Quick Solution: CSV File and Python

Today I was tasked with de-duping a csv file that had around 25,000 single entry rows of emails. There were numerous duplicates, and I needed a way to read the file, find out how many duplicates, and then rewrite the file without duplicate entries. I wanted to use python because I tire of php and have little to no experience with any other programming language. Maybe I should have chosen another language to complicate my life, but for time's sake, python became my goto language.
Fortunately, python has a great way to navigate csv files using the csv module. All I needed to do was create a reader object to read the csv file.

emailReader = csv.reader(open('csvfile.csv', 'rb'), delimiter=',', quotechar='"')

Once I had the file object, I just needed to read it and put the contents in a list object, easy

a = []
for row in emailReader:
a += [", ".join(row)]

Unfortunately, I discovered a problem with the data set. A number of the email addresses had periods at the end of them. I'm not sure what happens with emails that end in a period, but I'd rather not find out with 5,000 bounced messages. So some string manipulation became important. Fortunately, for me, python has an easy way to deal with leading and trailing characters strip and rstrip. Since I was worried about trailing characters, I used rstrip.

s = "This string ends in a period."
s = s.rstrip('.')

Now s = "This string ends in a period" a handy trick to know for manipulating strings. The trick for me, however, was figuring out how to perform that operation recursively over a list. Not too hard, I just used the enumerate function.

for index, object in enumerate(a):
a[index] = object.rstrip('.')

Using this method, you can recursively perform a function on the indexed items within a list object.
Now I had a clean list of 25,000 email addresses 'a', and I just needed to de-dup the list. Snap! The next step was to figure out how to make those email addresses unique. If you've ever tried de-duping a mysql database, you know that this can be a major pain in the ass. With python lists, though, de-duping is a breeze. I found the built in set, frozenset types for quick de-duping.

b = set(a)

Now my list goes from this ['email@example.com', email@example.com'] to this set(['email@example.com']). Nice right? You can iterate over the set like any other list. Now a quick check for the number of duplicates.

r = len(a) - len(b) #Subtract the length of the set from the original list.
r = str(r) #Make the int type into a string for concatenation.
print "You removed " + r + " duplicate entries."

Now everything is in place to write back the emails to a csv file. Pretty easy, just like reading the file.

emailWriter = csv.writer(open('clean-emails.csv', 'wb'), delimiter=' ', quotechar='|')

for item in b:
emailWriter.writerow([item])

And that does it. Super easy python magic!

Site Region: