HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Will multi-threading or other method make my program run faster?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
multimethodmakeprogramfasterthreadingwillotherrun

Problem

I didn't use multi-threading so far as I didn't need to. But as far as I've read, implementing them will make my program slightly faster than it actually is.

from validate_email import validate_email
import os

# the program is reading each line from "emails.txt" and after it checks each email it will remove duplicates and sort the godd / bad emails

def verify_emails(all_emails_file, all_good_emails_file, all_bad_emails_file):
    with open(all_emails_file) as f: all_emails = f.readlines()

    rs_emails = [elem.strip('\n') for elem in all_emails]
    rs_emails_set = set(rs_emails)  # remove duplicates

    good_emails_file, bad_emails_file = open(all_good_emails_file, 'w+'), open(all_bad_emails_file, 'w+')

    for email in rs_emails_set:
        if validate_email(email, verify=True):
            print >> good_emails_file, email
        else:
            print >> bad_emails_file, email

if __name__ == "__main__":
    clear = lambda: os.system('cls')
    clear()
    try:
        verify_emails("emails.txt", "good_emails.txt", "bad_emails.txt")
    except:
        print "\n\nFile with emails could not be found. Please create emails.txt and run the program again\n\n"


My code is perfectly functional, but when it handles big files ( > 2k rows ) it runs really slow. I'd like to take the best out of it and make it as faster as possible using multi-threading or any other methods.

I'd like when answering, if possible, somebody to explain me whether using multi-threading will help me optimize the program or not. More, I'd like somebody to also explain from his / her experience how can I optimize my code

Solution

As stated in the comments, multi-threading might not be what you are looking for.

I think there is room for improvement on the way you read your file. Currently:

  • you read the whole file into a string all_emails = f.readlines()



  • you remove duplicates rs_emails_set = set(rs_emails) # remove duplicates



  • and you read every element of this array for email in rs_emails_set:



Reading this comment, I strongly recommend you to test the following:

processed_emails = set()
for email in f: 
  if email not in processed_emails:
    # validate email or not
    processed_emails.add(email)


Instead of immediately writing the good and bad emails files, you could store those into 2 lists and write them at once at the very end (removing some I/O with the filesystem as well)

Code Snippets

processed_emails = set()
for email in f: 
  if email not in processed_emails:
    # validate email or not
    processed_emails.add(email)

Context

StackExchange Code Review Q#106914, answer score: 5

Revisions (0)

No revisions yet.