HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Searching within multiple objects of the bucket

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thesearchingobjectswithinmultiplebucket

Problem

I have millions of files in the Google cloud storage's bucket. I want to search within files with a .index extension and retrieve the contents.

This is currently what I am doing, but the time required for overall process is large. Is there any better and faster way to do this?

class storage():
    ...
    self.indexFile = []
    self.indexFileIndex = []
    ...

    def get_content(self,param1,param2):
        c_n = []
        for value, index in zip(param1, param2):
            object_contents = StringIO.StringIO()
            srcObjURI = boto.storage_uri(value, self.storage)
            srcObjURI.get_key().get_file(object_contents)
            c_n.append(object_contents.getvalue())
            object_contents.close()
        return c_n

    def get_PATHs(self):
        paths=[]
        #pts=open("paths.txt","w")
        pts_log=open("paths_log.txt","a")
        pts_log.write("-"*20+time.ctime()+"-"*20+"\n")
        indexFileContents = self.get_content(self.indexFile, self.indexFileIndex)
        for c,d in zip(indexFileContents,self.indexFile):
            regx = r"(.*)\/" + r"(.*?)\|"
            patternPathList = re.compile(regx)
            for match in patternPathList.finditer(c):
                p=match.group(1).strip() + "/"+ match.group(2).strip()
                tst_exst=""
                if p in paths:
                    tst_exst="Already exist !"
                else:
                    tst_exst="Added to PATHs list"
                    paths.append(p)
                    #pts.write(p)
                    #pts.write("\n")
                pts_log.write("FROM : %s --> %s %s"%(d,p,tst_exst))
                pts_log.write("\n")
        #pts.close()
        pts_log.close()
        return paths


The files that I am trying to search varies from 200KB to 1MB and sometimes has Unicode characters.

Solution

A few remarks to add to @JoeWallis already excellent answer:

-
If you want speed, consider using the module cStringIO instead of StringIO. Since we are missing your import statements, I can't tell whether you have a try .. catch to conditionally import it or not. You should really post the whole code.

-
There are very few occasions when you want to keep commented out code. Generally speaking, the best thing to do is to remove dead code and let source control software remind you of what the old code was like.

-
Instead of zip, consider using itertools.izip which uses lazy evaluation to zip the iterables. In your case, it does not change many things since you never break early (in case of error maybe?) but in general, it saves more memory and sometimes avoids computing unused values.

-
You don't need the parenthesis in class storage():, unless you explicitly inherit from a class. Since you don't inherit from anything, you can simply drop them so that your code looks cleaner.

Context

StackExchange Code Review Q#98613, answer score: 4

Revisions (0)

No revisions yet.