HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Iterate over large list of lists and replace its elements

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
elementsreplaceiterateitslargelistsandlistover

Problem

I have a large list of lists, where each list contains words/characters as elements, and each list is of different length and may contain same words/characters more than once, as for example:

list_ = [['black', 'window', 'says', 'window'], ['dog', 'great'], ['after', 'end', 'explains', 'income', '.', '?']]


I also have a large dictionary of words/characters like this:

dictionary = {'word1': 'RARE', 'word2': 'RARE', 'word3': 'RARE', '%': 'RARE'} etc.


The length of list_ is around 40 000 (the total number of words in all lists is about 1 mln) and the length of dictionary is around 31 000. For each word in dictionary, I need to check if it's in any of the lists of list_ and if it is, I need to replace it with 'RARE'.

For example, if 'black'== 'word1' and 'after' == word3, the final list_ will be:

list_ = [['RARE', 'window', 'says'], ['dog', 'says'], ['RARE', 'end', 'explains', 'income']]


Because of the large length of list_ and dictionary, the code that I currently have below takes almost 40min to run. I'm new to Python and I would appreciate if someone could suggest whether there's a way to do this task much faster/more efficiently.

def replace_all(word_list, dictionary):
   for i, j in dictionary.iteritems():
      for lst in word_list: 
         if i in set(lst): 
            lst[lst.index(i)] = j

replace_all(list_, dictionary)

Solution

You're looping in incorrect order, take the advantage of the fact that dictionary provides O(1) lookups. So, you should loop over the list and replace its items with the value present in the dictionary. This way you will loop over the list only once:

We can either use dict.get here and avoid an if condition:

def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          lst[ind] = dictionary.get(item, item)


Or check for the key first and modify only matched item(LBYL):

def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          if item in dictionary:
              lst[ind] = dictionary[item]


or if you prefer EAFP then try to catch the KeyError:

def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          try:
              lst[ind] = dictionary[item]
          except KeyError:
              pass


I generally find functions that modify a mutable item in-place and don't return anything confusing, it's usually a better idea to return the list object from the function and re-assign to the variable from where it was called.

def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          lst[ind] = dictionary.get(item, item)

list_ = replace_matched_items(list_, dictionary)


But the above approach is still flawed(at least when our object has multiple references), with this even though we re-assigned only a single variable all the other references to the list object will still be affected, which can be very confusing to users. For example:

def func(seq):
    for lst in seq:
        for i, x in enumerate(lst):
            if x == 1:
                lst[i] = 100
    return seq

foo = [[1, 1, 2], [1, 2, 3]]
dct = {'key': foo}

foo = func(foo)
print foo
print dct


Outputs:

[[100, 100, 2], [100, 2, 3]]
{'key': [[100, 100, 2], [100, 2, 3]]}


To fix such such cases we can either pass a deepcopy of the list to the function(or simple shallow copy for list containing only immutable items) or we can create a new list inside of the function. But as we are dealing with 1 million items creating another list will surely result in extra memory consumption. Here's an example of how to do this by creating a new list using dict.get:

def replace_matched_items(word_list, dictionary):
    new_list = [[dictionary.get(item, item) for item in lst] for lst in word_list]
    return new_list

list_ = replace_matched_items(list_, dictionary)


Few other questions on the same topic:

  • Is making in-place operations return the object a bad idea?



  • Why does django ORM's save method not return the saved object?

Code Snippets

def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          lst[ind] = dictionary.get(item, item)
def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          if item in dictionary:
              lst[ind] = dictionary[item]
def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          try:
              lst[ind] = dictionary[item]
          except KeyError:
              pass
def replace_matched_items(word_list, dictionary):
   for lst in word_list:
      for ind, item in enumerate(lst):
          lst[ind] = dictionary.get(item, item)

list_ = replace_matched_items(list_, dictionary)
def func(seq):
    for lst in seq:
        for i, x in enumerate(lst):
            if x == 1:
                lst[i] = 100
    return seq

foo = [[1, 1, 2], [1, 2, 3]]
dct = {'key': foo}

foo = func(foo)
print foo
print dct

Context

StackExchange Code Review Q#79891, answer score: 2

Revisions (0)

No revisions yet.