patternpythonMinor
Iterate over large list of lists and replace its elements
Viewed 0 times
elementsreplaceiterateitslargelistsandlistover
Problem
I have a large list of lists, where each list contains words/characters as elements, and each list is of different length and may contain same words/characters more than once, as for example:
I also have a large dictionary of words/characters like this:
The length of
For example, if
Because of the large length of
list_ = [['black', 'window', 'says', 'window'], ['dog', 'great'], ['after', 'end', 'explains', 'income', '.', '?']]I also have a large dictionary of words/characters like this:
dictionary = {'word1': 'RARE', 'word2': 'RARE', 'word3': 'RARE', '%': 'RARE'} etc.The length of
list_ is around 40 000 (the total number of words in all lists is about 1 mln) and the length of dictionary is around 31 000. For each word in dictionary, I need to check if it's in any of the lists of list_ and if it is, I need to replace it with 'RARE'. For example, if
'black'== 'word1' and 'after' == word3, the final list_ will be: list_ = [['RARE', 'window', 'says'], ['dog', 'says'], ['RARE', 'end', 'explains', 'income']]Because of the large length of
list_ and dictionary, the code that I currently have below takes almost 40min to run. I'm new to Python and I would appreciate if someone could suggest whether there's a way to do this task much faster/more efficiently. def replace_all(word_list, dictionary):
for i, j in dictionary.iteritems():
for lst in word_list:
if i in set(lst):
lst[lst.index(i)] = j
replace_all(list_, dictionary)Solution
You're looping in incorrect order, take the advantage of the fact that dictionary provides
We can either use
Or check for the key first and modify only matched item(LBYL):
or if you prefer EAFP then try to catch the
I generally find functions that modify a mutable item in-place and don't return anything confusing, it's usually a better idea to return the list object from the function and re-assign to the variable from where it was called.
But the above approach is still flawed(at least when our object has multiple references), with this even though we re-assigned only a single variable all the other references to the list object will still be affected, which can be very confusing to users. For example:
Outputs:
To fix such such cases we can either pass a deepcopy of the list to the function(or simple shallow copy for list containing only immutable items) or we can create a new list inside of the function. But as we are dealing with 1 million items creating another list will surely result in extra memory consumption. Here's an example of how to do this by creating a new list using
Few other questions on the same topic:
O(1) lookups. So, you should loop over the list and replace its items with the value present in the dictionary. This way you will loop over the list only once:We can either use
dict.get here and avoid an if condition:def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
lst[ind] = dictionary.get(item, item)Or check for the key first and modify only matched item(LBYL):
def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
if item in dictionary:
lst[ind] = dictionary[item]or if you prefer EAFP then try to catch the
KeyError:def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
try:
lst[ind] = dictionary[item]
except KeyError:
passI generally find functions that modify a mutable item in-place and don't return anything confusing, it's usually a better idea to return the list object from the function and re-assign to the variable from where it was called.
def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
lst[ind] = dictionary.get(item, item)
list_ = replace_matched_items(list_, dictionary)But the above approach is still flawed(at least when our object has multiple references), with this even though we re-assigned only a single variable all the other references to the list object will still be affected, which can be very confusing to users. For example:
def func(seq):
for lst in seq:
for i, x in enumerate(lst):
if x == 1:
lst[i] = 100
return seq
foo = [[1, 1, 2], [1, 2, 3]]
dct = {'key': foo}
foo = func(foo)
print foo
print dctOutputs:
[[100, 100, 2], [100, 2, 3]]
{'key': [[100, 100, 2], [100, 2, 3]]}To fix such such cases we can either pass a deepcopy of the list to the function(or simple shallow copy for list containing only immutable items) or we can create a new list inside of the function. But as we are dealing with 1 million items creating another list will surely result in extra memory consumption. Here's an example of how to do this by creating a new list using
dict.get:def replace_matched_items(word_list, dictionary):
new_list = [[dictionary.get(item, item) for item in lst] for lst in word_list]
return new_list
list_ = replace_matched_items(list_, dictionary)Few other questions on the same topic:
- Is making in-place operations return the object a bad idea?
- Why does django ORM's
savemethod not return the saved object?
Code Snippets
def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
lst[ind] = dictionary.get(item, item)def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
if item in dictionary:
lst[ind] = dictionary[item]def replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
try:
lst[ind] = dictionary[item]
except KeyError:
passdef replace_matched_items(word_list, dictionary):
for lst in word_list:
for ind, item in enumerate(lst):
lst[ind] = dictionary.get(item, item)
list_ = replace_matched_items(list_, dictionary)def func(seq):
for lst in seq:
for i, x in enumerate(lst):
if x == 1:
lst[i] = 100
return seq
foo = [[1, 1, 2], [1, 2, 3]]
dct = {'key': foo}
foo = func(foo)
print foo
print dctContext
StackExchange Code Review Q#79891, answer score: 2
Revisions (0)
No revisions yet.