patternpythondjangoMinor
First attempt at weighted search
Viewed 0 times
weightedsearchfirstattempt
Problem
My concern is speed, since this may be used in an e-commerce website I want it to be as quick as possible. Where can I improve?
```
def get_queryset(self):
search_terms = self.request.GET.getlist('search', None)
if not search_terms:
return []
# removing trailing slash on restangular calls
search_terms[0] = search_terms[0].lower().replace('/', '')
terms = [term.split(" ") for term in search_terms][0]
# Query that will go through each item to see if their name or description match contain the search terms
results = reduce(operator.or_,
(Item.objects.filter
(Q(name__icontains=term) | Q(description__icontains=term))
for term in terms))
# Using enumerate so I can get the index, storing index at end of list for future reference
# Concats the item name and the item description into one list, using that for the items weight in the result
results_split = [list(set(item.name.lower().split() + item.description.lower().split() + list((index,))))
for index, item in enumerate(results)]
# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = [(term, len(search_terms[0].split()) - search_terms[0].split().index(term))
for term in terms]
# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would b
```
def get_queryset(self):
search_terms = self.request.GET.getlist('search', None)
if not search_terms:
return []
# removing trailing slash on restangular calls
search_terms[0] = search_terms[0].lower().replace('/', '')
terms = [term.split(" ") for term in search_terms][0]
# Query that will go through each item to see if their name or description match contain the search terms
results = reduce(operator.or_,
(Item.objects.filter
(Q(name__icontains=term) | Q(description__icontains=term))
for term in terms))
# Using enumerate so I can get the index, storing index at end of list for future reference
# Concats the item name and the item description into one list, using that for the items weight in the result
results_split = [list(set(item.name.lower().split() + item.description.lower().split() + list((index,))))
for index, item in enumerate(results)]
# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = [(term, len(search_terms[0].split()) - search_terms[0].split().index(term))
for term in terms]
# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would b
Solution
I agree with @Gareth-Rees, so I'm going to comment on the comments as
well: Most of them are fine if not a bit too detailed. It could make
sense to move them into docstrings and separate functions if there
really is that much to say about it. I kind of expected the whole
function to have a docstring anyway, since at the moment neither input
nor output is really documented.
In general, if you care about performance it'd be best to use a profiler
to find problematic parts. That said the construction of all these
lists is probably not the best thing to do. E.g.
only iterated over once. Another general suggestion would be to avoid
linear search and rather use dictionaries for faster lookups.
-
The construction of
the first one, then just use that all the time. Also, is the
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
I'm going to assume so.
the empty list already counts as a false value.
Additionally, avoiding the list creation by using a normal loop may be
faster.
the queries instead of combining all the results? In any case I'd
move this into a separate method.
added to the
list. Changing the parenthesis slightly would fix this - however it
doesn't seem like the index is used anywhere else, so both adding the
index, as well as using
the calculated values and reuse them.
allocations.
the code says, as there is nothing preventing an unsorted list in
those lines. Also, it would be better to fetch all objects at once by
ID instead of fetching all of them and then removing duplicates with
the
that was explicitely constructed with that number at the end, so it
should be rewritten to make use of that fact. Honestly I can't really
follow the logic here, I'll leave it out in the code below.
I tried to rewrite this a bit according to those points, however there's
a trade-off between readability and faster code. Best thing would be to
merge all the steps that operate on the same lists one after another and
extract functionality into separate functions to make up for the lack of
readability.
```
def get_matching_items(self, term):
return Item.objects.filter(Q(name__icontains=term) | Q(description__icontains=term))
def get_queryset(self):
search_terms = self.request.GET.getlist('search')
if not search_terms:
return []
terms = search_terms[0].lower().replace('/', '').split()
terms_length = len(terms)
# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = {term: terms_length - index for index, term in enumerate(terms)}
# Query that will go through each item to see if their name or description match contain the search terms
results = get_matching_items(terms[0])
for term in terms[1:]:
results |= get_matching_items(term)
# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would be Red: 2, Pants: 1
# However, the returned result would be, in this order, [Red Pants, Red Shoe, Red Swim Trunks, Blue Pants]
sorted_results = []
for item in results:
terms = set(item.name.lower().split())
terms.update(item.description.lower().split())
weighted_sum = sum(query_with_weights.get(y, 0) for term in terms)
sorted_results.
well: Most of them are fine if not a bit too detailed. It could make
sense to move them into docstrings and separate functions if there
really is that much to say about it. I kind of expected the whole
function to have a docstring anyway, since at the moment neither input
nor output is really documented.
In general, if you care about performance it'd be best to use a profiler
to find problematic parts. That said the construction of all these
lists is probably not the best thing to do. E.g.
results_split isonly iterated over once. Another general suggestion would be to avoid
linear search and rather use dictionaries for faster lookups.
-
The construction of
terms is confusing. If the only value used isthe first one, then just use that all the time. Also, is the
split(" ") significant, or could just split() be used as well (since it will remove any whitespace instead of just a single space, c.f. the docs)?str.split([sep[, maxsplit]])If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
I'm going to assume so.
- The
Nonedefault value ingetlistis probably not necessary, as
the empty list already counts as a false value.
- The definition of
get_weightshould be done withdefas well.
Additionally, avoiding the list creation by using a normal loop may be
faster.
resultscan probably be calculated more efficiently by combining all
the queries instead of combining all the results? In any case I'd
move this into a separate method.
list((index,))is a very verbose way of saying[index].
- The comment above
results_splitis wrong insofar as the index is
added to the
set, thus isn't necessarily at the end of the resultinglist. Changing the parenthesis slightly would fix this - however it
doesn't seem like the index is used anywhere else, so both adding the
index, as well as using
enumerate aren't necessary.search_terms[0]is used more than once, so it makes sense to store
the calculated values and reuse them.
- The
sortmethod can be used instead ofsortedto avoid additional
allocations.
top_level_sortedsounds like it does something different than what
the code says, as there is nothing preventing an unsorted list in
those lines. Also, it would be better to fetch all objects at once by
ID instead of fetching all of them and then removing duplicates with
the
set.all_resultsis bad. It's searching for the index value in a list
that was explicitely constructed with that number at the end, so it
should be rewritten to make use of that fact. Honestly I can't really
follow the logic here, I'll leave it out in the code below.
I tried to rewrite this a bit according to those points, however there's
a trade-off between readability and faster code. Best thing would be to
merge all the steps that operate on the same lists one after another and
extract functionality into separate functions to make up for the lack of
readability.
```
def get_matching_items(self, term):
return Item.objects.filter(Q(name__icontains=term) | Q(description__icontains=term))
def get_queryset(self):
search_terms = self.request.GET.getlist('search')
if not search_terms:
return []
terms = search_terms[0].lower().replace('/', '').split()
terms_length = len(terms)
# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = {term: terms_length - index for index, term in enumerate(terms)}
# Query that will go through each item to see if their name or description match contain the search terms
results = get_matching_items(terms[0])
for term in terms[1:]:
results |= get_matching_items(term)
# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would be Red: 2, Pants: 1
# However, the returned result would be, in this order, [Red Pants, Red Shoe, Red Swim Trunks, Blue Pants]
sorted_results = []
for item in results:
terms = set(item.name.lower().split())
terms.update(item.description.lower().split())
weighted_sum = sum(query_with_weights.get(y, 0) for term in terms)
sorted_results.
Code Snippets
def get_matching_items(self, term):
return Item.objects.filter(Q(name__icontains=term) | Q(description__icontains=term))
def get_queryset(self):
search_terms = self.request.GET.getlist('search')
if not search_terms:
return []
terms = search_terms[0].lower().replace('/', '').split()
terms_length = len(terms)
# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = {term: terms_length - index for index, term in enumerate(terms)}
# Query that will go through each item to see if their name or description match contain the search terms
results = get_matching_items(terms[0])
for term in terms[1:]:
results |= get_matching_items(term)
# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would be Red: 2, Pants: 1
# However, the returned result would be, in this order, [Red Pants, Red Shoe, Red Swim Trunks, Blue Pants]
sorted_results = []
for item in results:
terms = set(item.name.lower().split())
terms.update(item.description.lower().split())
weighted_sum = sum(query_with_weights.get(y, 0) for term in terms)
sorted_results.append((item, weighted_sum))
sorted_results.sort(key=lambda lst: lst[1], reverse=True)
...
# Gets the top level item for each sub item that hit in the results.
# If multiple sub items FK back to the a single master item we want to only display that master item once.
top_level_ids = set()
top_level_sorted = []
for item in all_results:
if not item.is_variant:
continue
top_level_id = item.get_top_level_item().id
if top_level_id in top_level_ids:
continue
top_level_item = Item.objects.get(id=top_level_id)
top_level_ids.add(top_level_id)
top_level_sorted.append(top_level_item)
return top_level_sortedContext
StackExchange Code Review Q#92847, answer score: 4
Revisions (0)
No revisions yet.