HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythondjangoMinor

First attempt at weighted search

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
weightedsearchfirstattempt

Problem

My concern is speed, since this may be used in an e-commerce website I want it to be as quick as possible. Where can I improve?

```
def get_queryset(self):
search_terms = self.request.GET.getlist('search', None)

if not search_terms:
return []

# removing trailing slash on restangular calls
search_terms[0] = search_terms[0].lower().replace('/', '')

terms = [term.split(" ") for term in search_terms][0]

# Query that will go through each item to see if their name or description match contain the search terms
results = reduce(operator.or_,
(Item.objects.filter
(Q(name__icontains=term) | Q(description__icontains=term))
for term in terms))

# Using enumerate so I can get the index, storing index at end of list for future reference
# Concats the item name and the item description into one list, using that for the items weight in the result
results_split = [list(set(item.name.lower().split() + item.description.lower().split() + list((index,))))
for index, item in enumerate(results)]

# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = [(term, len(search_terms[0].split()) - search_terms[0].split().index(term))
for term in terms]

# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would b

Solution

I agree with @Gareth-Rees, so I'm going to comment on the comments as
well: Most of them are fine if not a bit too detailed. It could make
sense to move them into docstrings and separate functions if there
really is that much to say about it. I kind of expected the whole
function to have a docstring anyway, since at the moment neither input
nor output is really documented.

In general, if you care about performance it'd be best to use a profiler
to find problematic parts. That said the construction of all these
lists is probably not the best thing to do. E.g. results_split is
only iterated over once. Another general suggestion would be to avoid
linear search and rather use dictionaries for faster lookups.

-
The construction of terms is confusing. If the only value used is
the first one, then just use that all the time. Also, is the split(" ") significant, or could just split() be used as well (since it will remove any whitespace instead of just a single space, c.f. the docs)?


str.split([sep[, maxsplit]])


If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.

I'm going to assume so.

  • The None default value in getlist is probably not necessary, as


the empty list already counts as a false value.

  • The definition of get_weight should be done with def as well.


Additionally, avoiding the list creation by using a normal loop may be
faster.

  • results can probably be calculated more efficiently by combining all


the queries instead of combining all the results? In any case I'd
move this into a separate method.

  • list((index,)) is a very verbose way of saying [index].



  • The comment above results_split is wrong insofar as the index is


added to the set, thus isn't necessarily at the end of the resulting
list. Changing the parenthesis slightly would fix this - however it
doesn't seem like the index is used anywhere else, so both adding the
index, as well as using enumerate aren't necessary.

  • search_terms[0] is used more than once, so it makes sense to store


the calculated values and reuse them.

  • The sort method can be used instead of sorted to avoid additional


allocations.

  • top_level_sorted sounds like it does something different than what


the code says, as there is nothing preventing an unsorted list in
those lines. Also, it would be better to fetch all objects at once by
ID instead of fetching all of them and then removing duplicates with
the set.

  • all_results is bad. It's searching for the index value in a list


that was explicitely constructed with that number at the end, so it
should be rewritten to make use of that fact. Honestly I can't really
follow the logic here, I'll leave it out in the code below.

I tried to rewrite this a bit according to those points, however there's
a trade-off between readability and faster code. Best thing would be to
merge all the steps that operate on the same lists one after another and
extract functionality into separate functions to make up for the lack of
readability.

```
def get_matching_items(self, term):
return Item.objects.filter(Q(name__icontains=term) | Q(description__icontains=term))

def get_queryset(self):
search_terms = self.request.GET.getlist('search')

if not search_terms:
return []

terms = search_terms[0].lower().replace('/', '').split()
terms_length = len(terms)

# Builds weight for each term
# Example: The search has 3 terms, Red, Shoes, Pants
# Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
query_with_weights = {term: terms_length - index for index, term in enumerate(terms)}

# Query that will go through each item to see if their name or description match contain the search terms
results = get_matching_items(terms[0])
for term in terms[1:]:
results |= get_matching_items(term)

# This section will go through and weigh each item based on name and description weight.
# This may be problematic if the description uses the key word multiple times.
# It could result in things being weighed incorrectly. See the example below for more details.
# Example 2: 'red pants' is the query.
# We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
# Each items description is sweet {{ item.name }} bro
# The resulting weight would be Red: 2, Pants: 1
# However, the returned result would be, in this order, [Red Pants, Red Shoe, Red Swim Trunks, Blue Pants]
sorted_results = []
for item in results:
terms = set(item.name.lower().split())
terms.update(item.description.lower().split())

weighted_sum = sum(query_with_weights.get(y, 0) for term in terms)

sorted_results.

Code Snippets

def get_matching_items(self, term):
    return Item.objects.filter(Q(name__icontains=term) | Q(description__icontains=term))

def get_queryset(self):
    search_terms = self.request.GET.getlist('search')

    if not search_terms:
        return []

    terms = search_terms[0].lower().replace('/', '').split()
    terms_length = len(terms)

    # Builds weight for each term
    # Example: The search has 3 terms, Red, Shoes, Pants
    # Red would have a weight of 3 since it is the first word, shoes would be 2 and pants would be 1
    query_with_weights = {term: terms_length - index for index, term in enumerate(terms)}

    # Query that will go through each item to see if their name or description match contain the search terms
    results = get_matching_items(terms[0])
    for term in terms[1:]:
        results |= get_matching_items(term)

    # This section will go through and weigh each item based on name and description weight.
    # This may be problematic if the description uses the key word multiple times.
    #  It could result in things being weighed incorrectly. See the example below for more details.
    # Example 2: 'red pants' is the query.
    # We have, in no particular order, a red shoe item, a blue pants item, a red pants item, a red swim trunks item.
    # Each items description is sweet {{ item.name }} bro
    # The resulting weight would be Red: 2, Pants: 1
    # However, the returned result would be, in this order, [Red Pants, Red Shoe, Red Swim Trunks, Blue Pants]
    sorted_results = []
    for item in results:
        terms = set(item.name.lower().split())
        terms.update(item.description.lower().split())

        weighted_sum = sum(query_with_weights.get(y, 0) for term in terms)

        sorted_results.append((item, weighted_sum))

    sorted_results.sort(key=lambda lst: lst[1], reverse=True)

    ...

    # Gets the top level item for each sub item that hit in the results.
    #   If multiple sub items FK back to the a single master item we want to only display that master item once.
    top_level_ids = set()
    top_level_sorted = []
    for item in all_results:
        if not item.is_variant:
            continue
        top_level_id = item.get_top_level_item().id
        if top_level_id in top_level_ids:
            continue
        top_level_item = Item.objects.get(id=top_level_id)
        top_level_ids.add(top_level_id)
        top_level_sorted.append(top_level_item)

    return top_level_sorted

Context

StackExchange Code Review Q#92847, answer score: 4

Revisions (0)

No revisions yet.