HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavascriptMinor

Query for finding collection entries matching an arbitrary set of URLs

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
setarbitraryquerycollectionforurlsfindingmatchingentries

Problem

Most of my slow queries on my server are due to a single "find" query. I have a collection that represents website pages. I want to be able to look up a page in the collection by URL. The caveat is that a website might have URLs that redirect to it.

I've modeled my schema (with Node.js/Mongoose) like this:

{
        'url':                      { type: String, index: {unique: true, dropDups: true} },
        'redirects':                { type: Array, index: true },
        // ... other stuff
    }


In other words, one entry might have the URL http://codereview.stackexchange.com/ but also have an array of 5 other URLs (such as bit.ly URLs) that redirect to it.

Later, I want to query the collection to see if we have an entry which matches an arbitrary set of URLs. I don't know if the URLs are redirected URLs or what -- I just want to find all matches in my collection that represent these URLs.

So I do this:

// urls is an array of URL strings we want to find...
model.find({$or: [{'url': {$in: urls}}, {'redirects': {$in: urls}}]}).lean().exec( ... );


Unfortunately, I might be looking for around 200 URLs at once, so this query sometimes takes > 1 second.

Is there further optimization I can do, or would it be better to split the query into multiple queries, capping the search-size each time?

Solution

I haven't used MongoDB, but I have a few suggestions based on experiences with optimization and other databases.

-
Index hashes of the URLs and search for those instead. Using a simple MD5 hash would probably speed up searching with the cost of dealing with false positives (unlikely but possible).

-
Store every URL as a top-level object and add a redirectsTo attribute. This alleviates the need to index or search the redirects collections. While now you need to perform two queries--one for the original URLs and another for the redirected URLs--this could end up being faster if only a small percentage of URLs are redirects.

Here's some psuedocode to clarify what I mean:

all = []
found = model.find({'url': {$in: urls}})...
redirects = []
for each item in found
    if item.redirectsTo
        redirects += item.redirectsTo
    else
        all += item
redirected += model.find({'url': {$in: redirects}})...
for each item in redirected
    all += item


-
Update your question with the higher-level problem you're trying to solve. I often find that when I'm trying to optimize slow task T to solve problem P, the better approach is to rethink P.

Code Snippets

all = []
found = model.find({'url': {$in: urls}})...
redirects = []
for each item in found
    if item.redirectsTo
        redirects += item.redirectsTo
    else
        all += item
redirected += model.find({'url': {$in: redirects}})...
for each item in redirected
    all += item

Context

StackExchange Code Review Q#23542, answer score: 3

Revisions (0)

No revisions yet.