HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Indexing a large static dataset

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
indexingdatasetlargestatic

Problem

My current situation is this: I have 328 million lines of data in a flat csv file. Not optimal. I want to be able to query this data (I'll explain in more detail). Grep is getting tired. This data is static and will not change.

I'm new to PostgreSQL and am working with this data on my 2010 MacBook Pro, 2,4 GHz Intel Core 2 Duo, 4 GB 1067 MHz DDR3. Storage size is not really a problem, the data is around 65GB on a 1TB disk, but read / write speed is not great (it's not an SSD unfortunately). From what I remember from my CS studies a few years ago, indexing could help me improve query speeds, and I'm reading up on how best to do that.

Why am I using PostgreSQL? Peer pressure. Not really but let's go with that. I have PostgreSQL 9.3.4 running on OSX 10.9.2.

Now, what I'm trying to do. The data includes date and time info and spans about 10 months. I want, for example, to be able to:

  • Find and summarize all activity on, say, all mondays in the data set for a specific user and/or for multiple users



  • Look at differences between weeks in the summertime vs weeks in the wintertime



  • Between months; and



  • Hours of day, etc.



It's mostly timestamp related summaries that I'm interested in.

Each data record includes, besides the timestamp, a userID and a servID. In another file (which is stored in another table), each servID is linked to a location as X and Y values. It's these locations that I'm interested in displaying on a map, so I would be exporting these to plot using matlab, in aggregate for many users or focusing on specific users at a time.

All values are repeated, i.e. there are about 700.000 unique userIDs and 1500 unique servIDs. Timestamps are not unique either, the resolution is only seconds, so there are repeated timestamps.

So far I've created two tables, one with a small subset of test data and the X-Y location table:

```
create table test (userID varchar, junk1 varchar,
date varchar, time varchar, junk2 varchar,
j

Solution

Query cleanup

First, lets rewrite your query to be readable, by using table aliases, qualifying field names, and using an ANSI join:

select t.userID, t.date, t.time, t.servID, t.timestamp, 
       l.servID_HEX, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
where t.userID='' 
  and extract(dow from t.timestamp)=2;


What the query needs

To start breaking this problem down you'll want to do is look at the fields used and where they're used:

Tables: locations, test

Fields output: test.userID, test.date, test.time, test.servID, test.timestamp, locations.servID_HEX, locations.SERV_LOCY, locations.SERV_LOCX

Terms used in filter predicates: test.servID, locations.servID_HEX, test.userID, extract(dow from test.timestamp)=2

Indexes

Now, the most important thing is to make sure that you have indexes that target high-selectivity columns used in predicates.

"high selectivity" just means that there aren't tons of the same value - there's no point indexing a column if 50% of the time it's 'a' and 50% of the time it's 'b', because you just don't gain much benefit from the index.

Assuming that all these columns have lots of well distributed distinct values, so they're highly selective and don't have any values that're massively common, you'll want to create an ordinary b-tree index. That's enough for locations because there's just one value used in a predicate:

CREATE INDEX locations_servid_idx ON locations(servID_HEX);


For test it's more complex.

You could create indexes separately for each column, but it's more efficient to create a single composite index that has all the columns you're using in this query's predicates for a given table. So you want servId, userID and timestamp. However, you don't use timestamp directly - you use it to get the day of week, and an index on timestamp can't be used to look up extract(dow from timestamp).

This is where expression indexes come in.

Indexes don't just cover columns. They can also cover arbitrary expressions. In this case, we'll create an index that includes extract(dow from timestamp).

So the index for test will look like:

CREATE INDEX test_custom_idx ON test (
    servID,
    extract(dow from timestamp),
    userID
);


As you'll be doing different kinds of timestamp based queries I'd consider creating multiple indexes for different timestamp expressions.

Which order to put the columns in depends on how selective each column is. Even though I'm sure the day-of-week will be more selective I've put the server ID first because it's used in a join condition that lets us completely disregard rows of the other table.

Since you didn't provide sample tables or sample data I can't really test it and I'm not keen on dummying some up. Adjustments may be required.

Partial indexes

Partial indexes can also be a big win, where the index only contains data for a subset of the data, e.g.:

CREATE INDEX test_custom_idx ON test (
    servID, userID
) WHERE (extract(dow from timestamp) = 2);


contains only data for that day, so it's a lot faster to search for queries involving just that day, but cannot be used at all for queries that might involve different days, or for which the day is determined at runtime from the output of another part of the query.

Index-only scans and covering indexes

Another thing you can consider is that all those unused columns still have to be read from disk to fetch your rows. In PostgreSQL 9.2 and newer there's a way to avoid that using an index-only scan if all columns you are fetching are in an index.

So you could create a covering index, where you include columns that aren't really required for the actual search but are there so they can be output without reading from the main table. The first thing to do for that is to get rid of useless values in your SELECT list, so we'll delete the redundant t.date and t.time from your query. Then create two new indexes:

CREATE INDEX locations_covering_idx 
ON locations(
    servID_HEX,
    SERV_LOCY,
    SERV_LOCX
);

CREATE INDEX test_covering_idx ON test (
    servID,
    extract(dow from timestamp),
    userID,
    timestamp
);


Reducing repetition

Since your dataset is static, having tons and tons of indexes isn't a big problem.

Each index has a cost for insert/update/delete. So for non-static data you want to minimize that by getting rid of indexes that don't provide enough improvement to be worth the cost.

One way to do that can be to split up composite indexes. In your case for example I might take the timestamp stuff out of the main index on location and put it in a bunch of new indexes instead. That way all the different timestamp expression indexes can be updated without rewriting a whole lot of unrelated data for userid and serverid as well. PostgreSQL can still often use the indexes together, combining them with a bitmap index scan.

Not a concern for your current dataset

Code Snippets

select t.userID, t.date, t.time, t.servID, t.timestamp, 
       l.servID_HEX, l.SERV_LOCY, l.SERV_LOCX 
from test t
     inner join locations l on (t.servID=l.servID_HEX)
where t.userID='<someusers>' 
  and extract(dow from t.timestamp)=2;
CREATE INDEX locations_servid_idx ON locations(servID_HEX);
CREATE INDEX test_custom_idx ON test (
    servID,
    extract(dow from timestamp),
    userID
);
CREATE INDEX test_custom_idx ON test (
    servID, userID
) WHERE (extract(dow from timestamp) = 2);
CREATE INDEX locations_covering_idx 
ON locations(
    servID_HEX,
    SERV_LOCY,
    SERV_LOCX
);

CREATE INDEX test_covering_idx ON test (
    servID,
    extract(dow from timestamp),
    userID,
    timestamp
);

Context

StackExchange Database Administrators Q#65137, answer score: 10

Revisions (0)

No revisions yet.