HiveBrain v1.2.0
Get Started
← Back to all entries
principlesqlMinor

MongoDB performance vs. PostgreSQL with 5.5 million rows / documents

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlmillionrowswithmongodbperformancedocuments

Problem

Can someone help me compare these queries and explain why the PostgreSQL query executes in just under 2000ms and the MongoDB aggregate query takes almost 9000ms and sometimes as high as 130K ms?

PostgreSQL 9.3.2 on x86_64-apple-darwin, compiled by i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.9.00), 64-bit


PostgreSQL query

SELECT locomotive_id,
   SUM(date_trunc('second', datetime) - date_trunc('second', prevDatetime)) AS utilization_time

FROM bpkdmp 
WHERE datetime >= '2013-7-26 00:00:00.0000' 
AND   datetime <= '2013-7-26 23:59:59.9999'
GROUP BY locomotive_id
order by locomotive_id


MongoDB Query

db.bpkdmp.aggregate([
   {
      $match : {
          datetime : { $gte : new Date(2013,6,26, 0, 0, 0, 0), $lt : new Date(2013,6,26, 23, 59, 59, 9999) }
   }
   },
   {
      $project: {
         locomotive_id : "$locomotive_id",
         loco_time : { $subtract : ["$datetime", "$prevdatetime"] }, 
      }
   },
   {
      $group : {
         _id : "$locomotive_id",
         utilization_time : { $sum : "$loco_time" }
      }
   },
   {
      $sort : {_id : 1}
   }
])


Both the PostgreSQL table and the MongoDB collection are indexed on datetime : 1 and locomotive_id : 1

These queries are being testing on an iMac with a 2TB hybrid drive and 16GB of memory. I have received comparable results on a Windows 7 machine with 8GB of memory and a 256GB SSD.

Thanks!

** Update : I am posting the EXPLAIN (BUFFERS, ANALYZE) results after my question was posted

```
"Sort (cost=146036.84..146036.88 rows=19 width=24) (actual time=2182.443..2182.457 rows=152 loops=1)"
" Sort Key: locomotive_id"
" Sort Method: quicksort Memory: 36kB"
" Buffers: shared hit=13095"
" -> HashAggregate (cost=146036.24..146036.43 rows=19 width=24) (actual time=2182.144..2182.360 rows=152 loops=1)"
" Buffers: shared hit=13095"
" -> Bitmap Heap Scan on bpkdmp (cost=12393.84..138736.97 rows=583942 width=24) (actu

Solution

All PostgreSQL is doing here is a bitmap heap scan on bpkdmp_datetime_ix to find blocks that might contain matching rows, then a heap scan of those blocks to find matching rows in bpkdmp. It then groups the rows into hash buckets using hashes of the grouping key, sums each bucket, and sorts the results. It's a simple, basic query plan - it might perform better if you throw lots of work_mem at it, but it might not, too.

There's no parallelism anywhere in that query, either; it'll all happen on one core.

I can only assume that MongoDB is using a less efficient method, or is not benefiting from an appropriate index. You'd need to show the explain for the MongoDB query for a useful comment there to be possible; see cursor.explain.

Context

StackExchange Database Administrators Q#57448, answer score: 9

Revisions (0)

No revisions yet.