HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Find friends of friends (recursively) efficiently using Postgresql

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlefficientlyrecursivelyusingfindfriends

Problem

Objective: Users submit their Contact Books, and then the application looks for connections between users, according to their Phone Number. Something like "6 Handshakes" idea (https://en.wikipedia.org/wiki/Six_degrees_of_separation).

Problem: Make this query performance close to real time. When the User submits his phone number and gets the full list of other phones, he may know. Plain list - without connections (graph vertices etc), but full, not paginated (this requirement is here because original goal is more complex).

Question: is it possible to achieve close-to-realtime performance with pure relational database, without Graph databases (Neo4j, etc), graph extensions (bitnine agensgraph) or workflow redesign? Any denormalization is possible, but to my understanding, it won't help.

Given:

test=# select * from connections;
 user_phone | friend_phone
------------+--------------
          1 |            2
          1 |            3
          1 |            4
          2 |            6
          2 |            7
          2 |            8
          8 |           10
          8 |           11
          8 |           12
         20 |           30
         40 |           50
         60 |           70


I expect to receive the following connections for User with Phone === 1:

friend_phone
--------------
            2
            3
            4
            6
            7
            8
           10
           11
           12
(9 rows)


It is really difficult to estimate real-world connections numbers. But I was testing at least with:

  • 10,000 Users (Phone Numbers)



  • Each user was randomly assigned with 50-1000 Connections with pseudo-random other Users



  • This resulted in about 1,000,000 Connections



If it is impossible to achieve this in general (using some tricky ORDER BYs or subqueries etc) - what metrics should be considered for example to understand that:

with 1,000,000 connections you need 128GB RAM instance to get 2 seconds response time

an

Solution

Just by looking at the explain plan, we see that the optimizer's estimate are totally off the grid:

  • the sort node is estimated at 46170 rows whereas there are actually 637677 rows



  • the index scan is estimated at 2270852 rows whereas there are actually 1722262 rows



And so on.

So I think you'll need to vacuum analyze your tables connections and friends before getting a new plan.

Maybe, it will only help a little, but we need to get rid of this problem before trying to understand what's happening and how to optimize that query in your context.

Context

StackExchange Database Administrators Q#271010, answer score: 2

Revisions (0)

No revisions yet.