patternsqlMinor
For what type of data it's better to use relational, and for what type of data, non-relational databases?
Viewed 0 times
databaseswhatnonbettertyperelationalforandusedata
Problem
I am trying to write my first big backend project. This is a mobile/web application like Instagram but for different purposes. As I searched through the internet I found that Instagram uses PostgreSQL and Cassandra as it's main databases. But I don't know for what purpose/type/part of data it uses which database?
Does anyone know more about the databases Instagram use or in general, may I know how to decide for what services or what type of data/application it's better to use SQL or NoSQL databases?
Does anyone know more about the databases Instagram use or in general, may I know how to decide for what services or what type of data/application it's better to use SQL or NoSQL databases?
Solution
may I know how to decide for what services or what type of data/application it's better to use SQL or NoSQL databases?
Use a SQL database if any combination of the following is true:
Use a NoSQL database otherwise. Specifically if your schema is ill-defined and changes more frequently than you're willing to manage.
There's not many differences in use cases otherwise. NoSQL is somewhat a subset of SQL but they compliment each other in different ways too.
What about horizontal scaling? I've read that RDBMS databases have big issues in horizontal scaling for big, rapid growing projects.
Horizontal scaling aka sharding is a little gimmicky, in my honest opinion. There's a lot of room for vertical scaling with servers these days, especially with virtualization and / or cloud services.
MongoDB is one of the most mainstream NoSQL databases that are thought of when the topic of horizontal scaling / sharding comes to mind. Many inexperienced people like to exclaim how sharding makes scaling easier, cheaper, or both. But even the developers of MongoDB claim the contrary, stating sharding is problematic, hard to manage, limiting on what queries can be ran, and they even go as far as claiming vertical scaling is just more practical and cost effective (myth #5):
Sharded clusters also make your data harder to manage, and they add some limitations to the types of queries you can conduct. Sharding is useful if you need it, but it's often cheaper and easier to simply upgrade your hardware!
In fact, that same article discusses how a minimum of 8 servers are needed to properly setup sharding with MongoDB. That's surely going to be more costly with the redundancy in paying for all of the hardware in a full system for each of the 8 servers as opposed to vertically scaling your infrastructure for a specific piece of hardware. Of course there are exception cases, but generally speaking, it's hard for me to see it being cost efficient for the average user.
But even modern SQL databases support various implementations with horizontal scaling, should one feel the need to utilize this methodology. For example, SQL Server has a feature called Availability Groups which automatically synchronizes the data from one server to the other replica servers in the same group. It even offers two modes of synchronization depending on if you prioritize performance over data consistency between servers (similar, though not exactly the same, as the idea of eventual consistency normally seen with NoSQL database synchronization in a sharded topology).
Aside from all of that, when a database is optimally architected, with efficiently designed queries, a SQL database can handle a high level of concurrency for large amounts of data. SQL Server is rated to handle tables with trillions of records. I've personally worked with tables in the 10s of billions on servers that were fairly heavily transactional (1,000s of transactions per minute), with minimal hardware (16 GB of Memory, 8 CPU cores, etc), and most queries ran in under 1 second.
Some key ideas behind this are that data at rest can be any size, and it doesn't really matter from a performance perspective (for most use cases). Proper indexing which uses a B-Tree data structure, allows for optimal searching of any achievable size of data (even by big data standards in 100 years from now). That's because B-Tree's have an
Long story short, there are no differences from a performance perspective in what any modern SQL database is capable of vs a modern NoSQL database. Most data scale problems are usually best solvable on the software side of the equation, not really the hardware side, as most performance problems are caused by poor database design / implementation and / or poorly written code (queries).
Use a SQL database if any combination of the following is true:
- You have relational data
- Your schema is under your control
- The schema you're adhering to changes at a rate that's tolerable for you to manage
Use a NoSQL database otherwise. Specifically if your schema is ill-defined and changes more frequently than you're willing to manage.
There's not many differences in use cases otherwise. NoSQL is somewhat a subset of SQL but they compliment each other in different ways too.
What about horizontal scaling? I've read that RDBMS databases have big issues in horizontal scaling for big, rapid growing projects.
Horizontal scaling aka sharding is a little gimmicky, in my honest opinion. There's a lot of room for vertical scaling with servers these days, especially with virtualization and / or cloud services.
MongoDB is one of the most mainstream NoSQL databases that are thought of when the topic of horizontal scaling / sharding comes to mind. Many inexperienced people like to exclaim how sharding makes scaling easier, cheaper, or both. But even the developers of MongoDB claim the contrary, stating sharding is problematic, hard to manage, limiting on what queries can be ran, and they even go as far as claiming vertical scaling is just more practical and cost effective (myth #5):
Sharded clusters also make your data harder to manage, and they add some limitations to the types of queries you can conduct. Sharding is useful if you need it, but it's often cheaper and easier to simply upgrade your hardware!
In fact, that same article discusses how a minimum of 8 servers are needed to properly setup sharding with MongoDB. That's surely going to be more costly with the redundancy in paying for all of the hardware in a full system for each of the 8 servers as opposed to vertically scaling your infrastructure for a specific piece of hardware. Of course there are exception cases, but generally speaking, it's hard for me to see it being cost efficient for the average user.
But even modern SQL databases support various implementations with horizontal scaling, should one feel the need to utilize this methodology. For example, SQL Server has a feature called Availability Groups which automatically synchronizes the data from one server to the other replica servers in the same group. It even offers two modes of synchronization depending on if you prioritize performance over data consistency between servers (similar, though not exactly the same, as the idea of eventual consistency normally seen with NoSQL database synchronization in a sharded topology).
Aside from all of that, when a database is optimally architected, with efficiently designed queries, a SQL database can handle a high level of concurrency for large amounts of data. SQL Server is rated to handle tables with trillions of records. I've personally worked with tables in the 10s of billions on servers that were fairly heavily transactional (1,000s of transactions per minute), with minimal hardware (16 GB of Memory, 8 CPU cores, etc), and most queries ran in under 1 second.
Some key ideas behind this are that data at rest can be any size, and it doesn't really matter from a performance perspective (for most use cases). Proper indexing which uses a B-Tree data structure, allows for optimal searching of any achievable size of data (even by big data standards in 100 years from now). That's because B-Tree's have an
O(log(n)) search time complexity. That means if your table has 1 billion rows, in the worst case, it would take log2(1 billion) = 30 to find any subset of the data. If that table grew to 1 trillion rows, log2(1 trillion) = 40 to find any subset. 30 and 40 nodes of a B-Tree takes a few milliseconds for any modern computer, even with the hardware of a cellphone, to search through.Long story short, there are no differences from a performance perspective in what any modern SQL database is capable of vs a modern NoSQL database. Most data scale problems are usually best solvable on the software side of the equation, not really the hardware side, as most performance problems are caused by poor database design / implementation and / or poorly written code (queries).
Context
StackExchange Database Administrators Q#320456, answer score: 7
Revisions (0)
No revisions yet.