patternsqlMinor
NoSQL : What is unstructured data?
Viewed 0 times
unstructureddatawhatnosql
Problem
we are currently running at the edge of resources with our mssql server based solution.
We have now many traditional options regarding the next move to tackle the load:
All are either expensive in terms of licensing and hardware or time. So, I want to add another option by moving the whole system to a scalable solution that nosql engine cassandra promises.
Yet, I am not sure and not experienced with noSQL databases, so I need to understand the structure of "unstructured" data.
In our application, we basically store data entered by users in various ways as "key-value" lists. There is a parent table, that contains the head element (like an Order) and there is a child table with the key-value pairs comprising the contents of the order (like Order_Lines).
Business-wise, Order and OrderLines are a unit. But due to the RDBMS, they are stored in tables and must be joined all the time.
During operations, we sometimes choose to load only the top part, but most of the time, we load the head row + some KVPs to display some useful information.
For example, in an overview list, we show the head identifier + some values to in columns for each row.
UPDATE: We store forms of any kind. So, basically we store "documents". Nevertheless, we have to prepare and search through these forms by any value, sort etc. Data access control adds another layer of compexity on the database.
As you may guess, the amount and availability of certain KVPs varies from object to object. There is no valid possibility to create single tables for each kind of object as we would have to create thousands of tables for the different data combinations.
Would this kind of "Dictionary" like datasets be better stored in a noSQL database? And will we have performance benefits from this?
Would cassandra model these head+KVPs as one dataset? Looking at the cassandra webpage and some tutorials, I have the impressi
We have now many traditional options regarding the next move to tackle the load:
- buy faster CPUs and IO
- split some customers to seperate server
- move db to cluster
All are either expensive in terms of licensing and hardware or time. So, I want to add another option by moving the whole system to a scalable solution that nosql engine cassandra promises.
Yet, I am not sure and not experienced with noSQL databases, so I need to understand the structure of "unstructured" data.
In our application, we basically store data entered by users in various ways as "key-value" lists. There is a parent table, that contains the head element (like an Order) and there is a child table with the key-value pairs comprising the contents of the order (like Order_Lines).
Business-wise, Order and OrderLines are a unit. But due to the RDBMS, they are stored in tables and must be joined all the time.
During operations, we sometimes choose to load only the top part, but most of the time, we load the head row + some KVPs to display some useful information.
For example, in an overview list, we show the head identifier + some values to in columns for each row.
UPDATE: We store forms of any kind. So, basically we store "documents". Nevertheless, we have to prepare and search through these forms by any value, sort etc. Data access control adds another layer of compexity on the database.
As you may guess, the amount and availability of certain KVPs varies from object to object. There is no valid possibility to create single tables for each kind of object as we would have to create thousands of tables for the different data combinations.
Would this kind of "Dictionary" like datasets be better stored in a noSQL database? And will we have performance benefits from this?
Would cassandra model these head+KVPs as one dataset? Looking at the cassandra webpage and some tutorials, I have the impressi
Solution
Despite the mainstream of noSQL databases IMHO the decision about adopting such technology should be made according to the achievements needed according to the information stored, not only attending to the performance you currently have. This means that maybe your best option is to stick to the SQL database and improve your HW.
But additionally I read something in your question that made me think. There is not much about the current status of your database but your sentence "we basically store data entered by users in various ways as "key-value" lists" makes me think about if the problem wouldn't be a poor data model rather than the lack of physical resources. I've managed really large tables (+10 billion rows) with incredible performance in "traditional" SQL databases.
I'm not saying it is wrong, just, since of course I cannot assess you in the right data model with such little information about your current solution, but just think about revisiting your data model as an additional option along with the rest since you may find some clue scratching there.
Usually key-value lists are fine as a trade-off when you cannot implement the model in its final state because you don't know the different keys you'll have to face, or when you will need the values of one of the possible keys for a certain element. But when implemented, I usually like to re-think such decisions after a while when you have gathered enough amount of information to identify the common case of use and decide whether data model decision is the best. If you know you'll have a certain number of keys, try to do some benchmark with a design of a regular table in the traditional way
...and adding the corresponding indices. Try it out and measure execution plans with both approaches. You may be surprised specially if you gather more than one key at a time, since, among other advantages the data block size should be reduced and thus the performance would be improved.
Hope this helps, or at least broadens the possibilities and opens a new line for investigation.
But additionally I read something in your question that made me think. There is not much about the current status of your database but your sentence "we basically store data entered by users in various ways as "key-value" lists" makes me think about if the problem wouldn't be a poor data model rather than the lack of physical resources. I've managed really large tables (+10 billion rows) with incredible performance in "traditional" SQL databases.
I'm not saying it is wrong, just, since of course I cannot assess you in the right data model with such little information about your current solution, but just think about revisiting your data model as an additional option along with the rest since you may find some clue scratching there.
Usually key-value lists are fine as a trade-off when you cannot implement the model in its final state because you don't know the different keys you'll have to face, or when you will need the values of one of the possible keys for a certain element. But when implemented, I usually like to re-think such decisions after a while when you have gathered enough amount of information to identify the common case of use and decide whether data model decision is the best. If you know you'll have a certain number of keys, try to do some benchmark with a design of a regular table in the traditional way
CREATE TABLE benchmarkTable (
element INTEGER,
key1 VARCHAR(xx),
key2 INTEGER,
key3 DECIMAL(5,2),
...
);...and adding the corresponding indices. Try it out and measure execution plans with both approaches. You may be surprised specially if you gather more than one key at a time, since, among other advantages the data block size should be reduced and thus the performance would be improved.
Hope this helps, or at least broadens the possibilities and opens a new line for investigation.
Code Snippets
CREATE TABLE benchmarkTable (
element INTEGER,
key1 VARCHAR(xx),
key2 INTEGER,
key3 DECIMAL(5,2),
...
);Context
StackExchange Database Administrators Q#112298, answer score: 5
Revisions (0)
No revisions yet.