patternsqlMinor
Should I do the maths within PostgreSQL for ACIDity?
Viewed 0 times
mathspostgresqlthewithinforshouldacidity
Problem
It is my first question on Database Administration Stack Exchange. Please be kind if something is not done the nice way..
I would like to be advised concerning the way I'm working on my database, in order to make it as ACID as possible.
I am working on a Web Application using Flask Framework (Python) and a PostgreSQL (9.4) database. I am using SQLAlchemy as ORM.
I am doing Financial Analysis on Stocks handled by my Hedge Fund.
My work consists in 3 parts:
Here is my database:
```
security_id | driver_id | security_field |
I would like to be advised concerning the way I'm working on my database, in order to make it as ACID as possible.
I am working on a Web Application using Flask Framework (Python) and a PostgreSQL (9.4) database. I am using SQLAlchemy as ORM.
I am doing Financial Analysis on Stocks handled by my Hedge Fund.
My work consists in 3 parts:
- updating the database daily
- doing animal mathematics on it
- looking for best results
- presenting the results on the web app
Here is my database:
security table (the column active only means whether the stock should be updated or not):id | ticker | active
----+----------------+--------
1 | 3993 HK Equity | t
2 | A2A IM Equity | t
3 | AA UN Equity | f
4 | AA/ LN Equity | f
5 | AAL LN Equity | t
6 | AALB NA Equity | tvaluation tablesecurity_id | date | px_close | px_open | px_high | px_low | volume
-------------+------------+----------+----------+----------+---------+--------
42 | 2012-11-06 | 44.757 | 45.099 | 45.336 | 43.918 | 157295
42 | 2012-11-07 | 45.099 | 44.791 | 45.099 | 44.226 | 162671
42 | 2012-11-08 | 44.245 | 45.051 | 45.08 | 44.241 | 56763
42 | 2012-11-09 | 43.994 | 44.307 | 44.672 | 43.771 | 123424
43 | 2012-11-06 | 143.994 | 144.255 | 144.601 | 143.785 | 54598
43 | 2012-11-07 | 143.885 | 143.629 | 144.108 | 143.524 | 39380
43 | 2012-11-08 | 144.392 | 144.056 | 144.928 | 143.866 | 60509
43 | 2012-11-09 | 144.008 | 144.563 | 144.61 | 143.913 | 62784regression table (I have a lot of other tables like this one, e.g polynomial_regression, learning_clustering, multiple_regression, etc. but it is for the example):```
security_id | driver_id | security_field |
Solution
Welcome to DBA Stack Exchange!
My problem is that I don't know the 'speed' of SQL when it deals with
stats and maths, compared to other language. I know that basic
functions (correlation, R^2, ...) are already implemented in SQL, but
I am using far more 'advanced' (I mean 'complex' here..) functions
(even if I have not represented it here).
As a rule of thumb, for basic aggregations (Grouping, Joining, Summation) on large datasets (millions of rows) SQL will perform better. I'd recommend you leverage SQL's strengths here use a hybrid approach. Prep your data with basic aggregations as much as possible, but leave the higher-level mathematics and analysis logic in Python.
Would you think I should translate all my mathematical analysis from
Python to SQL, and create Triggers launching it whenever a new value
is inserted?
No.
As stated before, you can do basic aggregate functions, but I expect you will find it tedious if not outright impossible to model the more 'complex' statistical function from a feature-rich language such as Python into a constrained language such as SQL. Here is a decent article exploring this concept. Even if we are successful translating Python functions into SQL, our success is going to highly dependent on whether the database is optimized to deliver a set-based representation of all the data inputs needed for those calculations at the time of insert.
The thing is that between the update of the values, and the update of
the regressions, the database is not in a 'stable' state: new values
have been inserted, but the corresponding regressions doesn't exist
yet.
Let's be sure we aren't dogmatically promoting database ACIDity at the potential cost of performance, and the assured cost associated with increased code complexity. I DO understand the importance of ACIDity, but from your description this sound like an analytic workflow to me. We have a 3AM bulk load of data, which is subsequently analyzed outside of SQL, and then re-inserted into the database.
To summarize.. YEs, SQL will perform better on basic aggregations across large datasets, but higher-level statistical analysis is best left to Python. I'm not sure it's worth the risk translating Python functions into SQL triggers for the sake of ACIDity.
P.S. I know this is a late answer to your question, so let us know what you eventually decided to implement!
My problem is that I don't know the 'speed' of SQL when it deals with
stats and maths, compared to other language. I know that basic
functions (correlation, R^2, ...) are already implemented in SQL, but
I am using far more 'advanced' (I mean 'complex' here..) functions
(even if I have not represented it here).
As a rule of thumb, for basic aggregations (Grouping, Joining, Summation) on large datasets (millions of rows) SQL will perform better. I'd recommend you leverage SQL's strengths here use a hybrid approach. Prep your data with basic aggregations as much as possible, but leave the higher-level mathematics and analysis logic in Python.
Would you think I should translate all my mathematical analysis from
Python to SQL, and create Triggers launching it whenever a new value
is inserted?
No.
As stated before, you can do basic aggregate functions, but I expect you will find it tedious if not outright impossible to model the more 'complex' statistical function from a feature-rich language such as Python into a constrained language such as SQL. Here is a decent article exploring this concept. Even if we are successful translating Python functions into SQL, our success is going to highly dependent on whether the database is optimized to deliver a set-based representation of all the data inputs needed for those calculations at the time of insert.
The thing is that between the update of the values, and the update of
the regressions, the database is not in a 'stable' state: new values
have been inserted, but the corresponding regressions doesn't exist
yet.
Let's be sure we aren't dogmatically promoting database ACIDity at the potential cost of performance, and the assured cost associated with increased code complexity. I DO understand the importance of ACIDity, but from your description this sound like an analytic workflow to me. We have a 3AM bulk load of data, which is subsequently analyzed outside of SQL, and then re-inserted into the database.
To summarize.. YEs, SQL will perform better on basic aggregations across large datasets, but higher-level statistical analysis is best left to Python. I'm not sure it's worth the risk translating Python functions into SQL triggers for the sake of ACIDity.
P.S. I know this is a late answer to your question, so let us know what you eventually decided to implement!
Context
StackExchange Database Administrators Q#132150, answer score: 3
Revisions (0)
No revisions yet.