patternsqlMinor
Best DB structure for large CSVs?
Viewed 0 times
csvslargeforstructurebest
Problem
I have previously posted on stackoverflow, however, I have been directed here.
I'm getting to the stage where I need to analyse lots of flat CSVs:
I wasn't sure whether sqlite was the right package for this (adding CSVs to it appears to take a long time and as in my link above after a certain size I can't access the database) and whether I should consider something else like PostgreSQL or Microsoft SQL Server?
I'm looking to invest in a hardware/software platform for this (e.g. SSD, RAID, Microsoft SQL Server) and was hoping for some information on where to begin.
Particularly, if PostgreSQL is a possibility - is there a similar way to quickly import CSVs like here:
https://blog.netnerds.net/2015/01/powershell-high-performance-techniques-for-importing-csv-to-sql-server/
Edit (08/10/15):
I'm testing out uploading the data into a PostgreSQL database and am averaging 16 minutes per 10GB CSV. My issue is that some of my columns are very big so I would have to change them from bigint to varchar, etc. The data has 38 columns and around 50 mill rows per file so figuring out which categorical variables are strings or integers is a real pain.
At the moment I am using:
Would it be much slower to import as a Pandas data-frame (so that panda takes care of column-type for me) and use that with sqlachemy to insert into Postgres? I'm guessing if PANDAs is a bit slower it will make up for it because the column types will be optimised (since it will decide on that for me).
E.g. something like this:
```
import pandas as pd
df = pd.read_csv('mypath.csv')
from sqlalchemy import cre
I'm getting to the stage where I need to analyse lots of flat CSVs:
- Read-in around 500GB of CSVs (e.g. daily data split by month)
- Group the data (e.g. by month or year)
- Output the aggregated data as a small CSV
I wasn't sure whether sqlite was the right package for this (adding CSVs to it appears to take a long time and as in my link above after a certain size I can't access the database) and whether I should consider something else like PostgreSQL or Microsoft SQL Server?
I'm looking to invest in a hardware/software platform for this (e.g. SSD, RAID, Microsoft SQL Server) and was hoping for some information on where to begin.
Particularly, if PostgreSQL is a possibility - is there a similar way to quickly import CSVs like here:
https://blog.netnerds.net/2015/01/powershell-high-performance-techniques-for-importing-csv-to-sql-server/
Edit (08/10/15):
I'm testing out uploading the data into a PostgreSQL database and am averaging 16 minutes per 10GB CSV. My issue is that some of my columns are very big so I would have to change them from bigint to varchar, etc. The data has 38 columns and around 50 mill rows per file so figuring out which categorical variables are strings or integers is a real pain.
At the moment I am using:
cur.copy_expert(sql="COPY %s FROM stdin DELIMITERS '~' CSV;", file=f), with my data-type mainly being varchar. I did this because the CSV is a bit messy and sometimes what I think is an integer turns out be alphanumeric and I have to re-upload.Would it be much slower to import as a Pandas data-frame (so that panda takes care of column-type for me) and use that with sqlachemy to insert into Postgres? I'm guessing if PANDAs is a bit slower it will make up for it because the column types will be optimised (since it will decide on that for me).
E.g. something like this:
```
import pandas as pd
df = pd.read_csv('mypath.csv')
from sqlalchemy import cre
Solution
Frankly, I probably wouldn't use a database in the first place. It doesn't sound like you have any need for ACID, which is one the main things which databases provide and which doesn't come cheap in terms of performance. Nor does it sound like you need complicated indexes.
What I would do (and have done, an awful lot) is just loop over the files with Perl or Python and use their built-in hash table features to generate your aggregates. Since you have specified that the output is small, you should have little problem holding the intermediate aggregation state in memory for the duration.
What I would do (and have done, an awful lot) is just loop over the files with Perl or Python and use their built-in hash table features to generate your aggregates. Since you have specified that the output is small, you should have little problem holding the intermediate aggregation state in memory for the duration.
Context
StackExchange Database Administrators Q#117010, answer score: 3
Revisions (0)
No revisions yet.