patternsqlMinor
PostgreSQL importing thousands of columns as an array
Viewed 0 times
postgresqlcolumnsarraythousandsimporting
Problem
PROBLEM:
I have a CSV file with 16,382 columns with data that looks like this:
RAW FORMAT:
Except for the first two columns, everything else afterward are 16,380 columns of floating points. The heading for these floating points are integer IDs of human genes and are not contiguous natural numbers.
WHAT I WANT:
I am using PostgreSQL 9.5.6 on 64-bit Linux based on Ubuntu 16.04.4
I want to put the 16,380 numbers into a REAL array of size 16,380 in PostgreSQL, but I am lost as to how to do this.
I want th
I have a CSV file with 16,382 columns with data that looks like this:
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| PATIENT_ID | DIAGNOSIS | 1 | 2 | 3 | 9 | 10 | 13 | 14 | ... |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| X764_130520 | 0 | 0.35 | 9.68 | 0.11 | 0.04 | 0.03 | 0 | 32.54 | |
| X800_130701 | 0 | 2.24 | 32.04 | 0.13 | 0.34 | 0.04 | 0 | 39.47 | ... |
| X164_120423 | 6 | 3.12 | 24.08 | 0.04 | 0.1 | 0.08 | 0 | 73.47 | ... |
| X218_120425 | 6 | 2.48 | 20.62 | 0 | 0.22 | 0 | 0 | 59.06 | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| X824_130725 | 1 | 0.76 | 44.77 | 0.1 | 0.28 | 0.13 | 0 | 45.35 | |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+RAW FORMAT:
PATIENT_ID,DIAGNOSIS,1,2,3,9,10,13,14,15,16,18,19,20,21,22,23,24,25,...
X764_130520,0,0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77,..
X800_130701,0,2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92,...
X218_120425,6,2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83,...
X266_120430,6,2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03,...
X276_120430,6,2.45,14.64,0,0.06,0.1,0,60.77,0.34,88.92,59.66,...
X411_120503,6,3.33,28.23,0,0.19,0.04,0,54.92,0.52,64.29,53.62,0.7,...
X448_120507,6,2.48,22.98,0,0.08,0,0,46.58,0.25,67.85,57.76,...
X517_120515,6,3.5,30.19,0.07,0.07,0.03,0.02,51.38,0.25,67.89,56.2,1.07,...Except for the first two columns, everything else afterward are 16,380 columns of floating points. The heading for these floating points are integer IDs of human genes and are not contiguous natural numbers.
WHAT I WANT:
I am using PostgreSQL 9.5.6 on 64-bit Linux based on Ubuntu 16.04.4
I want to put the 16,380 numbers into a REAL array of size 16,380 in PostgreSQL, but I am lost as to how to do this.
I want th
Solution
You don't need any auxiliary tools. This works on any platform.
I chose backspace as delimiter (
Assuming UTF8 encoding. Adapt if necessary.
If you can't use SQL
Create the target table by parsing the row in the next step:
Simple and fast.
The temporary table dies at the end of the session automatically.
The first line of
About
The regular expression in
This assumes no white space between values else you might want to use
Since an array does not seem useful for your purpose, you might instead just keep the comma-separated list:
COPY to a simple temporary ancillary table:CREATE TEMP TABLE tmp(txt text);
COPY tmp FROM '/path/to/your_file.csv' (FORMAT csv, ENCODING 'utf8', DELIMITER E'\b');I chose backspace as delimiter (
E'\b') which never occurs. This way you get one text column.Assuming UTF8 encoding. Adapt if necessary.
If you can't use SQL
COPY, consider the psql meta command \copy instead:- Postgres client copy (\copy) command doesn't have access to a temporary table?
Create the target table by parsing the row in the next step:
CREATE TABLE target AS
SELECT split_part(txt, ',', 1) AS patient_id
, split_part(txt, ',', 2) AS diagnosis
, string_to_array(substring(txt, '^(?:[^,]*,){2}(.*)'), ',')::real[] AS arr
FROM tmp;Simple and fast.
The temporary table dies at the end of the session automatically.
The first line of
integer values can be stored in real[], too.About
split_part():- Split comma separated column data into additional columns
The regular expression in
substring(): '^(?:[^,],){2}(.)' - it basically says: omit the first two fields and take the rest.This assumes no white space between values else you might want to use
trim(). Related answer with more explanation and links for regular expressions:- Filter strings with regex before casting to numeric
Since an array does not seem useful for your purpose, you might instead just keep the comma-separated list:
substring(txt, '^(?:[^,]*,){2}(.*)') AS listCode Snippets
CREATE TEMP TABLE tmp(txt text);
COPY tmp FROM '/path/to/your_file.csv' (FORMAT csv, ENCODING 'utf8', DELIMITER E'\b');CREATE TABLE target AS
SELECT split_part(txt, ',', 1) AS patient_id
, split_part(txt, ',', 2) AS diagnosis
, string_to_array(substring(txt, '^(?:[^,]*,){2}(.*)'), ',')::real[] AS arr
FROM tmp;substring(txt, '^(?:[^,]*,){2}(.*)') AS listContext
StackExchange Database Administrators Q#166867, answer score: 4
Revisions (0)
No revisions yet.