HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL importing thousands of columns as an array

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
postgresqlcolumnsarraythousandsimporting

Problem

PROBLEM:

I have a CSV file with 16,382 columns with data that looks like this:

+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| PATIENT_ID  | DIAGNOSIS |  1   |   2   |  3   |  9   |  10  | 13  |  14   | ... |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+
| X764_130520 | 0         | 0.35 | 9.68  | 0.11 | 0.04 | 0.03 | 0   | 32.54 |     |
| X800_130701 | 0         | 2.24 | 32.04 | 0.13 | 0.34 | 0.04 | 0   | 39.47 | ... |
| X164_120423 | 6         | 3.12 | 24.08 | 0.04 | 0.1  | 0.08 | 0   | 73.47 | ... |
| X218_120425 | 6         | 2.48 | 20.62 | 0    | 0.22 | 0    | 0   | 59.06 | ... |
| ...         | ...       | ...  | ...   | ...  | ...  | ...  | ... | ...   |     |
| X824_130725 | 1         | 0.76 | 44.77 | 0.1  | 0.28 | 0.13 | 0   | 45.35 |     |
+-------------+-----------+------+-------+------+------+------+-----+-------+-----+


RAW FORMAT:

PATIENT_ID,DIAGNOSIS,1,2,3,9,10,13,14,15,16,18,19,20,21,22,23,24,25,...
X764_130520,0,0.35,9.68,0.11,0.04,0.03,0,32.54,0.13,49.73,33.34,0.77,..
X800_130701,0,2.24,32.04,0.13,0.34,0.04,0,39.47,0.51,44.92,...
X218_120425,6,2.48,20.62,0,0.22,0,0,59.06,0.11,86.48,62.63,1.09,110.83,...
X266_120430,6,2.66,19.95,0.05,0.13,0.1,0,39.72,0.32,41.78,40.78,1.03,...
X276_120430,6,2.45,14.64,0,0.06,0.1,0,60.77,0.34,88.92,59.66,...
X411_120503,6,3.33,28.23,0,0.19,0.04,0,54.92,0.52,64.29,53.62,0.7,...
X448_120507,6,2.48,22.98,0,0.08,0,0,46.58,0.25,67.85,57.76,...
X517_120515,6,3.5,30.19,0.07,0.07,0.03,0.02,51.38,0.25,67.89,56.2,1.07,...


Except for the first two columns, everything else afterward are 16,380 columns of floating points. The heading for these floating points are integer IDs of human genes and are not contiguous natural numbers.

WHAT I WANT:

I am using PostgreSQL 9.5.6 on 64-bit Linux based on Ubuntu 16.04.4

I want to put the 16,380 numbers into a REAL array of size 16,380 in PostgreSQL, but I am lost as to how to do this.

I want th

Solution

You don't need any auxiliary tools. This works on any platform.

COPY to a simple temporary ancillary table:

CREATE TEMP TABLE tmp(txt text);

COPY tmp FROM '/path/to/your_file.csv' (FORMAT csv, ENCODING 'utf8', DELIMITER E'\b');


I chose backspace as delimiter (E'\b') which never occurs. This way you get one text column.

Assuming UTF8 encoding. Adapt if necessary.

If you can't use SQL COPY, consider the psql meta command \copy instead:

  • Postgres client copy (\copy) command doesn't have access to a temporary table?



Create the target table by parsing the row in the next step:

CREATE TABLE target AS
SELECT split_part(txt, ',', 1) AS patient_id
     , split_part(txt, ',', 2) AS diagnosis
     , string_to_array(substring(txt, '^(?:[^,]*,){2}(.*)'), ',')::real[] AS arr
FROM   tmp;


Simple and fast.

The temporary table dies at the end of the session automatically.

The first line of integer values can be stored in real[], too.

About split_part():

  • Split comma separated column data into additional columns



The regular expression in substring(): '^(?:[^,],){2}(.)' - it basically says: omit the first two fields and take the rest.

This assumes no white space between values else you might want to use trim(). Related answer with more explanation and links for regular expressions:

  • Filter strings with regex before casting to numeric



Since an array does not seem useful for your purpose, you might instead just keep the comma-separated list:

substring(txt, '^(?:[^,]*,){2}(.*)') AS list

Code Snippets

CREATE TEMP TABLE tmp(txt text);

COPY tmp FROM '/path/to/your_file.csv' (FORMAT csv, ENCODING 'utf8', DELIMITER E'\b');
CREATE TABLE target AS
SELECT split_part(txt, ',', 1) AS patient_id
     , split_part(txt, ',', 2) AS diagnosis
     , string_to_array(substring(txt, '^(?:[^,]*,){2}(.*)'), ',')::real[] AS arr
FROM   tmp;
substring(txt, '^(?:[^,]*,){2}(.*)') AS list

Context

StackExchange Database Administrators Q#166867, answer score: 4

Revisions (0)

No revisions yet.