HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Changing the representation of NULLs in pg_dump plaintext output

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thepg_dumpoutputchangingrepresentationnullsplaintext

Problem

I'm trying to convert a large-ish Postgres DB (500 GB) to SQL Server 2012. After investigating a few third-party tools and being disappointed in features, performance, or both, I started pursuing a simple pg_dump/bulk import solution. Things looked promising until I realized that pg_dump represents NULLs in plaintext as "\N", which causes the bulk insert to vomit in cases of type mismatch. Even if I were to automate the pg_dump process to produce a single file per table, some of the individual tables involved are very large (20-50 GB) and performing a comprehensive find-replace using even fast file editing options in linux, or a Perl script, add too much overhead to the time required for the import/export.

I'm hoping there's a way to modify the NULL representation in the pg_dump output that I'm not aware of, or failing that, to get some recommendations for alternative approaches to this process in terms of tools or strategies. Thanks in advance for your help.

Solution

I think the best for your situation is to use the COPY ... TO with one file for each table, and use pg_dump only for schema.

With COPY you can do change the output format (including NULL), example:

COPY foo TO '/path/to/foo.output' WITH NULL '';


You can even use CSV or other format (I don't know what MSSQL recognizes).

To automatize, you can easily check all the tables with a shell script. To achieve that, first edit as needed the following and save in getcommand.sql:

SELECT 'COPY '||c.oid::regclass||' TO ''/path/to/'||c.oid::regclass||''' WITH NULL '''';'
FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind='r' AND n.nspname !~ '^(information_schema$|pg_)';


With that, you use the following to execute the copys (you can even use xargs to parallelize it):

psql  -XAtf getcommand.sql | psql 


It will generate and execute the COPY commands to export all tables data.

EDIT1: Added regclass to filename and corrected a syntax error on the generated COPY command (the WITH NULL goes at the end, not begining). Added -X to psql to avoid unexpected input through pipe.

EDIT2: Parallelism

I have made the test with xargs to parallelize it, and the result is the following:

psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P  psql -c


On a simple test I did, with a small database I use for tests (~800MB), with no parallelism it took 6 seconds to complete, and with 4 jobs it took little more than 1 second. I imagine that for a 500GB database on a good server, the difference will be bigger. If you want to compare:

time ( psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P 4 psql -c )
time psql -XAtf getcommand.sql | psql


NOTICE: For those thinking in use it for parallel backups (e.g. to substitute pg_dump): don't do it! Because this solution uses different sessions (and transactions) for each table, so it can't provide consistency on the result if there are other users working on the database. On PG 9.2+ we can use transaction snapshot to guarantee consistency, but PG 9.3 (still beta) already has parallelism on pg_dump.

Code Snippets

COPY foo TO '/path/to/foo.output' WITH NULL '<your null>';
SELECT 'COPY '||c.oid::regclass||' TO ''/path/to/'||c.oid::regclass||''' WITH NULL '''';'
FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind='r' AND n.nspname !~ '^(information_schema$|pg_)';
psql <conn_info> -XAtf getcommand.sql | psql <conn_info>
psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P <num jobs> psql -c
time ( psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P 4 psql -c )
time psql -XAtf getcommand.sql | psql

Context

StackExchange Database Administrators Q#45055, answer score: 5

Revisions (0)

No revisions yet.