patternsqlMinor
Changing the representation of NULLs in pg_dump plaintext output
Viewed 0 times
thepg_dumpoutputchangingrepresentationnullsplaintext
Problem
I'm trying to convert a large-ish Postgres DB (500 GB) to SQL Server 2012. After investigating a few third-party tools and being disappointed in features, performance, or both, I started pursuing a simple pg_dump/bulk import solution. Things looked promising until I realized that pg_dump represents NULLs in plaintext as "\N", which causes the bulk insert to vomit in cases of type mismatch. Even if I were to automate the pg_dump process to produce a single file per table, some of the individual tables involved are very large (20-50 GB) and performing a comprehensive find-replace using even fast file editing options in linux, or a Perl script, add too much overhead to the time required for the import/export.
I'm hoping there's a way to modify the NULL representation in the pg_dump output that I'm not aware of, or failing that, to get some recommendations for alternative approaches to this process in terms of tools or strategies. Thanks in advance for your help.
I'm hoping there's a way to modify the NULL representation in the pg_dump output that I'm not aware of, or failing that, to get some recommendations for alternative approaches to this process in terms of tools or strategies. Thanks in advance for your help.
Solution
I think the best for your situation is to use the
With
You can even use CSV or other format (I don't know what MSSQL recognizes).
To automatize, you can easily check all the tables with a shell script. To achieve that, first edit as needed the following and save in
With that, you use the following to execute the copys (you can even use
It will generate and execute the
EDIT1: Added
EDIT2: Parallelism
I have made the test with
On a simple test I did, with a small database I use for tests (~800MB), with no parallelism it took 6 seconds to complete, and with 4 jobs it took little more than 1 second. I imagine that for a 500GB database on a good server, the difference will be bigger. If you want to compare:
NOTICE: For those thinking in use it for parallel backups (e.g. to substitute
COPY ... TO with one file for each table, and use pg_dump only for schema.With
COPY you can do change the output format (including NULL), example:COPY foo TO '/path/to/foo.output' WITH NULL '';You can even use CSV or other format (I don't know what MSSQL recognizes).
To automatize, you can easily check all the tables with a shell script. To achieve that, first edit as needed the following and save in
getcommand.sql:SELECT 'COPY '||c.oid::regclass||' TO ''/path/to/'||c.oid::regclass||''' WITH NULL '''';'
FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind='r' AND n.nspname !~ '^(information_schema$|pg_)';With that, you use the following to execute the copys (you can even use
xargs to parallelize it):psql -XAtf getcommand.sql | psql It will generate and execute the
COPY commands to export all tables data.EDIT1: Added
regclass to filename and corrected a syntax error on the generated COPY command (the WITH NULL goes at the end, not begining). Added -X to psql to avoid unexpected input through pipe.EDIT2: Parallelism
I have made the test with
xargs to parallelize it, and the result is the following:psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P psql -cOn a simple test I did, with a small database I use for tests (~800MB), with no parallelism it took 6 seconds to complete, and with 4 jobs it took little more than 1 second. I imagine that for a 500GB database on a good server, the difference will be bigger. If you want to compare:
time ( psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P 4 psql -c )
time psql -XAtf getcommand.sql | psqlNOTICE: For those thinking in use it for parallel backups (e.g. to substitute
pg_dump): don't do it! Because this solution uses different sessions (and transactions) for each table, so it can't provide consistency on the result if there are other users working on the database. On PG 9.2+ we can use transaction snapshot to guarantee consistency, but PG 9.3 (still beta) already has parallelism on pg_dump.Code Snippets
COPY foo TO '/path/to/foo.output' WITH NULL '<your null>';SELECT 'COPY '||c.oid::regclass||' TO ''/path/to/'||c.oid::regclass||''' WITH NULL '''';'
FROM pg_class c JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE c.relkind='r' AND n.nspname !~ '^(information_schema$|pg_)';psql <conn_info> -XAtf getcommand.sql | psql <conn_info>psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P <num jobs> psql -ctime ( psql -XAtf getcommand.sql | xargs -d '\n' -L 1 -P 4 psql -c )
time psql -XAtf getcommand.sql | psqlContext
StackExchange Database Administrators Q#45055, answer score: 5
Revisions (0)
No revisions yet.