patternpythonMinor
Migrate files from MySQL BLOBs to PostgreSQL largeobjects
Viewed 0 times
postgresqllargeobjectsblobsmigratemysqlfilesfrom
Problem
As title says, this piece of code migrates files (binary and metadata) from a database to another one with different structure.
Currently my problem is that when I have to deal with a big database (between 12 and 30GB) , at a certain point the script CRASHES. With the 30GB database, this took 4 days of activity and then it just got killed from the system.
```
#!/usr/bin/python
# ------------------------------
# Import standard libraries |
# ------------------------------
#
import os
import sys
import mysql.connector
import psycopg2
import shutil
import base64
# ------------------------------
# Import internal snippets |
# ------------------------------
#
from include.db_config import *
from include.functions import *
# ------------------------------
# Open database connections |
# ------------------------------
#
# Mysql connection
try:
cnx_msql = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
print "MYSQL: Unable to connect!", e.msg
sys.exit(1)
# Mysql unbuffered connection
try:
cnx_msql_unbuffered = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
print "MYSQL: Unable to connect!", e.msg
sys.exit(1)
# Postgresql connection
try:
cnx_psql = psycopg2.connect(**psql_param)
# enable autocommit
#cnx_psql.set_isolation_level(0)
except psycopg2.Error as e:
print('PSQL: Unable to connect!\n{0}').format(e)
sys.exit(1)
# -----------
# FUNCTIONS |
# -----------
#
def fv_missing_records():
cur_msql = cnx_msql_unbuffered.cursor(dictionary=True)
cur_psql = cnx_psql.cursor()
qry_1 = "SELECT r.resource_id, r.author_id, r.create_date, r.visible_from, '' AS description, r.size, \
r.format, r.content_type, r.status, rep.content, rep.iv, 1 as version, NULL AS loid \
FROM resources r \
INNER JOIN repository rep O
Currently my problem is that when I have to deal with a big database (between 12 and 30GB) , at a certain point the script CRASHES. With the 30GB database, this took 4 days of activity and then it just got killed from the system.
```
#!/usr/bin/python
# ------------------------------
# Import standard libraries |
# ------------------------------
#
import os
import sys
import mysql.connector
import psycopg2
import shutil
import base64
# ------------------------------
# Import internal snippets |
# ------------------------------
#
from include.db_config import *
from include.functions import *
# ------------------------------
# Open database connections |
# ------------------------------
#
# Mysql connection
try:
cnx_msql = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
print "MYSQL: Unable to connect!", e.msg
sys.exit(1)
# Mysql unbuffered connection
try:
cnx_msql_unbuffered = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
print "MYSQL: Unable to connect!", e.msg
sys.exit(1)
# Postgresql connection
try:
cnx_psql = psycopg2.connect(**psql_param)
# enable autocommit
#cnx_psql.set_isolation_level(0)
except psycopg2.Error as e:
print('PSQL: Unable to connect!\n{0}').format(e)
sys.exit(1)
# -----------
# FUNCTIONS |
# -----------
#
def fv_missing_records():
cur_msql = cnx_msql_unbuffered.cursor(dictionary=True)
cur_psql = cnx_psql.cursor()
qry_1 = "SELECT r.resource_id, r.author_id, r.create_date, r.visible_from, '' AS description, r.size, \
r.format, r.content_type, r.status, rep.content, rep.iv, 1 as version, NULL AS loid \
FROM resources r \
INNER JOIN repository rep O
Solution
You don't have any performance metrics, so there is no telling whether this will have a significant effect, but one of the things you could do is combine your queries.
Take this one:
If you first gathered all the
That way you go to the database once and allow the database engine to optimize your one query, rather than the lot that you're shooting at it.
You could do the same with other queries, but you'd have to do some more processing to split the resultset back to something you can work with.
This too could be combined, but you'd have to do more work to get a good combination. You know better than I how the original data looks, so if you have a lot of the same version numbers ("rev1", "rev2", "rev3") and wildly varying resource id's, then you'd be better off making it
Whether or not this is worth it can be determined by just printing the queries, then sorting them by type, and running a few non-combined and combined queries by hand.
Take this one:
full_qry="SELECT resource_id, path, address, asas_id, progress \
FROM resources \
WHERE resource_id={0}".format(row['rid'])If you first gathered all the
rid's, you could then do this:full_qry="SELECT resource_id, path, address, asas_id, progress \
FROM resources \
WHERE resource_id IN {0}"That way you go to the database once and allow the database engine to optimize your one query, rather than the lot that you're shooting at it.
You could do the same with other queries, but you'd have to do some more processing to split the resultset back to something you can work with.
full_qry="SELECT v.resource_id, v.version, rep.content, v.description, v.format, v.content_type, \
v.size, v.user_id, v.timestamp, v.status, rep.iv, NULL AS loid \
FROM versions v \
INNER JOIN repository rep \
ON v.resource_id = rep.resource_id \
AND v.version = rep.version \
WHERE v.resource_id={0} AND v.version={1}".format(row['resource_id'], row['version'])This too could be combined, but you'd have to do more work to get a good combination. You know better than I how the original data looks, so if you have a lot of the same version numbers ("rev1", "rev2", "rev3") and wildly varying resource id's, then you'd be better off making it
WHERE v.resource_id IN with a fixed v.version. But if the resource id's are mostly the same and you have a lot of versions, then you might want to put the IN clause for v.version.Whether or not this is worth it can be determined by just printing the queries, then sorting them by type, and running a few non-combined and combined queries by hand.
Code Snippets
full_qry="SELECT resource_id, path, address, asas_id, progress \
FROM resources \
WHERE resource_id={0}".format(row['rid'])full_qry="SELECT resource_id, path, address, asas_id, progress \
FROM resources \
WHERE resource_id IN {0}"full_qry="SELECT v.resource_id, v.version, rep.content, v.description, v.format, v.content_type, \
v.size, v.user_id, v.timestamp, v.status, rep.iv, NULL AS loid \
FROM versions v \
INNER JOIN repository rep \
ON v.resource_id = rep.resource_id \
AND v.version = rep.version \
WHERE v.resource_id={0} AND v.version={1}".format(row['resource_id'], row['version'])Context
StackExchange Code Review Q#131622, answer score: 3
Revisions (0)
No revisions yet.