HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Migrate files from MySQL BLOBs to PostgreSQL largeobjects

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
postgresqllargeobjectsblobsmigratemysqlfilesfrom

Problem

As title says, this piece of code migrates files (binary and metadata) from a database to another one with different structure.

Currently my problem is that when I have to deal with a big database (between 12 and 30GB) , at a certain point the script CRASHES. With the 30GB database, this took 4 days of activity and then it just got killed from the system.

```
#!/usr/bin/python

# ------------------------------
# Import standard libraries |
# ------------------------------
#
import os
import sys
import mysql.connector
import psycopg2
import shutil
import base64

# ------------------------------
# Import internal snippets |
# ------------------------------
#
from include.db_config import *
from include.functions import *

# ------------------------------
# Open database connections |
# ------------------------------
#
# Mysql connection
try:
cnx_msql = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
print "MYSQL: Unable to connect!", e.msg
sys.exit(1)

# Mysql unbuffered connection
try:
cnx_msql_unbuffered = mysql.connector.connect( host=host_mysql, user=user_mysql, passwd=pswd_mysql, db=dbna_mysql )
except mysql.connector.Error as e:
print "MYSQL: Unable to connect!", e.msg
sys.exit(1)

# Postgresql connection
try:
cnx_psql = psycopg2.connect(**psql_param)
# enable autocommit
#cnx_psql.set_isolation_level(0)
except psycopg2.Error as e:
print('PSQL: Unable to connect!\n{0}').format(e)
sys.exit(1)

# -----------
# FUNCTIONS |
# -----------
#
def fv_missing_records():
cur_msql = cnx_msql_unbuffered.cursor(dictionary=True)
cur_psql = cnx_psql.cursor()

qry_1 = "SELECT r.resource_id, r.author_id, r.create_date, r.visible_from, '' AS description, r.size, \
r.format, r.content_type, r.status, rep.content, rep.iv, 1 as version, NULL AS loid \
FROM resources r \
INNER JOIN repository rep O

Solution

You don't have any performance metrics, so there is no telling whether this will have a significant effect, but one of the things you could do is combine your queries.

Take this one:

full_qry="SELECT resource_id, path, address, asas_id, progress \
            FROM resources \
            WHERE resource_id={0}".format(row['rid'])


If you first gathered all the rid's, you could then do this:

full_qry="SELECT resource_id, path, address, asas_id, progress \
            FROM resources \
            WHERE resource_id IN {0}"


That way you go to the database once and allow the database engine to optimize your one query, rather than the lot that you're shooting at it.

You could do the same with other queries, but you'd have to do some more processing to split the resultset back to something you can work with.

full_qry="SELECT v.resource_id, v.version, rep.content, v.description, v.format, v.content_type, \
                v.size, v.user_id, v.timestamp, v.status, rep.iv, NULL AS loid \
                FROM versions v \
                INNER JOIN repository rep \
                ON v.resource_id = rep.resource_id \
                AND v.version = rep.version \
                WHERE v.resource_id={0} AND v.version={1}".format(row['resource_id'], row['version'])


This too could be combined, but you'd have to do more work to get a good combination. You know better than I how the original data looks, so if you have a lot of the same version numbers ("rev1", "rev2", "rev3") and wildly varying resource id's, then you'd be better off making it WHERE v.resource_id IN with a fixed v.version. But if the resource id's are mostly the same and you have a lot of versions, then you might want to put the IN clause for v.version.

Whether or not this is worth it can be determined by just printing the queries, then sorting them by type, and running a few non-combined and combined queries by hand.

Code Snippets

full_qry="SELECT resource_id, path, address, asas_id, progress \
            FROM resources \
            WHERE resource_id={0}".format(row['rid'])
full_qry="SELECT resource_id, path, address, asas_id, progress \
            FROM resources \
            WHERE resource_id IN {0}"
full_qry="SELECT v.resource_id, v.version, rep.content, v.description, v.format, v.content_type, \
                v.size, v.user_id, v.timestamp, v.status, rep.iv, NULL AS loid \
                FROM versions v \
                INNER JOIN repository rep \
                ON v.resource_id = rep.resource_id \
                AND v.version = rep.version \
                WHERE v.resource_id={0} AND v.version={1}".format(row['resource_id'], row['version'])

Context

StackExchange Code Review Q#131622, answer score: 3

Revisions (0)

No revisions yet.