patternsqlMinor
Biological Sequences of UniProt in PostgreSQL
Viewed 0 times
uniprotbiologicalpostgresqlsequences
Problem
What is the best way to store UniProt biological sequences in PostreSQL?
Data Details
Data Access Details
To answer Jeremiah Peschka's comment:
Backwards Compatibility
It would be nice to be able to continue to be able to apply the following hashing function (SEGUID - SEquence Globally Unique IDentifier) to the sequences.
Data Details
- We pull in 12 million sequences from UniProt - this number is likely to double every 3-10 month.
- The length of a sequence can vary from 10 to 50 billion characters
- Less than 1% of the sequences are longer than 10 thousand characters
- Would it improve performance to store the longer sequences separately?
- A sequence can be of either Protein or DNA alphabet
- The DNA alphabet has 5 characters (A, T, C, G, or -).
- The Protein alphabet will have around 30 characters.
- We don't mind storing the sequences of the two different alphabets in different columns or even different tables. Would that help?
Data Access Details
To answer Jeremiah Peschka's comment:
- Protein and DNA sequences would be accessed at different times
- Would not need to search within the sequence (that's done outside of db)
- Would ether access single rows at a time or pull out sets of rows by IDs. We would not need to scan rows. All sequences are referenced by other tables - several biologically and chronologically meaningful hierarchies exist in the database.
Backwards Compatibility
It would be nice to be able to continue to be able to apply the following hashing function (SEGUID - SEquence Globally Unique IDentifier) to the sequences.
CREATE OR REPLACE FUNCTION gfam.get_seguid(p_sequence character varying)
RETURNS character varying AS
$BODY$
declare
result varchar := null;
x integer;
begin
select encode(gfam.digest(p_sequence, 'sha1'), 'base64')
into result;
x := length(result);
if substring(result from x for 1) = '=' then
result := substring( result from 1 for x-1 );
end if;
return result;
end;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;Solution
Exploring the functions at PostBio it looks like they have a couple of ways of encoding. However, given that those extensions are optimized for searching, they make multiple references to simply using the
According to the documentation:
Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.
Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:
A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.
Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.
If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.
text data type. According to the documentation:
Long strings are compressed by the system automatically, so the physical requirement on disk might be less. Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values. In any case, the longest possible character string that can be stored is about 1 GB.
Therefore, by putting the table into its own very large tablespace on dedicated hardware should be sufficient for your performance goals. If 1 GB is too small for your data, the int_interval from ProtBio should provide excellent performance:
A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.
Encoding the sequence in sha1 looks to be a very painful way of making a GUID, considering the potential lengths of the sequence.
If the different sequences are unrelated, store them on different tablespaces on different disks for maximum performance.
Context
StackExchange Database Administrators Q#66, answer score: 7
Revisions (0)
No revisions yet.