patterncppModerate
Simple program to split DNA sequences based on ambiguous content
Viewed 0 times
simpleprogramambiguoussplitdnabasedcontentsequences
Problem
The problem
I have a program that takes a file of DNA sequences as input and writes each entry to one of two output files: one file for sequences containing only canonical characters (
The data
DNA sequencing machines produce gobs of data, with each entry adhering to the following format.
The first line contains an ID and metadata, the second line contains DNA sequence (this is what the program filters on), the third line is stupid and can be ignored, and the fourth line contains ASCII-encoded quality values corresponding to each nucleotide in the DNA. Here are 3 entries from a real file.
When a DNA sequencing machine cannot reliably determine the molecule at a given position, it will use an
I have a program that takes a file of DNA sequences as input and writes each entry to one of two output files: one file for sequences containing only canonical characters (
A, C, G, and T), and another file for sequences containing non-canonical (or ambiguous) characters.The data
DNA sequencing machines produce gobs of data, with each entry adhering to the following format.
@uniqueID free-form description and/or metadata
ACGTACGTACGTACGT
+
BBBBBBBBBBBBBBBBThe first line contains an ID and metadata, the second line contains DNA sequence (this is what the program filters on), the third line is stupid and can be ignored, and the fourth line contains ASCII-encoded quality values corresponding to each nucleotide in the DNA. Here are 3 entries from a real file.
@ERR899705.3060 HWI-D00574:77:C6A74ANXX:5:1101:2907:2432/1
GCCTTCCAGAAAAATGTTCCCAGTTCATTGATGGCAAACTCAAACAGGTTTGAAAAAAAATCTCAATATTAAATATGGATATGATGCATCAGTTCAAGGTGATTTCTGGTTAGCTGTGTGCTGACA
+
BCCCCGGGEGGGGGGGGGGGGGGGGGFGGGGGGGGGGGEGGGGGGGGE1>FCFGF1FFCEG@@0FGGGGGCEDGCG>DCGGGGG=0D=GG
@ERR899705.3061 HWI-D00574:77:C6A74ANXX:5:1101:2869:2437/1
CAAATAAATAAATAAATATTTAGTTACTGAAATAATAATGTTGATTGGTAAAATAAGAATGTATTCTCTCTAAAAACTAGTTAGTTTGTACTAAAAAATGGGCATAATTATTATCTCAAAGATTAG
+
BBBBBEGGGGGGGGGGGGGGGGGGGGGGD@GGGGGGGGEGC@FGFG>F:CGGGGGGFGGGCFCBDC11:F>FGGGGBEGGGGGGGGGC@F@FGGGGGG>000F@FGGGGGGEGGG00B0CFGGE>0
@ERR899705.3062 HWI-D00574:77:C6A74ANXX:5:1101:2786:2438/1
TTGGAAAACATTTATTACATTTTCTTCTAATTTCTTAGAATCATTTACTTGTAATATTTAGATACTCGGCGATCCTTTCAGGTATTCACTGCATATCCCCAAATATTGCATCAAACTCTCCCACGT
+
AABB0111@F111F111=1;@@;111E1?111:111=1=1111F/////0/1:C1:11=0:=0;@0>=0000==8>00000=8000<0000008000;0When a DNA sequencing machine cannot reliably determine the molecule at a given position, it will use an
N character instead of an ACGT character. Also, programs for analyzing DNA sequences will sometimes use other non-ACGT characters (IUPAC notation) to represent uncertainty about the content of DNA Solution
Well one simple change will make it do lines in groups of four.
We can also improve the test for a specific character. Note I am changing the return type to
That will work. But its a bit slow if you get a lot of 'T'. We can optimize that if you want with a lookup table;
Bit of a micro optimization but if your files are huge. Now that @Edward has published an answer I like his switch statement better.
Prefer '\n' over
Your C++ code will run slower than C code because the C++ streams are tied to the C stream and try and keep them in sync. If you don't care about the C streams you can improve performance by unbinding them.
Also because you are reading from the standard input this can cause slow downs (compared to reading from a file). Because the input needs to force the output to flush before getting reading input (so you see any messages that you should respond too). Since you don't want to wait for user input you can disable this behavior. In your case it probably will not make any difference but its worth adding just to be pedantic:
Questions
How should I format the if statement in the non_canon function? Everything I've tried looks hideous.
Don't use and if statement. My code above just returns the expression.
I'm using the ternary operator to conditionally assign the output stream. My typical pattern in Python is to assign, check, and then reassign if needed, but std::ostream references cannot be reassigned in C++. Is there a clearer way of doing this?
I think your code looks fine. But you don't need an intermediate variable.
Or you can put it in a function.
while (getline(std::cin, defline)
&& getline(std::cin, seq)
&& getline(std::cin, buffer)
&& getline(std::cin, qual)
)
{
// Only enter the loop if you correctly read all four lines.
}We can also improve the test for a specific character. Note I am changing the return type to
bool.bool canon(char bp)
{
bp = toupper(bp);
return bp == 'A' || bp == 'C' || bp == 'G' || bp == 'T';
}That will work. But its a bit slow if you get a lot of 'T'. We can optimize that if you want with a lookup table;
bool canon(unsigned char bp)
{
static bool const is_cannon[] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // A C G
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // T
0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // a c g
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // t
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
return is_cannon[bp];
}Bit of a micro optimization but if your files are huge. Now that @Edward has published an answer I like his switch statement better.
Prefer '\n' over
std::endl. It (the endl) will only slow your code down. If forces a flush and this flush is unnecessary. The streams will automatically flush when they need to.Your C++ code will run slower than C code because the C++ streams are tied to the C stream and try and keep them in sync. If you don't care about the C streams you can improve performance by unbinding them.
std::ios::sync_with_stdio(false);Also because you are reading from the standard input this can cause slow downs (compared to reading from a file). Because the input needs to force the output to flush before getting reading input (so you see any messages that you should respond too). Since you don't want to wait for user input you can disable this behavior. In your case it probably will not make any difference but its worth adding just to be pedantic:
std::cin.tie(nullptr);Questions
How should I format the if statement in the non_canon function? Everything I've tried looks hideous.
Don't use and if statement. My code above just returns the expression.
I'm using the ternary operator to conditionally assign the output stream. My typical pattern in Python is to assign, check, and then reassign if needed, but std::ostream references cannot be reassigned in C++. Is there a clearer way of doing this?
I think your code looks fine. But you don't need an intermediate variable.
(non_acgt ? std::cerr : std::cout) << defline << '\n' << seq << "\n+\n" << qual << "\n";Or you can put it in a function.
inline std::ostream& getOutputStream(bool non_acgt) {
return non_acgt ? std::cerr : std::cout;
}
getOutputStream(non_acgt) << defline << '\n' << seq << "\n+\n" << qual << "\n";Code Snippets
while (getline(std::cin, defline)
&& getline(std::cin, seq)
&& getline(std::cin, buffer)
&& getline(std::cin, qual)
)
{
// Only enter the loop if you correctly read all four lines.
}bool canon(char bp)
{
bp = toupper(bp);
return bp == 'A' || bp == 'C' || bp == 'G' || bp == 'T';
}bool canon(unsigned char bp)
{
static bool const is_cannon[] = {
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // A C G
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // T
0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, // a c g
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // t
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
return is_cannon[bp];
}std::ios::sync_with_stdio(false);std::cin.tie(nullptr);Context
StackExchange Code Review Q#161898, answer score: 11
Revisions (0)
No revisions yet.