HiveBrain v1.2.0
Get Started
← Back to all entries
snippetjavaMinor

Convert impute2 files to mach format

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
formatconvertmachimpute2files

Problem

Here is a program for converting Impute2 files into MaCH format (related to genetics).

Source files include one xxx_haps file and one xxx_samples file, for example:

// chromosome21.gen_haps
--- rs62224609 16051249 T C 0 0 0 0 0 0
--- rs62224610 16051347 G C 0 1 0 1 1 0
--- rs143503259 16051453 A C 0 0 0 0 0 0
--- rs192339082 16051477 C A 1 0 0 0 0 0
--- rs191139082 16052477 ACCTT A 1 0 0 0 0 0
--- rs191139082 16052478 G GCCGG 0 1 0 0 0 0

// chromosome21.gen_samples
ID_1 ID_2 missing father mother sex plink_pheno
0 0 0 D D D B
RS3 10051 0 0 0 2 -9
RS3 10052 0 0 0 2 -9
RS3 10068 0 0 0 2 -9


Output should be two MaCH files, one xxx.mach.out and one xxx.data.dat, for example, the above input files should be converted to:

// chromosome21.mach.out
RS3->10051 HAPL01 TGAADR
RS3->10051 HAPL02 TCACRI
RS3->10052 HAPL01 TGACRR
RS3->10052 HAPL02 TCACRR
RS3->10068 HAPL01 TCACRR
RS3->10068 HAPL02 TGACRR

// chromosome21.data.dat
M   rs62224609
M   rs62224610
M   rs143503259
M   rs192339082
M   rs191139082
M   rs191139082


Java code:

```
import com.beust.jcommander.JCommander;
import com.google.common.base.Joiner;

import java.io.*;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.regex.Pattern;

public class Main {
public static void main(String[] args) throws Exception {
ArgParser argParser = new ArgParser();
new JCommander(argParser, args);
if (argParser.debug) {
System.out.printf("Debugging mode on\n%-22s%s\n%-22s%s\n",
"File to convert:",
argParser.file,
"Output folder:",
argParser.outDirString);
}

Pattern pattern = Pattern.compile("_haps\\.gz$");
String fileStem = pattern.matcher(argParser.file).replaceAll("");
System.out.println("File stem: " + fileStem);
System.out.println("File : " + argParser.file);
File outDir;
File machOutFile = new File(fileStem + ".mach

Solution

My rule of thumb is that a method should not be more than 10 lines long. So your single method makes me weep.

My guiding principle is the "single responsibility principle" (aka "separation of concerns"). You should try to split your method in smaller methods where each method does one "logical" thing. You will probably spend of lot time changing your mind and doing a lot of copy-paste and that's normal. As you gain experience you will get better.

For example the code that reads/writes to a file should be completely separated from the code that manipulates the data. So if you get the data from the network instead of reading the files from the filesystem, you only have to modify the part that does data input.

Since Java is an OO language, you should probably use that too. You have to think carefully about what is a class here. Maybe you could define the two kinds of files you are manipulating as classes. And those two types files seem to contain two different data structures which are written in different formats in the two files. You can define two new class for those data types, and they sub classes for the way those are represented in the two different file types. You would also need to define some methods to convert between those formats.

If you do it this way, when you add a third file type, all you have to do is add some new sub-classes and new converting methods. Even if you don't plan to add a third file type, you should code as if you will since it will help you modularize the code. Modularized code might be longer overall than non-modularized code, but it is a lot clearer to read and modify.

Context

StackExchange Code Review Q#69077, answer score: 3

Revisions (0)

No revisions yet.