HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Categorizing gene sequences read from a CSV file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
categorizingfilereadcsvgenefromsequences

Problem

I am relatively new to programming and would love to get some feedback on the following section of my code.

class Gene:
    def __init__(self, gene_symbol, gene_id):
        # gene_symbol represents the abbreviated name of the gene (string)
        # Example: 'RHO' (short for 'Rhodopsin')
        self.gene_symbol = gene_symbol

        # gene_id represents an accession ID for the gene (string)
        # Example: 'NM_000539.3'
        self.gene_id = gene_id

        # is_valid_ functions check if the type of the gene_id is valid
        # and returns True or False
        # Assume that the types of the gene_id are mutually exclusive
        if is_valid_refseq(gene_id):
            # Example of a valid refseq: 'NM_000539.3'
            self.gene_id_type = REFSEQ

        elif is_valid_ensembl_gene(gene_id):
            # Example of a valid ensembl_gene: 'ENSG00000163914'
            self.gene_id_type = ENSEMBL_GENE

        elif is_valid_ensembl_transcript(gene_id):
            # Example of a valid ensembl_transcript: 'ENST00000296271'
            self.gene_id_type = ENSEMBL_TRANSCRIPT

        else:
            raise InvalidGeneIDError("Invalid gene_id: {}".format(gene_id))

refseqs = []
ensembl_genes = []
ensembl_transcripts = []

with open(csv_gene_list, 'r', newline='') as csv_input:

    reader = csv.reader(csv_input, delimiter=',')
    next(reader)
    for row in reader:
        row_gene_symbol = row[0]
        row_gene_id = row[1]

        row_gene = Gene(row_gene_symbol, row_gene_id)

        if row_gene.gene_id_type == REFSEQ:
            refseqs.append(row_gene)

        elif row_gene.gene_id_type == ENSEMBL_GENE:
            ensembl_genes.append(row_gene)

        elif row_gene.gene_id_type == ENSEMBL_TRANSCRIPT:
            ensembl_transcripts.append(row_gene)

        else:
            # What do I do here?
            raise AssertionError('Unrecognized gene_id_type: {}'.format(
                    row_gene.gene_id_type))


Assume gene_symbol and

Solution

Yeah, the problem is the switch: you have the same set of if-elif-else conditions duplicated in two places. Chances are that as your program grows, you will add more replicas of these chains. The problem with these chains is that if later you need to add one more case in the middle, you will need to modify all replicas accordingly. This is error prone, as you might not remember all the places where you replicated the chains.

How can we deal with that better? Can we replace these chains with something else, in a way that if we add a new case in the Gene constructor, we can be free from worrying about the rest of the program?

At least in this example, there is a workable solution. Consider these 3 lists:

refseqs = []
ensembl_genes = []
ensembl_transcripts = []


These lists are an important part of the problem. They mirror the currently supported 3 types, and when you put items in these lists, you need to refer to these lists by these names exactly.

Instead of these lists, you could use a dictionary of lists, where the keys are the gene types, and the values are lists. Something like this:

genes = dict()

with open(csv_gene_list, 'r', newline='') as csv_input:

    reader = csv.reader(csv_input, delimiter=',')
    next(reader)
    for row in reader:
        row_gene_symbol = row[0]
        row_gene_id = row[1]

        row_gene = Gene(row_gene_symbol, row_gene_id)

        if row_gene.row_gene_id not in genes:
            genes[row_gene.row_gene_id] = []

        genes[row_gene.row_gene_id].append(row_gene)
        # ...


In this form the chain is gone, and the second part of your question is naturally gone too.

As @jaime pointed out in a comment, this can be greatly simplified:

if row_gene.row_gene_id not in genes:
    genes[row_gene.row_gene_id] = []

genes[row_gene.row_gene_id].append(row_gene)


Using setdefault like this:

genes.setdefault(row_gene.row_gene_id, []).append(row_gene)

Code Snippets

refseqs = []
ensembl_genes = []
ensembl_transcripts = []
genes = dict()

with open(csv_gene_list, 'r', newline='') as csv_input:

    reader = csv.reader(csv_input, delimiter=',')
    next(reader)
    for row in reader:
        row_gene_symbol = row[0]
        row_gene_id = row[1]

        row_gene = Gene(row_gene_symbol, row_gene_id)

        if row_gene.row_gene_id not in genes:
            genes[row_gene.row_gene_id] = []

        genes[row_gene.row_gene_id].append(row_gene)
        # ...
if row_gene.row_gene_id not in genes:
    genes[row_gene.row_gene_id] = []

genes[row_gene.row_gene_id].append(row_gene)
genes.setdefault(row_gene.row_gene_id, []).append(row_gene)

Context

StackExchange Code Review Q#112942, answer score: 4

Revisions (0)

No revisions yet.