snippetpythonMinor
Categorizing gene sequences read from a CSV file
Viewed 0 times
categorizingfilereadcsvgenefromsequences
Problem
I am relatively new to programming and would love to get some feedback on the following section of my code.
Assume
class Gene:
def __init__(self, gene_symbol, gene_id):
# gene_symbol represents the abbreviated name of the gene (string)
# Example: 'RHO' (short for 'Rhodopsin')
self.gene_symbol = gene_symbol
# gene_id represents an accession ID for the gene (string)
# Example: 'NM_000539.3'
self.gene_id = gene_id
# is_valid_ functions check if the type of the gene_id is valid
# and returns True or False
# Assume that the types of the gene_id are mutually exclusive
if is_valid_refseq(gene_id):
# Example of a valid refseq: 'NM_000539.3'
self.gene_id_type = REFSEQ
elif is_valid_ensembl_gene(gene_id):
# Example of a valid ensembl_gene: 'ENSG00000163914'
self.gene_id_type = ENSEMBL_GENE
elif is_valid_ensembl_transcript(gene_id):
# Example of a valid ensembl_transcript: 'ENST00000296271'
self.gene_id_type = ENSEMBL_TRANSCRIPT
else:
raise InvalidGeneIDError("Invalid gene_id: {}".format(gene_id))
refseqs = []
ensembl_genes = []
ensembl_transcripts = []
with open(csv_gene_list, 'r', newline='') as csv_input:
reader = csv.reader(csv_input, delimiter=',')
next(reader)
for row in reader:
row_gene_symbol = row[0]
row_gene_id = row[1]
row_gene = Gene(row_gene_symbol, row_gene_id)
if row_gene.gene_id_type == REFSEQ:
refseqs.append(row_gene)
elif row_gene.gene_id_type == ENSEMBL_GENE:
ensembl_genes.append(row_gene)
elif row_gene.gene_id_type == ENSEMBL_TRANSCRIPT:
ensembl_transcripts.append(row_gene)
else:
# What do I do here?
raise AssertionError('Unrecognized gene_id_type: {}'.format(
row_gene.gene_id_type))Assume
gene_symbol andSolution
Yeah, the problem is the switch: you have the same set of if-elif-else conditions duplicated in two places. Chances are that as your program grows, you will add more replicas of these chains. The problem with these chains is that if later you need to add one more case in the middle, you will need to modify all replicas accordingly. This is error prone, as you might not remember all the places where you replicated the chains.
How can we deal with that better? Can we replace these chains with something else, in a way that if we add a new case in the
At least in this example, there is a workable solution. Consider these 3 lists:
These lists are an important part of the problem. They mirror the currently supported 3 types, and when you put items in these lists, you need to refer to these lists by these names exactly.
Instead of these lists, you could use a dictionary of lists, where the keys are the gene types, and the values are lists. Something like this:
In this form the chain is gone, and the second part of your question is naturally gone too.
As @jaime pointed out in a comment, this can be greatly simplified:
Using
How can we deal with that better? Can we replace these chains with something else, in a way that if we add a new case in the
Gene constructor, we can be free from worrying about the rest of the program? At least in this example, there is a workable solution. Consider these 3 lists:
refseqs = []
ensembl_genes = []
ensembl_transcripts = []These lists are an important part of the problem. They mirror the currently supported 3 types, and when you put items in these lists, you need to refer to these lists by these names exactly.
Instead of these lists, you could use a dictionary of lists, where the keys are the gene types, and the values are lists. Something like this:
genes = dict()
with open(csv_gene_list, 'r', newline='') as csv_input:
reader = csv.reader(csv_input, delimiter=',')
next(reader)
for row in reader:
row_gene_symbol = row[0]
row_gene_id = row[1]
row_gene = Gene(row_gene_symbol, row_gene_id)
if row_gene.row_gene_id not in genes:
genes[row_gene.row_gene_id] = []
genes[row_gene.row_gene_id].append(row_gene)
# ...In this form the chain is gone, and the second part of your question is naturally gone too.
As @jaime pointed out in a comment, this can be greatly simplified:
if row_gene.row_gene_id not in genes:
genes[row_gene.row_gene_id] = []
genes[row_gene.row_gene_id].append(row_gene)Using
setdefault like this:genes.setdefault(row_gene.row_gene_id, []).append(row_gene)Code Snippets
refseqs = []
ensembl_genes = []
ensembl_transcripts = []genes = dict()
with open(csv_gene_list, 'r', newline='') as csv_input:
reader = csv.reader(csv_input, delimiter=',')
next(reader)
for row in reader:
row_gene_symbol = row[0]
row_gene_id = row[1]
row_gene = Gene(row_gene_symbol, row_gene_id)
if row_gene.row_gene_id not in genes:
genes[row_gene.row_gene_id] = []
genes[row_gene.row_gene_id].append(row_gene)
# ...if row_gene.row_gene_id not in genes:
genes[row_gene.row_gene_id] = []
genes[row_gene.row_gene_id].append(row_gene)genes.setdefault(row_gene.row_gene_id, []).append(row_gene)Context
StackExchange Code Review Q#112942, answer score: 4
Revisions (0)
No revisions yet.