HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Impyla parse user args into SQL query

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sqluserintoqueryparseargsimpyla

Problem

I feel like there must be a better way to do the following. I would really appreciate any feedback/pointers so that I can improve my code and skills (or lack thereof!).

The goal is to allow users to query our imapala database using a python script, so that they don't have to learn SQL.

I'd like to take the following dictionary of user arguments (as an example):

args = {
'members':'all',
'subject_id':'all',
'genes':'BRCA1,TRT1,SON%,HOX%',
'gt':'hom',
'chrom':'8',
'search_db':'wgs.comgen_variant',
'search_db_cols':'all', 
'annot_db':'ref.ensembl_genes',
'annot_db_cols':'gene_name',
}


And parse them as follows:

-
If a value is 'all', no statement is added to the query

-
If a value ends in "%", create a statement as AND db_name.table_name.column LIKE "value%"

-
If a value doesn't end in "%", create a statement as AND db_name.table_name.column = "value"

The desired output should look like (this is just the beginning of the query):

SELECT wgs.comgen_variant.*, ref.ensembl_genes.gene_name    
FROM wgs.comgen_variant,ref.ensembl_genes 
WHERE (ref.ensembl_genes.gene_name = 'BRCA1' 
        OR ref.ensembl_genes.gene_name = 'TRT1' 
        OR ref.ensembl_genes.gene_name LIKE 'SON%' 
        OR ref.ensembl_genes.gene_name LIKE 'HOX%')
        AND wgs.comgen_variant.chrom = "8"
        AND wgs.comgen_variant.gt = "hom"


I came up with the following code, to separate the wildcards from the non-wildcards and create statements:

```
# separate wildcards from args
def find_wildcards(x):
user_args = []
wildcards = []
# separate user args by comma
arg_list = x.replace("'", "").split(',')
for arg in arg_list:
if arg.endswith('%'):
wildcards.append(arg)
else:
user_args.append(arg)
return user_args, wildcards

# function to turn arg_list and wild_list into statements
def process_args(arg_list, wild_list, name_arg, arg_db):
# if there is only one user arg
if (len(arg_list) == 1 a

Solution

First of all, I apologize if this is incomplete - I'm working from my laptop and I tend to code worse with fewer and smaller screens.

That being said, there is one huge issue I see with your code

Don't trust users. Ever

I don't see anything in your code that would prevent a SQL injection attack - for example, if I were to supply you with arguments like

args = {
    'members': 'all',
    'subject_id': 'all',
    'genes': 'BRCA1,TRT1,SON%,HOX%',
    'gt': 'hom',
    'chrom': '8',
    'search_db': 'wgs.comgen_variant; BEGIN TRANSACTION badness; DROP TABLE IMPORTANT_TABLE; COMMIT TRANSACTION badness;',
    'search_db_cols': 'all', 
    'annot_db': 'ref.ensembl_genes',
    'annot_db_cols': 'gene_name'
}


you'd probably have a bad time. I'm sure that there are much more subtle ways to achieve similarly nasty results, but this is something to keep in mind. I don't have a good suggestion on how to prevent SQL injection - I strongly prefer using libraries where they handle that and I just have to call their methods properly. If possible I'd work on leveraging existing code and techniques to avoid this problem.

Other stuff

Naming

In your find_wildcards function you name your input parameter x which tells me absolutely nothing about what it is or what it should be. Consider a more useful name.

Documentation

It will greatly help you, and any future contributors, if you include documentation on what each function is supposed to do - I recommend adding a docstring to every function, as well as a module level one.

String formatting

You use a weird mixture of string concatenation and string.format() to build your strings - pick one (and it better be formatting) and stick with it. For example, make

query_args = 'AND {0}.{1} = '.format(arg_db, name_arg) + "'" + ", ".join(str(e) for e in arg_list) + "'"


into

query_args = "AND {}.{} = '{}'".format(arg_db, name_arg, ', '.join(str(e) for e in arg_list))


Note that you also don't need to specify {index} - I generally find it just adds noise unless you want the same value inserted in multiple places.

Don't make unnecessary intermediate lists

Sometimes you use a list comprehension (which is good) and then map a function to that list. This is okay, but it could be better. Consider

conditions = ['{0}.{1} = '.format(arg_db, name_arg) + "'" + s + "'" for s in arg_list]
query_args = "AND (" + " OR ".join(map(str, conditions)) + ")"


This could be easily rewritten as

query_args = "AND ({})".format(" OR ".join("{}.{} = '{}'".format(arg_db, name_arg, s) for s in arg_list))


Some people might say that's hard to read - I'm not one of them, but I get the point. Thank god for indentation.

query_args = "AND ({})".format(" OR ".join(
    "{}.{} = '{}'".format(arg_db, name_arg, s)
    for s in arg_list))


Something else you could do - the form "{}.{}".format(arg_db, name_arg) is everywhere. Instead of repeating that formatting everywhere, save that as a local variable and insert it as needed.

There looks like a lot of repetition here. You could probably reduce that by being clever with function parameters and string formatting.

Finally, this section is pretty hard to understand. You wouldn't go wrong with some comments explaining why you're doing the things you do.

Simply booleans

Instead of

if len(query_arg) > 0:


just do

if query_arg:


Simplify and explain get_cols

This function is also hard to read and follow. Adding comments and documentation would go a long way to help with this

Make a main function

At the end you have quite a few steps required to actually get the query you want - it would be much easier to write a wrapper function that puts all of that work together and outputs the final query.

Code Snippets

args = {
    'members': 'all',
    'subject_id': 'all',
    'genes': 'BRCA1,TRT1,SON%,HOX%',
    'gt': 'hom',
    'chrom': '8',
    'search_db': 'wgs.comgen_variant; BEGIN TRANSACTION badness; DROP TABLE IMPORTANT_TABLE; COMMIT TRANSACTION badness;',
    'search_db_cols': 'all', 
    'annot_db': 'ref.ensembl_genes',
    'annot_db_cols': 'gene_name'
}
query_args = 'AND {0}.{1} = '.format(arg_db, name_arg) + "'" + ", ".join(str(e) for e in arg_list) + "'"
query_args = "AND {}.{} = '{}'".format(arg_db, name_arg, ', '.join(str(e) for e in arg_list))
conditions = ['{0}.{1} = '.format(arg_db, name_arg) + "'" + s + "'" for s in arg_list]
query_args = "AND (" + " OR ".join(map(str, conditions)) + ")"
query_args = "AND ({})".format(" OR ".join("{}.{} = '{}'".format(arg_db, name_arg, s) for s in arg_list))

Context

StackExchange Code Review Q#97981, answer score: 2

Revisions (0)

No revisions yet.