snippetpythonMinor
Impyla parse user args into SQL query
Viewed 0 times
sqluserintoqueryparseargsimpyla
Problem
I feel like there must be a better way to do the following. I would really appreciate any feedback/pointers so that I can improve my code and skills (or lack thereof!).
The goal is to allow users to query our imapala database using a python script, so that they don't have to learn SQL.
I'd like to take the following dictionary of user arguments (as an example):
And parse them as follows:
-
If a value is 'all', no statement is added to the query
-
If a value ends in "%", create a statement as
-
If a value doesn't end in "%", create a statement as
The desired output should look like (this is just the beginning of the query):
I came up with the following code, to separate the wildcards from the non-wildcards and create statements:
```
# separate wildcards from args
def find_wildcards(x):
user_args = []
wildcards = []
# separate user args by comma
arg_list = x.replace("'", "").split(',')
for arg in arg_list:
if arg.endswith('%'):
wildcards.append(arg)
else:
user_args.append(arg)
return user_args, wildcards
# function to turn arg_list and wild_list into statements
def process_args(arg_list, wild_list, name_arg, arg_db):
# if there is only one user arg
if (len(arg_list) == 1 a
The goal is to allow users to query our imapala database using a python script, so that they don't have to learn SQL.
I'd like to take the following dictionary of user arguments (as an example):
args = {
'members':'all',
'subject_id':'all',
'genes':'BRCA1,TRT1,SON%,HOX%',
'gt':'hom',
'chrom':'8',
'search_db':'wgs.comgen_variant',
'search_db_cols':'all',
'annot_db':'ref.ensembl_genes',
'annot_db_cols':'gene_name',
}And parse them as follows:
-
If a value is 'all', no statement is added to the query
-
If a value ends in "%", create a statement as
AND db_name.table_name.column LIKE "value%"-
If a value doesn't end in "%", create a statement as
AND db_name.table_name.column = "value"The desired output should look like (this is just the beginning of the query):
SELECT wgs.comgen_variant.*, ref.ensembl_genes.gene_name
FROM wgs.comgen_variant,ref.ensembl_genes
WHERE (ref.ensembl_genes.gene_name = 'BRCA1'
OR ref.ensembl_genes.gene_name = 'TRT1'
OR ref.ensembl_genes.gene_name LIKE 'SON%'
OR ref.ensembl_genes.gene_name LIKE 'HOX%')
AND wgs.comgen_variant.chrom = "8"
AND wgs.comgen_variant.gt = "hom"I came up with the following code, to separate the wildcards from the non-wildcards and create statements:
```
# separate wildcards from args
def find_wildcards(x):
user_args = []
wildcards = []
# separate user args by comma
arg_list = x.replace("'", "").split(',')
for arg in arg_list:
if arg.endswith('%'):
wildcards.append(arg)
else:
user_args.append(arg)
return user_args, wildcards
# function to turn arg_list and wild_list into statements
def process_args(arg_list, wild_list, name_arg, arg_db):
# if there is only one user arg
if (len(arg_list) == 1 a
Solution
First of all, I apologize if this is incomplete - I'm working from my laptop and I tend to code worse with fewer and smaller screens.
That being said, there is one huge issue I see with your code
Don't trust users. Ever
I don't see anything in your code that would prevent a SQL injection attack - for example, if I were to supply you with arguments like
you'd probably have a bad time. I'm sure that there are much more subtle ways to achieve similarly nasty results, but this is something to keep in mind. I don't have a good suggestion on how to prevent SQL injection - I strongly prefer using libraries where they handle that and I just have to call their methods properly. If possible I'd work on leveraging existing code and techniques to avoid this problem.
Other stuff
Naming
In your
Documentation
It will greatly help you, and any future contributors, if you include documentation on what each function is supposed to do - I recommend adding a docstring to every function, as well as a module level one.
String formatting
You use a weird mixture of string concatenation and
into
Note that you also don't need to specify
Don't make unnecessary intermediate lists
Sometimes you use a list comprehension (which is good) and then map a function to that list. This is okay, but it could be better. Consider
This could be easily rewritten as
Some people might say that's hard to read - I'm not one of them, but I get the point. Thank god for indentation.
Something else you could do - the form
There looks like a lot of repetition here. You could probably reduce that by being clever with function parameters and string formatting.
Finally, this section is pretty hard to understand. You wouldn't go wrong with some comments explaining why you're doing the things you do.
Simply booleans
Instead of
just do
Simplify and explain
This function is also hard to read and follow. Adding comments and documentation would go a long way to help with this
Make a
At the end you have quite a few steps required to actually get the query you want - it would be much easier to write a wrapper function that puts all of that work together and outputs the final query.
That being said, there is one huge issue I see with your code
Don't trust users. Ever
I don't see anything in your code that would prevent a SQL injection attack - for example, if I were to supply you with arguments like
args = {
'members': 'all',
'subject_id': 'all',
'genes': 'BRCA1,TRT1,SON%,HOX%',
'gt': 'hom',
'chrom': '8',
'search_db': 'wgs.comgen_variant; BEGIN TRANSACTION badness; DROP TABLE IMPORTANT_TABLE; COMMIT TRANSACTION badness;',
'search_db_cols': 'all',
'annot_db': 'ref.ensembl_genes',
'annot_db_cols': 'gene_name'
}you'd probably have a bad time. I'm sure that there are much more subtle ways to achieve similarly nasty results, but this is something to keep in mind. I don't have a good suggestion on how to prevent SQL injection - I strongly prefer using libraries where they handle that and I just have to call their methods properly. If possible I'd work on leveraging existing code and techniques to avoid this problem.
Other stuff
Naming
In your
find_wildcards function you name your input parameter x which tells me absolutely nothing about what it is or what it should be. Consider a more useful name.Documentation
It will greatly help you, and any future contributors, if you include documentation on what each function is supposed to do - I recommend adding a docstring to every function, as well as a module level one.
String formatting
You use a weird mixture of string concatenation and
string.format() to build your strings - pick one (and it better be formatting) and stick with it. For example, make query_args = 'AND {0}.{1} = '.format(arg_db, name_arg) + "'" + ", ".join(str(e) for e in arg_list) + "'"into
query_args = "AND {}.{} = '{}'".format(arg_db, name_arg, ', '.join(str(e) for e in arg_list))Note that you also don't need to specify
{index} - I generally find it just adds noise unless you want the same value inserted in multiple places.Don't make unnecessary intermediate lists
Sometimes you use a list comprehension (which is good) and then map a function to that list. This is okay, but it could be better. Consider
conditions = ['{0}.{1} = '.format(arg_db, name_arg) + "'" + s + "'" for s in arg_list]
query_args = "AND (" + " OR ".join(map(str, conditions)) + ")"This could be easily rewritten as
query_args = "AND ({})".format(" OR ".join("{}.{} = '{}'".format(arg_db, name_arg, s) for s in arg_list))Some people might say that's hard to read - I'm not one of them, but I get the point. Thank god for indentation.
query_args = "AND ({})".format(" OR ".join(
"{}.{} = '{}'".format(arg_db, name_arg, s)
for s in arg_list))Something else you could do - the form
"{}.{}".format(arg_db, name_arg) is everywhere. Instead of repeating that formatting everywhere, save that as a local variable and insert it as needed.There looks like a lot of repetition here. You could probably reduce that by being clever with function parameters and string formatting.
Finally, this section is pretty hard to understand. You wouldn't go wrong with some comments explaining why you're doing the things you do.
Simply booleans
Instead of
if len(query_arg) > 0:just do
if query_arg:Simplify and explain
get_colsThis function is also hard to read and follow. Adding comments and documentation would go a long way to help with this
Make a
main functionAt the end you have quite a few steps required to actually get the query you want - it would be much easier to write a wrapper function that puts all of that work together and outputs the final query.
Code Snippets
args = {
'members': 'all',
'subject_id': 'all',
'genes': 'BRCA1,TRT1,SON%,HOX%',
'gt': 'hom',
'chrom': '8',
'search_db': 'wgs.comgen_variant; BEGIN TRANSACTION badness; DROP TABLE IMPORTANT_TABLE; COMMIT TRANSACTION badness;',
'search_db_cols': 'all',
'annot_db': 'ref.ensembl_genes',
'annot_db_cols': 'gene_name'
}query_args = 'AND {0}.{1} = '.format(arg_db, name_arg) + "'" + ", ".join(str(e) for e in arg_list) + "'"query_args = "AND {}.{} = '{}'".format(arg_db, name_arg, ', '.join(str(e) for e in arg_list))conditions = ['{0}.{1} = '.format(arg_db, name_arg) + "'" + s + "'" for s in arg_list]
query_args = "AND (" + " OR ".join(map(str, conditions)) + ")"query_args = "AND ({})".format(" OR ".join("{}.{} = '{}'".format(arg_db, name_arg, s) for s in arg_list))Context
StackExchange Code Review Q#97981, answer score: 2
Revisions (0)
No revisions yet.