patternpythonMinor
csvshuf: a tool to shuffle CSV columns written in Python
Viewed 0 times
writtencolumnscsvshufflepythoncsvshuftool
Problem
Edit: After reading Gareth's answer I have pushed an updated version of the code to github.
I needed to shuffle cells from specific columns in several CSV files. One of the requirements was to be able to do derangement (shuffling that leaves no element in its original place), I read about Sattolo's algorithm.
So I started writing
Code:
Usage:
```
csvshuf -c1 foobar.csv
(shuffles the first column of each row of foobar.csv using Python's shuffle())
svshuf -c2 -k foobar.csv
(shuffles the second column of each row using Knuth-Fischer-Yeats algorithm.)
svshuf -c3 -s foobar.csv
(shuffles the third column of each row using Sattolo's algorithm.)
csvshuf foobar.csv
(shuffles all the columns of foobar.csv)
csvshuf -C1 foobar.csv
(shuffles all the columns but the first of foobar.csv)
head -10 foobar.csv | csvshuf -c 1,3
(shuffles the first and third columns of the first ten lines of foobar.csv)
csvshuf -c1,3 -d "|" foobar.csv
(shuffles the first and third columns of the pipe-delimited foobar.csv)
csvshuf -c 1,3 -t foobar.csv
(shuffles the first and third columns of the tab-delimited foobar.csv if present, the -d option will be ignored.)
csvshuf -c 1,2,3 -
I needed to shuffle cells from specific columns in several CSV files. One of the requirements was to be able to do derangement (shuffling that leaves no element in its original place), I read about Sattolo's algorithm.
So I started writing
csvshuf this afternoon. Code was originally based in csvcut. I tried to make it good but I am sure it can be improved as I have little experience programming in Python. I would be grateful if I get some code review.Code:
import csv
import sys
import random
import getopt
# From https://softwareengineering.stackexchange.com/q/218255/149749
def shuffle_kfy(items):
i = len(items) - 1
while i > 0:
j = random.randrange(i + 1) # 0 1:
i = i - 1
j = random.randrange(i) # 0 len(headers):
print('Invalid column {}. Last column is {}').format(c, len(headers))
exit(1)
table[c - 1] = shuffle(table[c - 1], search_mode)
table = zip(*table)
writer = csv.writer(sys.stdout, delimiter=output_delimiter)
writer.writerow(headers)
for row in table:
writer.writerow(row)Usage:
```
csvshuf -c1 foobar.csv
(shuffles the first column of each row of foobar.csv using Python's shuffle())
svshuf -c2 -k foobar.csv
(shuffles the second column of each row using Knuth-Fischer-Yeats algorithm.)
svshuf -c3 -s foobar.csv
(shuffles the third column of each row using Sattolo's algorithm.)
csvshuf foobar.csv
(shuffles all the columns of foobar.csv)
csvshuf -C1 foobar.csv
(shuffles all the columns but the first of foobar.csv)
head -10 foobar.csv | csvshuf -c 1,3
(shuffles the first and third columns of the first ten lines of foobar.csv)
csvshuf -c1,3 -d "|" foobar.csv
(shuffles the first and third columns of the pipe-delimited foobar.csv)
csvshuf -c 1,3 -t foobar.csv
(shuffles the first and third columns of the tab-delimited foobar.csv if present, the -d option will be ignored.)
csvshuf -c 1,2,3 -
Solution
-
There are no docstrings. It's hard to review code when we don't know what it is supposed to do.
-
The various
-
The shuffle functions can be simplified using
Note that I've cached
-
The
-
Using a string argument to select a mode of operation is error-prone. If you mistyped the string, for example
-
The
-
Having code at the top level of a module makes it hard to test — for example, you can't test the
-
The code processes the command-line options by transforming them into a dictionary and then looking up each valid argument. But this means that there are no error messages for invalid arguments. If you look at the
-
The name
-
Error messages should go to standard error, not standard output:
This is particularly important here, because this program writes its results to standard output, and so probably the user is redirecting standard output to a file. If you write error messages to standard output, they will go to the file and the user will not see them.
-
It's a better idea to use the
```
def column_list(string):
"""Validate and convert comma-separated list of column numbers."""
try:
columns = list(map(int, string.split(',')))
except ValueError as e:
raise argparse.ArgumentTypeError(*e.args)
for column in columns:
if column < 1:
raise argparse.ArgumentTypeError(
"Invalid column {!r}: column numbers start at 1."
.format(column))
return columns
def main():
parser = argparse.ArgumentParser(
description="Shuffle columns in a CSV file")
parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
default=sys.stdin, help='Input CSV file')
parser.add_argument('-s', '--sattolo',
action='store_const', const=shuffle_sattolo,
dest='shuffle', default=random.shuffle,
help="Use Sattolo's shuffle.")
col_group = parser.add_mutually_exclusive_group()
col_group.add_argument('-c', '--columns', type=column_list,
help="Comma-separated list of columns to include.")
col_group.add_argument('-C', '--no-columns', type=column_list,
help="Comma-separated list of columns to exclude.")
delim_group = parser.add_mutually_exclusive_group()
delim_group.add_argument('-d', '--delimiter', type=str, default=',',
help="Input column delimiter.")
delim_group.add_argument('-t', '--tabbed', dest='delimiter',
action='store_const', const='\t',
help="Delimit input with tabs.")
parser.add_argument('-q', '--quotechar', type=str, default='"',
help="Quote character.")
parser.add_argument('-o', '--output-delimiter', type=str, default=',',
help="Output column delimiter.")
args = parser.parse_args()
reader = csv.reader(args.infile,
There are no docstrings. It's hard to review code when we don't know what it is supposed to do.
-
The various
shuffle functions both modify their items argument and return it. It would be clearer if these functions either modified their argument without returning it (like random.shuffle) or returned a new list without modifying the original (like random.sample). Doing both is redundant and confusing.-
The shuffle functions can be simplified using
reversed and range:def shuffle_sattolo(items):
"""Shuffle items in place using Sattolo's algorithm."""
_randrange = random.randrange
for i in reversed(range(1, len(items))):
j = _randrange(i) # 0 <= j < i
items[j], items[i] = items[i], items[j]Note that I've cached
random.randrange in a local variable. This is to avoid having to look it up again each time around the loop.-
The
shuffle_kfy function is redundant because random.shuffle already implements the Fisher–Yates shuffle.-
Using a string argument to select a mode of operation is error-prone. If you mistyped the string, for example
shuffle(items, 'satolo'), then you would not get an error, it would just silently do the wrong thing.-
The
shuffle function is redundant. Instead of assigning a search mode while parsing the arguments, assign a shuffle function:shuffle = random.shuffle
# ...
if '-k' in opts:
shuffle = shuffle_kfy
elif '-s' in opts:
shuffle = shuffle_sattolo-
Having code at the top level of a module makes it hard to test — for example, you can't test the
shuffle function by itself, because you can't import the module without running all the top-level code. It is a good idea to put top-level code into a main function and guard it with if __name__ == '__main__':.-
The code processes the command-line options by transforming them into a dictionary and then looking up each valid argument. But this means that there are no error messages for invalid arguments. If you look at the
getopt examples in the manual, you'll see that they do:for o, a in opts:
if o == '-v':
# process option
else:
# report error-
The name
i is normally used for index variables, so it's misleading to use it for an input file.-
Error messages should go to standard error, not standard output:
sys.stderr.write("Invalid column 0. Columns are 1-based.\n")This is particularly important here, because this program writes its results to standard output, and so probably the user is redirecting standard output to a file. If you write error messages to standard output, they will go to the file and the user will not see them.
-
It's a better idea to use the
argparse module for parsing command-line arguments. With argparse the code is more explicit, it automatically emits error messages for invalid arguments, it has various types of built-in argument validation and conversion, and it has built-in support for the --help option. Something like this:```
def column_list(string):
"""Validate and convert comma-separated list of column numbers."""
try:
columns = list(map(int, string.split(',')))
except ValueError as e:
raise argparse.ArgumentTypeError(*e.args)
for column in columns:
if column < 1:
raise argparse.ArgumentTypeError(
"Invalid column {!r}: column numbers start at 1."
.format(column))
return columns
def main():
parser = argparse.ArgumentParser(
description="Shuffle columns in a CSV file")
parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
default=sys.stdin, help='Input CSV file')
parser.add_argument('-s', '--sattolo',
action='store_const', const=shuffle_sattolo,
dest='shuffle', default=random.shuffle,
help="Use Sattolo's shuffle.")
col_group = parser.add_mutually_exclusive_group()
col_group.add_argument('-c', '--columns', type=column_list,
help="Comma-separated list of columns to include.")
col_group.add_argument('-C', '--no-columns', type=column_list,
help="Comma-separated list of columns to exclude.")
delim_group = parser.add_mutually_exclusive_group()
delim_group.add_argument('-d', '--delimiter', type=str, default=',',
help="Input column delimiter.")
delim_group.add_argument('-t', '--tabbed', dest='delimiter',
action='store_const', const='\t',
help="Delimit input with tabs.")
parser.add_argument('-q', '--quotechar', type=str, default='"',
help="Quote character.")
parser.add_argument('-o', '--output-delimiter', type=str, default=',',
help="Output column delimiter.")
args = parser.parse_args()
reader = csv.reader(args.infile,
Code Snippets
def shuffle_sattolo(items):
"""Shuffle items in place using Sattolo's algorithm."""
_randrange = random.randrange
for i in reversed(range(1, len(items))):
j = _randrange(i) # 0 <= j < i
items[j], items[i] = items[i], items[j]shuffle = random.shuffle
# ...
if '-k' in opts:
shuffle = shuffle_kfy
elif '-s' in opts:
shuffle = shuffle_sattolofor o, a in opts:
if o == '-v':
# process option
else:
# report errorsys.stderr.write("Invalid column 0. Columns are 1-based.\n")def column_list(string):
"""Validate and convert comma-separated list of column numbers."""
try:
columns = list(map(int, string.split(',')))
except ValueError as e:
raise argparse.ArgumentTypeError(*e.args)
for column in columns:
if column < 1:
raise argparse.ArgumentTypeError(
"Invalid column {!r}: column numbers start at 1."
.format(column))
return columns
def main():
parser = argparse.ArgumentParser(
description="Shuffle columns in a CSV file")
parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
default=sys.stdin, help='Input CSV file')
parser.add_argument('-s', '--sattolo',
action='store_const', const=shuffle_sattolo,
dest='shuffle', default=random.shuffle,
help="Use Sattolo's shuffle.")
col_group = parser.add_mutually_exclusive_group()
col_group.add_argument('-c', '--columns', type=column_list,
help="Comma-separated list of columns to include.")
col_group.add_argument('-C', '--no-columns', type=column_list,
help="Comma-separated list of columns to exclude.")
delim_group = parser.add_mutually_exclusive_group()
delim_group.add_argument('-d', '--delimiter', type=str, default=',',
help="Input column delimiter.")
delim_group.add_argument('-t', '--tabbed', dest='delimiter',
action='store_const', const='\t',
help="Delimit input with tabs.")
parser.add_argument('-q', '--quotechar', type=str, default='"',
help="Quote character.")
parser.add_argument('-o', '--output-delimiter', type=str, default=',',
help="Output column delimiter.")
args = parser.parse_args()
reader = csv.reader(args.infile, delimiter=args.delimiter,
quotechar=args.quotechar)
# ... and so on ...Context
StackExchange Code Review Q#129806, answer score: 8
Revisions (0)
No revisions yet.