patternpythonMinor

csvshuf: a tool to shuffle CSV columns written in Python

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

writtencolumnscsvshufflepythoncsvshuftool

Problem

Edit: After reading Gareth's answer I have pushed an updated version of the code to github.

I needed to shuffle cells from specific columns in several CSV files. One of the requirements was to be able to do derangement (shuffling that leaves no element in its original place), I read about Sattolo's algorithm.

So I started writing csvshuf this afternoon. Code was originally based in csvcut. I tried to make it good but I am sure it can be improved as I have little experience programming in Python. I would be grateful if I get some code review.

Code:

import csv
import sys
import random
import getopt

# From https://softwareengineering.stackexchange.com/q/218255/149749
def shuffle_kfy(items):
    i = len(items) - 1
    while i > 0:
        j = random.randrange(i + 1)  # 0  1:
        i = i - 1
        j = random.randrange(i)  # 0  len(headers):
        print('Invalid column {}. Last column is {}').format(c, len(headers))
        exit(1)
    table[c - 1] = shuffle(table[c - 1], search_mode)

table = zip(*table)

writer = csv.writer(sys.stdout, delimiter=output_delimiter)
writer.writerow(headers)
for row in table:
    writer.writerow(row)

Usage:

```
csvshuf -c1 foobar.csv
(shuffles the first column of each row of foobar.csv using Python's shuffle())

svshuf -c2 -k foobar.csv
(shuffles the second column of each row using Knuth-Fischer-Yeats algorithm.)

svshuf -c3 -s foobar.csv
(shuffles the third column of each row using Sattolo's algorithm.)

csvshuf foobar.csv
(shuffles all the columns of foobar.csv)

csvshuf -C1 foobar.csv
(shuffles all the columns but the first of foobar.csv)

head -10 foobar.csv | csvshuf -c 1,3
(shuffles the first and third columns of the first ten lines of foobar.csv)

csvshuf -c1,3 -d "|" foobar.csv
(shuffles the first and third columns of the pipe-delimited foobar.csv)

csvshuf -c 1,3 -t foobar.csv
(shuffles the first and third columns of the tab-delimited foobar.csv if present, the -d option will be ignored.)

csvshuf -c 1,2,3 -

Solution

-
There are no docstrings. It's hard to review code when we don't know what it is supposed to do.

-
The various shuffle functions both modify their items argument and return it. It would be clearer if these functions either modified their argument without returning it (like random.shuffle) or returned a new list without modifying the original (like random.sample). Doing both is redundant and confusing.

-
The shuffle functions can be simplified using reversed and range:

def shuffle_sattolo(items):
    """Shuffle items in place using Sattolo's algorithm."""
    _randrange = random.randrange
    for i in reversed(range(1, len(items))):
        j = _randrange(i)  # 0 <= j < i
        items[j], items[i] = items[i], items[j]

Note that I've cached random.randrange in a local variable. This is to avoid having to look it up again each time around the loop.

-
The shuffle_kfy function is redundant because random.shuffle already implements the Fisher–Yates shuffle.

-
Using a string argument to select a mode of operation is error-prone. If you mistyped the string, for example shuffle(items, 'satolo'), then you would not get an error, it would just silently do the wrong thing.

-
The shuffle function is redundant. Instead of assigning a search mode while parsing the arguments, assign a shuffle function:

shuffle = random.shuffle

# ...

if '-k' in opts:
    shuffle = shuffle_kfy
elif '-s' in opts:
    shuffle = shuffle_sattolo

-
Having code at the top level of a module makes it hard to test — for example, you can't test the shuffle function by itself, because you can't import the module without running all the top-level code. It is a good idea to put top-level code into a main function and guard it with if __name__ == '__main__':.

-
The code processes the command-line options by transforming them into a dictionary and then looking up each valid argument. But this means that there are no error messages for invalid arguments. If you look at the getopt examples in the manual, you'll see that they do:

for o, a in opts:
    if o == '-v':
        # process option
    else:
        # report error

-
The name i is normally used for index variables, so it's misleading to use it for an input file.

-
Error messages should go to standard error, not standard output:

sys.stderr.write("Invalid column 0. Columns are 1-based.\n")

This is particularly important here, because this program writes its results to standard output, and so probably the user is redirecting standard output to a file. If you write error messages to standard output, they will go to the file and the user will not see them.

-
It's a better idea to use the argparse module for parsing command-line arguments. With argparse the code is more explicit, it automatically emits error messages for invalid arguments, it has various types of built-in argument validation and conversion, and it has built-in support for the --help option. Something like this:

```
def column_list(string):
"""Validate and convert comma-separated list of column numbers."""
try:
columns = list(map(int, string.split(',')))
except ValueError as e:
raise argparse.ArgumentTypeError(*e.args)
for column in columns:
if column < 1:
raise argparse.ArgumentTypeError(
"Invalid column {!r}: column numbers start at 1."
.format(column))
return columns

def main():
parser = argparse.ArgumentParser(
description="Shuffle columns in a CSV file")
parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
default=sys.stdin, help='Input CSV file')
parser.add_argument('-s', '--sattolo',
action='store_const', const=shuffle_sattolo,
dest='shuffle', default=random.shuffle,
help="Use Sattolo's shuffle.")
col_group = parser.add_mutually_exclusive_group()
col_group.add_argument('-c', '--columns', type=column_list,
help="Comma-separated list of columns to include.")
col_group.add_argument('-C', '--no-columns', type=column_list,
help="Comma-separated list of columns to exclude.")
delim_group = parser.add_mutually_exclusive_group()
delim_group.add_argument('-d', '--delimiter', type=str, default=',',
help="Input column delimiter.")
delim_group.add_argument('-t', '--tabbed', dest='delimiter',
action='store_const', const='\t',
help="Delimit input with tabs.")
parser.add_argument('-q', '--quotechar', type=str, default='"',
help="Quote character.")
parser.add_argument('-o', '--output-delimiter', type=str, default=',',
help="Output column delimiter.")
args = parser.parse_args()

reader = csv.reader(args.infile,

Code Snippets

def shuffle_sattolo(items):
    """Shuffle items in place using Sattolo's algorithm."""
    _randrange = random.randrange
    for i in reversed(range(1, len(items))):
        j = _randrange(i)  # 0 <= j < i
        items[j], items[i] = items[i], items[j]

shuffle = random.shuffle

# ...

if '-k' in opts:
    shuffle = shuffle_kfy
elif '-s' in opts:
    shuffle = shuffle_sattolo

for o, a in opts:
    if o == '-v':
        # process option
    else:
        # report error

sys.stderr.write("Invalid column 0. Columns are 1-based.\n")

def column_list(string):
    """Validate and convert comma-separated list of column numbers."""
    try:
        columns = list(map(int, string.split(',')))
    except ValueError as e:
        raise argparse.ArgumentTypeError(*e.args)
    for column in columns:
        if column < 1:
            raise argparse.ArgumentTypeError(
                "Invalid column {!r}: column numbers start at 1."
                .format(column))
    return columns

def main():
    parser = argparse.ArgumentParser(
        description="Shuffle columns in a CSV file")
    parser.add_argument('infile', type=argparse.FileType('r'), nargs='?',
                        default=sys.stdin, help='Input CSV file')
    parser.add_argument('-s', '--sattolo',
                        action='store_const', const=shuffle_sattolo,
                        dest='shuffle', default=random.shuffle,
                        help="Use Sattolo's shuffle.")
    col_group = parser.add_mutually_exclusive_group()
    col_group.add_argument('-c', '--columns', type=column_list,
                           help="Comma-separated list of columns to include.")
    col_group.add_argument('-C', '--no-columns', type=column_list,
                           help="Comma-separated list of columns to exclude.")
    delim_group = parser.add_mutually_exclusive_group()
    delim_group.add_argument('-d', '--delimiter', type=str, default=',',
                             help="Input column delimiter.")
    delim_group.add_argument('-t', '--tabbed', dest='delimiter',
                             action='store_const', const='\t',
                             help="Delimit input with tabs.")
    parser.add_argument('-q', '--quotechar', type=str, default='"',
                        help="Quote character.")
    parser.add_argument('-o', '--output-delimiter', type=str, default=',',
                        help="Output column delimiter.")
    args = parser.parse_args()

    reader = csv.reader(args.infile, delimiter=args.delimiter,
                        quotechar=args.quotechar)

    # ... and so on ...

Context

StackExchange Code Review Q#129806, answer score: 8

Revisions (0)

No revisions yet.