HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Script for concatenating delimited text files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptdelimitedconcatenatingtextfilesfor

Problem

I need to concatenate a bunch of delimited text files. In the process, I need to add a new column to the data based on part of the name of one of the directories containing each file. The script works, but I have a feeling that it is inelegant in the extreme.

In particular, I'm wondering whether I should be reading the source files into a data structure that can later be written to file as a delimited text file. It seems like such an approach would be more general and easily extended.

In its current structure, I have doubts about the efficiency of how I'm skipping the header lines in source files with the is_header variable. It seems like this approach requires more condition-checking than should be strictly necessary. Can't I just iterate over some object? I tried for row in reader[1:]: but apparently objects of type csv.reader don't allow subscripting.

#! /usr/bin/env Python3

import glob
import csv

file_names = glob.glob('*/unknown/*.dat')

with open(file_names[0], 'r') as csv_input:
  reader = csv.reader(csv_input, delimiter = '\t')
  header = ['seed'] + next(reader)

with open('output.dat', 'a') as csv_output:
  writer = csv.writer(csv_output, delimiter = '\t')
  writer.writerow(header)

  for file_name in file_names:
    param_dir = file_name.split('/')[0]
    seed = param_dir.split('-')[0]

    with open(file_name, 'r') as csv_input:
      reader = csv.reader(csv_input, delimiter = '\t')
      is_header = True
      for row in reader:
        if not is_header:
          out_row = [seed] + row
          writer.writerow(out_row)
        is_header = False

Solution

There is some repetition between your header-reading routine and the rest of the code. You could incorporate it into the main loop to remove that duplication and avoid opening the first file twice. You already use next(reader) nicely to read the header; use the same technique to skip the header row.

To determine the seed from a file path, you look for a - in the file path. Your glob should have a - to ensure that that succeeds.

import csv
from glob import glob

paths = glob('*-*/unknown/*.dat')

with open('output.dat', 'a') as csv_output:
    writer = csv.writer(csv_output, delimiter='\t')
    header_written = False

    for path in paths:
        param_dir = path.split('/')[0]
        seed = param_dir.split('-')[0]

        with open(path, 'r') as csv_input:
            reader = csv.reader(csv_input, delimiter='\t')

            header = next(reader)
            if not header_written:
                writer.writerow(['seed'] + header)
                header_written = True

            for row in reader:
                writer.writerow([seed] + row)

Code Snippets

import csv
from glob import glob

paths = glob('*-*/unknown/*.dat')

with open('output.dat', 'a') as csv_output:
    writer = csv.writer(csv_output, delimiter='\t')
    header_written = False

    for path in paths:
        param_dir = path.split('/')[0]
        seed = param_dir.split('-')[0]

        with open(path, 'r') as csv_input:
            reader = csv.reader(csv_input, delimiter='\t')

            header = next(reader)
            if not header_written:
                writer.writerow(['seed'] + header)
                header_written = True

            for row in reader:
                writer.writerow([seed] + row)

Context

StackExchange Code Review Q#35218, answer score: 2

Revisions (0)

No revisions yet.