patternpythonMinor
Script for concatenating delimited text files
Viewed 0 times
scriptdelimitedconcatenatingtextfilesfor
Problem
I need to concatenate a bunch of delimited text files. In the process, I need to add a new column to the data based on part of the name of one of the directories containing each file. The script works, but I have a feeling that it is inelegant in the extreme.
In particular, I'm wondering whether I should be reading the source files into a data structure that can later be written to file as a delimited text file. It seems like such an approach would be more general and easily extended.
In its current structure, I have doubts about the efficiency of how I'm skipping the header lines in source files with the is_header variable. It seems like this approach requires more condition-checking than should be strictly necessary. Can't I just iterate over some object? I tried
In particular, I'm wondering whether I should be reading the source files into a data structure that can later be written to file as a delimited text file. It seems like such an approach would be more general and easily extended.
In its current structure, I have doubts about the efficiency of how I'm skipping the header lines in source files with the is_header variable. It seems like this approach requires more condition-checking than should be strictly necessary. Can't I just iterate over some object? I tried
for row in reader[1:]: but apparently objects of type csv.reader don't allow subscripting.#! /usr/bin/env Python3
import glob
import csv
file_names = glob.glob('*/unknown/*.dat')
with open(file_names[0], 'r') as csv_input:
reader = csv.reader(csv_input, delimiter = '\t')
header = ['seed'] + next(reader)
with open('output.dat', 'a') as csv_output:
writer = csv.writer(csv_output, delimiter = '\t')
writer.writerow(header)
for file_name in file_names:
param_dir = file_name.split('/')[0]
seed = param_dir.split('-')[0]
with open(file_name, 'r') as csv_input:
reader = csv.reader(csv_input, delimiter = '\t')
is_header = True
for row in reader:
if not is_header:
out_row = [seed] + row
writer.writerow(out_row)
is_header = FalseSolution
There is some repetition between your header-reading routine and the rest of the code. You could incorporate it into the main loop to remove that duplication and avoid opening the first file twice. You already use
To determine the seed from a file path, you look for a
next(reader) nicely to read the header; use the same technique to skip the header row.To determine the seed from a file path, you look for a
- in the file path. Your glob should have a - to ensure that that succeeds.import csv
from glob import glob
paths = glob('*-*/unknown/*.dat')
with open('output.dat', 'a') as csv_output:
writer = csv.writer(csv_output, delimiter='\t')
header_written = False
for path in paths:
param_dir = path.split('/')[0]
seed = param_dir.split('-')[0]
with open(path, 'r') as csv_input:
reader = csv.reader(csv_input, delimiter='\t')
header = next(reader)
if not header_written:
writer.writerow(['seed'] + header)
header_written = True
for row in reader:
writer.writerow([seed] + row)Code Snippets
import csv
from glob import glob
paths = glob('*-*/unknown/*.dat')
with open('output.dat', 'a') as csv_output:
writer = csv.writer(csv_output, delimiter='\t')
header_written = False
for path in paths:
param_dir = path.split('/')[0]
seed = param_dir.split('-')[0]
with open(path, 'r') as csv_input:
reader = csv.reader(csv_input, delimiter='\t')
header = next(reader)
if not header_written:
writer.writerow(['seed'] + header)
header_written = True
for row in reader:
writer.writerow([seed] + row)Context
StackExchange Code Review Q#35218, answer score: 2
Revisions (0)
No revisions yet.