HiveBrain v1.2.0
Get Started
← Back to all entries
patternrubyMinor

Combining CSVs using ruby to match headers

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
combiningcsvsmatchheadersusingruby

Problem

I am trying to write a ruby script that will take all of the CSVs in a given directory and map their headers out so that duplicate columns will match up and columns unique to a file will have their own column in the final data file. The following code produces the output I desire, however something about it seems overly complicated and clumsy. Are there better practices or patterns I should follow to accomplish this? For example it seems to me like the master_hash variable should lose scope when I transfer it to the key_arr, but when I try to make the code reflect that I end up overwriting it each time I open a new file. Am I trying to hard to avoid "slurping" the files? Can this be done using less memory/faster? Forgive me if this question is too general in nature and if I can improve it to be more useful please let me know.

```
require 'CSV'
key_arr = [] # create Key array to store the headers
master_hash = {} # create master hash to consolidate headers
# get each csv in file
Dir.glob('datasample*.csv').each do |file|
# open each csv
File.open(file) do |csv|
# Create a temporary hash of the headers
temp_hash = Hash[csv.readline.split(',').map.with_index.to_a]
# merge temporary hash to master hash
master_hash = master_hash.merge(temp_hash)
end
key_arr = master_hash.keys # Put keys from hash into array
end

CSV.open('dest.csv', 'w') do |dest_csv| # open destination file
dest_csv << key_arr # write header row
end

Dir.glob('datasample*.csv').each do |file| # get all the csv's to be merged
# Open destination file in append mode
CSV.open('dest.csv', 'a', headers: true) do |dest_csv|
# open each source file for reading
CSV.open(file, headers: true) do |source_csv|
source_csv.each do |row| # iterate over each row of the source file
row_arr = Array.new(key_arr.length) # create array to hold Row
row.each do |cell| # get each header/field combination
# iterate over the key arr
# (key arr is the

Solution

Your overall approach seems fine (i.e. get all headers, then combine the files). However, it can be made much simpler.

Before getting into that, though, I'm slightly concerned about how you "manually" parse the headers of each file. The CSV format can have some nasty syntax, so simply splitting on commas may give you the wrong results (for instance, if the header text itself contains a comma). So I'd use the CSV methods for reliable, robust parsing of the header lines.

As for the actual "meat" of the script, it seems to me that you're working way, way too hard, using arrays, hashes, indices, and all sorts of other stuff. No need for that. In fact, you rarely ever need to muck around with low-level array/index stuff in Ruby.

And the neat thing about Ruby's CSV library is that you can reference cells by the header they belong to. So you don't need numeric indices. Instead, if you can do something like:

CSV.foreach(some_file, headers: true) do |row|
  puts row["Column A"] # grab the value from a named column in the row
end


CSV.foreach avoids reading everything all at once. Instead it reads the rows, one by one, passing them to the block. And since it's passing them as CSV::Row objects, so you can access the row cells by header name.

So here's my take:

require 'csv'

# Get input files
input_files = Dir.glob("datasample*.csv")

# Collect/combine headers
all_headers = input_files.reduce([]) do |all_headers, file|
  header_line = File.open(file, &:gets)     # grab first line
  all_headers | CSV.parse_line(header_line) # parse headers and merge with known ones
end

# Write combined file
CSV.open("out.csv", "w") do |out|
  # Write all headers
  out << all_headers

  # Write rows from each file
  input_files.each do |file|
    CSV.foreach(file, headers: true) do |row|
      out << all_headers.map { |header| row[header] }
    end
  end
end


The output file will contain all the headers, and all the data in the correct order, interspersed with blank cells where necessary. The memory usage should be really, really low, because no file is read in its entirety. We just read a single line from each to get all the headers, and afterward, we go row by row, outputting them as we read them.

Code Snippets

CSV.foreach(some_file, headers: true) do |row|
  puts row["Column A"] # grab the value from a named column in the row
end
require 'csv'

# Get input files
input_files = Dir.glob("datasample*.csv")

# Collect/combine headers
all_headers = input_files.reduce([]) do |all_headers, file|
  header_line = File.open(file, &:gets)     # grab first line
  all_headers | CSV.parse_line(header_line) # parse headers and merge with known ones
end

# Write combined file
CSV.open("out.csv", "w") do |out|
  # Write all headers
  out << all_headers

  # Write rows from each file
  input_files.each do |file|
    CSV.foreach(file, headers: true) do |row|
      out << all_headers.map { |header| row[header] }
    end
  end
end

Context

StackExchange Code Review Q#84290, answer score: 6

Revisions (0)

No revisions yet.