HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

Processing text files for data extraction and analysis

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
textextractionanalysisfilesforandprocessingdata

Problem

I started learning Haskell to see if I can use it at my job. A lot of my work is processing text files for data extraction and analysis.

For my first test, I added a counter at the end of each line from a .csv text file (currently I don't care about the format management).

My current code in Haskell is:

import qualified Data.ByteString.Lazy.Char8 as L

addRecordId :: String -> String -> Int -> String
addRecordId "" _ _ = ""
addRecordId rec sep cnt = rec ++ sep ++ show cnt

addIncrementalId :: String -> [String] -> [String]
addIncrementalId _ []       = []
addIncrementalId sep ls = addId ls 1
  where
    addId []     _   = []
    addId (l:ls) cnt = addRecordId l sep cnt : addId ls (cnt + 1)

identifyFile :: FilePath -> String -> IO [String]
identifyFile path sep = do
  inpStr  IO ()
printLnIdentifiedFile ls = do
  lines <- ls
  putStr (unlines lines)

main = printLnIdentifiedFile (identifyFile "myfile.csv" ";")


This code processes a file of 1GB (4,845,000 records) in 90 seconds.

This C code below does the same job in 10 seconds:

#include 
#include 

int main() {
  FILE *f = fopen("myfile.csv", "r");

  size_t bytes_read;
  size_t current_buffer_size = 400;
  char *buffer = calloc(current_buffer_size, 1);

  long cnt = 1;
  while ((bytes_read = getline(&buffer, ¤t_buffer_size, f)) > 0) {
    if (feof(f)) break;

    buffer[ bytes_read - 2 ] = 0;
    printf("%s;%ld\n", buffer, cnt++);
  }

  fclose(f);
  return 0;
}


And the Java code below does the job in 30 seconds:

```
package test.perf.numadr;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class NumAdr {

static public void main(String[] args) {
BufferedReader br = null;

try {

String sCurrentLine;

br = new BufferedReader(new FileReader("myfile.csv"));

int cnt = 1;
String lineWithId;
while ((sCurrentLine = br.readLine()) != null) {
lineWithId = sCu

Solution

I think your program needs to be more Haskell-style (and shorter). Here is my rewrite:

import qualified Data.ByteString.Lazy.Char8 as L

processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
    where out = zipWith f [1..] (L.lines contents)
          sep = L.pack ";"
          f n l = l `L.append` sep `L.append` L.pack (show n)

main = do
    contents <- L.readFile "myfile.csv"
    L.putStr (processContents contents)


Need better speed? It is trivial to convert this code into parallel one.

import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)

processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
    where out = zipWith f [1..] (L.lines contents)
          sep = L.pack ";"
          f n l = l `L.append` sep `L.append` L.pack (show n)
          chunks = 1 + (length out `div` numCapabilities)

main = do
    contents <- L.readFile "myfile.csv"
    L.putStr (processContents contents)


To compile it use the following command:

ghc -O2 -threaded -with-rtsopts=-N program.hs

Code Snippets

import qualified Data.ByteString.Lazy.Char8 as L

processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
    where out = zipWith f [1..] (L.lines contents)
          sep = L.pack ";"
          f n l = l `L.append` sep `L.append` L.pack (show n)

main = do
    contents <- L.readFile "myfile.csv"
    L.putStr (processContents contents)
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)

processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
    where out = zipWith f [1..] (L.lines contents)
          sep = L.pack ";"
          f n l = l `L.append` sep `L.append` L.pack (show n)
          chunks = 1 + (length out `div` numCapabilities)

main = do
    contents <- L.readFile "myfile.csv"
    L.putStr (processContents contents)
ghc -O2 -threaded -with-rtsopts=-N program.hs

Context

StackExchange Code Review Q#51251, answer score: 12

Revisions (0)

No revisions yet.