patternModerate
Processing text files for data extraction and analysis
Viewed 0 times
textextractionanalysisfilesforandprocessingdata
Problem
I started learning Haskell to see if I can use it at my job. A lot of my work is processing text files for data extraction and analysis.
For my first test, I added a counter at the end of each line from a .csv text file (currently I don't care about the format management).
My current code in Haskell is:
This code processes a file of 1GB (4,845,000 records) in 90 seconds.
This C code below does the same job in 10 seconds:
And the Java code below does the job in 30 seconds:
```
package test.perf.numadr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class NumAdr {
static public void main(String[] args) {
BufferedReader br = null;
try {
String sCurrentLine;
br = new BufferedReader(new FileReader("myfile.csv"));
int cnt = 1;
String lineWithId;
while ((sCurrentLine = br.readLine()) != null) {
lineWithId = sCu
For my first test, I added a counter at the end of each line from a .csv text file (currently I don't care about the format management).
My current code in Haskell is:
import qualified Data.ByteString.Lazy.Char8 as L
addRecordId :: String -> String -> Int -> String
addRecordId "" _ _ = ""
addRecordId rec sep cnt = rec ++ sep ++ show cnt
addIncrementalId :: String -> [String] -> [String]
addIncrementalId _ [] = []
addIncrementalId sep ls = addId ls 1
where
addId [] _ = []
addId (l:ls) cnt = addRecordId l sep cnt : addId ls (cnt + 1)
identifyFile :: FilePath -> String -> IO [String]
identifyFile path sep = do
inpStr IO ()
printLnIdentifiedFile ls = do
lines <- ls
putStr (unlines lines)
main = printLnIdentifiedFile (identifyFile "myfile.csv" ";")This code processes a file of 1GB (4,845,000 records) in 90 seconds.
This C code below does the same job in 10 seconds:
#include
#include
int main() {
FILE *f = fopen("myfile.csv", "r");
size_t bytes_read;
size_t current_buffer_size = 400;
char *buffer = calloc(current_buffer_size, 1);
long cnt = 1;
while ((bytes_read = getline(&buffer, ¤t_buffer_size, f)) > 0) {
if (feof(f)) break;
buffer[ bytes_read - 2 ] = 0;
printf("%s;%ld\n", buffer, cnt++);
}
fclose(f);
return 0;
}And the Java code below does the job in 30 seconds:
```
package test.perf.numadr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
public class NumAdr {
static public void main(String[] args) {
BufferedReader br = null;
try {
String sCurrentLine;
br = new BufferedReader(new FileReader("myfile.csv"));
int cnt = 1;
String lineWithId;
while ((sCurrentLine = br.readLine()) != null) {
lineWithId = sCu
Solution
I think your program needs to be more Haskell-style (and shorter). Here is my rewrite:
Need better speed? It is trivial to convert this code into parallel one.
To compile it use the following command:
import qualified Data.ByteString.Lazy.Char8 as L
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
where out = zipWith f [1..] (L.lines contents)
sep = L.pack ";"
f n l = l `L.append` sep `L.append` L.pack (show n)
main = do
contents <- L.readFile "myfile.csv"
L.putStr (processContents contents)Need better speed? It is trivial to convert this code into parallel one.
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
where out = zipWith f [1..] (L.lines contents)
sep = L.pack ";"
f n l = l `L.append` sep `L.append` L.pack (show n)
chunks = 1 + (length out `div` numCapabilities)
main = do
contents <- L.readFile "myfile.csv"
L.putStr (processContents contents)To compile it use the following command:
ghc -O2 -threaded -with-rtsopts=-N program.hsCode Snippets
import qualified Data.ByteString.Lazy.Char8 as L
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines out
where out = zipWith f [1..] (L.lines contents)
sep = L.pack ";"
f n l = l `L.append` sep `L.append` L.pack (show n)
main = do
contents <- L.readFile "myfile.csv"
L.putStr (processContents contents)import qualified Data.ByteString.Lazy.Char8 as L
import Control.Parallel.Strategies
import GHC.Conc(numCapabilities)
processContents :: L.ByteString -> L.ByteString
processContents contents = L.unlines (out `using` parListChunk chunks rdeepseq)
where out = zipWith f [1..] (L.lines contents)
sep = L.pack ";"
f n l = l `L.append` sep `L.append` L.pack (show n)
chunks = 1 + (length out `div` numCapabilities)
main = do
contents <- L.readFile "myfile.csv"
L.putStr (processContents contents)ghc -O2 -threaded -with-rtsopts=-N program.hsContext
StackExchange Code Review Q#51251, answer score: 12
Revisions (0)
No revisions yet.