HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Copy first n lines of each file in a directory

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
directoryfileeachfirstlinescopy

Problem

In an attempt to learn Haskell, I've built this - it accepts a number n, a source directory, and a sink directory (ending with /) copies n lines from each file in the source directory to the sink directory. It is not recursive. Thoughts?

module Main where
import System.Directory
import Control.Monad
import System.IO
import System.IO.Error
import Data.List
import Control.Applicative
import Control.Exception.Base
import System.Environment
import Data.List

--requires / or \ at end.
listFilesInDir :: FilePath -> IO [FilePath]
listFilesInDir dir= (getDirectoryContents dir) >>= filterM (\x ->doesFileExist $ dir ++ x)

readAndWrite:: Handle -> Handle -> IO ()
readAndWrite inHandle outHandle = do
    myLn if isEOFError e
                   then return ()
                   else ioError e
         Right myLn -> do hPutStrLn outHandle myLn
                          return ()

writeFirstNLines :: Int-> FilePath->FilePath -> IO ()
writeFirstNLines n source sink =do
    inHandle <- openFile source ReadMode
    outHandle <- openFile sink WriteMode
    replicateM_ n (readAndWrite inHandle outHandle)

main :: IO ()
main = do
    args <- getArgs
    let n = read $ args!!0
    let source = args!!1
    let sink   = args!!2
    let fileList=listFilesInDir  source
    fileSources <-fmap (map (source ++)) $ fileList
    fileSinks <-fmap (map (sink ++)) $ fileList
    --putStrLn "Sources"
    --mapM_ putStrLn fileSources
    --putStrLn "Sinks"
    --mapM_ putStrLn fileSinks
    q<-zipWithM (writeFirstNLines n) fileSources fileSinks
    return()

Solution

requires / or \ at end.

Your code requires that the user supplied source and destination
directories end in a path name separator. A better approach is
to simply add it. See the comments below on the () operator.

partial functions

When processing the command line arguments you
are using functions which could throw errors, e.g. read, head, !!.
When they fail you'll get a generic error message like Prelude.read: no parse or Prelude.!!: index too large.

Hackage has libraries you can use to parse and validate command line arguments
like cmdargs or any of the optparse- packages. However, it's not too difficult to roll your own like this:

import Data.Char

 parseArgs (nstr : source : dest : _) =
   if all isDigit nstr
     then ( read nstr, source, dest )
     else error "first argument is not a number"
 parseArgs _ = error "not enough arguments"

main = do
  args <- getArgs
  let (n, source, dest) = parseArgs args
  ...


parseArgs checks that there are at least three command line arguments and
also that the first one consists entirely of digits.

(++)

Currently the FilePath type is a String, so using (++) to join
together path name segments is fine. However, in System.FilePath
Haskell defines the operator () which is meant to be a cross-platform
way to do this - i.e. on Windows it will use \\ and on Posix systems it
will use a /. Moreover using () signals to the reader that you are
working with a path name string.

The module System.FilePath also has utility functions for dealing with
path names such as extracting the base / directory name from a path, extracting and replacing the suffix, etc. which are good to know about and may come
in handy.

A good place to use () is in your listFilesInDir function, e.g.:

listFilesInDir dir =
  getDirectoryContents >>= filterM (\x -> doesFileExist (dir  x))


Now dir does not have to end in a / or \\ because using () will
add the appropriate separator.

fileList

Note that fileList is an IO-action, so in these lines:

fileSources <-fmap (map (source ++)) $ fileList
fileSinks <-fmap (map (sink ++)) $ fileList


you are repeating the IO-action twice. That is, you are performing
getDirectoryContents twice on the same directory. Instead you
want to perform fileList
only once and use its result to compute fileSources and fileSinks:

leafNames ) leafNames
    fileSinks = map (sink ) leafNames
zipWithM_ (writeFirstNLines n) fileSources fileSinks


Note I'm using zipWithM_ here - it works just like zipWithM but discards
the results. It's more efficient because zipWithM_ discards the result of
the zip operation immediately, whereas zipWithM saves the results in a
list thus consuming memory for data you'll never look at.

hClose

In Haskell you need to explicitly close your file handles when you are done
with them. Every process has limit on the number of handles which can be open at any one time, so if you never close any handles then eventually openFile will fail when you hit this limit.

A basic way to make sure your file handles always get closed is to use
the withFile function:

withFile :: FilePath -> IOMode -> (Handle -> IO r) -> IO r


withFile creates a file handle by opening a path with the specified IOMode
and passes the handle to an IO-action. It will ensure that the handle is
closed when the IO-action completes - whether it returns normally
or if an exception is thrown.

Using withFile your writeFirstNLines would look like:

writeFirstNLines n source sink = do
  withFile source ReadMode (\hin -> do
    withFile sink WriteMode (\hout -> do
      ... copy from hin to hout ...
    )
  )


Now both hin and hout will automatically get closed when the copy action
completes.

replicateM

Note that replicateM_ n action will always execute action
exactly n times - no more and no less. So this is not a good way
to write a loop which could be executed a variable number of times.

As an example of how inefficient using replicateM_ here could be,
suppose that n is 1000 and you are operating on a 3 line file.
Then:

replicateM_ n readAndWrite


will execute readAndWrite 1000 times even though the last
997 calls won't copy any lines.

Here is a recipe for a loop which is executed at most n times
with the possibility of breaking out early:

loop n | n <= 0 = return ()
loop n = do quit <- condition       -- check condition
            if quit
              then return ()        -- break out early
              else ... do stuff...
                   loop (n-1)       -- perform next iteration


You can use this to copy the first n lines by substituting:

condition        =  hIsEOF hin
...do stuff...   =  hGetLine hin >>= hPutStrLn hout

Code Snippets

import Data.Char

 parseArgs (nstr : source : dest : _) =
   if all isDigit nstr
     then ( read nstr, source, dest )
     else error "first argument is not a number"
 parseArgs _ = error "not enough arguments"

main = do
  args <- getArgs
  let (n, source, dest) = parseArgs args
  ...
listFilesInDir dir =
  getDirectoryContents >>= filterM (\x -> doesFileExist (dir </> x))
fileSources <-fmap (map (source ++)) $ fileList
fileSinks <-fmap (map (sink ++)) $ fileList
leafNames <- listFilesInDir source
let fileSources = map (source </>) leafNames
    fileSinks = map (sink </>) leafNames
zipWithM_ (writeFirstNLines n) fileSources fileSinks
withFile :: FilePath -> IOMode -> (Handle -> IO r) -> IO r

Context

StackExchange Code Review Q#108943, answer score: 2

Revisions (0)

No revisions yet.