09 April 2007

Haskell for the Short Attention Span: A Simple File Filter

Well, there is more to say about run-length encoding and other interesting examples I was sent in response to my last article, but today I've been working on another little real-world problem. I've been able to spend a little bit of time on #haskell and the gang there has been extremely helpful. The task: filter a binary log file, turning it into a text representation, and do some stateful validation. I'm not going to show you all the code (it is pretty boring), but here are the minor pitfalls and helpful suggestions I encountered along the way. First, I started with a very basic file filter example from the Haskell 98 report:

main = do
    putStr "Input file: "
    ifile <- getLine 
    putStr "Output file: "
    ofile <- getLine 
    s <- readFile ifile 
    writeFile ofile (filter isAscii s)
    putStr "Filtering successful\n"

Pretty simple. As a sanity check, I wanted to run this example on a data file. The first thing I found is that isAscii is not in the default namespace. To use this program in GHC you'll need to import the module Char. You can learn this via Hoogle. I've started using Hoogle quite a bit; it is a great little tool! You can find Hoogle here. You can also get to it directly through #haskell via LambdaBot. Or you can put it right in your GHCi. Lambdabot lives here.

Anyway, the next thing I found was that, at least under Cygwin, unless you explicitly open a file in binary mode, you might encounter problems. One I should have anticipated -- you might see line feed translation. But there are apparently others -- like silent truncation of the data! Somewhere in the guts of Cygwin, or maybe Windows, control codes designed back in the Jurassic days of computing are still being honored. Input was terminating on (apparently) an EOF character in my binary file. Another suggestion from one of the very smart folks on #haskell.

I'll just use readBinaryFile instead of readFile, right? Well, no, readBinaryFile is part of MissingH, a third-party library. Back to #haskell, where I was advised that I could roll my own. Importing System.IO, I use the following snippet:

readBinaryFile s = System.IO.openBinaryFile s
    System.IO.ReadMode >>= System.IO.hGetContents

This works fine, although I would humbly suggest that GHC's standard library should provide easy access to binary files; while the solution is pretty trivial, it is a wart.

I next stubbed my toe on printf. It is available in module Text.Printf and provides a type-safe version of C's printf. As a long-time C and C++ programmer, you'd think I'd know just about everything there is to know about printf. So, I rapidly gave it a format string "%02X" and passed it a char. Apparently the uppercase X to produce a hex representation with uppercase A-F is not supported (grrr). Another minor wart -- if you provide printf, it should behave like printf -- but we'll move on.

Per more chatting on #haskell I was given this one liner to dump binary data in a nicely formatted way:

writeFile ofile (concat $ zipWith (printf "%02x %s") s (cycle $ replicate 19 "" ++ ["\n"]))

I want to take a moment to talk about how it works. First, cycle $ replicate 19 "" ++ ["\n"]. The replicate function gives us a list of 19 empty strings, which we then concatenate with a newline. Applying cycle to this list treats it as an infinitely repeating circular list of strings, where every twentieth is a newline. These arguments are then fed to printf using zipWith. zipWith is an interesting function: while zip takes two lists and generates a list of pairs produced by assembling the list elements into tuples, zipWith doesn't tupleize the elements; instead it feeds the elements to the provided function, and makes a list of the results.

While this worked, it was interesting enough that I wanted to play with it using GHCi. But I had to give up on that; I kept tripping over the type checker. While I appreciate the masochistic joys of programming and the safety that comes with it, it can be frustrating for programmers with experience in, say, Ruby, or even C.

For example:

let xs = [1..100]
let ys = take 100 (cycle $ replicate 19 "" ++ ["\n"])
zipWith (printf "%02x %s") xs ys

GHC replies:

Ambiguous type variable `c' in the constraint:
`PrintfType c' arising from use of `printf' at <interactive>:1:9-24
Probable fix: add a type signature that fixes these type variable(s)

Ugh. Using the :t command in GHC it is easy to see that GHC thinks the type of xs is [Integer] and ys is [[Char]] (a list of list of chars, also known as a list of strings). If I put roughly the same code in a Literate Haskell source file and ask GHC to load it, I get:

Ambiguous type variable `a' in the constraints:
  `Enum a'
    arising from the arithmetic sequence `1 .. 100'
    at E:\toy.lhs:3:5-12
  `Num a' arising from the literal `100' at E:\toy.lhs:3:9-11
  `PrintfArg a' arising from use of `printf' at E:\toy.lhs:5:18-33
Possible cause: the monomorphism restriction applied to the following:
  xs :: [a] (bound at E:\toy.lhs:3:0)
Probable fix: give these definition(s) an explicit type signature
              or use -fno-monomorphism-restriction

followed immediately by:

Ambiguous type variable `c' in the constraint:
  `PrintfType c' arising from use of `printf' at E:\toy.lhs:5:18-33
Possible cause: the monomorphism restriction applied to the following:
  result :: [c] (bound at E:\toy.lhs:5:0)
Probable fix: give these definition(s) an explicit type signature
              or use -fno-monomorphism-restriction

Failed, modules loaded: none.

Wow. I'd say that is not really a newbie-friendly error message. However, this printf works fine in my real program. I'm not certain why, and I'm not going to dive into it too deeply right now. But here's a simpler type checking example: while C is strongly typed, you can treat numbers as chars and vice-versa, as long as you keep integral promotion and sign extension in mind. GHC is a harsher mistress. Let's say we want to pattern-match on our binary data. The value 16 in my binary data is DLE, which stands for Data Link Escape; it is often used in serial data to indicate packet boundaries, while inside the payload, it will be escaped (doubled). So here's a little pattern to remove doubled DLEs:

de_dle (16:16:xs) = ...

Simple enough, right? No, to Haskell a number and a Char are not interchangeable. Back to #haskell, where I got a quick explanation of the type checker's error messages. I turned the numbers into chars:

de_dle ('\16':'\16':xs) = ...

And that works just beautifully.

Anyway, to make a long story shorter, I was able to write my file filter, which does some nice pattern matching and formatted out. The problems I had while developing that were the kind I like: problems choosing my algorithm properly, not problems fighting with the language. The runtime was quite helpful here; while processing the file, if I hit a case at runtime which my patterns did not handle, I got a runtime warning about non-exhaustive patterns. That led me to reorganize my patterns, and the result was much clearer.

There was one more minor pitfall remaining. I compiled my program using GHC, but when I ran it, instead of my prompts for input and output filenames, I got nothing! Haskell was silently waiting for input. Back to #haskell. To make the output show up, I had to import System.IO and do hSetBuffering stdout NoBuffering. (Alternately, I could flush stdout immediately after each putStr, but that seems even uglier). I hope that saves someone a little aggravation.

Speaking of aggravation, how did it all come out? Well, the original binary log file is about six megabytes. Since I was doing this by hand, I was more concerned that the program ran correctly than that it ran fast; I would have been satisfied with anything under a half-hour. In fact, without any attempts at optimization at all, my filter ran in under thirty seconds, which is more than fast enough for an ad hoc little tool. I started out trying to do this task using some regular expressions in vi, and that was quickly going nowhere, and taking forever to do it. The vi that came with Cygwin doesn't support some of the more advanced regular expressions features (there is no {x,y} syntax for specifying the number of repeats of a pattern). Notepad++, my workhorse Windows text editor, also doesn't support this syntax. Without this the regexes were becoming hideous, although in (say) Perl they would have been rather simple. But I was able to use Haskell instead, thanks to GHC and the kind folks on #haskell!

4 comments:

Malte said...

I do not understand the ins and outs of your ghci/printf example myself, but it seems to me that you have to tell the Haskell type system that there are two sequences to replace in your string. That is, you have to tell Haskell that (printf "%02x %s") takes two further arguments. The simplest way to achieve this is to specify the type of the result:

zipWith (printf "%02x %s") xs ys :: [String]

This will evaluate to a string list, as specified.

Malte

Unknown said...

The printf issue is, I believe, the "magic" that lets Text.Printf.printf do double duty as C's printf and sprintf functions. Unless you specify the type there, you could be ending up with [String] or [IO ()], so as far as GHC is concerned, the type of that function is ambiguous.

If at some point you actually use the function in code, then the context will allow GHC to infer the type. This is probably why it works in the program, but not at REPL.

And because if I don't someone else will: the point you make about being able to treat chars as ints in C is exactly that C is weakly typed. You probably mean statically typed.

rien said...

" I had to import System.IO and do hSetBuffering stdout NoBuffering. (Alternately, I could flush stdout immediately after each putStr, but that seems even uglier). I hope that saves someone a little aggravation."

You just did, thank you so much!

Unknown said...

If you are prompting the user, then you should write the prompt to unbuffered stderr instead:

hPutStr stderr "something"