05 March 2007

Haskell for the Short Attention Span: Run-Length Encoding, Part 2

Greetings! Last time I introduced a piece of code to solve a small real-world problem: decoding a set of function IDs from a run-length-encoded list of change points. I received some excellent feedback. One of the most valuable suggestions was that I avoid unnecessarily taking the length of the list in my "triplify" function. That makes a great deal of sense, considering that ideally I'd like to be able to handle infinite lists, and of course it would be nice if the runtime of my algorithm did not spiral out of control quite so quickly!

I also gave a little more thought to the various base cases. One of the nicest things about Haskell is its pattern matching. Indeed, even a simple function like "square a = a * a" makes use of pattern matching, where "a" on the left-hand side is irrefutable. My implementation last time failed to fully take advantage of pattern matching on multiple parameters, particularly to handle base cases, and did not even handle all the various termination conditions properly. So, here is a slightly condensed version of the code that makes the termination conditions much more explicit, and so I claim even though it is shorter, it is more self-explanatory.

>import Char
>import Numeric
>rle_raw_str = "00 30 90 09 31 01 10 32 20 22 12 30 23 12 41 24" ++
>  "34 40 44 1C 0F C1 1D 00 D0 1D 03 D0 4D 05 D0 7D 0A D0 CD 10" ++
>    "D1 2D 20 D2 6D 30 D3 1D 40 D4 1D 50 D5 1F 00"

A new triplify:

>
>triplify :: String -> [String]
>triplify [] = []
>triplify (x:y:z:rest) = [x, y, z] : triplify rest
>triplify str = error "Unexpected string length (trying to match 3 characters)"

Treating the input data as a list of 12-bit numbers:

>rle_change_points = map (fst . head . readHex)
>  (triplify (filter isHexDigit rle_raw_str))

The main function: given a list of change points, a hint as to whether we are initially presuming the decoder will output ones or zeroes, and a maximum value, decode RLE data from the change points. Return the indices of the one bits.

>deRLE :: [Integer] -> Bool -> Integer -> [Integer]
>deRLE [] False _ = []
>deRLE points False max = deRLEHelper points max
>deRLE points True max = deRLEHelper (0 : points) max

>deRLEHelper :: [Integer] -> Integer -> [Integer]
>deRLEHelper [] max = []
>deRLEHelper [unmatched] max = [unmatched..max]
>deRLEHelper (start:end:rest) max = [start..end - 1] 
>  ++ deRLEHelper rest max

This is more like it -- the kind of program that resembles a proof more than code. I'm starting to prefer this style!

A brief explanation: if we're initially decoding zeroes, the empty list case can be disposed of quickly. The program filters this case out first so that the recursive helper function, which accumulates the output, doesn't need to handle this case. The next two lines handle non-empty lists. If we're initially decoding ones, we can still treat all cases the same if we just prepend a zero to "flip" the decoder. Our helper function then has only three cases: a termination case where an interval of ones is not closed (because we have a single change point left), in which case we spit out ones up to and including our maximum possible value; a termination case where we have two change points forming a closed interval, in which case we spit out an interval inclusive of the start and exclusive of the end, and a non-termination case in which we have more than two entries, which spits out a closed interval and continues the process.

Let's look at some test cases. First, we confirm that if we're generating zeroes and we encounter no change points, our result set is empty:

deRLE [] False 0xF

[]

We confirm that if we hit a single change point, the result set will contain all the values up to and including the maximum possible:

deRLE [0] False 0xF

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

deRLE [11] False 0xF

[11,12,13,14,15]

In the general case where our interval is closed, our result set is inclusive on the left and exclusive on the right:

deRLE [5, 10] False 0xF

[5, 6, 7, 8, 9]

Closing one range and opening another without closing it works correctly:

deRLE [5, 10, 11] False 0xF

[5,6,7,8,9,11,12,13,14,15]

If we begin assuming our decoder generates zeroes, an empty list of change points maps to all possible values:

deRLE [] True 0xF

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

A single change point at zero toggles the state of the decoder and produces and empty result:

deRLE [0] True 0xF

[]

A single change point greater than zero allows some values into our result set before it is closed:

deRLE [5] True 0xF

[0,1,2,3,4]

Two change points defines a range to exclude from the results, and we get the rest of the values in the range:

deRLE [5, 10] True 0xF

[0,1,2,3,4,10,11,12,13,14,15]

Three change points suppresses the rest of the result set:

deRLE [5, 10, 13]

[0,1,2,3,4,10,11,12]

With these tests I'm satisified that deRLE works for all expected cases.

I received more comments, including some recommendations for a nifty unfold. Thanks to everyone who made suggestions. Next time I'll take a look at how those work, talk about anything else that comes to mind, and then proceed to my next real-world problem!

2 comments:

augustss said...

You can simplify the helper:


deRLEHelper :: [Integer] -> Integer -> [Integer]
deRLEHelper [] max = []
deRLEHelper [unmatched] max = [unmatched..max]
deRLEHelper (start:end:rest) max = [start..end - 1] ++ deRLEHelper rest max

Paul R. Potts said...

Hi lennart,

You are correct. To recap:

deRLEHelper :: [Integer] -> Integer -> [Integer]
deRLEHelper (unmatched:[]) max = [unmatched..max]
deRLEHelper (start:end:[]) max = [start..end - 1]
deRLEHelper (start:end:rest) max = [start..end - 1]
++ ( deRLEHelper rest max )

can be simplified by turning the second-to-last equation into a termination case on the empty list (which you then moved up in order, so that it is matched first). This means a list containing only two elements is no longer handled with its own equation, but matches on the last equation. The key is noticing, which I did not, that the pattern match on [start:end:rest] will work even if rest is [].

We'll get one more call as deRLEHelper is called again on [], but it more clearly shows the termination, so I'm adopting this change into my article. Thanks for the suggestion!