07 March 2007

Haskell for the Short Attention Span: Run-Length Encoding, Part 3

This isn't a big improvement, but it occurs to me that my implementation from last time:

>deRLE :: [Integer] -> Bool -> Integer -> [Integer]
>deRLE [] False _ = []
>deRLE points False max = deRLEHelper points max
>deRLE points True max = deRLEHelper (0 : points) max

>deRLEHelper :: [Integer] -> Integer -> [Integer]
>deRLEHelper [] max = []
>deRLEHelper [unmatched] max = [unmatched..max]
>deRLEHelper (start:end:rest) max = [start..end - 1]
>    ++ deRLEHelper rest max

can be yet shorter; since deRLEHelper handles the empty list specially, deRLE doesn't need to:

>deRLE :: [Integer] -> Bool -> Integer -> [Integer]
>deRLE points False max = deRLEHelper points max
>deRLE points True max = deRLEHelper (0 : points) max

That way only the function that actually recurses down to the empty list has to explicitly handle it.

Can we make it even more concise and still retain the self-documenting aspects of the code? I'll have to come back to that question. It's time to respond to comments. When I first mentioned my triplify function, which breaks a string into strings of three characters, I wrote "I'm sure there is a very clever one-line fold that applies take 3 in an mind-expanding manner, but I was unable to find it." I got some great suggestions. The first, due to Cale Gibbard, was:

map (take 3) . takeWhile (not . null) . iterate (drop 3)

Let's look at it piecewise as a pipeline, from right to left.

First, our list/string (for example, "ABCDEFG") is fed to the function iterate along with (drop 3). Iterate is an interesting function. I tried to understand it a few months ago by reading the description at zvon.org, which says that iterate

creates an infinite list where the first item is calculated by applying the function on the second argument, the second item by applying the function on the previous result and so on.

They give the following example:

Input: take 10 (iterate (2*) 1)
Output: [1,2,4,8,16,32,64,128,256,512]

Note that if the behavior followed the description, the output would be [2,4,8,16,32,64,128,256,512,1024]. Their description is wrong, although the output matches GHCi's output. What does iterate really do? I'll describe it in the form of a poem.

iterate takes a function and a value, and returns a list consisting of:

the value

followed by
  the value of applying the function to the value

followed by
  the value of applying the function to
    the value of applying the function to the value

followed by
  the value of applying the function to
    the value of applying the function to
      the value of applying the function to the value
      
it will keep talking
  even when
    it has nothing
      new to say
        so please don't
        ask it any
          open-ended
            questions...

That is, I can't evaluate

iterate (drop 3) "ABCDEFG"

because even though the input is finite, the output isn't; there's no termination to the recursion. However, I can evaluate part of the result:

take 4 (iterate (drop 3) "ABCDEFG")
["ABCDEFG","DEFG","G",""]

And because of Haskell's lazy evaluation, we can feed this infinite result "upstream" to be used in the remainder of our calculation, and it will work fine as long as it doesn't evaluate the complete result. The next stage in our "points-free" pipeline is:

takeWhile (not . null)

The takeWhile function operates on a list, returning a list made up of its values as long as they match the predicate supplied. This has the effect of squashing the overly talkative iterate, because we stop evaluating its results as soon as it returns the first null value. Lazy evaluation is so cool!

Now, our finite list of strings is handed off to map (take 3), which generates a list of 3-character prefixes. Here's the whole process:

iterate (drop 3) "ABCDEFG"

yields an infinite list beginning ["ABCDEFG","DEFG","G",""];

takeWhile (not . null)

applied to the above will take elements from the list until it hits the first null: ["ABCDEFG","DEFG","G"]

map (take 3)

applied to the above will give us ["ABC","DEF","G"]. Note that the short remainder is preserved.

>

That seems so useful, I'm surprised it isn't a function in the Standard Prelude, called groupsOf. Although it would also be nice to have a version which didn't keep the short remainder, perhaps groupsOfOnly.

I got another suggestion from a user called "kirby." He suggested the function

takeWhile (\="") . List . unfoldr (Just . splitAt 3)

Wow, my first unfoldr! I'm going to have to stop there for today, while I learn a little something about Just, Nothing, and the Maybe warm fuzzy thing. My first funster!

06 March 2007

English Majors as Programmers

The Embedded Muse is an e-newsletter edited by Jack Ganssle; see the Ganssle Group home page. In Issue 142 he writes:

I've found that some of the best developers of all are English majors. They'll often graduate with no programming experience at all, and certainly without a clue about the difference between DRAM and EPROM. But they can write. That's the art of conveying information concisely and clearly. Software development and writing are both the art of knowing what you're going to do, and then lucidly expressing your ideas. The worst developers, regardless of background, fail due to their inability to be clear. Their thoughts and code tend to ramble rather than zero-in on the goal... Too many engineering-trained developers have a total disregard for stylistic issues in programming. Anything goes. Firmware is the most expensive thing in the universe, so it makes sense to craft it carefully and in accordance with a standard style guide. Make sure it clearly communicates its intent. This is where the English majors shine; they've spent 4 years learning everything there is to know about styles and communication.

As an English major who has programmed computers since the age of ten, and who also took every programming course I could manage, it's nice to hear a little love for English majors! But I should add that an English degree is not a requirement; I also know some excellent developers who majored in more technical subjects, such as anthropology and archaeology!

05 March 2007

Haskell for the Short Attention Span: Run-Length Encoding, Part 2

Greetings! Last time I introduced a piece of code to solve a small real-world problem: decoding a set of function IDs from a run-length-encoded list of change points. I received some excellent feedback. One of the most valuable suggestions was that I avoid unnecessarily taking the length of the list in my "triplify" function. That makes a great deal of sense, considering that ideally I'd like to be able to handle infinite lists, and of course it would be nice if the runtime of my algorithm did not spiral out of control quite so quickly!

I also gave a little more thought to the various base cases. One of the nicest things about Haskell is its pattern matching. Indeed, even a simple function like "square a = a * a" makes use of pattern matching, where "a" on the left-hand side is irrefutable. My implementation last time failed to fully take advantage of pattern matching on multiple parameters, particularly to handle base cases, and did not even handle all the various termination conditions properly. So, here is a slightly condensed version of the code that makes the termination conditions much more explicit, and so I claim even though it is shorter, it is more self-explanatory.

>import Char
>import Numeric
>rle_raw_str = "00 30 90 09 31 01 10 32 20 22 12 30 23 12 41 24" ++
>  "34 40 44 1C 0F C1 1D 00 D0 1D 03 D0 4D 05 D0 7D 0A D0 CD 10" ++
>    "D1 2D 20 D2 6D 30 D3 1D 40 D4 1D 50 D5 1F 00"

A new triplify:

>
>triplify :: String -> [String]
>triplify [] = []
>triplify (x:y:z:rest) = [x, y, z] : triplify rest
>triplify str = error "Unexpected string length (trying to match 3 characters)"

Treating the input data as a list of 12-bit numbers:

>rle_change_points = map (fst . head . readHex)
>  (triplify (filter isHexDigit rle_raw_str))

The main function: given a list of change points, a hint as to whether we are initially presuming the decoder will output ones or zeroes, and a maximum value, decode RLE data from the change points. Return the indices of the one bits.

>deRLE :: [Integer] -> Bool -> Integer -> [Integer]
>deRLE [] False _ = []
>deRLE points False max = deRLEHelper points max
>deRLE points True max = deRLEHelper (0 : points) max

>deRLEHelper :: [Integer] -> Integer -> [Integer]
>deRLEHelper [] max = []
>deRLEHelper [unmatched] max = [unmatched..max]
>deRLEHelper (start:end:rest) max = [start..end - 1] 
>  ++ deRLEHelper rest max

This is more like it -- the kind of program that resembles a proof more than code. I'm starting to prefer this style!

A brief explanation: if we're initially decoding zeroes, the empty list case can be disposed of quickly. The program filters this case out first so that the recursive helper function, which accumulates the output, doesn't need to handle this case. The next two lines handle non-empty lists. If we're initially decoding ones, we can still treat all cases the same if we just prepend a zero to "flip" the decoder. Our helper function then has only three cases: a termination case where an interval of ones is not closed (because we have a single change point left), in which case we spit out ones up to and including our maximum possible value; a termination case where we have two change points forming a closed interval, in which case we spit out an interval inclusive of the start and exclusive of the end, and a non-termination case in which we have more than two entries, which spits out a closed interval and continues the process.

Let's look at some test cases. First, we confirm that if we're generating zeroes and we encounter no change points, our result set is empty:

deRLE [] False 0xF

[]

We confirm that if we hit a single change point, the result set will contain all the values up to and including the maximum possible:

deRLE [0] False 0xF

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

deRLE [11] False 0xF

[11,12,13,14,15]

In the general case where our interval is closed, our result set is inclusive on the left and exclusive on the right:

deRLE [5, 10] False 0xF

[5, 6, 7, 8, 9]

Closing one range and opening another without closing it works correctly:

deRLE [5, 10, 11] False 0xF

[5,6,7,8,9,11,12,13,14,15]

If we begin assuming our decoder generates zeroes, an empty list of change points maps to all possible values:

deRLE [] True 0xF

[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]

A single change point at zero toggles the state of the decoder and produces and empty result:

deRLE [0] True 0xF

[]

A single change point greater than zero allows some values into our result set before it is closed:

deRLE [5] True 0xF

[0,1,2,3,4]

Two change points defines a range to exclude from the results, and we get the rest of the values in the range:

deRLE [5, 10] True 0xF

[0,1,2,3,4,10,11,12,13,14,15]

Three change points suppresses the rest of the result set:

deRLE [5, 10, 13]

[0,1,2,3,4,10,11,12]

With these tests I'm satisified that deRLE works for all expected cases.

I received more comments, including some recommendations for a nifty unfold. Thanks to everyone who made suggestions. Next time I'll take a look at how those work, talk about anything else that comes to mind, and then proceed to my next real-world problem!

01 March 2007

Haskell for the Short Attention Span: Run Length Encoding, Part 1

I'm a fortunate man. Some 50% of people who suffer herpes zoster infections involving the eye wind up with permanent eye damage. I seem to have escaped with little or no permanent damage. The nerves are not completely healed, and there is still visible damage under the skin, but it is no longer painful to work at the computer or read!

It's time to jump back on the horse and risk looking foolish in public by writing some code. I still consider myself to be close to a rank beginner in Haskell, so you're reading my "finger exercises." To experienced functional programmers they will probably sound like a new violin student sawing away. However, since I haven't been a student for almost twenty years and there is no one to assign me homework but myself, this seems to be the next best way to force myself to learn something!

With three children at home, two of them babies, I don't get a lot of unbroken free time; I've got a guaranteed short attention span. So consequently I'm going to look at some very small programs. However, rather than make up toy programs from scratch, I am now going to turn to a few real-world problems that I've stumbled across while doing embedded programming.

OK. Consider a spherical cow of uniform density... wait, I mean, consider a black-box system with an outward-facing interface in which a number of functions are made available. Functions are accessed via a 12-bit unique ID. A client of that system doesn't know which functions are available. The client needs a method to retrieve a list of funciton IDs. Let's assume we don't have bandwidth to burn, so we don't want to send a complete list of all the function IDs. Let's further assume that the functions are organized into a small number of groups with consecutive IDs. In other words, this is data that is very amenable to a simple compression scheme.

We're going to do it using Run-Length Encoding (RLE). Here is a brief informal explanation of RLE; if you are already know all about RLE, skip the next paragraph (but read the ones after that, please!)

RLE is a commonly used technique for encoding bitmaps; a variant of RLE is even used in fax machines. Rather than store a bit for every bit in a bitmap, RLE records _runs_ of repeating data. If you have a raster image that you're scanning left to right, top to bottom, and that image has a lot of horizontal lines in it, or black and white rectangles, this might be a big win. The RLE encoding will record long alternating runs of ones and zeroes. If the image has a lot of vertical lines, this might not be much of a win. The worst-case scenario is a 50% checkerboard: in that case, we'll have to indicate runs of length 1, and the overhead will be far worse than just encoding the bits. For that kind of image you'd be better off using a means of compression that actually adapts to the frequency of occurrence of particular patterns in the data, such as Huffman coding or LZW.

Now, there are several possible ways to represent the runs of RLE data. One scheme might use tuples consisting of the bit value and the number of occurrences of that value (the "run length"). For example, the tuples [(0, 14), (1, 28), (0, 3)] could represent a run of fifteen zeroes, followed by a run of twenty-nine ones, followed by four zeroes. Why are the values off by one? Because we would never need to encode a run of length zero, so we allow zero to represent one, and so on. A tuple like this could be encoded in a single byte, where the high-order bit represents the value and the remaining seven bits represented the run length - 1. This gives us the ability to represent runs of up to 128 bits. Of course, if we have a run of length 129 or greater, we'll need to represent it with multiple bytes. For degenerate cases where we encode very long runs, we'll have unnecessary redundancy.

For that low-entropy bitmap, a better scheme might be to just record the points at which the data _changes_. We have to choose an initial state for decoding: should the decoder, prior to the first change point, produce ones, or zeroes? Once we know that, we don't need to encode the bit values at all. If we assume that our encoded data begins with a run of zeroes, we could represent the example above (fifteen zeroes, followed by twenty-nine ones, followed by four zeroes) by recording only [15, 44]. Let's say that we want our decoder to spit out the indices of the one bits. We'll get:
[15,16,17,18,19,20,21,22,23,24,
25,26,27,28,29,30,31,32,33,34,
35,36,37,38,39,40,41,42,43]
Using this scheme, a bitmap of all zeroes would be represented by an empty list, while a bitmap of all ones would be represented by by a single [0]; it turns on the "ones" state and never turns it off, so we just generate ones until our range of values is exhausted. (This implies that, in addition to knowing what starting state it should assume, the decoder has to know the size of the bitmap we are representing). A bitmap consisting of all zeroes followed by a single one would be represented by [N - 1] where N is the number of bits in the bitmap.

If, instead of assuming that our data starts with an implicit run of zeroes, we assume instead that it starts with an implicit run of ones, it should be obvious that we could encode the same example using [0, 15, 44]; in this case, a bitmap of all zeroes would be represented by [0], which would yield an empty list of one-bit indices, while a bitmap of all ones would be represented by an empty list, yielding a list [0..n - 1] of bit indices.

We'll assume that our function IDs are represented this way: they are the one bits in the bitmap, and we want a list of their indices. When we query our interface about its function IDs, we get back a response that consists of a short list of change points. Here are the data bytes:
00 30 90 09 31 01 10 32 20 22 12 30 23 12 41 24 34 
40 44 1C 0F C1 1D 00 D0 1D 03 D0 4D 05 D0 7D 0A D0
CD 10 D1 2D 20 D2 6D 30 D3 1D 40 D4 1D 50 D5 1F 00
The first time I encountered data like this, I wrote myself a little bit of Ruby code to decode it. Here's first part of the Ruby code, which turns these bytes into a series of 12-bit unsigned numbers by assembling three nybbles (half-bytes) at a time:
rle_raw_list = gets.chomp.delete(" ")

rle_change_func_ids = []
0.step(rle_raw_list.length - 2, 3) { |nybble_idx|
rle_change_func_ids.push(rle_raw_list[nybble_idx..nybble_idx + 2].hex)
}
Here's how it works. The first line reads data from standard input and, reading left to right, removes any end-of-line character and deletes spaces. We next create an empty list. "0.step" is cute little bit of idiomatic Ruby shorthand: everything is an object, including integers, and they have a "step" method, so by calling 0.step we loop from zero to the string length - 2, incrementing by threes. We also provide a Ruby block, which is basically a closure that receives one parameter, the loop index, and which is bound to the name nybble_idx. Working inside-out, the body of this closure takes three-byte substrings of the raw data, passes them to the hex function to yield an integer, and then uses push, which treats the list like a stack and appends the new integer to the end. The result looks like this:
[0x003, 0x090, 0x093, 0x101, 0x103 ...]
This is the list of change points. Once I have that, I can generate the runs of data. I'll assume that our data starts with an initial run of ones:
rle_state = true
MAX_FUNC_ID = 0xFFF
current_func_id = 0

for change_func_id in rle_change_func_ids
if rle_state == true
(current_func_id...change_func_id).each {|valid_func_id|
puts sprintf('0x%02X', valid_func_id)
}
rle_state = false
elsif rle_state == false
rle_state = true
end
current_func_id = change_func_id
end
Now, I should mention that I like Ruby, and especially like its support for nice idioms like blocks and closures and iterators and chaining functions. But while this Ruby code worked well enough for my little task, you might notice that it doesn't actually handle all the termination cases properly. As I work with Haskell, I'm finding that I like it quite a bit more than Ruby. One of the reasons is that Haskell code, being less imperative, seems closer to a kind of abstract, platonic representation of the problem space. Problems such as the failure to properly handle various termination cases tend to look much more obvious, particularly when functions are broken down into pieces using pattern matching.

Anyway, here's my Haskell version:
>import Char
>import Numeric

>rle_raw_str = "00 30 90 09 31 01 10 32 20 22 12 30 23 12 41 24" ++
> "34 40 44 1C 0F C1 1D 00 D0 1D 03 D0 4D 05 D0 7D 0A D0 CD 10" ++
> "D1 2D 20 D2 6D 30 D3 1D 40 D4 1D 50 D5 1F 00"
(My first opportunity to look silly: I'm sure there's a way to break a string across multiple lines with some sort of continuation character, but I could not get "\" to work together with Literate Haskell mode, so I'll just concatenate them and hope no one notices...)
>rle_bytes_str = filter isHexDigit rle_raw_str
That was easy!

Now, my second opportunity to look silly: I'm sure there is a very clever one-line fold that applies "take 3" in an mind-expanding manner, but I was unable to find it. This was about as clever as I could manage, but it does have the advantage of checking for unexpected data length:
>triplify :: String -> [String]
>triplify str | length str == 0 = []
>triplify str | length str `mod` 3 == 0 = ( take 3 str ) : triplify ( drop 3 str )
>triplify str = error "bad length (expected multiple of three)"
To see the results, evaluate "triplify rle_bytes_str." I get:
["003","090","093","101","103","220","221","230","231","241",
"243","440","441","C0F","C11","D00","D01","D03","D04","D05",
"D07","D0A","D0C","D10","D12","D20","D26","D30","D31","D40",
"D41","D50","D51","F00"]
Now I want to turn these into a list of numbers; these are the change points. Mumble, mumble...
>rle_change_points = map (fst . head . readHex) (triplify rle_bytes_str)

[3,144,147,257,259,544,545,560,561,577,579,1088,1089,3087,3089,3328,3329,3331,3332,
3333,3335,3338,3340,3344,3346,3360,3366,3376,3377,3392,3393,3408,3409,3840]
Our playing field is the range of integers 0x000..0xFFF or 0..4095 in decimal. We want to lazily generate a list of valid values using our RLE decoding scheme. Rather than walk the list of change points spitting out values using some kind of toggling state variable to keep track of whether the decoder is processing a run of ones or a run of zeroes, or walking all the possible values filtering it by the list of change points, we'll use two functions, one for extracting ranges, and one for generating values from ranges. Both functions receive a list representing "the rest of the work to do" in the form of the remainder of the list of change points, and so this is a _little_ bit like using continuation-passing style, or using a Monad to maintain the progress of the calculation. Maybe in Part 2 I'll have learned enough to explain what I mean by that!

Our first function, getRangeFromChangePoints, has to handle some interesting cases. Assuming we are decoding RLE in the range of possible values [X..Y], what happens when our list contains a final value, N, that opens a final range but doesn't explicitly close it? Well, there are two approaches we could take. The more conservative approach is to assume that this was a harmless mistake and close the range [N..N + 1) so that it contains only the value N. (Note that I'm using interval notation in which a square bracket indicates that the range _includes_ the bound specified, and a parenthesis _excludes_ the bound specified). However, there is a problem with this approach: it means we can't generate a range that includes the maximum value, Y! That's no good, so we'll assume that a final range opening with N means that we really want [N..Y] (inclusive). In fact, since we treat ranges as exclusive at the end, this is actually the only way we can generate a range containing Y.

For now, we also assume that running out of values is an error. We're assuming that our caller will handle the termination conditions.
>getRangeFromChangePoints :: [Integer] -> (Integer, Integer, [Integer])
>getRangeFromChangePoints ([]) = error "getRangeFromRLE: no change points in list!"
>getRangeFromChangePoints (a:[]) = (a, max, []) where max = 0xFFF
>getRangeFromChangePoints (a:b:[]) = (a, b - 1, [])
>getRangeFromChangePoints (a:b:ls) = (a, b - 1, ls)
Wow, that's pretty wordy as Haskell code goes, but I kind of like the way the pattern-matching breaks down. Note that in particular it makes the termination conditions obvious. This helps me think about the problem domain, rather than the implementation, in a way that the imperative Ruby code did not.

Next, here is the recursive function to generate values from ranges. We represent a range using the first two values of our tuple, while the third value is a list of the remaining ranges (sort of like our continuation). We handle the termination case, in which we have no more work to do, by just returning a list from our range, while the non-terminating case gets the next range (including the list representing our remaining work) and calls the function again to continue generating the list.
>genValuesFromRange :: (Integer, Integer, [Integer]) -> [Integer]
>genValuesFromRange (first, last, []) = [first..last]
>genValuesFromRange (first, last, to_do) = [first..last] ++
> (genValuesFromRange (getRangeFromChangePoints to_do))
To generate the initial case we use a helper function called valuesFromRLE. In ddition to the change points list, it takes a Boolean value which indicates whether we are to assume that our RLE decoding is initially inclusive or initially exclusive: that is, whether the first value in our first range indicates that the range is starting or ending. If we want it to behave inclusively, we prepend a zero value to create a range starting with zero. Note that this will be harmless if our first range starts with zero, because we'll generate a range like [0..(-1)], which will contain no members.
>valuesFromRLE :: [Integer] -> Bool -> [Integer]
>valuesFromRLE change_points initially_inclusive = genValuesFromRange $
> if initially_inclusive then getRangeFromChangePoints (0 : change_points)
> else getRangeFromChangePoints change_points
Note that we can evaluate the whole thing, but more interestingly, we can use "take" to produce only the first few elements:
take 16 $ valuesFromRLE rle_change_points True

[0,1,2,144,145,146,257,258,544,560,577,578,1088,3087,3088,3328]
And there you have it. If you like, play with the code. Next time we'll see if we can encode the function IDs back into change points, and confirm the round trip. As always, I appreciate your comments.