18 February 2008

Is Python an Acceptable Haskell?

So I struggled with some Python this weekend. I'm taking over a piece of code that is written in Python, but the original is very imperative, with a lot of for loops. It was giving me a headache. It just looked way too long and tedious. I'm too old to spend my life debugging off-by-one errors in for loops.

After staying up half the night struggling with a malfunctioning IDLE on MacOS X, I finally discovered the cure for the headache was -- more Python!

# This gives us a "curry" equivalent
from functools import partial

# Reduce a list of strings to a single string by concatenation
reduce_stringlist = partial( reduce, str.__add__ )

# Given a list of XML elements, return a list of only the data from only the text nodes
# (ignores any non-text nodes)
def get_text_data( nodelist ):
    return map( lambda node: node.data,
           filter( lambda node: node.nodeType == node.TEXT_NODE, nodelist ) )

# Now, combined: reduce a list of nodes to the concatenated text extracted out of only
# the text node data
def extract_text( nodelist ):
    return reduce_stringlist( get_text_data( nodelist ) )

# Given a portion of the parsed XML tree and a name, extract a single string. Expect
# only one element.
def get_one_text_element( node, name ):
    elt_list = node.getElementsByTagName( name )
    assert elt_list.length == 1
    return extract_text( elt_list[0].childNodes )

# Given a starting node and a list of tag names, retrieve one text string each and
# put them in a newly created dictionary using the tag name as a key
def make_text_element_dict( node, namelist ):
    return dict
        zip( namelist,
             map ( partial( get_one_text_element, node ),
                   namelist ) ) )

Ahhh, I feel much better! All that excess hairy boilerplate seems to be just melting away!

The only thing I did not like is that there did not seem to be a nice way to specify keyword arguments to "reduce" so that I could create a partial (curried) application that supplied the first and optional third parameter, leaving the second as the one to be supplied at runtime.

I experimented with writing the above with Python's list comprehension idiom, generators, and various permutations on join(). The above just seems to make more sense to me. Haskell has apparently ruined me for other languages.

Apparently Guido would like to ban reduce() from Python. I say -- prefer the standard idiom to the offbeat, and avoid the Not Invented Here syndrome. Python's comprehensions and generators are nice, but apparently I've been ruined by seeking more and more expressive languages, seeking truth and beauty on my wandering but inexorable path from Dylan to NewtonScript to Scheme to Haskell. Lambda, map, curry, reduce, and zip now seem to me to be fundamental primitives that any reasonable dynamic language ought to provide, and it seems to me that they ought to be preferred to an obscure language-specific trick. As long as I have to use Python, Guido will have to pry the curried functions from my cold, dead hands!


Justin said...

return map( lambda node: node.data,
filter( lambda node: node.nodeType == node.TEXT_NODE, nodelist ) )

is much simpler when written as a LC:

return [node.data for node in nodelist if node.nodeType == node.TEXT_NODE]

Erik said...

Just posted this on reddit but thought I'd show you here, as well:

You used:

reduce_string = partial(reduce, str.__add__)

Better and, even more functional would be:

reduce_string = ''.join

which could be seen as the point free style to:

reduce_string = lambda ls: ''.join(ls)

anonymous_479843 said...

Wow, I guess python is not haskell! Why don't you use something like this:

def extract_text( nodelist ):
____return u''.join( node.data for node in nodelist if node.nodeType == node.TEXT_NODE )

It's much more "pythonic". List comprehensions and generator expressions are usually easy to read and avoid boilerplate code.

Bill Mill said...

Just to clarify, reduce() isn't gone from python 3, it's just moved into the functools library.

Bill Mill said...

Also, you like your get_text_data better than:

[node.data for node in nodelist where node.nodeType == node.TEXT_NODE]

? really? Did I mis-translate between idioms there?

Bill Mill said...

In fact, your whole extract_text seems to be equivalent to this (more idiomatic, IMHO) python:

"".join(node.data for node in nodelist where node.nodeType == node.TEXT_NODE)

To me, that just seems miles better, but I'm certainly not used to that much filter, map, and reduce; I also understand that "".join is fingernails on a chalkboard for many people.

Andy Gayton said...
This comment has been removed by the author.
Paul R. Potts said...

Thanks to everyone who commented. Yes, I am aware of the list comprehension syntax [node.data for node in nodelist if node.nodeType == node.TEXT_NODE], and aware of the ''.join idiom. The join idiom strikes me as like the Ruby idiom 3.times {}, which is similarly cute (trumpeting "everything is an object") but when I'm thinking in functions and data, not objects, it seems jarring.

I may wind up using the list comprehension syntax in the final code instead of the map/filter combination. I agree that there is a benefit to the LC form, in that I can specify the additional step of looking up each .data field along with the predicate. I appreciate Haskell's comprehension syntax, so I ought to be able to appreciate Python's.

Oh, I also found out there is a bug in the code. My use of partial(reduce, str.__add__) breaks in degenerate cases with a runtime error about not supplying an initial value. So the curry-like "partial" doesn't really serve me well in that case, and it is safer to write:

def reduce_stringlist( strings ):
return reduce( str.__add__, strings, '' )

It would work better if the initial parameter was second in the parameter list (and not optional) and I could curry it away. I'm also not certain what benefit it is to make this parameter optional; it implies the implementation of reduce() is determining what it should do in the degenerate case for some incoming parameter types but not others. I'd consider trying to fix it, but given his written opinions on the matter I suspect GvR would not be charitable towards any patches to reduce(). If there were a build-in reduce_stringlist() that was not a method, I'd pretty much be willing to get over my attachment to reduce().

I am not really trying to convert any Python fans towards my way of thinking, and as I get accustomed to Python I will probably wind up with a greater appreciation of the Python way of doing things. For the moment, though, the functional idioms that are supported were a big help in getting me "across the divide" and feeling like I could get productive again. I can appreciate GvR's general desire to want to provide one way to do things, because at the other extreme lies Perl.

The only remaining syntactic complaint I have is that I could not easily determine how to pass the string concatenation operator as a paramter. It would be nice to have a syntactic trick like \+ to get to the unevaluated symbol. The separate "implementation name" __add__ seems kind of ugly.

Bill Mill said...

Agreed about the annoyance of the + operator, I really like that part of Haskell's syntax. The more idiomatic way of doing that is to use operator.add.

Also, I understand not liking "".join, it's not my favorite either. but turning 7ish lines into 1 and saving yourself a temporary list allocation seems like the right tradeoff to me.

Paul R. Potts said...

Maybe a good compromise for the reduce_string would be a standard library equivalent to the sum() function, but for strings?

Maybe there already is one, but I haven't seen it yet?

I've also seen reference to '+' on strings operating much more slowly than the join() method. I'm not currently worried very much about efficiency for this program, but everything else being equal I'd certainly want to choose the more efficient implementation.