31 August 2006

Generic Functions and Pointers to Member Functions

People who know me know that I still carry a torch for a wonderfully-designed but little-used programming language called Dylan. In my arrogant opinion Dylan took much of the best of Common Lisp and CLOS, regularized and streamlined it, and gave it a syntax more acceptable (due to familiarity) to programmers from the C and C++ world (although this annoyed some prefix syntax fans in the process). Dylan is a multi-paradigm language, where "everything is an object," designed to support functional styles, object-oriented styles, and procedural styles.

Dylan is still alive and kicking with several impressive implementations available here including d2c, an amazing compiler that generates C code from Dylan source, and which is written primarily in Dylan itself, and the formerly commercial Functional Objects compiler, which generates native machine code. I have not studied the Functional Objects codebase very much, but I have spent time with d2c. It is an amazing piece of code. Studying it makes me wish I was able to strap on an auxilliary brain or two and take a pill that gave me a post-graduate education in language implementation without all that tedious mucking about in graduate school, so I could contribute something useful to the project instead of just an occasional bug report or ignorant question.

In Dylan, and in Dylan's sire Common Lisp with CLOS, classes contain data. Classes don't contain methods. Methods are implemented using a construct known as a generic function. A generic function lets you define several specialized functions to operate on objects of different classes. A variety of functions with the same name are bundled together; as you define additional functions, they get added to the generic function. At run-time, when you make a call to the generic function, the specific function chosen can depend on the actual run-time type (the class) of the object you send to it. Classes can inherit from one or more other classes; this inheritance tree is used to make the decision about which actual method to call for a given object. Generic functions can be specialized in other ways; they are extraordinarily flexible and powerful. But I'm only going to touch today on generic function dispatch specialized on the class of the incoming object, incoming parameters, and return type.

In his article "A First Look at Dylan: Classes, Functions, and Modules," Steve Strassmann writes:

Object-oriented languages, including Dylan, provide polymorphic functions, which means a given function may be executed as one of several possible implementations of that function, called methods... when Dylan sees a call to name(x), depending on what type of object is, one of several methods is selected and executed. In Dylan, name is called a generic function, consisting of a family of name methods that implement the functionality of name for various classes (see Figure 2). Each of these methods "belongs to" its generic function (in this case, name) rather than to a class. This is a key point; it's the core difference between C++'s object model and Dylan's.

In terms of implementation, this means that when you make a generic function call, the generic function isn't a "standard" function; a dispatch mechanism comes into play to select the proper generic function based on the call's arguments. The run-time context contained in the object is explicitly passed to the function.

Strassmann gives examples generic function; I have stripped down one of his examples somewhat here:

// No type declaration, works on any type of object
define method double (x)
   pair(x, x);
end method double;

// Works on all numbers
define method double (x :: <number>)
   2 * x;
end method double;

// Works on all strings
define method double (x :: <string>)
   concatenate(x, x);
end method double;
When double is invoked on an argument, the most specific method is invoked... for example, double("foo") would invoke the third method, because <string> is more specific than <object>, which is what the first method is specialized to. If no match is found, Dylan will catch it and signal an error.

Besides a plethora of different ways to specialize generic functions, Dylan supplies various forms of introspection; if you want to extend or alter the run-time dispatch, you can do so.

To C++ programmers, this is inside-out. In C++, classes "contain" methods, known more commonly as "member functions" because they are "members" of the class. C++ supports polymorphism and dispatch based on run-time type. I'm going to talk a little bit about how objects and member functions work together in C++ and how you can use them, and how, when we look at C++ in light of Dylan or CLOS, C++ actually gives us a subset of object dispatch, presented in an obfuscated form.

Let's say you don't want to just make a new subclass with specialized methods and allow polymorphic dispatch to operate based on run-time type, but instead you want to handle your own run-time dispatch. In Dylan you can do this using introspection and some handlers that allow you to override the standard generic function dispatch. Since you have the object and the generic function and can get at specific functions, you can just call them, passing the object.

To do the same thing in C++ you need to work around the fact that C++ obfuscates what is actually going on when you call a method. You can do this by using a slightly obscure C++ construct called the pointer to member. For the present discussion the member in question will always be a member function, but keep in mind that pointers to members can work with data members as well. This use is probably even less common since if you have an object, or a pointer or reference to an object you can access public members directly as you would access members of a struct, or via an accessor function; I would think that pointers to data members are necessary only for serious runtime hacking or compiler implementation. But the key point here is that C++ pretends that both data members and member functions are part of classes, while CLOS and Dylan much more explicitly separate these concepts.

Pointers to member functions are uncommonly used, probably in part because of its somewhat obscure syntax. The only place I've seen them used, in fact, is the Darwin kernel's IOKit source code. Recently I've had a need to customize class behaviors at runtime without subclassing and without a complete refactoring of a class into modular pieces, so I've been investigating the construct again.

The syntax is awkward, but it is consistent with the general C school of thought which says that declarations are read backwards (right to left). In C, if you have an integer variable:

int i;

to make a pointer out of it, you insert an asterisk to the left of the variable name:

int * i;

Similarly, if you have a prototype for a function returning int and taking an int parameter:

int fn ( int );

to take a pointer to it, you insert an asterisk to the left of the name:

int * fn ( int );

but the order of evaluation means that the parser reads it as a function returning a pointer to int. So you have to change the order of evaluation using parentheses:

int ( *fn ) ( int );

Now fn means a pointer to a function, not a function returning a pointer.

A member function is basically the same, with the addition of the class name and the scope resolution operator (::). When you are declaring your class, this prefix is not wanted or needed (it is implicit within the class declaration's curly braces), but when you define your methods, you do so like this:

int my_class::my_member_func( int )
{
   /* function body */
}

An aside: the separation between the class declaration and class definition is just one of the speed bumps you have to live with when using C++; it is there to support separate compilation, but if you've ever made the switch from C++ to Java you know that giving up the need to keep header files and implementation files in perfect synchronization gives you an immediate productivity boost.

If we want to make a variable that can point to that member function, we put an asterisk to the left of the name:

int my_class::*my_member_func_ptr( int );

The precedence rules get confused by this (during parsing, it probably looks like a collision between a definition of a member function and a conventional function returning a pointer with no type attached, or some such nonsense), so again we have to use parentheses:

int ( my_class::*my_member_func_ptr )( int );

and this gives us a variable, my_member_func_ptr, which can hold a pointer to a member function with the signature "returns int and takes one int parameter." Note that the parameter list, as in a function prototype or member function declaration, only requires the types of the parameters and not the names, although you can supply names for documentation purposes if you want.

Another aside: it would be easy to ridicule this syntax, but ridiculing the syntax of C++ is kind of like shooting fish in a barrel. I've shown that the syntax is at least somewhat consistent with the other C++ syntactical forms. Because of that, this particular syntax doesn't really bother me; if I can't remember it off the top of my head, I can mentally follow this derivation path and reason it out. This isn't necessarily true of a lot of other pieces of C++ syntax which do bother me and which I'm quite happy to rant about; see my piece on struct.

A pointer to member function does not have a standard implementation; the implementation (size, etc.) may differ between implementations; you should probably treat it as an opaque data structure and make no assumptions about what it actually looks like inside, unless you want to write very non-portable code. There's nothing in the standard that says that pointers to member functions have to be the same size, for example, and the standard specifically says they are _not_ just specially typed pointers; they are not convertible to and from void* by casting, as other pointer types are, using either the old-style or new-style casts.

This is because pointer to member function is not a usable pointer by itself; instead it contains whatever extra information is needed to represent a specific member function for a given class, probably an index or offset of some sort. You can imagine that the compiler needs a mechanism like this for its own use, so that it can turn

object.member_function(...)

or

object_pointer->member_function(...)

into C-style function calls with the addition of a hidden parameter:

looked_up_member_function( this, ... )

Remember our discussion on generic function dispatch? You might notice that when you rewrite the code to explicitly pass in the contest (the "this"), it starts to look like Dylan (or, ignoring the difference between infix and prefix notation, CLOS). You can actually think of the C++ object model as a much more limited and obfuscated form of CLOS generic function dispatch. How weird is that?

The "this" pointer allows the function body access to the specific object instance variables, and can also be passed on to any subsidiary member function calls on the same object. In Dylan or CLOS you would access the data members (in slots) of the incoming object explicitly; C++ gives you access to the data members and member functions of "this" implicitly, although you can still use this-> as a prefix if you want to make very clear what you are doing, or protect against accidentally accessing something global or local in the member function's namespace that might inadvertently shadow a member. This merging of namespaces is another rich source of potential errors (remember my comment about how C++ is like obfuscated Dylan?)

Note that our pointer to member function is not defined to refer to a specific member function in the given class. It can have assigned to it any member function that is part of the class (or a derived class), and that has the same signature (that is, matching return type and parameter types). This, as you might imagine, allows the compiler to do dynamic dispatch based on runtime type; you can do the same thing, based on any criteria you choose.

Note also that the variable you've created is not yet initialized. There's that obfuscation again! In a language where everything is done by reference, that isn't even quite possible; your reference will wind up referring to something. But pointers enable whole new classes of errors, and there is unfortunately no such thing as a "reference to member function" in C++. Be careful to make certain you initialize it before use, preferably at the point the variable is defined! Not doing so will cause you severe tire damage!

And, in fact, the compiler may not help you avoid this. I tested GCC by writing a call to an uninitialized regular function pointer:

void ( *fptr ) ( void );
fptr();

and it generated a helpful warning for this case. When I wrote a similar call using an uninitialized pointer to a member function, like so:

void ( Class_c::*mptr ) ( void );
( this->*mptr ) ();

GCC produced no warning at all, and as you might expect, the call caused an immediate crash. This is the likely outcome, but according to the C++ standard the behavior is undefined (this is very bad; it means the program is not required by the standard to catch this as an error; it is free to silently fail in some insidious way, or destroy your hard drive, or electrify your keyboard).

You can initialize them to zero, the null pointer constant (there is an explicitly allowed conversion), but the results of calling a null pointer to member function are also undefined, so this doesn't buy you anything; you are better off initializing them at the point where they are defined. Here is an example of how to do so:

void ( my_class::*mp ) ( void ) = &my_class::member_function;

Since pointers to member functions are hard to read, especially for member functions that that accept a long list of parameters, it is valuable to make the construct into a type using typedef. I recommend using the naming convention class name + the _kind_ of member func + "_pm_t"

But follow your own convention, or your team's convention, if you like. Keep in mind that the type can be used to represent any member function of the class (or a derived class) that matches the signature. The member function will probably represent a handler of some kind, or in design pattern terms a specific strategy. Describe what it handles, or what the strategies are trying to accomplish.

When you write the typedef, keep in mind that it looks just like the variable definition, except that the type name replaces the variable name and you put typedef in front. To define a type called my_class_SomethingHandler_pm_t, use the form:

typedef int ( my_class::*my_class_SomethingHandler_pm_t ) ( int );

Once you have a type, the following definition creates a variable with the new type and initializes it with a specific member function:

my_class_SomethingHandler_pm_t pm = &my_class::my_member_func;

This assignment can also take place in parameter binding. This means you can write a standalone function, member function, or static member function that accepts any pointer to member function matching the specified signature, and then pass it a specific one at run-time.

Once you have a pointer to member function, you can call that member function, but to do this you need an object (the "this" that will be in effect for the duration of the call). If you have a pointer to the object (which could be "this"), use the syntax:

( object_ptr->*mp )( params );

If you have a local or global variable holding an object or a reference to an object, use the syntax:

( object.*mp ) ( params );

Note that the parentheses around the first part are mandatory, to help the parser. If your object pointer or pointer to member function is stored in a structure or you are accessing it via pointer you will probably have to introduce more parentheses to make sure the parts of the expression are evaluated in the right order. This kind of construct can get ugly fast, so be careful.

In his C++ FAQ Lite section on pointers to member functions available here Marshall Cline suggest that when making calls using pointers to member functions, you always define a calling macro like this:

#define CALL_MEMBER_FN(object,ptrToMember) ((object).*(ptrToMember))

and then do the call like this:

result = CALL_MEMBER_FN(PO,PM)( PL );

but I wouldn't necessarily recommend that; C-style macros are still quite evil, and can make it very hard to find out what the compiler is complaining about, or even what it actually compiled that you didn't intend.

30 August 2006

The Kernel Programming Language and the Quest for Simplicity

I've been reading with considerable interest about the Kernel programming language under development by John Shutt. Information and a PDF file are available here.

I am not sure I fully understand the implications and details of his $vau and $wrap constructions, but I strongly support his goals, and the rest of his report kept me, literally, awake half the night reading it. Shutt writes:

...the mandate of the Kernel language is to have a clean design, flowing from a broad design philosophy as refined by a series of more specific design guidelines --- just one of which is that all manipulable entities should be first-class.

He is trying to create a Scheme-like language which is even more uniform, predictable, and configurable than R5RS Scheme.

I don't have a Ph.D. in the programming language field, so bear with me if my terminology is a little vague. It seems that the big thing Shutt is attempting is to give fine-grained control over evaluation, essentially separating out the parameter-evaluation stage of a lambda call into a distinct and fully configurable stage. This means you can do more clever things with objects without evaluating them. I think it also points you towards macro-like constructions that don't lead off into the existing slightly swampy maze of different Scheme macro implementations. It also could be a more explicit and flexible means of handling the need for quoting.

On top of that he is also doing a number of slightly smaller things (on the scale of theoretical complexity), but which I think are great.

For instance, Shutt has in his design very explicit rules for what to do with cyclic structures, a.k.a. infinite lists. While in some situations creating these would be considered an error, in others they make perfect sense: for example, in creating trees that describe recursive-descent parsing. Therefore, giving a sensible semantics to handling cyclic structures is a sensible and intriguing thing to do. If you want to write functions which aren't willing to handle cyclic structures, you can use one of his predicates to check for that possibility.

He also has addressed a number of minor irritants which don't require a deep understanding of the lambda calculus to appreciate. For instance, he proposes that boolean evaluation should never be implicit: that is, if you want to evaluate something as a boolean value, you have to use a predicate on it to generate a boolean value, and compare it to true or false. In other words, no more "anything not explicitly false is not true" or "false is false, and nil is false, but anything else is true, and true is true" or other bizarre variant on this theme.

As someone who switches languages a lot, I have always tried to make my boolean evaluations absolutely explicit to avoid the potential confusion that implicit evaluation can bring. For example, I always write (pedantically) in C++:

if ( false == stack.full() ) {
    push()...

In addition, he has addressed the question "if everything in Scheme has a value, what is the value of a control structure?" To make this explicit he proposes an immutable value called inert, somewhat analagous to void in C/C++, and which is an explicit type and which even has explict external representation.

There's lots more great stuff in Shutt's draft report and it addresses many of the weaknesses that make Scheme implementations so maddeningly inconsistent. Read it, study it, live it. I am looking forward to Shutt's doctoral thesis and I hope that there is one day very soon a freely available implementation of Kernel to try out!

17 August 2006

A Rant about Const

WARNING: THE FOLLOWING MAY CAUSE BRAIN DAMAGE

This is an unfinished essay that sat in draft form in my bog, unpublished, for several years. It seems very unfinished, but I decided to go ahead and publish it anyway. I'm not sure it actually makes sense; it certainly needs some revision. It's possibly my thinking has just been too altered by working with pure functional programming languages such as Haskell to really understand the C++ mindset anymore. There may indeed be a point buried in here somewhere, if only I can determine what it is, but it might be so trivial or pedantic as to be uninteresting to anyone but me. But I'll let you the reader decide that.

A while back I was thinking about why I couldn't qualify the return value of a function -- in this case an object -- with const. When compiling some code with GCC, such a qualifier produced a warning, somehting like "type qualifier on return type is meaningless." GCC was calling my code meaningless.

Maybe it isn't sensible, and maybe the compiler can't enforce it, but it seemed to express what I meant, so I looked this up in the comp.lang.c++ newsgroups. What I keep reading is that returning a const value, when the value is int or float or some other plain built-in data type, as opposed to a class type, is not "meaningful."

Either I'm too dumb to understand this argument or the argument isn't actually coherent, it is just widely propagated because everyone mistakes the existing C/C++ runtime behavior for a truly sensible model of language semantics.

What they seem to be actually saying is "according to the language standard, C++ is not allowed to enforce meaningful const semantics on return types that are not reference or pointer types. And pointer types are hopelessly unsafe. But references can be pretty unsafe too."

There was a thread about this on comp.lang.c++, specifically about why you can't return a const int. One of the comments was:

"A value returned by a function is a rvalue, unless it is a reference. The notion of being (or not being) const-qualified is not applicable to rvalues of non-class types ('int' in your case). A rvalue cannot be modified just because it is a rvlaue. That's it. It doesn't make any difference whether you const-qualify it or not. That's what your compiler is trying to tell you."

See: http://groups.google.com/group/comp.lang.c++/browse_thread/thread/4e50f8abf9c11f71/6039d5b2e6612df0%236039d5b2e6612df0

As far as I can tell, the C++ standard specifies this because of the way returning values from functions has usually been implemented in C/C++ (by copying on the stack). So the caller isn't actually getting the "same" object or integer value, and in fact the original will be gone.

What this means is actually that you can freely change the return value:

int y = f(x);
y = y + 1;

But this is not actually "changing the rvalue" because the rvalue was actually copied into y; you're just changing a copy. So the statement above is correct in that the rvalue "cannot be modified." (It means that only indirectly).

In cases even when the compiler puts the return values in registers, or wherever, the standard rules about lvalues and rvalues still apply, for sanity and safety and consistency.

But it is not what I want to express when I use const with a return value -- consider it an annotation, if you will, rather than something the compiler will enforce. In the methods that generated warnings, I was using const return values to mean:

Case 1: what I'm returning is the result of a mathematical function call: f(x) always returns the same value for the given x, in a functional programming sense, so it wouldn't make sense for you to change the value I'm returning to you. If my function f is f(x) = 1 + x, then f(1) = 2. It doesn't make any sense to then say f(1) = 3 later. But of course this makes more sense in Haskell, where variables don't "vary," than in C or C++.

Case 2: what I'm returning is the result of an iterator and represents the state of another object at a given moment in the life of the program. Therefore, you shouldn't change the value itself; if you want to update it, call the iterator to get a new updated value. If say iterator_state = iterator:GetNextValue() and get a new iterator_state back, it doesn't make any sense to say iterator_state = iterator_state + 1 without calling GetNextValue(). Of course, it expresses something in C++, but your program is confusing. It would be less confusing, readability-wise, if you explicitly copied iterator_state to a working variable before altering it.

It seems to me that I ought to be able to expressthe notion that the caller of one of my object's methods should not update the object I'm returning, whether "returning" means binding it via a pointer, a reference (hidden pointer), or by copying the rvalue.

If it is a const u32_t I'm returning, the compiler should insist that the value be assigned only to a const u32_t. (It doesn't).

If I'm returning a const object, the compiler, it seems to me that the compiler should be able to require that it is assigned to a const variable. (It doesn't, but at least it doesn't generate the warning in those cases). This is because returning an object on the stack does not constitute returning an "rvalue."

This is, as far as I can tell, a completely arbitrary distinction, with the sole exception that copying an object gets to invoke the copy constructor (default, or one you provide), whereas copying a built-in type just does it for you.

In a similar thread Greg Comeau wrote:

"Also, returning a const T, where T IS NOT A BUILTIN TYPE many not be meaningless, because you may end up calling member functions on that object, and the member function may need to be a const member function."

It is "meaningful" in the sense that you may want to put the returned object into a const variable to _prevent_ calling any non-const member functions. But Greg phrased it backwards. You can always call const member functions on non-const objects. He should have said "you may end up calling member functions on that object, and you might want to prevent the calling of non-const member functions."

Example:

static AZMMultiZoneRef_c const MakeFromZonesBitmap( u32_t zones_bmp );
const AZMMultiZoneRef_c zones_ref = AZMMultiZoneRef_c::MakeFromZonesBitmap( p_data->zones_bitmap );
u32_t AZMMultiZoneRef_c::MutateMe( void_t );
zones_ref.MutateMe();

The compiler will (rightly) complain about this. But it will not complain about calling const member functions on a non-const object. The intent is that you should not be able to mutate an object that is const, except using the mutable workaround to allow hidden state changes.

However, the rules for returned objects let us completely get around this! If I don't like that restriction, I can just assign the return value to a non-const object. So this is legal:

AZMMultiZoneRef_c zones_ref = AZMMultiZoneRef_c::MakeFromZonesBitmap( p_data->zones_bitmap );
zones_ref_2.MutateMe();

Not much of a restriction, then, is it?

OK, so let's say we're more serious about enforcing const. So, we'll use a pointer or reference instead, because that will maintain the const enforcement restrictions, right? BUT... the semantics are actually quite different, and there is a serious safety hole introduced:

"When returning references, a reference bound to the return value is not valid after object deletion; when returning values, a reference bound to the return value is valid after object deletion."

And, of course, if what you are returning a pointer or reference to is a an object in a local variable, that is a seriously broken program right there. So it is kind of a "damned if I do, damned if I don't" situation.

C++ is actively hostile to functional programming. In fact, the "safety features" of the language can't even be used consistently. And with this compiler, when you try to express what you mean (taken as a given that the compiler won't enforce it), it treats it as a warning!