07 December 2006

The Division Bell Tolls for Me, Part Four (Conclusion)

Well, this was initially going to be a three-part article, but things have become more interesting. The situation is not unlike picking up a log in the forest and finding all kinds of nifty crawlies living under it. "Interesting" to certain geeky people like myself, at least! It's about to get geekier. If you are allergic to ANSI or ISO standards, you'd better stop reading now. But I claim the payoff is worth it, if you can make it through to the end. At least, for some value of "worth it."

In the last installment we were exploring the behavior of integer division in Haskell. I want to get back to C now. We learned that the behavior of integer division involving negative results is not strictly defined for C. The behavior of % (mod) is also not very strictly defined.

Let's start with Kernighan and Ritchie's The C Programming Language, Second Edition (I'm not even going to try to figure out rules for pre-ANSI/ISO C). K&R tell us on p. 10 that

...integer division truncates; any fractional part is discarded.

That isn't very detailed, but on p. 41 they tell us

the direction of truncation for / and the sign of the result for % are machine-dependent for negative operands, as is the action taken on overflow or underflow.

OK, but that doesn't describe exactly what the options are. Things get a little bit more detailed on p. 205, where we read that no guarantees are made about the sign of the remainder!

...if the second operand is zero, the result is undefined. Otherwise, it is always true that (a/b)*b + a%b is equal to a. If both operands are non-negative, then the remainder is non-negative and smaller than the divisor; if not, it is guaranteed only that the absolute value of the remainder is smaller than the absolute value of the divisor.

This isn't actually internally consistent, because if the sign of the remainder is not correct, then the identity for division doesn't work for negative values unless the sign of the quotient is allowed to be incorrect! I've never seen it suggested anywhere else that division could be that loosely defined. But K&R isn't the formal standard, so let's move on.

The reason for the "looseness" that exists in the C standard, of course, is that the standard was originally written to, as much as possible, codify existing practice, and C has a strong bias towards efficiency. Since (apparently) different hardware division algorithms had different behavior, the standards committee did not feel that it was appropriate to require existing implementations to change behavior. Doing so might have required existing architectures to have to replace hardware division with software division, which could have inflicted an enormous efficiency cost.

According to ISO/IEC 9899:1990 (known as C90):

When integers are divided and the division is inexact, if both operands are positive the result of the / operator is the largest integer less than the algebraic quotient and the result of the % operator is positive. If either operand is negative, whether the result of the / operator is the largest integer less than or equal to the algebraic quotient or the smallest integer greater than or equal to the algebraic quotient is implementation-defined, as is the sign of the result of the % operator. If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a.

So we know that there are two options for rounding. In order to maintain the identity involving integer division and mod, though, the value of mod has to be consistent with the rounding strategy, even if the sign does not have to be consistent.

There is a back door that lets us get to defined behavior. Apparently the div function does have strictly defined rounding. In section 7.10.6.2 of the standard, we read:

If the division is inexact, the resulting quotient is the integer of lesser magnitude that is the nearest to the algebraic quotient.

That's interesting -- this is rounding towards zero. The div function returns a structure of type div_t that contains both the integer quotient and integer remainder. It is provided in the <stdlib> header, available in C++ as <cstdlib>. In his book The Standard C Library, P. J. Plauger provides an implementation for div. It looks like this:

div_t (div)(int numer, int denom)
    {        /* compute int quotient and remainder */
    div_t val;
    val.quot = numer / denom;
    val.rem = numer - denom * val.quot;
    if (val.quot < 0 && 0 < val.rem)
        {    /* fix remainder with wrong sign */
        val.quot += 1;
        val.rem -= denom;
        }
return (val); }

We can see what he is doing: for negative quotients that are inexact answers to the division problem, he shifts the quotients up towards zero and the remainders get reversed. Note that he does not use the built-in % facility, presumably since its behavior does not have very strong guarantees placed on it.

C: A Reference Manual (5th Edition) by Harbison and Steele seems to indicate that the semantics are a little more rigorously defined. We read there that

For integral operands, if the mathematical quotient of the operands is not an exact integer, then the fractional part is discarded (truncation toward zero). Prior to C99, C implementations could choose to truncate toward or away from zero if either of the operands were negative. The div and ldiv library functions were always well defined for negative operands.

That seems pretty unambiguous, but then on page 419, when describing the div and ldiv function, the authors write

The returned quotient quot is the same as n/d, and the remainder rem is the same as n%d.

But that ignores the possibility of a pre-C99 implementation where / rounds away from zero for negative values, as does GHCi in my tests. So again we have a lack of rigorous internal consistency. It is worth noting here that K&R don't make any extra statements about the required rounding behavior of div and ldiv, and since K&R is still considered the "bible" by many C programmers, the guarantees on the div and ldiv function may not be very well-known -- or properly implemented.

Does C99 clarify this at all? ISO/IEC 9899:1999(E), Second Edition 1999-12-01, tells us that "the result of the / operator is the algebraic quotient with any fractional part discarded" and a footnote indicating that this is often called "truncation towards zero."

How about C++? ISO/IEC 14882, Second Edition (2003-10-15) does not actually say how rounding is performed, although it says that "If both operands are nonnegative then the remainder is nonnegative; if not, the sign of the remainder is implementation-defined." There is a footnote indicating that "According to work underway toward the revision of ISO C, the preferred algorithm for integer division follows the rules defined in the ISO Fortran standard" (rounding towards zero). When we try to look up information about div and ldiv, we find that the C++ standard just refers us to the Standard C library.

OK, all the standards language is giving me a headache; let's take a look at some code. To begin with, let's confirm the way my C implementation does rounding of integer quotients. We took a look at 6/5, -6/5, 5/6, and -5/6 in Haskell; let's look at the same quotients in C:

int val_1 = 6;
int val_2 = 5;
int val_3 = -6;
int val_4 = -5
int result_1 = val_1 / val_2;
int result_2 = val_3 / val_2;
int result_3 = val_2 / val_1;
int result_4 = val_4 / val_1;
printf("result_1 = %d, result_2 = %d, result_3 = %d, result_4 = %d\n",
    result_1, result_2, result_3, result_4);

This gives me:

result_1 = 1, result_2 = -1, result_3 = 0, result_4 = 0

Yep, I'm getting rounding towards zero for both positive and negative values. Next, some Haskell. Let's bring back our algorithm from part 2:

let gen_nths denom =
    [(numer, denom::Int) | numer <- [(-denom::Int)..(denom::Int)] ]

let div_map m1 m2 = map (\tup -> (fst tup + snd tup) *
    (((m2::Int) - (m1::Int)) `div` 2) `div` snd tup + m1)

div_map (-10) 10 (gen_nths 3)

[-10,-7,-4,0,3,6,10]

And now let's write up my C algorithm again. This time we'll use int instead of long since we don't need big numbers (although on my target, ints are actually 32 bits long).

unsigned int denom = 3;
int numer;
for ( numer = -3; numer <= 3; numer++ )
{
    int quot = numer * 10 / denom;
    printf("quotient %d\n", quot);
}

Go ahead and try it on your platform.

Whoah! What happened? You might have noticed something very odd about the first few quotients. They are monstrously large! On my target I get:

quotient 1431655755
quotient 1431655758
quotient 1431655762
quotient 0
quotient 3
quotient 6
quotient 10

What happened is that I injected a bug to point out yet another possible way your C code can fail catastrophically. This behavior can surprise even experienced programmers. Note that since my denominator is always a positive value, I declared it as an unsigned int instead of an int. So what went wrong?

Let's look at the first example: -3/3 yielding 1431655755. The behavior you are looking at is actually mandated by the C standard. When mixing signed and unsigned types, the compiler is required to promote (that is, widen) the actual type of the calculation to an unsigned type. So, internally, if we are performing the calulation -22 (signed) / 7 (unsigned), the compiler is required to notice that both a signed and unsigned long value are presented to the integer division operator. The 32-bit twos-complement representation of -3 is 0xFFFFFFFD (the conversion to an unsigned intermediate value changes the meaning, not the representation). I multiply this value by 10, yielding 0xFFFFFFE2, and then divide it by 3, yielding 0x5555554B, or a decimal value of 1431655755. The compiler sticks this unsigned value right into the signed result variable. The signed or unsigned status of the destination variable has absolutely no effect; it does not dissuade the compiler from deciding to treat the values as unsigned. (It is a common misconception to think that the destination value influences the type of the calculation in C).

If you have a target with 16-bit ints, your rolled-over values will be different (0xFFFD, 0xFFE2, and 0x554B, or decimal 21835). So the mixing of signed and unsigned values has not only given us the wrong answer, but made the code that generates the wrong answer produce inconsistent results depending on the size of int, even when the values we start with are not actually too big to fit into 16 bits.

If that wasn't perfectly clear, and even if it was, if you are really trying to write portable numeric code in C that has to be correct, I urge to consider purchasing the MISRA (Motor Industry Software Reliability Association) book MISRA-C:2004, Guidelines for the Use of the C language in Critical Systems. It has the most detailed explanation of C's arithmetic type conversions, and more importantly, the implications of these type conversions in real code. The crux of the biscuit is that C is hard to get right, even for good programmers.

OK, so let's change my denominator back to a signed long and try it again. This time the results look more reasonable: -10, -6, -3, 0, 3, and 6.

Are things any different if we use div?

int denom = 3;
int numer;
for ( numer = -3; numer <= 3; numer++ )
{
    div_t val = div(numer * 10, denom);
    printf("quotient %d\n", val.quot);
}

In my case, no, since on my compiler and target, / already seems to round towards zero (asymmetrically) while Haskell rounds symmetrically downwards (floor behavior).

By the way, while we're at it, let's see what C has to say about the "weird" number divided by -1:

long most_negative_long = -2147483648;
long minus_one = -1;
long result = most_negative_long / minus_one;
ldiv_t ldiv_result = ldiv(most_negative_long, minus_one);
printf("results: %lX, %lX\n", result, ldiv_result.quot);

What do you get? I get zero as the result of both the / and ldiv.

I also get GCC telling me "warning: this decimal constant is unsigned only in ISO C90." What does that mean? The compiler is apparently warning us that in C99 we won't be getting an implicitly unsigned value out of the constant -2147483648, in case we might have wanted to use this constant in an expression involving signed types to force the calculation to be promoted to an unsigned type. But why would we get an unsigned value in the first place? Apparently in C90, the number's magnitude is interpreted first, and the value is negated. 2147483648 is too large (by one) to fit into a signed long, so the compiler promotes the type to an unsigned long, then negates it.

I have been trying to come up with an example of an expression that behaves differently when I use -2147483648 as a constant, as opposed to 0x80000000, but so far I haven't been able to come up with one. This may be because my compiler is already operating under C99 rules and so I never get the implicit promotion to an unsigned value.

Anyway, be that as it may, some parting advice and conclusions:

1. GHCi (on the PC and Mac) and C (Visual C++ 6.0 and GCC on the PC and Mac) yield distinctly different rounding behavior for integer division. C90 allows this behavior. Does Haskell? Would it make sense for Haskell to define integer division more strictly, the way that C99 does?

2. Division by zero is never OK; neither is taking % zero. Both are a fatal error in GHCi. The results are undefined in C (but a fatal error is generated by most implementations).

3. Overflow (which, in the case of integer division, occurs only when the "weird number" is divided by -1) produces undefined results, and your code should avoid it, both in Haskell and in C.

4. Operations on unsigned values are guaranteed to produce results that are consistent from platform to platform, assuming the integer size is the same. Operations on signed values don't have this guarantee.

5. The rounding (truncation) of inexact quotients may be either towards zero in all cases, or towards zero for positive quotients and away from zero when the results are negative. If your implementation follows the newer C99 standard rounding should always be towards zero.

6. Mixing signed and unsigned values in C expressions is very dangerous. If you do so, your implementation may have errors far more severe than differences in rounding behavior.

7. Consider avoiding signed division of negative values in C altogether, if your algorithm allows it. Either implement the algorithm using entirely unsigned integral types, or shift the domain of your function so that you are operating on non-negative signed values, and shift it back (using subtraction, which has well-defined semantics) when your division is done.

8. Algorithms which feed a range of values which cross the origin to integer division may vary between GHCi and C implementations. Because GHCi does not provide guaranteed rounding towards zero as C99 and the div and ldiv functions require, it is difficult to prototype in GHCi and expect to get the same results in C.

And, finally,

9. Make sure your implementation actually works the way it is supposed to. Only testing can truly accomplish this.

Interesting, isn't it, how studying one language can enrich your understanding of another! At least, it works for me!

4 comments:

The Alternate Moebyus said...

Paul, this is really really useful. Thank you.

Explains the common saying you hear after hanging around people who code numerical stuff: "never, ever, mix unsigned and signed types."

Paul R. Potts said...

Thanks. They say that all open source software projects start out as someone trying to scratch a personal itch. I generally start writing things like this in an attempt to codify what I've learned from bugs in my own code, in the hopes that someone else will be able to learn something from my process as well.

Unknown said...

When I try your "weird" number divided by -1 (using the C code from your post), I get "Floating point exception" and a program crash. This is on a Pentium-M; on what platform do you get zero?

Paul R. Potts said...

Hi Carl,

I have tried the C code on MacOS X 10.4 using the freely available XCode toolchain (which uses GCC).

The C standard says that the behavior of division under these circumstances is undefined. All environments I've worked with produce some kind of a program-terminating crash when dividing by zero, but I generally would not have expected overflow in integer division to produce a crash. It certainly doesn't occur in multiplication or addition which overflows, or subtraction which underflows. However, according to Wikipedia the IA-32 produces an exception when the "weird number" is divided by -1; see

http://en.wikipedia.org/wiki/SIGFPE

I guess the question then becomes what the OS and/or the program does with that exception. I don't really know what the answers to those questions are for any given OS and application, although it might be interesting to find out.