18 February 2016
Cancellation in floating point addition
A question I have been asking myself lately is: With floating point numbers, how high is the lowest i
where i = i+1.0
due to loss of precision (cancellation)?
Of course, that depends on the format. Let us inspect the IEEE Standard for Floating-Point Arithmetic (IEEE 754), first the binary 32-bit version (float in C) and then the 64-bit version (double in C).
A float consists of a sign (1 bit), an exponent (8-bit) and a mantissa/fraction (23-bit). I won't explain this in a lot of detail because there is a wealth of books and articles explaining this (e.g. Wikipedia).
If we write the sign first, then the exponent and then the mantissa, 1.0 has the (big-endian) representation
0 01111111 00000000000000000000000
The reason why the mantissa is zero is because there's an implicit "1." in front of it. And the exponent is that high because we multiply the mantissa with
2
raised to the power of (exponent - 127)
. Thus, if the exponent is 127
in binary, we multiply the mantissa 1.00000000000000000000000
with 1
and we get 1.0
.
Since the lower value must have its exponent increased, we have cancellation for addition with large numbers. In particular, we have complete cancellation if we have to increase the exponent of the lower value by (number of mantissa bits + 1)
, in this case, 24 (why + 1? Because of the implicit "1." before the mantissa). Thus,
2^24 = 16777216.0f + 1.0f = 16777216.0f but
(these numbers have been verified in C with the correct floating point CPU flags set)
2^24 = 16777215.0f + 1.0f = 16777216.0f
Similarly, for doubles, we have
2^53 = 9007199254740992.0d + 1.0d = 9007199254740992.0d but
2^53 = 9007199254740991.0d + 1.0d = 9007199254740992.0d
Generally, if you add two numbers and one is larger than the other by a factor of at least 2 raised to the power of (number of mantissa bits + 1)
, you will have complete cancellation (with denormals it is even more problematic). This only applies exactly if we use truncation as our rounding mode. Other rounding modes make this a bit more subtle because there are 2 more precision bits (guard, round) and a third, special precision bit (sticky) that influence rounding and thus cancellation.
This has interesting consequences. In particular, since JavaScript uses 64-bit double for all numerical variables, the for loop
var start = 9007199254740992;
for(i = start; i < start+2; ++i){
}
never terminates and causes a crash of the VM! Do note that depending on your CPU architecture, your floating point operations might be calculated with 80-bit intermediary precision, causing unforeseen and nondeterministic results. As far as I know, this is impossible to configure directly in JavaScript, hence I had to test the numbers with a C program using assembly instructions to set the floating point flags.
The study of how floating point arithmetic behaves on modern systems is important to me because a programmer is supposed to know how his numerical variables behave. Also, my current side project relies on floating point determinism. So, depending on how much spare time I can invest, I will write more about this subject, though nothing of what I wrote in this post wasn't previously known.
Post comment
Comments
(no comments yet)