### 18 February 2016

## Cancellation in floating point addition

A question I have been asking myself lately is: With floating point numbers, how high is the lowest `i`

where `i = i+1.0`

due to loss of precision (cancellation)?

Of course, that depends on the format. Let us inspect the IEEE Standard for Floating-Point Arithmetic (IEEE 754), first the binary 32-bit version (float in C) and then the 64-bit version (double in C).

A float consists of a sign (1 bit), an exponent (8-bit) and a mantissa/fraction (23-bit). I won't explain this in a lot of detail because there is a wealth of books and articles explaining this (e.g. Wikipedia).

If we write the sign first, then the exponent and then the mantissa, 1.0 has the (big-endian) representation

```
0 01111111 00000000000000000000000
```

The reason why the mantissa is zero is because there's an implicit "1." in front of it. And the exponent is that high because we multiply the mantissa with

`2`

raised to the power of `(exponent - 127)`

. Thus, if the exponent is `127`

in binary, we multiply the mantissa `1.00000000000000000000000`

with `1`

and we get `1.0`

.

Since the lower value must have its exponent increased, we have cancellation for addition with large numbers. In particular, we have **complete** cancellation if we have to increase the exponent of the lower value by `(number of mantissa bits + 1)`

, in this case, 24 (why + 1? Because of the implicit "1." before the mantissa). Thus,

```
2^24 = 16777216.0f + 1.0f = 16777216.0f but
```

(these numbers have been verified in C with the correct floating point CPU flags set)

2^24 = 16777215.0f + 1.0f = 16777216.0f

Similarly, for doubles, we have

```
2^53 = 9007199254740992.0d + 1.0d = 9007199254740992.0d but
```

2^53 = 9007199254740991.0d + 1.0d = 9007199254740992.0d

Generally, if you add two numbers and one is larger than the other by a factor of at least 2 raised to the power of `(number of mantissa bits + 1)`

, you will have complete cancellation (with denormals it is even more problematic). This only applies exactly if we use truncation as our rounding mode. Other rounding modes make this a bit more subtle because there are 2 more precision bits (guard, round) and a third, special precision bit (sticky) that influence rounding and thus cancellation.

This has interesting consequences. In particular, since JavaScript uses 64-bit double for all numerical variables, the for loop

```
var start = 9007199254740992;
```

for(i = start; i < start+2; ++i){

}

never terminates and causes a crash of the VM! Do note that depending on your CPU architecture, your floating point operations might be calculated with 80-bit intermediary precision, causing unforeseen and nondeterministic results. As far as I know, this is impossible to configure directly in JavaScript, hence I had to test the numbers with a C program using assembly instructions to set the floating point flags.

The study of how floating point arithmetic behaves on modern systems is important to me because a programmer is supposed to know how his numerical variables behave. Also, my current side project relies on floating point determinism. So, depending on how much spare time I can invest, I will write more about this subject, though nothing of what I wrote in this post wasn't previously known.

## Post comment

## Comments

(no comments yet)