|You are here:|
Consider the addition of floating point numbers:
If the numbers are smaller , or use fewer bits we may be simply able to convert floating point numbers to a fixed point representation and then add them in the traditional way:
For example, using 12-bits (8 for a fractional two's complement mantissa and 4 for the two's complement integer exponent) add the following two numbers:
0.1001000 0010 + 0.1111000 0100
Converting the numbers from normalised form to fixed point gives
respectively, now we can add:
To normalise we move the decimal point 5 places to the left, giving an exponent of 5, so the answer is:
Adding two larger magnitude floating point numbers requires the following steps:
For example, using a 10-bit mantissa with a 6-bit exponent, where it is very difficult to convert numbers to a fractional representation:
Just so we can check the correct result later, let's convert this binary value to its decimal equivalent.
The bit pattern of the mantissa 1011011011 can be converted to:
As a fraction this is (negative) 1/2 + 1/16 + 1/128 + 1/512 or -293/512.
Expressed as a decimal this is -0.5722656 approximately, as a power of 2 it is -293 x 2 -9 , either way we have to multiply it by the exponent of (001111) 215 (32768).
From here we can work out that its decimal value is:
-293 x 2-9 x 215 = -293 x 26 = -18752
-0.5722656 x 32768 = -18752
Suppose we wish to add 294 to this number, that would be (1/2 + 1/16 + 1/128 + 1/256) x 29 approximately or
in our representation. Now we have 001111 for the first exponent and 001010 for the second.
So, for the smaller, we have to repeatedly shift the mantissa one place to the right and increase the exponent by one until it is equal to the larger:
And the exponents of the two numbers are now equal. Notice, however, that our original value of 294 has been altered to 256 (32766/128) by this process, ie accuracy has been lost.
In this case, accuracy has been lost as the final bits "fell off" the end of our mantissa - a truncation error.
Anyway, now we can add the two mantissas:
The final result is
-1 + (1/4 + 1/8 + 1/32 + 1/64 + 1/128 +1/256 + 1/512) x 215
-1 + 223/512 or -289/512
again, using powers of 2:
-289 x 2-9 x 215 = -289 x 26 = -18496
The true answer (-18752 + 294) is -18458 so the process has resulted in some error.
In real systems these errors are reduced by allocating more bits to the storage of floating point numbers. For example, in Java, 4 bytes (32 bits) are used to store float primitives and 8 bytes are used to store double primitives.
To improve accuracy, intermediate calculations will use more bits than the final stored result (this would help to avoid the error in the calculation shown above).
Situations in which errors can occur
If two numbers are added together such that the result is too big to be stored in the allocated number of bits then we have overflow . Usually this causes a run-time error. An example is adding or multiplying two numbers together where the outcome is too big too store.
If the result is too small to be stored (after division or subtraction. for example) in normalised form then the error condition is known as underflow . Most systems will simply use zero to represent this condition.
As we have seen truncation errors occur when bits are "chopped off" from the end of a number in some process such as adjusting the mantissa of a number to be processed.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. © 2001 - 2009 Richard Jones, PO BOX 246, Cambridge, New Zealand;
This page was last modified: October 28, 2013