Google
 
Site navigation: [ Home | Theory | Java | About ]

Errors in Floating Point representation

Sidebar links

Consider the addition of floating point numbers:

If the numbers are smaller , or use fewer bits we may be simply able to convert floating point numbers to a fixed point representation and then add them in the traditional way:

For example, using 12-bits (8 for a fractional two's complement mantissa and 4 for the two's complement integer exponent) add the following two numbers:  

     0.1001000 0010 + 0.1111000 0100

Converting the numbers from normalised form to fixed point gives

     010.01000

and

     01111.000

respectively, now we can add:

first

 
 

0

1

0

.

0

1

0

0

0

second

0

1

1

1

1

.

0

0

0

result

1

0

0

0

1

.

0

1

0

0

0

To normalise we move the decimal point 5 places to the left, giving an exponent of 5, so the answer is:

      0.1000101 0101

Adding two larger magnitude floating point numbers requires the following steps:

  • Normalize the bigger (in magnitude) number.
  • Change the other number so that the two exponents are the same
  • Add (or subtract) the mantissas
  • Normalize the result

For example, using a 10-bit mantissa with a 6-bit exponent, where it is very difficult to convert numbers to a fractional representation:

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

0

-1

1/2

1/4

1 / 8

1/16

1/32

1/64

1/128

1/256

1/512

-32

16

8

4

2

1

1

0

1

1

0

1

1

0

1

1

0

0

1

1

1

1

Just so we can check the correct result later, let's convert this binary value to its decimal equivalent.

The bit pattern of the mantissa 1011011011 can be converted to:

                         - 0.100100101

As a fraction this is (negative) 1/2 + 1/16 + 1/128 + 1/512 or -293/512.

Expressed as a decimal this is -0.5722656 approximately, as a power of 2 it is -293 x 2 -9 , either way we have to multiply it by the exponent of (001111) 215 (32768).

From here we can work out that its decimal value is:

                          -293 x 2-9 x 215 = -293 x 26 = -18752

or

                          -0.5722656 x 32768 = -18752

Suppose we wish to add 294 to this number, that would be (1/2 + 1/16 + 1/128 + 1/256) x 29 approximately or

    0100100110 001001

in our representation. Now we have 001111 for the first exponent and 001010 for the second.

So, for the smaller, we have to repeatedly shift the mantissa one place to the right and increase the exponent by one until it is equal to the larger:

so

0100100110

001001

becomes

0010010010 001010 on the first shift

0001001001 001011 on the second shift

0000100100 001100 on the third shift

0000010010 001101 on the fourth shift

0000001001 001110 on the fifth shift

0000000100 001111 on the sixth shift

And the exponents of the two numbers are now equal. Notice, however, that our original value of 294 has been altered to 256 (32766/128) by this process, ie accuracy has been lost.

In this case, accuracy has been lost as the final bits "fell off" the end of our mantissa - a truncation error.

Anyway, now we can add the two mantissas: 

                    1011011011
                + 0000000100
1011011111 exp 001111

The final result is

          -1 + (1/4 + 1/8 + 1/32 + 1/64 + 1/128 +1/256 + 1/512) x 215

or

          -1 + 223/512 or -289/512

again, using powers of 2:

          -289 x 2-9 x 215 = -289 x 26 = -18496

The true answer (-18752 + 294) is -18458 so the process has resulted in some error.

In real systems these errors are reduced by allocating more bits to the storage of floating point numbers. For example, in Java, 4 bytes (32 bits) are used to store float primitives and 8 bytes are used to store double primitives.

To improve accuracy, intermediate calculations will use more bits than the final stored result (this would help to avoid the error in the calculation shown above).

Situations in which errors can occur

If two numbers are added together such that the result is too big to be stored in the allocated number of bits then we have overflow . Usually this causes a run-time error. An example is adding or multiplying two numbers together where the outcome is too big too store.

If the result is too small to be stored (after division or subtraction. for example) in normalised form then the error condition is known as underflow . Most systems will simply use zero to represent this condition.

As we have seen truncation errors occur when bits are "chopped off" from the end of a number in some process such as adjusting the mantissa of a number to be processed.

Back to top

related: [ Topic 4 home | previous: floating point binary | next: example questions ]

add title!


 
The site is partly financed by advertising revenue, partly by online teaching activities and partly by donations. If you or your organisation feel these resouces have been useful to you, please consider a donation, $9.95 is suggested. Please report any issues with the site, such as broken links, via the feedback page, thanks.

Questions or problems related to this web site should be addressed to Richard Jones who asserts his right to be identified as the author and owner of these materials - unless otherwise indicated. Please feel free to use the material presented here and to create links to it for non-commercial purposes; an acknowledgement of the source is required by the Creative Commons licence. Use of materials from this site is conditional upon your having read the additional terms of use on the about page and the Creative Commons Licence. View privacy policy.

Creative Commons License


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. © 2001 - 2009 Richard Jones, PO BOX 246, Cambridge, New Zealand;
This page was last modified: October 28, 2013