You are here:  
Sidebar links 
Consider the addition of floating point numbers: If the numbers are smaller , or use fewer bits we may be simply able to convert floating point numbers to a fixed point representation and then add them in the traditional way: For example, using 12bits (8 for a fractional two's complement mantissa and 4 for the two's complement integer exponent) add the following two numbers: 0.1001000 0010 + 0.1111000 0100 Converting the numbers from normalised form to fixed point gives 010.01000 and 01111.000 respectively, now we can add:
To normalise we move the decimal point 5 places to the left, giving an exponent of 5, so the answer is: 0.1000101 0101 Adding two larger magnitude floating point numbers requires the following steps:
For example, using a 10bit mantissa with a 6bit exponent, where it is very difficult to convert numbers to a fractional representation:
Just so we can check the correct result later, let's convert this binary value to its decimal equivalent. The bit pattern of the mantissa 1011011011 can be converted to:  0.100100101 As a fraction this is (negative) 1/2 + 1/16 + 1/128 + 1/512 or 293/512. Expressed as a decimal this is 0.5722656 approximately, as a power of 2 it is 293 x 2 9 , either way we have to multiply it by the exponent of (001111) 215 (32768). From here we can work out that its decimal value is: 293 x 29 x 215 = 293 x 26 = 18752 or 0.5722656 x 32768 = 18752 Suppose we wish to add 294 to this number, that would be (1/2 + 1/16 + 1/128 + 1/256) x 29 approximately or 0100100110 001001 in our representation. Now we have 001111 for the first exponent and 001010 for the second. So, for the smaller, we have to repeatedly shift the mantissa one place to the right and increase the exponent by one until it is equal to the larger:
And the exponents of the two numbers are now equal. Notice, however, that our original value of 294 has been altered to 256 (32766/128) by this process, ie accuracy has been lost. In this case, accuracy has been lost as the final bits "fell off" the end of our mantissa  a truncation error. Anyway, now we can add the two mantissas: 1011011011 The final result is 1 + (1/4 + 1/8 + 1/32 + 1/64 + 1/128 +1/256 + 1/512) x 215 or 1 + 223/512 or 289/512 again, using powers of 2: 289 x 29 x 215 = 289 x 26 = 18496 The true answer (18752 + 294) is 18458 so the process has resulted in some error. In real systems these errors are reduced by allocating more bits to the storage of floating point numbers. For example, in Java, 4 bytes (32 bits) are used to store float primitives and 8 bytes are used to store double primitives. To improve accuracy, intermediate calculations will use more bits than the final stored result (this would help to avoid the error in the calculation shown above). Situations in which errors can occur If two numbers are added together such that the result is too big to be stored in the allocated number of bits then we have overflow . Usually this causes a runtime error. An example is adding or multiplying two numbers together where the outcome is too big too store. If the result is too small to be stored (after division or subtraction. for example) in normalised form then the error condition is known as underflow . Most systems will simply use zero to represent this condition. As we have seen truncation errors occur when bits are "chopped off" from the end of a number in some process such as adjusting the mantissa of a number to be processed. related: [ Topic 4 home  previous: floating point binary  next: example questions ] 
add title! 


Questions or problems related to this web site should be addressed to Richard Jones who asserts his right to be identified as the author and owner of these materials  unless otherwise indicated. Please feel free to use the material presented here and to create links to it for noncommercial purposes; an acknowledgement of the source is required by the Creative Commons licence. Use of materials from this site is conditional upon your having read the additional terms of use on the about page and the Creative Commons Licence. View privacy policy. This work is licensed under a Creative Commons AttributionNonCommercialShareAlike 2.5 License. © 2001  2009 Richard Jones, PO BOX 246, Cambridge, New Zealand; This page was last modified: October 28, 2013 