 
  
  
  
  
 Floating point numbers approximate real numbers. Operations with floating point numbers approximate corresponding operations with real numbers. Consider the following addition operation:
  
 
 and
  and   are
members of
  are
members of   ;
 ;
  is not.
  is not.
When the implied real result of a floating point operation is not a floating point number the result is rounded to a floating point number. The most common form of rounding is ``rounding to nearest'' where the result is rounded to the nearest floating point number. Using such rounding the previous example would result in:
  
 
Another form of rounding is ``upward rounding'' where the result is rounded up to a larger floating point number. If the result is positive, it is rounded away from zero; if the result is negative, it is rounded towards zero. Using such rounding the previous example would result in:
  
 
Another form of rounding is ``downward rounding'' where the result is rounded down to a smaller floating point number. If the result is positive, it is rounded towards zero; if the result is negative, it is rounded away from zero. Using such rounding the previous example would result in:
  
 
Numerical libraries provide three forms of rounding:  
  ,
 ,   , and
 , and   . The default
mode of rounding is
 . The default
mode of rounding is   .
When an explicit rounding mode is not specified,
as was done earlier,
 .
When an explicit rounding mode is not specified,
as was done earlier,   is assumed.
  is assumed.
Although
IEEE 754 requires that the algebraic operators
+, -,   ,
 ,   , and
 , and   are rounded to
the nearest floating point number, other operators
are not so favoured. The following example will illustrate
what can happen with operators whose results are not
guaranteed to be accurate to within one ULP (Unit in the
Last Place). With a
  are rounded to
the nearest floating point number, other operators
are not so favoured. The following example will illustrate
what can happen with operators whose results are not
guaranteed to be accurate to within one ULP (Unit in the
Last Place). With a   implementation that is guaranteed
to be accurate to within 40 ULPS the following may occur:
  implementation that is guaranteed
to be accurate to within 40 ULPS the following may occur:
  
 
 , is bracketed by
 , is bracketed by
  and
  and
  .
These brackets may be widely separated; with our example sine implementation
they may differ by up to 80 ULPS.
The result using
``rounding to nearest'' only guarantees that the true result will
fall within the bracketed region.
 .
These brackets may be widely separated; with our example sine implementation
they may differ by up to 80 ULPS.
The result using
``rounding to nearest'' only guarantees that the true result will
fall within the bracketed region.
Using real numbers directly in computations is currently infeasible. Floating point numbers are commonly used because of their computational advantages. Unfortunately, rounding causes the result returned to be inexact.
 
  
  
  
  
 | Jeff Tupper | March 1996 |