Floating point numbers are commonly used to approximate real numbers. Floating point facilities are common in computer hardware so most floating point operations can be performed very quickly on computers.
There are many different floating point number systems [5, 49, 50, 35], although they are all very similar. A floating point number can be written as:
All of the numbers in a particular floating point
number system can be specified with a single choice
of b. The set of floating point numbers with b=2
is denoted by .
is the system
of choice for computer implementations since a and c
are usually stored in binary.
Implementations usually represent a and c in a
fixed number of bits. A common example is IEEE 754 [5]
64-bit double precision where a is stored in 53 bits
(fifty-two bits for the magnititude, one for the
sign) while c is stored in 11 bits (using
biased binary representation).
Such a system
is compactly expressed as :
two exponent values are reserved to indicate non-normalized numbers.
The floating point operations described below are
required in IEEE 754 compliant numerical libraries.
Formally, the system includes
all numbers which may be expressed as
and satisfy:
Another view of the floating point numbers is to imagine
the numbers of as being described
by A base b digits multiplied by b raised
to an exponent between m and M:
Throughout this presentation the exact details of the
underlying floating point system will not be important
so will be used to denote any particular
floating point system.
The exact format used to store floating point numbers
does not concern us.
The meticulous reader is encouraged to read one of [x,y,z]
for details omitted in this brief
exposé of floating point.
We use
for numerical examples.
Jeff Tupper | March 1996 |