Floating point numbers are commonly used to approximate real numbers. Floating point facilities are common in computer hardware so most floating point operations can be performed very quickly on computers.
There are many different floating point number systems [5, 49, 50, 35], although they are all very similar. A floating point number can be written as:
All of the numbers in a particular floating point number system can be specified with a single choice of b. The set of floating point numbers with b=2 is denoted by . is the system of choice for computer implementations since a and c are usually stored in binary.
Implementations usually represent a and c in a fixed number of bits. A common example is IEEE 754 [5] 64-bit double precision where a is stored in 53 bits (fifty-two bits for the magnititude, one for the sign) while c is stored in 11 bits (using biased binary representation). Such a system is compactly expressed as : two exponent values are reserved to indicate non-normalized numbers. The floating point operations described below are required in IEEE 754 compliant numerical libraries.
Formally, the system includes all numbers which may be expressed as and satisfy:
Another view of the floating point numbers is to imagine the numbers of as being described by A base b digits multiplied by b raised to an exponent between m and M:
Throughout this presentation the exact details of the underlying floating point system will not be important so will be used to denote any particular floating point system. The exact format used to store floating point numbers does not concern us. The meticulous reader is encouraged to read one of [x,y,z] for details omitted in this brief exposé of floating point. We use for numerical examples.
Jeff Tupper | March 1996 |