Question & Answer
Question
The following test case prints the result of the subtraction of two single-precision floating point numbers. The result is incorrect. What is the problem?
/* t.c */
#include
Cause
This behavior is a result of a limitation of single-precision floating-point arithmetic. The VisualAge C++ compiler implementation of single-precision and double-precision numbers follows the IEEE 754 standard, like most other hardware and software.
The complete binary representation of values stored in f1 and f2 cannot fit into a single-precision floating-point variable. The binary format of a 32-bit single-precision float variable is s-eeeeeeee-fffffffffffffffffffffff, where s=sign, e=exponent, and f=fractional part (mantissa). A single-precision float only has about 7 decimal digits of precision (actually the log base 10 of 223, or about 6.92 digits of precision). The greater the integer part is, the less space is left for floating part precision.
Therefore, the compiler actually performs subtraction of the following numbers:
520.020020
- 520.039978
= -000.019958
Answer
You can get the correct answer of -0.02 by using double-precision arithmetic, which yields greater precision. The long double type has even greater precision. Double-precision arithmetic is more than adequate for most scientific applications, particularly if you use algorithms designed to maintain accuracy. Nonetheless, all floating-point representations are only approximations. For an accounting application, it may be even better to use integer, rather than floating-point arithmetic. For instance, you could make your calculations using cents and then divide by 100 to convert to dollars when you want to display your results.
Was this topic helpful?
Document Information
Modified date:
12 October 2022
UID
swg21194436