You should almost never check floating point values for exact equality, expecially when they comes from comutation. You should check the absolute value of the difference from the compare value being less than a certain epsilon. The "precision" of the double is given by the internal number representation and you can't change it. How to exactly choose epsilon can be difficult, there is some comment to that answer discussing that, read it, but you eventually come with the pratical epsilon based equality.
Answer from Felice Pollano on Stack OverflowYou should almost never check floating point values for exact equality, expecially when they comes from comutation. You should check the absolute value of the difference from the compare value being less than a certain epsilon. The "precision" of the double is given by the internal number representation and you can't change it. How to exactly choose epsilon can be difficult, there is some comment to that answer discussing that, read it, but you eventually come with the pratical epsilon based equality.
There's no portable way, no.
With the GNU C library, you can use this API to change the rounding mode.
But in general, it's better to express it with code, so that your expectations become clear and portable:
Copy#define EQUALITY_EPSILON 1e-3 /* Or whatever. */
if(fabs(x - y) <= EQUALITY_EPSILON)
{
}
You can't do that, since precision is determined by the data type (i.e. float or double or long double). If you want to round it for printing purposes, you can use the proper format specifiers in printf(), i.e. printf("%0.3f\n", 0.666666666).
You can't. Precision depends entirely on the data type. You've got float and double and that's it.
I'm using a method to calculate PI, but the floating-point precision errors mean that the calculation isn't 100% accurate. How can I ensure that even the last few bits of precision in my floating-point are correct?
Can I virtually increase the mantissa bits by using a class that stores several values?
Do I need to recreate my own sqrt function for greater precision?
Any Idea to solve this problem ? (I know about Boost::multiprecision but I don't want to use any external library)
https://godbolt.org/z/471Tc1b95
Note that I do this for the fun, the challenge and the desire to learn.
EDIT :
I've managed to solve my problem by creating a more_presice<T> class, where T is a floating-point.
example_3.cpp
I recommend @Jens Gustedt hexadecimal solution: use %a.
OP wants “print with maximum precision (or at least to the most significant decimal)”.
A simple example would be to print one seventh as in:
#include <float.h>
int Digs = DECIMAL_DIG;
double OneSeventh = 1.0/7.0;
printf("%.*e\n", Digs, OneSeventh);
// 1.428571428571428492127e-01
But let's dig deeper ...
Mathematically, the answer is "0.142857 142857 142857 ...", but we are using finite precision floating point numbers.
Let's assume IEEE 754 double-precision binary.
So the OneSeventh = 1.0/7.0 results in the value below. Also shown are the preceding and following representable double floating point numbers.
OneSeventh before = 0.1428571428571428 214571170656199683435261249542236328125
OneSeventh = 0.1428571428571428 49212692681248881854116916656494140625
OneSeventh after = 0.1428571428571428 769682682968777953647077083587646484375
Printing the exact decimal representation of a double has limited uses.
C has 2 families of macros in <float.h> to help us.
The first set is the number of significant digits to print in a string in decimal so when scanning the string back,
we get the original floating point. There are shown with the C spec's minimum value and a sample C11 compiler.
FLT_DECIMAL_DIG 6, 9 (float) (C11)
DBL_DECIMAL_DIG 10, 17 (double) (C11)
LDBL_DECIMAL_DIG 10, 21 (long double) (C11)
DECIMAL_DIG 10, 21 (widest supported floating type) (C99)
The second set is the number of significant digits a string may be scanned into a floating point and then the FP printed, still retaining the same string presentation. There are shown with the C spec's minimum value and a sample C11 compiler. I believe available pre-C99.
FLT_DIG 6, 6 (float)
DBL_DIG 10, 15 (double)
LDBL_DIG 10, 18 (long double)
The first set of macros seems to meet OP's goal of significant digits. But that macro is not always available.
#ifdef DBL_DECIMAL_DIG
#define OP_DBL_Digs (DBL_DECIMAL_DIG)
#else
#ifdef DECIMAL_DIG
#define OP_DBL_Digs (DECIMAL_DIG)
#else
#define OP_DBL_Digs (DBL_DIG + 3)
#endif
#endif
The "+ 3" was the crux of my previous answer. Its centered on if knowing the round-trip conversion string-FP-string (set #2 macros available C89), how would one determine the digits for FP-string-FP (set #1 macros available post C89)? In general, add 3 was the result.
Now how many significant digits to print is known and driven via <float.h>.
To print N significant decimal digits one may use various formats.
With "%e", the precision field is the number of digits after the lead digit and decimal point.
So - 1 is in order. Note: This -1 is not in the initial int Digs = DECIMAL_DIG;
printf("%.*e\n", OP_DBL_Digs - 1, OneSeventh);
// 1.4285714285714285e-01
With "%f", the precision field is the number of digits after the decimal point.
For a number like OneSeventh/1000000.0, one would need OP_DBL_Digs + 6 to see all the significant digits.
printf("%.*f\n", OP_DBL_Digs , OneSeventh);
// 0.14285714285714285
printf("%.*f\n", OP_DBL_Digs + 6, OneSeventh/1000000.0);
// 0.00000014285714285714285
Note: Many are use to "%f". That displays 6 digits after the decimal point; 6 is the display default, not the precision of the number.
The short answer to print floating point numbers losslessly (such that they can be read back in to exactly the same number, except NaN and Infinity):
- If your type is float: use
printf("%.9g", number). - If your type is double: use
printf("%.17g", number).
Do NOT use %f, since that only specifies how many significant digits after the decimal and will truncate small numbers. For reference, the magic numbers 9 and 17 can be found in float.h which defines FLT_DECIMAL_DIG and DBL_DECIMAL_DIG.
Hi Debojit Acharjee,
The significand of the double type is approximately 15 to 17 decimal digits for most platforms. In most cases, a variable of type double can accurately represent 15 to 17 decimal digits. Numbers outside this range may lose precision or be rounded.
When I defined a 18 digits number, the result lost precision.
When using floating-point numbers, you should choose the appropriate data type according to your specific needs and precision requirements, and avoid using values beyond its representation range for calculations.
Regarding the double type, this documentation states:
Microsoft Specific The double type contains 64 bits: 1 for sign, 11 for the exponent, and 52 for the mantissa. Its range is +/-1.7E308 with at least 15 digits of precision.
You could also refer to this document for the float type.
Best regards,
Elya Yao
If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.
A double is stored in base 2 not decimal. It’s stored in 64 bits. The mantissa is 52 bits, or max 179769313486232 in decimal. The exponent is 11 bits or max of 2047 in decimal. The final bit is the sign bit.
See:
https://en.wikipedia.org/wiki/Computer_number_format#:~:text=an%2011%2Dbit%20binary%20exponent,gives%20the%20actual%20signed%20value
How to reduce the precision of a double in C?
To reduce the relative precision of a floating point numbers such that various least significant bits of the significand/mantissa are zero'd, code needs to access the significand.
Use frexp() to extract the signicand and exponent of the FP number.
Scale the signicand with ldexp() and then round, truncate, or floor - depending in coding goals - to remove precision. Truncation is shown, yet I recommend rounding via rint()
Scale back and add back the exponent.
#include <math.h>
#include <stdio.h>
double reduce(double x, int precision_power_2) {
if (isfinite(x)) {
int power_2;
// The frexp functions break a floating-point number into a
// normalized fraction and an integral power of 2.
double normalized_fraction = frexp(x, &power_2); // 0.5 <= result < 1.0 or 0
// The ldexp functions multiply a floating-point number by an integral power of 2
double less_precise = trunc(ldexp(normalized_fraction, precision_power_2));
x = ldexp(less_precise, power_2 - precision_power_2);
}
return x;
}
void testr(double x, int pow2) {
printf("reduce(%a, %d --> %a\n", x, pow2, reduce(x, pow2));
}
int main(void) {
testr(0.1, 5);
return 0;
}
Output
// v-53 bin.digs-v v-v 5 significant binary digits
reduce(0x1.999999999999ap-4, 5 --> 0x1.9p-4
Use frexpf(), ldexp(), rintf(), truncf(), floorf(), etc. for float.
If you wish to apply the bitwise and &, you need to apply it to the integer representation of the float value:
float f = 0.1f;
printf("Befor: %a %.16e\n", f, f);
unsigned int i;
_Static_assert(sizeof f == sizeof i, "pick integer type of the correct size");
memcpy(&i, &f, sizeof i);
i &= ~ 0x3U; // or any other mask.
// This one assumes the endianness of floats is identical to integers'
memcpy(&f, &i, sizeof f);
printf("After: %a %.16e\n", f, f);
Note that this does not provide you with 29-bit IEEE-754-like numbers. The value in f was first rounded as a 32-bit single-precision number, and then brutally truncated.
A more elegant method relies on a floating-point constant with two bits set:
float f = 0.1f;
float factor = 5.0f; // or 3, or 9, or 17
float c = factor * f;
f = c - (c - f);
printf("After: %a %.16e\n", f, f);
The advantage of this method is that it rounds f to the nearest value using N bits of significand, as opposed to truncating it towards zero as in the first method. However, the program is still computing with 32-bit IEEE 754 floating-point and then rounding to fewer bits, so the result is still not always equivalent to what a narrower floating-point type would have produced.
The second method relies on an idea by Dekker, described online in this article.
I can't figure out the code to take a double f = 44.444 and store it into double c as 44.44. I need to make the precision after the decimal place 2.
For cout << setprecision(2) << fixed << f, I know this works, but I don't want to print on the screen.