variables - What are the actual min/max values for float and double (C++) - Stack Overflow
Why is the max value of a 32 bit floating point number 3.4 x 10^38?
c# - What is the Max value for 'float'? - Stack Overflow
How to calculate min/max values of floating point numbers? - Software Engineering Stack Exchange
Alright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.
In C++ there are two ways to represent/store decimal values.
Floats and Doubles
A float can store values from:
- -340282346638528859811704183484516925440.0000000000000000 Float lowest
- 340282346638528859811704183484516925440.0000000000000000 Float max
A double can store values from:
-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max
Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)
Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)
e.g.
float x = 1.426;
double y = 8.739437;
Decimals & Math
Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.
e.g
include
typedef std::numeric_limits<double> dbl;
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312;
double a = 12345678.244;
// these calculations won't perform correctly be printed correctly without setting the precision.
cout << endl << x+a <<endl;
example 2:
typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x = 54.122111;
float a = 11.323111;
cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */
Roughly how accurate is this description? Can it be used as a standard when confused?
The std::numerics_limits class in the <limits> header provides information about the characteristics of numeric types.
For a floating-point type T, here are the greatest and least values representable in the type, in various senses of “greatest” and “least.” I also include the values for the common IEEE 754 64-bit binary type, which is called double in this answer. These are in decreasing order:
std::numeric_limits<T>::infinity()is the largest representable value, ifTsupports infinity. It is, of course, infinity. Whether the typeTsupports infinity is indicated bystd::numeric_limits<T>::has_infinity.std::numeric_limits<T>::max()is the largest finite value. Fordouble, this is 21024−2971, approximately 1.79769•10308.std::numeric_limits<T>::min()is the smallest positive normal value. Floating-point formats often have an interval where the exponent cannot get any smaller, but the significand (fraction portion of the number) is allowed to get smaller until it reaches zero. This comes at the expense of precision but has some desirable mathematical-computing properties.min()is the point where this precision loss starts. Fordouble, this is 2−1022, approximately 2.22507•10−308.std::numeric_limits<T>::denorm_min()is the smallest positive value. In types which have subnormal values, it is subnormal. Otherwise, it equalsstd::numeric_limits<T>::min(). Fordouble, this is 2−1074, approximately 4.94066•10−324.std::numeric_limits<T>::lowest()is the least finite value. It is usually a negative number large in magnitude. Fordouble, this is −(21024−2971), approximately −1.79769•10308.If
std::numeric_limits<T>::has_infinityandstd::numeric_limits<T>::is_signedare true, then-std::numeric_limits<T>::infinity()is the least value. It is, of course, negative infinity.
Another characteristic you may be interested in is:
std::numeric_limits<T>::digits10is the greatest number of decimal digits such that converting any decimal number with that many digits toTand then converting back to the same number of decimal digits will yield the original number. Fordouble, this is 15.
I understand 2x - 1 gives you the max number for x unsigned bits. but say we're using IEEE-754 32 bit floating point number. Shouldn't the maximum value be 10256 x 223? since the mantissa is a 23 bit number and the exponent is 10 raised to the power of an 8 bit number.
Wikipedia however says the (2 − 2−23 ) × 2127
where does this come from? Why is the exponent negative?
For 32-bit floating point, the maximum value is shown in Table III:
0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.
We can decompose the mantissa/exponent into a (close) decimal value as follows:
7FFFFF <base-16> = 8,388,607 <base-10>.
There are 23 bits of significance, so we divide 8,388,607 by 2^23.
8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)
as far as the exponent:
7F <base-16> = 127 <base-10>
and now we multiply the mantissa by 2^127 (the exponent)
8,388,607 / 2^23 * 2^127 =
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38
This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.
The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as
mansissa=7FFFFFFFFF, exponent=7F.
again, we can compute
7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>
the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39. 127-39=88, so just multiply by 2^88:
549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38
This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.
So, the max values are:
1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit
The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.
(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)
(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)
FYI
Some floating point formats use an unrepresented hidden 1 bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1, therefore we don't have to store that 1, and we have an extra bit of precision). This particular format doesn't seem to do this.
Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.
8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.
The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111 (a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.
mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.
Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.
(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).
You can borrow from how the IEEE floating point is laid out for fast comparison: sign, exponent, mantissa. however in that PDF I see mantissa and exponent are reversed.
This means that to compare you'll have to first check the sign bit and if one is not the winner yet you compare the exponents and then you compare the mantissa.
If one is positive and the other is negative then the positive is the max.
If both are positive and one exponent is larger then it is the max (if both are negative then it is the min)
Similarly for mantissa.
