Alright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.
In C++ there are two ways to represent/store decimal values.
Floats and Doubles
A float can store values from:
- -340282346638528859811704183484516925440.0000000000000000 Float lowest
- 340282346638528859811704183484516925440.0000000000000000 Float max
A double can store values from:
-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max
Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)
Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)
e.g.
float x = 1.426;
double y = 8.739437;
Decimals & Math
Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.
e.g
include
typedef std::numeric_limits<double> dbl;
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312;
double a = 12345678.244;
// these calculations won't perform correctly be printed correctly without setting the precision.
cout << endl << x+a <<endl;
example 2:
typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x = 54.122111;
float a = 11.323111;
cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */
Roughly how accurate is this description? Can it be used as a standard when confused?
Answer from user9318444 on Stack OverflowAlright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.
In C++ there are two ways to represent/store decimal values.
Floats and Doubles
A float can store values from:
- -340282346638528859811704183484516925440.0000000000000000 Float lowest
- 340282346638528859811704183484516925440.0000000000000000 Float max
A double can store values from:
-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max
Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)
Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)
e.g.
float x = 1.426;
double y = 8.739437;
Decimals & Math
Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.
e.g
include
typedef std::numeric_limits<double> dbl;
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312;
double a = 12345678.244;
// these calculations won't perform correctly be printed correctly without setting the precision.
cout << endl << x+a <<endl;
example 2:
typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x = 54.122111;
float a = 11.323111;
cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */
Roughly how accurate is this description? Can it be used as a standard when confused?
Answer from user9318444 on Stack OverflowAlright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.
In C++ there are two ways to represent/store decimal values.
Floats and Doubles
A float can store values from:
- -340282346638528859811704183484516925440.0000000000000000 Float lowest
- 340282346638528859811704183484516925440.0000000000000000 Float max
A double can store values from:
-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max
Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)
Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)
e.g.
float x = 1.426;
double y = 8.739437;
Decimals & Math
Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.
e.g
include
typedef std::numeric_limits<double> dbl;
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312;
double a = 12345678.244;
// these calculations won't perform correctly be printed correctly without setting the precision.
cout << endl << x+a <<endl;
example 2:
typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x = 54.122111;
float a = 11.323111;
cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */
Roughly how accurate is this description? Can it be used as a standard when confused?
The std::numerics_limits class in the <limits> header provides information about the characteristics of numeric types.
For a floating-point type T, here are the greatest and least values representable in the type, in various senses of “greatest” and “least.” I also include the values for the common IEEE 754 64-bit binary type, which is called double in this answer. These are in decreasing order:
std::numeric_limits<T>::infinity()is the largest representable value, ifTsupports infinity. It is, of course, infinity. Whether the typeTsupports infinity is indicated bystd::numeric_limits<T>::has_infinity.std::numeric_limits<T>::max()is the largest finite value. Fordouble, this is 21024−2971, approximately 1.79769•10308.std::numeric_limits<T>::min()is the smallest positive normal value. Floating-point formats often have an interval where the exponent cannot get any smaller, but the significand (fraction portion of the number) is allowed to get smaller until it reaches zero. This comes at the expense of precision but has some desirable mathematical-computing properties.min()is the point where this precision loss starts. Fordouble, this is 2−1022, approximately 2.22507•10−308.std::numeric_limits<T>::denorm_min()is the smallest positive value. In types which have subnormal values, it is subnormal. Otherwise, it equalsstd::numeric_limits<T>::min(). Fordouble, this is 2−1074, approximately 4.94066•10−324.std::numeric_limits<T>::lowest()is the least finite value. It is usually a negative number large in magnitude. Fordouble, this is −(21024−2971), approximately −1.79769•10308.If
std::numeric_limits<T>::has_infinityandstd::numeric_limits<T>::is_signedare true, then-std::numeric_limits<T>::infinity()is the least value. It is, of course, negative infinity.
Another characteristic you may be interested in is:
std::numeric_limits<T>::digits10is the greatest number of decimal digits such that converting any decimal number with that many digits toTand then converting back to the same number of decimal digits will yield the original number. Fordouble, this is 15.
c++ - Min positive value for float? - Stack Overflow
How to calculate min/max values of floating point numbers? - Software Engineering Stack Exchange
python - Why do numpy float32s correctly return numbers below the float32 minimum value? - Stack Overflow
precision - Minimum / Maximum numbers that can be represented in floating point - Stack Overflow
I am using MS Visual Studio, so there may be some MS specific values. Something cross platform would, of course, be nicer.
We have, for instance, INT32_MIN and INT32_MAX. Is there something similar for Floating Point, when stored in 32 bits?
A floating point number consists of 3 parts: a sign, a fraction, and an exponent. These are all integers, and they're combined to get a real number: (-1)sign × (fraction × 2-23) × 2exponent
The Wikipedia article uses a binary number with a decimal point for the fraction, but I find it clearer to think of it as an integer multiplied by a fixed constant. Mathematically it's the same.
The fraction is 23 bits, but there's an extra hidden bit that makes it a 24-bit value. The largest integer that can be represented in 24 bits is 16777215, which has just over 7 decimal digits. This defines the precision of the format.
The exponent is the magic that expands the range of the numbers beyond what the precision can hold. There are 8 bits to hold the exponent, but a couple of those values are special. The value 255 is reserved for infinities and Not-A-Number (NAN) representations, which aren't real numbers and don't follow the formula given above. The value 0 represents the denormal range, which are called that because the hidden bit of the fraction is 0 rather than 1 - it's not normalized. In this case the exponent is always -126. Note that the precision of denormal numbers declines as the fraction gets smaller, because it has fewer digits. For all the other bit patterns 1-254, the hidden bit of the fraction is 1 and the exponent is bits-127. You can see the details at the Wikipedia section on exponent encoding.
The smallest positive denormal number is (-1)0 × (1 × 2-23) × 2-126, or 1.4e-45.
The smallest positive normalized number is (-1)0 × (0x800000 × 2-23) × 2(1 - 127), or 1.175494e-38.
It has to be distinguished between the internal representation and the format.
In the internal representation Floating-point numbers are typically packed as sign bit, the exponent field, and the significand or mantissa, from left to right. This representation is defined for the range you mentioned (mathematical limit)
The format defines the "external" representation and is limited to the available space and thereby to the precision of the data type e.g. float about 7 digits (technical limit).
For 32-bit floating point, the maximum value is shown in Table III:
0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.
We can decompose the mantissa/exponent into a (close) decimal value as follows:
7FFFFF <base-16> = 8,388,607 <base-10>.
There are 23 bits of significance, so we divide 8,388,607 by 2^23.
8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)
as far as the exponent:
7F <base-16> = 127 <base-10>
and now we multiply the mantissa by 2^127 (the exponent)
8,388,607 / 2^23 * 2^127 =
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38
This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.
The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as
mansissa=7FFFFFFFFF, exponent=7F.
again, we can compute
7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>
the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39. 127-39=88, so just multiply by 2^88:
549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38
This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.
So, the max values are:
1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit
The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.
(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)
(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)
FYI
Some floating point formats use an unrepresented hidden 1 bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1, therefore we don't have to store that 1, and we have an extra bit of precision). This particular format doesn't seem to do this.
Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.
8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.
The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111 (a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.
mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.
Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.
(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).
You can borrow from how the IEEE floating point is laid out for fast comparison: sign, exponent, mantissa. however in that PDF I see mantissa and exponent are reversed.
This means that to compare you'll have to first check the sign bit and if one is not the winner yet you compare the exponents and then you compare the mantissa.
If one is positive and the other is negative then the positive is the max.
If both are positive and one exponent is larger then it is the max (if both are negative then it is the min)
Similarly for mantissa.
The NORMAL ranges are:
- 16-bit (half precision): ±6.10e-5 to ±65504.0
- 32-bit (single precision): ±1.18e−38 to ±3.4e38
- 64-bit (double precision): ±2.23e−308 to ±1.80e308
If you allow for DENORMALS as well, then minumum values are:
- 16-bit: ±5.96e-8
- 32-bit: ±1e-45
- 64-bit: ±5e-324
Always keep in mind that just because a number is in this range doesn't mean it can be exactly represented. At any range, floating-point numbers necessarily skip values due to cardinality reasons. The classic example is 1/3 which has no exact representation in any finite precision, for binary or decimal formats. In general you can only precisely represent those numbers that are called "dyadic" for the binary format, i.e., those of the form A/2^B for some A and B; provided the result falls into the dynamic range.
In answer to the question, "How do I calculate the min/max?", it's straightforward, as long as you pay attention to several nuances of the IEEE-754 formats.
Let's start with single precision. There's a 24-bit significand, including one "hidden" or "implicit" bit. So the largest significand is the binary fraction 0b1.11111111111111111111111, which is equal to the hexadecimal fraction 0x1.fffffe, or the decimal fraction 1.99999988079071044921875.
There's an 8-bit exponent, with raw values ranging from 0 to 255. The lowest and highest values are reserved (more on those in a bit), and there's a bias of 127. So the largest exponent value is 254 - 127 = 127.
So the largest single-precision floating-point value is 1.99999988079071044921875 × 2127, or about 3.4 × 1038. (The exact value is 3.4028234663852885981170418348451692544 × 1038.)
For normal floating-point numbers, the "implicit" or "hidden" bit is always 1, so the smallest normal significand is 0b1.00000000000000000000000, which is just 1. The smallest normal exponent is 1 - 127 = -126. (Again, the smallest raw exponent value, 0, is reserved.) So the smallest normal single-precision floating-point number is 1.0 × 2-126, or about 1.1755 × 10-38. (If you're into this stuff, it's fun to compute these numbers out to full precision, but the trailing digits don't mean much, so from now on I'm going to round to more reasonable approximations like 1.1755.)
Finally, we come to the subnormal numbers, which are the ones that the minimum raw exponent value of 0 are reserved for. These have a "hidden" or "implicit" bit of 0. So the smallest nonzero subnormal significand is 0b0.00000000000000000000001, or 0x0.000002, or about 0.0000001192. So the very smallest, nonzero, subnormal, single-precision, floating-point number is about 0.0000001192 × 2-126, or about 1.4 × 10-45.
The subnormal numbers are special because they have fewer than 24 binary bits of significance. They fill in the gap between 0 and the smallest normal number, and allow for "gradual underflow". (Notice that the exponent in the subnormal case is one higher than you might have expected by naïvely subtracting the bias from 0. Stated another way, the precision of the subnormal binade, 2-126-23 = 2-149, is equal to the precision of the lowest normal binade, as it has to be if they're to gracefully fill in that gap.)
For the record, the other special values are Infinity and NaN ("Not a Number"). Those are the values that the maximum exponent value are reserved for. They don't really concern us here, except that they mean that (for single precision) the maximum scaled exponent value we care about is +127, not +128.
We can use the same procedure to compute the minima and maxima for half and double precision. Double precision has a 52+1 = 53 bit significand, so the largest value is
0b1.1111111111111111111111111111111111111111111111111111
There are 11 bits for the exponent, with a bias of 1023, giving a largest scaled exponent of 2046 - 1023 = 1023. So the math for the largest value works out to 1.999999999999999778 × 21023 ≈ 1.8 × 10308.
The minimum normal and subnormal values have significands of
0b1.0000000000000000000000000000000000000000000000000000
0b0.0000000000000000000000000000000000000000000000000001
and a base-2 exponent of 1 - 1023 = -1022, so the minimum values work out to
1.0 × 2-1022 ≈ 2.22 × 10-308
0.00000000000000022 × 2-1022 ≈ 4.9 × 10-324
You've probably noticed by now that we don't really need to work with those 24- and 53-bit binary significands explicitly, since all three of the values we care about end up being close to or exactly powers of two, with the powers being numbers like 127 or 1023, plus or minus 1, or for the subnormals, additionally shifted by the number of significand bits. So here's a shortcut, for single, double, and also half precision:
| precision | exp, signif bits | min/max | power of two | equals approximately |
|---|---|---|---|---|
| single | 8, 23 | max | 2128 | 3.4 × 1038 |
| min norm | 2-126 | 1.175 × 10-38 | ||
| min subnorm | 2-126-23 | 1.4 × 10-45 | ||
| double | 11, 52 | max | 21024 | 1.8 × 10308 |
| min norm | 2-1022 | 2.22 × 10-308 | ||
| min subnorm | 2-1022-52 | 4.9 × 10-324 | ||
| half | 5, 10 | max | 216 | 6.55 × 104 |
| min norm | 2-14 | 6.1 × 10-5 | ||
| min subnorm | 2-14-10 | 5.96 × 10-8 |
Finally, if you don't want to compute these values (or even look them up on the internet), many programming languages have ways of requesting some/all of them programmatically. For example, C has constants in <float.h> like FLT_MIN and DBL_MAX, and C++ has static methods from <limits> like std::numeric_limits<float>::min() and std::numeric_limits<double>::max().
For Java there's Float.MIN_NORMAL and Double.MAX_VALUE, and Python has sys.float_info.
Hello!
I'm studying a bit of Arduino programming and for ints and longs I have learned their minimum and maximum values depend on the amount of bits they take up in RAM.
So, since integer takes 16 bits, that's 2^16 combinations, and since half goes for negative numbers, half for positive numbers and a zero, its range is from -2^15 to (2^15) - 1. Long is 32 bits so formula just changes to -2^31 to (2^31) - 1. I also understand that unsigned makes the variable hold only positive values so I get 0 to (2^16)-1 and 0 to (2^32)-1.
I know I can't apply this logic to floats and doubles. From Arduino documentation: Floating-point numbers can be as large as 3.4028235E+38 and as low as -3.4028235E+38. They are stored as 32 bits (4 bytes) of information.
How are floats stored in memory and how can I calculate their min and max values?