What are the actual min/max values for float and double (C++)

stackoverflow.com › questions › 48630106 › what-are-the-actual-min-max-values-for-float-and-double-c

Alright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.

In C++ there are two ways to represent/store decimal values.

Floats and Doubles

A float can store values from:

-340282346638528859811704183484516925440.0000000000000000 Float lowest
340282346638528859811704183484516925440.0000000000000000 Float max

A double can store values from:

-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max

Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)

Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)

e.g.

     float x = 1.426;
     double y = 8.739437;

Decimals & Math

Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.

e.g

include

typedef std::numeric_limits<double> dbl; 
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312; 
double a = 12345678.244; 
// these calculations won't perform correctly be printed correctly without setting the precision.


cout << endl << x+a <<endl;

example 2:

typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x =  54.122111;
float a =  11.323111;

cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */

Roughly how accurate is this description? Can it be used as a standard when confused?

Answer from user9318444 on Stack Overflow

What are the actual min/max values for float and double (C++)

stackoverflow.com › questions › 48630106 › what-are-the-actual-min-max-values-for-float-and-double-c

Alright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.

In C++ there are two ways to represent/store decimal values.

Floats and Doubles

A float can store values from:

-340282346638528859811704183484516925440.0000000000000000 Float lowest
340282346638528859811704183484516925440.0000000000000000 Float max

A double can store values from:

-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max

Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)

Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)

e.g.

     float x = 1.426;
     double y = 8.739437;

Decimals & Math

Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.

e.g

include

typedef std::numeric_limits<double> dbl; 
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312; 
double a = 12345678.244; 
// these calculations won't perform correctly be printed correctly without setting the precision.


cout << endl << x+a <<endl;

example 2:

typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x =  54.122111;
float a =  11.323111;

cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */

Roughly how accurate is this description? Can it be used as a standard when confused?

Answer from user9318444 on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 48630106 › what-are-the-actual-min-max-values-for-float-and-double-c

variables - What are the actual min/max values for float and double (C++) - Stack Overflow

Top answer

1 of 4

Alright. Using what I learned from here (thanks everyone) and the other parts of the web I wrote a neat little summary of the two just in case I run into another issue like this.

In C++ there are two ways to represent/store decimal values.

Floats and Doubles

A float can store values from:

-340282346638528859811704183484516925440.0000000000000000 Float lowest
340282346638528859811704183484516925440.0000000000000000 Float max

A double can store values from:

-179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double lowest
179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0000000000000000 Double max

Float's precision allows it to store a value of up to 9 digits (7 real digits, +2 from decimal to binary conversion)

Double, like the name suggests can store twice as much precision as a float. It can store up to 17 digits. (15 real digits, +2 from decimal to binary conversion)

e.g.

     float x = 1.426;
     double y = 8.739437;

Decimals & Math

Due to a float being able to carry 7 real decimals, and a double being able to carry 15 real decimals, to print them out when performing calculations a proper method must be used.

e.g

include

typedef std::numeric_limits<double> dbl; 
cout.precision(dbl::max_digits10-2); // sets the precision to the *proper* amount of digits.
cout << dbl::max_digits10 <<endl; // prints 17.
double x = 12345678.312; 
double a = 12345678.244; 
// these calculations won't perform correctly be printed correctly without setting the precision.


cout << endl << x+a <<endl;

example 2:

typedef std::numeric_limits< float> flt;
cout.precision(flt::max_digits10-2);
cout << flt::max_digits10 <<endl;
float x =  54.122111;
float a =  11.323111;

cout << endl << x+a <<endl; /* without setting precison this outputs a different value, as well as making sure we're *limited* to 7 digits. If we were to enter another digit before the decimal point, the digits on the right would be one less, as there can only be 7. Doubles work in the same way */

Roughly how accurate is this description? Can it be used as a standard when confused?

2 of 4

The std::numerics_limits class in the <limits> header provides information about the characteristics of numeric types.

For a floating-point type T, here are the greatest and least values representable in the type, in various senses of “greatest” and “least.” I also include the values for the common IEEE 754 64-bit binary type, which is called double in this answer. These are in decreasing order:

std::numeric_limits<T>::infinity() is the largest representable value, if T supports infinity. It is, of course, infinity. Whether the type T supports infinity is indicated by std::numeric_limits<T>::has_infinity.
std::numeric_limits<T>::max() is the largest finite value. For double, this is 2¹⁰²⁴−2⁹⁷¹, approximately 1.79769•10³⁰⁸.
std::numeric_limits<T>::min() is the smallest positive normal value. Floating-point formats often have an interval where the exponent cannot get any smaller, but the significand (fraction portion of the number) is allowed to get smaller until it reaches zero. This comes at the expense of precision but has some desirable mathematical-computing properties. min() is the point where this precision loss starts. For double, this is 2⁻¹⁰²², approximately 2.22507•10⁻³⁰⁸.
std::numeric_limits<T>::denorm_min() is the smallest positive value. In types which have subnormal values, it is subnormal. Otherwise, it equals std::numeric_limits<T>::min(). For double, this is 2⁻¹⁰⁷⁴, approximately 4.94066•10⁻³²⁴.
std::numeric_limits<T>::lowest() is the least finite value. It is usually a negative number large in magnitude. For double, this is −(2¹⁰²⁴−2⁹⁷¹), approximately −1.79769•10³⁰⁸.
If std::numeric_limits<T>::has_infinity and std::numeric_limits<T>::is_signed are true, then -std::numeric_limits<T>::infinity() is the least value. It is, of course, negative infinity.

Another characteristic you may be interested in is:

std::numeric_limits<T>::digits10 is the greatest number of decimal digits such that converting any decimal number with that many digits to T and then converting back to the same number of decimal digits will yield the original number. For double, this is 15.

Wikipedia

en.wikipedia.org › wiki › Single-precision_floating-point_format

Single-precision floating-point format - Wikipedia

3 weeks ago - Single-precision floating-point format (sometimes called FP32, float32, or float) is a computer number format, usually occupying 32 bits in computer memory; it represents a wide range of numeric values by using a floating radix point. A floating-point variable can represent a wider range of ...

IEEE 754 standard: binary32

Discussions

c++ - Min positive value for float? - Stack Overflow

I'm new to C++! Wiki says this about float: The minimum positive normal value is 2^−126 ≈ 1.18 × 10^−38 and the minimum positive (denormal) value is 2^−149 ≈ 1.4 × 10^−45. But if a float can have... More on stackoverflow.com

stackoverflow.com

How to calculate min/max values of floating point numbers? - Software Engineering Stack Exchange

I'm trying to calculate the min/max, or the lowest to highest value range of a 48 bit floating point type MIL-STD-1750A (PDF) (WIKI). Ex: How a double range is 1.7E +/- 308 I've looked around for More on softwareengineering.stackexchange.com

softwareengineering.stackexchange.com

python - Why do numpy float32s correctly return numbers below the float32 minimum value? - Stack Overflow

The smallest number that can be stored by an IEEE-754 32-bit float is 2^-126 ~ 1.18e-38. Why then do I get the following output? print(np.float32(1e-39)) > 1e-39 Through experimentation, it looks More on stackoverflow.com

stackoverflow.com

precision - Minimum / Maximum numbers that can be represented in floating point - Stack Overflow

How do I calculate the min/max decimal numbers that could be represented in binary 16, 32, 64 IEEE 754 floating point? More on stackoverflow.com

stackoverflow.com

reddit.com › r/cpp_questions › are there constants for min/max floating point numbers which can fit into 32 bits?

r/cpp_questions on Reddit: Are there constants for min/max Floating Point numbers which can fit into 32 bits?

October 24, 2022 -

I am using MS Visual Studio, so there may be some MS specific values. Something cross platform would, of course, be nicer.

We have, for instance, INT32_MIN and INT32_MAX. Is there something similar for Floating Point, when stored in 32 bits?

Top answer

1 of 1

Try std::numeric_limits::min() and std::numeric_limits::max() Note that thus also works for int, double etc without using horrible macros (but if you want macros they're FLT_MIN and FLT_MAX) https://en.cppreference.com/w/cpp/types/numeric_limits

Note.nkmk.me

note.nkmk.me › home › python

Maximum and Minimum float Values in Python | note.nkmk.me

August 11, 2023 - This article explains how to get and check the range (maximum and minimum values) that float can represent in Python. In many environments, the representable range for float is as follows.

gosamples

gosamples.dev › tutorials › the maximum and minimum value of the float types in go

📊 The maximum and minimum value of the float types in Go

April 14, 2022 - The minimum value above zero (smallest positive, non-zero value) of the float32 type in Go is 1.401298464324817070923729583289916131280e-45 and you can get this value using the math.SmallestNonzeroFloat32 constant.

Stack Overflow

stackoverflow.com › questions › 39130701 › min-positive-value-for-float

c++ - Min positive value for float? - Stack Overflow

Top answer

1 of 2

A floating point number consists of 3 parts: a sign, a fraction, and an exponent. These are all integers, and they're combined to get a real number: (-1)^sign × (fraction × 2^-23) × 2^exponent

The Wikipedia article uses a binary number with a decimal point for the fraction, but I find it clearer to think of it as an integer multiplied by a fixed constant. Mathematically it's the same.

The fraction is 23 bits, but there's an extra hidden bit that makes it a 24-bit value. The largest integer that can be represented in 24 bits is 16777215, which has just over 7 decimal digits. This defines the precision of the format.

The exponent is the magic that expands the range of the numbers beyond what the precision can hold. There are 8 bits to hold the exponent, but a couple of those values are special. The value 255 is reserved for infinities and Not-A-Number (NAN) representations, which aren't real numbers and don't follow the formula given above. The value 0 represents the denormal range, which are called that because the hidden bit of the fraction is 0 rather than 1 - it's not normalized. In this case the exponent is always -126. Note that the precision of denormal numbers declines as the fraction gets smaller, because it has fewer digits. For all the other bit patterns 1-254, the hidden bit of the fraction is 1 and the exponent is bits-127. You can see the details at the Wikipedia section on exponent encoding.

The smallest positive denormal number is (-1)⁰ × (1 × 2^-23) × 2^-126, or 1.4e-45.

The smallest positive normalized number is (-1)⁰ × (0x800000 × 2^-23) × 2^{(1 - 127)}, or 1.175494e-38.

2 of 2

-2

It has to be distinguished between the internal representation and the format.

In the internal representation Floating-point numbers are typically packed as sign bit, the exponent field, and the significand or mantissa, from left to right. This representation is defined for the range you mentioned (mathematical limit)

The format defines the "external" representation and is limited to the available space and thereby to the precision of the data type e.g. float about 7 digits (technical limit).

GitHub

github.com › const-io › smallest-float32

GitHub - const-io/smallest-float32: Smallest positive single-precision floating-point number.

var SMALLEST_FLOAT32 = require( 'const-smallest-float32' ); console.log( SMALLEST_FLOAT32.VALUE ); // returns 1.1754943508222875e-38 console.log( SMALLEST_FLOAT32.DENORMALIZED ); // returns 1.401298464324817e-45 · To run the example code from the top-level application directory, $ node ./examples/index.js ·

Author const-io

Find elsewhere

Google Bing Mojeek

Stack Exchange

softwareengineering.stackexchange.com › questions › 294269 › how-to-calculate-min-max-values-of-floating-point-numbers

How to calculate min/max values of floating point numbers? - Software Engineering Stack Exchange

Top answer

1 of 2

For 32-bit floating point, the maximum value is shown in Table III:

0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.

We can decompose the mantissa/exponent into a (close) decimal value as follows:

7FFFFF <base-16> = 8,388,607 <base-10>.

There are 23 bits of significance, so we divide 8,388,607 by 2^23.

8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)

as far as the exponent:

7F <base-16> = 127 <base-10>

and now we multiply the mantissa by 2^127 (the exponent)

8,388,607 / 2^23 * 2^127 = 
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38

This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.

The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as

mansissa=7FFFFFFFFF, exponent=7F.

again, we can compute

7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>

the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39. 127-39=88, so just multiply by 2^88:

549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38

This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.

So, the max values are:

1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit

The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.

(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)

(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)

FYI

Some floating point formats use an unrepresented hidden 1 bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1, therefore we don't have to store that 1, and we have an extra bit of precision). This particular format doesn't seem to do this.

Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.

8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.

The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111 (a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.

mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.

Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.

(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).

2 of 2

You can borrow from how the IEEE floating point is laid out for fast comparison: sign, exponent, mantissa. however in that PDF I see mantissa and exponent are reversed.

This means that to compare you'll have to first check the sign bit and if one is not the winner yet you compare the exponents and then you compare the mantissa.

If one is positive and the other is negative then the positive is the max.

If both are positive and one exponent is larger then it is the max (if both are negative then it is the min)

Similarly for mantissa.

Khronos

khronos.org › opengl › wiki › Small_Float_Formats

Small Float Formats - OpenGL Wiki

August 5, 2023 - Small Float Formats, are floating-point values that use less than the standard 32-bits of precision. An example of these are 16-bit half-floats. This article details how these are encoded and used · We start with a quick review on how 32-bit floating-point numbers are encoded; detailed information ...

cppreference.com

en.cppreference.com › w › c › types › limits.html

Numeric limits - cppreference.com

February 3, 2025 - #include <limits.h> #include <stdint.h> #include <stdio.h> int main(void) { printf("CHAR_BIT = %d\n", CHAR_BIT); printf("MB_LEN_MAX = %d\n\n", MB_LEN_MAX); printf("CHAR_MIN = %+d\n", CHAR_MIN); printf("CHAR_MAX = %+d\n", CHAR_MAX); printf("SCHAR_MIN = %+d\n", SCHAR_MIN); printf("SCHAR_MAX = %+d\n", SCHAR_MAX); printf("UCHAR_MAX = %u\n\n", UCHAR_MAX); printf("SHRT_MIN = %+d\n", SHRT_MIN); printf("SHRT_MAX = %+d\n", SHRT_MAX); printf("USHRT_MAX = %u\n\n", USHRT_MAX); printf("INT_MIN = %+d\n", INT_MIN); printf("INT_MAX = %+d\n", INT_MAX); printf("UINT_MAX = %u\n\n", UINT_MAX); printf("LONG_MIN =

Stack Overflow

stackoverflow.com › questions › 78868728 › why-do-numpy-float32s-correctly-return-numbers-below-the-float32-minimum-value

python - Why do numpy float32s correctly return numbers below the float32 minimum value? - Stack Overflow

Obviously there is a very large truncation error here, but it is nevertheless clearly performing computation on these very small numbers. What's going on? Does the numpy float32 not conform to IEEE-754? ... 1.18·10^−38 is the minimum normal ...

Stack Overflow

stackoverflow.com › questions › 53525498 › minimum-maximum-numbers-that-can-be-represented-in-floating-point

precision - Minimum / Maximum numbers that can be represented in floating point - Stack Overflow

Top answer

1 of 2

The NORMAL ranges are:

16-bit (half precision): ±6.10e-5 to ±65504.0
32-bit (single precision): ±1.18e−38 to ±3.4e38
64-bit (double precision): ±2.23e−308 to ±1.80e308

If you allow for DENORMALS as well, then minumum values are:

16-bit: ±5.96e-8
32-bit: ±1e-45
64-bit: ±5e-324

Always keep in mind that just because a number is in this range doesn't mean it can be exactly represented. At any range, floating-point numbers necessarily skip values due to cardinality reasons. The classic example is 1/3 which has no exact representation in any finite precision, for binary or decimal formats. In general you can only precisely represent those numbers that are called "dyadic" for the binary format, i.e., those of the form A/2^B for some A and B; provided the result falls into the dynamic range.

2 of 2

In answer to the question, "How do I calculate the min/max?", it's straightforward, as long as you pay attention to several nuances of the IEEE-754 formats.

Let's start with single precision. There's a 24-bit significand, including one "hidden" or "implicit" bit. So the largest significand is the binary fraction 0b1.11111111111111111111111, which is equal to the hexadecimal fraction 0x1.fffffe, or the decimal fraction 1.99999988079071044921875.

There's an 8-bit exponent, with raw values ranging from 0 to 255. The lowest and highest values are reserved (more on those in a bit), and there's a bias of 127. So the largest exponent value is 254 - 127 = 127.

So the largest single-precision floating-point value is 1.99999988079071044921875 × 2¹²⁷, or about 3.4 × 10³⁸. (The exact value is 3.4028234663852885981170418348451692544 × 10³⁸.)

For normal floating-point numbers, the "implicit" or "hidden" bit is always 1, so the smallest normal significand is 0b1.00000000000000000000000, which is just 1. The smallest normal exponent is 1 - 127 = -126. (Again, the smallest raw exponent value, 0, is reserved.) So the smallest normal single-precision floating-point number is 1.0 × 2^-126, or about 1.1755 × 10^-38. (If you're into this stuff, it's fun to compute these numbers out to full precision, but the trailing digits don't mean much, so from now on I'm going to round to more reasonable approximations like 1.1755.)

Finally, we come to the subnormal numbers, which are the ones that the minimum raw exponent value of 0 are reserved for. These have a "hidden" or "implicit" bit of 0. So the smallest nonzero subnormal significand is 0b0.00000000000000000000001, or 0x0.000002, or about 0.0000001192. So the very smallest, nonzero, subnormal, single-precision, floating-point number is about 0.0000001192 × 2^-126, or about 1.4 × 10^-45.

The subnormal numbers are special because they have fewer than 24 binary bits of significance. They fill in the gap between 0 and the smallest normal number, and allow for "gradual underflow". (Notice that the exponent in the subnormal case is one higher than you might have expected by naïvely subtracting the bias from 0. Stated another way, the precision of the subnormal binade, 2^-126-23 = 2^-149, is equal to the precision of the lowest normal binade, as it has to be if they're to gracefully fill in that gap.)

For the record, the other special values are Infinity and NaN ("Not a Number"). Those are the values that the maximum exponent value are reserved for. They don't really concern us here, except that they mean that (for single precision) the maximum scaled exponent value we care about is +127, not +128.

We can use the same procedure to compute the minima and maxima for half and double precision. Double precision has a 52+1 = 53 bit significand, so the largest value is

0b1.1111111111111111111111111111111111111111111111111111

There are 11 bits for the exponent, with a bias of 1023, giving a largest scaled exponent of 2046 - 1023 = 1023. So the math for the largest value works out to 1.999999999999999778 × 2¹⁰²³ ≈ 1.8 × 10³⁰⁸.

The minimum normal and subnormal values have significands of

0b1.0000000000000000000000000000000000000000000000000000
0b0.0000000000000000000000000000000000000000000000000001

and a base-2 exponent of 1 - 1023 = -1022, so the minimum values work out to

1.0 × 2^-1022 ≈ 2.22 × 10^-308
0.00000000000000022 × 2^-1022 ≈ 4.9 × 10^-324

You've probably noticed by now that we don't really need to work with those 24- and 53-bit binary significands explicitly, since all three of the values we care about end up being close to or exactly powers of two, with the powers being numbers like 127 or 1023, plus or minus 1, or for the subnormals, additionally shifted by the number of significand bits. So here's a shortcut, for single, double, and also half precision:

precision	exp, signif bits	min/max	power of two	equals approximately
single	8, 23	max	2¹²⁸	3.4 × 10³⁸
		min norm	2^-126	1.175 × 10^-38
		min subnorm	2^-126-23	1.4 × 10^-45
double	11, 52	max	2¹⁰²⁴	1.8 × 10³⁰⁸
		min norm	2^-1022	2.22 × 10^-308
		min subnorm	2^-1022-52	4.9 × 10^-324
half	5, 10	max	2¹⁶	6.55 × 10⁴
		min norm	2^-14	6.1 × 10^-5
		min subnorm	2^-14-10	5.96 × 10^-8

Finally, if you don't want to compute these values (or even look them up on the internet), many programming languages have ways of requesting some/all of them programmatically. For example, C has constants in <float.h> like FLT_MIN and DBL_MAX, and C++ has static methods from <limits> like std::numeric_limits<float>::min() and std::numeric_limits<double>::max(). For Java there's Float.MIN_NORMAL and Double.MAX_VALUE, and Python has sys.float_info.

TutorialsPoint

tutorialspoint.com › article › get-the-machine-limits-information-for-float-types-in-python

Get the Machine limits information for float types in Python

February 24, 2022 - The min is the minimum value of given dtype and max is the minimum value of given dtype. ... a = np.finfo(np.float16) print("Minimum of float16 type...\n",a.min) print("Maximum of float16 type...\n",a.max) ... b = np.finfo(np.float32) ...

Lean

lean-lang.org › doc › reference › latest › Basic-Types › Floating-Point-Numbers

20.6. Floating-Point Numbers

If it is NaN, returns 0. This function does not reduce in the kernel. ... Converts a floating-point number to a 16-bit unsigned integer. If the given Float32 is non-negative, truncates the value to a positive integer, rounding down and clamping to the range of UInt16.

TutorialsPoint

tutorialspoint.com › c_standard_library › float_h.htm

C Library - <float.h>

The maximum value of float = 3.4028234664e+38 The minimum value of float = 1.1754943508e-38 The number of digits in the number = 7.2996655210e-312

reddit.com › r/arduino › help me understand float and double min and max values

r/arduino on Reddit: Help me understand float and double min and max values

October 2, 2022 -

Hello!

I'm studying a bit of Arduino programming and for ints and longs I have learned their minimum and maximum values depend on the amount of bits they take up in RAM.

So, since integer takes 16 bits, that's 2^16 combinations, and since half goes for negative numbers, half for positive numbers and a zero, its range is from -2^15 to (2^15) - 1. Long is 32 bits so formula just changes to -2^31 to (2^31) - 1. I also understand that unsigned makes the variable hold only positive values so I get 0 to (2^16)-1 and 0 to (2^32)-1.

I know I can't apply this logic to floats and doubles. From Arduino documentation: Floating-point numbers can be as large as 3.4028235E+38 and as low as -3.4028235E+38. They are stored as 32 bits (4 bytes) of information.

How are floats stored in memory and how can I calculate their min and max values?

Top answer

1 of 3

IEEE 754 is what you're looking for.

2 of 3

I'm still learning, too, so I'll bow down to the greater knowledge of others. The 32 bits of a float are arranged so that the first bit is the sign (positive or negative) the next 8 bits are the exponent and the remaining 23 bits store the significand. Doubles (64 bit) provide greater precision and range, however as the Arduino is based on a 32 bit architecture it will process them as floats. Here is more info on how the min&max values are calculated https://stackoverflow.com/questions/45727806/how-to-calculate-the-range-of-data-type-float-in-c

Arduino Forum

forum.arduino.cc › international › français

Explications valeur min et max d'un float - Français - Arduino Forum

January 7, 2015 - Bonjour à tous, Il y a une chose que je ne comprends pas bien dans la doc arduino, c'est l'explication sur la variable de type float, et la valeur min et max qu'elle peut prendre. float Description Datatype for floating-point numbers, a number that has a decimal point.

O'Reilly

oreilly.com › library › view › c-cookbook › 0596007612 › ch03s08.html

3.7. Getting the Minimum and Maximum Values for a Numeric Type - C++ Cookbook [Book]

November 8, 2005 - short: min: -32768 max: 32767 int: min: -2147483648 max: 2147483647 long: min: -2147483648 max: 2147483647 float: min: 1.17549e-038 max: 3.40282e+038 double: min: 2.22507e-308 max: 1.79769e+308 long double: min: 2.22507e-308 max: 1.79769e+308 ...

Authors D. Ryan StephensChristopher Diggins…

Published 2005

Pages 594

Finxter

blog.finxter.com › 5-best-ways-to-set-a-float-to-its-minimum-value-in-python

5 Best Ways to Set a Float to Its Minimum Value in Python – Be on the Right Side of Change

This code simply assigns the constant float('-inf') to the variable min_value. It’s a rapid way to have a starting value that is lower than all other floats in comparisons. Method 1: Directly accessing the Python sys module. Strengths: No external dependencies, straightforward. Weaknesses: Directly tied to the Python float implementation. Method 2: Using NumPy’s finfo with float32.

Microsoft Learn

learn.microsoft.com › en-us › cpp › c-language › type-float

Type float | Microsoft Learn

The following table shows the minimum and maximum values you can store in variables of each floating-point type. The values listed in this table apply only to normalized floating-point numbers; denormalized floating-point numbers have a smaller minimum value.