In this chapter we are going to give elements of understanding of how numbers are represented in today's computers. We won't have time to go into the details (which are perhaps not so relevant for non informaticians), however it is extremely important to understand the pitfalls of the so-called "floating-point arithmetics", because sooner or later one of these pitfalls will hit you (I promise).
Traditional arithmetics and mathematics rely on the decimal system. Every number can be decomposed in a sum of products such as: $908.2 = 9\times10^2 + 0\times10^1 + 8\times10^{0} + 2\times10^{-1}$. This system is called decimal because the factors can take the values $[0 - 9]$ while the fixed base (the number which is scaled by an exponent) is 10.
Nearly all modern processors however represent numbers in binary form, with each digit being represented by a two-valued physical quantity such as a "low" or "high" voltage: 0s and 1s. These binary digits are called bits, and are the basis of the binary representation of numbers.
With only 0s and 1s available the most efficient way to represent numbers is as a sum of 2s to the power of i, with i going from 0 to N-1. It is best shown with an example, here with N=4 digits:
0101
To convert our number to a decimal representation we compute $0\times2^3 + 1\times2^2 + 0\times2^1 + 1\times2^0 = 5$. In this convention, the leftmost digit is called the most significant bit and the rightmost the least significant bit. If all the elements from the left are zeros, the convention is to omit them when writing. This is for example what the built-in bin function chooses to do:
bin(5)
which converts an integer number to a binary string prefixed with 0b
. Now it appears quite clear that the number of different integers that a computer can represent with this system depends on the size N of the binary sequence. A positive integer represented with a byte (8 bits are called a byte) can thus be as large as:
sum([2**i for i in range(8)])
but not larger. Unless specified otherwise, the first bit is often used to give the sign of the integer it represents. Therefore, the actual range of numbers that a byte can represent is $[-2^7; 2^7-1]= [-128; 127]$ (the reason for this asymmetry is a matter of definition, as we will see later). If you are sure that you only want to do arithmetics with positive number you can spare this one bit and specify your binary number as being of unsigned integer type.
So how many bits does our computer use to represent integers? Well, it depends on the platform and programming language you are using. Many older languages (including old python versions) made the distinction between short (32 bits) and long (64 bits) integers.
Exercise: what is the largest number that a 64-bit integer can represent? The smallest? And a 32-bit unsigned integer?
Now, what is the default length of binary integers in python 3? Let's try to find out:
from sys import getsizeof
a = 10
getsizeof(a)
So 28 bytes? That's a lot of bits. This is because the getsizeof function returns the memory consumption of the object int(10)
. What does that mean? Well, in python numbers are not only numbers, they are also "things" (objects). And these things come with services, like for example the bit_length
method:
a.bit_length()
Exercise: what does bit_length()
do? What is the bit length of 5? of 127?
These services have a memory cost (an "overhead"), and are required no matter how big our number is:
size_int = getsizeof(int())
size_int # size of an empty integer object
def get_int_bitsize(integer):
"""Returns the actual memory consumption of an integer (in bits) without the object overhead."""
return (getsizeof(integer) - getsizeof(int())) * 8
get_int_bitsize(2)
Ha! This looks more reasonable. So python uses 32 bits to store integers? But then, the largest number it can manipulate must be very small? Let's see if we can create a number larger than 2**32-1
:
12**68
get_int_bitsize(2**18), get_int_bitsize(2**68), get_int_bitsize(2**100000)
As shown above, it turns out that modern python versions have no limitations on the size of integers (other than the total memory available on your computer). The memory slot used to store your number simply depends on how large it is.
get_int_bitsize(2**100000) / 8 / 1024
So the 2**100000
number requires about 13 KB (Kilobytes) of memory.
Exercise: print 2**100000
on screen, just for "fun".
This dynamic resizing of integers in python means that they cannot overflow. Overflow is a common pitfall of many numerical languages though, and to illustrate what it is we can either use floats (unlike python integers, python floats can overflow), or numpy, which uses integer types of fixed length:
import numpy as np
a = np.int8(127)
a.dtype
a
a + np.int8(1)
What happened here? To understand this we need to understand how binary numbers are added together first. Please read the addition chapter of the Wikipedia article on binary numbers before going on.
Basically, we added 1 (binary 00000001
) to 127 (binary 01111111
) which gives us the binary number 10000000
, i.e. -128 in two's complement representation, which is the most common method of representing signed integers on computers. At least, python warned us that we are doing something wrong here. But be aware that this is not always the case:
np.int32(2147483648)
a = np.array([18])
a**a
These are examples of silent overflows. Silent means that they do not warn you about the probable mistake and could happen in your code without you noticing that this happens.
A more general definition of the binary representation for integers is to use negative exponents as well. Any rational number a can be approximated by:
$$a = \pm \sum_{i=-j}^{k} z_i b^i$$ with $b > 1$ (base) and $0 \le z_i \le b-1$, $j$ and $k$ all positive integers. The precision depends on the size of $j$ while the maximum range depends on the size of $k$.
In the notation, a fixed point separates digits of positive powers of the base, from those of negative powers. In base 2 (binary), the number $10000.1_2$ is equal to $16.5_{10}$ in base 10.
Indeed:
$$1\times2^4 + 0\times2^3 + 0\times2^2 + 0\times2^1 + 0\times2^0 + 1\times2^{-1} = 16.5_{10}$$
Although this representation is convenient, the representable value range is heavily limited by the number of bits available to represent the number (see this week's assignments). Therefore, most computers today are relying on the floating point notation to represent real numbers.
Because computers have a finite number of storage units available, they can only represent a finite number of distinguishable values. In fact, a memory slot with $N$ available bits cannot represent more then $2^N$ distinguishable values. The range of real (or complex) numbers is of course infinite, and therefore it becomes clear that in the computer representation of numbers there will always be a trade-off between the range of numbers one would like to represent and their relative accuracy (i.e. the absolute difference between two consecutive representable numbers).
Taking the decimal representation of the number 1/3 as an example: it can be written as 0.33
, 0.333
, 0.3333
, etc. Depending on the numbers of digits available, the precision of the number will increase but never reach the exact value, at least in the decimal representation.
This fundamental limitation is the explanation for unexpected results of certain arithmetic operations. For example:
0.1 + 0.1 # so far so good
0.1 + 0.2 # wtf?
This is a typical rounding error, happening because most computers do not represent numbers as decimal fractions, but as binary. Without going too much into details (which can be tricky), this chapter will give you some elements of understanding in order to prepare you for the most common pitfalls of floating point arithmetics.
In the exponential notation, a number is approximated with a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base; the base for the scaling is normally two, ten, or sixteen. A number that can be represented exactly is of the following form:
$\mathrm{number} = \mathrm{significand} \times \mathrm{base} ^{\mathrm{exponent}}$
For example, in base 10:
$$1.234 = 1234 \times 10^{-3}$$
The number 1.234
can easily be represented exactly in base 10, but the number 1/3 cannot. However, in base 3 the latter can be represented exactly by $1 \times 3^{-1}$.
To approximate the number 1/3 in base 10, the rule of "the more bits you have, the more precise you can be" that we learned with integers continues to hold true: $33 \times 10^{-2}$ and $33333333333333 \times 10^{-14}$ are two ways to approximate the number, the later being more expensive in terms of storage requirements but more accurate.
The exponential notation is the most common way to represent real numbers in computers and is the basis of the floating-point number representation. A floating point number in base 2, 10, or N will store the two attributes: the significand (including its sign) and the exponent.
This precision/size trade-off raises many challenges: memory and computer time must be optimized while still allowing programs to realize computations on a wide range of numbers in the same program. A good example is provided with atmospheric models, which realize computations on specific humidity values ($\mathrm{kg}\,\mathrm{kg}^{-1}$) with typical values of 10$^{-5}$ or smaller up to values of several order of magnitude larger. Using different standards for each set of numbers would be impracticable. This lead to the development of the IEEE Standard for Floating-Point Arithmetic (IEEE 754).
The standard defines five basic formats that are named for their numeric base and the number of bits used in their encoding, as listed in this table. For example, binary64
is the famous "double precision" format, called float64
in numpy and simply float
in the python standard library. In this format, 53 bits are used for the significand, and the remaining 11 for the exponent. Remember:
This has the important consequence that small numbers have a larger absolute precision than large numbers. See this example:
.99 == .98 # so far so good
999999999999999.99 == 999999999999999.98 # wtf?
999999999999999.99 # wtf?
Fortunately, it is possible to compute the approximate precision of IEEE754 floating point numbers according to their value:
Source: wikipedia
Like numpy integers, floating point numbers can overflow:
np.float16(65510) + np.float16(20)
Fortunately, the IEEE 754 standard defines two new numbers (-inf and +inf) which are more informative (and less dangerous) than the negative reset of integer numbers. IEEE 754 also defines "Not A Number" (abbreviated NaN), which propagates through computations:
np.NaN * 10
As we've learned, errors in floating-point numbers are unavoidable. Even if these errors are very small, simple calculations on approximated numbers can contain pitfalls that increase the error in the result way beyond just having the individual errors "add up".
In certain cases where perfect decimal accuracy is needed (for example when dealing with currencies and money), it is possible to use a decimal floating point representation instead of the default binary one:
1/10*3 # not precise
from decimal import Decimal
Decimal(1) / Decimal(10) * 3 # precise
With limited-precision decimals there are no unexpected rounding errors. In practice, however, alternative datatypes are used rarely because the precision gain comes with a performance overhead: computers work best with 0s and 1s, not with numbers humans can read.
Symbolic computations are realized literally (like in mathematics), not approximately. SymPy is the most famous python library for symbolic mathematics:
import sympy
a = sympy.sympify('1 / 3')
a + a
Seams like the perfect solution, right? It probably is if you are a mathematician, but for actual numerical computations SymPy will be way too slow to use. Symbolic mathematics can only be used for problems where analytical solutions are known. Unfortunately, this is not always the case (take numerical models of the atmosphere for example).
There are no simple answers to numerical rounding errors. Be aware that they occur and try to mitigate their effect.
Awareness is mostly hindered by the string representation of floating point numbers. In practice:
0.1
format(0.1, '.16g') # give 16 significant digits
format(0.1, '.30g') # give 30 significant digits
The default 0.1
print is therefore a "lie", but it is a useful one: in most cases you don't want to know about these insignificant digits at the end. The numpy.finfo is a useful function informing you about the machine limits for floating point types:
info = np.finfo(np.float)
info.bits, info.precision, info.max
Preventing rounding errors to happen is not possible, but there are a few general rules:
(list taken from from the floating point guide)
As illustration for the difference between addition and multiplication, see the following example:
a = 10 * 0.1
b = 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1
a == b
a, b
Realize safer computations therefore involves asking yourself at which stage the most precision is going to be lost: this is most often the case when adding numbers of very different magnitudes. When building numerical models, this should always be something you consider: is a formula leading to dangerous additions, then reformulate it and/or use other units for your variables (e.g. $\mathrm{g}\,\mathrm{kg}^{-1}$ instead of $\mathrm{kg}\,\mathrm{kg}^{-1}$ for specific humidity). Consider the following example:
a = 1
b = 1e-16
c = 10
c1 = c * (a + b)
c2 = c * a + c * b
c1 == c2
c * (a + b)
and c * a + c * b
are mathematically equivalent, and the first is computationally less expensive (two operations instead of three). However, the second is less prone to rounding errors!
Fortunately, rounding errors often remain unnoticed, meaning that your computations are probably OK! In our field in particular, we often do not care if the post-processed temperature forecast is numerically precise at 0.001° since the forecast accuracy is much lower anyway. However, this can still lead to surprises when comparing arrays for equality (e.g. when testing that a temperature is equal to zero, or for matching coordinates like longitude or latitude). In these cases, prefer to use numpy's specialized functions:
np.isclose(c1, c2) # OK is you don't care about small numerical errors
np.float32
, np.float64
, np.float128
).Because of the importance of floating point computations you will find many resources online. I highly recommend to go through the short floating point guide website, which explains the problem to non specialists. It will give you another point of view on the topic.
Other resources:
Back to the table of contents, or jump to this week's assignment.