Assignment #04: rounding errors¶

Today's exercises deal with the pitfalls of floating point arithmetics.

Exercise #04-01: fixed point binary numbers¶

Write a function which converts binary strings to decimal numbers. The function should handle unsigned (positive) numbers only. Examples:

'101010' $\rightarrow$ 42
'10000.1' $\rightarrow$ 16.5
'1011101101.10101011' $\rightarrow$ 749.66796875

Now let's develop a new standard based on this representation. Dots cannot be represented by 0s and 1s, so that if we want the position of the dot to be flexible we need an additional memory slot to store this position. Let's define our new format as a 32 bits long sequence of bits, the first 5 bits (starting from the left) being used to give the position of the dot, and the remaining 27 bits used to represent the number. Examples:

'10101010101010101010101010101010' $\rightarrow$ 699050.65625.
'00000001100110011001100110011001' $\rightarrow$ 0.19999999552965164.

Explanation for example 1: the first five digits are '10101' which gives the number 21. The second part of the string therefore becomes a dot at position 21: '010101010101010101010.101010'. This binary number is then converted to decimal.

Let's name this standard "BSE" (for "best standard ever"), and try to convince the Institute of Electrical and Electronics Engineers to adopt it in place of the old IEEE 754 standard. We have to answer the following questions:

what is the smallest number the BSE can represent? The largest?
what is the maximal accuracy of the BSE? (in other words, what is the difference between the smallest positive number and zero?)
what is the lowest accuracy of our standard? (in other words, what is the difference between the largest number we can represent and the second largest?)
does the difference between two nearest representable change, when the dot position doesn't?
try to produce a precision plot similar to the wikipedia page on IEEE 754, for the range of possible values of the BSE (you can also use a log-log plot for it).

Finally, how does our format compare with the IEEE754 binary32 format (or its numpy equivalent np.float32)? Do you think we should try to convince them and present them our results?

Exercise #04-02: exponential error growth¶

The number e can be defined as the sum of the infinite series:

$$e = \sum_{n=0}^{\infty} \frac{1}{n!}$$

We are going to approximate this number by truncating the sum to a finite value. We use the standard library and it's math module:

import math
n = 100

e1 = sum([1. / math.factorial(i) for i in range(n+1)])
e1

2.7182818284590455

Close enough! Now let's compute it with the same values, but summed from n=100 to n=0:

e2 = sum([1. / math.factorial(i) for i in range(n+1)][::-1])
e2

2.718281828459045

Seems reasonable too! Are they different?

e1 - e2

4.440892098500626e-16

Which of the two values is closest to the actual e? Explain why this occurs, and what we can learn from this experiment.

Back to the table of contents