Widget HTML Atas

Round-Off Error and Computer Arithmetic

The arithmetic performed by a calculator or computer is different from the arithmetic that we use in our algebra and calculus courses. From your past experience you might expect that we always have as true statements such things as $2+2=4$, $4\times8=32$, and $\left(\sqrt{3}\right)^{2}=3$. In standard computational arithmetic we expect exact results for $2+2=4$ and $4\times8=32$, but we will not have precisely $\left(\sqrt{3}\right)^{2}=3$. To understand why this is true we must explore the world of finite-digit arithmetic.



In our traditional mathematical world we permit numbers with an infinite number of digits. The arithmetic we use in this world defines $\sqrt{3}$ as that unique positive number that when multiplied by itself produces the integer 3. In the computational world, however, each representable number has only a fixed and finite number of digits. This means, for example, that only rational numbers and not even all of these can be represented exactly. Since $\sqrt{3}$ is not rational, it is given an approximate representation within the machine, a representation whose square will not be precisely 3, although it will likely be sufficiently close to 3 to be acceptable in most situations. In most cases, then, this machine representation and arithmetic is satisfactory and passes without notice or concern, but at times problems arise because of this discrepancy.

The error that is produced when a calculator or computer is used to perform real-number calculations is called round-off error. It occurs because the arithmetic performed in a machine involves numbers with only a finite number of digits, with the result that calculations are performed with only approximate representations of the actual numbers. In a typical computer, only a relatively small subset of the real number system is used for the representation of all the real numbers. This subset contains only rational numbers, both positive and negative, and stores the fractional part, together with an exponential part.

In 1985, the IEEE (Institute for Electrical and Electronic Engineers) published a report called Binary Floating Point Arithmetic Standard 754-1985. Formats were specified for single, double, and extended precisions. These standards are generally followed by all microcomputer manufacturers using hardware that performs real number, or floating point, arithmetic operations. For example, the double precision real numbers require a 64-bit (binary digit) representation.

The first bit is a sign indicator, denoted $s$. This is followed by an $11-bit$ exponent, $c$, and a $52-bit$ binary fraction, $f$, called the mantissa. The base for the exponent is $2$. 

The normalized form for the nonzero double precision numbers have $0<c<2^{11}-1=2047$. Since $c$ is positive, a bias of 1023 is subtracted from $c$ to give an actual exponent in the interval $(-1023,1024)$. This permits adequate representation of numbers with both large and small magnitude.The first bit of the fractional part of a number is assumed to be 1 and is not stored in order to give one additional bit of precision to the representation, Since 53 binary digits correspond to between 15 and 16 decimal digits, we can assume that a number represented using this system has at least 15 decimal digits of precision. Thus, numbers represented in normalized double precision have the form
\[\left(-1\right)^{s}*2^{c-1023}*\left(1+f\right)\]
Consider for example, the machine number
$$0100000000111011100100010000000000000000000000000000000000000000.$$
The leftmost bit is zero, which indicates that the number is positive. The next $11$ bits, $10000000011$, giving the exponent, are equivalent to the decimal number
\[c=1\cdot2^{10}+0\cdot2^{9}+\cdots+0\cdot2^{2}+1\cdot2^{1}+1\cdot2^{0}=1024+2+1=1027\]
The exponential part of the number is, therefore, $2^{1027-1023}=2^{4}$. The final 52 bits specify that the mantissa is
\[f=1\cdot\left(\frac{1}{2}\right)^{1}+1\cdot\left(\frac{1}{2}\right)^{3}+1\cdot\left(\frac{1}{2}\right)^{4}+1\cdot\left(\frac{1}{2}\right)^{5}+1\cdot\left(\frac{1}{2}\right)^{8}+1\cdot\left(\frac{1}{2}\right)^{12}\]
As a consequence, this machine number precisely represents the decimal number
$\left(-1\right)^{s}*2^{c-1023}*\left(1+f\right)=\left(-1\right)^{0}\cdot2^{1027-1023}\left(1+\left({\displaystyle \frac{1}{2}+\frac{1}{8}+\frac{1}{16}+\frac{1}{32}+\frac{1}{256}+\frac{1}{4096}}\right)\right)=27.56640625.$
However, the next smallest machine number is\[0100000000111011100100001111111111111111111111111111111111111111\]and the next largest machine number is$$0100000000111011100100010000000000000000000000000000000000000001.$$ This means that our original machine number represents not only $27.56640625,$ but also half of all the real numbers that are between $27.56640625$ and its two nearest machine-number neighbors. To be precise, it represents any real number in the interval
$$[27.56640624999999988897769753748434595763683319091796875,
27.56640625000000011102230246251565404236316680908203125).$$
The smallest normalized positive number that can be represented has $s=0,c=1$, and $f=0$, and is equivalent to the decimal number
\[2^{-1022}\cdot(1+0)\approx0.225\times10^{-307},\]
The largest normalized positive number that can be represented has $s=0,c=2046$, and $f=1-2^{-52}$ , and is equivalent to the decimal number \[2^{1023}\cdot\left(1+\left(1-2^{-52}\right)\right)\approx0.17977\times10^{309}\]
Numbers occurring in calculations that have too small a magnitude to be represented result in underflow, and are generally set to $0$ with computations continuing. However, numbers occurring in calculations that have too large a magnitude to be represented result in overflow and typically cause the computations to stop (unless the program has been designed to detect this occurrence). Note that there are two representations for the number zero; a positive $0$ when $s=0,c=0$ and $f=0$ and a negative $0$ when $s=1,c=0$ and $f=0$. The use of binary digits tends to complicate the computational problems that occur when a finite collection of machine numbers is used to represent all the real numbers. To examine these problems, we now assume, for simplicity, that machine numbers are represented in the normalized decimal form
\[\pm0.d_{1}d_{2}\cdots d_{k}\times10^{n},\,\,\,1\leq d_{1}\leq9,\,\,\,\,0\leq d_{i}\leq9\]
for each $i=2,...,k$. Numbers of this form are called k-digit decimal machine numbers. Any positive real number within numerical range of the machine can be normalized to achieve the form \[y=0\cdot d_{1}d_{2}\cdots d_{k}d_{k+1}d_{k+2}\cdots\times10^{n}\]
The floating-point form of $y$, denoted by $fl(y)$, is obtained by terminating the mantissa of $y$ at $k$ decimal digits. There are two ways of performing the termination. One method, called chopping, is to simply chop off the digits $d_{k+1}d_{k+2}$ . . . to obtain
\[fl\left(y\right)=0\cdot d_{1}d_{2}\cdots d_{k}\times10^{n}\]
The other method of terminating the mantissa of $y$ at $k$ decimal points is called rounding. If the $k+1st$ digit is smaller than $5$, then the result is the same as chopping. If the $k+1st$ digit is 5 or greater, then 1 is added to the kth digit and the resulting number is chopped. As a consequence, rounding can be accomplished by simply adding $5\times10^{n-(k+1)}$ to $y$ and then chopping the result to obtain $fl(y)$. Note that when rounding up the exponent $n$ could increase by $1$. In summary, when rounding we add one to $d_{k}$ to obtain $fl(y)$ whenever $d_{k+1}\geq5$, that is, we round up; when $d_{k+1}<5$, we chop off all but the fi{}rst k digits, so we round down. 

The next examples illustrate floating-point arithmetic when the number of digits being retained is quite small. Although the floating-point arithmetic that is performed on a calculator or computer will retain many more digits, the problems this arithmetic can cause are essentially the same regardless of the number of digits. Retaining more digits simply postpones the awareness of the situation.

EXAMPLE 


The irrational number $\pi$ has an infinite decimal expansion of the form $\pi=3.14159265....$ Written in normalized decimal form, we have
\[\pi=0.314159265...\times10^{1}\]
The five-digit floating-point form of $\pi$ using choppingis
\[fl(\pi)=0.31415\times101=3.1415.\]
Since the sixth digit of the decimal expansion of $\pi$ is a 9, the five-digit floating-point form of $\pi$ using rounding is
\[fl\left(\pi\right)=\left(0.31415+0.00001\right)\times10^{1}=0.31416\times10^{1}=3.1416.\]
The error that results from replacing a number with its floating-point form is called round-off error (regardless of whether the rounding or chopping method is used). There are two common methods for measuring approximation errors. 

The approximation $p*$ to $p$ has absolute error $\left|p-p*\right|$ and relative error $\left|p-p*\right|/\left|p\right|$, provided that $p=0$.

No comments for "Round-Off Error and Computer Arithmetic"