5. 数值的计算:
特殊数值的表示:
gradual underflow——>the subnormal numbers
Gradual
[Gradual underlow provides a number of advantages over abrupt
underfow.Without it, the gap between zero and the smallest foating-point number is much
larger than the gap between successive small foating-point numbers. Without gradual
underfow one can find two values, X and Y (such that X is not equal to Y), and yet when
you subtract them their result is zero. While a skilled numerical analyst could work
around this limitation in many situations, this anomaly would tend to cause problems for
less skilled programmers.——Charles Severance]
结论:小数部分最高有效位由指数部分决定。如果指数在 0 < exponent <
2^e-1 之间,那么小数部分最高有效位将是 1,而且这个数将被称为正规形式。
Zianed Version 1.0 5
6. 如果指数是 0,有效数最高有效位将会是 0,并且这个数将被称为非正规形式。这
里有三个特殊值需要指出:
如果 指数 是 0 并且 小数部分 是 0,这个数±0(和符号位相关)
如果 指数 = 2^e - 1 并且 小数部分 是 0,这个数是 ±无穷大(同样和符号
位相关)
如果 指数 = 2^e - 1 并且 小数部分 非 0,这个数表示为不是一个数
(NaN)。
以上规则,总结如下:
形势 指数 小数部分
零 00
非正规形式 0 非 0
正规形式 1 到 2^e - 2 任意
无穷 2^e-1 0
NaN 2^e-1 非零
Fraction 位二进制数所能表示的二进制个数是在 2^(Fraction)个,而 Fraction 位
十进制数可以表示的个数是 10^(Fraction)个。
可以表示的比例是 2^(Fraction)/10^(Fraction)=0.2^(Fraction)。Fraction 越大,所能表
示的浮点数的比例就越小。
Subnormal numbers:
The numbers closest to the inverse of these bounds (−1×10−95 and 1×10−95) are
considered to be the smallest (in magnitude) normal numbers; non-zero numbers between
these smallest numbers are called subnormal numbers.
Subnormal numbers provide the guarantee that addition and subtraction of floating-point
numbers never underflows; two nearby floating-point numbers always have a
representable non-zero difference. Without gradual underflow, the subtraction a−b can
underflow and produce zero even though the values are not equal. This can, in turn, lead
to division by zero errors that cannot occur when gradual underflow is used.
By filling the underflow gap like this, significant digits are lost, but not to the extent as
when doing flush to zero on underflow (losing all significant digits all through the
underflow gap). Hence the production of a denormal number is sometimes called
gradual underflow because it allows a calculation to lose precision slowly when the
result is small.
Some processors handle subnormal values in hardware, just as normal values are.
Subnormal values (as arguments or results) then pose no particular performance issue;
they are handled at the same speed as normal values. But some processors leave the
handling of subnormal values to system software, only handling normal values (and zero)
in hardware. In this case, computing with subnormal values is significantly slower than
computing with normal values.Some applications need to contain code to avoid
Zianed Version 1.0 6
7. subnormal numbers. Either to maintain accuracy, or in order to avoid the performance
penalty in some processors
If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero),
then the value is a subnormalized number, which does not have an assumed leading 1
before the binary point. Thus, this represents a number (-1)s × 0.f × 2-126, where s is the
sign bit and f is the fraction. For double precision, denormalized numbers are of the form
(-1)s × 0.f × 2-1022. From this you can interpret zero as a special type of denormalized
number.
各种类型数值计算中 Subnormal 的值:
3、浮点数的舍入
任何有效数上的运算结果,通常都存放在较长的寄存器中,当结果返回为浮
点格式时,必须将多出来的位元丢弃。
有多种方法可以用来执行舍入作业,实际上 IEEE 标准列出 4 种不同的方
法:
舍入到最接近:会将结果舍入为最接近且可以表示的值。
向+∞方向舍入:会将结果向正无限大的方向舍入。ceil()方法
向-∞方向舍入: 会将结果向负无限大的方向舍入。 floor()方法
向 0 方向舍入: 会将结果向 0 的方向舍入。 (int)截断舍入
Zianed Version 1.0 7
8. IEEE754-2008 的舍入算法:
1)Rounding to nearest(向最近的数值进行舍入)
Round to nearest,ties to eve 向偶数进行方向舍入,也就是将最后一位取 0 的
一种舍入方式;是默认的舍入方式,也是推荐的舍入方式。(理解:0 是偶数,所
以偶数比奇数多,自然取偶数的精确性更大些)
Round to nearest,ties away from zero 向远离 0 的一侧进行舍入;正数取大的
数值,负数取小的数值。
2)Directed roundings(定向的舍入)
Round toward 0 向 0 方向舍入。
Round toward +∞ 将结果向正无限大的方向舍入。
Round toward -∞ 将结果向负无限大的方向舍入。
4、数值处理中的异常
标准定义了五种异常(非法操作、0除、上溢、下溢、不精确)
The standard defines five exceptions, each of which has a corresponding status
flag that (except in certain cases of underflow) is raised when the exception
occurs. No other action is required, but alternatives are recommended (see
below).
The five possible exceptions are:
Invalid operation (e.g., square root of a negative number)
Division by zero
Overflow (a result is too large to be represented correctly)
Underflow (a result is very small (outside the normal range) and is inexact)
Inexact.
Underflow
Recall that the IEEE format for a normal floating-point number is:
(-1)
s..(e- bias) . (2...) . 1.f
where s is the sign bit, e is the biased exponent, and f is the fraction. Only s, e, and f
need to be stored to fully specify the number. Because the implicit leading bit of the
significand is defined to be 1 for normal numbers, it need not be stored.
The smallest positive normal number that can be stored, then, has the negative
exponent of greatest magnitude and a fraction of all zeros. Even smaller numbers can be
accommodated by considering the leading bit to be zero rather than one. In the double-
precision format, this effectively extends the minimum exponent from 10-308 to 10-324,
because the fraction part is 52 bits long (roughly 16 decimal digits.) These are the
Zianed Version 1.0 8
9. subnormal numbers; returning a subnormal number (rather than flushing an underflowed
result to zero) is gradual underflow.
Clearly, the smaller a subnormal number, the fewer nonzero bits in its fraction;
computations producing subnormal results do not enjoy the same bounds on relative
roundoff error as computations on normal operands. However, the key fact about gradual
underflow is that its use implies:
Underflowed results need never suffer a loss of accuracy any greater than that
which results from ordinary roundoff error.
Addition, subtraction, comparison, and remainder are always exact when the
result is very small.
Recall that the IEEE format for a subnormal floating-point number is:
(-1)
s..(- bias+ 1) . (2....) . 0.f
where s is the sign bit, the biased exponent e is zero, and f is the fraction. Note that
the implicit power-of-two bias is one greater than the bias in the normal format, and the
implicit leading bit of the fraction is zero.
Gradual underflow allows you to extend the lower range of representable numbers. It
is not smallness that renders a value questionable, but its associated error. Algorithms
exploiting subnormal numbers have smaller error bounds than other systems. The next
section provides some mathematical justification for gradual underflow.
Why Gradual Underflow?
The purpose of subnormal numbers is not to avoid underflow/overflow entirely, as
some other arithmetic models do. Rather, subnormal numbers eliminate underflow as a
cause for concern for a variety of computations (typically, multiply followed by add). For
a more detailed discussion, see "Underflow and the Reliability of Numerical Software"
by James Demmel and "Combatting the Effects of Underflow and Overflow in
Determining Real Roots of Polynomials" by S. Linnainmaa.
The presence of subnormal numbers in the arithmetic means that untrapped
underflow (which implies loss of accuracy) cannot occur on addition or subtraction. If x
and y are within a factor of two, then x -y is error-free. This is critical to a number of
algorithms that effectively increase the working precision at critical places in
algorithms.In addition, gradual underflow means that errors due to underflow are no
worse than usual roundoff error. This is a much stronger statement than can be made
about any other method of handling underflow, and this fact is one of the best
justifications for gradual underflow.
Most of the time, floating-point results are rounded:
computed result = (true result). Roundoff
In IEEE arithmetic, with rounding mode to nearest,
1/2 ulp0 . roundoff .
Zianed Version 1.0 9
10. of the computed result.ulp is an acronym for Unit in the Last Place. The least
significant bit of the fraction of a number in its standard representation, is the last place.
If the roundoff error is less than or equal to one half unit in the last place, then the
calculation is correctly rounded.
the ulp for each floating point data type would be
Precision Value
single = 2^-23 ~ 1.192092896e-07
double = 2^-52 ~ 2.22044604925031308e-16
Intel double extended = 2^-11 ~ 1.92592994438723585305597794258492732e-34
Any conventional set of representable floating-point numbers has the property that
the worst effect of one inexact result is to introduce an error no worse than the distance to
one of the representable neighbors of the computed result. When subnormal numbers are
added to the representable set and gradual underflow is implemented, the worst effect of
one inexact or underflowed result is to introduce an error no greater than the distance to
one of the representable neighbors of the computed result.
In particular, in the region between zero and the smallest normal number, the
distance between any two neighboring numbers equals the distance between zero and the
smallest subnormal number. The presence of subnormal numbers eliminates the
possibility of introducing a roundoff error that is greater than the distance to the nearest
representable number.
In the absence of gradual underflow, user programs need to be sensitive to the
implicit inaccuracy threshold. For example, in single precision, if underflow occurs in
some parts of a calculation, and Store 0 is used to replace underflowed results with 0,
then accuracy can be guaranteed only to around 10-31, not 10-38, the usual lower range for
single-precision exponents.
This means that programmers need to implement their own method of detecting when
they are approaching this inaccuracy threshold, or else abandon the quest for a robust,
stable implementation of their algorithm.Some algorithms can be scaled so that
computations don't take place in the onstricted area near zero. However, scaling the
algorithm and detecting the inaccuracy threshold can be difficult and time-consuming for
each numerical program.
认识:
Gradual underflow 可以使程序在截断数据的时间向更精确做出判断,提高数据精
确度,subnormal number 就是这种用来做判断的数据的一个范围值。
5、Java 中的 float 和 double
Java
1)float
Float 中,指数 8 位,小数位 23 位
Zianed Version 1.0 10
11. 指数范围:0~255(-127 偏差)=-127~128
其中-127 和 128 是用来表示特殊数字的-126~127 表示的是正常数字。
以上数值为了表示 0:因此用指数-127 来表示:
0x00000000=(0,-127,1.0)=2^(-127)表示零。
以上数值为了表示无穷和非数:因此用指数 128 来表示:
0x[7|f]f[8|c]......=(0|1,128,)=2^128 当小数部分是 0 时表示无穷大;当小数部分不是 0
时表示的是非数。
因此,有效范围内的最大正值为 0x7f7fffff(0,127,7ffff)
有效范围内的最小正值为 0x00000001=2^(-23)*2^(-126)=2^(-149)
MIN_NORMAL=0x0080000=(0,1,0)=2^(-126)
Why -126?
Otherwise we’d be skipping numbers
0.1 * 2-126 = 1.0 * 2-127
Subnormal number 为最小值到 MIN_NORMAL 之间的所有数值。
Approximate
Subnormalized Normalized
Decimal
Single ± 2-149 to (1-2-23)×2-126
= (2-23——1-2-23)×2-126 ± 2-126 to (2-2-23)×2127 ± ~10-44.85 to ~1038.53
Precision
Double ± 2-1074 to (1-2-52)×2-1022
=(2-52——1-2-52)×2-1022 ± 2-1022 to (2-2-52)×21023 ± ~10-323.3 to ~10308.3
Precision
java.lang.Float 中
//@code Float.intBitsToFloat(0x7f800000)即(0,255-127=128,)
public static final double POSITIVE_INFINITY = 1.0 / 0.0;
//@Float.intBitsToFloat(0xff800000)即(0,255-127=128,)
public static final float NEGATIVE_INFINITY = -1.0f / 0.0f;
//@Float.intBitsToFloat(0x7fc00000)即(0,255-127=128,)
public static final float NaN = 0.0f / 0.0f;
//@Float.intBitsToFloat(0x7f7fffff)</code>.
public static final float MAX_VALUE = 0x1.fffffeP+127f; //
3.4028235e+38f
//@code Float.intBitsToFloat(0x00800000 即(0,1-127=-126,)
public static final float MIN_NORMAL = 0x1.0p-126f; // 1.17549435E-38f
//@code Float.intBitsToFloat(0x1) 即(0,0-127=-127,)
public static final float MIN_VALUE = 0x0.000002P-126f; // 1.4e-45f
获取代表 float 的 32bit 的 int 型表示:
float
public static int floatToIntBits(float value)
float
public static native int floatToRawIntBits(float value)
Zianed Version 1.0 11
12. 返回 32bit 代表的 float 浮点数值:
int
public static native float intBitsToFloat(int bits)
2)double
java.lang.Double 中
//@code Double.longBitsToDouble(0x7ff0000000000000L)</code>.
public static final double POSITIVE_INFINITY = 1.0 / 0.0;
//@code Double.longBitsToDouble(0xfff0000000000000L)</code>.
public static final double NEGATIVE_INFINITY = -1.0 / 0.0;
//@code Double.longBitsToDouble(0x7ff8000000000000L)</code>.
public static final double NaN = 0.0d / 0.0;
//@code Double.longBitsToDouble(0x7fefffffffffffffL)</code>.
public static final double MAX_VALUE = 0x1.fffffffffffffP+1023; //
1.7976931348623157e+308
//@code Double.longBitsToDouble(0x0010000000000000L)
public static final double MIN_NORMAL = 0x1.0p-1022; //
2.2250738585072014E-308
//@code Double.longBitsToDouble(0x1L)
public static final double MIN_VALUE = 0x0.0000000000001P-1022; //
4.9e-324
获取代表 double 的 64bit 的 long 型表示:
double
public static long doubleToLongBits(double value)
double
public static native long doubleToRawLongBits(double value)
返回 64bit 代表的 double 浮点数值:
long
public static native double longBitsToDouble(long bits)
6、Java 中的 BigDecimal
Java
extends Number implements Comparable<BigDecimal>
实现了比较接口,可以进行相互之间的比较。
不可变的、任意精度的有符号十进制数。BigDecimal 由任意精度的整数非标
度值 和 32 位的整数标度 (scale) 组成。如果为零或正数,则标度是小数点后的
位数。如果为负数,则将该数的非标度值乘以 10 的负 scale 次幂。因此,
BigDecimal 表示的数值是 (unscaledValue × 10-scale)。
在金融以及涉及到钱的计算中,都需要使用该类替换 double,以防止引起累积误
差,获取高准确的数值计算。
测试代码:
double v1 = 1.0;
double v2 = 0.9;
out.println(v1 - v2);
BigDecimal value1 = new BigDecimal(Double.toString(v1));
Zianed Version 1.0 12
13. BigDecimal value2 = new BigDecimal(Double.toString(v2));
out.println(value1.subtract(value2));
使用时的注意:
1)构造函数采用 String 参数而不是 double 参数,因为 double val 本身就是一个
精确表示的值。
public BigDecimal(String val)
2)基本运算
加
public BigDecimal add(BigDecimal augend)
减
public BigDecimal subtract(BigDecimal subtrahend)
乘
public BigDecimal multiply(BigDecimal multiplicand)
除
public BigDecimal divide(BigDecimal divisor)
3)舍入方式
// Rounding Modes
//Rounding mode to round away from zero. Always increments the digit.
public final static int ROUND_UP = 0;
//Rounding mode to round towards zero. Never increments the digit.
public final static int ROUND_DOWN = 1;
//Rounding mode to round towards positive infinity.
public final static int ROUND_CEILING = 2;
//Rounding mode to round towards negative infinity.
public final static int ROUND_FLOOR = 3;
//Rounding mode to round towards nearest neighbor
//unless both neighbors are equidistant, in which case round up.
public final static int ROUND_HALF_UP = 4;
//Rounding mode to round towards nearest neighbor
//unless both neighbors are equidistant, in which case round down.
public final static int ROUND_HALF_DOWN = 5;
//Rounding mode to round towards the nearest neighbor
//unless both neighbors are equidistant, in which case, round
// towards the even neighbor.
public final static int ROUND_HALF_EVEN = 6;
//Rounding mode to assert that the requested operation has an exact
// result, hence no rounding is necessary.
public final static int ROUND_UNNECESSARY = 7;
4)比较两个数值大小的方法是:
public int compareTo(BigDecimal val)
不能用
Zianed Version 1.0 13
14. public boolean equals(Object x)
进行。
测试代码:
import java.math.BigDecimal;
import static java.lang.System.out;
BigDecimal value11 = new BigDecimal("1");
BigDecimal value21 = new BigDecimal("1.0");
out.println(value11.equals(value21));//false
out.println(value11.compareTo(value21) == 0 ? true : false
false);//true
7、IEEE-754 发展
IEEE-754
IEEE 754-2008 governs binary floating-point arithmetic. It specifies number formats,
basic operations, conversions, and exceptional conditions.The 2008 edition supersedes
both the 754-1985 standard and the related IEEE 854-1987 which generalized 754-1985
to cover decimal arithmetic as well as binary.
IEEE-7442008 标准定义了:
1)arithmetic formats:二进制、十进制浮点数;
signed zeros,subormal numbers,infinites,NaN(Not a Number) ;
2)interchange formats:encoding(bit strings)编码数据在交换时已获得更高效率;
3)rounding algorithms:在计算和转换时进行的舍入方式;
4)operations:操作符在计算层次上的格式;
5)exception handling:指示异常条件(0 除、溢出)。
Basic Format(基本格式):
Name Common name Base Digits E min E max Notes
binary16 Half precision 2 10+1 -14 +15 storage, not basic
binary32 Single precision 2 23+1 -126 +127
binary64 Double precision 2 52+1 -1022 +1023
binary128 Quadruple precision 2 112+1 -16382 +16383
decimal32 10 7 -95 +96 storage, not basic
decimal64 10 16 -383 +384
decimal128 10 34 -6143 +6144
All the basic formats are available in both hardware and software implementations
Arithmetic Format(算术格式):
用浮点数的 sign(符号位)、significand(小数位)、exponent(指数位)表示的浮点数。
Zianed Version 1.0 14
15. Interchange format(交换格式)
The width of the exponent field for a k-bit format is computed as
w = round(4×log2(k))- 13.(指数位位数的计算公式)
十进制浮点数:
Kahan 教授的看法:使用十进制浮点数,以避免人为错误。也就是这种错误:
double d = 0.1;实际上,d≠0.1。IBM 公司的看法:在经济、金融和与人相关的程
序中,使用十进制浮点数。但是,由于没有硬件支持,用软件实现的十进制浮点计
算比硬件实现的二进制浮点计算要慢 100-1000 倍。由于被 IEEE 754R 所采纳,
IBM 公司将在下一代 Power 芯片中实现十进制 FPU。
总结附录图表:
Reference:
EN
http://standards.ieee.org/
http://ieeexplore.ieee.org/servlet/opac?punumber=2355
http://ieeexplore.ieee.org/servlet/opac?punumber=2502
Zianed Version 1.0 15