CMPE12c Gabriel Hugh Elkaim1
What do floating-point numbers represent?
•Rational numbers with non-repeating expansions
in the given base within the specified exponent range.
•They do not represent repeating rational or irrational
numbers, or any number too small or too large.
Floating Point Format
CMPE12c Gabriel Hugh Elkaim2
IEEE Double Precision FP
•IEEE Double Precision is similar to SP
–52-bit M
•53 bits of precision with hidden bit
–11-bit E, excess 1023, representing –1022 <--> 1023
–One sign bit
•Always use DP unless memory/file size is important
–SP ~ 10
-38
… 10
38
–DP ~ 10
-308
… 10
308
•Be verycareful of these ranges in numeric
computation
CMPE12c Gabriel Hugh Elkaim3
Floating Point Arithmetic
Floating Point operations include
•Addition
•Subtraction
•Multiplication
•Division
They are complicated because…
CMPE12c Gabriel Hugh Elkaim4
Floating Point Addition
1.Align decimal points
2.Add
3.Normalize the result
•Often already normalized
•Otherwise move one digit
1.0001631 x 10
3
4.Round result
1.000 x 10
3
9.997x 10
2
+ 4.631x 10
-1
9.997 x 10
2
+ 0.004631x 10
2
10.001631 x 10
2
Decimal Review
How do we do this?
CMPE12c Gabriel Hugh Elkaim5
Floating Point Addition
First step: get into SP FP if not already
.25 = 0 01111101 00000000000000000000000
100 =0 10000101 10010000000000000000000
Or with hidden bit
.25 =0 01111101 1 00000000000000000000000
100 =0 10000101 1 10010000000000000000000
Example:0.25 + 100 in SP FP
Hidden Bit
CMPE12c Gabriel Hugh Elkaim6
Second step: Align radix points
–Shifting F left by 1 bit, decreasinge by 1
–Shifting F right by 1 bit, increasinge by 1
–Shift F right so least significant bits fall off
–Which of the two numbers should we shift?
Floating Point Addition
CMPE12c Gabriel Hugh Elkaim7
Floating Point Addition
Shift the .25 to increase its exponent so it matches
that of 100.
0.25’s e: 01111101 –1111111(127) =
100’s e:10000101 –1111111(127) =
Shift .25 by 8 then.
Easier method: Bias cancels with subtraction, so
Second step: Align radix points cont.
10000101
-01111101
00001000
100’s E
0.25’s E
CMPE12c Gabriel Hugh Elkaim8
Carefully shifting the 0.25’s fraction
S E HB F
•0 01111101 1 00000000000000000000000 (original value)
•0 01111110 0 10000000000000000000000 (shifted by 1)
•0 01111111 0 01000000000000000000000 (shifted by 2)
•0 10000000 0 00100000000000000000000 (shifted by 3)
•0 10000001 0 00010000000000000000000 (shifted by 4)
•0 10000010 0 00001000000000000000000 (shifted by 5)
•0 10000011 0 00000100000000000000000 (shifted by 6)
•0 10000100 0 00000010000000000000000 (shifted by 7)
•0 10000101 0 00000001000000000000000 (shifted by 8)
Floating PointAddition
CMPE12c Gabriel Hugh Elkaim9
Floating Point Addition
Third Step: Add fractions with hidden bit
0 10000101 1 10010000000000000000000 (100)
+ 0 10000101 0 00000001000000000000000 (.25)
0 10000101 1 10010001000000000000000
Fourth Step: Normalize the result
•Get a ‘1’ back in hidden bit
•Already normalized most of the time
•Remove hidden bit and finished
CMPE12c Gabriel Hugh Elkaim10
Normalization example
S E HBF
0 0111 1100
+0 0111 1011
0 011110111
Need to shift so that only a 1 in HB spot
0 1001 10111 discarded
Floating Point Addition
CMPE12c Gabriel Hugh Elkaim11
Floating Point Example
•0xD4F80000 + 0x56B00000
CMPE12c Gabriel Hugh Elkaim12
CMPE12c Gabriel Hugh Elkaim13
Another SP FP Example
•0xD5D00000 + 0x54600000
CMPE12c Gabriel Hugh Elkaim14
CMPE12c Gabriel Hugh Elkaim15
Floating Point Subtraction
•Mantissa’s are sign-magnitude
•Watch out when the numbers are close
1.23455 x 10
2
-1.23456 x 10
2
•A many-digit normalization is possible
This is why FP addition is in many ways more
difficult than FP multiplication
CMPE12c Gabriel Hugh Elkaim16
Floating Point Subtraction
1.Align radix points
2.Perform sign-magnitude operand swap if
needed
•Compare magnitudes (with hidden bit)
•Change sign bit if order of operands is
changed.
3.Subtract
4.Normalize
5.Round
Steps to do subtraction
CMPE12c Gabriel Hugh Elkaim17
S E HB F
0 011 1 1011smaller
-0 011 1 1101bigger
switch order and make result negative
0 011 1 1101bigger
-0 011 1 1011smaller
1 011 0 0010
1 000 1 0000 switched sign
Floating Point Subtraction
Simple Example:
CMPE12c Gabriel Hugh Elkaim18
Floating Point Multiplication
1.Multiply mantissas
3.0
x5.0
15.00
2.Add exponents
1 + 2 = 3
3. Combine
15.00 x 10
3
4.Normalize if needed
1.50 x 10
4
Decimal example:
3.0 x 10
1
x5.0 x 10
2
How do we do this?
CMPE12c Gabriel Hugh Elkaim19
Floating Point Multiplication
Multiplication in binary (4-bit F)
0 10000100 0100
x1 00111100 1100
Step 1: Multiply mantissas
(put hidden bit back first!!)
1.0100
x 1.1100
00000
00000
10100
10100
+ 10100
100011000010.00110000
CMPE12c Gabriel Hugh Elkaim20
Floating Point Multiplication
Second step: Add exponents, subtract extra bias.
10000100
+00111100
Third step: Renormalize, correcting exponent
101000001 10.00110000
Becomes
101000010 1.000110000
Fourth step: Drop the hidden bit
101000010 000110000
11000000
11000000
-01111111 (127)
01000001
CMPE12c Gabriel Hugh Elkaim21
Multiply these SP FP numbers together
0x49FC0000
x0x4BE00000
Floating Point Multiplication
CMPE12c Gabriel Hugh Elkaim22
CMPE12c Gabriel Hugh Elkaim23
CMPE12c Gabriel Hugh Elkaim24
Another SP FP Example
•0xC9F4 ×0x484F
CMPE12c Gabriel Hugh Elkaim25
CMPE12c Gabriel Hugh Elkaim26
Floating Point Division
•True division
•Unsigned, full-precision division on mantissas
•This is much more costly (e.g. 4x) than mult.
•Subtract exponents
•Faster division
•Newton’s method to find reciprocal
•Multiply dividend by reciprocal of divisor
•May not yield exact result without some work
•Similar speed as multiplication