floating point

The float point contain three parts:

- signed bit
- exponent bits (E)
- fraction bits (M)

\begin{equation} V = (-1)^s*M*2^E \end{equation}

There are two kinds of floating point:

- normalization
- denormalization

### normalization

when e is not all 1s or all 0s.

$E = e - Bias$

$bias = 2^{k-1} - 1$

$M = 1.f_{n-1}...f_1f_0$

### denormalization

when e is all 0s

$E = 1 - Bias$

$M = f$ with no leading 1

when e is all 1s

when f=0, then it is $\infty$.

otherwise it is NaN (Not a number)

The following example is when 8bits floating point with k=4 n=3 bias = 7. You can see that it's easy to compare each number.

Description | Bit representation | e | E | f | M | V |

Zero | 0 0000 000 | 0 | -6 | 0 | 0 | 0 |

Smallest denormalization | 0 0000 001 | 0 | -6 | 1/8 | 1/8 | 1/512 |

0 0000 010 | 0 | -6 | 2/8 | 2/8 | 2/512 | |

0 0000 011 … | 0 | -6 | 3/8 | 3/8 | 3/512 | |

0 0000 110 | 0 | -6 | 6/8 | 6/8 | 6/512 | |

Largest denormalization | 0 0000 111 | 0 | -6 | 7/8 | 7/8 | 7/512 |

Smallest normalization | 0 0001 000 | 1 | -6 | 0 | 8/8 | 8/512 |

0 0001 001 … | 1 | -6 | 1/8 | 9/8 | 9/512 | |

0 0110 110 | 6 | -1 | 6/8 | 14/8 | 14/16 | |

0 0110 111 | 6 | -1 | 7/8 | 15/8 | 15/16 | |

One | 0 0111 000 | 7 | 0 | 0 | 8/8 | 1 |

0 0111 001 | 7 | 0 | 1/8 | 9/8 | 9/8 | |

0 0111 010 … | 7 | 0 | 2/8 | 10/8 | 10/8 | |

0 1110 110 | 14 | 7 | 6/8 | 14/8 | 224 | |

Largest norm. | 0 1110 111 | 14 | 7 | 7/8 | 15/8 | 240 |

Infinity | 0 1111 000 | - | - | - | - | $\infty$ |

page revision: 5, last edited: 03 Oct 2008 06:51