Binary Floating Point-8 Bit (bf8) Specification
Overview
The bf8 format is designed as a compact floating-point representation technique. It aims to provide a balance between range, precision, computational requirements, making it ideal for applications with limited memory or bandwidth.
Bit Allocation
bf8 format represents a floating-point number using a total of 8 bits:
- One bit for the sign
- Three bits for the exponent
- Four bits for the fraction
Representation
A bf8 number N can be represented using formula: N = (-1)^sign * 1.fraction * 2^(exponent – bias)
Special Numbers
- Infinity: Exponent of all 1s and fraction of all 0s
- Negative Infinity: Infinity with sign bit set as 1
- NaN (Not a Number): Exponent of all 1s and non-zero fraction
Denormalized Numbers
When the exponent field is all-zero (0), the number is to be interpreted in a denormalized form: Ndenormalized = (-1) ^ sign * 0.fraction * 2^(1 – bias)
Rounding
Rounding should adhere to the IEEE 754 standard, including the round-to-nearest-even rule (banker’s rounding).
Operations
The basic operations for addition, subtraction, multiplication, and division should conform to the IEEE 754 standard rules, which include handling overflows, underflows, and exceptions accordingly.
Implementation
This specification should be implemented in a low-level language to interface with the hardware instruction sets. It should also include functionalities for handling compound operations such as fused multiply-add.
Application
bf8 format is ideal for memory constrained applications or architectures, including embedded devices, IoT devices, and machine learning applications.
Disclaimer
The bf8 format can lack precision due to its reduced number of bits and should therefore not be used for applications that require high precision.