Binary Floating Point-8 Bit (bf8) Specification

Overview

The bf8 format is designed as a compact floating-point representation technique. It aims to provide a balance between range, precision, computational requirements, making it ideal for applications with limited memory or bandwidth.

Bit Allocation

bf8 format represents a floating-point number using a total of 8 bits:

One bit for the sign
Three bits for the exponent
Four bits for the fraction

Representation

A bf8 number N can be represented using formula: N = (-1)^sign * 1.fraction * 2^(exponent – bias)

Special Numbers

Infinity: Exponent of all 1s and fraction of all 0s
Negative Infinity: Infinity with sign bit set as 1
NaN (Not a Number): Exponent of all 1s and non-zero fraction

Denormalized Numbers

When the exponent field is all-zero (0), the number is to be interpreted in a denormalized form: Ndenormalized = (-1) ^ sign * 0.fraction * 2^(1 – bias)

Rounding

Rounding should adhere to the IEEE 754 standard, including the round-to-nearest-even rule (banker’s rounding).

Operations

The basic operations for addition, subtraction, multiplication, and division should conform to the IEEE 754 standard rules, which include handling overflows, underflows, and exceptions accordingly.

Implementation

This specification should be implemented in a low-level language to interface with the hardware instruction sets. It should also include functionalities for handling compound operations such as fused multiply-add.

Application

bf8 format is ideal for memory constrained applications or architectures, including embedded devices, IoT devices, and machine learning applications.

Disclaimer

The bf8 format can lack precision due to its reduced number of bits and should therefore not be used for applications that require high precision.