Abstract for ``Representing numeric data in 32 bits while preserving 64-bit precision''

Representing numeric data in 32 bits while preserving 64-bit precision

Radford M. Neal, Dept. of Statistical Sciences and Dept. of Computer Science, University of Toronto

Data files often consist of numbers having only a few significant decimal digits, whose information content would allow storage in only 32 bits. However, we may require that arithmetic operations involving these numbers be done with 64-bit floating-point precision, which precludes simply representing the data as 32-bit floating-point values. Decimal floating point gives a compact and exact representation, but requires conversion with a slow division operation before the data can be used in an arithmetic operation. Here, I show that interesting subsets of 64-bit floating-point values can be compactly and exactly represented by the 32 bits consisting of the sign, exponent, and high-order part of the mantissa, with the lower-order 32 bits of the mantissa filled in by a table lookup indexed by bits from the part of the mantissa that is retained, and possibly some bits from the exponent. For example, decimal data with four or fewer digits to the left of the decimal point and two or fewer digits to the right of the decimal point can be represented in this way, using a decoding table with 32 entries, indexed by the lower-order 5 bits of the retained part of the mantissa. Data consisting of six decimal digits with the decimal point in any of the seven positions before or after one of the digits can also be represented this way, and decoded using a table indexed by 19 bits from the mantissa and exponent. Encoding with such a scheme is a simple copy of half the 64-bit value, followed if necessary by verification that the value can be represented, by checking that it decodes correctly. Decoding requires only extraction of index bits and a table lookup. Lookup in a small table will usually reference fast cache memory, and even with larger tables, decoding is still faster than conversion from decimal floating point with a division operation. I present several variations on these schemes, show how they perform on various recent computer systems, and discuss how such schemes might be used to automatically compress large arrays in interpretive languages such as R.

Technical report, 8 April 2015, 16 pages: pdf.

Also available as arXiv:1504.02914.

The tests in the paper were done using these C programs.