Summary: This article explains the limitations faced by Arithmetic coder and
Range coder on 32bit platforms and how using 64bit variables can improve
the compression results significantly. With 64bit processors becoming
increasingly popular, it makes sense to move on and use the available
processing power.
Effects of Precision on Entropy Coding
Conceptually both Arithmetic coding and Range coding are best described
using floating point mathematics. But when it comes to actually implementing
them for practical use, it turns out that this is best accomplished using
standard integer math. No floating point math is required, nor would it help
to use it. What is used instead is an incremental transmission scheme,
where fixed size integer state variables receive new bits in at the low end
and shift them out the high end, forming a single number that can be as many
bits long as are available on the computer's storage medium. (Read Mark
Nelson's article on arithmetic coding for detailed introduction to
incremental
transmission scheme, the same background applies to range coding too.)
The size of integer variables has two independent impacts:
 It directly effects the accuracy of calculations during entropy
encoding.
 It effects the accuracy with which the model can keep track of
probabilities
The limitation faced on 32bit platforms can be solved using 64bit arithmetic. Most current compilers and systems
allow use of 64bit arithmetic (implemented as unsigned long long
or __int64) and 64bit
processors are also becoming increasingly popular.
Here I will be trying to evaluate the effect of both these impacts. The
results are compared on
Calgary Corpus and
Large Canterbury Corpus,
using order0 model with 4 entropy coder implementations:
 32bit arithmetic coder
 64bit arithmetic coder
 32bit range coder
 64bit range coder
Details of the implementations used are described where needed. The source
code is also available at the bottom of this page.
Effect of Precision on Accuracy of Calculations
in Entropy Coding
The size of the integer variables determines the number of bits which are
participating in the calculations at a given time. The size greatly effects the
precision/accuracy of calculations and effects the end result. Reduced
accuracy with 32bit variables, as expected, leads to more than necessary bits being generated by
the encoder, thus adversely effecting the compression ratio achieved.
Accuracy can be improved by using 64bit variables.
The results are compared on
Calgary
Corpus and
Large Canterbury Corpus, using order0 model with 4 entropy coder implementations.
Besides providing a comparison of precision of 32bit vs 64bit
implementations, this also provides a comparison of Range coder vs
Arithmetic coder.
File Name 
Original Size 
Size after Compression with OrderZero
Model 


32bit Arithmetic Coder 
64bit Arithmetic Coder 
32bit Range Coder 
64bit Range Coder 
BIB 
111261 
72791 
72791 
72815 
72813 
BOOK1 
768771 
436914 
436911 
437063 
437000 
BOOK2 
610856 
364778 
364776 
364896 
364852 
GEO 
102400 
72405 
72404 
72426 
72424 
NEWS 
377109 
244492 
244492 
244571 
244544 
OBJ1 
21504 
16039 
16039 
16045 
16048 
OBJ2 
246814 
187307 
187304 
187357 
187340 
PAPER1 
53161 
33130 
33130 
33144 
33144 
PAPER2 
82199 
47541 
47541 
47557 
47561 
PAPER3 
46526 
27392 
27391 
27401 
27404 
PAPER4 
13286 
7999 
7999 
8004 
8007 
PAPER5 
11954 
7561 
7560 
7565 
7569 
PAPER6 
38105 
23839 
23839 
23848 
23850 
PIC 
513216 
75060 
75056 
75100 
75079 
PROGC 
39611 
25923 
25923 
25932 
25932 
PROGL 
71646 
42616 
42616 
42633 
42628 
PROGP 
49379 
30208 
30208 
30219 
30221 
TRANS 
93695 
64341 
64341 
64364 
64361 
The results on
Large
Canterbury Corpus:
File Name 
Original Size 
Size after Compression with OrderZero
Model 


32bit Arithmetic Coder 
64bit Arithmetic Coder 
32bit Range Coder 
64bit Range Coder 
E.coli 
4638690 
1176867 
1176867 
1177665 
1177304 
world192.txt 
2473400 
1545446 
1545440 
1545915 
1545733 
bible.txt 
4047392 
2202010 
2202002 
2202798 
2202477 
Note: In each of the 4 encoders, the frequency counts in model were
rescaled to half when Sum of frequencies reached 0x3FFF. This is done
because, as will be discussed soon, the maximum range of frequencies which
the model can keep, is limited by the precision of integer variables used.
Here 32bit Arithmetic Coder has the most restrictive "maximum range", hence
that range is used for all encoders.
As the gains achieved by using 64bit implementations are very small on
Calgary corpus files and considering that the test files are small in size,
it
is worth mentioning here that 64bit encoders have to flush 8 bytes at end,
whereas the 32bit encoders have to flush on 4 bytes. The also explains the
slight inconsistency in results where using 64bit implementation increased
the compressed size.
Conclusion: It can be concluded that using 64bit implementations
do improve the precision/accuracy of calculations in entropy coder, but the
gains due to this are not significant enough for small data sets. On larger
files, a definite improvement can be seen. Also, effect of using 64bit
implementation is more profound on range coder.
Effect of Entropy Coder Precision on Model
The size of the integer variables also determines the maximum precision
with which the model can represent the probabilities. For entropy coding,
once the character probabilities are known, the individual symbols need to
be assigned a range along a "probability line". The 'maximum range' that can
be used determines the accuracy with which we can represent the
probabilities. This 'maximum range' depends upon the size of the integer
variables used for entropy coding, in both arithmetic and range coder.
This is a sever limitation that forces us to use approximations,
frequency counts are rescaled (usually halved) whenever
they are near the limit. As now probability estimation is not accurate, the
number of bits generated is more than optimal.
In adaptive models, if the nature of source data changes, then this
halving of frequency counts can improve compression. Because in
such cases it effectively translates into "reduce weight of statistics
gathered till now and give more importance to new statistics". However, when
data is statistically uniform then this halving should result in reduced precision
and thus less and optimal compression.
Arithmetic Coder: Major problem with the CACM arithmetic coder is
the arithmetic overflow in steps involving
calculations of new high and
low.
If the machine supports wbit arithmetic. (Until recently, most
systems supported maximum 32bit arithmetic). Then, to avoid overflow, the no of bits
used to store cumulative frequency counts ( f ) is restricted by w.
It is required that w >= 2f + 1. Thus, (as w = 32) f
cannot exceed 15, and as a result, sum of frequency counts in any given
context can not be more than 32767.
If we use 64bit integers, as w = 64, we can have
f
= 31. Thus sum of frequency counts can be as much as 2147483648, which is a value
big enough for most practical purposes. As now don't have to rescale the
frequencies, we have more accurate probability
estimation. This reduces the loss and, as we will soon see, gives
significantly better results.
Range Coder: The problem with range coder is roughly the same as it
is with
arithmetic coder, but things are slightly better here.
Range coder normalizes itself 1 byte at a time, i.e. 8bits at a time. On
a wbit machine, range coder defines two constants
TOP=1<<(w8) and
BOTTOM=1<<(TOP8). To avoid
overflow, the sum of frequencies of symbols in any given context should not
exceed the value of BOTTOM. Thus,
the no. of bits used to store the cumulative frequency counts (f)
should not be greater than w16. With w=32, f cannot
exceed 16, and as a result, sum of frequency counts cannot be more than
65535. This is still a sever enough restriction for same reasons as above
(it forces us to use approximations).
This again can be fixed using 64bit arithmetic. With w=64, we can
have f as large as 48. Thus sum of frequency counts can now be as
much as 281474976710655 which is cool enough.
Results
This time also, orderzero model is used with each of the 4 encoders. In
each case, the frequency counts are halved whenever total count reached the
limit for that encoder.
 32bit Arithmetic Coder : 0x3FFF
 64bit Arithmetic Coder : 0x3FFF FFFF
 32bit Range Coder : 0xFFFF
 64bit Range Coder : 0xFFFF FFFF FFFF
The results on
Calgary
Corpus:
File Name 
Original Size 
Size after Compression with OrderZero
Model 


32bit Arithmetic Coder 
64bit Arithmetic Coder 
32bit Range Coder 
64bit Range Coder 
BIB 
111261 
72791 
72602 
72662 
72622 
BOOK1 
768771 
436914 
435399 
435904 
435501 
BOOK2 
610856 
364778 
366284 
365517 
366366 
GEO 
102400 
72405 
72442 
72474 
72462 
NEWS 
377109 
244492 
244940 
244836 
244996 
OBJ1 
21504 
16039 
16122 
16128 
16131 
OBJ2 
246814 
187307 
193337 
191771 
193376 
PAPER1 
53161 
33130 
33353 
33370 
33369 
PAPER2 
82199 
47541 
47543 
47545 
47560 
PAPER3 
46526 
27392 
27373 
27388 
27385 
PAPER4 
13286 
7999 
7999 
8004 
8007 
PAPER5 
11954 
7561 
7560 
7565 
7569 
PAPER6 
38105 
23839 
24088 
24100 
24100 
PIC 
513216 
75060 
77962 
76907 
77981 
PROGC 
39611 
25923 
25968 
25980 
25979 
PROGL 
71646 
42616 
42977 
43000 
42990 
PROGP 
49379 
30208 
30292 
30308 
30305 
TRANS 
93695 
64341 
65055 
64947 
65076 
The results on
Large
Canterbury Corpus:
File Name 
Original Size 
Size after Compression with OrderZero
Model 


32bit Arithmetic Coder 
64bit Arithmetic Coder 
32bit Range Coder 
64bit Range Coder 
E.coli 
4638690 
1176867 
1160066 
1165708 
1160511 
world192.txt 
2473400 
1545446 
1545742 
1543913 
1546045 
bible.txt 
4047392 
2202010 
2197536 
2196442 
2197987 
It has already been discussed that in adaptive models, if the nature of
source data changes, then the halving of frequency counts can
improve compression. Because in such cases it effectively translates
into "reducing weight of statistics gathered till then and giving more
importance to new statistics". However, when data is statistically uniform
then this halving results in reduced precision and this less and optimal
compression.
This observation is confirmed by the results obtained. For statistically
uniform files, a marked improvement in compression ratio is observed with
64bit entropy coders. Such files are marked in bold.
Conclusion: By using 64bit encoders, our choice of when and how
often to rescale the frequency counts in model can now depend on the data
source, instead of being dictated upon by the restrictions of the entropy
coder used.
Summary
Here is a summary of what all we have concluded from all the results.
 64bit implementations do improve the precision/accuracy of
calculations in entropy coder, but the gains due to this are not
significant enough for small data sets. On larger files, a definite
improvement can be seen.
 Effect of using 64bit implementation is more profound on range
coder.
 For statistically uniform files, a marked improvement in compression
ratio is observed with 64bit entropy coders.
 By using 64bit encoders, our choice of when and how often to
rescale the frequency counts in model can now depend on the data source,
instead of being dictated upon by the restrictions of the entropy coder
used.
 With 64bit processors becoming increasingly popular, it makes sense
to move on and use the available processing power, as it will not result
in any overhead in processing time.
Source Code
Here is the source code for the 64bit Arithmetic coder and the 64bit
range coder used for this article. The archive includes C++ classes for
the entropy coders and code for order0 model which was used for testing the
coders and gathering results. The archive also includes reference 32bit
implementations of both arithmetic coder and range coder.
You can browse the source tree here.
Download Source Code (22 KB)
References
For arithmetic coding, I have used Mark Nelson's code which was published
with his 1991 DDJ
article. Other
interesting papers on arithmetic coding include the
1987 CACM paper and Alistair Moffat's 1998 paper,
Arithmetic Coding Revisited.
For range coder, I have used Dmitry Subbotin's carryless implementation
found in Dmitry Shkarin's fast
PPMd code. Range coding was introduced by Martin in
his 1979 paper.
