Quick Python zlib vs bz2 benchmark

http://log.bthomson.com/2011/01/quick-python-gzip-vs-bz2-benchmark.html
I use the zlib module a lot on Google App Engine; often the tiny CPU time for decompression is a good tradeoff to save disk space. I was curious how bz2 compares so I ran this short benchmark.

The test file was this plaintext book, a highly-compressible source. Columns are: level, time, bytes uncompressed, bytes compressed, ratio.

% ./bench.zsh

zlib compress
0 6.98ms 640599 640700 1.000
1 21.22ms 640599 274195 2.336
2 25.08ms 640599 261638 2.448
3 34.24ms 640599 249649 2.566
4 36.41ms 640599 241500 2.653
5 54.24ms 640599 232545 2.755
6 77.22ms 640599 228621 2.802
7 87.94ms 640599 228032 2.809
8 112.49ms 640599 227622 2.814
9 113.03ms 640599 227622 2.814

zlib decompress
0 1.54ms
1 6.39ms
2 6.13ms
3 6.02ms
4 6.22ms
5 5.96ms
6 5.94ms
7 5.90ms
8 5.89ms
9 5.94ms

bz2 compress
1 105.30ms 640599 196752 3.256
2 103.42ms 640599 186082 3.443
3 105.40ms 640599 180905 3.541
4 104.95ms 640599 177642 3.606
5 113.12ms 640599 176232 3.635
6 110.45ms 640599 173153 3.700
7 113.06ms 640599 169634 3.776
8 110.27ms 640599 169634 3.776
9 111.43ms 640599 169634 3.776

bz2 decompress
1 36.40ms
2 35.79ms
3 36.35ms
4 36.81ms
5 41.18ms
6 44.86ms
7 48.96ms
8 48.45ms
9 47.95ms

Conclusion: probably not worth it. bz2 at level=4 takes about 7 times longer to decompress than gzip at level=9 for only a modest improvement in the compression ratio from 2.8 to 3.6.

Interestingly for write-heavy workloads bz2 may actually be the better choice since compression time is not much worse than gzip at level=9.

I think it's better not to use the timeit module for this kind of benchmark since in typical usage you will just be compressing/decompressing some given data once. If the operations speed up in repeat runs due to caching (and they do), that doesn't reflect typical usage. Starting a new python process for each test seems to reduce cache effects.

Anyway, here is the code.

import zlib
import bz2
import time
import sys
 
level = int(sys.argv[1])
mod = zlib if int(sys.argv[2]) else bz2
is_decompress = int(sys.argv[3])
 
with open("pg4238.txt") as f:
  data = f.read()
 
if is_decompress:
  c_data = mod.compress(data, level)
 
t = time.time()
if is_decompress:
  data = mod.decompress(c_data)
else:
  c_data = mod.compress(data, level)
 
print level, "%6.02fms" % (1000*(time.time() - t)),
if not is_decompress:
  print len(data), len(c_data), "%.03f" % (float(len(data))/len(c_data))

#!/usr/bin/zsh
echo 'zlib compress'
for level in {0..9}; do python bench.py $level 1 0; done
echo '\nzlib decompress'
for level in {0..9}; do python bench.py $level 1 1; done
echo '\nbz2 compress'
for level in {1..9}; do python bench.py $level 0 0; done
echo '\nbz2 decompress'
for level in {1..9}; do python bench.py $level 0 1; done

同步内容