Using a preset deflate dictionary to reduce compressed archive file size

Tags: , , , ,



I have a requirement where text files are send from one location to other. Both location are in our control. The nature of content and the words that could appear in this are mostly the same. Which means, if I keep the delate dictionary in both location once, there is no need to send it with file.

I have been reading about this last 1 week and experimenting with some available codes such as this & this.

However, I am still in dark.

Few questions I still have:

  1. Can we generate and use custom deflate dictionary from a preset of words?
  2. Can we send file without the deflate dictionary and use local one?
  3. If not gzip, are there any such compression library that can be used for this purpose?

Some references I stumbled upon so far:

  1. https://medium.com/iecse-hashtag/huffman-coding-compression-basics-in-python-6653cdb4c476
  2. https://blog.cloudflare.com/improving-compression-with-preset-deflate-dictionary/
  3. https://www.euccas.me/zlib/#zlib_optimize_cloudflare_dict

Answer

Below are the specific answers I found along with example codes.

1. Can we generate and use custom deflate dictionary from a preset of words?

Yes, this can be done. A quick example in python will as below:

import zlib

#Data for compression
hello = b'hello'    

#Compress with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()

2. Can we send a file without the deflate dictionary and use local one?

Yes, you can send just the data without dictionary. The compressed data is in compress_data in above example code. However, to decompress you will need the zdict value passed during compression. Example of how it is decompressed:

hello = b'hello'  #for passing to zdict  
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(compress_data)

A full example code with and without dict data:

import zlib

#Data for compression
hello = b'hello'

#Compression with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()

#Compression without dictionary
co_nodict = zlib.compressobj(wbits=-zlib.MAX_WBITS, )
compress_data_nodict = co_nodict.compress(hello) + co_nodict.flush()

#De-compression with dictionary
do = zlib.decompressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
data = do.decompress(compress_data)

#print compressed output when dict used
print(compress_data)

#print compressed output when dict not used
print(compress_data_nodict)

#print decompressed output when dict used
print(data)

Above code doesn’t works with unicode data. For unicode data you have to do something as below:

import zlib

#Data for compression
unicode_data = 'റെക്കോർഡ്'
hello = unicode_data.encode('utf-16be')

#Compression with dictionary
co = zlib.compressobj(wbits=-zlib.MAX_WBITS, zdict=hello)
compress_data = co.compress(hello) + co.flush()
...

JS based approach references:

  1. How to find a good/optimal dictionary for zlib ‘setDictionary’ when processing a given set of data?
  2. Compression of data with dictionary using zlib in node.js


Source: stackoverflow