Sequential Universal Lossless Techniques for Compression of Patterns and Their Description Length

Gil I. Shamir

doi:10.1109/DCC.2004.1281487

Abstract

A pattern is a sequence of indices that contains all consecutive integer indices up to some integer k in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence, which, in turn, can be compressed by itself. In this paper, two low-complexity sequential schemes are proposed for universally compressing patterns that are obtained from sequences generated by independently identically distributed (i.i.d.) sources with unknown (possibly large) alphabets of unknown size. The description lengths both schemes assign to a pattern are investigated and bounded by rigorous closed form expressions in terms of the maximum likelihood (ML) probability of the underlying i.i.d. sequence. In particular, each distinct index in the pattern is shown to cost 0.5 log(n/k3)+1:59 log e bits above the i.i.d. ML cost. This results in description length for unknown parameters that is shorter than the minimum code length of an i.i.d. sequence if there are more than e19/18?n 1=3 indices in the pattern. The sequential performance results are then used to establish a connection between the pattern entropy and the underlying i.i.d. entropy. This final result points out that for large alphabets (including those larger than n), recently derived universal coding redundancy bounds for coding patterns are negligible compared to the reduction in entropy from the underlying i.i.d. one.

Sequential Universal Lossless Techniques for Compression of Patterns and Their Description Length

Authors

Abstract

Related Articles