# Do lossless compression algorithms reduce entropy?

According to Wikipedia:

Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc.

So entropy is a measure of the amount of information contained in a message. Entropy coders are used to losslessy compress such a message to the minimum number of bits needed to represent it (entropy). To me this looks like a perfect entropy encoder would be all that is needed to losslessy compress a message as much as possible.

Many compression algorithms however use steps before entropy coding to supposedly reduce the entropy of the message.

According to german Wikipedia

Entropiekodierer werden häufig mit anderen Kodierern kombiniert. Dabei dienen vorgeschaltete Verfahren dazu, die Entropie der Daten zu verringern.

In english:

Entropy coders are frequently combined with other encoders. Previous steps serve to reduce the entropy of the data.

i.e. bzip2 uses the Burrows-Wheeler-Transform followed by a Move-To-Front-Transform before applying entropy coding (Huffman coding in this case).

Do these steps really reduce the entropy of the message, which would imply reducing the amount of information contained in the message? This seems contradictory to me, since that would mean that information was lost during compression, preventing lossless decompression. Or do they merely transform the message to improve the efficiency of the entropy coding algorithm? Or does entropy not correspond directly to the amount of information in the message?

A lot of casual descriptions of entropy are confusing in this way because entropy is not quite as neat and tidy a measure as sometimes presented. In particular, the standard definition of Shannon entropy stipulates that it only applies when, as Wikipedia puts it, “information due to independent events is additive.”

In other words, independent events must be statistically independent. If they aren’t, then you have to find a representation of the data that defines events in ways that make them truly independent. Otherwise, you will overestimate the entropy.

To put it yet another way, Shannon entropy only applies to true probability distributions, and not to random processes in general. For concrete examples of processes that don’t fit the assumptions of Shannon entropy, consider…

### Markov processes

A Markov process generates a series of events in which the most recent event is sampled from a distribution that depends on one or more previous events. Obviously a huge number of real-world phenomena are better modeled as Markov processes than as discrete, independent probability distributions. For example: the text you’re reading right now!

The naively calculated Shannon entropy rate of a Markov process will always be greater than or equal to the true entropy rate of the process. To get the true entropy of the process, you need to take into account the statistical dependence between events. In simple cases, the formula for that looks like this:

$$H(S)=−∑ipi∑j pi(j)logpi(j) H(\mathcal{S}) = - \sum_i p_i \sum_j \ p_i (j) \log p_i (j)$$

This can also be represented like so:

$$H(Y)=−∑ijμiPijlogPij H(Y) = - \sum_{ij} \mu_i P_{ij} \log P_{ij}$$

Again quoting Wikipedia, here “$$μi\mu_i$$ is the asymptotic distribution of the chain” — that is, the overall probability that a given event will occur over a long horizon.

This is all a complicated way of saying that even when you can calculate the overall probability of a given event, certain sequences of events are more likely than others to be generated by a Markov process. So for example, the following three strings of English words are increasingly less likely:

• They ran to the tree
• The tree ran to they
• Tree the they to ran

But Shannon entropy will assess all three strings as equally likely. The Markov process entropy takes the difference into account, and as a result, it assigns a lower entropy rate to the process.

### Entropy rates are model-dependent

If you zoom way out, here’s the big picture: the entropy rate of a given sequence of events from an unknown source is model-dependent. You’ll assign a different entropy rate to a particular series of events depending on how you model the process that generated them.

And very frequently, your model of the process isn’t going to be quite correct. This isn’t a simple or easy to solve problem. In fact, in general, it is impossible to assign a true entropy rate to a sufficiently long and complex sequence of events if you don’t know what the true underlying process is. This is a central result in algorithmic information theory.

What it means in practice is that given an unknown source of sequences of events, different models will yield different entropies, and it’s impossible to know which is correct in the long run — although the one that assigns the lowest entropy is probably the best.