There’s this popular idea that communication theory has nothing to do with meaning; that measures of information are conceptually divorced from semantic content. Expressions of this idea are usually vague, and at least one version of it is demonstrably false: there is a notion of semantic content that is crucial to communication theory. It is the notion of using symbols to represent events. In their widely cited textbook, Cover & Thomas use this idea in several places; for example:
The notion of assigning events to symbol strings is central to understanding the theorems of communication theory. The first theorem of communication theory gives a lower limit on how many symbols you need, per outcome, to represent the outcomes of an event. Without some concept of encoding – or representing, indicating, denoting, or whatever – this theorem has no clear sense or utility.
Encoding
Encoding schemes define the content of symbol strings by assigning different events to different strings. For example, an encoding scheme for a coin toss might look like this:
$$ \begin{align} \text{Heads} &\rightarrow 0\\ \text{Tails} &\rightarrow 1 \end{align} $$
Here the symbol 0 is being used to encode – to represent – the outcome Heads, while the symbol 1 is being used to encode the outcome Tails. A sequence of coin toss results can be represented by a sequence of symbols:
$$ \begin{align} &\text{Heads},\ \text{Heads},\ \text{Tails},\ \text{Tails},\ \text{Heads},\ \text{Tails},\ \text{…}\\ &\rightarrow 001101\text{…} \end{align} $$
Or consider a dice roll:
$$ \begin{align} 1 &\rightarrow 00\\ 2 &\rightarrow 01\\ 3 &\rightarrow 10\\ 4 &\rightarrow 11\\ 5 &\rightarrow 100\\ 6 &\rightarrow 101 \end{align} $$
However, there is something wrong with this code. Suppose you are using it to transmit the sequence of dice rolls 3, 5. The signal you transmit is the concatenation of the individual “codewords”: \(3, 5 \rightarrow 10, 100 = 10100\). But your receiver might decode that signal as \(10100 = 101, 00 \rightarrow 6, 1\). Nightmare!
If something’s worth encoding, it’s worth decoding. We need to be able to uniquely decode signals, so we had better design an encoding scheme that doesn’t confuse the decoder. And because transmitting signals in the physical world costs resources, we want unique codes that are as short as possible. These two constraints provide the basic motivation for communication theory.
Mathematicians and engineers begin by stating the problem formally: encoding schemes shall be evaluated in terms of the average number of symbols they would produce per signal. Given the non-unique code above, the average length of a signal (assuming all outcomes are equally likely) is:
$$ \begin{align} \overline{L} &= \frac{2}{6} + \frac{2}{6} + \frac{2}{6} + \frac{2}{6} + \frac{3}{6} + \frac{3}{6}\\ &= \frac{14}{6} \end{align} $$
which is slightly more than 2 symbols per outcome. The shortest uniquely decodable code is likely to be longer than this; but how much longer?
The first theorem of communication theory answers the question: what is the smallest average length a string must be in order to uniquely represent every outcome of a given event?
Prelude: the smallest \(\delta\)-sufficient subset
A few technicalities must be introduced before the theorem can be stated formally.
Suppose you are happy to take the risk that very improbable events won’t happen. You can choose to leave out improbable events from your coding scheme. We will say that any event whose probability is smaller than a certain very small number will be left out. Call this very small number \(\delta\). You can think of \(\delta\) as a measure of the risk you are willing to take: the smaller the \(\delta\), the less risky you’re being, and the more improbable events you are choosing to include.
We can formalise this idea by defining the set of events to be encoded given our chosen riskiness \(\delta\).
$$ \begin{align*} S_{\delta} := \text{The smallest subset of }\mathcal{A}_X\text{ satisfying } P(x\in S_{\delta})\ge 1-\delta \end{align*} $$
So we can define a subset using a “risk parameter” \(\delta\). So what?
Essential bit content
We know that given a set \(\mathcal{A}_X\) we can encode its outcomes with strings of expected length \(\log{\left|\mathcal{A}_X\right|}.\) Having thrown out the improbable events by giving ourselves a risk threshold \(\delta\), we can therefore confidently assert that we know we can achieve a certain expected length for this now reduced set:
$$ H_{\delta}(X) := \log{\left|S_{\delta}\right|} $$
Call this the essential bit content of \(X\). (Really this should be “\(\delta\)-essential”, but I’m taking this presentation from MacKay 2003:\(\S\)4 and want to keep my terminology consistent with his.)
What is this measure good for? It’s giving us an upper bound on what we know we can achieve. It really gains power when we consider sequences of outcomes.
Sequences of outcomes
If \(X\) denotes an event space, \(X^N\) denotes a space of sequences of events. For example, if \(X\) is a dice roll, then \(X^N\) is \(N\) dice rolls in a row.
We can encode sequences of outcomes instead of individual outcomes. We can then drop improbable sequences using our risk threshold \(\delta\).
Whereas the essential bit content of an event space can be understood as the cost of representing an outcome given we are ignoring improbable outcomes…
$$ \underbrace{H_{\delta}}_{ \substack{ \text{Cost, given}\\ \text{risk }\delta\text{, of}\\ \text{representing} } } \quad \underbrace{(X)}_{ \substack{ \text{an outcome}\\ \text{of event }X } } $$
…the average essential bit content of a sequence can be understood as the per-event cost of representing that sequence:
$$ \underbrace{ \class{mj_blue}{\frac{1}{N\vphantom{N_{\delta}}}} }_{ \substack{ \text{The}\\ \text{per-event} } } \quad \underbrace{H_{\delta} \vphantom{\frac{1}{N_{\delta}}} }_{ \substack{ \text{cost, given}\\ \text{risk }\delta\text{, of}\\ \text{representing} } } \quad \underbrace{(X^{\class{mj_blue}{N}}) \vphantom{\frac{1}{N_{\delta}}} }_{ \substack{ \text{a long sequence}\\ \text{of outcomes}\\ \text{of event }X } } $$
If you are very risk-averse, your \(\delta\) will be very low. Then you might not be able to ignore any outcomes of \(X\). But you might still be able to ignore sequences of outcomes, because sequences are in general less probable than the outcomes that compose them. :Caveat!
The theorem can be understood in terms of a game. Suppose your risk \(\delta\) is fixed, and you are tasked with encoding sequences of outcomes from \(X\). You are allowed to use sequences of any length. What is the smallest per-event cost (defined by the formula above) you can achieve?
The first theorem of communication theory stated
Let \(X\) be an event space with entropy \(H(X)\). Given \(\epsilon > 0\) and \(0 < \delta < 1\), there exists a positive integer \(N_0\) such that for \(N > N_0:\) $$ \left|\frac{1}{N}H_{\delta}(X^N)-H(X)\right| <\epsilon $$
In short, entropy is the lower limit of essential bit content.
The theorem in plain language
The per-event cost, given risk \(\delta\), of representing a long sequence of events is equal to the entropy (in the limit).
$$ | \underbrace{\frac{1}{N\vphantom{N_{\delta}}}}_{ \substack{ \text{The}\\ \text{per-event} } } \quad \underbrace{H_{\delta} \vphantom{\frac{1}{N_{\delta}}} }_{ \substack{ \text{cost, given}\\ \text{risk }\delta\text{, of}\\ \text{representing} } } \quad \underbrace{(X^N) \vphantom{\frac{1}{N_{\delta}}} }_{ \substack{ \text{a long sequence}\\ \text{of events} } } \quad \quad \underbrace{-H(X) \vphantom{\frac{1}{N_{\delta}}} }_{ \substack{ \text{is equal to}\\ \text{the entropy} } } | \quad \underbrace{<\epsilon \vphantom{\frac{1}{N_{\delta}}} }_{ \substack{ \text{(in the limit).} } } $$
Or, rewritten a little more intuitively,
The per-event cost of representing a long sequence of events with risk \(\delta\) is equal, in the limit, to the entropy.
What does the theorem tell us?
The first thing to notice is that the theorem contains \(\epsilon,\ \delta \text{ and } N_0.\) These are standard tools in theorems about limits.
There is an intuitive trick that helps us to understand theorems like this, which I learned from :Philip Welch. The theorem asks us to suppose that \(\epsilon\) and \(\delta\) are supplied, and then asserts that an \(N_0\) can be found that satisfies a certain relationship. The trick is to imagine you are playing a game against a mathematical opponent. They give you \(\epsilon\) and \(\delta\), and your task is to find \(N_0\) such that the relationship holds. The theorem guarantees that you can win this game.
Your opponent can make the game harder by choosing lower values of \(\epsilon\) and \(\delta.\) Intuitively that’s because of the role these two terms play in the statement of the theorem.
Greek letter epsilon, \(\epsilon\), is very often used to state the requirement that two terms be arbitrarily close to one another. In showing that some target value is the limit of a sequence, for example, we say that including enough terms in the sequence gets us within \(\epsilon\) of the target. Obviously, smaller values of \(\epsilon\) make the target smaller to hit: imagine trying to fire an arrow at a shrinking bullseye.
The role of \(\delta\), Greek letter delta, is more relevant to the communications engineering context at hand. We are talking about compression, which is the act of representing events by symbol sequences in the hopes of retrieving those events from those sequences further down the road. When you are deciding on the code by which symbol strings will represent events, you could choose to ignore certain unlikely events in the hope of saving on symbols. For example, if there are 9 events, you could assign the first 8 events to the 8 three-symbol binary sequences (000, 001, 010 and so on) and just ignore the ninth event altogether. You will be in trouble if the ninth event occurs – you won’t be able to represent it, and a decoding error will occur – but you will have saved up to one symbol per event in taking that risk. Delta quantifies the risk you are taking by employing a leaky coding scheme.
Appendix 1: Other names for the theorem
I have found it difficult to organise my thinking about the mathematics of communication theory in part because the same concepts and theorems are often named differently in different sources. I am therefore maintaining the following list in a vain attempt to help me pin down exactly when people are talking about the theorem discussed in this post. Alternative names include:
- Source coding theorem
- MacKay 2003:\(\S\)4, whose presentation is the basis of this post
- Cover & Thomas 2006:745 (index)
- Shannon’s source coding theorem
- Shannon’s first theorem
- Noiseless coding theorem
Appendix 2: definitions
$$ \begin{align} X:=& \text{ Event space: a triple }(x,A_X,P_X)\\ x:=& \text{ Outcome: the value of a random variable}\\ \mathcal{A}_X:=& \text{ Alphabet: the set of possible values of } x: \{a_1,a_2,…a_I\}\\ \mathcal{P}_X:=& \text{ Probabilities: the set } \{p_1,p_2,…p_I\} \text{ such that } P(x=a_i)=p_i. \end{align} $$
The probabilities satisfy \(p_i\ge 0 \text{ and }\sum_{a_i\in \mathcal{A}_X}P(x=a_i)=1.\)
Here we’re defining an event space, which will be familiar if you have spent any time on probabilities. Examples of event spaces include the possible outcomes of coin tosses and dice rolls, weather conditions on a given day, and the results of horse races. Any time you have a collection of mutually exclusive and jointly exhaustive outcomes, and you can assign probabilities to those outcomes, you’re dealing with an event space.
:x Caveat 1
In the dice roll case, because the individual outcomes are equiprobable, all sequences of outcomes of a given length are equiprobable too. Then there are no sequences that can be ignored. In the general case, some individual outcomes are less probable than others, and sequences that contain many of the improbable events are more likely to fall within the “ignorable” set.