by Jan Schnupp
In Information theory, "Information" is almost defined in a negative way: if I tell you something you already know then the amount of information you gain is zero. More generally, information is "the flip side of uncertainty". The more unertain you are about a particular state of affairs to begin with, the more information you require to get a good appreciation of what is going on.
Example: I send you the message "The first letter of the alphabet is 'A'". Assuming that you remember the alphabet from primary school, you will have received zero new information here.
Now consider another message, one that might interest Neuroscientists, for example: "the neuron fired five spikes". How much information does that message give us? You may find it mildly unsatisfactory to learn that this simple question does not have a simple answer. We can only work out how informative this message is if we make assumptions about how this type of neuron typically behaves (does it often fire five spikes or only rarely?), and then our answer will depend on the assumptions we made. More specifically, the information content of a message will be related to the inverse of the probability of receiving that message, because less probable messages are more "surprising" and therefore more informative.
Example: if this type of neuron always fired five spikes, then the probability of receiving this message is 1, the surprise is zero, and the amount of information would be zero. If the neuron fires five spikes almost all the time, the message would carry only little information, but if this type of neuron was thought to hardly ever fire five spikes, the message would be "big news".
Example: Assume this kind of neuron keeps "rolling dice", so that it is equally likely to fire, 1,2,3,4,5 or 6 spikes at any one trial, and it never fires more than 6 or less than 1. In that case the number of spikes fired becomes a random variable . By convention random variables are denoted using capital lettres, like X. Then the message "the neuron fired 5 spikes" (or X=5 for short) narrows down the possibilities from 6 equally likely states of affairs to just one. In this case the probability of this message would be 1/6, and the "surprise" would be the inverse of the probability, namely 6.
Now lets say we make two observations from the "dice rolling neuron" in the previous example. The first time it fires 5 spikes, the second time 3. Assuming that the two observations are indepenent, the probability of getting this particular message would be 1/6 times 1/6 or 1/36 and the Surprise would be 36. Under independence, the probablities multiply. However, information theorists would like information for independent messages to add, rather than multiply. To achieve this they simply define the information content of a message to be equal to the logarithm of the surprise, or, equivalently, minus the logarithm of the probability, of that message.
Logarithms to base ten are of course commonly used in many branches of science to work out "orders of magnitude" of a value, but in Information theory, only natural logarithms or logarithms to base 2 are commonly used. The former calculate information in units of "nats", the latter in units of "bits". Using log2 has the advantage that he answer to any yes/no question where "yes" or "no" are equally likely gives you exactly -log2(1/2) = log2(2) = 1 bit of information. As this is deemed convenient and intuitively appealing, log2 is by far the most commonly used.