Shannon information
From The Transhumanist Wiki
An introduction to Shannon information and Shannon entropy
- Wikipedia on information theory
- Dawkins on genetic information
- More resources on Shannon information
- Mathematical properties of Shannon information
Suppose you have a system A that's equally likely to be in any of 8 possible states {A1, A2, A3, A4, A5, A6, A7, A8}. There is an extraordinarily ubiquitous quantity, appearing in many branches of physics and mathematics and even biology, known as the Shannon entropy or simply the entropy. The Shannon entropy of A is 3 bits:
- A1: 001
- A2: 010
- A3: 011
- A4: 100
- A5: 101
- A6: 110
- A7: 111
- A8: 000
Using 3 bits of information, it is possible to uniquely identify the state of A. If you had no idea what was the state of system A, and I wanted to tell you, we could prearrange this code, and then I could use three coins to signal to you the state of A. For example, the coin sequence "heads, heads, tails" might indicate that A was in state A6. The entropy of a system is, roughly, how much additional information it will take to indicate the exact state of a system.
Suppose we have another system B which is equally likely to be in any of 4 possible states [B1..B4]. B's Shannon entropy would be 2 bits, log2(4), the logarithm base 2 of the volume of the state space. In general, to distinguish between 2N equally probable states requires N bits of information, so a system that might be in any of 2N equally probable states has N bits of entropy.
What is the entropy of the combined system A+B?
One reply would be that the combined system can be in any of 32 states {A1+B1, A1+B2, A1+B3, A1+B4, A2+B1, ..., A8+B4}, and so AB has 5 bits of entropy; or more simply that H(AB) = H(A) + H(B). But this is not generally true. Suppose that we don't know which of 8 states A occupies, and we don't know which of 4 states B occupies, but we know that A and B have the same parity – if A is an odd state, B is in an odd state; if A is in an even state, B is in an even state. The set of possible states for the combined system AB is {A1+B1, A1+B3, A2+B2, A2+B4, A3+B1, ..., A8+B2, A8+B4}, or 16 total possible states. Thus the combined system AB has 4 bits of entropy, and it takes 4 bits of information to exactly specify the state of the combined system; H(AB) = 4.
By learning the exact state of A, we constrain B down to two possible choices, {B2, B4} if A is even, {B1, B3} if A is odd. This reduces the entropy of B (the information we do not yet know about B) from 2 bits to 1 bit, since the volume of possible states B could be in has been reduced from 4 states to 2 states. Similarly, learning the exact state of B reduces the possible state volume of A from 8 states to 4 states, reducing A's entropy from 3 bits to 2 bits.
The Shannon mutual information between A and B, I(A, B), is therefore one bit:
- I(A, B) = H(A) + H(B) - H(AB)
- 1 = 3 + 2 - 4
We can also write that the combined entropy equals the sum of the independent entropies minus the mutual information:
- H(AB) = H(A) + H(B) - I(A, B)
- 4 = 3 + 2 - 1
Suppose system C has four possible states {C1, C2, C3, C4} with probabilities:
C1: 1/2 (50%) C2: 1/4 (25%) C3: 1/8 (12.5%) C4: 1/8 (12.5%)
The Shannon entropy for C would be 1.75 bits, found by summing -P(c)log(P(c)) for all states c:
C1: 1/2 (50%) -P(c)*log(P(c)) = -1/2 * log(1/2) = -1/2 * -1 = 1/2 C2: 1/4 (25%) -P(c)*log(P(c)) = -1/4 * log(1/4) = -1/4 * -2 = 1/2 C3: 1/8 (12.5%) -P(c)*log(P(c)) = -1/8 * log(1/8) = -1/8 * -3 = 3/8 C4: 1/8 (12.5%) -P(c)*log(P(c)) = -1/8 * log(1/8) = -1/8 * -3 = 3/8
Total entropy: 1.75 bits
C has 1.75 bits of entropy? What does that mean? It means that, on average, it would take one and three-fourths bits to tell you the exact state of C. How does one signal using three-fourths of a bit? By writing three-fourths of a one or zero? Suppose I were to signal the state of C using this code:
C1: 1 (1 bit, 1/2 of the time) C2: 01 (2 bits, 1/4 of the time) C3: 001 (3 bits, 1/8 of the time) C4: 000 (3 bits, 1/8 of the time)
This alphabet is unambiguous; if you know where the message starts, you can always tell where each symbol begins and ends. For example, 10111001011 decodes unambiguously to C1, C2, C1, C1, C3, C2, C1. Suppose there's a message composed of a string of symbols from C, with the frequencies stated before. Using the above code, on average I can communicate each symbol using 1.75 bits; 1 bit half the time, 2 bits a quarter of the time, and 3 bits a quarter of the time. Since the average signal information is exactly equal to the Shannon entropy computed before, this is a maximally efficient code.
You can go back and verify that the Shannon entropy as formally defined corresponds to the intuitive quantities given earlier. For example, system B has 4 possible states, all equally probable:
B1: 1/4 (25%) -P(b)*log(P(b)) = -1/4 * log(1/4) = -1/4 * -2 = 1/2 B2: 1/4 (25%) -P(b)*log(P(b)) = -1/4 * log(1/4) = -1/4 * -2 = 1/2 B3: 1/4 (25%) -P(b)*log(P(b)) = -1/4 * log(1/4) = -1/4 * -2 = 1/2 B4: 1/4 (25%) -P(b)*log(P(b)) = -1/4 * log(1/4) = -1/4 * -2 = 1/2
Total entropy: 2.0 bits
The Shannon information Sum(-p*log(p)) has its global maximum when all possible basic states are equally probable. This is known as the maximum-entropy distribution.
The minimum entropy is zero, and it occurs when we know the exact state of a system. For example, suppose we know that B is in state B3:
B1: 0 (0%) -P(b)*log(P(b)) = 0 * log(0) = 0 in the limit = 0 B2: 0 (0%) -P(b)*log(P(b)) = 0 * log(0) = 0 in the limit = 0 B3: 1 (100%) -P(b)*log(P(b)) = 1 * log(1) = 0 = 0 B4: 0 (0%) -P(b)*log(P(b)) = 0 * log(0) = 0 in the limit = 0
Total entropy: 0 bits
Shannon entropy is a precise quantity, but it behaves intuitively like our notion of "uncertainty" in our knowledge about a system. If there are a few very strong possibilities, the entropy is low. If there are many weak possibilities, the entropy is high. If you know the exact state, the entropy is zero.
Physics and information theory
Entropy can measure the volume of configuration space a system may possibly occupy. If each possible state of the system is a single point in configuration space, then a volume in configuration space can represent your knowledge and uncertainty about the state of the system. Suppose a system consists of three continuous real variables X, Y, Z. You can then represent any possible state of the system by a single point in three-dimensional space. Suppose you have a system of ten particles, where each particle has a three-dimensional position and a three-dimensional velocity. You can represent any possible state of this system by a point in a sixty-dimensional phase space. Since any possible state of that physical system corresponds to a single point in phase space, a volume in phase space describes many different possible states of a physical system.
If you're uncertain about what state a physical system is in, you're uncertain about where the point lies in phase space. You can describe that uncertainty by drawing a border around the volume of phase space the point might occupy. The wider the border, the larger the volume. The larger the volume, the more points in the volume; the more points in the volume, the more possible states of the system; the more possible states of the system, the greater your uncertainty about the system; the greater your uncertainty, the greater the entropy.
Suppose you have a system of sixty (classical) gas molecules, with three positional dimensions and three velocity dimensions, a 360-dimensional phase space. As the gas molecules fly around and bump into each other, collide and rebound, their positions change and their velocities change, and the point representing the state of the entire system traces out a trajectory through phase space. Suppose you heat up the gas, make the molecules move faster. Each individual molecule has a greater range of possible velocities because the molecules are moving faster; particle 33's velocity in the vertical coordinate might vary between -10 meters/second and 10 meters/second, instead of -5 meters/second and 5 meters/second. You only know the temperature of the gas; you don't know the exact position of any specific particle (note: in quantum physics there is no such thing as a "specific particle") or the exact velocity of any specific particle. If you are told the temperature has changed, say, gone up by 5 degrees, you then become more uncertain of the system state; every individual particle has a wider range of possible velocities. The system as a whole could be in more possible states; the volume in configuration space needs a larger border; the entropy goes up. As an object becomes hotter, its entropy increases. This is the border where physics blends smoothly into information theory.
Think of our belief about a system as a probability distribution over configuration space. Each point in the configuration space represents a possible state of the entire system. Our beliefs about the system are summarized by a function over each point in the configuration space, where the function returns our estimated probability that the system has that exact state.
The second law of thermodynamics
The second law of thermodynamics states that the entropy of a closed system never decreases. The second law of thermodynamics is a consequence of a mathematical statement called Liouville's Theorem which holds for both classical Newtonian and modern quantum physics; Liouville's Theorem states that, in our universe's physics, probability is incompressible.
Suppose that we have a complex classical system with billions of particles, and hence a multi-billion-dimensional phase space. (What do you mean you can't visualize a multi-billion-dimensional phase space? Run out and get your visual cortex upgraded.) The system starts out at some initial point Q. The particles move around under the laws of Newtonian physics, changing positions as they fly, changing velocities as they bump and experience various forces. The point that describes the system's state therefore changes with time, tracing out a trajectory.
Now let us instead take a volume of points, a little blob in this multi-billion-dimensional phase space, and extrapolate forward each point in this blob. Remember, each different point represents a different starting state of the entire physical system, and by "extrapolating" the point, we are taking the corresponding physical system, running it forward under the laws of physics, seeing where it ends up, and then translating that end state back into another point. Liouville's Theorem for classical systems says that no two starting points ever end up in the same final point, and furthermore, the total volume of the blob doesn't change. The second conclusion doesn't follow automatically from the first; for example, you can shrink a blob in every dimension by a factor of two, and each different starting point will map to a different end point, if the variables are real and continuous; but the volume of the blob would still change. Liouville's Theorem says the volume doesn't change. If you start with a milliliter of points in the billion-dimensional phase space of a classical system, and extrapolate that blob forward five minutes, take the physical system corresponding to each point and run that physical system for five minutes and translate it back into a point again, after five minutes you end up with a milliliter of points again.
However, if you started out with a compact volume in phase space, a milliliter of points with a small tight boundary, the extrapolation might turn the compact volume into a decidedly non-compact volume – a blob with all sorts of wriggly fractal tentacles. If you then wanted to draw a compact border around the blob, you might need to draw a much bigger blob to contain it. Suppose that your knowledge about the system started out as that small complex blob – you know the system's state is somewhere in that blob of possibilities. The system runs for five minutes, and if you had the computing power to separately extrapolate each state, you'd end up with a hugely complicated blob of the same size. But you don't have nearly enough computing power to do this. The only thing you can afford to compute is that the point must now be somewhere in this much larger blob, a big blob with compact borders which contains the small but squiggly and tentacle-y blob that you would find if you could afford to actually extrapolate every possible starting state of the system. From your perspective, the entropy of the system has gone up.
Imagine a box with sixty gas molecules, which start out confined in the lower left front corner by a small partition. Then we remove the partition and let the gas molecules bounce around for five minutes. At the beginning of the problem, you know that all sixty gas molecules will be in one corner (a small blob of possibilities). After five minutes, you can only guess that the sixty gas molecules will be somewhere in the entire box, unless you have a huge amount of computing power to check every trajectory from every possible initial condition. Your uncertainty has become worse; information has been lost; entropy has increased.
Liouville's Theorem on the incompressibility of probability directly implies the second law of thermodynamics. Why? Suppose there were a physical process that turned a glass of warm water onto ice cubes, making no other changes. That is, you'd take a glass of warm water, put in a sealed box (a closed system), and the sealed box would turn the warm water into ice cubes, and afterward the sealed box would still be the same temperature. Ice cubes have a smaller volume in phase space than warm water, because the molecules in ice are colder (and also crystallized). This would be a physical process that causes a large blob in phase space to develop into a small blob, violating Liouville's Theorem. Your knowledge of the system can go from a small blob to a large blob without violating Liouville's Theorem because you're losing information – you can't calculate fast enough to keep track of all the wriggling tentacles, so you draw a bigger blob instead. But to create a process that starts out with ordinary tap water (and not tap water in some special, extremely unique initial condition) and maps the phase space volume of tap water onto the phase space volume of ice cubes, you'd need to compress the probability flow, turn a big blob into a little blob.
So how does a refrigerator work? Imagine that we have two subsystems, A and B. A starts out in state A1, and B can occupy any of states [B1..B4]. Let there be a physical process Q that causes the system AB to develop as follows:
A1+B1 -> B1+A2 A1+B2 -> B1+A4 A1+B3 -> B1+A6 A1+B4 -> B1+A8
Process Q does not compress probability; it maps four possible starting points to four unique endpoints. Process Q does compress system B; from four possible starting points, B considered in isolation ends up in the single endpoint B1. The entropy is transferred to system A, which goes from a single possible starting point to four possible endpoints. Developing the metaphor further, we see that A ends up spread out over a more wriggly blob than B originally occupied; a gross description of process Q might look like this:
[B1..B4] -> [B1] [A1] -> [A2..A8]
Entropy has, apparently, increased in the process of flowing from one subsystem to another. You can refrigerate water into ice cubes, but your refrigerator needs to become hotter, and moreover, for real physical systems, your refrigerator will generate waste heat, in addition to the heat it removes from the water.
In the real universe of quantum mechanics, things are more complicated; but the quantum analogue of Liouville's Theorem yet holds, and so does the second law of thermodynamics. This is why physicists are so arrogantly certain that your perpetual motion machine, which uses a most ingenious collection of gears and pumps, can't possibly work. A perpetual motion machine would require new physics, and gears and pumps are decidedly old physics.


