by dedalus on 8/21/23, 8:26 PM with 74 comments
by jwarden on 8/22/23, 6:34 PM
surprisal: how surprised I am when I learn the value of X
Suprisal(x) = -log p(X=x)
entropy: how surprised I expect to be H(p) = 𝔼_X -log p(X)
= ∑_x p(X=x) * -log p(X=x)
cross-entropy: how surprised I expect Bob to be (if Bob's beliefs are q instead of p) H(p,q) = 𝔼_X -log q(X)
= ∑_x p(X=x) * -log q(X=x)
KL divergence: how much *more* surprised I expect Bob to be than me Dkl(p || q) = H(p,q) - H(p,p)
= ∑_x p(X=x) * log p(X=x)/q(X=x)
information gain: how much less surprised I expect Bob to be if he knew that Y=y IG(q|Y=y) = Dkl(q(X|Y=y) || q(X))
mutual information: how much information I expect to gain about X from learning the value of Y I(X;Y) = 𝔼_Y IG(q|Y=y)
𝔼_Y Dkl(q(X|Y=y) || q(X))
by golwengaud on 8/22/23, 1:27 PM
1. Expected surprise
2. Hypothesis testing
3. MLEs
4. Suboptimal coding
5a. Gambling games -- beating the house
5b. Gambling games -- gaming the lottery
6. Bregman divergence
by tysam_and on 8/22/23, 8:21 PM
The KL divergence yields a concrete value that tells you how many actual bits of space on disk you will waste if you try to use an encoding table from one ZIP file of data to encode another ZIP file of data. It's not just theoretical, this is exactly the type of task that it's used for.
The closer the folders are to each other in content, the fewer wasted bits. So, we can use this to measure how similar two sets of information are, in a manner of speaking.
These 'wasted bits' are also known as relative entropy, since entropy basically is a measure of how disordered something can be. The more disordered, the more possibilities we have to choose from, thus the more information possible.
Entropy does not guarantee that the information is usable. It only guarantees how much of this quantity we can get, much like pipes serving water. Yes, they will likely serve water, but you can accidentally have sludge come through instead. Still, their capacity is the same.
One thing to note is that with our ZIP files, if you use the encoding tables from one to encode the other, then you will end up with different relative entropy (i.e. our 'wasted bits') numbers than if you did the vice versa. This is because the KL is not what's called symmetric. That is, it can have different meaning based upon which direction it goes.
Can you pull out a piece of paper, make yourself an example problem, and tease out an intuition as to why?
by techwizrd on 8/22/23, 1:58 PM
by zerojames on 8/22/23, 1:53 PM
My theory was: calculate entropy ("surprisal") of used words in a language (in my case, from an NYT corpus), then calculate KL-divergence between a given prose and a collection of surprisals for different authors. The author to whom the prose had the highest KL-divergence was assumed to be the author. I think it has been used in stylometry a bit.
by max_ on 8/22/23, 12:49 PM
Could someone give me a simple explanation as to what it's is.
And also, what practical use cases does it have?
by riemannzeta on 8/22/23, 7:37 PM
https://arxiv.org/abs/1508.02421
And to explain the relationship between the rate of evolution and evolutionary fitness:
https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf
The connection between all of these manifestations of KL divergence is that a system far from equilibrium contains more information (in the Shannon sense) than a system in equilibirum. That "excess information" is what drives fitness within some environment.
by jszymborski on 8/22/23, 2:01 PM
by nravic on 8/22/23, 6:26 PM
by mrv_asura on 8/22/23, 1:32 PM
by janalsncm on 8/22/23, 9:47 PM
by ljlolel on 8/22/23, 12:52 PM