Entropy Coding Set Shaping Theory.pdf

SET SHAPING THEORY
(THE FUTURE OF INFORMATION THEORY)
PROF. JOHN KENDALL DIXON

Important updates regarding Set Shaping Theory
A fundamental step has been taken in the study of this theory. A group of information theory
students published an article in which they experimentally confirmed the theoretical
predictions of the set shaping theory.
“Practical applications of Set Shaping Theory in Huffman coding”
https://arxiv.org/abs/2208.13020
The Matlab code used is available in the following link:
https://www.mathworks.com/matlabcentral/fileexchange/115590-test-sst-huffman-coding.
In the following link you will find the description of the data compression experiment used to
confirm the theoretical predictions
https://www.academia.edu/88055617/Description_of_the_program_used_to_validate_the_t
heoretical_results_of_the_Set_Shaping_Theory

Let us start with a famous Riemann quote that seems to predict Set Shaping
Theory.
"Two sets that contain the same number of elements can be interpreted as
two points of view that observe the same phenomenon".
As we will see, this sentence was a real source of inspiration and this new
theory is based precisely on this intuition.

The Set Shaping Theory represents a real change in the approach to data
compression.
For this reason it is important to remember some fundamental concepts
regarding the theory of information developed by Shannon in his famous
article:
“A Mathematical Theory of Communication“
Suppose A is a set of symbols (ex: a collection of numbers or letters, an
alphabet). We will call system or ensemble the triple 𝑋 = (𝑥; 𝐴; 𝑃) formed
by x, a random variable, called state, 𝐴 = ሼ𝑥1,𝑥2 … … ሽ
𝑥𝐼 are the possible
values of x (states) and 𝑃 = ቄ𝑝1,𝑝2 … … ሽ
𝑝𝐼 is the probability distribution of the
states 𝑃 𝑥𝑖 = 𝑝𝑖 with σ𝑖=1
𝐼
𝑝𝑖 = 1.

The fundamental point of Shannon's approach is to shift the focus from the sequence,
which must be compressed, to the ensemble X that generated it.
Thus, for Shannon to encode a sequence means to encode the ensemble X (source) that
generated the sequence.
We can represent the source as a die having N faces, where the probability of rolling a face
is defined by the function P.
Therefore, for example, a balanced die represents an ensemble X with a uniform probability
distribution.

Shannon's point of view on data compression is based on associating the shorter
codewords with the faces of the die with a higher probability of exit.
Consequently, each face of the die will be associated with a codeword whose length will
depend on the probability of that face coming out.
This mathematical model created by Shannon is based on a mathematical function called
entropy, defined in the following way: Given an ensemble 𝑋 = (𝑥; 𝐴; 𝑃), the entropy of X,
denoted H, is defined as:
𝐻 𝑋 = − ෍
𝑥𝑖∈𝐴
𝑝(𝑥𝑖) 𝑙𝑜𝑔𝑏𝑝(𝑥𝑖)

Shannon's first theorem (source coding theorem) shows that the average length of the codewords
cannot be less than the entropy of the source H(X).
In order to define a limit to the compression of a message the asymptotic equipartition 'principle
is used.
This principle is very useful and is also used in the Set Shaping Theory.
To understand its meaning we use a die as a representation of the source.
So, for example, if we take a classic six-sided non-loaded die and roll it 100 times the
probability of getting one hundred 1's is very small
1
6
100
.
But instead the probability of obtaining a sequence, in which the faces have a probability close
to 1/6 is very high and grows with increasing throws.

Consequently, this principle tells us that if N (number of dice rolls) is very large, tending to
infinity, the generated sequence almost certainly belongs to a subset (typical set) that contains
only the sequences with an entropy close to NH(X). The error made with this approximation is
negligible.
It is interesting to note that the size of the typical set decreases as the entropy H(X) decreases, so
the smaller size of the typical set implies that the sequences generated by the source can be
encoded in less space.
Using this principle Shannon's first theorem can be rephrased as follows:N i.i.d. random
variables each with entropy H(X) can be compressed into more than NH(X) bits with negligible
risk of information loss, as N → ∞; conversely if they are compressed into fewer than NH(X) bits
it is virtually certain that information will be lost.
Now we can introduce the Set Shaping Therory that as we will see represents a completely new
point of view.

Set Shaping Theroy
To understand this theory it is interesting to try to answer the following question: how can
we simulate a sequence of N symbols emitted by a source defined by an ensemble 𝑋
= (𝑥; 𝐴; 𝑃)?
The simplest way is to consider our source X as a die, so if we want to simulate 10 values
of X, we roll the die 10 times.
Another way is as follows: if we roll a 6-sided die 10 times we can generate 𝐴 10 = 610
different sequences. So, if we write each of these sequences on a sheet and put them in a
box and then do a random draw, we get the same result as rolling the die ten times.

Our box that contains 610sequences is nothing more than a set 𝐴10 that contains
610elements.
Consequently, 10 values of the ensemble X can be obtained by randomly extracting (if the
distribution of P is uniform) an element from the set 𝐴10.
If the distribution of X is not uniform, as in the case of the example, the probability of
extraction of each sequence will depend on the probability distribution P.
According to Riemann's intuition, each set of dimension 𝐴 𝑁 represents a different point of
view with which to observe N values generated by the source 𝑋 = (𝑥; 𝐴; 𝑃).
Indeed, if two sets A and B have the same number of elements, it is possible to define a
bijection function that converts an element of set A to an element of set B and vice versa.
According to this theory, the source is seen as a set that can be transformed into any other
set of the same size.

At this point, we must ask ourselves what kind of sequences the new set should contain.
The new sequences cannot have a length N2 less than the original sequences, because the
new set created would be smaller than the one generated by our source 𝐴 𝑁 > 𝐴 𝑁2
The new sequence cannot even have the same length, indeed it has been shown that entropy
is invariant for every isomorphism.
The only possible solution is that the new sequences have a length N2 greater than the
original sequences.
By increasing the length of the sequence, the new set will have many more elements than the
source set, indeed if N2> N we have 𝐴 𝑁 < 𝐴 𝑁2.

Therefore, we must select from the set 𝐴𝑁2 a subset of size equal to 𝐴 𝑁.
This operation is called "Shaping of the source" because what is done is to cut some
sequences belonging to the set 𝐴𝑁2 with N2> N making their exit probability null.
So, it's like replacing the die with a new rigged die.
The price we have to pay to perform this substitution is to roll the die more times.

Since this theory is used in data compression, the most common method of performing the
“shaping of the source” is the one in which, the subset is chosen by selecting the
sequences with less entropy.
For example, we take as a source 𝑋 = (𝑥; 𝐴; 𝑃) a classic six-sided die (A=6) not loaded
(uniform P) and we roll it 10 times. There are 610 possible sequences 𝑎 = 𝑥1, … … . , 𝑥10
which can be obtained by rolling the die 10 times. We call this set 𝐴10( 𝐴10 = 610).
We order these sequences based on their entropy value, so sequence number 1 𝑎1 will
have the lowest entropy and sequence number 610𝑎610 will have the highest entropy.
Now we roll the die 11 times, in this case there are 611possible sequences 𝑎
= 𝑥1, … … . , 𝑥11 . Let us call this set 𝐴11 ( 𝐴11 = 611).
We sort, as in the previous case, the sequences based on the entropy value.

In this way, we obtain two series of sequences with increasing entropy.
𝐴10
𝐴11
𝑎611
…..
𝑎610 → 𝑎610
….. → …..
….. → …..
𝑎2 → 𝑎2
𝑎1 → 𝑎1
We define a bijection function that transforms the sequence 𝑎1 belonging to the set 𝐴10
into the sequence 𝑎1
belonging to the set 𝐴11
. We continue in this way for all the sequences of the set 𝐴10
as defined by the arrows.

Then we obtain a bijection function f which transforms the set 𝐴10 into a subset 𝐵11 of the
set 𝐴11, 𝐵11⊂ 𝐴11 and 𝐴10 = 𝐵11 .
𝑓: 𝐴10 → 𝐵11
We call 𝑓𝑚 the bijection function defined in the previous example, in which the sequences
𝑎 ∈ 𝐴10 are transformed into sequences with less entropy belonging to 𝐴11 according to
the scheme defined by the arrows.

To understand the advantages of this theory we need to define some functions.
Given a sequence 𝑎𝑖 = 𝑥1 … … … … 𝑥𝑁 , generated by a source 𝑋 = (𝑥; 𝐴; 𝑃), we define
its information content as follows:
𝐼 𝑎𝑖 = − ෍
𝑖=1
𝑁
log 𝑝(𝑥𝑖)
The probability 𝑃(𝑎𝑖) that the source X generates the sequence 𝑎𝑖 is:
𝑃 𝑎𝑖 = ς𝑖=1
𝑁
𝑝(𝑥𝑖)
Consequently, the average information content of a sequence generated by a source 𝑋
= (𝑥; 𝐴; 𝑃) is:
𝐼 𝑎 = ෍
𝑖=1
𝐴 𝑁
𝑃(𝑎𝑖)𝐼(𝑎𝑖)

Now, we apply the bijection function f on the set 𝐴𝑁:
𝑓: 𝐴𝑁 → 𝐵𝑁+𝑘
With 𝐾, 𝑁 ∈ ℕ 𝑎𝑛𝑑 𝐾 > 0, 𝐴𝑁 = 𝐵𝑁+𝑘 .
𝑓 𝑎 = 𝑏
with 𝑎 = 𝑎1, … … . , 𝑎𝑁 and 𝑏 = 𝑏1, … … . , 𝑏𝑁+𝐾 , 𝑎 ∈ 𝐴𝑁 and 𝑏 ∈ 𝐵𝑁+𝐾
The parameter K is called the shaping order of the source and represents the difference in
length between the source sequences belonging to 𝐴𝑁 and the transformed sequences
belonging to 𝐵𝑁+𝑘

Given a source 𝑋 = (𝑥; 𝐴; 𝑃) and a function f we will have:
𝐼 𝑎 = ෍
𝑖=1
𝐴 𝑁
𝑃(𝑎𝑖)𝐼(𝑎𝑖)
𝐼 𝑏 = ෍
𝑖=1
𝐴 𝑁
𝑃(𝑎𝑖)𝐼(𝑏𝑖)
𝐼 𝑎𝑖 = − ෍
𝑖=1
𝑁
log 𝑝 𝑥𝑖 𝑐𝑜𝑛 𝑎𝑖 ∈ 𝐴𝑁
𝐼 𝑏𝑖 = − ෍
𝑖=1
𝑁+𝑘
log 𝑝 𝑏𝑖 𝑐𝑜𝑛 𝑏𝑖 ∈ 𝐵𝑁+𝐾
Using these definitions, the 𝑓𝑚 function of the example is defined as follows:
Given the set 𝑓𝑚 𝐴𝑁
= 𝐵𝑁+𝐾
and its complementary 𝐴𝑁+𝐾
− 𝐵𝑁+𝐾
= 𝐶𝑁+𝐾
, for each sequence 𝑏 ∈ 𝐵𝑁+𝐾
the information content 𝐼 𝑏 is always less than 𝐼 𝑐 for each 𝑐 ∈ 𝐶𝑁+𝐾 and 𝐼 𝑏𝑖 < 𝐼 𝑏𝑖+1 ∀ 𝑏 ∈ 𝐵𝑁+𝐾.

If we apply the function 𝑓𝑚 we will expect that 𝐼 𝑏 ≥ 𝐼 𝑎 instead, we have a
much more complex situation where when 𝐴 > 2 we have 𝐼 𝑏 < 𝐼 𝑎 .
The table shows the bit values of 𝐼 𝑎𝑖 , 𝐼 𝑏𝑖 and 𝐼 𝑎 − 𝐼 𝑏 with K=1, N=100
and 𝑓 = 𝑓𝑚, relative and a source 𝑋 = (𝑥; 𝐴; 𝑃), with 𝐴 variable from 2 to 10 and
with uniform probability distribution 𝑃 𝑥𝑖 =
1
𝐴
.
𝐴 𝐼 𝑎 𝐼 𝑏 𝐼 𝑎 − 𝐼 𝑏
2 99,275 99,659 -0,383
3 157,044 157,040 0,004
4 197,819 197,324 0,495
5 229,271 228,304 0,968
6 254,843 253,401 1,443
7 276,353 274,464 1,889
8 294,868 292,527 2,341
9 311,121 308,383 2,738
10 325,570 322,388 3,181

The data in the table show an extremely interesting and unexpected result. Indeed,
when 𝐴 > 2, the average information of a sequence b randomly extracted from the
set 𝐵𝑁+1
turns out to be less than the average information of a sequence a randomly
extracted from the set 𝐴𝑁
.
The data in the table are relate to N=100, however these results remain valid also
for values of N lower and higher than 100.
It is important to specify that the set 𝐵𝑁+1
contains sequences of the same length
and different from each other, so the result obtained does not violate the Pigeonhole
principle.

The reasons why Set Shaping Theory represents a revolution in information theory
1) This theory, introducing conceptually very advanced elements hypothesized by
Riemann, represents a completely different point of view from that proposed by
Shannon.
2) It develops a new class of bijection functions with properties of strong practical
relevance in many fields.
3) It can help us solve many open problems concerning information theory.
4) It raises important questions about entropy that can allow us to better understand
this important function.
Finally most importantly, it is a new field with an infinity of possible results and
applications yet to be discovered.

Entropy Coding Set Shaping Theory.pdf

Recommended

Recommended

More Related Content

Similar to Entropy Coding Set Shaping Theory.pdf

Similar to Entropy Coding Set Shaping Theory.pdf (20)

Recently uploaded

Recently uploaded (20)

Entropy Coding Set Shaping Theory.pdf