SlideShare a Scribd company logo
1 of 52
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu ::
Faculty of Science, Technology and Communication (FSTC)
Bachelor en informatique (professionnel)
-- Media IT -–
¯_(ツ)_/¯
Unit 4
Entropy and
compression
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 2
Assignment
Presentation on data formats
Chose a data format among the following list:
How it works
• You work in teams of 2 students
• Prepare a presentation that explains the principles of you selected data format,
i.e., encoding, compression, features, usage...
• Share your presentation, e.g., PowerPoint file or Prezi link
• Presentation on 29 October or 5 November; 10 – 15 minutes
• Your work (presentation + support) is considered 10% of your final grade
jpeg / jfif / jpeg 2000 wav / aiff / au / raw
png mp3 / vorbis / aac / wma
bmp / (animated) gif tiff / raw
svg
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 3
3. Coding and compression
3.1 Basics of information theory
3.2 Entropy and redundancy
3.3 Huffman coding
3.4 Run-length encoding (RLE)
3.5 Lempel–Ziv–Welch (LZW) coding
3.6 The mysterious case of the Xerox scanners (2013)
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 4
3. Coding and compression
3.1 Basics of information theory
11 November 2017
One of Europe's smallest countries now holds claim to being a giant in the
space industry. Luxembourg, with a population less than the state of
Vermont, now generates nearly 2 percent of its annual gross domestic
product from the space industry, according to Deputy Prime Minister Etienne
Schneider. The country's economy checked in just shy of $61 billion in 2016,
according to the CIA World Factbook. "We have grown from nothing to the
most dynamic in Europe," Schneider told an audience Saturday, in a speech
at the New Worlds conference in Austin, Texas. He added that the country's
space program was first launched just over 30 years ago. Schneider, who
also serves as Luxembourg's economic minister, told the conference that he
is often questioned about why Luxembourg is so "keen on exploiting space
resources." He replied by saying the same "liberal, extremely business
friendly climate" that pushed the country's financial sector boom is now
being reapplied to attracting space companies. "I have more than 70 space
companies in the pipeline," Schneider told CNBC after the speech.
Luxembourg's "space resources initiative" is the country's plan to make the
most out of a quickly growing global industry, the minister said. "It's a
series of measures to position Luxembourg as the European heart of
exploration and use of space resources."
The word space is frequently used in this source, i.e., it
has a higher occurrence than other words, e.g., Texas
(appears just once)
The probability that the word space re-appears in
a next section (not visible on this slide) is very high
The word space has a higher “importance”, i.e., in
this source than other words, e.g., Texas.
This is called: self-information or surpisal
Therefore:
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 5
3. Coding and compression
3.1 Basics of information theory
𝐴 is an alphabet, i.e., a non-empty set of symbols
Computational linguistics basics
Claude Elwood Shannon (1916–2001) was
an American mathematician, electrical
engineer, and cryptographer known as "the
father of information theory"
Examples
𝐴 = {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}
𝐴∗
is the set of all possible words through this alphabet 𝐴∗
= {a, aaa, oxo, jimi, house, haus, maison, abc, bdef…}
The language 𝐸 ⊆ 𝐴∗
contains all English words 𝐸 = {a, and, are, house…}
𝑎 ∈ 𝐸 is a word of the English language 𝑎 = house
|𝑎| the length of the word 𝑎 |𝑎| =5
𝑝 𝑎 ∈ ℝ is the probability of occurrence of the word 𝑎 in the
English language with 0 ≥ 𝑝 𝑎 ≥ 1
The average word length is written:
𝑎∈𝐿
𝑝 𝑎 ∙ 𝑎
The sum of all probabilities is 1
𝑎∈𝐿
𝑝 𝑎 = 1
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 6
3. Coding and compression
3.1 Basics of information theory
Computational linguistics basics
Word frequency depends on the corpus you analyze
https://www.wordfrequency.info/ https://www.sketchengine.eu/
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 7
3. Coding and compression
3.1 Basics of information theory
Computational linguistics basics
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 8
3. Coding and compression
3.1 Basics of information theory
Computational linguistics basics
Example of application: cryptanalysis
A typical distribution
of letters in English
language text. Weak
ciphers do not
sufficiently mask the
distribution, and this
might be exploited by a
cryptanalyst to read
the message.
Colossus: British
machine developed
during WW2 to help
codebreakers break
the Enigma cipher
machine
LIVITCSWPIYVEWHEVSRIQMXLEYVEOIEWHRXEXIPFEMVEWHKVSTYLXZIXLIKIIXPIJVSZEYPERRGERIM
WQLMGLMXQERIWGPSRIHMXQEREKIETXMJTPRGEVEKEITREWHEXXLEXXMZITWAWSQWXSWEXTVEPMRXRSJ
GSTVRIEYVIEXCVMUIMWERGMIWXMJMGCSMWXSJOMIQXLIVIQIVIXQSVSTWHKPEGARCSXRWIEVSWIIBXV
IZMXFSJXLIKEGAEWHEPSWYSWIWIEVXLISXLIVXLIRGEPIRQIVIIBGIIHMWYPFLEVHEWHYPSRRFQMXLE
PPXLIECCIEVEWGISJKTVWMRLIHYSPHXLIQIMYLXSJXLIMWRIGXQEROIVFVIZEVAEKPIEWHXEAMWYEPP
XLMWYRMWXSGSWRMHIVEXMSWMGSTPHLEVHPFKPEZINTCMXIVJSVLMRSCMWMSWVIRCIGXMWYMX
Hereupon Legrand arose, with a grave and stately air, and brought me the beetle from a glass case in which it
was enclosed. It was a beautiful scarabaeus, and, at that time, unknown to naturalists—of course a great prize
in a scientific point of view. There were two round black spots near one extremity of the back, and a long one
near the other. The scales were exceedingly hard and glossy, with all the appearance of burnished gold. The
weight of the insect was very remarkable, and, taking all things into consideration, I could hardly blame
Jupiter for his opinion respecting it.
Full explanation can be found on Wikipedia
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 9
3. Coding and compression
3.1 Basics of information theory
Generates words randomly
Illustration with an alphabet 𝒳={a, b, c, d}
Machine 1 Machine 2 Generates words according the following probability
pa pb pc pd
0,25 0,25 0,25 0,25
pa pb pc pd
0,5 0,125 0,125 0,25
d d a c d a b
Is it a or b?
Is it a?
yes
yes
a
no
b
Is it c?
no
c d
Is it a?
a
yes no
Is it d?
yes
d Is it b?
no
yes no
b c
wa = 2
wb = 2
wc = 2
wd = 2
weight
What is
the next
word?
yes no
Questions to ask in average:
=pa⋅wa + pb⋅wb + pc⋅wc + pd⋅wd
=0,5⋅1 + 0,125⋅3 + 0,125⋅3 + 0,25⋅2
=1,75 questions
Questions to ask in average:
=pa⋅wa + pb⋅wb + pc⋅wc + pd⋅wd
=0,25⋅2 + 0,25⋅2 + 0,25⋅2 + 0,25⋅2
=2 questions
ca = 11 (2 bit)
cb = 10 (2 bit)
cc = 01 (2 bit)
cd = 00 (2 bit)
binary code
wa = 1
wb = 3
wc = 3
wd = 2
weight
ca = 1 (1 bit)
cb = 001 (3 bit)
cc = 000 (3 bit)
cd = 01 (2 bit)
binary code
Machine 2 is
producing less
information than
machine 1
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 10
𝐻 𝒳 = −
𝑥∈𝒳
𝑝 𝑥 ∙ log2
1
𝑝 𝑥
3. Coding and compression
3.2 Entropy and redundancy
Generates words randomly
Illustration with an alphabet 𝒳={a, b, c, d}
Machine 1 Machine 2 Generates words according the following probability
pa pb pc pd
0,25 0,25 0,25 0,25
pa pb pc pd
0,5 0,125 0,125 0,25
Information entropy (H) is the average rate at which
information is produced by a stochastic source of data
Entropy for machine 1:
𝐻 𝒳 = −
𝑥∈𝒳
𝑝 𝑥 ∙ log2
1
𝑝 𝑥
𝐻 𝒳 = − 𝑝 𝑎 ∙ 𝑙𝑜𝑔2
1
𝑝 𝑎
+ 𝑝 𝑏 ∙ 𝑙𝑜𝑔2
1
𝑝 𝑏
+ 𝑝𝑐 ∙ 𝑙𝑜𝑔2
1
𝑝𝑐
+ 𝑝 𝑑 ∙ 𝑙𝑜𝑔2
1
𝑝 𝑑
𝐻 𝒳 = − 0,25 ∙ 𝑙𝑜𝑔2
1
0,25
+ 0,25 ∙ 𝑙𝑜𝑔2
1
0,25
+ 0,25 ∙ 𝑙𝑜𝑔2
1
0,25
+ 0,25 ∙ 𝑙𝑜𝑔2
1
0,25
𝐻 𝒳 = − 0,25 ∙ (−2) + 0,25 ∙ (−2) + 0,25 ∙ (− 2) + 0,25 ∙ (−2)
𝐻(𝒳) = 2
Entropy for machine 2:
𝐻 𝒳 = −
𝑥∈𝒳
𝑝 𝑥 ∙ log2
1
𝑝 𝑥
𝐻 𝒳 = − 𝑝 𝑎 ∙ 𝑙𝑜𝑔2
1
𝑝 𝑎
+ 𝑝 𝑏 ∙ 𝑙𝑜𝑔2
1
𝑝 𝑏
+ 𝑝𝑐 ∙ 𝑙𝑜𝑔2
1
𝑝𝑐
+ 𝑝 𝑑 ∙ 𝑙𝑜𝑔2
1
𝑝 𝑑
𝐻 𝒳
= − 0,5 ∙ 𝑙𝑜𝑔2
1
0,5
+ 0,125 ∙ 𝑙𝑜𝑔2
1
0,125
+ 0,125 ∙ 𝑙𝑜𝑔2
1
0,125
+ 0,25 ∙ 𝑙𝑜𝑔2
1
0,25
𝐻 𝒳 = − 0,5 ∙ −1 + 0,125 ∙ (− 3) + 0,125 ∙ −3 + 0,25 ∙ (−2)
𝐻(𝒳) = 1,75
The weight of a certain
word
stochastic: having a random
probability distribution or
pattern that may be analyzed
statistically but may not be
predicted precisely
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 11
3. Coding and compression
3.2 Entropy and redundancy
Generates words randomly
Illustration with an alphabet 𝒳={a, b, c, d}
Machine 1 Machine 2 Generates words according the following probability
x a b c d
probability (px) 0,25 0,25 0,25 0,25
coding (cx) 00 01 10 11
weight (wx) 1 1 1 1
x a b c d
probability (px) 0,5 0,125 0,125 0,25
coding (cx) 1 001 000 01
weight (wx) 1 3 3 2
Entropy: H(𝒳) = 2 Entropy: H(𝒳) = 1,75
The principle idea is to find the optimal coding so that the average length of the code is the smallest possible in order to reduce
the amount of bits to transmit
𝐿 =
𝑥∈𝒳
𝑝 𝑥 ∙ 𝑐 𝑥
Average code length: L(𝒳) = 2 Average code length: L(𝒳) = 1,75
In the above example, information that is transmitted by machine 2 requires less actual bits than machine 1. The code has been
designed so that fewer bits are used to send more frequent symbols, but still so that it can be unambiguously decoded
A code is optimal if L – H is minimal,
i.e., little redundancy
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 12
3. Coding and compression
3.2 Entropy and redundancy
Illustration with an alphabet 𝒳={a, b, c, d}
Machine 3 Generates words according the following probability
x a b c d
probability (px) 0,5 0,125 0,125 0,25
coding (cx) 000 001 000 010
weight 1 3 3 2
Entropy: H(𝒳) = 1,75
Average code length: L(𝒳) = 3
Here, the redundancy for machine 3 (L – H = 1,25) is higher than the coding from machine 2 (L – H = 0) , although the probabilities,
i.e., entropy, remain the same. Therefore, this code is less optimal for transmission.
Same values as for machine 2 but with
a worse coding (more bits are used
than necessary)
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 13
3. Coding and compression
3.2 Entropy and redundancy
Illustration of flipping {a fair | an unfair} coin
The table shows the flipping of a coin where the probability 0,5 / 0,5 is the only “fair flipping” and a
maximized entropy (1 bit). Every other “unfair flipping” results in a smaller entropy, i.e., the surprise
of getting heads is smaller.
probability entropy
(H)heads tails
0 1 0
0,1 0,9 0,468
0,2 0,8 0,721
0,3 0,7 0,881
0,4 0,6 0,970
0,5 0,5 1,000
0,6 0,4 0,970
0,7 0,3 0,881
0,8 0,2 0,721
0,9 0,1 0,468
1 0 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
entropy
probability
fair flipping, i.e., equal
probabilities
The extreme case is that of a double-headed coin that never comes up tails, or a double-tailed coin
that never results in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin
delivers no new information as the outcome of each coin toss is always certain.
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 1414
Continue the
sequence!
111221
1211
21
11
1113213211
13112221
312211
one 1 and one 2 and two 1
one 2 and one 1
two 1
one 1
three 1 and two 2 and one 1
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 15
3. Coding and compression
3.2 Entropy and redundancy
Practical exercises
Calculate the entropy for rolling a 6-sided die and do the following steps:
• Give the probabilities for each possible value!
• Calculate the entropy using Shannon’s formula!
• Suggest a code for each word to transmit and compute the redundancy!
• Explain the meaning of the obtained values for the entropy!
• Represent graphically the rolling of a fair and unfair die!
1.
How it works
• Try out the two exercises alone.
• Discuss your results with another student.
• This work is not considered for your final grade.
Which symbols have the three highest occurrences in the English language (ignore case sensitivity)?
What would be the weight of the letter E?2.
https://www.khanacademy.org/computing/computer-science/informationtheory/moderninfotheory/v/information-entropy
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 16
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 17
3. Coding and compression
Objective: reduce the amount of data
universal specific
without data loss Huffman, LZW PNG, AIFF
with data loss JPEG, MP3
Classification
can be used for any
purpose
used for specific
applications
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 18
3. Coding and compression
3.3 Huffman coding
David Albert Huffman (1925–1999) was a
pioneer in computer science and professor
of computer science and the University of
California in Santa Cruz. He is known for
his Huffman coding.
http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf
Huffman, D. (1952). "A Method for the Construction of
Minimum-Redundancy Codes" (PDF). Proceedings of the IRE.
40 (9): 1098–1101. doi:10.1109/JRPROC.1952.273898
Every symbol of the source is represented by a code
Principle: entropy encoding method
The length of the code depends on the frequency of occurrence of
the symbol; frequent words have smaller codes than less frequent
symbols
Application: MPEG-Layer III (MP3) encoder
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 19
3. Coding and compression
3.3 Huffman coding
Input
𝐴 is an alphabet of symbols: 𝐴 = {a1, a2,...,an}
W is a the tuple of symbol weights (usually proportional to probabilities): W = {w1, w2,..., wn} with wi = weight(ai)
Output
CW is the set of binary codewords over 𝐴: CW = {c1, c2,..., cn}
Goal
Find the minimal redundancy 𝐿 = 𝑤∈𝑊 𝑤 𝑥 ∙ 𝑐 𝑥 according to Shannon entropy
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 20
3. Coding and compression
3.3 Huffman coding
Machine 2 Generates words according the following probability
pa pb pc pd
0,5 0,125 0,125 0,25
Is it a?
a
yes no
Is it d?
yes
d Is it b?
no
yes no
b c
ca = 1 (1 bit)
cb = 001 (3 bit)
cc = 000 (3 bit)
cd = 01 (2 bit)
binary code
Principle
http://huffman.ooz.ie/
wa wb wc wd
4 1 1 2
Huffman
tree
How do we
build the tree?
frequency probability
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 21
3. Coding and compression
3.3 Huffman coding
Example
text to send
wa wb wc wd
4 1 1 2
1 001 000 01
frequency
code
Example 1: Huffman compressed
wa wb wc wd
4 1 1 2
11 10 01 00
frequency
code
Example 2: uncompressed
abadacda
w a b a d a c d a
c 1 001 1 01 1 000 01 1
|c| 1 3 1 2 1 3 2 1
Length of transmission: 14 bit
w a b a d a c d a
c 11 10 11 00 11 01 00 11
|c| 2 2 2 2 2 2 2 2
Length of transmission: 16 bit
Compression ratio = 1 −
14
16
= 𝟏𝟐, 𝟓%
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 22
3. Coding and compression
3.3 Huffman coding
Building the Huffman tree
wa wb wc wd
4 1 1 2 frequency
golden rules
1) Every word is a node (in brackets
is the word frequency)
https://people.ok.ubc.ca/ylucet/DS/Huffman.html
2) Take two nodes of lowest
frequency and create a branching
with a new node, having the sum of
all frequencies of the branch
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 23
3. Coding and compression
3.3 Huffman coding
Building the Huffman tree
wa wb wc wd
4 1 1 2 frequency
golden rules
2) Take two nodes of lowest
frequency and create a branching
with a new node, having the sum of
all frequencies of the branch
1) Every word is a node (in brackets
is the word frequency)
https://people.ok.ubc.ca/ylucet/DS/Huffman.html
3) If three or more nodes have the
same (lowest) frequency then
create multiple sub-branches
4) Repeat steps 2) and 3) until
there are no nodes left
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 24
3. Coding and compression
3.3 Huffman coding
Building the Huffman tree
wa wb wc wd
4 1 1 2 frequency
golden rules
2) Take two nodes of lowest
frequency and create a branching
with a new node, having the sum of
all frequencies of the branch
1) Every word is a node (in brackets
is the word frequency)
3) If three or more nodes have the
same (lowest) frequency then
create multiple sub-branches
4) Repeat steps 2) and 3) until
there are no nodes left
https://people.ok.ubc.ca/ylucet/DS/Huffman.html
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 25
3. Coding and compression
3.3 Huffman coding
Building the Huffman tree
https://people.ok.ubc.ca/ylucet/DS/Huffman.html
golden rules
2) Take two nodes of lowest
frequency and create a branching
with a new node, having the sum of
all frequencies of the branch
1) Every word is a node (in brackets
is the word frequency)
3) If three or more nodes have the
same (lowest) frequency then
create multiple sub-branches
4) Repeat steps 2) and 3) until
there are no nodes left
wa wb wc wd
4 1 1 2 frequency
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 26
3. Coding and compression
3.3 Huffman coding
Building the Huffman tree
https://people.ok.ubc.ca/ylucet/DS/Huffman.html
golden rules
2) Take two nodes of lowest
frequency and create a branching
with a new node, having the sum of
all frequencies of the branch
1) Every word is a node (in brackets
is the word frequency)
3) If three or more nodes have the
same (lowest) frequency then
create multiple sub-branches
4) Repeat steps 2) and 3) until
there are no nodes left
Remarks
The tree is built bottom-up
Every code is unique and free of
ambiguities
wa wb wc wd
4 1 1 2
1 001 000 01
frequency
code
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 27
3. Coding and compression
3.3 Huffman coding
Example of decoding
Consider the following Huffman code
wa wb wc wd
4 1 1 2
1 001 000 01
frequency
code
What message is sent with the following transmission?
code received
010000011
01 000 001 1
d c b a
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 28
3. Coding and compression
3.3 Huffman coding
Practical exercises
Based on the following dictionary, what is the original message that was broadcast using Huffman coding:
1011100101001110010101001011001001000011010011.
How it works
• Try out the two exercises alone.
• Discuss your results with another student.
• This work is not considered for your final grade.
Calculate the compression ration against a classical 8-bit ASCII encoding of
the same message!
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 29
3. Coding and compression
3.3 Huffman coding
Practical exercises
Based on the following frequencies:
• Create the corresponding codes for each symbol
• Compute the compression ration to send the text “ecaabae” using Huffman compression against an
uncompressed code of 3 bit
2.
How it works
• Try out the two exercises alone.
• Discuss your results with another student.
• This work is not considered for your final grade.
https://www.dcode.fr/codage-huffman-compression
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 30
3. Coding and compression
3.3 Huffman coding
Practical exercises - solution
2.
https://cs.nyu.edu/courses/fall09/V22.0102-002/lectures/Huffman.pdf
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 31
3. Coding and compression
3.3 Huffman coding
Practical exercises
Consider the text “go go gophers”.
• Create the table of frequencies!
• Encode the text using a Huffman code!
• Compute the compression ration against an uncompressed code with minimal length!
3.
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 32
3. Coding and compression
3.3 Huffman coding
Practical exercises - solution
Consider the text “go go gophers”.
• Create the table of frequencies!
• Encode the text using a Huffman code!
• Compute the compression ration against an uncompressed code with minimal length!
3.
https://www2.cs.duke.edu/csed/poop/huff/info/
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 33
3. Coding and compression
3.4 Run-length encoding (RLE)
Principle
Very simple lossless data compression
Based replacing sequences of same symbols by a code
Application: Graphics Interchange Format (GIF), fax machines (T.45)
RLE is useful for highly-redundant data, indexed images with many pixels of the same color in a row, or in combination with
other compression techniques
Example
Message: aabbbbbeedddddddddddb
Runs: (a,2) (b,5) (e,2) (d,11) (b,1)
Encoding: different representations are possible
• 2a5b2e11d1b  can cause problems in decoding the frequency
• #a2#b5#e2#d11#b1  escape character used
• aa2bb5ee2dd11b  any time a character appears twice it denotes a run
• (a,b,e,d,b) (2,5,2,11,1)  two separate vectors: one for the symbols and one for the frequencies
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 34
3. Coding and compression
3.4 Run-length encoding (RLE)
Practical exercise
The following bitmap image has a size of 15 x 15 pixels. Each pixel can be white, red or black.
How it works
• Try out the exercise alone.
• Discuss your results with another student.
• This work is not considered for your final grade.
Compare the uncompressed size against an RLE compression!
How effective would be a Huffman compression for this purpose?
http://www.xiconeditor.com/
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 35
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Principle
Universal lossless data compression
Published in 1984 by T. Welch as an improvement of the LZ78
algorithm published by A. Lempel & J. Ziv in 1978
Application: Graphics Interchange Format (GIF), Unix file compression utility
Abraham Lempel
(1936-)
Yaakov Ziv
(1931-)
Terry Welch
(1939-1988)
Simple to use and widely used for very high throughput in
hardware implementations
Based on replacing recurrent sequences by a code and
managing in this way a dictionary of encountered sequences
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 36
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
b a n a n e n b a u
Item Code
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 37
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
b a n a n e n b a u
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Item Code
ba 27
coded message
2
b
is “b” in the dictionary?
is “ba” in the dictionary?
extent sequence
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 38
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
b a n a n e n b a u
Item Code
an 28
coded message
2 1
b a
is “a” in the dictionary?
is “an” in the dictionary?
extent sequence
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 39
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
coded message
2 1 14
b a n
is “n” in the dictionary?
is “na” in the dictionary?
extent sequence
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 40
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
ane 30
coded message
2 1 14 28
b a n an
is “a” in the dictionary?
is “an” in the dictionary?
extent sequence
is “ane” in the dictionary?
extent sequence
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 41
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
ane 30
en 31
coded message
2 1 14 28 5
b a n an e
is “e” in the dictionary?
is “en” in the dictionary?
extent sequence
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 42
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
ane 30
en 31
nb 32 coded message
2 1 14 28 5 14
b a n an e n
is “n” in the dictionary?
is “nb” in the dictionary?
extent sequence
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 43
is “bau” in the dictionary?
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
ane 30
en 31
nb 32
bau 33
coded message
2 1 14 28 5 14 27
b a n an e n ba
is “b” in the dictionary?
is “ba” in the dictionary?
extent sequence
extent sequence
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 44
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
ane 30
en 31
nb 32
bau 33
coded message
2 1 14 28 5 14 27
b a n an e n ba
is “u” in the dictionary?
STOP – no more symbols
21
u
golden rules
1) Every symbol (1 char) is
represented in the initial dictionary
with a code
2) Take current symbol:
• if present in dictionary then
extent sequence
• otherwise add it to dictionary
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 45
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Example
Item Code
a 1
b 2
...
z 26
ba 27
an 28
b a n a n e n b a u
Item Code
na 29
ane 30
en 31
nb 32
bau 33
coded message
2 1 14 28 5 14 27 21
b a n an e n ba u
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 46
Example of decoding!!!!
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 47
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Practical exercise
The following image has a resolution of 4 x 4 pixels. Each pixel can be white, black, blue or yellow.
1.
How it works
• Try out the exercise alone.
• Discuss your results with another student.
• This work is not considered for your final grade.
Compare the uncompressed size against an RLE and Huffman compression!
Compress the image according the LZW coding algorithm!
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 48
3. Coding and compression
3.5 Lempel–Ziv–Welch (LZW) coding
Practical exercise
Calculate the LZW code for the message bobobobowebewe and give the full dictionary!
2.
How it works
• Try out the exercise alone.
• Discuss your results with another student.
• This work is not considered for your final grade.
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 49
3. Coding and compression
3.6 The mysterious case of the Xerox scanners (2013)
The main actors
Xerox WorkCentre Line scanners
which randomly alter written
numbers in pages that are scanned
David Kriesel at the Chaos
Communication Congress (31C3)
in Hamburg on 29 December
2014
http://www.dkriesel.com/
What went wrong (test set)
original data (Arial, 7pt) scan result
Overview
24 July: D. Kriesel informed Xerox about the case
6 August: Xerox announced that this is not a bug
12 August: Xerox confirms that hundreds of thousands of
devices world-wide are affected due to software bug eight
years ago
22 August: first patches for different devices released
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 50
3. Coding and compression
3.6 The mysterious case of the Xerox scanners (2013)
Explaining the bug
https://www.youtube.com/watch?v=7FeqF1-Z1g0
Image to be scanned
JBig2: compression standard that segments input page
into regions (patches) of text and images
• Patch 1: image is compressed, e.g., JPEG
• Patches 2 - 4: text are compressed after OCR
• all the rest (white space) does not belong to a patch
Pattern matching: resolve similar patches, e.g., all the
letters “e”  store just one occurrence and re-use same
pattern
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 51
3. Coding and compression
3.6 The mysterious case of the Xerox scanners (2013)
Explaining the bug
https://www.youtube.com/watch?v=7FeqF1-Z1g0
Image to be scanned
JBig2: compression standard that segments input page
into regions (patches) of text and images
• Patch 1: image is compressed, e.g., JPEG
• Patches 2 - 4: text are compressed after OCR
• all the rest (white space) does not belong to a patch
Pattern matching: resolve similar patches, e.g., all the
letters “e”  store just one occurrence and re-use same
pattern
Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 52
3. Coding and compression
3.6 The mysterious case of the Xerox scanners (2013)
Explaining the bug
https://www.youtube.com/watch?v=7FeqF1-Z1g0
Image to be scanned
JBig2: compression standard that segments input page
into regions (patches) of text and images
• Patch 1: image is compressed, e.g., JPEG
• Patches 2 - 4: text are compressed after OCR
• all the rest (white space) does not belong to a patch
Pattern matching: resolve similar patches, e.g., all the
letters “e”  store just one occurrence and re-use same
pattern
Due to optimization and unprecise pattern
matching, errors can occur

More Related Content

Similar to Media IT - Entropy

Homomorphic encryption on Blockchain Principles
Homomorphic encryption on Blockchain PrinciplesHomomorphic encryption on Blockchain Principles
Homomorphic encryption on Blockchain PrinciplesJohann Höchtl
 
Foundations of Statistics for Ecology and Evolution. 4. Maximum Likelihood
Foundations of Statistics for Ecology and Evolution. 4. Maximum LikelihoodFoundations of Statistics for Ecology and Evolution. 4. Maximum Likelihood
Foundations of Statistics for Ecology and Evolution. 4. Maximum LikelihoodAndres Lopez-Sepulcre
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesEmery Berger
 
Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes acijjournal
 
Media IT - Natural Language Processing
Media IT - Natural Language ProcessingMedia IT - Natural Language Processing
Media IT - Natural Language ProcessingSerge Linckels
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataVrije Universiteit Amsterdam
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdfnyomans1
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream MiningAlbert Bifet
 
RSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESRSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESacijjournal
 
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring ProblemsA Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring ProblemsSandra Long
 
PostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesPostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesHans-Jürgen Schönig
 
Project 2: Baseband Data Communication
Project 2: Baseband Data CommunicationProject 2: Baseband Data Communication
Project 2: Baseband Data CommunicationDanish Bangash
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Lec_8_Image Compression.pdf
Lec_8_Image Compression.pdfLec_8_Image Compression.pdf
Lec_8_Image Compression.pdfnagwaAboElenein
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningOswald Campesato
 

Similar to Media IT - Entropy (20)

Homomorphic encryption on Blockchain Principles
Homomorphic encryption on Blockchain PrinciplesHomomorphic encryption on Blockchain Principles
Homomorphic encryption on Blockchain Principles
 
Foundations of Statistics for Ecology and Evolution. 4. Maximum Likelihood
Foundations of Statistics for Ecology and Evolution. 4. Maximum LikelihoodFoundations of Statistics for Ecology and Evolution. 4. Maximum Likelihood
Foundations of Statistics for Ecology and Evolution. 4. Maximum Likelihood
 
C++ and Deep Learning
C++ and Deep LearningC++ and Deep Learning
C++ and Deep Learning
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
DieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe LanguagesDieHard: Probabilistic Memory Safety for Unsafe Languages
DieHard: Probabilistic Memory Safety for Unsafe Languages
 
Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes Rsa Signature: Behind The Scenes
Rsa Signature: Behind The Scenes
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Media IT - Natural Language Processing
Media IT - Natural Language ProcessingMedia IT - Natural Language Processing
Media IT - Natural Language Processing
 
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked DataDedalo, looking for Cluster Explanations in a labyrinth of Linked Data
Dedalo, looking for Cluster Explanations in a labyrinth of Linked Data
 
nlp dl 1.pdf
nlp dl 1.pdfnlp dl 1.pdf
nlp dl 1.pdf
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Life Is Great
Life Is GreatLife Is Great
Life Is Great
 
RSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENESRSA SIGNATURE: BEHIND THE SCENES
RSA SIGNATURE: BEHIND THE SCENES
 
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring ProblemsA Signature Algorithm Based On Chaotic Maps And Factoring Problems
A Signature Algorithm Based On Chaotic Maps And Factoring Problems
 
PostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tablesPostgreSQL: Joining 1 million tables
PostgreSQL: Joining 1 million tables
 
Project 2: Baseband Data Communication
Project 2: Baseband Data CommunicationProject 2: Baseband Data Communication
Project 2: Baseband Data Communication
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Lec_8_Image Compression.pdf
Lec_8_Image Compression.pdfLec_8_Image Compression.pdf
Lec_8_Image Compression.pdf
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
 

More from Serge Linckels

Media IT - XML and XML Transformation (XSLT)
Media IT - XML and XML Transformation (XSLT)Media IT - XML and XML Transformation (XSLT)
Media IT - XML and XML Transformation (XSLT)Serge Linckels
 
Media IT - XML and sublanguages
Media IT - XML and sublanguagesMedia IT - XML and sublanguages
Media IT - XML and sublanguagesSerge Linckels
 
Media IT - author rights
Media IT - author rightsMedia IT - author rights
Media IT - author rightsSerge Linckels
 
Semantic Web - Search engines
Semantic Web - Search enginesSemantic Web - Search engines
Semantic Web - Search enginesSerge Linckels
 
Semantic Web - Ontologies
Semantic Web - OntologiesSemantic Web - Ontologies
Semantic Web - OntologiesSerge Linckels
 
Semantic Web - XML and sublanguages
Semantic Web - XML and sublanguagesSemantic Web - XML and sublanguages
Semantic Web - XML and sublanguagesSerge Linckels
 
Semantic Web - Overview
Semantic Web - OverviewSemantic Web - Overview
Semantic Web - OverviewSerge Linckels
 
Semantic Web - Introduction
Semantic Web - IntroductionSemantic Web - Introduction
Semantic Web - IntroductionSerge Linckels
 

More from Serge Linckels (13)

Media IT - XML and XML Transformation (XSLT)
Media IT - XML and XML Transformation (XSLT)Media IT - XML and XML Transformation (XSLT)
Media IT - XML and XML Transformation (XSLT)
 
Media IT - XML and sublanguages
Media IT - XML and sublanguagesMedia IT - XML and sublanguages
Media IT - XML and sublanguages
 
Media IT - author rights
Media IT - author rightsMedia IT - author rights
Media IT - author rights
 
Media IT - Images
Media IT - ImagesMedia IT - Images
Media IT - Images
 
Media IT - Coding
Media IT - CodingMedia IT - Coding
Media IT - Coding
 
Semantic Web - Search engines
Semantic Web - Search enginesSemantic Web - Search engines
Semantic Web - Search engines
 
Semantic Web - OWL
Semantic Web - OWLSemantic Web - OWL
Semantic Web - OWL
 
Semantic Web - Ontologies
Semantic Web - OntologiesSemantic Web - Ontologies
Semantic Web - Ontologies
 
Semantic Web - RDF
Semantic Web - RDFSemantic Web - RDF
Semantic Web - RDF
 
Semantic Web - XML and sublanguages
Semantic Web - XML and sublanguagesSemantic Web - XML and sublanguages
Semantic Web - XML and sublanguages
 
Semantic Web - Overview
Semantic Web - OverviewSemantic Web - Overview
Semantic Web - Overview
 
Semantic Web - Introduction
Semantic Web - IntroductionSemantic Web - Introduction
Semantic Web - Introduction
 
E-Librarian Service
E-Librarian ServiceE-Librarian Service
E-Librarian Service
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Media IT - Entropy

  • 1. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: Faculty of Science, Technology and Communication (FSTC) Bachelor en informatique (professionnel) -- Media IT -– ¯_(ツ)_/¯ Unit 4 Entropy and compression
  • 2. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 2 Assignment Presentation on data formats Chose a data format among the following list: How it works • You work in teams of 2 students • Prepare a presentation that explains the principles of you selected data format, i.e., encoding, compression, features, usage... • Share your presentation, e.g., PowerPoint file or Prezi link • Presentation on 29 October or 5 November; 10 – 15 minutes • Your work (presentation + support) is considered 10% of your final grade jpeg / jfif / jpeg 2000 wav / aiff / au / raw png mp3 / vorbis / aac / wma bmp / (animated) gif tiff / raw svg
  • 3. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 3 3. Coding and compression 3.1 Basics of information theory 3.2 Entropy and redundancy 3.3 Huffman coding 3.4 Run-length encoding (RLE) 3.5 Lempel–Ziv–Welch (LZW) coding 3.6 The mysterious case of the Xerox scanners (2013)
  • 4. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 4 3. Coding and compression 3.1 Basics of information theory 11 November 2017 One of Europe's smallest countries now holds claim to being a giant in the space industry. Luxembourg, with a population less than the state of Vermont, now generates nearly 2 percent of its annual gross domestic product from the space industry, according to Deputy Prime Minister Etienne Schneider. The country's economy checked in just shy of $61 billion in 2016, according to the CIA World Factbook. "We have grown from nothing to the most dynamic in Europe," Schneider told an audience Saturday, in a speech at the New Worlds conference in Austin, Texas. He added that the country's space program was first launched just over 30 years ago. Schneider, who also serves as Luxembourg's economic minister, told the conference that he is often questioned about why Luxembourg is so "keen on exploiting space resources." He replied by saying the same "liberal, extremely business friendly climate" that pushed the country's financial sector boom is now being reapplied to attracting space companies. "I have more than 70 space companies in the pipeline," Schneider told CNBC after the speech. Luxembourg's "space resources initiative" is the country's plan to make the most out of a quickly growing global industry, the minister said. "It's a series of measures to position Luxembourg as the European heart of exploration and use of space resources." The word space is frequently used in this source, i.e., it has a higher occurrence than other words, e.g., Texas (appears just once) The probability that the word space re-appears in a next section (not visible on this slide) is very high The word space has a higher “importance”, i.e., in this source than other words, e.g., Texas. This is called: self-information or surpisal Therefore:
  • 5. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 5 3. Coding and compression 3.1 Basics of information theory 𝐴 is an alphabet, i.e., a non-empty set of symbols Computational linguistics basics Claude Elwood Shannon (1916–2001) was an American mathematician, electrical engineer, and cryptographer known as "the father of information theory" Examples 𝐴 = {a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z} 𝐴∗ is the set of all possible words through this alphabet 𝐴∗ = {a, aaa, oxo, jimi, house, haus, maison, abc, bdef…} The language 𝐸 ⊆ 𝐴∗ contains all English words 𝐸 = {a, and, are, house…} 𝑎 ∈ 𝐸 is a word of the English language 𝑎 = house |𝑎| the length of the word 𝑎 |𝑎| =5 𝑝 𝑎 ∈ ℝ is the probability of occurrence of the word 𝑎 in the English language with 0 ≥ 𝑝 𝑎 ≥ 1 The average word length is written: 𝑎∈𝐿 𝑝 𝑎 ∙ 𝑎 The sum of all probabilities is 1 𝑎∈𝐿 𝑝 𝑎 = 1
  • 6. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 6 3. Coding and compression 3.1 Basics of information theory Computational linguistics basics Word frequency depends on the corpus you analyze https://www.wordfrequency.info/ https://www.sketchengine.eu/
  • 7. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 7 3. Coding and compression 3.1 Basics of information theory Computational linguistics basics
  • 8. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 8 3. Coding and compression 3.1 Basics of information theory Computational linguistics basics Example of application: cryptanalysis A typical distribution of letters in English language text. Weak ciphers do not sufficiently mask the distribution, and this might be exploited by a cryptanalyst to read the message. Colossus: British machine developed during WW2 to help codebreakers break the Enigma cipher machine LIVITCSWPIYVEWHEVSRIQMXLEYVEOIEWHRXEXIPFEMVEWHKVSTYLXZIXLIKIIXPIJVSZEYPERRGERIM WQLMGLMXQERIWGPSRIHMXQEREKIETXMJTPRGEVEKEITREWHEXXLEXXMZITWAWSQWXSWEXTVEPMRXRSJ GSTVRIEYVIEXCVMUIMWERGMIWXMJMGCSMWXSJOMIQXLIVIQIVIXQSVSTWHKPEGARCSXRWIEVSWIIBXV IZMXFSJXLIKEGAEWHEPSWYSWIWIEVXLISXLIVXLIRGEPIRQIVIIBGIIHMWYPFLEVHEWHYPSRRFQMXLE PPXLIECCIEVEWGISJKTVWMRLIHYSPHXLIQIMYLXSJXLIMWRIGXQEROIVFVIZEVAEKPIEWHXEAMWYEPP XLMWYRMWXSGSWRMHIVEXMSWMGSTPHLEVHPFKPEZINTCMXIVJSVLMRSCMWMSWVIRCIGXMWYMX Hereupon Legrand arose, with a grave and stately air, and brought me the beetle from a glass case in which it was enclosed. It was a beautiful scarabaeus, and, at that time, unknown to naturalists—of course a great prize in a scientific point of view. There were two round black spots near one extremity of the back, and a long one near the other. The scales were exceedingly hard and glossy, with all the appearance of burnished gold. The weight of the insect was very remarkable, and, taking all things into consideration, I could hardly blame Jupiter for his opinion respecting it. Full explanation can be found on Wikipedia
  • 9. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 9 3. Coding and compression 3.1 Basics of information theory Generates words randomly Illustration with an alphabet 𝒳={a, b, c, d} Machine 1 Machine 2 Generates words according the following probability pa pb pc pd 0,25 0,25 0,25 0,25 pa pb pc pd 0,5 0,125 0,125 0,25 d d a c d a b Is it a or b? Is it a? yes yes a no b Is it c? no c d Is it a? a yes no Is it d? yes d Is it b? no yes no b c wa = 2 wb = 2 wc = 2 wd = 2 weight What is the next word? yes no Questions to ask in average: =pa⋅wa + pb⋅wb + pc⋅wc + pd⋅wd =0,5⋅1 + 0,125⋅3 + 0,125⋅3 + 0,25⋅2 =1,75 questions Questions to ask in average: =pa⋅wa + pb⋅wb + pc⋅wc + pd⋅wd =0,25⋅2 + 0,25⋅2 + 0,25⋅2 + 0,25⋅2 =2 questions ca = 11 (2 bit) cb = 10 (2 bit) cc = 01 (2 bit) cd = 00 (2 bit) binary code wa = 1 wb = 3 wc = 3 wd = 2 weight ca = 1 (1 bit) cb = 001 (3 bit) cc = 000 (3 bit) cd = 01 (2 bit) binary code Machine 2 is producing less information than machine 1
  • 10. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 10 𝐻 𝒳 = − 𝑥∈𝒳 𝑝 𝑥 ∙ log2 1 𝑝 𝑥 3. Coding and compression 3.2 Entropy and redundancy Generates words randomly Illustration with an alphabet 𝒳={a, b, c, d} Machine 1 Machine 2 Generates words according the following probability pa pb pc pd 0,25 0,25 0,25 0,25 pa pb pc pd 0,5 0,125 0,125 0,25 Information entropy (H) is the average rate at which information is produced by a stochastic source of data Entropy for machine 1: 𝐻 𝒳 = − 𝑥∈𝒳 𝑝 𝑥 ∙ log2 1 𝑝 𝑥 𝐻 𝒳 = − 𝑝 𝑎 ∙ 𝑙𝑜𝑔2 1 𝑝 𝑎 + 𝑝 𝑏 ∙ 𝑙𝑜𝑔2 1 𝑝 𝑏 + 𝑝𝑐 ∙ 𝑙𝑜𝑔2 1 𝑝𝑐 + 𝑝 𝑑 ∙ 𝑙𝑜𝑔2 1 𝑝 𝑑 𝐻 𝒳 = − 0,25 ∙ 𝑙𝑜𝑔2 1 0,25 + 0,25 ∙ 𝑙𝑜𝑔2 1 0,25 + 0,25 ∙ 𝑙𝑜𝑔2 1 0,25 + 0,25 ∙ 𝑙𝑜𝑔2 1 0,25 𝐻 𝒳 = − 0,25 ∙ (−2) + 0,25 ∙ (−2) + 0,25 ∙ (− 2) + 0,25 ∙ (−2) 𝐻(𝒳) = 2 Entropy for machine 2: 𝐻 𝒳 = − 𝑥∈𝒳 𝑝 𝑥 ∙ log2 1 𝑝 𝑥 𝐻 𝒳 = − 𝑝 𝑎 ∙ 𝑙𝑜𝑔2 1 𝑝 𝑎 + 𝑝 𝑏 ∙ 𝑙𝑜𝑔2 1 𝑝 𝑏 + 𝑝𝑐 ∙ 𝑙𝑜𝑔2 1 𝑝𝑐 + 𝑝 𝑑 ∙ 𝑙𝑜𝑔2 1 𝑝 𝑑 𝐻 𝒳 = − 0,5 ∙ 𝑙𝑜𝑔2 1 0,5 + 0,125 ∙ 𝑙𝑜𝑔2 1 0,125 + 0,125 ∙ 𝑙𝑜𝑔2 1 0,125 + 0,25 ∙ 𝑙𝑜𝑔2 1 0,25 𝐻 𝒳 = − 0,5 ∙ −1 + 0,125 ∙ (− 3) + 0,125 ∙ −3 + 0,25 ∙ (−2) 𝐻(𝒳) = 1,75 The weight of a certain word stochastic: having a random probability distribution or pattern that may be analyzed statistically but may not be predicted precisely
  • 11. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 11 3. Coding and compression 3.2 Entropy and redundancy Generates words randomly Illustration with an alphabet 𝒳={a, b, c, d} Machine 1 Machine 2 Generates words according the following probability x a b c d probability (px) 0,25 0,25 0,25 0,25 coding (cx) 00 01 10 11 weight (wx) 1 1 1 1 x a b c d probability (px) 0,5 0,125 0,125 0,25 coding (cx) 1 001 000 01 weight (wx) 1 3 3 2 Entropy: H(𝒳) = 2 Entropy: H(𝒳) = 1,75 The principle idea is to find the optimal coding so that the average length of the code is the smallest possible in order to reduce the amount of bits to transmit 𝐿 = 𝑥∈𝒳 𝑝 𝑥 ∙ 𝑐 𝑥 Average code length: L(𝒳) = 2 Average code length: L(𝒳) = 1,75 In the above example, information that is transmitted by machine 2 requires less actual bits than machine 1. The code has been designed so that fewer bits are used to send more frequent symbols, but still so that it can be unambiguously decoded A code is optimal if L – H is minimal, i.e., little redundancy
  • 12. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 12 3. Coding and compression 3.2 Entropy and redundancy Illustration with an alphabet 𝒳={a, b, c, d} Machine 3 Generates words according the following probability x a b c d probability (px) 0,5 0,125 0,125 0,25 coding (cx) 000 001 000 010 weight 1 3 3 2 Entropy: H(𝒳) = 1,75 Average code length: L(𝒳) = 3 Here, the redundancy for machine 3 (L – H = 1,25) is higher than the coding from machine 2 (L – H = 0) , although the probabilities, i.e., entropy, remain the same. Therefore, this code is less optimal for transmission. Same values as for machine 2 but with a worse coding (more bits are used than necessary)
  • 13. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 13 3. Coding and compression 3.2 Entropy and redundancy Illustration of flipping {a fair | an unfair} coin The table shows the flipping of a coin where the probability 0,5 / 0,5 is the only “fair flipping” and a maximized entropy (1 bit). Every other “unfair flipping” results in a smaller entropy, i.e., the surprise of getting heads is smaller. probability entropy (H)heads tails 0 1 0 0,1 0,9 0,468 0,2 0,8 0,721 0,3 0,7 0,881 0,4 0,6 0,970 0,5 0,5 1,000 0,6 0,4 0,970 0,7 0,3 0,881 0,8 0,2 0,721 0,9 0,1 0,468 1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 entropy probability fair flipping, i.e., equal probabilities The extreme case is that of a double-headed coin that never comes up tails, or a double-tailed coin that never results in a head. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no new information as the outcome of each coin toss is always certain.
  • 14. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 1414 Continue the sequence! 111221 1211 21 11 1113213211 13112221 312211 one 1 and one 2 and two 1 one 2 and one 1 two 1 one 1 three 1 and two 2 and one 1
  • 15. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 15 3. Coding and compression 3.2 Entropy and redundancy Practical exercises Calculate the entropy for rolling a 6-sided die and do the following steps: • Give the probabilities for each possible value! • Calculate the entropy using Shannon’s formula! • Suggest a code for each word to transmit and compute the redundancy! • Explain the meaning of the obtained values for the entropy! • Represent graphically the rolling of a fair and unfair die! 1. How it works • Try out the two exercises alone. • Discuss your results with another student. • This work is not considered for your final grade. Which symbols have the three highest occurrences in the English language (ignore case sensitivity)? What would be the weight of the letter E?2. https://www.khanacademy.org/computing/computer-science/informationtheory/moderninfotheory/v/information-entropy
  • 16. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 16
  • 17. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 17 3. Coding and compression Objective: reduce the amount of data universal specific without data loss Huffman, LZW PNG, AIFF with data loss JPEG, MP3 Classification can be used for any purpose used for specific applications
  • 18. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 18 3. Coding and compression 3.3 Huffman coding David Albert Huffman (1925–1999) was a pioneer in computer science and professor of computer science and the University of California in Santa Cruz. He is known for his Huffman coding. http://compression.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf Huffman, D. (1952). "A Method for the Construction of Minimum-Redundancy Codes" (PDF). Proceedings of the IRE. 40 (9): 1098–1101. doi:10.1109/JRPROC.1952.273898 Every symbol of the source is represented by a code Principle: entropy encoding method The length of the code depends on the frequency of occurrence of the symbol; frequent words have smaller codes than less frequent symbols Application: MPEG-Layer III (MP3) encoder
  • 19. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 19 3. Coding and compression 3.3 Huffman coding Input 𝐴 is an alphabet of symbols: 𝐴 = {a1, a2,...,an} W is a the tuple of symbol weights (usually proportional to probabilities): W = {w1, w2,..., wn} with wi = weight(ai) Output CW is the set of binary codewords over 𝐴: CW = {c1, c2,..., cn} Goal Find the minimal redundancy 𝐿 = 𝑤∈𝑊 𝑤 𝑥 ∙ 𝑐 𝑥 according to Shannon entropy
  • 20. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 20 3. Coding and compression 3.3 Huffman coding Machine 2 Generates words according the following probability pa pb pc pd 0,5 0,125 0,125 0,25 Is it a? a yes no Is it d? yes d Is it b? no yes no b c ca = 1 (1 bit) cb = 001 (3 bit) cc = 000 (3 bit) cd = 01 (2 bit) binary code Principle http://huffman.ooz.ie/ wa wb wc wd 4 1 1 2 Huffman tree How do we build the tree? frequency probability
  • 21. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 21 3. Coding and compression 3.3 Huffman coding Example text to send wa wb wc wd 4 1 1 2 1 001 000 01 frequency code Example 1: Huffman compressed wa wb wc wd 4 1 1 2 11 10 01 00 frequency code Example 2: uncompressed abadacda w a b a d a c d a c 1 001 1 01 1 000 01 1 |c| 1 3 1 2 1 3 2 1 Length of transmission: 14 bit w a b a d a c d a c 11 10 11 00 11 01 00 11 |c| 2 2 2 2 2 2 2 2 Length of transmission: 16 bit Compression ratio = 1 − 14 16 = 𝟏𝟐, 𝟓%
  • 22. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 22 3. Coding and compression 3.3 Huffman coding Building the Huffman tree wa wb wc wd 4 1 1 2 frequency golden rules 1) Every word is a node (in brackets is the word frequency) https://people.ok.ubc.ca/ylucet/DS/Huffman.html 2) Take two nodes of lowest frequency and create a branching with a new node, having the sum of all frequencies of the branch
  • 23. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 23 3. Coding and compression 3.3 Huffman coding Building the Huffman tree wa wb wc wd 4 1 1 2 frequency golden rules 2) Take two nodes of lowest frequency and create a branching with a new node, having the sum of all frequencies of the branch 1) Every word is a node (in brackets is the word frequency) https://people.ok.ubc.ca/ylucet/DS/Huffman.html 3) If three or more nodes have the same (lowest) frequency then create multiple sub-branches 4) Repeat steps 2) and 3) until there are no nodes left
  • 24. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 24 3. Coding and compression 3.3 Huffman coding Building the Huffman tree wa wb wc wd 4 1 1 2 frequency golden rules 2) Take two nodes of lowest frequency and create a branching with a new node, having the sum of all frequencies of the branch 1) Every word is a node (in brackets is the word frequency) 3) If three or more nodes have the same (lowest) frequency then create multiple sub-branches 4) Repeat steps 2) and 3) until there are no nodes left https://people.ok.ubc.ca/ylucet/DS/Huffman.html
  • 25. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 25 3. Coding and compression 3.3 Huffman coding Building the Huffman tree https://people.ok.ubc.ca/ylucet/DS/Huffman.html golden rules 2) Take two nodes of lowest frequency and create a branching with a new node, having the sum of all frequencies of the branch 1) Every word is a node (in brackets is the word frequency) 3) If three or more nodes have the same (lowest) frequency then create multiple sub-branches 4) Repeat steps 2) and 3) until there are no nodes left wa wb wc wd 4 1 1 2 frequency
  • 26. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 26 3. Coding and compression 3.3 Huffman coding Building the Huffman tree https://people.ok.ubc.ca/ylucet/DS/Huffman.html golden rules 2) Take two nodes of lowest frequency and create a branching with a new node, having the sum of all frequencies of the branch 1) Every word is a node (in brackets is the word frequency) 3) If three or more nodes have the same (lowest) frequency then create multiple sub-branches 4) Repeat steps 2) and 3) until there are no nodes left Remarks The tree is built bottom-up Every code is unique and free of ambiguities wa wb wc wd 4 1 1 2 1 001 000 01 frequency code
  • 27. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 27 3. Coding and compression 3.3 Huffman coding Example of decoding Consider the following Huffman code wa wb wc wd 4 1 1 2 1 001 000 01 frequency code What message is sent with the following transmission? code received 010000011 01 000 001 1 d c b a
  • 28. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 28 3. Coding and compression 3.3 Huffman coding Practical exercises Based on the following dictionary, what is the original message that was broadcast using Huffman coding: 1011100101001110010101001011001001000011010011. How it works • Try out the two exercises alone. • Discuss your results with another student. • This work is not considered for your final grade. Calculate the compression ration against a classical 8-bit ASCII encoding of the same message!
  • 29. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 29 3. Coding and compression 3.3 Huffman coding Practical exercises Based on the following frequencies: • Create the corresponding codes for each symbol • Compute the compression ration to send the text “ecaabae” using Huffman compression against an uncompressed code of 3 bit 2. How it works • Try out the two exercises alone. • Discuss your results with another student. • This work is not considered for your final grade. https://www.dcode.fr/codage-huffman-compression
  • 30. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 30 3. Coding and compression 3.3 Huffman coding Practical exercises - solution 2. https://cs.nyu.edu/courses/fall09/V22.0102-002/lectures/Huffman.pdf
  • 31. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 31 3. Coding and compression 3.3 Huffman coding Practical exercises Consider the text “go go gophers”. • Create the table of frequencies! • Encode the text using a Huffman code! • Compute the compression ration against an uncompressed code with minimal length! 3.
  • 32. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 32 3. Coding and compression 3.3 Huffman coding Practical exercises - solution Consider the text “go go gophers”. • Create the table of frequencies! • Encode the text using a Huffman code! • Compute the compression ration against an uncompressed code with minimal length! 3. https://www2.cs.duke.edu/csed/poop/huff/info/
  • 33. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 33 3. Coding and compression 3.4 Run-length encoding (RLE) Principle Very simple lossless data compression Based replacing sequences of same symbols by a code Application: Graphics Interchange Format (GIF), fax machines (T.45) RLE is useful for highly-redundant data, indexed images with many pixels of the same color in a row, or in combination with other compression techniques Example Message: aabbbbbeedddddddddddb Runs: (a,2) (b,5) (e,2) (d,11) (b,1) Encoding: different representations are possible • 2a5b2e11d1b  can cause problems in decoding the frequency • #a2#b5#e2#d11#b1  escape character used • aa2bb5ee2dd11b  any time a character appears twice it denotes a run • (a,b,e,d,b) (2,5,2,11,1)  two separate vectors: one for the symbols and one for the frequencies
  • 34. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 34 3. Coding and compression 3.4 Run-length encoding (RLE) Practical exercise The following bitmap image has a size of 15 x 15 pixels. Each pixel can be white, red or black. How it works • Try out the exercise alone. • Discuss your results with another student. • This work is not considered for your final grade. Compare the uncompressed size against an RLE compression! How effective would be a Huffman compression for this purpose? http://www.xiconeditor.com/
  • 35. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 35 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Principle Universal lossless data compression Published in 1984 by T. Welch as an improvement of the LZ78 algorithm published by A. Lempel & J. Ziv in 1978 Application: Graphics Interchange Format (GIF), Unix file compression utility Abraham Lempel (1936-) Yaakov Ziv (1931-) Terry Welch (1939-1988) Simple to use and widely used for very high throughput in hardware implementations Based on replacing recurrent sequences by a code and managing in this way a dictionary of encountered sequences
  • 36. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 36 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 b a n a n e n b a u Item Code golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code
  • 37. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 37 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 b a n a n e n b a u golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary Item Code ba 27 coded message 2 b is “b” in the dictionary? is “ba” in the dictionary? extent sequence
  • 38. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 38 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 b a n a n e n b a u Item Code an 28 coded message 2 1 b a is “a” in the dictionary? is “an” in the dictionary? extent sequence golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 39. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 39 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 coded message 2 1 14 b a n is “n” in the dictionary? is “na” in the dictionary? extent sequence golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 40. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 40 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 ane 30 coded message 2 1 14 28 b a n an is “a” in the dictionary? is “an” in the dictionary? extent sequence is “ane” in the dictionary? extent sequence golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 41. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 41 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 ane 30 en 31 coded message 2 1 14 28 5 b a n an e is “e” in the dictionary? is “en” in the dictionary? extent sequence golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 42. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 42 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 ane 30 en 31 nb 32 coded message 2 1 14 28 5 14 b a n an e n is “n” in the dictionary? is “nb” in the dictionary? extent sequence golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 43. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 43 is “bau” in the dictionary? 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 ane 30 en 31 nb 32 bau 33 coded message 2 1 14 28 5 14 27 b a n an e n ba is “b” in the dictionary? is “ba” in the dictionary? extent sequence extent sequence golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 44. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 44 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 ane 30 en 31 nb 32 bau 33 coded message 2 1 14 28 5 14 27 b a n an e n ba is “u” in the dictionary? STOP – no more symbols 21 u golden rules 1) Every symbol (1 char) is represented in the initial dictionary with a code 2) Take current symbol: • if present in dictionary then extent sequence • otherwise add it to dictionary
  • 45. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 45 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Example Item Code a 1 b 2 ... z 26 ba 27 an 28 b a n a n e n b a u Item Code na 29 ane 30 en 31 nb 32 bau 33 coded message 2 1 14 28 5 14 27 21 b a n an e n ba u
  • 46. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 46 Example of decoding!!!!
  • 47. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 47 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Practical exercise The following image has a resolution of 4 x 4 pixels. Each pixel can be white, black, blue or yellow. 1. How it works • Try out the exercise alone. • Discuss your results with another student. • This work is not considered for your final grade. Compare the uncompressed size against an RLE and Huffman compression! Compress the image according the LZW coding algorithm!
  • 48. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 48 3. Coding and compression 3.5 Lempel–Ziv–Welch (LZW) coding Practical exercise Calculate the LZW code for the message bobobobowebewe and give the full dictionary! 2. How it works • Try out the exercise alone. • Discuss your results with another student. • This work is not considered for your final grade.
  • 49. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 49 3. Coding and compression 3.6 The mysterious case of the Xerox scanners (2013) The main actors Xerox WorkCentre Line scanners which randomly alter written numbers in pages that are scanned David Kriesel at the Chaos Communication Congress (31C3) in Hamburg on 29 December 2014 http://www.dkriesel.com/ What went wrong (test set) original data (Arial, 7pt) scan result Overview 24 July: D. Kriesel informed Xerox about the case 6 August: Xerox announced that this is not a bug 12 August: Xerox confirms that hundreds of thousands of devices world-wide are affected due to software bug eight years ago 22 August: first patches for different devices released
  • 50. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 50 3. Coding and compression 3.6 The mysterious case of the Xerox scanners (2013) Explaining the bug https://www.youtube.com/watch?v=7FeqF1-Z1g0 Image to be scanned JBig2: compression standard that segments input page into regions (patches) of text and images • Patch 1: image is compressed, e.g., JPEG • Patches 2 - 4: text are compressed after OCR • all the rest (white space) does not belong to a patch Pattern matching: resolve similar patches, e.g., all the letters “e”  store just one occurrence and re-use same pattern
  • 51. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 51 3. Coding and compression 3.6 The mysterious case of the Xerox scanners (2013) Explaining the bug https://www.youtube.com/watch?v=7FeqF1-Z1g0 Image to be scanned JBig2: compression standard that segments input page into regions (patches) of text and images • Patch 1: image is compressed, e.g., JPEG • Patches 2 - 4: text are compressed after OCR • all the rest (white space) does not belong to a patch Pattern matching: resolve similar patches, e.g., all the letters “e”  store just one occurrence and re-use same pattern
  • 52. Media IT :: Dr Serge Linckels :: http://www.linckels.lu/ :: serge@linckels.lu :: 52 3. Coding and compression 3.6 The mysterious case of the Xerox scanners (2013) Explaining the bug https://www.youtube.com/watch?v=7FeqF1-Z1g0 Image to be scanned JBig2: compression standard that segments input page into regions (patches) of text and images • Patch 1: image is compressed, e.g., JPEG • Patches 2 - 4: text are compressed after OCR • all the rest (white space) does not belong to a patch Pattern matching: resolve similar patches, e.g., all the letters “e”  store just one occurrence and re-use same pattern Due to optimization and unprecise pattern matching, errors can occur