SlideShare a Scribd company logo
1 of 29
DATA COMPRESSION
PRESENTED BY,
R.RAMADEVI,
II – M. SC(CS&IT).
HUFFMAN CODE
• THE ALGORITHM AS DESCRIBED BY DAVID HUFFMAN ASSIGNS EVERY SYMBOL TO A LEAF NODE
OF A BINARY CODE TREE.
• THESE NODES ARE WEIGHTED BY THE NUMBER OF OCCURRENCES OF THE CORRESPONDING
SYMBOL CALLED FREQUENCY OR COST.
• THE TREE STRUCTURE RESULTS FROM COMBINING THE NODES STEP-BY-STEP UNTILALL OF
THEM ARE EMBEDDED IN A ROOT TREE.
• THE ALGORITHM ALWAYS COMBINES THE TWO NODES PROVIDING THE LOWEST FREQUENCY IN
A BOTTOM UP PROCEDURE. THE NEW INTERIOR NODES GETS THE SUM OF FREQUENCIES OF
BOTH CHILD NODES.
CODE TREE ACCORDING TO HUFFMAN
Symbol Frequency
A 24
B 12
C 10
D 8
E 8
----> total 186 bit
(With 3 bit per code word)
The branches of the tree represent the binary values 0 and 1 according to the rules
for common prefix-free code trees. The path from the root tree to the corresponding
leaf node defines the particular code word.
The following example bases on a data source using a set of five different
symbols. The symbol's frequencies are:
CODE TREE ACCORDING TO HUFFMAN
Length Length
-------------------------------------
-------------------------------------
-
A 24 0 1
24
B 12 100 3
36
C 10 101 3
30
D 8 110 3
24
E 8 111 3
24
-------------------------------------
----------------------------------
186 bit tot.138 bit
(3 bit code)
ADAPTIVE HUFFMAN CODE
• THE ADAPTIVE HUFFMAN CODE WILL BE DEVELOPED IN THE COURSE OF THE CODING
STEP BY STEP.
• IN CONTRAST TO THE STATIC OR DYNAMIC CODING THE SYMBOL DISTRIBUTION
WILL NOT BE DETERMINED PREVIOUSLY.
• IT WILL BE GENERATED IN PARALLEL BY USING ALREADY PROCESSED SYMBOLS.
• PRINCIPLE:
• THE WEIGHT OF EACH NODE REPRESENTING THE LAST ENCODED SYMBOL WILL
BE INCREASED.
• AFTERWARDS THE CORRESPONDING PART OF THE TREE WILL BE ADAPTED.
• THUS THE TREE APPROACHES GRADUALLY THE CURRENT DISTRIBUTION OF THE
SYMBOLS. HOWEVER, THE CODE TREE ALWAYS REFLECTS THE "PAST" AND NOT THE
REAL DISTRIBUTION.
• INITIALIZATION:
• BECAUSE THE ADAPTIVE HUFFMAN CODE USES PREVIOUSLY ENCODED
SYMBOLS, AN INITIALIZATION PROBLEM ARISES AT THE BEGINNING OF THE
CODING.
• AT FIRST THE CODE TREE IS EMPTY AND DOES NOT CONTAIN SYMBOLS FROM
ALREADY ENCODED DATA. TO SOLVE THIS, A SUITABLE INITIALIZATION HAS TO
BE UTILIZED. AMONG OTHERS THE FOLLOWING OPTIONS ARE AVAILABLE:
• A STANDARD DISTRIBUTION WILL BE USED THAT IS AVAILABLE AT THE ENCODER
AS WELL AS THE DECODER.
• AN INITIAL CODE TREE WILL BE GENERATED WITH A FREQUENCY OF 1 FOR EACH
SYMBOL.
• A SPECIAL CONTROL CHARACTER WILL BE ADDED IDENTIFYING NEW SYMBOLS
FOLLOWING.
• ALGORITHM ADAPTIVE HUFFMAN CODE
• A NUMBER OF RULES AND CONVENTIONS IS REQUIRED FOR THE ALGORITHM
DESCRIBED HEREINAFTER.
• IT BASES ON A PROCEDURE STARTING WITH AN EMPTY TREE AND INTRODUCING
A SPECIAL CONTROL CHARACTER.
• NO PREVENTIVE ASSUMPTIONS ABOUT THE CODE DISTRIBUTION WILL BE MADE.
• CONTROL CHARACTER FOR SYMBOLS THAT ARE NOT PART OF THE TREE:
• A CONTROL CHARACTER IS DEFINED IN ADDITION TO THE ORIGINAL SET OF
SYMBOLS.
• THIS CONTROL CHARACTER IDENTIFIES SYMBOLS THAT, AT RUN-TIME, DO NOT
BELONG TO THE CURRENT CODE TREE. IN THE LITERATURE THIS CONTROL
CHARACTER IS DENOTED AS A NYA OR NYT (NOT YET AVAILABLE OR
TRANSMITTED).
WEIGHT OF A NODE:
• ANY NODE HAS AN ATTRIBUTE CALLED THE WEIGHT OF THE NODE.
• THE WEIGHT OF A LEAF NODE IS EQUAL TO THE FREQUENCY WITH WHICH THE
CORRESPONDING SYMBOL HAD BEEN CODED SO FAR.
• THE WEIGHT OF THE INTERIOR NODES IS EQUAL TO THE SUM OF THE TWO
SUBORDINATED NODES. THE CONTROL CHARACTER NYA GETS THE WEIGHT 0.
STRUCTURE OF THE CODE TABLE:
• ALL NODES ARE SORTED ACCORDING TO THEIR WEIGHT IN ASCENDING ORDER.
THE NYA NODE ALWAYS PROVIDES THE LOWEST NODE IN THE HIERARCHY.
SIBLING PROPERTY:
• ANY PAIR OF NODES THAT ARE NEIGHBORS IN THE HIERARCHY REFER TO THE
SAME PARENT NODE (NODE 2N-1 AND 2N -> 1 & 2, 3 & 4, 5 & 6, ETC.).
• THE PARENT NODE IS ALWAYS ORDERED AT A HIGHER LEVEL IN THE HIERARCHY
BECAUSE ITS WEIGHT RESULTS FROM THE SUM OF THE SUBORDINATED NODES.
THIS IS DENOTED AS SIBLING PROPERTY.
ROOT
• THE NODE WITH THE HIGHEST WEIGHT IS PLACED AT THE HIGHEST LEVEL IN THE
HIERARCHY AND FORMS THE CURRENT ROOT OF THE TREE.
INITIAL TREE
• THE INITIAL TREE ONLY CONTAINS THE NYA NODE WITH THE WEIGHT 0.
• DATA FORMAT FOR UNCODED SYMBOLS
• IN THE SIMPLEST CASE SYMBOLS WHICH ARE NOT CONTAINED IN THE CODE TREE ARE
ENCODED LINEARLY (E.G. WITH 8 BIT FROM A SET OF 256 SYMBOLS).
• BETTER COMPRESSION EFFICIENCY COULD BE ACHIEVED, IF THE DATA FORMAT WILL BE
ADAPTED TO THE NUMBER OF REMAINING SYMBOLS.
• A VARIETY OF OPTIONS ARE AVAILABLE TO OPTIMIZE THE CODE LENGTH. A SUITABLE
PROCEDURE WILL BE PRESENTED GUARANTEEING FULL UTILIZATION OF THE RANGE OF
VALUES.
• MAXIMUM NUMBER OF NODES
• ASSUMING A SET OF N SYMBOLS THE TOTAL NUMBER OF NODES RESULTS IN 2N-1 (LEAF AND
INTERIOR NODES). ACCORDINGLY THE MAXIMUM NUMBER OF NODES IS 511 IF THE STANDARD
UNIT IS THE BYTE.
• BLOCK
• ALL NODES OF IDENTICAL WEIGHT WILL BE SUMMARIZED LOGICALLY TO A BLOCK.
NODES WHICH ARE PART OF A BLOCK ARE ALWAYS NEIGHBORING IN THE CODE TABLE
REPRESENTING THE CODE TREE.
ARITHMETIC CODING:
• HUFFMAN CODING ACHIEVES THE SHANNON VALUE ONLY IF THE
CHARACTER/SYMBOL PROBABILITIES ARE ALL INTEGER POWERS OF ½.
• THE CODEWORDS PRODUCED USING ARITHMETIC CODING ALWAYS ACHIEVE THE
SHANNON VALUE. ARITHMETIC CODING, HOWEVER, IS MORE COMPLICATED
THAN HUFFMAN CODING.
• CONSIDER THE TRANSMISSION OF A MESSAGE COMPRISING A STRING OF
CHARACTERS WITH PROBABILITIES OF:
• E=0.3, N=0.3, T=0.2, W=0.1, . =0.1
• AT THE END OF EACH CHARACTER STRING MAKING UP A MESSAGE, A KNOWN
CHARACTER IS SENT WHICH, IN THIS EXAMPLE, IS A PERIOD..
• WHEN THIS IS DECODED AT THE RECEIVING SIDE, THE DECODER INTERPRETS
THIS AS THE END OF THE STRING/MESSAGE.
• ARITHMETIC CODING YIELDS A SINGLE CODEWORD FOR EACH CHARACTERS.
THE FIRST STEP IS DIVIDE THE NUMERIC RANGE FROM 0 TO 1 INTO A NUMBER OF
DIFFERENT CHARACTERS PRESENT IN THE MESSAGE TO BE SENT-INCLUDING THE
TERMINATION CHARACTER-AND THE SIZE OF EACH SEGMENT BY THE
PROBABILITY OF THE RELATED CHARACTER.
• THERE ARE ONLY FIVE DIFFERENT CHARACTERS, THERE ARE FIVE SEGMENTS,
THE WIDTH OF EACH SEGMENT BEING DETERMINED BY THE PROBABILITY OF
THE RELATED CHARACTER.
• FOR EXAMPLE, THE CHARACTER E HAS A PROBABILITY OF 0.3 AND HENCE IS ASSIGNED THE
RANGE FROM 0.3 TO 0.6, AND SO ON.
• HOWEVER AN ASSIGNMENT IS THE RAGE, SAY, 0.8 TO 0.9, MEANS THAT THE PROBABILITY IN THE
CUMULATIVE RANGE IS FROM 0.8 TO 0.8999…
• ONCE THIS HAS BEEN DONE , WE ARE READY TO START THE ENCODING PROCESS. WE ASSUME
THE CHARACTER STRING/MESSAGE TO BE ENCODED IS THE SINGLE WORD WENT.
• THE FIRST CHARACTER TO BE ENCODED W IS IN THE RANGE 0.8 TO 0.9.THE FINAL (NUMERIC)
CODEWORD IS A NUMBER IN THE RANGE 0.8 TO0.8999...
• SINCE EACH SUBSEQUENT CHARACTER IN THE STRING SUBDIVIDES THE RANGE 0.8 TO 0.9 INTO
PROGRESSIVELY SMALLER SEGMENTS EACH DETERMINED BY THE PROBABILITIES OF THE
CHARACTERS IN THE STRING.
EXAMPLE CHARACTER SET AND THEIR PROBABILITIES:
• CHARACTER SET |--------E----------|--------N--------|-----------T--------|-------W-------|-.--| CUMULATIVE
0 0.3 0.6 0.8 0.9 PROBABILITIES
• THE NEXT CHARACTER IN THE STRING IS E AND HENCE ITS RANGE (0.8 TO 0.83) IS
AGAIN SUBDIVIDED INTO FIVE SEGMENTS.
• WITH THE NEW ASSIGNMENTS, THEREFORE, THE CHARACTER E HAS A RANGE
FROM 0.8TO 0.809(0.809+0.3*0.03), AND SO ON.
• THIS PROCEDURE CONTINUES UNTIL THE TERMINATION CHARACTER (.) IS
ENCODED. AT THIS POINT, THE SEGMENT RANGE OF (.) FROM 0.81602 TO 0.8162
AND HENCE THE CODEWORD FOR THE COMPLETE STRING IS ANY NUMBER
WITHIN THE RANGE:
• IN THE STATIC MODE, THE DECODER KNOWS THE SET OF CHARACTERS THAT ARE
PRESENT IN THE ENCODED MESSAGES IT RECEIVES AS WELL AS THE SEGMENT TO
WHICH EACH CHARACTER HAS BEEN ASSIGNED AND ITS RELATED RANGE.
• FOR EXAMPLE, IF THE RECEIVED CODEWORD IS DAY, 0.8161, THEN THE
DECODER CAN READILY DETERMINE FROM THIS THAT THE FIRST CHARACTER IS
W SINCE IT IS THE ONLY CHARACTER WITHIN THE RANGE 0.8 TO 0.9.
• THE SECOND CHARACTER MUST BE E SINCE 0.8161 IS WITHIN THE RANGE 0.8
TO 0.83.THIS PROCEDURE THEM REPEATS UNTIL IT DECODES THE KNOWN
TERMINATION CHARACTER (.)
• AT WHICH POINT IT HAS RECREATED THE, SAY, ASCII STRING RELATING TO
WENT AND PASSES THIS ON FOR PROCESSING.
STATISTICAL AND ADAPTIVE
• THE ADAPTIVE CHARACTER WORD LENGTH ALGORITHM THAT HAS BEEN
PROPOSED CAN BE USED TO ENHANCE THE EFFECTIVENESS OF STATISTICAL
DATA COMPRESSION TECHNIQUES.
• IN THIS SECTION, WE GIVE A BRIEF DESCRIPTION TO ONE OF THE MOST
SOPHISTICATED AND EFFICIENT STATISTICAL DATA COMPRESSION TECHNIQUES,
NAMELY, HUFFMAN CODING
HUFFMAN CODING:
• HUFFMAN CODING IS A SOPHISTICATED AND EFFICIENT STATISTICAL LOSSLESS DATA
COMPRESSION TECHNIQUE, IN WHICH THE CHARACTERS IN A DATA FILE ARE
CONVERTED TO A BINARY CODE, WHERE THE MOST COMMON CHARACTERS IN THE
FILE HAVE THE SHORTEST BINARY CODES, AND THE LEAST COMMON HAVE THE
LONGEST.
• IN HUFFMAN CODING, FOR EXAMPLE, THE TEXT FILE TO BE COMPRESSED IS READ TO
CALCULATE THE FREQUENCIES FOR ALL THE CHARACTERS USED IN THE TEXT,
INCLUDING ALL LETTERS, DIGITS, AND
• PUNCTUATION.
• THE FIRST STEP IN BUILDING A HUFFMAN CODE IS TO ORDER THE CHARACTERS FROM
HIGHEST TO LOWEST FREQUENCY OF OCCURRENCE.
• THE SECOND STEP IS TO CONSTRUCT A BINARY TREE STRUCTURE. THIS BEGINS
BY SELECTING THE TWO LEAST-FREQUENT CHARACTERS: LOGICALLY GROUP
THEM TOGETHER, AND THEIR FREQUENCIES ARE ADDED.
• THEN, SELECT THE TWO ELEMENTS THAT HAVE THE LOWEST FREQUENCIES,
REGARDING THE PREVIOUS COMBINATION AS A SINGLE ELEMENT, GROUP THEM
TOGETHER AND ADD THEIR FREQUENCIES.
• CONTINUE IN THE SAME WAY TO SELECT THE TWO ELEMENTS WITH THE LOWEST
FREQUENCY, GROUP THEM TOGETHER, AND ADD THEIR FREQUENCIES, UNTIL
ONE ELEMENT REMAINS
• IN STATISTICAL COMPRESSION ALGORITHMS, IT IS REQUIRED FIRST TO FIND THE
PROBABILITIES (I.E., FREQUENCIES, F) FOR ALL CHARACTERS USED IN THE DATA FILE. FOR
TEXT FILES, CHARACTERS INCLUDE ALL LETTERS, DIGITS, AND PUNCTUATIONS.
• THESE CHARACTERS ARE TO BE ORDERED IN THE SEQUENCE (S) FROM HIGHEST
FREQUENCIES TO THE LOWEST.
• THE SECOND STEP IS TO FIND THE EQUIVALENT BINARY CODE FOR EACH CHARACTER
ACCORDING TO THE STATISTICAL DATA COMPRESSION TECHNIQUE THAT HAS BEEN USED,
E.G., HUFFMAN CODING, WHERE THE MOST COMMON CHARACTERS IN THE FILE HAVE THE
SHORTEST BINARY CODES, AND THE LEAST COMMON HAVE THE LONGEST.
• THEN THESE BINARY CODES ARE USED TO CONVERT ALL CHARACTERS IN THE DATA FILE TO
A BINARY CODE. TYPICALLY, IN ALL STATISTICALALGORITHMS, A CHARACTER WORD
LENGTH OF 8 BITS IS USED, WHERE THE EQUIVALENT DECIMAL VALUE FOR EACH 8 BITS (0–
255) IS CALCULATED AND CONVERTED TO A CHARACTER THAT IS WRITTEN TO THE OUTPUT
FILE.
• THE SIZE OF THE ORIGINAL DATA FILE (S DATA) IN BYTES IS GIVEN BY
• N
• S DATA= FI
• I=1
• ENTROPY IS A MEASURE OF THE INFORMATION CONTENT OF THE DATA FILE AND THE
SMALLEST NUMBER OF BITS PER CHARACTER NEEDED, ON AVERAGE, TO REPRESENT THE
COMPRESSED FILE. THEREFORE, THE ENTROPY OF A COMPLETE DATA FILE WOULD BE THE SUM
OF THE INDIVIDUAL CHARACTERS’ ENTROPY. THE ENTROPY OF A CHARACTER IS REPRESENTED
AS THE NEGATIVE LOGARITHM OF ITS FREQUENCY AND EXPRESSED USING BASE TWO. WHERE
THE FREQUENCY OF EACH CHARACTER OF THE ALPHABET IS CONSTANT, THE ENTROPY IS
CALCULATED AS
• N
• E = - - I LOG 2( I)
• I=1
•
• THE ACW ALGORITHM WILLACHIEVE A CODING RATE OF (2.94 × 8)/B OR A
COMPRESSION RATIO OF B/2.94, WHERE B IS THE OPTIMUM CHARACTER WORD
LENGTH. THUS, FOR 9 AND 10 CHARACTER WORD LENGTHS, FOR EXAMPLE, THE
CODING RATES ARE 2.16 (C = 3.06) AND 2.35 (C = 3.40), RESPECTIVELY. THE
OPTIMUM VALUE OF B DEPENDS ON A NUMBER OF FACTORS:
• 1. THE SIZE AND THE TYPE OF THE DATA FILE,
• 2. THE CHARACTERS FREQUENCIES WITHIN THE DATA FILE,
• 3. THE DISTRIBUTION OF CHARACTERS WITHIN THE FILE, AND
• 4. THE EQUIVALENT BINARY CODE USED FOR EACH CHARACTER.
• IT IS CLEAR THAT USING THE ACW ALGORITHM, THE COMPRESSION RATIO, FOR
ANY STATISTICAL COMPRESSION ALGORITHM, INCREASES LINEARLY WITH B,
AND IT CAN BE 4 TIMES ITS ORIGINAL VALUE IF B IS EQUAL TO 32 BITS.
• IN ORDER TO DECOMPRESS THE FILE EFFICIENTLY, THE NUMBER OF
POSSIBILITIES AND THEIR DECREMENTALLY ORDERED ACTUAL DECIMAL VALUES
MUST BE STORED INTO THE HEADER OF THE COMPRESSED FILE.
• THE MAXIMUM ADDED OVERHEAD OCCURS WHEN THE VALUE OF B IS 32 BITS
AND ALL THE POSSIBILITIES (0–255) ARE USED.
• THE MAXIMUM OVERHEAD IS EQUAL TO 2563 B. THIS CAN BE REDUCED TO
2050 B, IF THESE VALUES ARE STORED IN HEXADECIMAL NOTATION.
DICTIONARY MODELING
• THE COMPRESSION ALGORITHMS WE STUDIED SO FAR USE A STATISTICAL MODEL TO
ENCODE SINGLE SYMBOL.
• 1. COMPRESSION: ENCODE SYMBOLS INTO BIT STRINGS THAT USE FEWER BITS.
• DICTIONARY-BASED ALGORITHMS DO NOT ENCODE SINGLE SYMBOLS AS
VARIABLE-LENGTH BIT STRINGS; THEY ENCODE VARIABLE-LENGTH STRINGS OF
SYMBOLS AS SINGLE TOKENS
• 2. THE TOKENS FORM AN INDEX INTO A PHRASE DICTIONARY
• 3. IF THE TOKENS ARE SMALLER THAN THE PHRASES THEY REPLACE,
COMPRESSION OCCURS.
• DICTIONARY-BASED COMPRESSION IS EASIER TO UNDERSTAND BECAUSE IT
USES A STRATEGY THAT PROGRAMMERS ARE FAMILIAR WITH-> USING INDEXES INTO
DATABASES TO RETRIEVE INFORMATION FROM LARGE AMOUNTS OF STORAGE.
• 4. TELEPHONE NUMBERS
• 5. POSTAL CODES
•
• DICTIONARY-BASED COMPRESSION: EXAMPLE
• CONSIDER THE RANDOM HOUSE DICTIONARY OF THE ENGLISH LANGUAGE,
SECOND EDITION, UNABRIDGED. USING THIS DICTIONARY, THE STRING:
• A GOOD EXAMPLE OF HOW DICTIONARY BASED COMPRESSION WORKS CAN
BE CODED AS:
• 1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2
• CODING:
• USES THE DICTIONARYAS A SIMPLE LOOKUP TABLE
• EACH WORD IS CODED AS X/Y, WHERE, X GIVES THE PAGE IN THE
DICTIONARY AND Y GIVES THE NUMBER OF THE WORD ON THAT PAGE.
• THE DICTIONARY HAS 2,200 PAGES WITH LESS THAN 256 ENTRIES PER PAGE:
THEREFORE X REQUIRES 12 BITS AND Y REQUIRES 8 BITS, I.E., 20 BITS PER
WORD (2.5 BYTES PER WORD).
• USING ASCII CODING THE ABOVE STRING REQUIRES 48 BYTES,
• WHEREAS OUR ENCODING REQUIRES ONLY 20 (<-2.5 * 8) BYTES: 50%
COMPRESSION.
• ADAPTIVE DICTIONARY-BASED COMPRESSION
• BUILD THE DICTIONARYADAPTIVELY,
• NECESSARY WHEN THE SOURCE DATA IS NOT PLAIN TEXT, SAY AUDIO OR VIDEO DATA.
• IS BETTER TAILORED TO THE SPECIFIC SOURCE.
• ORIGINAL METHODS DUE TO ZIP AND LEMPEL IN 1977 (LZ77) AND 1978 (LZ78). TERRY
WELCH IMPROVED THE SCHEME IN 1984 (CALLED LZW COMPRESSION). IT IS USED IN,
UNIX COMPRESS,
• AND , GIF.
• LZ77: A SLIDING WINDOW TECHNIQUE IN WHICH THE DICTIONARY CONSISTS OF A SET
OF FIXED LENGTH PHRASES FOUND IN A WINDOW INTO THE PREVIOUSLY PROCESSED TEXT
• LZ78: INSTEAD OF USING FIXED-LENGTH PHRASES FROM A WINDOW INTO THE TEXT, IT
BUILDS PHRASES UP ONE SYMBOL AT A TIME, ADDING A NEW SYMBOL TO AN EXISTING PHRASE
WHEN A
• MATCH OCCURS.
• LZW ALGORITHM
• PRELIMINARIES:
• A DICTIONARY THAT IS INDEXED BY CODES IS USED.
• THE DICTIONARY IS ASSUMED TO BE INITIALIZED WITH 256 ENTRIES
(INDEXED WITH ASCII CODES THROUGH 255) REPRESENTING THE ASCII TABLE.
• THE COMPRESSION ALGORITHM ASSUMES THAT THE OUTPUT IS EITHER A
FILE OR A COMMUNICATION CHANNEL. THE INPUT BEING A FILE OR BUFFER.
• CONVERSELY, THE DECOMPRESSION ALGORITHM ASSUMES THAT THE
INPUT IS A FILE OR A COMMUNICATION CHANNEL AND THE OUTPUT IS A FILE OR
A BUFFER.
• LZW ALGORITHM
• LZW COMPRESSION:
• SET W = NIL
• LOOP
• READ A CHARACTER K
• IF WK EXISTS IN THE DICTIONARY
• W = WK
• ELSE
• OUTPUT THE CODE FOR W
• ADD WK TO THE DICTIONARY
• W = K
• END LOOP
THANKYOU

More Related Content

Similar to Data Compression Techniques Explained

ANN(Artificial Neural Networks) Clustering Algorithms
ANN(Artificial  Neural Networks)  Clustering AlgorithmsANN(Artificial  Neural Networks)  Clustering Algorithms
ANN(Artificial Neural Networks) Clustering AlgorithmsAnuj Kumar Pathak
 
Communication systems v3
Communication systems v3Communication systems v3
Communication systems v3babak danyal
 
Communication systems week 3
Communication systems week 3Communication systems week 3
Communication systems week 3babak danyal
 
Dt notes part 2
Dt notes part 2Dt notes part 2
Dt notes part 2syedusama7
 
Digital communications
Digital communicationsDigital communications
Digital communicationsAllanki Rao
 
Lesson One Fourth Quarter Second Year High School Understanding Sounds
Lesson One Fourth Quarter Second Year High School Understanding SoundsLesson One Fourth Quarter Second Year High School Understanding Sounds
Lesson One Fourth Quarter Second Year High School Understanding SoundsPerry Mallari
 
ambaaxi protocol basic information presentaion
ambaaxi protocol basic information presentaionambaaxi protocol basic information presentaion
ambaaxi protocol basic information presentaionSandipSolanki10
 
Power line carrier communication,ETL41/42
Power line carrier communication,ETL41/42Power line carrier communication,ETL41/42
Power line carrier communication,ETL41/42Sreenivas Gundu
 
Data converter fundamentals
Data converter fundamentalsData converter fundamentals
Data converter fundamentalsAbhishek Kadam
 
Lecture intro to_wcdma
Lecture intro to_wcdmaLecture intro to_wcdma
Lecture intro to_wcdmaGurpreet Singh
 
Digital signal transmission in ofc
Digital signal transmission in ofcDigital signal transmission in ofc
Digital signal transmission in ofcAnkith Shetty
 

Similar to Data Compression Techniques Explained (20)

ANN(Artificial Neural Networks) Clustering Algorithms
ANN(Artificial  Neural Networks)  Clustering AlgorithmsANN(Artificial  Neural Networks)  Clustering Algorithms
ANN(Artificial Neural Networks) Clustering Algorithms
 
add9.5.ppt
add9.5.pptadd9.5.ppt
add9.5.ppt
 
Communication systems v3
Communication systems v3Communication systems v3
Communication systems v3
 
Communication systems week 3
Communication systems week 3Communication systems week 3
Communication systems week 3
 
Dt notes part 2
Dt notes part 2Dt notes part 2
Dt notes part 2
 
White Box Testing
White Box Testing White Box Testing
White Box Testing
 
Turbo codes
Turbo codesTurbo codes
Turbo codes
 
DPCM
DPCMDPCM
DPCM
 
Digital communications
Digital communicationsDigital communications
Digital communications
 
Lesson One Fourth Quarter Second Year High School Understanding Sounds
Lesson One Fourth Quarter Second Year High School Understanding SoundsLesson One Fourth Quarter Second Year High School Understanding Sounds
Lesson One Fourth Quarter Second Year High School Understanding Sounds
 
ambaaxi protocol basic information presentaion
ambaaxi protocol basic information presentaionambaaxi protocol basic information presentaion
ambaaxi protocol basic information presentaion
 
Power line carrier communication,ETL41/42
Power line carrier communication,ETL41/42Power line carrier communication,ETL41/42
Power line carrier communication,ETL41/42
 
Data converter fundamentals
Data converter fundamentalsData converter fundamentals
Data converter fundamentals
 
cdma2000_Fundamentals.pdf
cdma2000_Fundamentals.pdfcdma2000_Fundamentals.pdf
cdma2000_Fundamentals.pdf
 
C02 transmission fundamentals
C02   transmission fundamentalsC02   transmission fundamentals
C02 transmission fundamentals
 
Lecture intro to_wcdma
Lecture intro to_wcdmaLecture intro to_wcdma
Lecture intro to_wcdma
 
Presentation 1
Presentation 1Presentation 1
Presentation 1
 
Harmonic speech coding
Harmonic speech codingHarmonic speech coding
Harmonic speech coding
 
Digital signal transmission in ofc
Digital signal transmission in ofcDigital signal transmission in ofc
Digital signal transmission in ofc
 
PAUT.pptx
PAUT.pptxPAUT.pptx
PAUT.pptx
 

More from lalithambiga kamaraj (20)

Firewall in Network Security
Firewall in Network SecurityFirewall in Network Security
Firewall in Network Security
 
Data Compression in Multimedia
Data Compression in MultimediaData Compression in Multimedia
Data Compression in Multimedia
 
Digital Audio in Multimedia
Digital Audio in MultimediaDigital Audio in Multimedia
Digital Audio in Multimedia
 
Network Security: Physical security
Network Security: Physical security Network Security: Physical security
Network Security: Physical security
 
Graphs in Data Structure
Graphs in Data StructureGraphs in Data Structure
Graphs in Data Structure
 
Package in Java
Package in JavaPackage in Java
Package in Java
 
Exception Handling in Java
Exception Handling in JavaException Handling in Java
Exception Handling in Java
 
Data structure
Data structureData structure
Data structure
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 
Digital Image Processing
Digital Image ProcessingDigital Image Processing
Digital Image Processing
 
Estimating Software Maintenance Costs
Estimating Software Maintenance CostsEstimating Software Maintenance Costs
Estimating Software Maintenance Costs
 
Datamining
DataminingDatamining
Datamining
 
Digital Components
Digital ComponentsDigital Components
Digital Components
 
Deadlocks in operating system
Deadlocks in operating systemDeadlocks in operating system
Deadlocks in operating system
 
Io management disk scheduling algorithm
Io management disk scheduling algorithmIo management disk scheduling algorithm
Io management disk scheduling algorithm
 
Recovery system
Recovery systemRecovery system
Recovery system
 
File management
File managementFile management
File management
 
Preprocessor
PreprocessorPreprocessor
Preprocessor
 
Inheritance
InheritanceInheritance
Inheritance
 
Managing console of I/o operations & working with files
Managing console of I/o operations & working with filesManaging console of I/o operations & working with files
Managing console of I/o operations & working with files
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 

Recently uploaded (20)

9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 

Data Compression Techniques Explained

  • 2. HUFFMAN CODE • THE ALGORITHM AS DESCRIBED BY DAVID HUFFMAN ASSIGNS EVERY SYMBOL TO A LEAF NODE OF A BINARY CODE TREE. • THESE NODES ARE WEIGHTED BY THE NUMBER OF OCCURRENCES OF THE CORRESPONDING SYMBOL CALLED FREQUENCY OR COST. • THE TREE STRUCTURE RESULTS FROM COMBINING THE NODES STEP-BY-STEP UNTILALL OF THEM ARE EMBEDDED IN A ROOT TREE. • THE ALGORITHM ALWAYS COMBINES THE TWO NODES PROVIDING THE LOWEST FREQUENCY IN A BOTTOM UP PROCEDURE. THE NEW INTERIOR NODES GETS THE SUM OF FREQUENCIES OF BOTH CHILD NODES.
  • 3. CODE TREE ACCORDING TO HUFFMAN Symbol Frequency A 24 B 12 C 10 D 8 E 8 ----> total 186 bit (With 3 bit per code word) The branches of the tree represent the binary values 0 and 1 according to the rules for common prefix-free code trees. The path from the root tree to the corresponding leaf node defines the particular code word. The following example bases on a data source using a set of five different symbols. The symbol's frequencies are:
  • 4. CODE TREE ACCORDING TO HUFFMAN Length Length ------------------------------------- ------------------------------------- - A 24 0 1 24 B 12 100 3 36 C 10 101 3 30 D 8 110 3 24 E 8 111 3 24 ------------------------------------- ---------------------------------- 186 bit tot.138 bit (3 bit code)
  • 5. ADAPTIVE HUFFMAN CODE • THE ADAPTIVE HUFFMAN CODE WILL BE DEVELOPED IN THE COURSE OF THE CODING STEP BY STEP. • IN CONTRAST TO THE STATIC OR DYNAMIC CODING THE SYMBOL DISTRIBUTION WILL NOT BE DETERMINED PREVIOUSLY. • IT WILL BE GENERATED IN PARALLEL BY USING ALREADY PROCESSED SYMBOLS. • PRINCIPLE: • THE WEIGHT OF EACH NODE REPRESENTING THE LAST ENCODED SYMBOL WILL BE INCREASED. • AFTERWARDS THE CORRESPONDING PART OF THE TREE WILL BE ADAPTED. • THUS THE TREE APPROACHES GRADUALLY THE CURRENT DISTRIBUTION OF THE SYMBOLS. HOWEVER, THE CODE TREE ALWAYS REFLECTS THE "PAST" AND NOT THE REAL DISTRIBUTION.
  • 6. • INITIALIZATION: • BECAUSE THE ADAPTIVE HUFFMAN CODE USES PREVIOUSLY ENCODED SYMBOLS, AN INITIALIZATION PROBLEM ARISES AT THE BEGINNING OF THE CODING. • AT FIRST THE CODE TREE IS EMPTY AND DOES NOT CONTAIN SYMBOLS FROM ALREADY ENCODED DATA. TO SOLVE THIS, A SUITABLE INITIALIZATION HAS TO BE UTILIZED. AMONG OTHERS THE FOLLOWING OPTIONS ARE AVAILABLE: • A STANDARD DISTRIBUTION WILL BE USED THAT IS AVAILABLE AT THE ENCODER AS WELL AS THE DECODER. • AN INITIAL CODE TREE WILL BE GENERATED WITH A FREQUENCY OF 1 FOR EACH SYMBOL. • A SPECIAL CONTROL CHARACTER WILL BE ADDED IDENTIFYING NEW SYMBOLS FOLLOWING.
  • 7. • ALGORITHM ADAPTIVE HUFFMAN CODE • A NUMBER OF RULES AND CONVENTIONS IS REQUIRED FOR THE ALGORITHM DESCRIBED HEREINAFTER. • IT BASES ON A PROCEDURE STARTING WITH AN EMPTY TREE AND INTRODUCING A SPECIAL CONTROL CHARACTER. • NO PREVENTIVE ASSUMPTIONS ABOUT THE CODE DISTRIBUTION WILL BE MADE. • CONTROL CHARACTER FOR SYMBOLS THAT ARE NOT PART OF THE TREE: • A CONTROL CHARACTER IS DEFINED IN ADDITION TO THE ORIGINAL SET OF SYMBOLS. • THIS CONTROL CHARACTER IDENTIFIES SYMBOLS THAT, AT RUN-TIME, DO NOT BELONG TO THE CURRENT CODE TREE. IN THE LITERATURE THIS CONTROL CHARACTER IS DENOTED AS A NYA OR NYT (NOT YET AVAILABLE OR TRANSMITTED).
  • 8. WEIGHT OF A NODE: • ANY NODE HAS AN ATTRIBUTE CALLED THE WEIGHT OF THE NODE. • THE WEIGHT OF A LEAF NODE IS EQUAL TO THE FREQUENCY WITH WHICH THE CORRESPONDING SYMBOL HAD BEEN CODED SO FAR. • THE WEIGHT OF THE INTERIOR NODES IS EQUAL TO THE SUM OF THE TWO SUBORDINATED NODES. THE CONTROL CHARACTER NYA GETS THE WEIGHT 0. STRUCTURE OF THE CODE TABLE: • ALL NODES ARE SORTED ACCORDING TO THEIR WEIGHT IN ASCENDING ORDER. THE NYA NODE ALWAYS PROVIDES THE LOWEST NODE IN THE HIERARCHY.
  • 9. SIBLING PROPERTY: • ANY PAIR OF NODES THAT ARE NEIGHBORS IN THE HIERARCHY REFER TO THE SAME PARENT NODE (NODE 2N-1 AND 2N -> 1 & 2, 3 & 4, 5 & 6, ETC.). • THE PARENT NODE IS ALWAYS ORDERED AT A HIGHER LEVEL IN THE HIERARCHY BECAUSE ITS WEIGHT RESULTS FROM THE SUM OF THE SUBORDINATED NODES. THIS IS DENOTED AS SIBLING PROPERTY. ROOT • THE NODE WITH THE HIGHEST WEIGHT IS PLACED AT THE HIGHEST LEVEL IN THE HIERARCHY AND FORMS THE CURRENT ROOT OF THE TREE. INITIAL TREE • THE INITIAL TREE ONLY CONTAINS THE NYA NODE WITH THE WEIGHT 0.
  • 10. • DATA FORMAT FOR UNCODED SYMBOLS • IN THE SIMPLEST CASE SYMBOLS WHICH ARE NOT CONTAINED IN THE CODE TREE ARE ENCODED LINEARLY (E.G. WITH 8 BIT FROM A SET OF 256 SYMBOLS). • BETTER COMPRESSION EFFICIENCY COULD BE ACHIEVED, IF THE DATA FORMAT WILL BE ADAPTED TO THE NUMBER OF REMAINING SYMBOLS. • A VARIETY OF OPTIONS ARE AVAILABLE TO OPTIMIZE THE CODE LENGTH. A SUITABLE PROCEDURE WILL BE PRESENTED GUARANTEEING FULL UTILIZATION OF THE RANGE OF VALUES. • MAXIMUM NUMBER OF NODES • ASSUMING A SET OF N SYMBOLS THE TOTAL NUMBER OF NODES RESULTS IN 2N-1 (LEAF AND INTERIOR NODES). ACCORDINGLY THE MAXIMUM NUMBER OF NODES IS 511 IF THE STANDARD UNIT IS THE BYTE. • BLOCK • ALL NODES OF IDENTICAL WEIGHT WILL BE SUMMARIZED LOGICALLY TO A BLOCK. NODES WHICH ARE PART OF A BLOCK ARE ALWAYS NEIGHBORING IN THE CODE TABLE REPRESENTING THE CODE TREE.
  • 11. ARITHMETIC CODING: • HUFFMAN CODING ACHIEVES THE SHANNON VALUE ONLY IF THE CHARACTER/SYMBOL PROBABILITIES ARE ALL INTEGER POWERS OF ½. • THE CODEWORDS PRODUCED USING ARITHMETIC CODING ALWAYS ACHIEVE THE SHANNON VALUE. ARITHMETIC CODING, HOWEVER, IS MORE COMPLICATED THAN HUFFMAN CODING. • CONSIDER THE TRANSMISSION OF A MESSAGE COMPRISING A STRING OF CHARACTERS WITH PROBABILITIES OF: • E=0.3, N=0.3, T=0.2, W=0.1, . =0.1
  • 12. • AT THE END OF EACH CHARACTER STRING MAKING UP A MESSAGE, A KNOWN CHARACTER IS SENT WHICH, IN THIS EXAMPLE, IS A PERIOD.. • WHEN THIS IS DECODED AT THE RECEIVING SIDE, THE DECODER INTERPRETS THIS AS THE END OF THE STRING/MESSAGE. • ARITHMETIC CODING YIELDS A SINGLE CODEWORD FOR EACH CHARACTERS. THE FIRST STEP IS DIVIDE THE NUMERIC RANGE FROM 0 TO 1 INTO A NUMBER OF DIFFERENT CHARACTERS PRESENT IN THE MESSAGE TO BE SENT-INCLUDING THE TERMINATION CHARACTER-AND THE SIZE OF EACH SEGMENT BY THE PROBABILITY OF THE RELATED CHARACTER. • THERE ARE ONLY FIVE DIFFERENT CHARACTERS, THERE ARE FIVE SEGMENTS, THE WIDTH OF EACH SEGMENT BEING DETERMINED BY THE PROBABILITY OF THE RELATED CHARACTER.
  • 13. • FOR EXAMPLE, THE CHARACTER E HAS A PROBABILITY OF 0.3 AND HENCE IS ASSIGNED THE RANGE FROM 0.3 TO 0.6, AND SO ON. • HOWEVER AN ASSIGNMENT IS THE RAGE, SAY, 0.8 TO 0.9, MEANS THAT THE PROBABILITY IN THE CUMULATIVE RANGE IS FROM 0.8 TO 0.8999… • ONCE THIS HAS BEEN DONE , WE ARE READY TO START THE ENCODING PROCESS. WE ASSUME THE CHARACTER STRING/MESSAGE TO BE ENCODED IS THE SINGLE WORD WENT. • THE FIRST CHARACTER TO BE ENCODED W IS IN THE RANGE 0.8 TO 0.9.THE FINAL (NUMERIC) CODEWORD IS A NUMBER IN THE RANGE 0.8 TO0.8999... • SINCE EACH SUBSEQUENT CHARACTER IN THE STRING SUBDIVIDES THE RANGE 0.8 TO 0.9 INTO PROGRESSIVELY SMALLER SEGMENTS EACH DETERMINED BY THE PROBABILITIES OF THE CHARACTERS IN THE STRING. EXAMPLE CHARACTER SET AND THEIR PROBABILITIES: • CHARACTER SET |--------E----------|--------N--------|-----------T--------|-------W-------|-.--| CUMULATIVE 0 0.3 0.6 0.8 0.9 PROBABILITIES
  • 14. • THE NEXT CHARACTER IN THE STRING IS E AND HENCE ITS RANGE (0.8 TO 0.83) IS AGAIN SUBDIVIDED INTO FIVE SEGMENTS. • WITH THE NEW ASSIGNMENTS, THEREFORE, THE CHARACTER E HAS A RANGE FROM 0.8TO 0.809(0.809+0.3*0.03), AND SO ON. • THIS PROCEDURE CONTINUES UNTIL THE TERMINATION CHARACTER (.) IS ENCODED. AT THIS POINT, THE SEGMENT RANGE OF (.) FROM 0.81602 TO 0.8162 AND HENCE THE CODEWORD FOR THE COMPLETE STRING IS ANY NUMBER WITHIN THE RANGE: • IN THE STATIC MODE, THE DECODER KNOWS THE SET OF CHARACTERS THAT ARE PRESENT IN THE ENCODED MESSAGES IT RECEIVES AS WELL AS THE SEGMENT TO WHICH EACH CHARACTER HAS BEEN ASSIGNED AND ITS RELATED RANGE.
  • 15. • FOR EXAMPLE, IF THE RECEIVED CODEWORD IS DAY, 0.8161, THEN THE DECODER CAN READILY DETERMINE FROM THIS THAT THE FIRST CHARACTER IS W SINCE IT IS THE ONLY CHARACTER WITHIN THE RANGE 0.8 TO 0.9. • THE SECOND CHARACTER MUST BE E SINCE 0.8161 IS WITHIN THE RANGE 0.8 TO 0.83.THIS PROCEDURE THEM REPEATS UNTIL IT DECODES THE KNOWN TERMINATION CHARACTER (.) • AT WHICH POINT IT HAS RECREATED THE, SAY, ASCII STRING RELATING TO WENT AND PASSES THIS ON FOR PROCESSING.
  • 16. STATISTICAL AND ADAPTIVE • THE ADAPTIVE CHARACTER WORD LENGTH ALGORITHM THAT HAS BEEN PROPOSED CAN BE USED TO ENHANCE THE EFFECTIVENESS OF STATISTICAL DATA COMPRESSION TECHNIQUES. • IN THIS SECTION, WE GIVE A BRIEF DESCRIPTION TO ONE OF THE MOST SOPHISTICATED AND EFFICIENT STATISTICAL DATA COMPRESSION TECHNIQUES, NAMELY, HUFFMAN CODING
  • 17. HUFFMAN CODING: • HUFFMAN CODING IS A SOPHISTICATED AND EFFICIENT STATISTICAL LOSSLESS DATA COMPRESSION TECHNIQUE, IN WHICH THE CHARACTERS IN A DATA FILE ARE CONVERTED TO A BINARY CODE, WHERE THE MOST COMMON CHARACTERS IN THE FILE HAVE THE SHORTEST BINARY CODES, AND THE LEAST COMMON HAVE THE LONGEST. • IN HUFFMAN CODING, FOR EXAMPLE, THE TEXT FILE TO BE COMPRESSED IS READ TO CALCULATE THE FREQUENCIES FOR ALL THE CHARACTERS USED IN THE TEXT, INCLUDING ALL LETTERS, DIGITS, AND • PUNCTUATION. • THE FIRST STEP IN BUILDING A HUFFMAN CODE IS TO ORDER THE CHARACTERS FROM HIGHEST TO LOWEST FREQUENCY OF OCCURRENCE.
  • 18. • THE SECOND STEP IS TO CONSTRUCT A BINARY TREE STRUCTURE. THIS BEGINS BY SELECTING THE TWO LEAST-FREQUENT CHARACTERS: LOGICALLY GROUP THEM TOGETHER, AND THEIR FREQUENCIES ARE ADDED. • THEN, SELECT THE TWO ELEMENTS THAT HAVE THE LOWEST FREQUENCIES, REGARDING THE PREVIOUS COMBINATION AS A SINGLE ELEMENT, GROUP THEM TOGETHER AND ADD THEIR FREQUENCIES. • CONTINUE IN THE SAME WAY TO SELECT THE TWO ELEMENTS WITH THE LOWEST FREQUENCY, GROUP THEM TOGETHER, AND ADD THEIR FREQUENCIES, UNTIL ONE ELEMENT REMAINS
  • 19. • IN STATISTICAL COMPRESSION ALGORITHMS, IT IS REQUIRED FIRST TO FIND THE PROBABILITIES (I.E., FREQUENCIES, F) FOR ALL CHARACTERS USED IN THE DATA FILE. FOR TEXT FILES, CHARACTERS INCLUDE ALL LETTERS, DIGITS, AND PUNCTUATIONS. • THESE CHARACTERS ARE TO BE ORDERED IN THE SEQUENCE (S) FROM HIGHEST FREQUENCIES TO THE LOWEST. • THE SECOND STEP IS TO FIND THE EQUIVALENT BINARY CODE FOR EACH CHARACTER ACCORDING TO THE STATISTICAL DATA COMPRESSION TECHNIQUE THAT HAS BEEN USED, E.G., HUFFMAN CODING, WHERE THE MOST COMMON CHARACTERS IN THE FILE HAVE THE SHORTEST BINARY CODES, AND THE LEAST COMMON HAVE THE LONGEST. • THEN THESE BINARY CODES ARE USED TO CONVERT ALL CHARACTERS IN THE DATA FILE TO A BINARY CODE. TYPICALLY, IN ALL STATISTICALALGORITHMS, A CHARACTER WORD LENGTH OF 8 BITS IS USED, WHERE THE EQUIVALENT DECIMAL VALUE FOR EACH 8 BITS (0– 255) IS CALCULATED AND CONVERTED TO A CHARACTER THAT IS WRITTEN TO THE OUTPUT FILE.
  • 20. • THE SIZE OF THE ORIGINAL DATA FILE (S DATA) IN BYTES IS GIVEN BY • N • S DATA= FI • I=1 • ENTROPY IS A MEASURE OF THE INFORMATION CONTENT OF THE DATA FILE AND THE SMALLEST NUMBER OF BITS PER CHARACTER NEEDED, ON AVERAGE, TO REPRESENT THE COMPRESSED FILE. THEREFORE, THE ENTROPY OF A COMPLETE DATA FILE WOULD BE THE SUM OF THE INDIVIDUAL CHARACTERS’ ENTROPY. THE ENTROPY OF A CHARACTER IS REPRESENTED AS THE NEGATIVE LOGARITHM OF ITS FREQUENCY AND EXPRESSED USING BASE TWO. WHERE THE FREQUENCY OF EACH CHARACTER OF THE ALPHABET IS CONSTANT, THE ENTROPY IS CALCULATED AS • N • E = - - I LOG 2( I) • I=1 •
  • 21. • THE ACW ALGORITHM WILLACHIEVE A CODING RATE OF (2.94 × 8)/B OR A COMPRESSION RATIO OF B/2.94, WHERE B IS THE OPTIMUM CHARACTER WORD LENGTH. THUS, FOR 9 AND 10 CHARACTER WORD LENGTHS, FOR EXAMPLE, THE CODING RATES ARE 2.16 (C = 3.06) AND 2.35 (C = 3.40), RESPECTIVELY. THE OPTIMUM VALUE OF B DEPENDS ON A NUMBER OF FACTORS: • 1. THE SIZE AND THE TYPE OF THE DATA FILE, • 2. THE CHARACTERS FREQUENCIES WITHIN THE DATA FILE, • 3. THE DISTRIBUTION OF CHARACTERS WITHIN THE FILE, AND • 4. THE EQUIVALENT BINARY CODE USED FOR EACH CHARACTER.
  • 22. • IT IS CLEAR THAT USING THE ACW ALGORITHM, THE COMPRESSION RATIO, FOR ANY STATISTICAL COMPRESSION ALGORITHM, INCREASES LINEARLY WITH B, AND IT CAN BE 4 TIMES ITS ORIGINAL VALUE IF B IS EQUAL TO 32 BITS. • IN ORDER TO DECOMPRESS THE FILE EFFICIENTLY, THE NUMBER OF POSSIBILITIES AND THEIR DECREMENTALLY ORDERED ACTUAL DECIMAL VALUES MUST BE STORED INTO THE HEADER OF THE COMPRESSED FILE. • THE MAXIMUM ADDED OVERHEAD OCCURS WHEN THE VALUE OF B IS 32 BITS AND ALL THE POSSIBILITIES (0–255) ARE USED. • THE MAXIMUM OVERHEAD IS EQUAL TO 2563 B. THIS CAN BE REDUCED TO 2050 B, IF THESE VALUES ARE STORED IN HEXADECIMAL NOTATION.
  • 23. DICTIONARY MODELING • THE COMPRESSION ALGORITHMS WE STUDIED SO FAR USE A STATISTICAL MODEL TO ENCODE SINGLE SYMBOL. • 1. COMPRESSION: ENCODE SYMBOLS INTO BIT STRINGS THAT USE FEWER BITS. • DICTIONARY-BASED ALGORITHMS DO NOT ENCODE SINGLE SYMBOLS AS VARIABLE-LENGTH BIT STRINGS; THEY ENCODE VARIABLE-LENGTH STRINGS OF SYMBOLS AS SINGLE TOKENS • 2. THE TOKENS FORM AN INDEX INTO A PHRASE DICTIONARY • 3. IF THE TOKENS ARE SMALLER THAN THE PHRASES THEY REPLACE, COMPRESSION OCCURS. • DICTIONARY-BASED COMPRESSION IS EASIER TO UNDERSTAND BECAUSE IT USES A STRATEGY THAT PROGRAMMERS ARE FAMILIAR WITH-> USING INDEXES INTO DATABASES TO RETRIEVE INFORMATION FROM LARGE AMOUNTS OF STORAGE.
  • 24. • 4. TELEPHONE NUMBERS • 5. POSTAL CODES • • DICTIONARY-BASED COMPRESSION: EXAMPLE • CONSIDER THE RANDOM HOUSE DICTIONARY OF THE ENGLISH LANGUAGE, SECOND EDITION, UNABRIDGED. USING THIS DICTIONARY, THE STRING: • A GOOD EXAMPLE OF HOW DICTIONARY BASED COMPRESSION WORKS CAN BE CODED AS: • 1/1 822/3 674/4 1343/60 928/75 550/32 173/46 421/2
  • 25. • CODING: • USES THE DICTIONARYAS A SIMPLE LOOKUP TABLE • EACH WORD IS CODED AS X/Y, WHERE, X GIVES THE PAGE IN THE DICTIONARY AND Y GIVES THE NUMBER OF THE WORD ON THAT PAGE. • THE DICTIONARY HAS 2,200 PAGES WITH LESS THAN 256 ENTRIES PER PAGE: THEREFORE X REQUIRES 12 BITS AND Y REQUIRES 8 BITS, I.E., 20 BITS PER WORD (2.5 BYTES PER WORD). • USING ASCII CODING THE ABOVE STRING REQUIRES 48 BYTES, • WHEREAS OUR ENCODING REQUIRES ONLY 20 (<-2.5 * 8) BYTES: 50% COMPRESSION. • ADAPTIVE DICTIONARY-BASED COMPRESSION • BUILD THE DICTIONARYADAPTIVELY,
  • 26. • NECESSARY WHEN THE SOURCE DATA IS NOT PLAIN TEXT, SAY AUDIO OR VIDEO DATA. • IS BETTER TAILORED TO THE SPECIFIC SOURCE. • ORIGINAL METHODS DUE TO ZIP AND LEMPEL IN 1977 (LZ77) AND 1978 (LZ78). TERRY WELCH IMPROVED THE SCHEME IN 1984 (CALLED LZW COMPRESSION). IT IS USED IN, UNIX COMPRESS, • AND , GIF. • LZ77: A SLIDING WINDOW TECHNIQUE IN WHICH THE DICTIONARY CONSISTS OF A SET OF FIXED LENGTH PHRASES FOUND IN A WINDOW INTO THE PREVIOUSLY PROCESSED TEXT • LZ78: INSTEAD OF USING FIXED-LENGTH PHRASES FROM A WINDOW INTO THE TEXT, IT BUILDS PHRASES UP ONE SYMBOL AT A TIME, ADDING A NEW SYMBOL TO AN EXISTING PHRASE WHEN A • MATCH OCCURS.
  • 27. • LZW ALGORITHM • PRELIMINARIES: • A DICTIONARY THAT IS INDEXED BY CODES IS USED. • THE DICTIONARY IS ASSUMED TO BE INITIALIZED WITH 256 ENTRIES (INDEXED WITH ASCII CODES THROUGH 255) REPRESENTING THE ASCII TABLE. • THE COMPRESSION ALGORITHM ASSUMES THAT THE OUTPUT IS EITHER A FILE OR A COMMUNICATION CHANNEL. THE INPUT BEING A FILE OR BUFFER. • CONVERSELY, THE DECOMPRESSION ALGORITHM ASSUMES THAT THE INPUT IS A FILE OR A COMMUNICATION CHANNEL AND THE OUTPUT IS A FILE OR A BUFFER.
  • 28. • LZW ALGORITHM • LZW COMPRESSION: • SET W = NIL • LOOP • READ A CHARACTER K • IF WK EXISTS IN THE DICTIONARY • W = WK • ELSE • OUTPUT THE CODE FOR W • ADD WK TO THE DICTIONARY • W = K • END LOOP