anithabalaprabhu
3,074 views
19 slides
Feb 08, 2010
Slide 1 of 19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
About This Presentation
No description available for this slideshow.
Size: 151.75 KB
Language: en
Added: Feb 08, 2010
Slides: 19 pages
Slide Content
1
Huffman Encoding
ΒαγγέληςΔούρος
EY0619
2
Text Compression z
On a computer: changing the representation
of a file so that it takes less space to store
or/and less time to transmit.
–
original file can be reconstructed exactly from the
compressed representation.
z
different than data compression in general
–
text compression has to be lossless.
–
compare with sound and images: small changes
and noise is tolerated.
3
First Approach z
Let the word
ABRACADABRA
z
What is the most economical way to write this
string in a binary representation?
z
Generally speaking, if a text consists of N
different characters, we need bits to
represent each one using a fixed-length
encoding.
z
Thus, it would require 3 bits for each of 5
different letters, or 33 bits for 11 letters.
z
Can we do it better?
logN
⎡
⎤
⎢
⎥
4
Yes!!!!
z
We can do better, provided:
–
Some characters are more frequentthan others.
–
Characters may be differentbit lengths, so that for
example, in the English alphabet letter a may use
only one or two bits, while letter
y may
use
several.
–
We have a uniqueway of decoding the bit stream.
5
Using Variable-Length Encoding (1) z
Magic word:
ABRACADABRA
z
LET A = 0
B = 100
C = 1010
D = 1011
R = 11
z
Thus, ABRACADABRA=01001101010010110100110
z
So 11 letters demand 23 bits < 33 bits, an
improvement of about 30%.
6
Using Variable-Length Encoding (2) z
However, there is a serious danger: How to ensure
unique reconstruction?
z
Let A Æ01 and B Æ0101
z
How to decode 010101?
z
AB?
z
BA?
z
AAA?
z
No problem……
z
if we use prefix codes: no codeword is a prefix of
another codeword.
7
Prefix Codes (1) z
Any prefix code can be represented by a full
binary tree.
z
Eachleaf stores a symbol.
z
Each node has two children –left branch
means 0, right means 1.
z
codeword = path from the root to the leaf
interpreting suitably the left and right
branches.
8
Prefix Codes (2) z
ABRACADABRA
z
A = 0
B = 100
C = 1010
D = 1011
R = 11
z
Decoding is unique and simple!
z
Read the bit stream from left to
right and starting from the root,
whenever a leaf is reached,
write down its symbol and
return to the root.
9
Prefix Codes (3) z
Let the frequency of the i-th symbol ,
the number of bits required for the i-th
symbol(=the depth of this symbol in tree),
z
How do we find the optimalcoding tree,
which minimizes the cost of tree ?
–
Frequent characters should have short
codewords
–
Rare characters should have longcodewords
1
n
ii
i
Cfd
=
=
∑
i
f
i
d
1in
≤
≤
10
Huffman’s Idea z
From the previous definition of the cost of tree, it is clear that
the two symbols with the smallest frequenciesmust be at the
bottom of the optimal tree, as ch ildren of the lowest internal
node, isn’t it?
z
This is a good sign that we have to use a bottom-up manner to
build the optimal code!
z
Huffman’s idea is based on a greedy approach, using the
previous notices.
z
Repeat until all nodes merged into one tree:
–
Remove two nodes with the lowest frequencies.
–
Create a new internal node, with the two just-removed nodes as
children (either node can be either child) and the sum of their
frequencies as the new frequency.
11
Constructing a Huffman Code (1) z
Assume that frequencies of symbols are:
–
A: 40 B: 20 C: 10 D: 10 R: 20
z
Smallest numbers are 10 and 10 (
C
and
D
), so
connect them
12
Constructing a Huffman Code (2) z
C
and
D
have already been
used, and the new node
above them (call it
C+D
) has
value 20
z
The smallest values are
B
,
C+D
, and
R
, all of which
have value 20
–
Connect any two of these
z
It is clear that the algorithm
does notconstruct a unique
tree, but even if we have
chosen the other possible
connection, the code would
be optimal too!
13
Constructing a Huffman Code (3) z
The smallest value is
R
, while
A
and
B+C+D
have
value 40.
z
Connect
R
to either of the others.
14
Constructing a Huffman Code(4) z
Connect the final two nodes, adding 0 and 1 to
each left and right branch respectively.
15
Algorithm z
X is the set of symbols, whose
frequencies are known in advance
Q is a min-priority queue,
implemented as binary-hea
p
-1
16
What about Complexity? z
Thus, the algorithm needs Ο(nlogn)
needs O(logn)
needs O(logn)
needs O(logn)
needs O(nlogn)
Thus, the loop needs O(nlogn)
-1
17
Algorithm’s Correctness z
It is proven that the greedy algo rithm HUFFMAN is correct, as the
problem of determining an optimal prefix code exhibits the greedy-
choice
and optimal-substructure
properties.
z
Greedy Choice:Let C an alphabet in which each character cЄChas
frequency f[c]. Let xand ytwo characters in Chaving the lowest
frequencies
. Then there exists an optimal prefix code for Cin which
the codewords for x and y have the same length and differ only in the
last bit.
z
Optimal Substructure:Let Ca given alphabet with frequency f[c]
defined for each character c ЄC. Let xand y,two characters in Cwith
minimum frequency. Let C’ ,the alphabet C with characters x,y
removed and (new) character z added, so that C’ = C – {x,y} U {z};
define f for C’ as for C, except that f[z] = f[x] + f[y]. Let T’ ,any tree
representing an optimal prefix code for the alphabet C’. Then the tree
T, obtained from T’ by replacing t he leaf node for z with an internal
node having x and y as children, r epresents an optimal prefix code for
the alphabet C.
18
Last Remarks • "Huffman Codes" are widely used applications that
involve the compression and transmission of digital
data, such as: fax machines, modems, computer
networks.
• Huffman encoding is practical if:
–
The encoded string is largerelative to the code table
(because you have to include the code table in the entire
message, if it is not widely spread).
–
We agree on the code table in advance
• For example, it’s easy to find a table of letter frequencies for
English (or any other alphabet-based language)