Department of CSE & IT
C.V. Raman College of Engineering
Bhubaneswar
1
Data Structure using 'C'
Hashing
Department of CSE and IT
Department of CSE and IT
2
Introduction and Definition
Hash Tables as a Data Structure
Choosing a hash Function
Truncation Method
Mid Square Method
Folding Method
Modular Method
Collision Resolution(Open Hashing)
Separate Chaining
Closed Hashing(Open Addressing)
Linear Probing
Quadratic Probing
Double Hashing
Storage Management
Garbage Collection
Dynamic memory management
Buddy System
Binary Buddy System
Fibonacci Buddy System
Application
Department of CSE and IT3
Introduction to Hashing
Hashing is the technique used for performing almost constant time search in case of
insertion, deletion and find operation.
The essence of hashing is to facilitate the next level searching method when
compared with the linear or binary search.
It is the process of mapping large amount of data item to a smaller table with the help
of a hashing function.
The advantage of this searching method is its efficiency to hand vast amount of data
items in a given collection (i.e. collection size).
Due to this hashing process, the result is a Hash data structure that can store or
retrieve data items in an average time disregard to the collection size.
Continued…
Let ‘n’ distinct record Table is one of the important data structure used
for information retrieval.
Records with keys K1,K2-----Kn are stored in a file and we want to
retrieve a record with a given key value K.
One way is to start from the first record by comparing K with its key
value, if found then stop ,otherwise proceed to the next record. The
searching time is directly proportional to the number of records in the
file.
Another way is to use an access table, where the searching time is
significantly reduced.
Department of CSE and IT4
Continued…
Let us assume a function f and apply this function to K, then it returns ‘i’ that
is f(k)=i. The ith entry of the access table gives the location of record with key
value K.
One type of such access table is known as Hash table.
The hash table is an array which contains key values with pointers to the
corresponding records.
The basic idea here is to place a key inside the hash table, and the
location/index of that key will be calculated from the given key value itself.
The one to one correspondence between a key value and index in the hash
table is known as hashing.
Department of CSE and IT5
Department of CSE and IT6
Example: The Search Problem
Find items with keys matching a given search key
Given an array A, containing n keys, and a search key x, find the index i
such as x=A[i]
As in the case of sorting, a key could be part of a large record.
Department of CSE and IT7
Applications
Keeping track of customer account information at a bank
Search through records to check balances and perform transactions
Keep track of reservations on flights
Search to find empty seats, cancel/modify reservations
Search engine
Looks for all documents containing a given word
Department of CSE and IT8
Hash Table
There are some type of tables which helps to retrieve information in
efficient manner.
The hash table is one of them.
This table is an array of constant size which depends on applications.
The hash table contains key values with pointers to the corresponding
records.
The basic idea is to place key value into location in the hash table and this
location will be calculated from the key value itself.
The one to one correspondence between the key value and index in the
hash table is known as hashing.
Department of CSE and IT9
The ideal hash table structure is merely an array of some fixed size, containing
the items.
A stored item needs to have a data member, called key, that will be used in
computing the index value for the item.
Key could be an integer, a string, etc
e.g. a name or Id that is a part of a large employee structure
The size of the array is TableSize.
The items that are stored in the hash table are indexed by values from 0 to
TableSize – 1.
Each key is mapped into some number in the range 0 to TableSize – 1.
The mapping is called a hash function.
Continued…
A hash table, or a hash map, is a data structure that associates keys
(names) with values (attributes).
• Look-Up Table
• Dictionary
• Cache
• Extended Array
Department of CSE and IT10
Continued…
Example: HASH TABLE
U
(universe of keys)
K
(actual
keys)
0
m - 1
h(k
3
)
h(k
2) = h(k
5)
h(k
1
)
h(k
4)
k
1
k
4k
2
k
5
k
3
Department of CSE and IT11
Choosing a Hash Function
A hash function maps keys to small integers (buckets). An ideal hash function
maps the keys to the integers in a random-like manner, so that bucket values are
evenly distributed even if there are regularities in the input data.
This process can be divided into two steps:
Map the key to an integer.
Map the integer to a bucket.
Department of CSE and IT12
Example
Hash
Function
mary 28200
dave 27500
phil 31250
john 25000
Items
Hash
Table
key
key
0
1
2
3
4
5
6
7
8
9
mary 28200
dave 27500
phil 31250
john 25000
Department of CSE and IT13
The hash function:
must be simple to compute.
must distribute the keys evenly among the cells.
Different Approaches to generate hash function
1. Truncation Method
2. Mid Square Method
3. Folding Method
4. Modular Method
Department of CSE and IT14
1.Truncation Method
Ignore a part of the key and use the remaining part directly as the index
Ex:1 If a hash table contains 999 entries at the most or 999 different key indexes
may be kept, then a hash function may be defined such that from an eight digit
integer 12345678, first, second and fifth digits from the right may be used to
define a key index i.e. 478, which is the key position in the hash lookup table
where this element will be inserted.
Any other key combination may be used.
Ex:2 If students have an 9-digit identification number, take the last 3 digits as the
table position
e.g. 925371622
becomes
622
Department of CSE and IT15
2. Mid Square Method
In Mid-Square method, the key element is multiplied by itself to yield
the square of the given key.
If the hash table contains maximum 999 entries, then three digits from
the result of square may be chosen as a hash key for mapping the key
element in the lookup table.
It generates random sequences of hash keys, which are generally key
dependent.
Mid Square method is a flexible method of selecting the hash key
value.
Example: If the input is the number 4567, squaring yields an 8-digit
number, 20857489. The middle two digits of this result are 57. All digits
of the original key value (equivalently, all bits when the number is
viewed in binary) contribute to the middle two digits of the squared
value. Thus, the result is not dominated by the distribution of the
bottom digit or the top digit of the original key value.
Department of CSE and IT16
3. Folding Method
Partition the key into several parts and combine the parts in a convenient
way (often addition or multiplication) to obtain the index.
Ex-1:
An eight digit integer can be divided into groups of three, three and two digits
(or any other combination) the groups added together and truncated if
necessary to be in the proper range of indices.
Hence 12345678 maps to 123+456+78 = 657, since any digit can affect the
index, it offers a wide distribution of key values.
Ex-2: Split a 9-digit number into three 3-digit numbers, and add them
e.g. 925371622 becomes 925 + 376 + 622 = 1923
4. Modular Method
For mapping a given key element in the hash table, mod operation of
individual key is calculated. The remainder denotes particular address position
of each element.
The result so obtained is divided by an integer, usually taken to be the size
of the hash table to obtain the remainder as the hash key to place that element
in the lookup table.
Department of CSE and IT17
5. The Division Method
Idea:
Map a key k into one of the m slots by taking the remainder of k divided by
m
h(k) = k mod m
Advantage:
fast, requires only one operation
Disadvantage:
Certain values of m are bad, e.g.,
power of 2
non-prime numbers
Department of CSE and IT18
Example - The Division Method
If m = 2
p
, then h(k) is just the least significant p bits
of k
p = 1 m = 2
h(k) = , least significant 1 bit of k
p = 2 m = 4
h(k) = , least significant 2 bits of k
Choose m to be a prime, not close to a
power of 2
Column 2:
Column 3:
{0, 1}
{0, 1, 2, 3}
k mod 97
k mod 100
m
97
m
100
Department of CSE and IT19
Collision
Two or more keys hash to the same slot!!
For a given set K of keys
If |K| ≤ m, collisions may or may not happen, depending on the
hash function
If |K| > m, collisions will definitely happen (i.e., there must be at
least two keys that have the same hash value)
Avoiding collisions completely is hard, even with a good hash
function
Department of CSE and IT20
Collision Resolution(Open Hashing)
If, when an element is inserted, it hashes to the same value as
an already inserted element, then we have a collision and need
to resolve it.
There are several methods for dealing with this:
Separate chaining
Open addressing
Linear Probing
Quadratic Probing
Double Hashing
21
Separate Chaining
The idea is to keep a list of all elements that hash to the same
value.
The array elements are pointers to the first nodes of the lists.
A new item is inserted to the front of the list.
Advantages:
Better space utilization for large items.
Simple collision handling: searching linked list.
Overflow: we can store more items than the hash table size.
Deletion is quick and easy: deletion from the linked list.
Department of CSE and IT
21
22
Example
0
1
2
3
4
5
6
7
8
9
0
81 1
64 4
25
36 16
49 9
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
Department of CSE and IT
22
Department of CSE and IT23
Operations
Initialization: all entries are set to NULL
Find:
locate the cell using hash function.
sequential search on the linked list in that cell.
Insertion:
Locate the cell using hash function.
(If the item does not exist) insert it as the first item in the list.
Deletion:
Locate the cell using hash function.
Delete the item from the linked list.
23
Department of CSE and IT24
24
Handling Collisions Using Separate Chaining
Idea:
Put all elements that hash to the same slot into a linked list
Slot j contains a pointer to the head of the list of all elements that
hash to j
Department of CSE and IT25
Closed Hashing(Open Addressing)
If we have enough contiguous memory to store all the keys (m > N)
store the keys in the table itself
No need to use linked lists anymore
Basic idea:
Insertion: if a slot is full, try another one,
until you find an empty one
Search: follow the same sequence of probes
Deletion: more difficult ... (we’ll see why)
Search time depends on the length of the
probe sequence!
e.g., insert 14
Department of CSE and IT26
Generalize hash function notation:
A hash function contains two arguments now:
(i) Key value, and (ii) Probe number
h(k,p), p=0,1,...,m-1
Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
Must be a permutation of <0,1,...,m-1>
There are m! possible permutations
Good hash functions should be able to produce all m! probe
sequences
insert 14
<1, 5, 9>
Example
Department of CSE and IT27
Common Open Addressing Methods
Linear probing
Quadratic probing
Double hashing
Note: None of these methods can generate more than m
2
different
probing sequences!
Department of CSE and IT28
Linear probing: Inserting a key
Idea: when there is a collision, check the next available
position in the table (i.e., probing)
h(k,i) = (h
1
(k) + i) mod m
i=0,1,2,...
First slot probed: h
1
(k)
Second slot probed: h
1
(k) + 1
Third slot probed: h
1(k)+2, and so on
Can generate m probe sequences maximum, why?
probe sequence: < h1(k), h1(k)+1 , h1(k)+2 , ....>
wrap around
Department of CSE and IT29
Linear probing: Searching for a key
Three cases:
(1) Position in table is occupied with an element of
equal key
(2) Position in table is empty
(3) Position in table occupied with a different
element
Case 2: probe the next higher index until the
element is found or an empty position is found
The process wraps around to the beginning of the
table
0
m - 1
h(k
3
)
h(k
2
) = h(k
5
)
h(k
1
)
h(k
4
)
Department of CSE and IT30
Linear probing: Deleting a key
Problems
Cannot mark the slot as empty
Impossible to retrieve keys inserted after that slot
was occupied
Solution
Mark the slot with a sentinel value DELETED
The deleted slot can later be used for insertion
Searching will be able to find all the keys
0
m - 1
Department of CSE and IT31
Quadratic Probing
i=0,1,2,...
Department of CSE and IT32
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the increment
for the probe sequence
h(k,i) = (h
1
(k) + i h
2
(k) ) mod m, i=0,1,...
Initial probe: h
1
(k)
Second probe is offset by h
2
(k) mod m, so on ...
Advantage: avoids clustering
Disadvantage: harder to delete an element
Can generate m
2
probe sequences maximum
Department of CSE and IT33
Double Hashing: Example
h
1
(k) = k mod 13
h
2(k) = 1+ (k mod 11)
h(k, i) = (h
1(k) + i h
2(k) ) mod 13
Insert key 14:
h
1(14,0) = 14 mod 13 = 1
h(14,1) = (h
1(14) + h
2(14)) mod 13
= (1 + 4) mod 13 = 5
h(14,2) = (h
1(14) + 2 h
2(14)) mod 13
= (1 + 8) mod 13 = 9
79
69
98
72
50
0
9
4
2
3
1
5
6
7
8
10
11
12
14
Example Linear (Closed) Hashing
D=8, keys a,b,c,d have hash values h(a)=3, h(b)=0, h(c)=4,
h(d)=3
0
2
3
4
5
6
7
1
b
a
c
Where do we insert using linear hashing:
Probe sequence 4%8 = 4
ady fillh
1
(d) = (h(d)+1)%8 = t d? 3 alreed
h
2(d) = (h(d)+2)%8 = 5%8 = 5*
7, 0,1,2
Wraps around the beginning of the table!
d
Already Filled
Department of CSE and IT35
Analysis of Open Addressing
a
1a
(load factor)
k=0
36
Compilers use hash tables to keep track of declared variables.
A hash table can be used for on-line spelling checkers — if misspelling
detection (rather than correction) is important, an entire dictionary can
be hashed and words checked in constant time.
Game playing programs use hash tables to store seen positions,
thereby saving computation time if the position is encountered
again.
Hash functions can be used to quickly check for inequality — if
two elements hash to different values they must be different.
Storing sparse data
Applications of Hashing
Department of CSE and IT
Department of CSE and IT37
Storage Management
An executing program uses memory (storage) for many different purposes, such
as for the machine instructions that represent the executable part of the program,
the values of data objects, and the return location for a function invocation.
Static Allocation :
allocation during translation that remains fixed throughout execution.
does not allow recursive subprograms
Dynamic Allocation: (Heap Storage Management)
Memory used for dynamic allocation of data objects in somewhat
unstructured manner is called heap storage.
Memory Management
Memory Areas and their use
Memory Manager Tasks:
acquire
release
Free List Implementations
Singly Linked List
Doubly Linked List
Buddy Systems
38 Department of CSE and IT
Continued…
Memory areas: In languages like C or Java, the memory used by
a program can be allocated from three different areas:
Static: laid out at compilation time, and allocated when the
program starts.
Used for Global variables and constants
Stack: memory is allocated and freed dynamically, in LIFO
order.
Used for Local variables and parameters
Heap: memory is allocated and freed dynamically, in any
order.
Used for data outliving the method which created them. In
Java all objects are stored in the heap
39 Department of CSE and IT
Memory Manager
The memory manager is part of the Operating System.
It must keep track of which parts of the heap are free, and which are
allocated.
A memory manager supports the following operations:
acquire: allocates memory needed by programs
release: deallocates memory no longer needed by programs
It also defragments memory when needed
40 Department of CSE and IT
Problems faced in memory allocation
Memory fragmentation:
External fragmentation: Memory wasted outside allocated
blocks
Internal fragmentation: Memory wasted inside allocated block.
Results when memory allocated is larger than memory requested.
Overhead: Additional memory that must be allocated, above and beyond
that requested by programs, in order to provide for the management of the
heap.
41 Department of CSE and IT
Free List
Memory manager uses a free list data structure that keeps track of free
memory blocks in a scheme for dynamic memory allocation.
Common implementations for free list:
Singly-linked list
Doubly-linked list
Buddy systems: an array of doubly-linked lists
Allocation Policies:
First fit chooses the first block in the free list big enough to satisfy the
request, and split it.
Next fit is like first fit, except that the search for a fitting block will start
where the last one stopped, instead of at the beginning of the free list.
Best fit chooses the smallest block bigger than the requested one.
Worst fit chooses the biggest, with the aim of avoiding the creation of too
many small fragments – but doesn’t work well in practice.
42 Department of CSE and IT
Singly-linked list implementation of free-list
Each node represents a free block of memory
Nodes must be sorted according to start addresses of free blocks so that
adjacent free memory blocks can be combined.
acquire( ) and release( ) operations are O(n); where n is the number of
blocks in the heap.
In order to acquire a block, a node is searched following one of the
allocation policy. If the block is bigger than requested, it is divided into two.
One part is allocated and one remains in the list.
In order to release a block,
a new node must be inserted (if the adjacent block is not on the free list)
or a node, which contains the adjacent free block, must be modified.
Searching for the place of the new or existing node has complexity O(n).
43 Department of CSE and IT
Doubly-linked list implementation of free-list
In this implementation
Nodes are not sorted according to start addresses of free blocks.
All memory blocks have boundary tags between them. The tag has
information about the size and status (allocated/free)
Each node in the doubly linked list represents a free block. It keeps
size & start address of the free block and start addresses & sizes of
the previous and next memory blocks. The adjacent blocks may be or
may not be free
The release operation does not combine adjacent free blocks. It simply
pretends a node corresponding to a released block at the front of the
free list. This operation is thus O(1). Adjacent free blocks are combined
by acquire().
The acquire operation traverses the free list in order to find a free area of
a suitable size. As it does so it also combines adjacent free blocks.
44 Department of CSE and IT
Doubly Linked List Example
Node structure:
Initial state of memory (shaded=allocated, grayed=boundary tags)
The corresponding free list
45 Department of CSE and IT
Doubly Linked List Example (Cont.)
The operation release(400, 4000) will result in:
The node corresponding to the freed block is appended at the
front of the free-list. The nodes x, y, and z correspond to the
three free blocks that have not yet been combined.
46 Department of CSE and IT
Doubly Linked List Example (Cont.)
The operation acquire(600) using the first-fit allocation policy
will first result in the combination of the three adjacent free
blocks:
At this point the corresponding free list is:
47 Department of CSE and IT
Doubly Linked List Example (Cont.)
The required 600 bytes are then allocated, resulting in:
The corresponding free list is:
48 Department of CSE and IT
Buddy Systems implementation of free-list
Instead of having a single free list, it has an array of free lists;
each element of the array holding blocks of the same size. One
type of buddy systems is the binary buddy system.
For a memory of size m, there are free-lists of size 2
0
, 2
1
,
2
2
, . . . , 2
k
, where m 2
k
The heap is viewed as one large block which can be split into
two equal smaller blocks, called buddies. Each of these
smaller blocks can again be split into two equal smaller
buddies, and so on. Each memory block has its “buddy”. The
“buddy” of a block of size 2
k
that starts at address x is the
block of size 2
k
that start at address y = complementBit_k(x),
where the address bits are numbered from right to left starting
with 0.
49 Department of CSE and IT
Buddies
If each block is of size 8 bytes (i.e., 2
3
bytes); then
the buddy of a block is obtained by complementing
bit 3 of its starting address. If each block is of size 4
bytes (i.e., 2
2
bytes); then the buddy of a block is
obtained by complementing bit 2 of its starting
address.
Example: What is the starting address of the buddy
of a block that starts at address 1100101010101101
if each block is 16 bytes?
Solution: 16 = 2
4
; the starting address of the buddy
is obtained by complementing bit 4:
1100101010111101
50 Department of CSE and IT
Binary Buddy System implementation of free-list
Each array element is a list of free blocks of same size. The size of
each block is a power of 2.
51 Department of CSE and IT
Binary Buddy System Algorithms
acquire(x): x <= 2
k
, the corresponding free list is searched
If there is a block in this list, it is allocated;
otherwise a block of size 2
k+1
, 2
k+2
, and so on is searched and
taken off the free list. The block is divided into two buddies. One
buddy is put on the free list for the next lower size and the other
is either allocated or further splinted if needed.
release(x): The block is placed back in the free list of its size, and
if its buddy is also free they are combined to form a free block of
size 2
k+1
. This block is then moved to the corresponding free list.
If its buddy is free they are combined to form a free block of size
2
k+2
, which is then moved to the appropriate free list and so on.
52 Department of CSE and IT
Buddy Systems Advantages/Disadvantages
•Advantage:
–Both acquire( ) and release( ) operations are fast.
•Disadvantages:
–Only memory of size that is a power of 2 can be allocated
internal fragmentation if memory request is not a power of 2.
–When a block is released, its buddy may not be free, resulting in
external fragmentation.
53 Department of CSE and IT
Department of CSE and IT54
QUIZ Test
1. Technique for direct search is
(A)Binary Search (B) Linear Search (C) Tree Search (D) Hashing
2. If h is any hashing function and is used to hash n keys in to a table of size m,
where n<=m, the expected number of collisions involving a particular key x is :
(A)less than 1. (B) less than n. (C) less than m. (D) less than n/2.
3. The OS of a computer may periodically collect all the free memory space to
form contiguous block of free space. This is called
(A)Concatenation (B) Garbage collection (C) Collision (D) Dynamic Memory Allocation
4. What is the best definition of a collision in a hash table?
A. Two entries are identical except for their keys.
B. Two entries with different data have the exact same key.
C. Two entries with different keys have the same exact hash value.
D. Two entries with the exact same key have different hash values.
5. In an open-address hash table there is a difference between those spots which
have never been used and those spots which have previously been used but no
longer contain an item. Which function has a better implementation because of
this difference?
A. insert B. is present C. remove E. size F. Two or more of the above
functions
-