Lec12-Hash-Tables-27122022-125641pm.pptx

IqraHanif27 11 views 48 slides Jul 08, 2024
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

Notes


Slide Content

Week 13: Hashing Qazi Haseeb Yousaf Dept of CS Bahria University , Islamabad 1 Data structures and algorithms

Introduction To Searching Linear Search & Binary Search Locate an item by a sequence of comparisons Item being sought – repeatedly compared with items in the list Fast Searching Location of an item is determined directly as a function of the item itself No hit and trial comparisons Ideally – Time required to locate an item is constant and does not depend on the number of items stored Hash tables and hash functions 2

Why Hashing?? Increased content especially internet Impossible to find anything, unless new data structures and algorithms for storing and accessing data are developed. Problem with traditional data structures like Arrays and Linked Lists? Sorted array ->Binary search -> time complexity =O(log n) Unsorted array -> Linear search -> time complexity = O(n) Either case may not be desirable if we need to process a very large data set. A new technique called hashing that allows us to update and retrieve any entry in constant time O(1). The constant time or O(1) performance means, the amount of time to perform the operation does not depend on data size n.

Applications of Hashing Compilers use hash tables to implement the symbol table (a data structure to keep track of declared variables) Game programs Spell Checking Substring Pattern Matching Searching Document comparison 4

When not to use hashing? Hash tables are very good if there is a need for many searches in a reasonably stable table Hash tables are not so good if there are many insertions and deletions, or if table traversals are needed If there are more data than available memory then use a tree Also, hashing is very slow for any operations which require the entries to be sorted e.g. Find the minimum key 5

A simple example – direct hashing 7 integers, ranging 0-10 to be stored in a hash table: Key = {7, 3, 6, 4,9,1,5} The hash table can be implemented by: An integer array, table. Initialize each array element with some dummy value, like –1. Store value i at location table[ i ] . 6 1 2 3 4 5 6 7 8 9 10 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 7 3 6 4 9 1 5

A simple example – direct hashing To check whether a particular value number stored in the hash table, we only need to check: Hash Function: The function h defined by h( i ) = i that determines the location of an item i in the hash table is called hash function 7 table[number] = number

A simple example – direct hashing 7 integers, ranging 0 – 999 to be stored in a hash table The hash table can be implemented by: An integer array, table. Initialize each array element with some dummy value, like –1. Store value i at location table[ i ] . 8 1 999 998 997

A simple example – direct hashing For the hash function h( i ) = i : Time required to search the table for a given item is constant, only one location needs to be examined Very Time efficient – not Space efficient at all 7 out of 1000 locations used – 993 unused locations Since it is possible to store 7 values in 7 locations, we can improve on space utilization 9

Hash functions One possible hash function could be: Or in C++ syntax: 10 h( i ) = i modulo 7 int h( int i ) { return i % 7; } 1 2 3 4 5 6 -1 -1 -1 -1 -1 -1 -1 3 6 4 9 1 5 Key = {7, 3, 6, 4,9,1,5} 7

Hashing and hash functions 11 The above function would always produce an integer in the range 0 –24. The integer 52 is thus stored in table[2], since : h(52) = 52 % 25 = 2 Similarly, 129, 500 and 49 are stored in locations 4,0 and 24 respectively. . . . 49 . . 129 -1 52 -1 500 Hash Table 1 3 23 24 2 …………..

Hash tables – formal definition The hash table structure is an array of some fixed size, containing the items. A stored item needs to have a data member, called key , that will be used in computing the index value for the item. Key could be an integer , a string , etc e.g. a name or Id that is a part of a large employee structure The size of the array is TableSize . The items that are stored in the hash table are indexed by values from to TableSize – 1 . Each key is mapped into some number in the range 0 to TableSize – 1. The mapping is implemented through a hash function . 12

Example 13 Hash Function mary 28200 dave 27500 phil 31250 john 25000 Items Hash Table key key 1 2 3 4 5 6 7 8 9 mary 28200 dave 27500 phil 31250 john 25000 hash index int hash( int key) { return key%table_size ; } void insert( int key) { int h = hash(key); table[h] = key; }

Example The simplest kind of hash table is an array of records. This example has 701 records. 14 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] An array of records . . . [ 700]

Example Each record has a special field, called its key . In this example, the key is the ID of an individual – a long integer 15 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] . . . [ 700] ID: 506643548

Example The rest of the record has information about the person. 16 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] . . . [ 700] ID: 506643548

Example When a hash table is in use, some spots contain valid records, and other spots are empty 17 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322 . . .

Example: Inserting a New Record In order to insert a new record, the key must somehow be converted to an array index The index is called the hash value of the key 18 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322 . . . ID: 580625685

Example: Inserting a New Record Simplest hash function: 19 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322 . . . ID: 580625685 (ID mod 701) ( 580625685 mod 701) = 3

Example: Inserting a New Record The new record is inserted at location 3 in the hash table 20 [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700] Number 506643548 Number 233667136 Number 281942902 Number 155778322 . . . ID: 580625685

Hash function examples for integer keys Let us consider a hash table size = 1000 Truncation : If students have an 9-digit identification number, take the last 3 digits as the table position E.g. 925371622 becomes 622 Folding : Split a 9-digit number into three 3-digit numbers, and add them E.g. 925371622 becomes 925 + 376 + 622 = 1923 Modular arithmetic : If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) E.g. 1923 mod 1000 = 923 (1923 % 1000) 21

Hash function A hash function should be easy and fast to compute A hash function should scatter the data evenly throughout the hash table. How well does the hash function scatter random data? How well does the hash function scatter non-random data? If the input keys are integers then simply Key mod TableSize is a general strategy. If the keys are strings, hash function needs more care First convert it into a numeric value. 22

Hash functions for non-numeric keys Add up the ASCII values of all characters of the key and take mod with table size 23 int h(String x, int M) { char ch []; ch = x.toCharArray (); int xlength = x.length (); int i , sum; for (sum=0, i =0; i < x.length (); i ++) sum += ch [ i ]; return sum % M; } Apple = (65+112+112+108+101)%27 = 498 % 27 = (498-486) = 12

Collision 24 If, when an element is inserted, it hashes to the same value as an already inserted element, then we have a collision and need to resolve it. 1 2

Collision Resolution 25

Separate Chaining The idea is to keep a list of all elements that hash to the same value. The array elements are pointers to the first nodes of the lists. A new item is inserted to the front of the list. Advantages: Better space utilization for large items. Simple collision handling: searching linked list. Overflow: we can store more items than the hash table size. Deletion is quick and easy: deletion from the linked list. 26

Example 27 1 2 3 4 5 6 7 8 9 81 1 64 4 25 36 16 49 9 Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81 hash(key) = key % 10. 1 2 3 4 5 6 7 8 9

Operations Initialization : All entries are set to NULL Search: Locate the cell using hash function. Sequential search on the linked list in that cell. Insertion : Locate the cell using hash function. (If the item does not exist) insert it as the first item in the list. Deletion : Locate the cell using hash function. Delete the item from the linked list. 28

29 class Node{ public : int key ; Node * next ; } ;  class hash{ public : Node * hashtable [MAX] ;    //Hash function that generate hash index int hashfunction ( int key);   // Intialize the array of pointers to NULL void hash();   // Insert a value in the hash table; You need to create a node, insert the value in the node and place the node at appropriate location. void insert( int k);   // Display the complete data in the hash table void display(); };

Open addressing Separate chaining has the disadvantage of using linked lists. Requires the implementation of a second data structure. In an open addressing hashing system, all the data go inside the table. If a collision occurs, alternative cells are tried until an empty cell is found. 30

Open Addressing There are three common collision resolution strategies: Linear Probing Quadratic probing Double hashing 31

Load Factor load factor  :  = (current number of items) / tableSize  measures how full a hash table is. H ash table should not be too loaded if we want to get better performance from hashing. A good load factor is generally <=0.5 in case of open addressing Only in Separate Chaining the load factor can be greater than 1. 32

Linear Probing In linear probing, collisions are resolved by sequentially scanning an array (with wraparound) until an empty cell is found. 33 Thus, once 77 collides with 52 at location 2, we simply put 77 in position 3. . . . 49 . . 129 77 52 -1 500 Hash Table 1 3 23 24 2 …………..

Linear Probing 34 . . . 49 . 102 129 77 52 -1 500 Hash Table 1 3 23 24 2 ………….. To insert 102, we follow the probe sequence consisting of locations 2,3,4, and 5 to find the first available locations and thus store 102 in table[5] . Note: If the search reaches the end of the table, we continue at first location.

Linear Probing To determine if a specified value is in the hash table, we first apply the hash function to compute the position at which this value should be found. There can by one of the following cases: The location is empty The location contains the specified value The location contains some other value Begin a circular linear search until either the item is found or we reach an empty location or the starting location. 35 L et  hash(x)  be the slot index computed using hash function and  S  be the table size If slot hash(x) % S is full, then we try (hash(x) + 1) % S If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S

Linear Probing -- Example Example: Table Size is 11 (0..10) Hash Function: h(x) = x mod 11 Insert keys: 20, 30, 2, 13, 25, 24, 10, 9 20 mod 11 = 9 30 mod 11 = 8 2 mod 11 = 2 13 mod 11 = 2  2+1=3 25 mod 11 = 3  3+1=4 24 mod 11 = 2  2+1, 2+2, 2+3=5 10 mod 11 = 10 9 mod 11 = 9  9+1, 9+2 mod 11 =0 36 9 1 2 2 3 13 4 25 5 24 6 7 8 30 9 20 10 10

Linear Probing -- Clustering Problem One of the problems with linear probing is that table items tend to cluster together in the hash table. i.e. table contains groups of consecutively occupied locations. This phenomenon is called primary clustering . Clusters can get close to one another, and merge into a larger cluster. Thus, the one part of the table might be quite dense, even though another part has relatively few items. Primary clustering causes long probe searches, and therefore, decreases the overall efficiency. 37

Clustering Problem As long as table is big enough, a free cell can always be found, but the time to do so can get quite large Larger table size are preferred Studies suggest the use of tables whose capacities are approx. 1.5 to 2 times the number of items that must be stored 38

Quardatic probing Quadratic Probing eliminates the clustering problem of linear probing. If the hash function evaluates to h and a search in cell h is inconclusive, we try cells h + 1 2 , h+2 2 , … h + i 2 . i.e. It examines cells 1,4,9 and so on away from the original probe. Subsequent probe points are a quadratic number of positions from the original probe point . 39

Quadratic Probing Quadratic probing: almost eliminates clustering problem Steps to follows: S tart from the original hash location i If location is occupied, check locations i+1 2 , i+2 2 , i+3 2 , i+4 2 ... W rap around table, if necessary. 40 let hash(x) be the slot index computed using hash function. If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S

Quadratic Probing -- Example Table Size is 11 (0..10) Hash Function: h(x) = x mod 11 Insert keys: 20, 30, 2, 13, 25, 24, 10, 9 20 mod 11 = 9 30 mod 11 = 8 2 mod 11 = 2 13 mod 11 = 2  2+1 2 =3 25 mod 11 = 3  3+1 2 =4 24 mod 11 = 2  2+1 2 , 2+2 2 =6 10 mod 11 = 10 9 mod 11 = 9  9+1 2 , 9+2 2 mod 11, 9+3 2 mod 11 =7 41 1 2 2 3 13 4 25 5 6 24 7 9 8 30 9 20 10 10

Double Hashing A second hash function is used to drive the collision resolution. We apply a second hash function to x and probe at a distance hash 2 (x), 2*hash 2 (x), … and so on. The function hash 2 (x) must never evaluate to zero 42 let hash(x) be the slot index computed using hash function. If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x)) % S If (hash(x) + 1*hash2(x)) % S is also full, then we try (hash(x) + 2*hash2(x)) % S If (hash(x) + 2*hash2(x)) % S is also full, then we try (hash(x) + 3*hash2(x)) % S

Double Hashing Double hashing also reduces clustering. Idea : Increment using a second hash function h 2 . S hould satisfy : h 2 (key)  h 2  h 1 P robe s following locations until it finds an unoccupied place h 1 (key) h 1 (key) + h 2 (key) h 1 (key) + 2*h 2 (key), ... 43

Double Hashing -- Example Example: Table Size is 11 (0..10) Hash Function: h 1 (x) = x mod 11 h 2 (x) = 7 – (x mod 7) Insert keys: 58, 14, 91 58 mod 11 = 3 14 mod 11 = 3  3+7=10 91 mod 11 = 3  3+7, 3+2*7 mod 11=6 44 1 2 3 58 4 5 6 91 7 8 9 10 14

45 class hash { Public: int HashTable [MAX]; //Hash Function to generate the index //hash(key) = key%MAX ; int hashfunction ( int key); //A function that accepts the hash table and key to be inserted and inserts the “key” at appropriate location in the table. Use linear probing to resolve collisions. The returned values is the index at which the key is inserted. int linear_probing ( int HashTable [], int key); //A function that accepts the hash table and key to be inserted and inserts the “key” at appropriate location in the table. Use linear probing to resolve collisions. The returned values is the index at which the key is inserted. int quadratic_probing ( int HashTable [], int key); //A function that inserts values in the table and resolves collisions using quardatic probing. int double_hashing ( int HashTable [], int key);   }; HINTS : Quardatic probing can be implemented like: for ( i = 0; i % MAX != pos ; i ++) pos = ( pos + i * i ) % MAX ;

Self Assessment Insert the given: keys = {“pineapple”, “grapefruit”, “apricot”, “coconut”}, into a hash table of size 10 , using a hash function: H(key) = key % tablesize To apply the given hash function, first convert the non-numeric key values into numeric, by counting the number of characters in each key (e.g. ‘ sam ’ has three characters, so its numeric value will be 3). In case of collision apply linear probing . Also compute the load factor of the given hash table. 46

Self Assessment Insert the keys={89, 18, 49, 58, 69} in a Hash table of size 10 using the hash function H(key)=key % tablesize and each of the following collision resolution techniques separately. Linear Probing Quadratic Probing Double Hashing using a second hash function H 2 (key)=7-(key % 7), 47

Summary The analysis shows us that the table size is not really important, but the load factor is. TableSize should be as large as the number of expected elements in the hash table. To keep load factor around 1 . TableSize should be prime for even distribution of keys to hash table cells. 48