How to store Employee records, which consists of Telephone no., Name, City, Department, Salary, etc… Telephone Name City Dept. Salary 2
Solution Use an Array: Search takes a linear time. If stored in sorted order this search can be done in O(log n) using binary search but then insertion and deletion becomes costly because we have to maintain the sorted order. 3
Solution Use an Linked List: Search takes again linear time to search for particular record as se need to go through all the records until the required one is found. Use balanced BST to store records: Insertion takes O(log n). Search takes O(log n). Deletion takes O(log n). 4
Solution Create a Direct Access Table: Insertion takes O(1). Search takes O(1). Deletion takes O(1). 5
Limitations: 1] Size of Table: m * 10^n Where, m = size of pointer to the record n = digits in the telephone number Telephone Memory Location (Pointer) ……………………. 9856234178 0x1… 8695234512 0x2… ……………………. Name City Dept ……. Tejas Pune IT Name City Dept ……. Rahul Mumbai HR Limitations: 2] Integer may not hold the size of n digits 6
Introduction to Hashing Provides O(1) time on average for insert, search and delete. Find items with keys matching a given search key Given an array A, containing n keys, and a search key x, find the index i such as x=A[i] Example of Record KEY Other Data 7
Continue.. Hashing is an important Data Structure which is designed to use a special function called the Hash function. Hash function is used to map a given value with a particular key for faster access of elements. The efficiency of mapping depends of the efficiency of the hash function used. Some examples of how hashing is used in our lives include: In universities, each student is assigned a unique roll number that can be used to retrieve information about them. In libraries, each book is assigned a unique number that can be used to determine information about the book, such as its exact position in the library or the users it has been issued to etc. 8
Continue.. Assume that you have an object and you want to assign a key to it to make searching easy . To store the key/value pair , you can use a simple array like a data structure where keys (integers) can be used directly as an index to store values. However, in cases where the keys are large and cannot be used directly as an index, you should use hashing . 9
Continue.. In hashing , large keys are converted into small keys by using hash functions . The values are then stored in a data structure called hash table . The idea of hashing is to distribute entries (key/value pairs) uniformly across an array . Each element is assigned a key (converted key). By using that key you can access the element in O(1) time. Using the key, the algorithm (hash function) computes an index that suggests where an entry can be found or inserted . 10
Important Keywords: Hash Table: A data structure used for storing and retrieving data quickly. Where each entry in hash table is made using hash function. Hash Function: It maps a big number or string into a small integer that can be used as index in hash table. Used to place data in hash table and also to retrieve data from hash table. A function that can be used to map a data set of an arbitrary size to a data set of a fixed size, which falls into the hash table . The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. Bucket: Each position of hash table is called bucket. Collision: It is a situation in w hich hash function returns the same address for more than one record. 25 1 Bucket 16 2 42 3 33 4 54 Given: 25, 33, 54, 16, 42 and table of size 5. Hash Function is KEY mod 5 Key mod 5 25 mod 5 = 0 33 mod 5 = 3 54 mod 5 = 4 16 mod 5 = 1 42 mod 5 = 2 30 mod 5 = 0 This is collision. Hash Table 11
Important Keywords: Probe: Each calculation of an address and test for success is known as p robe. Synonym: The set of keys that has the same location are called synonyms. Ex: 25 and 30 are the synonyms. Overflow: When hash table becomes full and new record needs to be inserted and resulting into more collisions the it is called overflow. Perfect hash function: It is a function that maps distinct key elements into the hash table with no collisions. 25 1 16 2 42 3 33 4 54 Entry of 25, 33, 54, 16, 42 in a table of size 5 with probe 5 30 Where 30 mod 5 = 0 is a collision, synonym for 25 and it also shows the overflow situation Hash Table Consider 25, 33, 54, 16, 42, 30. Show Perfect Hash Function 12
To achieve a good hashing mechanism, It is important to have a good hash function with the following basic requirements: Easy to compute Uniform distribution Less Collision 13
Important Keywords: 14
Types of Hash Functions Division Method Multiplication Method Extraction Method Mid Square Method Folding and Universal Method 15
Division Method It depends upon the remainder of division. Typically the divisor is table length. Ex: 54, 72, 89, 37 is to be placed in hash table & if table size 10. 1 2 3 4 5 6 7 8 9 Hash Function h(key) = record % table_size 54 % 10 = 4 72 % 10 = 2 89 % 10 = 9 37 % 10 = 7 54 72 89 37 Disadvantage: The consecutive keys map to consecutive hash values in the hash table. This leads to a poor performance. 16
Multiplication Method 17
Extraction Method Some digits are extracted from the key to form the address location in hash table. Ex: Suppose 1 st , 3 rd and 4 th digits from left are selected for hash key. 4 9 7 8 2 4 478 At 478 location in the hash table of size 1000 the key can be stored. 18
Mid Square Method Square the key. Then extract middle part of the result. That will be the location of key element in hash table. Ex: A] 3111 = Square(3111) B] 16 = Square(16) 96 783 21 At 783 location in the hash table of size 1000 the key can be stored. 2 56 At 56 location in the hash table. 19
Folding Method Two folding techniques: A] Fold Shift: The key is divided into separate parts based on size of required address. Finally add those parts. Ex: 345678123 And required address is of 3 digits. B] Fold Boundary: The key is divided into separate parts based on size of required address. Then left & right parts are folded and then add those parts . Ex: 345678123 345 678 123 345 678 123 ------- 1 146 345 678 123 543 678 321 ------- 1 542 Index is 146. Index is 542. 20
Universal Hashing In this technique randomized algorithm is used to select the hash function at random from family of hash functions with the help of certain mathematical properties. Let U is the set of Universal Keys and H be the finite collection of Hash functions mapping U into {0, 1, …….., m-1}. Say there is collision or your hash function perform badly for input key or a pair (x, y) i.e. { (x,y) | h(x)=h(y)}. U . . m-1 Bad portion of key 21
Universal Hashing The fundamental weakness of hash functions is collision so to avoid that what we do we can choose the Universal hashing where the hash function randomly at the runtime. Lets consider H = set of all hash functions h from U to {0,1,…,m-1}. { h | h : U ( 0,1,…,m-1 ) } H Bad portion 22
Universal Hashing 23
24
Properties of Good Hash Functions Rules for choosing good hash function: The hash function should be simple to compute. Number of collisions should be less ideally no collision should occur (perfect hash function). Hash function should produce such keys which will get distributed uniformly over an array. The hash function should depend on every bit of the key and not on the extracted portion of the key. 25
Collision Resolution Strategies 26
Separate chaining (open hashing) The open hashing is also known as separate chaining in which collisions are stored outside the table. Separate chaining is one of the most commonly used collision resolution techniques . It is usually implemented using linked lists. In separate chaining, each element of the hash table is a linked list. To store an element in the hash table you must insert it into a specific linked list. If there is any collision (i.e. two different elements have same hash value) then store both the elements in the same linked list. 27
Input: 81, 0, 64, 25, 1, 36, 4, 49, 16, 9. 28
Advantages: Simple to implement. Hash table never fills up, we can always add more elements to chain. Less sensitive to the hash function. It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted Disadvantages: Wastage of space. Cache performance of chaining is not good as keys are stored using linked list. If the chain becomes long then search time can become O(n) in worst case. Uses extra space for links. 29
Open Addressing A] Linear Probing with chaining (without replacement): A separate chain table is maintained for colliding data. The address of colliding data is stored with first colliding element in chain table. We can place element at next available bucket or slot and will keep track of that in chain column. Ex: 131, 3, 4, 21, 61, 6, 71, 8, 9. Index Data Chain -1 1 -1 2 -1 3 -1 4 -1 5 -1 6 -1 7 -1 8 -1 9 -1 131 3 4 21 mod 10 = 1 Collision 21 2 61 mod 10 = 1 Collision 6 1 5 6 71 7 8 9 71 mod 10 = 1 Collision -1 Index Data Chain -1 -1 1 131 2 2 21 5 3 3 -1 4 4 -1 5 61 7 6 6 -1 7 71 -1 8 8 -1 9 9 -1 Drawback: Finding the next empty location and that results into the element actually belonging to that empty location cannot obtain its location. 30
Linear Probing with chaining (without replacement) Ex: Consider elements as 21, 31, 41, 2, 6, 5, 7, 55 with table size of 10. Index Data Chain -1 1 -1 2 -1 3 -1 4 -1 5 -1 6 -1 7 -1 8 -1 9 -1 21 41 2 31 2 5 3 6 -1 Index Data Chain -1 -1 1 21 2 2 31 3 3 41 4 4 2 -1 5 5 8 6 6 -1 7 7 -1 8 55 -1 9 -1 -1 7 55 -1 4 8 31
B]Linear Probing with chaining (with replacement) Ex : 131, 21, 31, 4, 5, 2. Index Data Chain -1 1 -1 2 -1 3 -1 4 -1 5 -1 6 -1 7 -1 8 -1 9 -1 131 31 4 21 mod 10 = 1 Collision 21 2 31 mod 10 = 1 Collision 5 3 2 mod 10 = 2 Index Data Chain -1 1 131 2 2 3 31 -1 4 4 -1 5 5 -1 6 7 -1 8 -1 9 -1 Replace 21 by 2 and accordingly chain table will be updated. 2 21 3 21 3 -1 6 Index Data Chain -1 -1 1 131 6 2 2 -1 3 31 -1 4 4 -1 5 5 -1 6 21 3 7 -1 -1 8 -1 -1 9 -1 -1 32
Quadratic Probing Problem of clustering Accumulation of Consecutive cells n hash table When we want to add element we have to go for many probes means need to search for more elements to put element in table. Index Data 79 1 41 2 22 3 32 4 44 5 33 6 7 8 18 9 59 Solution: Quadratic Probing 34
Quadratic Probing Example: 37, 90, 55, 22, 11, 17, 49, 87 with m = 10. Index Data 90 1 11 2 22 3 4 5 55 6 7 37 8 9 17 49 87 36
Clustering Example Ex: 2, 12, 22, 32, 42, 52, 62 Index Data 1 32 2 2 3 12 4 5 6 22 7 52 8 42 9 [Hash(key)+i^2] mod M (2 + 6^2)mod 10 = 38 8 (2 + 4^2)mod 10 = 18 8 (2 + 7^2)mod 10 = 51 1 (2 + 8^2)mod 10 = 66 6 (2 + 9^2)mod 10 = 83 3 (2 + 10^2)mod 10 = 102 2 This is known as secondary clustering. Here we are unable to add an element in quadrating probing because the index calculated is never empty. 37
Double Hashing Second hash function is applied to the key when collision occurs. 38