#009 DS&A - Hashing - Omar ElKhatib

Hashing

Direct address Table

we have many data structures but they are all slow at searching.

type	time complexity
unsorted array	O(n)
sorted array	O(log n)
linked list	O(n)
Binary Tree	O(n)
Binary Search Tree	O(n)
Balanced Binary Search Tree	O(log n)
Priority (min and max)	O(n)

to minimize the searching time we use hashing which is O(1) , also some people use Direct address Table is just same as an array.

{1,2,3,4,5,...,100} if we need to put insert an element we put it with a key and store it let's say we need to store the number of occurrence for character C the number of ASCII let's assume it's 128

arr[99] I will store in this array the number of occurrence for c where 99 is the key of c

, and now if I need to access the number of occurrence letter c I can access it directly using arr[99] which is O(1)

Introduction to hashing

the problem with this direct access is that array should have big as key sizes , to solve this problem we use hashing.

let's say we have m keys and n elements. in order to map a key to element we need an hash function which will take time of n%m which is constant O(1) . When we need to hash a new element to element and it have same location of an previous element it's called collision example 141%10=1 and 21%10=1 so this is collision because they have same key . The challenge with hash map is the collision.

  i) better hash functions
 ii) chaining (using linked list)
iii) open addressing , implemented using verious ways
  a) linear probing
  b) quadratic probing
  c) Double hashing

chaining

after computing the hash if there is a collision , the previous value will link to the new value.

worst case search - O(n)
worst case deletion - O(n)

average search O(1+ alpha)
average deletion O(1)

where n is the number of elements represented in the linked list , and m is the number of cells in linked list.

alpha = n/m where alpha is the load factor.

An advantage of chained hash table over open addressing scheme is deletion is easier.

Open addressing

open addressing AKA closed hashing

let's say we have array that can hold m elements it mean all the elements I need should be inside array not outside like chaining and U keys , where U is very large and m is limited so definitely their is a collision.

we have the load factor alpha which is always will be between 0 and 1 .

We resolve the collision by applying the hash function with a small modification , this is called probing second time. Probing is nothing but finding location in hash table where we need place an element.

we apply the hash function if we find that the element is already picked we make a combination between u and finite set [0,m-1]

the worst time here to search an element is O(m) because we need to visit all the table to find that element.

assume the X is mean that slot is not empty and N is NULL (we can put element here).

h(k)        = 0

1 is NULL so I can put k in it.

h(k1)       = 1
h(k1 , 1)   = 5
h(k1 , 2)   = 6

h(k1) is 1 but 1 is not free ( their is collision ) so we run hash function again we get 5 which also is not free , so we run it again with 2 we get 6 which is NULL.

same for searching assume we have the element we need k1 is at slot 0

h(k,0) = 2
h(k,1) = 6
h(k,2) = 3
...
h(k,7) = 0

so I take O(m) to find it in worst case

assume the k1 is at slot 5 instead of 7 so if I reach h(k,5) and find it NULL I stop because I know that this element is not presented , because if it's not NULL the hash function will occupy it for k1 .

This is the case if no elements being deleting . If it's deleted I will continue searching (probing). So deletion will cause a problem so I will assign an element that mean oh there is an element let's say D here but it is deleted , and in insertion if I find D I will treat it as NULL so I place replace it with my element.

In worst case if All elements are deleted and I need an element after them , so chaining is the best in this case.

Linear probing

h : U --> {0 ... m-1}

if I run h(k) = a so index is a , if a is not empty will I will take another hash function h prime h'(k,i)=(h(k)+i) mod m which is h'(k,1) = ( h(k)+1) mod n = (a+1) mod m = a+1 and h'(k,2) = ( h(k)+2) mod n = (a+2) mod m = a+2

it's called linear probing because if the element at a is occupied it try a+1 if it also occupied it will try a+2 if it reach m it will start again from 0 to a

so if it start at 1 it will start from 1 and end at 0

{1,2,3 ... m-1 , 0} if it start at k it will end at k-1

the number of prob sequences is m . if 2 elements have the initial prob is the same it's called secondary clustering. Primary clustering is if a continuous block of memory is sequentially occupied it mean from a to a+k is filled , so the probability of inserting an element will be k+1/m

the worst search is O(m) but the average is constant O(2.5) their is a lot of proofs but they are hard and contain a lot of advanced probability theorems.

Quadratic probing

h'(k,i) = ( h(k) + c1*i + c2 i*i ) mod m

the problem with quadratic probing is that we need to choose c1 , c2 and m in such a way that h'(k,i) will examine entire table.

Double hashing

h'(k) = ( h1(k) + i * h2(k) ) mod m

if 2 elements have the same
h1 , h2 probably will not be the same

selection of h2(k)

The secondary hash function h2(k) it should never yield an index of zero

it should cycle through the whole table
it should be very fast to compute
it should be pair-wise independent of
h1(k)
The distribution characteristics of h2(k) are irrelevant. It is analogous to a random-number generator - it is only necessary that h2(k) be ’’relatively prime’’ to m .

In practice, if division hashing is used for both functions, the divisors are chosen as primes.

This post is also available on DEV.