Published on

Hash Conflicts/Collisons

Authors

# Hash Collisons Recap : we want to store and retrieve our data in constant time and hashing is one of the solutions we're after. In our example of storing books data, we saw that our hash function was returning same values for books having title of same length. This is classic case of hash conflict where hash function returns same locations for keys. How do we solve this ? One way would be to devise a perfect ideal hashing function which never returns same locations for two keys but it is usually not possible.

We can atleast devise a hashing function which minimises number of collisons for us and we should aim for this while developing a hash function. Lets look at some methods which can help resolve hash conflicts.

Open Addressing

In our example let's say for two books with title "Steve Jobs" and "Safe House" and our bucket size is 100. "Steve Jobs" will be stored at bucket[10] and for "Safe House", we need to do something. We can start with something simpler by finding the next available empty slot in bucket and storing "Safe House" there. for e.g. we check 11th, 12th, 13th positions until we find an empty slot for storing this book and return. This approach is called "Linear Probing" where we are continuosly probing in next empty slots where we can store our keys.

How do we find next empty slots ?

we use a technique called rehashing/open addressing where if we get a conflict for a key then we pass the conflcted location to rehash function until we get a right slot for storing the data.

location = hash(key)
while bucket[location] != key && bucket[location] != NULL
    location = rehash(location)

So we might be applying rehash function multiple times to get the next available position to store our data.

A good rehash function would reduce the number of iterations in while loop and give location fairly quickly.

Challenges

  • It may happen that loop executes forever as bucket may be full.
    • we can mitigate this situation by keeping the counter of keys and bucket size.
  • loop can still execute forever if rehash function checks only specific locations.
    • for e.g. if rehash function is like (i+2)%100, key hashes into odd integers then subsequent rehashing would result into odd integers as locations only skipping even numbered locations.

A good rehash function, for any index i, on succesive rehashes should cover as many of integers possible between 0 and bucketsize-1. Proven Fact : any function rehash(i) = (i+c)%tableSize where c and tableSize are relatively prime produces succesive values that cover the entire table.

Clustering

Primary Clustering

#### Secondary Clustering

Double Hashing

Deleting Items