Published on

Hash Conflicts/Collisons

Authors

# Hash Collisons Recap : we want to store and retrieve our data in constant time and hashing is one of the solutions we're after. In our example of storing books data, we saw that our hash function was returning same values for books having title of same length. This is classic case of hash conflict where hash function returns same locations for keys. How do we solve this ? One way would be to devise a perfect ideal hashing function which never returns same locations for two keys but it is usually not possible.

We can atleast devise a hashing function which minimises number of collisons for us and we should aim for this while developing a hash function. Let's look at some methods which can help resolve hash conflicts/clashes.

Open Addressing

In our example let's say for two books with title "Steve Jobs" and "Safe House" and our bucket size is 100. "Steve Jobs" will be stored at bucket[10] and for "Safe House", we need to do something. We can start with something simpler by finding the next available empty slot in bucket and storing "Safe House" there. for e.g. we check 11th, 12th, 13th positions until we find an empty slot for storing this book and return. This approach is called "Linear Probing" where we are continuosly probing in next empty slots where we can store our keys.

How do we find next empty slots ?

we use a technique called rehashing/open addressing where if we get a conflict for a key then we pass the conflicted location to rehash function until we get a right slot for storing the data.

location = hash(key)
while bucket[location] != key && bucket[location] != NULL
    location = rehash(location)

So we might be applying rehash function multiple times to get the next available position to store our data.

A good rehash function would reduce the number of iterations in while loop and give location fairly quickly.

Challenges

  • It may happen that loop executes forever as bucket may be full.
    • we can mitigate this situation by keeping the counter of keys and bucket size.
  • loop can still execute forever if rehash function checks only specific locations.
    • for e.g. if rehash function is like (i+2)%100, key hashes into odd integers then subsequent rehashing would result into odd integers as locations only skipping even numbered locations.

A good rehash function, for any index i, on succesive rehashes should cover as many of integers possible between 0 and bucketsize-1. Proven Fact : any function rehash(i) = (i+c)%tableSize where c and tableSize are relatively prime produces succesive values that cover the entire table.

Clustering

When we look for next available slot

Primary Clustering

When two keys that hash into different values compete in succesive rehashes then it is known as primary clustering. for e.g. if 990, 991, 992, 993 locations are occupied and some key initially hashes into 990 then there are very high chances that it will occupy 994 than 401 or other location. We can solve primary clustering by making sure rehash function uses multiple parameters to spread values.

#### Secondary Clustering We could've avoided primary clustering but it introduced secondary clustering in which different keys that hash to same value follow the same rehash path.

Double Hashing

We can solve Secondary clustering through Double hashing in which we have primary hash function and secondary hash function to find position for the key.

Deleting Items

It is difficult to delete items from a hash table that uses rehashing. Suppose we put record r1 at p and r2 got a position at p+1 after rehashing. If r1 is deleted then it is difficult to locate r2. One possible solution to add a marker to deleted record as "deleted" rather than fully deleting it.

Chaining or Coalesched Hashing

Efficiency of Hashing Method

  • n = number of items in hash table
  • tableSize = size of hash table
  • n/tableSize = load factor of a hash table

For a large table size, it has been proved that average number of probes required for a successful retrieval in a table organized using linear rehashing is approximately

(2*tableSize - n + 1)/(2*tableSize - 2*n + 2)

Probes Required

Efficieny is examined by average number of table positions that must be examined in searching for a particular item. This is called number of probes required.