Published on

Introduction to Hashing

Authors

Introduction

Suppose we're given a list of books in a store where each book has this information

String title,
String location

Now a customer comes and asks to search for a book with particular title then we have one way to find the location of title is we do a linear search in list and when title matches we get the location and retrieve the book. So in worst case if the book is at corner of the store then we would be searching the whole store to locate one book.

Can we do better here ? If titles are sorted lexicographically then we can perform binary search and retrieve the book in logarithmic time which is better than previous approach.

But can we still do better here ? We can use hashing technique to retrieve the title in constant time i.e. one lookup exactly so in this case location of title would depend upon the title itself and not other titles present.

Storage Structure

We create an array to hold the data of books and location.

String[n] title;
// where title[i] = location[i];
// n is size of array

So in hashing, we define a function i.e. hash function which calculates position of a key(in our case book title) in an array.

From wikipedia

A hash function is any function that can be used to map data of arbitrary size to fixed-size values, though there are some hash functions that support variable length output.[1] The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. The values are usually used to index a fixed-size table called a hash table. Use of a hash function to index a hash table is called hashing or scatter storage addressing.

Let's say our hash function is dead simple and looks like this

int func(String s) {
    return s.length();
}

for a given book title, return its length. We will use this value returned from hash function to store data in our array. for book titled "Talk Like Ted", it will return 13, we will store title[13] = "2nd Row 5th Shelf" for book titled "Wardley Maps", it will return 11, we will store title[11] = "4th Row 2nd Shelf"

Introduction to Hashing

Challenges

  • If two books title have same length then our hash function would return the same index which would overwrite and result in data loss.
    • This is called hash collision and there are some techniques to resolve this, we will see soon.
  • Size of underlying array is important.
    • If it is too large and number of keys are low then we would be wasting lots of space.
      • for e.g. imagine we have only 100 books but one of the book title has length of 120 then based on our hash function we need to define array of size 120, although other book title lengths are less than 12 so we would be wasting lots of space.
  • We are loosing the insertion order of our keys. for e.g. if we want to iterate over all the books then we would get them in random order and not in the order we inserted.
  • Choosing a good hash function is really critical.