Skip to content

Hash Tables with Chaining

Chapter 5 of Open Data Structures

Intro

Maps/Dictionary

Map (dictionary) is a relation between a set of keys
and set of values.

Mapping keys to values
Mapping keys to values
Mapping keys to values
Mapping keys to values

Implementing Dynamic Maps/Dictionaries

  • We want a data structure in which finds/searches
    are very fast.

    • As close to as possible.

    • With the minimum number of executed instructions per method.

  • Insert and Deletes should be fast too.

  • Objects in a map have unique keys. A key may be:

    • A single property/attribute value, or

    • Created from multiple properties/values.

We want to implement the map operations:
Insert, Delete, and Search/Find efficiently.

  • Arrays:

    • accomplished in time for integer keys of a known range.

    • but are not space efficient (assumes we leave empty space for keys not currently in the map).

  • Binary search trees:

    • can accomplish in time

    • are space efficient.

  • Hash Tables:

    • A generalization of an array that under some reasonable assumptions is for Insert/Delete/Search of a key.

Array Approach — Example

An application stores information about US citizens where
the search key is their social security numbers (SSN).

  • You can use an array to hold references to all persons.

    • Use an array with range 0 – 999,999,999

    • Using the SSN as a key, you have access to any person.

  • Unfortunately, there are much fewer active keys (Social Security Numbers) than the array size (1 billion entries).

    • Approximately 450 million SSNs have been issued since 1936.
      (net gain of 1 person every 22 seconds).

    • For that many social security numbers, 55% of the array would be unused.

  • Very useful data structure.

    • Good for storing and retrieving key-value pairs.

    • Not good for iterating through a list of items.

  • Example application:
    Storing objects according to ID numbers.

    • When the ID numbers are widely spread out.

    • When you don’t need to access items in ID order.

A Conceptual View of Hash Tables
A Conceptual View of Hash Tables
A Conceptual View of Hash Tables
A Conceptual View of Hash Tables

Hash Tables

  • Hash Tables solve these problems by using a much smaller array and mapping keys with a hash function.

  • Let universe of keys and an array of size . A hash function h is a function from to , that is:

Hash Index/Value

  • A hash value or hash index is used to index the hash table (array).

  • A hash function takes a key and returns a hash value/index.

    • The hash index is an integer (to index an array).
  • The key is a value associated with a specific object being stored in the hash table.

    • The key must remain constant for the lifetime of the object.

Generating Hash Codes

  • A key may not be an integer to directly used for
    hashing.
    For example,

    • Floating-point numbers: (e.g., 3.14159, 23543.43, 3.997E-8)

    • String values: (e.g., "Alice", "Dear diary", ":-)")

    • Objects of custom classes.

  • In Java, every object has a hashCode() method to return a hash code.

    • A series of shifts, adds, and xors is performed on the key to produce pseudo-random numbers.
  • Likewise, we need to make a hash-code function for each type we want to use as a key in our hash table.

Hash Functions & insert()

  • Usage summary:

    unsigned int hashCode (int key);
    unsigned int hashCode (const std::string& key);
    unsigned int hashCode (const ItemType& item);
    
    template <typename Type>
    unsigned int hash (const Type &key) {
      return hashCode(key) % TABLE_SIZE; // example
    };
  • An insert method:

    void insert (int key, itemType item) {
      const unsigned int HASH_VAL = hash(key);
      table[HASH_VAL] = item; // Incomplete
    }

The Mod (i.e., Modulo) Operator

  • The modulo is the remainder after integer division.

  • The modulo is the remainder.

  • For keys , multiples of give the same result, .

    • But multiples of other numbers do not give the same result.

    • What if is a prime number such that the keys are rarely multiples of ?

Hash Tables: Insert Example

For example, if we hash keys into a hash table
with entries and use , we get the following sequence of events.

Reducing Collisions

Ideally, hash(key) has two properties:

  1. Consistent: if x == y, then hash(x) == hash(y)

  2. No Collisions: if x != y, then hash(x) != hash(y)

Choosing a Hash Function

  • Many programming languages (including Java) have built-in hash functions.

  • The performance of the hash table depends on having a hash function that evenly distributes the keys.
    Uniform hashing is the ideal target.

  • Choosing a good hash function requires accounting for the kind of data that will be used.

  • One must account for the statistical distribution of keys.

    • For example, choosing the first letter of a last name will likely cause lots of collisions depending on the nationality of the population.
Distribution of the people’s surname initial letter according to the 2010 US Census.
Distribution of the people’s surname initial letter according to the 2010 US Census.
Distribution of the people’s surname initial letter according to the 2010 US Census.
Distribution of the people’s surname initial letter according to the 2010 US Census.

Choosing a Hash Function

Some hashing methods that convert a hash code, ,
into an array index:

  • Division/Modulo Method:

    • Given: is the array size, which usually should be prime number.

  • Multiplicative Method:

    • Given, is the array size (usually a power of 2).
      is a randomly chosen prime, greater than (often )
      and are random positive integers that are less than .

Prime Number Distribution

For example, assume:

  • Key values, , are multiples of from to .
    (5).

  • The array size (and divisor), , is .

  • Then, the hash values will be evenly distributed from to for the keys.

  • If was , then you would have what kind of distribution?

Total
07
17
27
37
47
57
67
Grand Total49

Choosing Hash Function

  • If the keys are non-random (e.g., part numbers),
    do the following for better distributions:

    • Use all unique data to contribute to the hash function.

    • Consider folding — sum the natural (or arbitrary) groups of digits in the key.

    • Exclude redundant or non-data (e.g., checksum values).

    • Exclude information that may change!

  • Analyze your expected key values (or some representative subset) to ensure your hash function gives a good distribution!

Dealing with Collisions

A problem arises when we have two keys that hash in the same array entry — this is called a collision.

Two ways to resolve collision are:

  • Hashing with Chaining (a.k.a., “Separate Chaining”):
    Every hash-table entry contains a pointer to a linked list of keys that hash in the same entry.

  • Hashing with Open Addressing:
    Every hash-table entry contains only one key. If a new key hashes to a filled table entry, systematically examine other table entries until an empty entry is found for the new key.

Hashing with Chaining

Problem: Collisions (e.g., the keys 34 and 54 both hash to 4 for a table with 5 buckets).

Solution: Place keys that hash in the same hash-table entry in the same chain (linked list) or bucket (array).

Example hash table implemented via chaining.
Example hash table implemented via chaining.
Example hash table implemented via chaining.
Example hash table implemented via chaining.

Performance of Hashing with Chaining

Running-time performance for hash tables with chaining

  • Insert: It takes time to compute the hash function and insert at the head of the linked list.

  • Search: It is proportional to max length of the linked lists.

  • Delete: Same as search.

Therefore, in the unfortunate event that we have a “bad” hash function all keys may hash in the same table entry giving an run-time!

So, we must create a “good” hash function.

Lab 18: Hash Tables with Chaining

Let’s take a look at the lab based on today’s lecture.