Hash Tables with Chaining

Chapter 5 of Open Data Structures

Intro

Maps/Dictionary

Map (dictionary) is a relation between a set of keys
and set of values.

Implementing Dynamic Maps/Dictionaries

We want a data structure in which finds/searches
are very fast.
- As close to as possible.
- With the minimum number of executed instructions per method.
Insert and Deletes should be fast too.
Objects in a map have unique keys. A key may be:
- A single property/attribute value, or
- Created from multiple properties/values.

We want to implement the map operations:
Insert, Delete, and Search/Find efficiently.

Arrays:
- accomplished in time for integer keys of a known range.
- but are not space efficient (assumes we leave empty space for keys not currently in the map).
Binary search trees:
- can accomplish in time
- are space efficient.
Hash Tables:
- A generalization of an array that under some reasonable assumptions is for Insert/Delete/Search of a key.

Array Approach — Example

An application stores information about US citizens where
the search key is their social security numbers (SSN).

You can use an array to hold references to all persons.
- Use an array with range 0 – 999,999,999
- Using the SSN as a key, you have access to any person.
Unfortunately, there are much fewer active keys (Social Security Numbers) than the array size (1 billion entries).
- Approximately 450 million SSNs have been issued since 1936.
  (net gain of 1 person every 22 seconds).
- For that many social security numbers, 55% of the array would be unused.
Very useful data structure.
- Good for storing and retrieving key-value pairs.
- Not good for iterating through a list of items.
Example application:
Storing objects according to ID numbers.
- When the ID numbers are widely spread out.
- When you don’t need to access items in ID order.

Hash Tables

Hash Tables solve these problems by using a much smaller array and mapping keys with a hash function.
Let universe of keys and an array of size . A hash function h is a function from to , that is:

Hash Index/Value

A hash value or hash index is used to index the hash table (array).
A hash function takes a key and returns a hash value/index.
- The hash index is an integer (to index an array).
The key is a value associated with a specific object being stored in the hash table.
- The key must remain constant for the lifetime of the object.

Generating Hash Codes

A key may not be an integer to directly used for
hashing.
For example,
- Floating-point numbers: (e.g., 3.14159, 23543.43, 3.997E-8)
- String values: (e.g., "Alice", "Dear diary", ":-)")
- Objects of custom classes.
In Java, every object has a hashCode() method to return a hash code.
- A series of shifts, adds, and xors is performed on the key to produce pseudo-random numbers.
Likewise, we need to make a hash-code function for each type we want to use as a key in our hash table.

Hash Functions & `insert()`

Usage summary:

unsigned int hashCode (int key);
unsigned int hashCode (const std::string& key);
unsigned int hashCode (const ItemType& item);

template <typename Type>
unsigned int indexFor (const Type &key) {
  return hashCode(key) % TABLE_SIZE; // example
};

An insert method:

void insert (int key, itemType item) {
  const unsigned int HASH_VAL = indexFor(key);
  table[HASH_VAL] = item; // Incomplete
}

The Mod (i.e., Modulo) Operator

The modulo is the remainder after integer division.
The modulo is the remainder.
For keys , multiples of give the same result, .
- But multiples of other numbers do not give the same result.
- What if is a prime number such that the keys are rarely multiples of ?

Hash Tables: Insert Example

For example, if we hash keys into a hash table
with entries and use , we get the following sequence of events.

Reducing Collisions

Ideally, hash(key) has two properties:

Consistent: if x == y, then hash(x) == hash(y)
No Collisions: if x != y, then hash(x) != hash(y)

Choosing a Hash Function

Many programming languages (including Java) have built-in hash functions.
The performance of the hash table depends on having a hash function that evenly distributes the keys.
Uniform hashing is the ideal target.
Choosing a good hash function requires accounting for the kind of data that will be used.
One must account for the statistical distribution of keys.
- For example, choosing the first letter of a last name will likely cause lots of collisions depending on the nationality of the population.

Distribution of the people’s surname initial letter according to the 2010 US Census.

Choosing a Hash Function

Some hashing methods that convert a hash code, ,
into an array index:

Division/Modulo Method:
- Given: is the array size, which usually should be prime number.
Multiplicative Method:
- Given, is the array size (usually a power of 2).
  is a arbitrary prime, greater than (often )
  and are random positive integers that are less than .

Prime Number Distribution

For example, assume:

Key values, , are multiples of from to .
(5).
The array size (and divisor), , is .
Then, the hash values will be evenly distributed from to for the keys.
If was , then you would have what kind of distribution?

	Total
0	7
1	7
2	7
3	7
4	7
5	7
6	7

Grand Total	49

Choosing Hash Function

If the keys are non-random (e.g., part numbers),
do the following for better distributions:
- Use all unique data to contribute to the hash function.
- Consider folding — sum the natural (or arbitrary) groups of digits in the key.
- Exclude redundant or non-data (e.g., checksum values).
- Exclude information that may change!
Analyze your expected key values (or some representative subset) to ensure your hash function gives a good distribution!

Dealing with Collisions

A problem arises when we have two keys that hash in the same array entry — this is called a collision.

Two ways to resolve collision are:

Hashing with Chaining (a.k.a., “Separate Chaining”):
Every hash-table entry contains a pointer to a linked list of keys that hash in the same entry.
Hashing with Open Addressing:
Every hash-table entry contains only one key. If a new key hashes to a filled table entry, systematically examine other table entries until an empty entry is found for the new key.

Hashing with Chaining

Problem: Collisions (e.g., the keys 34 and 54 both hash to 4 for a table with 5 buckets).

Solution: Place keys that hash in the same hash-table entry in the same chain (linked list) or bucket (array).

Example hash table implemented via chaining.

Performance of Hashing with Chaining

Running-time performance for hash tables with chaining

Insert: It takes time to compute the hash function and insert at the head of the linked list.
Search: It is proportional to max length of the linked lists.
Delete: Same as search.

Therefore, in the unfortunate event that we have a “bad” hash function all keys may hash in the same table entry giving an run-time!

So, we must create a “good” hash function.

Lab 18: Hash Tables with Chaining

Let’s take a look at the lab based on today’s lecture.

Hash Tables with Chaining ​

Intro ​

Maps/Dictionary ​

Implementing Dynamic Maps/Dictionaries ​

Array Approach — Example ​

Hash Tables ​

Hash Index/Value ​

Generating Hash Codes ​

Hash Functions & insert() ​

The Mod (i.e., Modulo) Operator ​

Hash Tables: Insert Example ​

Reducing Collisions ​

Choosing a Hash Function ​

Choosing a Hash Function ​

Prime Number Distribution ​

Choosing Hash Function ​

Dealing with Collisions ​

Hashing with Chaining ​

Performance of Hashing with Chaining ​

Lab 18: Hash Tables with Chaining ​

Hash Tables with Chaining

Intro

Maps/Dictionary

Implementing Dynamic Maps/Dictionaries

Array Approach — Example

Hash Tables

Hash Index/Value

Generating Hash Codes

Hash Functions & `insert()`

The Mod (i.e., Modulo) Operator

Hash Tables: Insert Example

Reducing Collisions

Choosing a Hash Function

Choosing a Hash Function

Prime Number Distribution

Choosing Hash Function

Dealing with Collisions

Hashing with Chaining

Performance of Hashing with Chaining

Lab 18: Hash Tables with Chaining