Introduction
Hashing is a technique to map keys to array indices using a hash function, giving average O(1) insert, search, and delete. A hash table is the underlying array of size m; element x is placed at position h(x) mod m.
Hash Function
A good hash function should be deterministic, fast to compute, and uniform (spread keys evenly across the table). Common hash functions include:
- Division:
h(k) = k mod m. Pick m as a prime not close to a power of 2. - Multiplication:
h(k) = floor(m * ((k * A) mod 1))withA≈ 0.618 (Knuth). - String hash (Java):
h(s) = s[0]*31n-1 + s[1]*31n-2 + ... + s[n-1], mod table size.
Collisions
When two different keys hash to the same bucket, a collision occurs. Two strategies resolve collisions:
Separate Chaining
Each bucket holds a linked list of all keys mapped there. Insert appends to the list; search walks the list; delete removes the node. With load factor α = n/m, average search time is O(1 + α). Pros: simple, table never "fills up". Cons: extra pointer overhead, poor cache performance.
Open Addressing
All entries live in the table; on collision, probe other buckets until an empty slot is found.
- Linear probing: try h(k), h(k)+1, h(k)+2, ... mod m. Simple but suffers from primary clustering.
- Quadratic probing: try h(k), h(k)+1², h(k)+2², .... Reduces clustering.
- Double hashing: probe
h1(k) + i * h2(k) mod m. Best distribution.
Load factor must stay below 1 and is usually kept below 0.7 by rehashing.
Rehashing
When the load factor exceeds a threshold, allocate a new larger table (typically double) and reinsert every key using the new size. Amortized cost per insert stays O(1).
Applications
- Language dictionaries: Python
dict, JavaHashMap, C++unordered_map. - Database indexing (hash indexes).
- Symbol tables in compilers.
- Caching (memoization, LRU cache).
- Cryptographic hashing (SHA-256) for integrity.
- Set membership: spam filters, Bloom filters.
Hash Table vs BST
| Aspect | Hash Table | Balanced BST |
|---|---|---|
| Search | O(1) avg | O(log n) |
| Ordered traversal | No | Yes |
| Worst case search | O(n) | O(log n) |
| Extra memory | Table slack | Pointers per node |
Summary
Hashing gives the fastest average lookup but does not preserve order and can degrade with a bad hash function. Pick a prime table size, a uniform hash, and a collision strategy suited to your workload.
Important Questions
- Define hashing and hash function. What are the properties of a good hash function?
- Explain the division and multiplication methods.
- What is a collision? How is it resolved?
- Explain separate chaining with an example.
- Explain linear, quadratic probing, and double hashing.
- What is load factor? When should you rehash?
- Compare hash tables and balanced BSTs.
- List five applications of hashing.