Indexing and Hashing

Efficient data retrieval depends on indexing and hashing. These techniques reduce the number of disk accesses needed to find records and are critical to the performance of all but the smallest databases.

Why Indexing

Without an index, finding a record requires scanning the entire table—linear search with cost proportional to the number of pages. An index maps search-key values to record locations, allowing logarithmic or near-constant time lookup. Indexing is one of the most powerful tuning tools available to a DBA.

Index Types

A primary (clustered) index determines the physical order of records; only one such index can exist per table. A secondary (non-clustered) index points to records whose order is determined by another index. Dense indexes contain an entry for every record; sparse indexes contain entries only for some records and rely on sequential layout for the rest.

Multi-level Indexes

When a single index is too large, multi-level indexing indexes the index itself. The top-level index fits in memory and guides access to deeper levels on disk. This idea is generalized by tree-structured indexes, the most important of which is the B+ tree.

B+ Trees

The B+ tree is the workhorse index structure of relational databases. Each internal node stores keys that guide the search; leaf nodes contain the actual keys and pointers to data records. Leaves are linked to support efficient range queries. Operations maintain balance by splitting overfull nodes and merging underfull ones; the tree height stays at O(log_b N), where b is the branching factor. Commercial indexes routinely have thousands of keys per node and only three to four levels for millions of records.

Hashing

Static hashing maps keys to buckets by a hash function. Lookups, inserts, and deletes run in expected O(1) time, but the structure suffers when the data grows beyond the designed capacity. Extendible hashing and linear hashing are dynamic variants that grow and shrink gracefully. Hash indexes are excellent for equality lookups but not for range queries.

Bitmap Indexes

Bitmap indexes represent the presence of each value as a bit vector. They are compact and support fast AND/OR/NOT operations, making them ideal for data warehouses with low-cardinality attributes. They become inefficient for high-cardinality or write-heavy workloads.

Composite and Partial Indexes

Indexes can span multiple columns; the order of columns matters because the index sorts lexicographically. Partial indexes, which include only rows matching a predicate, save space and speed up queries on subsets of the data. Modern databases also support expression indexes on computed values.

Query Optimization and Indexes

The query optimizer chooses access paths based on cost estimates. It may decide to use an index scan, a full table scan, or an index-only scan depending on selectivity, data distribution, and available statistics. Understanding the optimizer's behaviour is essential for tuning slow queries.

Cost of Maintenance

Indexes accelerate reads but slow writes: every insert, update, or delete must also update the affected indexes. Designers pick indexes that balance query benefit against maintenance cost. Over-indexing is a common pitfall in poorly tuned applications.

Summary

Indexing and hashing transform linear searches into logarithmic or constant ones. B+ trees dominate general-purpose indexing; hash, bitmap, and specialized indexes serve particular workloads. Skilled use of indexes is often the difference between a snappy application and a sluggish one.

Indexing and Hashing