Indexing and Query OptimizationLesson 7.1

Database Indexing: B-Tree, Hash, and Composite Indexes

database index, B-tree index, hash index, composite index, clustered index, covering index, index selectivity, CREATE INDEX

Introduction

Indexes are the single most impactful performance optimization available in any Database Management System. The difference between a query that scans millions of rows (full table scan) and one that jumps directly to the target rows (index seek) can be the difference between a 30-second timeout and a 5-millisecond response. Understanding how indexes work, when to use them, and their tradeoffs is essential for any developer working with databases at scale.

Explanation

Without an index, the database checks every row one by one (slow).With an index, it directly jumps to the required data (fast).An index is a separate data structure maintained by the DBMS that allows fast lookup of rows by specific column values — similar to a book's index that lets you jump directly to a topic instead of reading every page. The most common index structure is the B-tree (Balanced Tree), used for equality and range queries. Hash indexes are used for equality-only lookups and are even faster for exact matches.

Primary keys are automatically indexed. All other columns require explicit index creation. Every index adds write overhead (the index must be updated on every INSERT, UPDATE, and DELETE) and storage cost, but eliminates the need to scan the full table on reads.

Real World Example

An e-commerce platform with 50 million orders. A query filtering by customer_id = 12345 without an index scans all 50 million rows. With a B-tree index on customer_id, the DBMS traverses log2(50,000,000) ≈ 26 levels to find the matching rows — microseconds instead of seconds. At Netflix scale, without proper indexes, serving a user's watch history would take minutes instead of milliseconds.Without index → like searching a name by reading entire phonebook.With index → directly jump using alphabet sections

Technical Breakdown

B-tree index properties:

Supports equality (=), range (<, >, BETWEEN), sorting (ORDER BY), and prefix matching (LIKE 'abc%').
Balanced — all leaf nodes are at the same depth, guaranteeing O(log n) lookup time.
Leaf nodes store either the actual row data (clustered/primary index) or a pointer to the row (secondary index).

Index types:

Clustered index: Determines the physical order of rows on disk. Only one per table. InnoDB uses the primary key as the clustered index.
Secondary (non-clustered) index: Separate structure pointing to rows. Multiple allowed per table.
Composite index: Index on multiple columns. Extremely useful but must be designed in the right column order (leftmost prefix rule).
Covering index: An index that includes all columns needed by a query, allowing the DBMS to answer the query from the index alone without accessing the table.

Example

-- Single column index
CREATE INDEX idx_orders_customer
ON orders(customer_id);

-- Composite index (column order matters!)
-- Useful for: WHERE status = 'pending' AND order_date > '2024-01-01'
-- Also useful for: WHERE status = 'pending' (uses leftmost prefix)
-- NOT useful for: WHERE order_date > '2024-01-01' alone
CREATE INDEX idx_orders_status_date
ON orders(status, order_date);

-- Covering index (includes all columns needed by a common query)
CREATE INDEX idx_products_category_covering
ON products(category, price, product_id, name);
-- This query uses only the index, never reads the table:
-- SELECT product_id, name, price FROM products
-- WHERE category = 'Electronics'
-- ORDER BY price;

-- Check if a query uses an index (MySQL)
EXPLAIN SELECT order_id, total_amount
FROM orders
WHERE customer_id = 101
  AND status = 'completed';
-- Look for 'ref' or 'range' in 'type' column (not 'ALL' which means full scan)

Common Mistakes

Indexing every column — over-indexing slows writes significantly because every INSERT/UPDATE/DELETE must update all indexes. Index only columns used in WHERE, JOIN ON, and ORDER BY.
Ignoring the leftmost prefix rule for composite indexes — a composite index on (a, b, c) helps queries filtering on a, (a,b), or (a,b,c) but NOT queries filtering only on b or c.
Using functions on indexed columns in WHERE clauses — WHERE YEAR(created_at) = 2024 cannot use an index on created_at because the function transforms the value. Use range conditions instead: WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'.

Interview Tips

A classic interview question is "How would you optimize a slow query?" — always start by running EXPLAIN (or EXPLAIN ANALYZE in PostgreSQL) to see the execution plan and identify full table scans. Then consider: adding an index on the WHERE/JOIN columns, creating a covering index to avoid table lookups, checking if the query can be rewritten to use an index-friendly condition, and ensuring statistics are up to date for the query optimizer.