How Databases Actually Work: Storage, Indexes, and Transactions

Most developers use databases through an ORM and have a functional but superficial understanding of what happens underneath. The internals of relational databases are not just academic — they directly explain performance characteristics, failure modes, and the trade-offs between database systems. Here is what actually happens.

Storage: How Data Is Written to Disk

Relational databases write data to disk in pages (typically 8KB or 16KB). A page is the basic unit of I/O — the database reads and writes a full page at a time, even if only one row changed. Storage structures: heap (the basic structure — rows are stored in pages in insertion order; reading a row requires knowing its page and slot); and clustered index (a B-tree structure where the data is stored in the index itself, ordered by the primary key — reading a row by primary key is a single B-tree traversal; PostgreSQL uses heap + secondary index, MySQL InnoDB uses clustered primary key). Write-ahead log (WAL): before any data page is modified, the modification is written to the WAL (a sequential log file). This is the mechanism that enables crash recovery — if the database crashes mid-write, the WAL is replayed on restart to bring the data to a consistent state. The WAL also underpins replication — the replica replays the primary’s WAL to stay in sync. MVCC (Multi-Version Concurrency Control): most modern databases (PostgreSQL, MySQL InnoDB, Oracle) use MVCC to allow readers and writers to not block each other. When a row is updated, the database does not overwrite the old version — it creates a new version. Readers see the version that was current at the start of their transaction; writers create new versions. Old versions are cleaned up by a background process (VACUUM in PostgreSQL). MVCC is why `SELECT COUNT(*) FROM large_table` is slow in PostgreSQL — it must count only the rows visible to the current transaction, which requires a full table scan even if the table is indexed.

Indexes: What They Are and Why They Cost

An index is a separate data structure that enables fast lookup by a specific column or set of columns. B-tree (balanced tree): the default index type. Supports equality lookups (`WHERE column = value`) and range queries (`WHERE column BETWEEN x AND y`), and is sorted so it supports `ORDER BY` without sorting. Lookup time: O(log n) tree traversal. Hash index: faster than B-tree for equality lookups (O(1) hash lookup) but does not support range queries or `ORDER BY`. Supported in PostgreSQL and MySQL as an explicit choice; MongoDB uses hash indexes for hashed sharding. Full-text index (GIN in PostgreSQL, FULLTEXT in MySQL): inverts the text — for each word, stores the list of rows containing it. Enables `WHERE content @@ to_tsquery(‘keyword’)` at speed. Why indexes cost: every write (INSERT, UPDATE, DELETE) must also update all indexes on the table. A table with 10 indexes requires 10 index writes per data write. For write-heavy tables, too many indexes is a performance anti-pattern. Index bloat: deleted rows leave “dead” index entries that are not immediately removed — VACUUM cleans them, but heavy DELETE traffic can cause index bloat, increasing index size and slowing reads. Partial indexes: an index on a subset of rows — `CREATE INDEX ON orders (customer_id) WHERE status = ‘pending’`. Covers only the rows satisfying the WHERE clause, dramatically smaller than a full index for selective conditions. One of the most underused PostgreSQL features.

Transactions and ACID

Atomicity: a transaction is all-or-nothing. If any statement in the transaction fails, all statements are rolled back. Consistency: a transaction brings the database from one valid state to another valid state (all constraints are satisfied after commit). Isolation: concurrent transactions are isolated from each other — the degree of isolation is controlled by the isolation level. Read Committed (default in PostgreSQL and most databases): each statement sees only committed data at the time it executes. Repeatable Read: each transaction sees a consistent snapshot of data from the time the transaction started. Serializable: the strongest isolation — transactions execute as if they were run one at a time serially (no concurrency anomalies). Higher isolation = fewer anomalies = more locking overhead = lower throughput. Durability: after a transaction commits, it will not be lost even if the database crashes — the WAL ensures this. The commit record in the WAL is the atomic “commit completed” signal. The performance reality: ACID guarantees are not free. The WAL write on every commit creates I/O overhead; MVCC version management creates storage overhead and VACUUM needs; serializable isolation requires detecting conflicts which creates CPU overhead. Understanding what each guarantee costs helps you make database design decisions.

上一篇 韩国烤肉:如何在桌旁点餐、烹饪和享用
下一篇 数据库实际上如何工作:存储、索引和事务