PostgreSQL Performance: The Indexing Decisions That Actually Matter

Most PostgreSQL performance problems come down to a small number of indexing mistakes. Here is a practical guide to what matters and what doesn’t.

When PostgreSQL Uses an Index (and When It Doesn’t)

The query planner decides whether to use an index by estimating whether a sequential scan or an index scan is cheaper. The key factors: selectivity (an index on a column where 90% of rows share the same value — like a boolean “is_active” column where most rows are true — provides little benefit), table size (on a small table, sequential scan is faster than an index lookup; the threshold is roughly 10,000 rows where indexes start providing reliable benefit), and query structure (indexes are used for WHERE clauses, JOIN conditions, and ORDER BY; they are not used if the column is wrapped in a function). The EXPLAIN ANALYZE command: essential for understanding what the planner actually does. `EXPLAIN (ANALYZE, BUFFERS) SELECT …` shows the execution plan, estimated vs actual row counts, cost estimates, and buffer hits vs disk reads. Read the plan bottom-up (the innermost nodes execute first).

Index Types and When to Use Each

B-tree (the default): suitable for equality comparisons (=), range comparisons (<, >, BETWEEN), and sorting. Used for the vast majority of indexes. Hash index: equality-only (no range scans), sometimes faster than B-tree for pure equality lookups, but less common and not replicated in older PostgreSQL versions. GIN (Generalised Inverted Index): for searching within values — full-text search (tsvector), JSONB containment (@>), and array overlap (&&). The right choice when you’re searching inside a text document or inside a JSON blob. GiST: for geometric data (PostGIS, point-in-polygon queries), range types, and full-text search (similar to GIN but GIN is generally faster for text search). BRIN (Block Range Index): extremely small, suitable for naturally ordered data like timestamps where you want to narrow down the disk blocks to scan. The wrong choice: using B-tree indexes on large-text columns for LIKE ‘%keyword%’ queries — the wildcard at the start prevents index use; use GIN with pg_trgm for this pattern.

Composite Indexes and Column Order

A composite index on (a, b) can be used for queries filtering on a alone, or a AND b, but not for queries filtering on b alone. Column order in a composite index matters: put the most selective column first (the column that narrows down the result set most). The covering index: if all columns in a query are included in the index, PostgreSQL can satisfy the query from the index alone without hitting the table (index-only scan) — dramatically faster for read-heavy workloads. Example: `CREATE INDEX idx_orders_covering ON orders (user_id, created_at) INCLUDE (status, total)` — a query selecting status and total, filtering on user_id and created_at, never needs to touch the main table. Partial indexes: index only the rows you actually query. `CREATE INDEX idx_pending ON orders (created_at) WHERE status = ‘pending’` — much smaller index, faster to maintain and query, but only useful for queries with WHERE status = ‘pending’. For a large table where 95% of queries target only 5% of rows (e.g., active rows, pending items), a partial index is often the biggest performance win available.

Index Maintenance and Monitoring

Indexes slow down writes (every INSERT, UPDATE, DELETE must update all indexes). Unused indexes are pure overhead. Query to find unused indexes: `SELECT schemaname, tablename, indexname, idx_scan FROM pg_stat_user_indexes WHERE idx_scan = 0 ORDER BY schemaname, tablename;`. Bloat: after many updates, index pages accumulate dead tuples — `REINDEX CONCURRENTLY` or `pg_repack` can reclaim space without locking the table. The key monitoring query: `SELECT tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan DESC;` — shows which indexes are being used and how much work they are doing. pg_stat_statements (extension, recommended in production): records execution statistics for every distinct query, including mean execution time, total calls, and rows — the most direct way to find the slow queries that need addressing.

上一篇 韩国烤肉:它实际上是什么以及如何运作
下一篇 PostgreSQL性能:真正重要的索引决策