Python is slower than compiled languages by design — the interpreter, dynamic typing, and garbage collection add overhead. But most Python performance problems are not “Python is slow” problems — they are algorithmic inefficiency, unnecessary I/O, or misuse of data structures that would be slow in any language. Here is the hierarchy of fixes.
Find the Actual Bottleneck First
Premature optimisation is the root of significant waste in Python projects. Before changing anything, profile. `cProfile` (standard library): `python -m cProfile -s cumulative myscript.py` shows cumulative time spent in each function. The `pstats` module allows filtering. `line_profiler`: line-by-line profiling for the functions you suspect — install with `pip install line-profiler`, decorate the function with `@profile`, run with `kernprof -l myscript.py`. `memory_profiler`: memory usage by line — useful when memory, not CPU, is the bottleneck. `py-spy`: a sampling profiler that attaches to a running Python process without restarting — useful for profiling production code. The practical insight: in most cases, 90%+ of execution time is in 5% of the code. Profiling tells you which 5%; optimising the rest is wasted effort.
Algorithmic and Data Structure Fixes (Highest Impact)
O(n²) vs O(n log n) vs O(n): a nested loop over a list of 10,000 items runs 100 million iterations; a dictionary lookup on the same data runs 10,000 lookups. Algorithmic complexity dominates hardware and language speed differences. The common pattern: using `in list` (O(n) linear search) vs `in set` or `in dict` (O(1) hash lookup). Converting a list to a set before repeated membership testing is often a 100x speedup. Avoid repeatedly appending to strings: Python strings are immutable — `result = result + new_string` in a loop creates a new string object every iteration. Use `””.join(list_of_strings)` instead. Generator expressions vs list comprehensions: `sum(x**2 for x in range(1000000))` doesn’t build the full list in memory; `sum([x**2 for x in range(1000000)])` does. For large iterables, generators reduce peak memory. `collections` module: `Counter` for frequency counting, `defaultdict` to avoid key-existence checks, `deque` for O(1) pop/append from both ends (list.pop(0) is O(n)). Pandas vectorisation: if you are iterating over DataFrame rows with a Python `for` loop, you are doing it wrong — use vectorised operations (`df[‘col’].apply(func)`, or better, numpy operations on the underlying arrays). Iterating over 1M rows in Python vs numpy is typically 100–1000x slower.
When to Reach for External Tools
NumPy: for numerical array operations, NumPy runs compiled C code on contiguous memory — typically 10–100x faster than pure Python loops. The model: express the computation as array operations, not Python loops. PyPy: a JIT-compiled Python interpreter — for CPU-bound pure Python code, PyPy often gives 5–10x speedup with zero code changes. Not suitable for all libraries (NumPy integration works, but some C extensions don’t). Cython: compile Python code to C with type annotations — useful for performance-critical inner loops. Requires a compile step. Multiprocessing: Python’s GIL prevents true multithreading for CPU-bound work. `multiprocessing.Pool` runs separate processes with no GIL limitation — useful for CPU-bound tasks that can be parallelised. `concurrent.futures.ThreadPoolExecutor` is appropriate for I/O-bound tasks (where threads are fine since the GIL is released during I/O). asyncio: for I/O-bound code at high concurrency, asyncio allows thousands of concurrent operations in a single thread without the overhead of thousands of threads. `aiohttp` and `httpx` for async HTTP; `asyncpg` for async PostgreSQL. The hierarchy: fix the algorithm first; then numpy/pandas vectorisation; then multiprocessing/asyncio; only then consider Cython or PyPy.



