Files
tarun-elango be31df2d44 more text
2026-04-26 14:09:04 -04:00

399 lines
12 KiB
Markdown

# File 4: STL Deep Dive
## Learning Goals
By the end of this file, you should be able to:
- choose standard containers based on access patterns, not habit
- explain how core STL containers work internally
- understand iterator categories and invalidation rules well enough to avoid subtle bugs
- use algorithms library functions as first-class tools rather than optional extras
- discuss STL complexity tradeoffs in interviews and system design conversations
This file assumes you already understand object lifetime, move semantics, and ownership. The STL is not “just a library.” It is a design philosophy built around generic programming, iterator-based abstraction, and predictable complexity.
## What the STL Is Trying to Solve
### Intuition
Most programs need the same families of operations:
- store collections of data
- traverse them efficiently
- search, sort, transform, filter, and aggregate
The STL gives reusable building blocks for those tasks while preserving performance transparency.
Its core ideas are:
- containers own and organize data
- iterators provide a common traversal interface
- algorithms operate over iterator ranges instead of hardcoding container types
That separation is one of the most important patterns in C++.
## `std::vector`
### Intuition
`std::vector` is the default dynamic array in C++. It stores elements contiguously and grows as needed.
If you do not have a strong reason to pick something else, `vector` is often the correct first choice.
### Internal Mechanics
A vector typically stores:
- a pointer to a contiguous heap buffer
- its current size
- its current capacity
```mermaid
flowchart LR
A[vector object] --> B[data pointer]
A --> C[size]
A --> D[capacity]
B --> E[element 0]
B --> F[element 1]
B --> G[element 2]
```
When capacity is exceeded, vector allocates a larger buffer, moves or copies elements into it, then frees the old buffer.
### Why It Exists
Contiguous storage gives major benefits:
- O(1) random access
- strong cache locality
- easy interop with C APIs and low-level buffers
- efficient iteration and algorithm use
### Practical Usage
- numeric data
- event buffers
- parsed records
- task queues with append-heavy patterns
### Pitfalls
- reallocation invalidates pointers, references, and iterators to elements
- frequent small growth can cause repeated reallocations if capacity is not reserved
- insertion in the middle is expensive because elements after the insertion point must shift
### Real Advice
If you know approximate size up front, call `reserve()`. That is one of the highest-value micro-optimizations in ordinary C++ code.
## `std::deque`
### Intuition
`deque` is a double-ended queue optimized for efficient insertion and removal at both ends while still supporting indexed access.
### Internal Mechanics
Unlike vector, deque is not typically one contiguous buffer. It is often implemented as a segmented structure: a map of fixed-size blocks.
This avoids whole-buffer reallocation for growth at the front or back.
### Practical Usage
- queue-like workloads needing both front and back operations
- sliding window logic
- schedulers and work-stealing structures in some implementations
### Pitfalls
- weaker cache locality than vector
- assumptions about contiguity are invalid
- iterators can be invalidated in ways different from vector
## `std::list`
### Intuition
`list` is a doubly linked list. It exists because some workloads benefit from stable iterators and cheap insertion or removal at known positions.
### Internal Mechanics
Each node usually stores:
- the element value
- pointer to previous node
- pointer to next node
```mermaid
flowchart LR
A[Node] --> B[prev]
A --> C[value]
A --> D[next]
```
### Practical Usage
In practice, far fewer workloads need `list` than many engineers assume. It can be useful when:
- you already hold iterators to splice locations
- stable node addresses matter
- frequent insertion and erasure in the middle dominate performance and traversal locality matters less
### Common Misconception
“List is better for lots of inserts and deletes.”
Only sometimes. Pointer chasing hurts cache locality badly. In many real workloads, vector still wins despite O(n) insertion because contiguous memory is so CPU-friendly.
## Associative Containers: `map` and `set`
### Intuition
Ordered associative containers maintain elements in sorted order and support logarithmic lookup, insertion, and removal.
### Internal Mechanics
`std::map` and `std::set` are typically implemented as balanced binary search trees, commonly red-black trees.
```mermaid
flowchart TB
A[8] --> B[4]
A --> C[12]
B --> D[2]
B --> E[6]
C --> F[10]
C --> G[14]
```
Why this matters:
- elements are kept ordered
- lookup is O(log n)
- iterating produces sorted order
- node-based storage means references and iterators are often more stable than in vector
### Practical Usage
- ordered dictionaries
- interval or range logic using `lower_bound` and `upper_bound`
- workloads where sorted traversal is part of the contract
### Pitfalls
- higher per-element overhead than vector-based approaches
- poorer cache locality because nodes are separately allocated
- using `map` by default when ordered traversal is not needed
## Hash-Based Containers: `unordered_map` and `unordered_set`
### Intuition
Hash-based containers optimize for average-case constant-time lookup rather than ordering.
### Internal Mechanics
An `unordered_map` typically uses:
- a bucket array
- a hash function to choose a bucket
- collision handling, often with chains or equivalent node structures
```mermaid
flowchart LR
A[key hash] --> B[bucket array]
B --> C[bucket 0]
B --> D[bucket 1]
B --> E[bucket 2]
D --> F[node -> node]
```
### Practical Usage
- caches
- symbol tables
- frequency counting
- routing tables or registries when order does not matter
### Tradeoffs
- average O(1) lookup, but worst-case O(n)
- memory overhead from buckets and nodes
- iteration order is not stable or meaningful
### Pitfalls
- bad custom hash functions hurt performance
- rehashing invalidates iterators in many cases
- using `unordered_map` when deterministic iteration order is important
## Iterators
### Intuition
Iterators generalize traversal so algorithms can work across many containers.
Instead of writing one sorting routine for vectors and another for arrays, algorithms operate on iterator ranges.
### Categories Matter
Different iterators support different capabilities:
- input iterator: read sequentially
- forward iterator: one-way multi-pass traversal
- bidirectional iterator: move both forward and backward
- random-access iterator: jump in constant time
This is why `std::sort` works with vector iterators but not list iterators. Sorting efficiently requires random access.
### Mental Model
Think of an iterator as a generalized cursor with container-specific guarantees.
### Practical Usage
- generic algorithms over different container types
- decoupling traversal from storage details
- writing reusable library code
### Pitfalls
- invalidating iterators after insertions or erasures
- dereferencing `end()`
- assuming all iterators support the same operations
## Iterator Invalidation
### Intuition
This is one of the most frequent real-world STL bug sources. The container changes, but code keeps using old iterators, references, or pointers.
### Practical Rules of Thumb
- vector reallocation invalidates all iterators, pointers, and references to elements
- list node insertions usually preserve iterators to other nodes
- unordered containers may invalidate iterators when rehashing occurs
Do not rely on vague memory here. For critical code, check the container's exact guarantees.
## Algorithms Library
### Intuition
The algorithms library exists so you can express intent at a higher level than manual loops while still staying efficient.
Common examples include:
- `std::sort`
- `std::find_if`
- `std::transform`
- `std::accumulate`
- `std::lower_bound`
- `std::remove_if`
### Why It Matters
Algorithms make code:
- more declarative
- easier to review
- easier for the compiler to optimize consistently
- less error-prone than handwritten index manipulation
### Example
```cpp
std::vector<int> values = {5, 1, 4, 2, 3};
std::sort(values.begin(), values.end());
```
You do not need to reimplement quicksort or mergesort in production code unless the problem specifically requires it.
### The Erase-Remove Idiom
This is a classic STL pattern:
```cpp
values.erase(
std::remove_if(values.begin(), values.end(), [](int v) { return v % 2 == 0; }),
values.end());
```
Why it exists:
- `remove_if` reorders the range so kept elements move to the front
- it returns the new logical end
- `erase` actually shrinks the container
Understanding this pattern signals real STL fluency.
## Complexity Cheat Sheet
### Sequence Containers
| Container | Random Access | Push Back | Push Front | Insert Middle | Iterator Stability |
| --- | --- | --- | --- | --- | --- |
| `vector` | O(1) | amortized O(1) | O(n) | O(n) | weak under reallocation |
| `deque` | O(1) | O(1) | O(1) | O(n) | moderate, container-specific |
| `list` | O(n) | O(1) | O(1) | O(1) with iterator | strong for other nodes |
### Associative Containers
| Container | Lookup | Insert | Order | Typical Internal Structure |
| --- | --- | --- | --- | --- |
| `map` | O(log n) | O(log n) | sorted | balanced tree |
| `set` | O(log n) | O(log n) | sorted | balanced tree |
| `unordered_map` | average O(1) | average O(1) | none | hash table |
| `unordered_set` | average O(1) | average O(1) | none | hash table |
### Why Interviews Ask About This
Interviewers are usually not checking if you memorized tables. They want to know whether you can choose the right structure for a workload.
Examples:
- frequent append plus indexed reads: likely `vector`
- ordered lookup with range queries: likely `map`
- key lookup without ordering: likely `unordered_map`
- middle splicing with stable iterators: maybe `list`, but verify locality costs first
## Container Selection in Real Systems
### Prefer `vector` More Often Than You Think
Because of contiguity, vector is often the fastest general-purpose container even when its theoretical complexity looks worse than a node-based alternative.
### Reach for Ordered Containers When Order Matters as Part of the Contract
If you need sorted traversal, nearest-key queries, or stable ordering semantics, `map` earns its cost.
### Use Hash Containers When Key Lookup Dominates and Order Does Not Matter
This is common in compilers, interpreters, caches, and service registries.
### Avoid Cargo-Culting `list`
Many engineers learn linked lists academically and then overestimate their usefulness in high-performance software.
## Common STL Pitfalls
- forgetting iterator invalidation rules
- using `operator[]` on `map` or `unordered_map` when accidental insertion is undesirable
- choosing containers by asymptotic complexity alone and ignoring memory locality
- copying large containers accidentally when references or moves were intended
- assuming all algorithms work with all iterator categories
## Interview Checkpoints
You should be able to explain:
- why `vector` is often the default container
- how vector reallocation works and why `reserve()` helps
- the internal difference between `map` and `unordered_map`
- what iterator categories mean in practice
- why `std::sort` requires random-access iterators
- how the erase-remove idiom works
- why cache locality can beat seemingly better asymptotic complexity
## What Comes Next
The final file moves from language and library mechanics into systems-level C++: threads, locks, atomics, performance work, and the patterns that show up in production engines, compilers, and low-latency systems.