From 26810e43d00e8503365e8e2b6793c0a4f213d872 Mon Sep 17 00:00:00 2001 From: tarun-elango Date: Sun, 26 Apr 2026 13:27:19 -0400 Subject: [PATCH] sd text --- systems design/1.RequestHandling.md | 1868 +++++++++++ systems design/10.reliabiltyAndProtection.md | 1467 +++++++++ systems design/2.identityAccess.md | 1808 +++++++++++ systems design/3.dataStorage.md | 2128 +++++++++++++ systems design/4.perfLayer.md | 1809 +++++++++++ systems design/5.asyncSystem.md | 2201 +++++++++++++ systems design/6.commSystems.md | 2363 ++++++++++++++ systems design/7.searchAndFIleSystem.md | 2978 ++++++++++++++++++ systems design/8.financial.md | 1941 ++++++++++++ systems design/9.internalOps.md | 1545 +++++++++ 10 files changed, 20108 insertions(+) create mode 100644 systems design/1.RequestHandling.md create mode 100644 systems design/10.reliabiltyAndProtection.md create mode 100644 systems design/2.identityAccess.md create mode 100644 systems design/3.dataStorage.md create mode 100644 systems design/4.perfLayer.md create mode 100644 systems design/5.asyncSystem.md create mode 100644 systems design/6.commSystems.md create mode 100644 systems design/7.searchAndFIleSystem.md create mode 100644 systems design/8.financial.md create mode 100644 systems design/9.internalOps.md diff --git a/systems design/1.RequestHandling.md b/systems design/1.RequestHandling.md new file mode 100644 index 0000000..f3073af --- /dev/null +++ b/systems design/1.RequestHandling.md @@ -0,0 +1,1868 @@ +# Request Handling + +Request handling is the full journey of a request from the moment a client sends it to the moment the system returns a response. In interviews, this topic sits at the boundary between API design, distributed systems, reliability engineering, and production operations. In real systems, request handling is where latency, availability, security, and cost are decided. + +Most weak system design answers treat a request as if it magically reaches the correct service and that service magically succeeds. Real systems do not work that way. Before business logic runs, a request usually passes through multiple control points: + +- edge protection +- routing +- load balancing +- authentication and authorization checks +- rate limiting or throttling +- validation +- version negotiation +- retries and failover logic +- observability hooks + +If you understand request handling well, you can explain not only how a request succeeds, but also how the system behaves when traffic spikes, instances fail, regions go down, clients retry aggressively, or malformed input hits your API. + +This guide is written with two goals: + +1. Help you answer backend and system design interview questions with depth and structure. +2. Help you understand how production systems at companies like Google, Netflix, Uber, Amazon, GitHub, Stripe, and typical SaaS platforms are actually built. + +Examples in this guide are intentionally generalized from widely used industry patterns and public engineering discussions rather than private internal implementation details. + +## 1. Big Picture: What Request Handling Really Means + +At a high level, request handling exists because distributed systems are hostile environments: + +- networks are slow and unreliable +- clients are untrusted +- services scale up and down dynamically +- requests are not evenly distributed +- failures are partial, not binary +- deployments happen continuously +- one bad downstream dependency can cascade into an outage + +The job of request handling is to answer a series of questions quickly and safely: + +1. Should this request be allowed into the system? +2. Is the client authenticated and allowed to do this? +3. Is the request well-formed and safe? +4. Which region, cluster, service, and instance should receive it? +5. Can the system handle the load right now? +6. If something is failing, should we retry, reroute, degrade, or reject? +7. How do we observe what happened later? + +### 1.1 End-to-End Request Lifecycle + +```mermaid +flowchart LR + C[Client: Browser / Mobile / API Consumer] --> CDN[CDN / Edge Cache] + CDN --> WAF[WAF / Reverse Proxy / TLS Termination] + WAF --> GW[API Gateway] + GW --> POL[Auth / Validation / Rate Limit] + POL --> ROUTE[Routing + Load Balancing] + ROUTE --> S1[Service A] + ROUTE --> S2[Service B] + S1 --> CACHE[(Cache)] + S1 --> DB[(Database)] + S2 --> MQ[(Queue / Stream)] + GW -. logs / metrics / traces .-> OBS[Observability Stack] + S1 -. logs / metrics / traces .-> OBS + S2 -. logs / metrics / traces .-> OBS +``` + +### 1.2 Core Goals of Request Handling + +| Goal | Why it matters | Typical mechanisms | +|---|---|---| +| Correctness | Wrong requests can corrupt data or create security holes | validation, auth, idempotency, request signing | +| Availability | Users should still be served when instances or regions fail | load balancing, health checks, failover, retries, circuit breaking | +| Performance | Users care about latency more than architecture diagrams | caching, compression, routing, efficient LB strategy | +| Isolation | One tenant or one endpoint should not take down the whole system | rate limiting, throttling, priority queues, load shedding | +| Observability | If you cannot see failures, you cannot fix them | centralized logs, metrics, traces, correlation IDs | +| Evolvability | APIs and deployments must change without breaking clients | versioning, traffic splitting, canary release, blue-green rollout | + +### 1.3 Interview Framing + +In an interview, saying "I will add a load balancer" is not enough. A stronger answer sounds more like this: + +"Requests first hit a reverse proxy or gateway where we terminate TLS, authenticate the caller, apply rate limiting, and route traffic. From there, an L7 load balancer sends traffic to healthy service instances discovered dynamically. We keep requests idempotent where retries are possible, use readiness checks so bad instances do not receive traffic, and add observability at the edge and service layers so we can debug tail latency and failure patterns." + +That answer shows system thinking rather than component name-dropping. + +## 2. API Gateway + +### 2.1 What It Is + +An API gateway is the entry point for requests into a backend system. It sits between clients and backend services and applies common policies before requests reach business logic. + +Think of it as a programmable front door for your platform. + +Common technologies: + +- Envoy +- NGINX +- Kong +- HAProxy +- AWS API Gateway +- Spring Cloud Gateway +- Netflix Zuul historically + +### 2.2 Why It Exists + +Without a gateway, each service often ends up re-implementing the same cross-cutting concerns: + +- token validation +- request logging +- rate limiting +- route matching +- error normalization +- response compression +- API version handling + +That leads to duplicated logic, inconsistent behavior, and harder operations. + +The gateway centralizes edge concerns so backend services can focus more on business rules. + +### 2.3 Main Responsibilities + +| Responsibility | What it does | Why it belongs at the gateway | +|---|---|---| +| Authentication | Verifies tokens, API keys, signed requests | cheap early reject before backend work | +| Authorization at coarse level | Rejects callers that cannot access an API family | reduces unnecessary downstream traffic | +| Request aggregation | Combines data from multiple services into one response | reduces client chattiness, especially mobile | +| Centralized logging | Captures request metadata once | consistent audit and debugging | +| Observability | emits metrics, traces, request IDs | easier latency and error tracking | +| Service discovery integration | resolves service names to live instances | works with autoscaling and dynamic infra | +| Retries and timeouts | handles transient failures | reduces client-visible errors when used carefully | +| Circuit breaking | stops sending traffic to failing backends | prevents cascading failure | +| Response transformation | maps internal responses to public API shape | decouples client contract from internal service contract | +| Caching | serves repeated read traffic cheaply | lowers latency and backend load | + +### 2.4 How It Works Internally + +A gateway typically processes a request through a pipeline: + +1. Accept TCP/TLS connection. +2. Terminate TLS or pass it through depending on setup. +3. Parse HTTP request metadata. +4. Match route rules using host, path, method, headers, or query parameters. +5. Apply middleware or policies such as auth, rate limiting, schema checks, and logging. +6. Resolve the upstream service using static config or service discovery. +7. Select a healthy backend instance using a load-balancing policy. +8. Forward the request. +9. Apply retries, timeouts, or circuit-breaker policy if needed. +10. Transform, compress, cache, or redact the response. +11. Emit logs, metrics, and trace spans. + +### 2.5 Request Lifecycle Through a Gateway + +```mermaid +sequenceDiagram + participant Client + participant Gateway + participant Auth as Auth/JWKS + participant Registry as Service Discovery + participant Service as Backend Service + participant Obs as Observability + + Client->>Gateway: HTTPS request + Gateway->>Auth: Validate token or signature + Auth-->>Gateway: Auth result + Gateway->>Registry: Resolve upstream instances + Registry-->>Gateway: Healthy endpoints + Gateway->>Service: Forward request + Service-->>Gateway: Response + Gateway->>Obs: Logs / metrics / traces + Gateway-->>Client: Final response +``` + +### 2.6 Request Aggregation + +Request aggregation is when the gateway calls multiple backend services and combines their results into a single response. + +Example: a mobile home screen might need: + +- profile service +- recommendation service +- notification service +- recent activity service + +Without aggregation, the mobile app might issue 4 to 8 network calls. With a gateway or backend-for-frontend layer, the client makes one call and receives one composite response. + +Why it exists: + +- mobile networks are high latency +- clients should not need to know internal service topology +- it reduces repetitive orchestration logic across clients + +Tradeoffs: + +- the gateway becomes more complex +- partial failures are harder to represent +- tail latency can worsen because one slow dependency slows the whole aggregate response +- aggregation logic can become accidental business logic + +Best practice: keep aggregation focused on shaping data for clients, not on implementing domain rules that belong in services. + +### 2.7 Authentication at the Gateway + +This is common because authentication is cheap to reject early and expensive to repeat everywhere. + +Typical patterns: + +- JWT validation at the gateway using cached public keys from a JWKS endpoint +- API key lookup for machine clients +- OAuth token introspection for opaque tokens +- mTLS for service-to-service requests in internal systems + +Important nuance: gateway authentication does not eliminate service-level authorization. + +The gateway can answer, "Is this caller known and allowed to hit this API family?" + +The service still often needs to answer, "Can this user access this specific invoice, order, or repository?" + +Common mistake: pushing all authorization into the gateway. Fine-grained authorization usually belongs closer to the business object. + +### 2.8 Centralized Logging and Observability + +The gateway is the best place to generate or propagate correlation identifiers such as: + +- request ID +- trace ID +- span ID +- tenant ID +- client application ID + +Useful gateway metrics: + +- requests per second by route +- latency percentiles, especially p95 and p99 +- error rates by status code family +- upstream retry counts +- rate-limited requests +- auth failures +- cache hit ratio + +Observability matters because many request-handling bugs look similar from the outside. A user just sees a timeout. Internally, that timeout might have been caused by: + +- route misconfiguration +- unhealthy instances still receiving traffic +- retry storms +- TLS handshake issues +- bad DNS resolution +- a dependency that is slow but not fully down + +### 2.9 Service Discovery Integration + +In dynamic environments, backend instances change constantly because of autoscaling, deployments, and failures. Hardcoding backend IPs is not realistic. + +The gateway therefore needs service discovery. + +Common models: + +- client-side discovery: the caller resolves service instances and chooses one +- server-side discovery: the gateway or load balancer resolves instances and forwards traffic + +In Kubernetes, a gateway often routes to a Service object, and kube-proxy or the data plane routes to live pod endpoints. In service-mesh-heavy environments, Envoy sidecars may receive endpoint updates via xDS-style control-plane APIs. + +Failure case: stale discovery data can send traffic to dead instances. + +Best practices: + +- respect health information, not just presence in registry +- support quick config propagation +- use connection draining when removing instances +- avoid very aggressive caching of endpoint lists + +### 2.10 Retries + +Retries are deceptively dangerous. + +Why they exist: + +- networks fail transiently +- connections reset occasionally +- an instance may fail while others are healthy + +Why they are risky: + +- retries multiply traffic load during incidents +- non-idempotent operations may execute twice +- stacked retries at multiple layers create retry storms + +Best practices: + +- retry only idempotent or safely repeatable operations +- use bounded retries, usually very small counts +- add exponential backoff and jitter +- couple retries with timeouts and circuit breakers +- never let every layer retry blindly + +Interview point: if a gateway retries a POST payment request, you must discuss idempotency keys. + +### 2.11 Circuit Breaking + +Circuit breaking protects the rest of the system from a dependency that is failing or timing out. + +Typical states: + +- closed: traffic flows normally +- open: requests fail fast instead of calling a bad dependency +- half-open: limited test traffic checks whether recovery has happened + +Why it matters: + +If a database or dependency is timing out, continuing to send full traffic often just consumes worker threads, saturates queues, and increases latency everywhere. + +Circuit breaking buys time and preserves system health. + +### 2.12 Response Transformation + +Gateways often transform responses by: + +- removing internal fields +- renaming fields for public API consistency +- changing status code mappings +- combining multiple backend responses into one DTO +- translating protocols such as gRPC to JSON/HTTP for clients + +Useful when: + +- internal services evolve independently +- multiple clients need different shapes +- you want to hide internal topology + +Danger: too much transformation turns the gateway into a fragile orchestration layer. + +### 2.13 Request and Response Caching + +Caching at the gateway is powerful for read-heavy APIs. + +Good candidates: + +- public or semi-public GET responses +- configuration or feature metadata +- rarely changing reference data + +Hard parts: + +- cache invalidation +- per-user or per-tenant cache keys +- auth-sensitive content +- stale responses after writes + +Best practices: + +- cache only clearly safe responses +- include authorization context in the cache key if needed +- use TTLs conservatively +- prefer cache headers and explicit policies over guesswork + +### 2.14 Gateway vs Service Mesh + +| Dimension | API Gateway | Service Mesh | +|---|---|---| +| Main traffic direction | north-south, from clients into platform | east-west, service-to-service | +| Main role | edge policy and API entry | internal traffic management | +| Common features | auth, rate limiting, versioning, aggregation, public API concerns | mTLS, retries, traffic shaping, service identity, observability | +| Consumer | external clients or apps | internal services | +| Operational risk | can become a choke point | can add significant platform complexity | +| Typical tools | API Gateway products, NGINX, Kong, Envoy | Istio, Linkerd, Consul Connect, Envoy-based meshes | + +They are not mutually exclusive. Many real systems use both. + +### 2.15 Production Patterns + +- Netflix popularized a gateway-style edge layer to handle cross-cutting concerns before requests hit microservices. +- Stripe-like public APIs often emphasize idempotency, auth, versioning, and request logging at the edge because correctness matters more than raw throughput alone. +- Large SaaS platforms often use gateways to enforce tenant-aware limits and route requests to the correct service family. + +### 2.16 Common Mistakes + +- turning the gateway into a monolith of business logic +- retrying non-idempotent requests +- centralizing coarse and fine-grained authorization in the same place +- logging secrets or PII in raw form +- caching personalized responses incorrectly +- making the gateway a single point of failure without horizontal scaling + +## 3. Request Routing + +Routing decides where a request goes after it enters the system. + +This sounds simple, but routing is one of the most important control points in production because it determines: + +- which service handles the request +- which version handles it +- which region handles it +- which tenant or experimental path it follows + +### 3.1 Routing Types + +| Routing style | How it works | Common use cases | Risks | +|---|---|---|---| +| Path-based routing | route by URL path such as `/payments` or `/users` | REST APIs, monolith decomposition, ingress rules | overlapping route patterns, regex complexity | +| Host-based routing | route by hostname such as `api.example.com` or `admin.example.com` | multiple products or domains behind one edge | DNS and certificate management complexity | +| Header-based routing | route using headers such as version, tenant, device type, or experiment ID | canaries, A/B tests, tenant isolation | header spoofing, harder debugging | +| Geo-based routing | route by location, region, or country | latency reduction, data residency, regulatory compliance | incorrect geo inference, data locality problems | +| Canary routing | send a small portion of traffic to a new version | safe rollout | canary users may not represent real load | +| Blue-green routing | switch traffic between old and new environments | low-risk deployments with quick rollback | expensive duplication, data migration risk | +| Weighted traffic splitting | send 90 percent to old version and 10 percent to new version, then ramp | gradual deployment, model rollout | sticky results, measurement bias | + +### 3.2 How Routing Usually Works Internally + +The router generally evaluates rules in order: + +1. Match host. +2. Match method and path. +3. Evaluate higher-priority header or cookie rules. +4. Apply traffic-splitting policy if multiple upstreams are eligible. +5. Resolve the target service using discovery. +6. Pick a healthy instance. + +Rule order matters. A subtle configuration bug can shadow a more specific route with a broader one. + +### 3.3 Routing Decision Flow + +```mermaid +flowchart TD + A[Incoming Request] --> B{Host Match?} + B -->|api.example.com| C{Path Match?} + B -->|admin.example.com| D[Admin Service] + C -->|/v1/payments| E{Header or Canary Rule?} + C -->|/v1/users| F[User Service] + E -->|Canary| G[Payments vNext] + E -->|Default| H[Payments vCurrent] +``` + +### 3.4 Path-Based Routing + +This is the most common form of routing for HTTP APIs. + +Examples: + +- `/orders/*` to order service +- `/payments/*` to payment service +- `/search/*` to search service + +Why it exists: + +- intuitive for REST-style APIs +- easy to reason about operationally +- fits ingress and gateway tools well + +Failure case: route collisions. For example, a generic `/payments/*` rule may accidentally catch `/payments/admin/*` if route precedence is wrong. + +### 3.5 Host-Based Routing + +Host-based routing routes by domain name. + +Examples: + +- `api.company.com` +- `admin.company.com` +- `uploads.company.com` +- `hooks.company.com` + +This is useful when products or workloads differ enough that they deserve different operational policies. + +For example, a webhook ingestion domain may need different timeout, retry, and rate-limit rules than a user-facing API domain. + +### 3.6 Header-Based Routing + +Header-based routing is common for: + +- API version rollout +- internal testing +- tenant routing +- language or device-specific responses +- canary routing with explicit opt-in + +Example headers: + +- `X-API-Version` +- `X-Tenant-ID` +- `X-Experiment` +- `X-Canary` + +Be careful: headers are easy for internal callers, but for public APIs they may be spoofed unless protected by auth and policy. + +### 3.7 Geo-Based Routing + +Geo routing tries to send users to the best region based on one or more goals: + +- lower latency +- data residency compliance +- regulatory boundaries +- disaster isolation +- capacity balancing + +Examples: + +- EU users sent to EU region for GDPR-sensitive workloads +- ride-sharing or mapping traffic sent to region nearest user demand +- global SaaS tenants pinned to a home region + +Tradeoffs: + +- nearest region is not always best if the user’s data lives elsewhere +- geo-IP is imperfect +- cross-region writes can be very expensive or consistency-sensitive + +Interview point: geo routing and data placement must be discussed together. + +### 3.8 Canary Routing + +Canary routing sends a small portion of traffic to a new version first. + +Typical rollout: + +1. 1 percent traffic +2. 5 percent traffic +3. 10 percent traffic +4. 25 percent traffic +5. 50 percent traffic +6. 100 percent traffic + +What you watch: + +- error rate +- latency regression +- resource utilization +- business metrics such as checkout success or sign-in success + +Why companies use it: + +- safer than instant full rollout +- catches dependency or schema issues early +- provides rollback window + +Real-world intuition: Netflix-style continuous deployment is only practical because traffic shaping and observability let teams expose changes gradually. + +### 3.9 Blue-Green Deployments + +Blue-green means you maintain two environments: + +- blue: current production +- green: new version + +Then you shift traffic from one to the other. + +Why it exists: + +- rollback is simple in theory because the old environment still exists +- deployment risk is separated from code build risk + +Where it gets hard: + +- databases do not switch as cleanly as stateless services +- dual environments cost more +- background jobs and asynchronous consumers may still affect shared data + +### 3.10 Traffic Splitting + +Traffic splitting is a more general concept than canary. + +You can split traffic by: + +- percentage +- user cohort +- tenant tier +- geography +- request attributes +- session stickiness + +This is useful for: + +- A/B experiments +- canaries +- ML model rollout +- progressive feature migration + +### 3.11 Service Discovery Impact + +Routing depends on service discovery more than most beginners realize. + +A route usually points to a logical service name, not a fixed machine. The system must map that name to currently healthy instances. + +If service discovery is stale or slow: + +- requests go to dead instances +- traffic may concentrate on a few nodes +- rollout changes may not propagate consistently + +Best practices: + +- keep route definitions separate from ephemeral instance identity +- use health-aware endpoint selection +- support connection draining during deployment +- prefer automation over manual endpoint lists + +### 3.12 Common Interview Discussions + +- How do you safely route traffic to a new version? +- How do you guarantee tenant isolation during routing? +- What happens if a route config is wrong globally? +- How do you roll back a bad canary quickly? + +## 4. Load Balancing + +### 4.1 Why Load Balancing Exists + +If requests were sent to a single server, that server would become a bottleneck and a single point of failure. + +Load balancing exists to: + +- distribute traffic across multiple instances +- improve availability +- enable horizontal scaling +- reduce overload on any single machine +- route away from unhealthy instances + +Horizontal scaling is the key idea. Instead of buying one huge machine forever, you run multiple smaller instances and distribute work. + +### 4.2 Active-Active vs Active-Passive + +| Mode | Meaning | Benefits | Drawbacks | Common use | +|---|---|---|---|---| +| Active-active | multiple nodes or regions serve traffic at the same time | high availability, better capacity utilization, low failover time | more complex consistency and routing | web and API frontends, global services | +| Active-passive | one node or region is primary, another waits as standby | simpler operational model, easier correctness story | slower failover, unused capacity | legacy systems, cost-sensitive setups, some databases | + +Interview rule of thumb: active-active improves availability and latency, but only if the data layer and operational discipline are strong enough to support it. + +### 4.3 Common Load-Balancing Algorithms + +| Algorithm | How it works | Best for | Strengths | Weaknesses | +|---|---|---|---|---| +| Round robin | rotate evenly through instances | similar stateless servers | simple and cheap | ignores load differences | +| Weighted round robin | same as round robin, but some instances get more traffic | mixed-capacity fleets | easy to express capacity skew | weights can become stale | +| Least connections | send to instance with fewest active connections | long-lived connections, uneven request duration | better than round robin for sticky or long sessions | connection count may not reflect real CPU or memory load | +| Least response time | prefer instances with lower observed latency | latency-sensitive APIs | reacts to slow instances | can create feedback loops or oscillations | +| Consistent hashing | map request key to instance ring | caches, sharded workloads, affinity | minimal reshuffling when instances change | hotspot risk if keys are skewed | +| IP hash | hash client IP to pick instance | simple affinity | easy stickiness | NAT and shared IPs skew traffic badly | + +One strong interview addition: consistent hashing is not a general-purpose default. It is especially useful when request locality matters, such as cache ownership or shard routing. + +### 4.4 Round Robin + +Round robin is the simplest algorithm: instance A, then B, then C, then back to A. + +Why it exists: + +- low overhead +- easy to reason about +- works fine when instances are homogeneous and requests are similar + +Why it fails: + +- requests may vary wildly in cost +- one node may be slow but still receive equal traffic + +### 4.5 Weighted Round Robin + +Weighted round robin lets you give bigger instances more traffic. + +Example: + +- instance A weight 4 +- instance B weight 2 +- instance C weight 1 + +This is useful during mixed instance migrations or when canary nodes should receive only a small share. + +### 4.6 Least Connections + +This is common when connection duration matters, such as proxies handling long-lived connections or websocket-heavy workloads. + +Why it helps: + +- a server already holding many active sessions gets less new work + +Limitations: + +- 10 idle connections are not equivalent to 10 expensive requests +- CPU-heavy but short-lived requests may still be imbalanced + +### 4.7 Least Response Time + +This tries to avoid slow nodes by routing to faster ones. + +Good intuition: if one instance’s latency is rising, it may be overloaded or degraded. + +Risk: feedback loops. + +If the algorithm overreacts, traffic can bounce around and create instability. Slow-start, damping, and outlier detection often help. + +### 4.8 Consistent Hashing + +Consistent hashing is important in system design interviews. + +Instead of randomly balancing every request, you map a key such as: + +- user ID +- session ID +- cache key +- shard key + +to a position on a hash ring. Each instance owns a portion of the ring. + +Why it exists: + +- when instances join or leave, only a subset of keys remap +- this preserves cache locality and reduces churn + +Common use cases: + +- distributed caches +- sharded databases +- sticky-ish routing without a central session store + +Failure case: skewed keys can create hotspots. Virtual nodes and better key design are common mitigations. + +### 4.9 IP Hash + +IP hash is a crude form of affinity. + +It is easy to set up, but it performs poorly when many clients appear behind a single NAT or corporate proxy. One large office can accidentally behave like one giant user from the load balancer’s perspective. + +### 4.10 Distributed Load Balancing + +At scale, the load balancer itself must also scale. + +That means: + +- multiple LB nodes, not one appliance +- shared or replicated configuration +- health state propagation +- often anycast or DNS in front of LB fleets + +If the load balancer layer is not distributed, you have just moved the bottleneck. + +### 4.11 Global Load Balancing + +Global load balancing chooses a region before a local load balancer chooses an instance. + +Goals: + +- reduce latency +- avoid unhealthy regions +- keep traffic near user or data +- manage regional capacity + +```mermaid +flowchart TD + U[Global Users] --> GTM[Global Traffic Manager] + GTM --> US[US Region] + GTM --> EU[EU Region] + GTM --> AP[APAC Region] + US --> USLB[Regional Load Balancer] + EU --> EULB[Regional Load Balancer] + AP --> APLB[Regional Load Balancer] + USLB --> US1[Service Instances] + EULB --> EU1[Service Instances] + APLB --> AP1[Service Instances] +``` + +Google-scale and Amazon-scale systems rely heavily on global traffic management concepts because the first routing decision is often regional, not per-instance. + +### 4.12 DNS Load Balancing + +DNS can return different IPs for the same hostname. + +Why it is attractive: + +- simple +- globally available +- often the first layer of balancing + +Limitations: + +- DNS caching means failover is not immediate +- clients do not always honor TTL precisely +- DNS cannot see application-level health very well by itself + +Interview point: DNS is useful, but it is usually too coarse to be the only failover mechanism. + +### 4.13 Best Practices + +- remove unhealthy instances quickly +- use connection draining before terminating instances +- prefer zone-aware balancing if cross-zone traffic is expensive +- track p95 and p99, not just average latency +- use slow start for newly added instances so they do not get overwhelmed instantly +- treat retries as part of traffic load, not separate from it + +### 4.14 Common Failure Cases + +- unhealthy instances still receive traffic because readiness is wrong +- least-response-time routing amplifies instability +- one AZ gets overloaded because balancing is not zone-aware +- sticky affinity causes hotspots +- the load balancer layer itself is not redundant + +## 5. Rate Limiting + +### 5.1 Why It Exists + +Rate limiting controls how much traffic a client, tenant, API key, IP, or endpoint can send over time. + +It exists to enforce: + +- abuse prevention +- fairness +- cost control +- protection of downstream systems +- multi-tenant isolation + +Without rate limits, one abusive or buggy client can monopolize resources. + +### 5.2 Where It Is Applied + +Rate limits can exist at multiple layers: + +- CDN or edge +- API gateway +- service layer +- database or queue concurrency controls + +Different layers often enforce different types of limits. + +Examples: + +- per-IP login limit at the edge +- per-API-key limit at the gateway +- per-tenant expensive-operation limit in the service + +### 5.3 Common Algorithms + +| Algorithm | Idea | Strengths | Weaknesses | Good fit | +|---|---|---|---|---| +| Fixed window | count requests in each fixed period | very simple | bursty at window boundaries | low-complexity systems | +| Sliding window | approximate rolling window using adjacent buckets | smoother than fixed window | more logic and state | typical APIs | +| Sliding log | keep timestamps of requests | precise | expensive in memory and compute | strict low-volume policies | +| Token bucket | tokens refill at a constant rate, requests spend tokens | allows controlled bursts | stateful logic | public APIs with burst tolerance | +| Leaky bucket | requests enter a bucket and drain at fixed rate | smooths outgoing rate | may delay or drop burst traffic | traffic shaping and smoothing | + +### 5.4 Fixed Window + +Example rule: 100 requests per minute. + +Implementation is often as simple as: + +- increment a counter for current window +- reject if counter exceeds limit +- expire counter at end of window + +Problem: boundary burst. + +A client can send 100 requests at the end of one minute and 100 more at the start of the next minute, effectively sending 200 requests in a short interval. + +### 5.5 Sliding Window + +Sliding window reduces the boundary-burst problem. + +Instead of treating time as disconnected minute buckets, it approximates usage across a rolling interval. This is fairer and smoother for APIs. + +### 5.6 Sliding Log + +Sliding log stores individual request timestamps and removes old ones. + +It is the most exact of the common approaches, but also the most expensive. It is rarely the default choice for very high-cardinality high-throughput public traffic. + +### 5.7 Token Bucket + +This is one of the most useful algorithms to understand. + +Mental model: + +- tokens drip into a bucket at a steady rate +- each request consumes one or more tokens +- if the bucket is empty, reject or delay + +Why it is popular: + +- supports bursts up to bucket capacity +- preserves average rate over time +- easy to reason about for product limits + +Example: + +- refill 10 tokens per second +- bucket size 50 + +The client can burst 50 requests instantly, but over time they only sustain about 10 per second. + +### 5.8 Leaky Bucket + +Leaky bucket emphasizes output smoothing more than burst allowance. + +Requests may queue and drain at a steady rate. This is useful when downstream systems need smooth, predictable load rather than spikes. + +### 5.9 Redis Implementation Patterns + +Redis is common for distributed rate limiting because it is fast and supports atomic operations. + +Typical patterns: + +- fixed window: `INCR` plus `EXPIRE` +- sliding log: sorted set of timestamps with `ZADD`, `ZREMRANGEBYSCORE`, and `ZCARD` +- token bucket: store token count and last refill timestamp, update atomically with Lua + +Why Lua scripts matter: + +- distributed rate limiting requires atomic read-modify-write behavior +- without atomicity, concurrent requests can exceed the intended limit + +Key design examples: + +- `ratelimit:user:123:/payments` +- `ratelimit:tenant:acme:minute` +- `ratelimit:ip:203.0.113.10:login` + +### 5.10 Distributed Rate Limiting Challenges + +This is where interview answers often become shallow. + +Real problems include: + +- hot keys for large tenants or popular routes +- clock skew between nodes +- cross-region consistency +- Redis outages +- fail-open versus fail-closed policy decisions +- cardinality explosion if keys are too granular + +Fail-open means the limiter allows traffic if the limiter store is down. + +Fail-closed means it rejects traffic if the limiter store is down. + +Which is right depends on the endpoint: + +- login or anti-abuse endpoint may prefer fail-closed +- general read endpoint may prefer fail-open to preserve availability + +### 5.11 Best Practices + +- limit by the right identity: IP is often not enough +- use different limits for read, write, and expensive endpoints +- return `429 Too Many Requests` with `Retry-After` when possible +- monitor near-limit behavior, not just hard rejections +- consider shadow mode before enforcing a new limit in production +- keep rate limiting close to the edge for cheap rejection + +### 5.12 Real-World Intuition + +- GitHub-style public APIs need clear client-visible limits to keep the platform fair. +- Stripe-like payment APIs need rate limiting to protect correctness-sensitive backends from abuse or accidental retry loops. +- Multi-tenant SaaS platforms often combine per-user, per-tenant, and per-endpoint limits. + +## 6. Request Validation + +Validation is the discipline of refusing bad requests before they do damage. + +This includes much more than checking whether a JSON field exists. + +### 6.1 Why Validation Exists + +Requests are dangerous because they may be: + +- malformed +- malicious +- duplicated +- replayed +- semantically invalid +- inconsistent with business rules + +Validation protects: + +- correctness +- security +- downstream capacity +- developer sanity + +### 6.2 Validation Layers + +| Layer | Typical checks | Why here | +|---|---|---| +| Edge or gateway | body size, basic schema, auth format, signature presence, rate limit | cheap early rejection | +| API layer | required fields, type checks, enum checks, version compatibility | contract correctness | +| Domain layer | business rules and state-dependent validation | real correctness | +| Database layer | unique constraints, foreign keys, transactional guarantees | final integrity guardrail | + +Important interview point: validation should be layered. Do not rely on only one layer. + +### 6.3 Schema Validation + +Schema validation checks the request shape. + +Examples: + +- JSON schema or OpenAPI validation for REST +- protobuf validation for gRPC +- GraphQL schema and resolver validation + +Why it exists: + +- catches bad input early +- makes API behavior predictable +- prevents weird null or type bugs from leaking deep into business logic + +What it does not do: + +- prove business correctness + +Example: + +- schema can prove `amount` exists and is numeric +- it cannot prove that a user is allowed to charge that amount + +### 6.4 Input Sanitization + +Sanitization is about ensuring input cannot be used to exploit downstream systems. + +Common concerns: + +- SQL injection +- command injection +- path traversal +- log injection +- XSS if data will later be rendered in browsers + +The important mindset is not "strip all special characters". That often breaks legitimate input. + +Better practice: + +- use parameterized queries +- encode output for the correct context +- validate formats where needed +- avoid blindly interpolating request data into logs or shell commands + +### 6.5 Idempotency + +Idempotency is one of the most important backend interview concepts. + +A request is idempotent if sending it multiple times has the same effect as sending it once. + +Why it matters: + +- clients retry when timeouts happen +- networks fail after the server may already have processed the request +- gateways or proxies may retry transient failures + +Typical example: payments. + +If a client sends "charge $100" twice because the first response was lost, you do not want to charge the card twice. + +A common solution is an idempotency key. + +```mermaid +sequenceDiagram + participant Client + participant API + participant Store as Idempotency Store + participant Payment as Payment Service + + Client->>API: POST /charges + Idempotency-Key: abc123 + API->>Store: Lookup key abc123 + alt Key not found + Store-->>API: miss + API->>Payment: Execute charge + Payment-->>API: success + API->>Store: Save key + normalized request hash + response + API-->>Client: 200 OK with result + else Key found + Store-->>API: previous response + API-->>Client: return same stored result + end +``` + +Best practices for idempotency: + +- store both the key and enough request fingerprinting to detect misuse +- scope keys appropriately, often per client or per endpoint family +- keep key retention long enough to cover realistic retry windows +- use for non-idempotent operations such as payment creation or order submission + +### 6.6 Replay Protection + +Replay protection prevents an attacker or buggy intermediary from resending a valid request later. + +Common techniques: + +- timestamps with expiration windows +- nonces stored briefly to prevent reuse +- signed requests that include method, path, body hash, and timestamp + +This is especially common in webhook verification and partner API integrations. + +### 6.7 Request Signing Basics + +Request signing often works like this: + +1. client builds a canonical string from method, path, timestamp, and body hash +2. client signs it with an HMAC secret or private key +3. server recomputes expected signature +4. server rejects if signature differs or timestamp is too old + +Why it exists: + +- verifies authenticity +- detects tampering +- supports replay protection when timestamp and nonce are included + +GitHub-style or Stripe-style webhooks commonly use a variant of this pattern so receivers can verify that an event really came from the platform. + +### 6.8 Common Validation Mistakes + +- trusting frontend validation +- validating schema but not business semantics +- implementing idempotency without scoping or request hashing +- rejecting too late after expensive downstream work already happened +- logging raw secrets, tokens, or signed payloads +- treating all retries as duplicates without considering request identity + +## 7. API Versioning + +Versioning exists because APIs change, but clients do not upgrade instantly. + +### 7.1 Why Versioning Matters + +Without a versioning strategy: + +- client upgrades become risky +- breaking changes become outages +- multiple mobile app versions become painful to support +- integration partners lose trust + +Strong backend teams design for API evolution, not just initial launch. + +### 7.2 Common Versioning Strategies + +| Strategy | Example | Benefits | Drawbacks | When useful | +|---|---|---|---|---| +| URI versioning | `/v1/orders` | explicit and easy to see | path clutter, can encourage large version forks | public REST APIs | +| Header versioning | `X-API-Version: 2025-10-01` | cleaner URLs, flexible rollout | less visible, harder to debug manually | mature APIs, platform clients | +| Media type versioning | `Accept: application/vnd.company.v2+json` | precise content negotiation | operationally less friendly, not beginner-friendly | specialized APIs | + +### 7.3 Backward Compatibility + +This is often more important than the version number itself. + +Safer changes: + +- adding optional fields +- adding new endpoints +- adding new enum values only if clients are tolerant + +Risky changes: + +- removing fields +- renaming fields +- changing meaning or units of existing fields +- turning nullable fields into required ones + +Production rule: additive evolution is easier than breaking evolution. + +### 7.4 Deprecation Strategy + +Good deprecation is operational, not just documented. + +A practical strategy: + +1. announce deprecation clearly +2. measure usage of the old version +3. provide migration docs and examples +4. support both versions during a migration window +5. alert high-usage customers directly if possible +6. set and communicate a sunset date + +### 7.5 Migration Strategy + +Strong teams avoid big-bang migrations. + +Typical approach: + +- dual-read or dual-write only when necessary and carefully controlled +- gateway routes old and new versions separately +- monitor client adoption +- migrate major SDKs first +- cut off the oldest, least-safe versions gradually + +### 7.6 Real-World Examples + +- Stripe is well known for careful API versioning because payment integrations cannot break casually. +- GitHub exposes explicit API versioning so clients know which contract they are using. +- Internal microservices often use protobuf or schema evolution rules instead of public URI versioning. + +### 7.7 Best Practices + +- version only when needed; do not fork casually +- prefer backward-compatible changes when possible +- monitor version usage by client and tenant +- keep error formats consistent across versions +- document behavioral differences, not just field differences + +## 8. Throttling + +Rate limiting and throttling are related but not identical. + +### 8.1 Throttling vs Rate Limiting + +| Concept | Main goal | Typical action | Example | +|---|---|---|---| +| Rate limiting | enforce quota or fairness | reject when request budget is exceeded | 100 requests per minute per API key | +| Throttling | protect system under stress or shape traffic | slow down, queue, degrade, or reject | reduce expensive search traffic during overload | + +Rate limiting is often policy-driven. + +Throttling is often system-health-driven. + +### 8.2 Why Throttling Exists + +Even legitimate traffic can overwhelm a system. + +Throttling helps you: + +- degrade gracefully rather than crash +- preserve critical endpoints over less important ones +- absorb bursts temporarily +- smooth traffic into fragile downstream systems + +### 8.3 Graceful Degradation + +Graceful degradation means not every feature must remain equally available during stress. + +Examples: + +- checkout stays available, recommendations are temporarily disabled +- write-heavy analytics ingestion is delayed, user sign-in remains online +- expensive search filters are limited, basic search still works + +This is how mature systems preserve business-critical value during incidents. + +### 8.4 Queueing + +Queueing is useful when the work does not need to complete synchronously. + +Examples: + +- email sending +- thumbnail generation +- event enrichment +- some analytics processing + +Why it helps: + +- decouples request acceptance from background work +- smooths spikes +- improves perceived responsiveness if the request can return early + +Danger: unbounded queues are just hidden outages. If the queue grows forever, latency becomes effectively infinite. + +### 8.5 Shedding Load + +Load shedding means rejecting traffic on purpose so the rest of the system survives. + +This can feel wrong to beginners, but it is often the correct decision. + +Serving 70 percent of traffic quickly is better than timing out 100 percent of traffic after exhausting all workers. + +Common strategies: + +- reject low-priority requests first +- enforce concurrency caps on expensive endpoints +- return stale cached data for noncritical reads +- cut off optional features during incidents + +### 8.6 Best Practices + +- define traffic priority classes +- keep queues bounded +- expose clear client signals such as `429` or `503` +- combine throttling with backoff guidance for clients +- ensure degraded mode is tested before an incident + +### 8.7 Common Mistakes + +- using queueing for work that users expect immediately +- letting retries refill the queue faster than it drains +- not distinguishing critical and noncritical traffic +- treating throttling only as an edge concern when downstream systems are the real bottleneck + +## 9. Reverse Proxy and Load Balancer + +These terms overlap in practice, but they are not identical. + +### 9.1 Reverse Proxy Role + +A reverse proxy sits in front of backend servers and receives requests on their behalf. + +Clients think they are talking to one endpoint. The proxy decides how to forward requests internally. + +Why it exists: + +- hide internal topology +- centralize TLS handling +- compress and cache content +- enforce some security rules +- simplify operational control + +Common tools: + +- NGINX +- Envoy +- HAProxy +- cloud-managed L7 load balancers + +### 9.2 SSL/TLS Termination + +TLS termination means the proxy or load balancer decrypts incoming HTTPS traffic. + +Benefits: + +- central certificate management +- offload crypto work from backend services +- enables L7 inspection for routing and policy + +Tradeoff: + +- internal traffic must still be protected appropriately +- if you terminate at the edge and send plaintext internally, the trust boundary moves inward + +Many production systems re-encrypt internally or use mTLS on internal hops. + +### 9.3 Caching + +Reverse proxies can cache static assets and some API responses. + +Why it matters: + +- lower origin load +- lower latency +- better resilience during backend spikes + +But be careful with: + +- personalized responses +- auth-dependent content +- stale data after updates + +### 9.4 Compression + +Compression reduces payload size for responses such as JSON, HTML, CSS, and JS. + +Benefits: + +- lower bandwidth +- faster transfers for text-heavy payloads + +Tradeoff: + +- CPU overhead +- not useful for already compressed formats such as many images or zipped binaries + +### 9.5 WAF Basics + +A Web Application Firewall applies security rules to incoming requests. + +It commonly helps with: + +- blocking obviously malicious payloads +- filtering known exploit patterns +- enforcing IP reputation rules +- reducing bot and abuse traffic + +Important nuance: a WAF is helpful, but it is not a substitute for secure application code and proper validation. + +### 9.6 CDN Relationship + +A CDN is often the outermost layer, especially for global systems. + +Typical order: + +1. CDN +2. WAF or reverse proxy +3. API gateway +4. internal load balancing and service routing + +CDNs are best for: + +- static assets +- edge caching +- some globally cacheable API responses +- DDoS absorption and edge presence + +### 9.7 Reverse Proxy vs API Gateway + +A reverse proxy is often lower-level and more generic. + +An API gateway usually adds richer API-specific behavior such as: + +- auth policies +- API keys +- per-route rate limits +- response transformation +- version-aware routing + +In practice, the same product may serve both roles. + +## 10. L4 vs L7 + +This comparison appears constantly in interviews. + +### 10.1 The Basic Difference + +- L4 operates at the transport layer, mainly IP, TCP, and UDP information. +- L7 operates at the application layer, understanding HTTP methods, paths, headers, cookies, and sometimes message semantics. + +### 10.2 Comparison Table + +| Dimension | L4 | L7 | +|---|---|---| +| Visibility | IP, port, protocol, connection metadata | URL path, headers, host, cookies, method, status | +| Speed | generally lower overhead | generally more overhead due to parsing and richer policy | +| Routing options | by IP and port | by path, host, header, content-type, user, version | +| TLS handling | can pass through TLS | often terminates TLS to inspect HTTP | +| Use cases | very high throughput transport balancing, TCP services, simple load distribution | APIs, canary routing, auth, rate limits, response transforms | +| Observability | coarse | rich request-aware observability | +| Examples | AWS NLB, IPVS-style balancing, transport proxies | AWS ALB, Envoy, NGINX, Kong | + +### 10.3 Performance Tradeoffs + +Why choose L4: + +- lower per-request overhead +- works for non-HTTP protocols +- simpler fast-path routing + +Why choose L7: + +- can make smarter decisions +- can enforce API policy +- can support sophisticated routing and deployment patterns + +Interview answer pattern: + +"I would use L7 where I need HTTP-aware routing, auth, canarying, or response handling. I would prefer L4 for simpler, high-throughput transport balancing or protocols where application parsing is unnecessary." + +## 11. Health Checks + +Health checks determine whether an instance should receive traffic or be restarted. + +### 11.1 Liveness + +Liveness asks: is the process alive at all? + +If liveness fails, the platform may restart the container or instance. + +Best practice: keep liveness simple. It should detect deadlock or fatal stuck states, not depend on every external dependency. + +### 11.2 Readiness + +Readiness asks: is this instance ready to serve traffic right now? + +If readiness fails, the instance should stop receiving traffic, but it does not necessarily need a restart. + +Examples of not-ready: + +- startup still in progress +- critical dependency unavailable +- cache warmup incomplete if that makes service unusable +- instance is draining for deployment + +### 11.3 Startup Probes + +Startup probes exist because some applications take time to initialize. + +Without startup-aware logic, a slow boot may be misclassified as a dead process and restarted repeatedly. + +### 11.4 Dependency Health + +This is subtle. + +Should readiness depend on downstream dependencies? + +Answer: only on truly critical dependencies. + +If a noncritical dependency fails and the service can degrade gracefully, readiness should often stay healthy. Otherwise you risk removing all instances from service for a partial dependency issue. + +### 11.5 Kubernetes Relevance + +In Kubernetes: + +- liveness probe failure can restart a pod +- readiness probe failure removes the pod from service endpoints +- startup probe delays liveness/readiness enforcement until boot completes + +This is why bad probes can cause cascading production pain. + +### 11.6 Probe Comparison + +| Probe | Purpose | Good use | Common mistake | +|---|---|---|---| +| Liveness | detect dead or stuck process | deadlock or irrecoverable internal failure | checking database and causing restart storms | +| Readiness | decide whether to receive traffic | dependency-aware traffic gating | marking ready too early | +| Startup | allow slow initialization | JVM warmup, cache preload, large bootstraps | omitting it for slow-start services | + +### 11.7 Best Practices + +- keep liveness shallow +- make readiness meaningful +- distinguish recoverable dependency issues from fatal ones +- support graceful shutdown by failing readiness first, then draining connections +- test probe behavior during deployments and partial outages + +## 12. Failover + +Failover is the process of moving traffic or responsibility from a failing component to a healthy one. + +### 12.1 Why It Matters + +Failure is not exceptional in distributed systems. Machines fail, networks partition, regions degrade, and deployments go wrong. + +Failover is how the system continues serving despite that reality. + +### 12.2 Active-Passive Failover + +One environment serves traffic. Another waits in standby. + +Variants: + +- cold standby: mostly offline until needed +- warm standby: partially provisioned +- hot standby: ready to serve immediately + +Benefits: + +- easier reasoning about writes +- lower coordination complexity + +Costs: + +- potentially slower failover +- wasted standby capacity + +### 12.3 Active-Active Failover + +Multiple regions or clusters actively serve traffic. + +Benefits: + +- lower latency for global users +- fast failover because traffic is already live elsewhere +- better capacity utilization + +Challenges: + +- data consistency +- write coordination +- duplicate processing risk +- routing users to the correct regional data + +### 12.4 Regional Failover + +Regional failover usually means a global traffic manager stops sending requests to a bad region and shifts them elsewhere. + +```mermaid +flowchart LR + User[Client Traffic] --> GTM[Global Traffic Manager] + GTM -->|Primary healthy| R1[Primary Region] + GTM -->|Standby or overflow| R2[Secondary Region] + R1 -. region unhealthy .-> GTM + GTM -->|Failover| R2 +``` + +Hard parts: + +- DNS caches may slow failover +- stateful sessions may not exist in the secondary region +- databases may lag if replication is asynchronous +- legal or data residency rules may limit where data can fail over + +### 12.5 Database Failover Basics + +This is where request-handling discussions must connect to the data layer. + +Stateless compute failover is relatively straightforward. + +Database failover is much harder because of: + +- replication lag +- split-brain risk +- leader election complexity +- transaction durability guarantees +- write fencing and stale primary protection + +Common patterns: + +- primary-replica with leader promotion +- managed database multi-AZ failover +- read replicas for scale, primary for writes +- careful multi-region replication for stricter availability needs + +### 12.6 RTO and RPO + +These two terms matter in interviews. + +- RTO: Recovery Time Objective. How quickly must the system recover? +- RPO: Recovery Point Objective. How much data loss is acceptable? + +Examples: + +- a chat notification system may tolerate some message delay and nonzero RPO +- a payment ledger system usually needs very low RPO and strict correctness + +Lower RTO and lower RPO generally increase cost and complexity. + +### 12.7 Best Practices + +- design failover with data consistency in mind, not just traffic movement +- test failover regularly, not just in slide decks +- make failover automation observable and reversible +- drain traffic from degraded zones before full failure when possible +- separate control-plane failure from data-plane failure in your reasoning + +### 12.8 Common Mistakes + +- assuming DNS failover is instant +- forgetting session or cache locality during regional failover +- promoting replicas without considering lag or split-brain protection +- calling a system active-active when only the stateless tier is active-active + +## 13. Sticky Sessions + +Sticky sessions mean requests from the same client are repeatedly routed to the same backend instance. + +### 13.1 Why They Exist + +They are often used when session state is stored in process memory on the server. + +Examples: + +- legacy web sessions +- websocket affinity +- in-memory shopping cart state in older systems + +### 13.2 When They Help + +- short-term compatibility during migration away from stateful servers +- workloads where per-connection state is expensive to rebuild +- some real-time protocols that prefer affinity + +### 13.3 Why They Are Often Avoided + +Modern backend design prefers stateless services because stateless services: + +- scale horizontally more easily +- recover from failure more cleanly +- tolerate rebalancing better +- simplify failover and deployment + +Sticky sessions hurt these properties. + +Problems they create: + +- uneven load distribution +- poor failover if the chosen instance dies +- harder autoscaling +- harder cross-region portability + +### 13.4 Better Alternatives + +- external session store such as Redis +- signed or encrypted stateless tokens where appropriate +- shared caches for session metadata +- move connection state into durable or replicated infrastructure when necessary + +### 13.5 Common Interview Answer + +If asked about sticky sessions, a strong answer is: + +"I would avoid them unless the workload truly needs affinity. In general I prefer stateless services and store session state externally so the load balancer can route to any healthy instance." + +## 14. How These Pieces Fit Together in Real Architecture + +This is the part many interview answers miss. These systems are not isolated topics. They work together as one request-handling pipeline. + +### 14.1 Typical SaaS API Architecture + +```mermaid +flowchart TD + Client[Client / Partner API Consumer] --> CDN[CDN] + CDN --> Edge[WAF / Reverse Proxy] + Edge --> Gateway[API Gateway] + Gateway --> Limits[Auth / Versioning / Rate Limit / Validation] + Limits --> LB[L7 Routing + Load Balancing] + LB --> App1[Service Instance Group A] + LB --> App2[Service Instance Group B] + App1 --> Cache[(Redis / Cache)] + App1 --> DB[(Primary DB + Replicas)] + App2 --> Queue[(Async Queue)] + Gateway -. traces .-> Obs[Logs / Metrics / Tracing] + App1 -. traces .-> Obs + App2 -. traces .-> Obs + Gateway -. service discovery .-> Registry[Service Registry / Orchestrator] +``` + +### 14.2 End-to-End Example: Payment API + +Consider `POST /payments` in a Stripe-like or SaaS billing system. + +1. Client sends HTTPS request with auth token and idempotency key. +2. CDN or edge forwards dynamic request to reverse proxy. +3. Gateway terminates TLS, checks auth, attaches trace IDs, and applies per-client rate limit. +4. Gateway validates request shape and routes to payment service. +5. Load balancer picks a healthy instance. +6. Payment service checks business rules and idempotency store. +7. Service writes to database transactionally and calls external payment processor if needed. +8. Response returns through gateway, which logs metadata and surfaces normalized errors. + +This one request involves: + +- authentication +- rate limiting +- validation +- routing +- load balancing +- idempotency +- observability +- external failure handling + +### 14.3 End-to-End Example: Global Consumer App + +Consider a Netflix-like or Uber-like global product. + +1. Global traffic manager chooses a region. +2. Edge or gateway enforces auth and traffic policy. +3. Router sends request to correct service version, maybe with canary rules. +4. Service mesh or internal load balancing handles service-to-service calls. +5. Read requests may hit regional caches first. +6. If one dependency is degraded, the system may throttle optional features and preserve the core experience. + +### 14.4 Layer-to-Concern Mapping + +| Layer | Main concerns | +|---|---| +| CDN and edge | caching, DDoS absorption, global reach | +| Reverse proxy or WAF | TLS termination, request filtering, compression | +| API gateway | auth, rate limiting, versioning, route policy, observability | +| L7 routing and load balancing | path and header-aware routing, canary, health-aware distribution | +| Service layer | business validation, idempotency, authorization on domain objects | +| Data layer | constraints, consistency, replication, failover | +| Observability platform | logs, metrics, traces, alerting | + +## 15. Real-World Discussion Patterns + +### 15.1 Google-Style or Amazon-Scale Thinking + +At very large scale, request handling is strongly influenced by geography and fleet management. + +The important patterns are: + +- global traffic steering +- regional isolation +- heavy automation around health and rollout +- multi-layer balancing rather than one magical balancer + +### 15.2 Netflix-Style Thinking + +A company deploying frequently cares deeply about: + +- canary releases +- traffic shaping +- resilience under partial failure +- observability and fast rollback + +### 15.3 Uber-Style Thinking + +Systems tied to geography or real-time state often care about: + +- geo-aware routing +- regional capacity balancing +- latency sensitivity +- selective degradation during spikes + +### 15.4 GitHub-Style and Stripe-Style Thinking + +Public API platforms care deeply about: + +- stable API contracts +- client-visible rate limiting +- request signing and webhook verification +- versioning discipline +- auditability and correctness + +### 15.5 Typical SaaS Thinking + +SaaS platforms often need a combination of: + +- tenant-aware routing +- per-tenant quotas and rate limits +- centralized auth and observability +- low operational complexity relative to global hyperscale systems + +## 16. Common Interview Questions and Strong Angles + +### 16.1 "Where would you put rate limiting?" + +Strong answer: + +"Mostly at the gateway or edge for cheap rejection, but I may also add service-level limits for expensive operations or tenant-specific protections." + +### 16.2 "How do you avoid duplicate writes when retries happen?" + +Strong answer: + +"Use idempotency keys or natural idempotency where possible, persist enough request identity to detect duplicates, and only retry safely repeatable operations." + +### 16.3 "When would you choose L4 vs L7?" + +Strong answer: + +"L7 when I need HTTP-aware routing and policies like auth, canary, or versioning. L4 when I need simpler, high-throughput transport balancing or I do not need application-layer inspection." + +### 16.4 "How would you do a safe deployment?" + +Strong answer: + +"Use health checks plus canary or blue-green routing, monitor business and technical metrics, and ensure rollback is fast." + +### 16.5 "What happens if the rate limiter or discovery service goes down?" + +Strong answer: + +"I need a failure policy. For some endpoints I fail open to preserve availability; for abuse-sensitive endpoints I may fail closed. For discovery, I keep short-lived cached endpoint data and remove unhealthy instances quickly, but I do not rely on stale data too long." + +### 16.6 "Why avoid sticky sessions?" + +Strong answer: + +"Because they make scaling, failover, and even load distribution harder. Stateless services are easier to operate and recover." + +## 17. Common Mistakes Across the Whole Topic + +- describing only the happy path and ignoring failure behavior +- saying "use a load balancer" without specifying which type or why +- retrying everything blindly +- forgetting idempotency on write APIs +- conflating authentication with authorization +- overusing the gateway as a business-logic layer +- assuming health checks are trivial +- assuming DNS-based failover is immediate and sufficient +- using sticky sessions to avoid fixing state management +- forgetting observability at the edge and routing layers + +## 18. Practical Best Practices Checklist + +- terminate or manage TLS deliberately; do not let trust boundaries be accidental +- reject bad or abusive traffic as early as possible +- keep services stateless when you can +- make retries explicit, bounded, and idempotency-aware +- use readiness checks to gate traffic and liveness checks to recover dead processes +- monitor p95 and p99 latency, not just averages +- make rollout and failover mechanisms observable +- keep routing policy simple enough to debug under pressure +- treat API versioning as a product and operational discipline, not just a URL pattern +- test degraded modes, not just normal operation + +## 19. Final Mental Model + +The cleanest way to think about request handling is this: + +Request handling is the system that protects scarce resources while getting the right request to the right code, at the right time, under the right policy, even when parts of the system are failing. + +If you can explain request handling from that perspective, you will do well in interviews and you will design more production-ready backend systems. diff --git a/systems design/10.reliabiltyAndProtection.md b/systems design/10.reliabiltyAndProtection.md new file mode 100644 index 0000000..5bfb5a8 --- /dev/null +++ b/systems design/10.reliabiltyAndProtection.md @@ -0,0 +1,1467 @@ +# Reliability & Protection + +Reliability and protection are the control systems that keep a backend usable when reality stops being polite. In a clean whiteboard interview, requests arrive at a nice steady rate, services are healthy, latency is predictable, and failures are isolated. In production, the opposite is usually true: + +- traffic is bursty, not smooth +- clients retry badly +- dependencies slow down before they fail +- bots scrape and abuse public endpoints +- one noisy tenant can starve everyone else +- instances become partially unhealthy +- engineers need to understand outages while users are actively impacted + +This is why good distributed systems do more than process business logic. They also protect themselves, measure themselves, and explain themselves. + +This guide is written for two audiences at the same time: + +1. Someone preparing for backend and system design interviews who needs strong, structured explanations. +2. Someone trying to understand how real production platforms stay stable under load. + +Examples in this guide are generalized from widely used public industry patterns rather than private implementation details, but they map closely to how large companies such as Google, Netflix, Amazon, Uber, Stripe, GitHub, and large SaaS platforms reason about these systems. + +## 1. Big Picture: What Reliability & Protection Actually Mean + +At a high level, reliability is the ability of a system to continue providing acceptable service over time, even when components fail, traffic changes, or dependencies misbehave. + +Protection is the set of mechanisms that stop the system from being destabilized by bad traffic, abusive clients, overload, or unhealthy internal states. + +These two ideas are tightly connected: + +- rate limiting protects reliability by controlling who gets to consume scarce capacity +- monitoring protects reliability by detecting degradation quickly +- logging and tracing protect reliability by making incidents diagnosable +- health checks protect reliability by keeping broken instances out of the serving path +- observability protects reliability by shortening mean time to recovery, or MTTR + +If request handling decides where work goes, reliability and protection decide whether the system survives doing that work. + +### 1.1 Core Questions These Systems Answer + +Every mature backend eventually needs good answers to questions like these: + +1. How do we stop one client from overwhelming the system? +2. How do we know something is wrong before users file tickets? +3. When the system is slow, how do we know whether the bottleneck is the app, the database, or the network? +4. How do we debug a single failing user request across 10 microservices? +5. How do we keep unhealthy instances from receiving traffic? +6. How do we decide whether to reject, delay, retry, degrade, or fail over? + +Interviewers like this topic because it reveals whether you understand production reality, not just architecture vocabulary. + +### 1.2 Reliability Control Surface in a Typical Backend + +```mermaid +flowchart LR + Client[Client / Browser / Mobile App] --> Edge[CDN / WAF / Edge Proxy] + Edge --> Gateway[API Gateway] + Gateway --> ServiceA[Service A] + Gateway --> ServiceB[Service B] + ServiceA --> Cache[(Cache)] + ServiceA --> DB[(Primary DB)] + ServiceB --> MQ[(Queue / Stream)] + ServiceB --> Ext[External API] + + Edge -. edge rate limits .-> RL1[Protection Layer] + Gateway -. auth + quotas + route metrics .-> RL2[Policy Layer] + ServiceA -. app metrics/logs/traces .-> Obs[Observability Stack] + ServiceB -. app metrics/logs/traces .-> Obs + Gateway -. access logs + latency + trace roots .-> Obs + Edge -. DDoS signals + blocked traffic .-> Obs + + LB[Load Balancer / Service Mesh] --> ServiceA + LB --> ServiceB + HC[Health Checks] -. remove unhealthy endpoints .-> LB +``` + +### 1.3 Interview Framing + +A weak interview answer says: + +"I would add monitoring and rate limiting." + +A strong interview answer says: + +"I would enforce coarse rate limits at the edge to stop abusive traffic early, then add service-level quotas for expensive operations. I would instrument RED metrics at the gateway and USE metrics for the database and worker pools, propagate trace IDs across services, centralize structured logs for debugging, and use readiness checks so bad instances are drained before they receive traffic." + +That answer shows placement, purpose, and tradeoffs. + +--- + +## 2. Rate Limiting + +Rate limiting is one of the most important protection mechanisms in backend systems because it prevents demand from turning into collapse. + +### 2.1 Core Idea + +Fundamentally, rate limiting controls how quickly a client, tenant, API key, IP address, or internal caller can consume a resource. + +The resource might be: + +- HTTP requests per second +- login attempts per minute +- messages published per second +- database writes per tenant +- expensive AI inference calls per hour +- webhook deliveries per endpoint + +Rate limiting exists because backend resources are finite. CPU, memory, database connections, thread pools, cache bandwidth, and downstream API budgets are all limited. Without protection, the system behaves unfairly and often unstably. + +### 2.2 What Rate Limiting Protects Against + +| Problem | What happens without limits | Why rate limiting helps | +|---|---|---| +| Abuse and bots | scrapers, credential stuffing, spam, brute force requests | slows attackers, raises cost of abuse | +| Cost explosion | one client generates huge billable backend work | protects infrastructure and vendor spend | +| Unfairness | one noisy tenant degrades everyone else | enforces fairness and multi-tenant isolation | +| Cascading overload | overloaded service causes retries and wider collapse | sheds or delays work before queues explode | +| Capacity mismatch | demand spikes above safe throughput | keeps the system inside stable operating bounds | + +An important intuition: rate limiting is not mainly about denying traffic. It is about shaping demand so the system stays in a region where it can still serve useful work. + +### 2.3 Where It Sits in the Request Lifecycle + +Rate limiting can exist at multiple layers: + +- edge or CDN layer +- web application firewall layer +- API gateway layer +- service or endpoint layer +- internal RPC layer +- asynchronous job admission layer + +Each layer protects a different thing: + +- edge limiting protects internet-facing capacity and blocks obvious abuse cheaply +- gateway limiting protects shared APIs and enforces customer quotas consistently +- service-level limiting protects expensive or sensitive business operations +- internal limiting protects downstream systems like databases, queues, or external providers + +### 2.4 Multi-Layer Request Flow + +```mermaid +flowchart LR + C[Client] --> E[Edge / CDN / WAF] + E --> G[API Gateway] + G --> S1[Auth Service] + G --> S2[Core API Service] + S2 --> DB[(Database)] + S2 --> P[(Payment / External Provider)] + + E -. IP reputation / bot filter / coarse rate limit .-> M1[Edge Protection] + G -. API key / tenant / route quota .-> M2[Gateway Rate Limit] + S2 -. expensive op guard / per-user write limit .-> M3[Service Limit] + S2 -. metrics / logs / traces .-> O[Observability] + G -. access logs / latency / rejections .-> O +``` + +This multi-layer approach matters because a single limiting point is usually too blunt. If you only limit at the service, abusive traffic still consumes gateway and network resources. If you only limit at the edge, an authenticated user may still abuse an expensive endpoint. + +### 2.5 Token Bucket + +Token bucket is one of the most common production rate limiting algorithms because it allows controlled bursts while maintaining a long-term average rate. + +#### 2.5.1 Intuition + +Imagine a bucket holding tokens. A request can only proceed if it removes a token. Tokens are added back over time at a fixed refill rate until the bucket reaches a maximum size. + +- refill rate controls steady-state throughput +- bucket size controls burst tolerance + +If tokens exist, traffic can burst. If the bucket empties, excess requests are rejected or delayed. + +#### 2.5.2 How It Works Internally + +The limiter stores two main pieces of state: + +- last refill timestamp +- current token count + +When a request arrives: + +1. Compute elapsed time since last refill. +2. Add new tokens according to elapsed time times refill rate. +3. Cap the bucket at max capacity. +4. If at least one token is available, consume one and allow the request. +5. Otherwise reject, throttle, or queue the request. + +In equation form, if the refill rate is $r$ tokens per second and elapsed time is $\Delta t$, then: + +$$ +new\_tokens = \min\left(capacity, current\_tokens + r \cdot \Delta t\right) +$$ + +#### 2.5.3 Concept Diagram + +```mermaid +flowchart LR + T[Time passes] --> R[Add tokens at fixed rate] + R --> B[(Token Bucket)] + Req[Incoming request] --> Check{Token available?} + B --> Check + Check -->|Yes| Allow[Consume token and allow] + Check -->|No| Reject[Reject / delay / degrade] +``` + +#### 2.5.4 Why Production Systems Like It + +Real traffic is rarely perfectly even. Users refresh a page, mobile clients reconnect, cron jobs fire on the minute, and webhooks fan out in bursts. Token bucket handles this better than a hard per-second cutoff because it absorbs short bursts without punishing normal behavior. + +This is why APIs from platforms like Stripe, GitHub, and cloud providers often combine long-term quotas with burst allowance rather than a rigid request-per-second cap. + +#### 2.5.5 Configuration Tradeoffs + +| Setting | If too low | If too high | +|---|---|---| +| Refill rate | good clients get throttled during normal use | service may still overload | +| Bucket capacity | no burst tolerance, bad user experience | burst can overwhelm downstream dependencies | +| Key granularity | unfair sharing across users | high cardinality and more storage | + +The important tradeoff is burst versus smoothness. A large bucket improves user experience for bursty workloads, but it can create traffic spikes that a fragile downstream service cannot handle. + +#### 2.5.6 Common Interview Discussion + +Interviewers often ask: "Why use token bucket instead of fixed window counting?" + +Strong answer: + +- token bucket handles bursts better +- it avoids harsh boundary effects like "100 requests at 12:00:59 and another 100 at 12:01:00" +- it maps better to systems that want average throughput plus burst tolerance + +### 2.6 Leaky Bucket + +Leaky bucket is another classic algorithm. It is often used when you want steady outflow instead of burst-friendly admission. + +#### 2.6.1 Intuition + +Imagine requests entering a bucket, but water leaks out at a constant rate. If input comes faster than output, the bucket fills. Once full, new requests are dropped or blocked. + +This models queue smoothing. + +#### 2.6.2 Internal Behavior + +- incoming requests are enqueued +- a scheduler or consumer drains the queue at a fixed rate +- if the queue exceeds capacity, additional requests are rejected + +This produces a smoother stream downstream, which can be useful when the protected system is sensitive to burstiness. + +#### 2.6.3 Where It Is Useful + +- traffic shaping in network systems +- smoothing writes to a fragile downstream service +- controlling job dispatch into worker pools +- protecting databases or third-party providers that dislike spikes + +#### 2.6.4 Token Bucket vs Leaky Bucket + +| Dimension | Token bucket | Leaky bucket | +|---|---|---| +| Traffic model | allows bursts up to bucket size | smooths traffic toward constant rate | +| User experience | better for bursty legitimate traffic | can add queueing delay | +| Downstream protection | moderate | stronger smoothing | +| Common usage | API rate limits, user quotas | shaping outbound work, queue drain control | + +The distinction is simple but important: + +- token bucket decides admission with burst allowance +- leaky bucket decides drain rate with burst smoothing + +Many real systems effectively combine both ideas. For example, an API may admit a burst via token bucket, then a worker queue may leak work toward a payment gateway at a controlled rate. + +### 2.7 Distributed Counters + +Single-process counters work in demos. They fail immediately in horizontally scaled systems. + +If your API runs on 100 instances and each instance only tracks its own local counters, then a user can often exceed the intended global limit by spreading requests across many instances. + +#### 2.7.1 Why Single-Node Counters Break + +- load balancers route requests to different instances +- autoscaling adds and removes nodes dynamically +- restarts wipe in-memory state +- multi-region routing makes local counters inconsistent globally + +#### 2.7.2 Redis-Based Counters + +A common production approach is to keep rate limit state in Redis because it is fast, centralized, and supports atomic operations. + +Typical designs use: + +- atomic increment with expiration for window counters +- Lua scripts for atomic token bucket logic +- sorted sets for sliding-window calculations + +Why Redis is popular: + +- low latency +- shared across instances +- atomic primitives +- operationally simpler than full database-backed counters + +But Redis does not magically solve everything. + +#### 2.7.3 Consistency and Failure Challenges + +| Challenge | What can go wrong | +|---|---| +| replication lag | replicas may return stale counts | +| hot keys | a popular API key or IP becomes a bottleneck | +| partial outage | if Redis is down, limiter behavior must fail open or fail closed | +| cross-region latency | global counters become slow or inconsistent | +| clock dependence | time-window logic gets tricky around skew and boundaries | + +Fail-open means requests are allowed if the limiter backend is unavailable. This preserves availability but weakens protection. Fail-closed means requests are rejected when limiter state cannot be checked. This protects capacity but can create an outage for legitimate traffic. + +Which one is correct depends on what you are protecting: + +- login abuse defenses often lean fail-closed or degrade aggressively +- low-risk analytics APIs may lean fail-open to preserve usability +- financial or fraud-sensitive paths may use more conservative protection + +#### 2.7.4 In-Memory Plus Sync Approaches + +At very high scale, some systems use a hybrid strategy: + +- local in-memory counters for extremely fast coarse enforcement +- periodic synchronization to shared backing state +- eventual consistency with safety margins + +This reduces Redis load and latency, but increases approximation error. + +This is useful when exactness is less important than protection. For example, if the policy is "roughly 100 requests per second per client," being off by a small amount may be acceptable. If the policy is tied to billing or fraud, approximation may not be acceptable. + +#### 2.7.5 Time Window Challenges + +Naive time-based limiting creates ugly edge effects: + +- fixed windows allow bursts at boundaries +- sliding windows cost more to compute accurately +- per-second precision increases storage and write amplification + +This is why interview answers should mention time semantics, not just "store a counter in Redis." + +#### 2.7.6 Multi-Region Scaling Challenges + +Global rate limiting is hard because the system must choose between: + +- strict global accuracy +- low latency +- high availability + +You usually cannot maximize all three. + +Typical patterns: + +- region-local limits with some over-allocation buffer +- global quotas with asynchronous reconciliation +- per-region token allotments periodically rebalanced +- edge-local abuse blocking plus regional business quotas + +Example: a global SaaS API may enforce hard account quotas at a central control plane, but use region-local token buckets for fast data-plane enforcement. + +### 2.8 Abuse Prevention + +Rate limiting is one piece of a broader abuse prevention system. + +#### 2.8.1 Common Abuse Patterns + +- bots scraping public pages or APIs +- credential stuffing using leaked usernames and passwords +- brute force attempts against login or OTP flows +- spam account creation +- card testing against payment endpoints +- low-and-slow abuse designed to stay below simple thresholds +- DDoS traffic intended to exhaust network, compute, or application resources + +#### 2.8.2 IP-Based vs User-Based Limiting + +| Dimension | IP-based | User/API key based | +|---|---|---| +| Strength | useful before authentication | better for fairness after auth | +| Weakness | NAT and mobile networks cause false positives | attackers can create many accounts | +| Best use | edge defense, anonymous traffic | tenant quotas, authenticated APIs | + +Strong systems combine both. Anonymous traffic might be limited by IP and ASN reputation at the edge, while authenticated traffic is limited by account, token, route, and operation type. + +#### 2.8.3 Behavioral Rate Limiting + +Simple request counting misses intent. Behavioral limiting looks at patterns like: + +- many login attempts across many usernames from one IP +- one account creating resources unusually fast +- abnormal error-rate patterns +- suspicious path traversal or enumeration behavior +- unusual geo or device distribution + +This is closer to how large anti-abuse systems at companies like Cloudflare, Stripe, GitHub, and large identity platforms think. The goal is not only "count requests," but "detect harmful behavior patterns." This often combines rules, heuristics, ML scoring, reputation data, and challenge systems such as CAPTCHAs or proof-of-work style friction. + +#### 2.8.4 Adaptive Throttling + +Adaptive throttling changes policy based on current system health or attack conditions. + +Examples: + +- tighten anonymous limits when error rate spikes +- reduce expensive endpoint quotas when database saturation rises +- require additional verification for suspicious login flows +- deprioritize background traffic during an incident + +This is more resilient than static limits because static thresholds are often wrong under dynamic load. + +#### 2.8.5 DDoS Mitigation Basics + +Application teams usually do not stop large volumetric DDoS attacks alone. This is typically handled with layered defenses: + +- anycast edge networks +- CDN absorption +- WAF rules +- SYN flood protections +- upstream scrubbing centers +- edge filtering by reputation and geography + +The application layer still matters because sophisticated attacks often look like valid traffic but target expensive endpoints. + +### 2.9 Production Placement and Best Practices + +#### 2.9.1 Edge vs Service-Level Limiting + +| Placement | Best for | Tradeoff | +|---|---|---| +| Edge | block abusive traffic cheaply and early | limited user context before auth | +| API gateway | enforce route and tenant quotas consistently | gateway can become a bottleneck | +| Service layer | protect expensive operations with business context | traffic already consumed upstream capacity | + +Best practice is layered enforcement, not choosing only one layer. + +#### 2.9.2 Common Mistakes + +- using only IP limits and harming legitimate users behind NAT +- applying one global limit instead of per-route cost-aware limits +- failing open on sensitive abuse endpoints without thinking through risk +- forgetting internal callers and retry storms +- storing high-cardinality limit keys without memory planning +- ignoring user messaging and not returning retry metadata + +A good production API usually returns clear headers or error fields such as remaining quota, reset hints, or backoff guidance. + +#### 2.9.3 Real-World Examples + +- GitHub-style APIs expose rate limits to clients and distinguish authenticated versus unauthenticated usage. +- Stripe-style systems protect sensitive payment and authentication flows with route-specific controls, idempotency, and anti-abuse heuristics. +- Cloudflare-style edge systems combine rate limits, bot signals, reputation, and challenge mechanisms. +- Large SaaS platforms often have tenant-level quotas to stop one customer from exhausting shared resources. + +--- + +## 3. Monitoring + +Monitoring is how a system notices that it is drifting away from healthy behavior. + +### 3.1 Core Idea + +Many people think monitoring exists to help debugging after something breaks. That is only part of the story. Monitoring exists to answer a broader question: + +"Is the system still delivering the reliability properties we promised?" + +That means monitoring is about: + +- early detection +- trend awareness +- capacity planning +- incident response +- reliability management +- business risk visibility + +If users discover outages before engineers do, monitoring is underperforming. + +### 3.2 Reliability vs Visibility + +Reliable systems are not systems with many dashboards. They are systems where signals are tied to real user impact. + +Visibility means you can see internal measurements. + +Reliability means you can use those measurements to keep the user experience within acceptable bounds. + +This is why mature teams tie monitoring to service level objectives, or SLOs, not just random machine metrics. + +### 3.3 Metrics + +Metrics are numeric measurements collected over time. They are efficient to aggregate, cheap to alert on, and ideal for answering "what is happening at scale?" + +#### 3.3.1 Main Metric Types + +| Type | Meaning | Example | +|---|---|---| +| Counter | monotonically increasing count of events | requests_total, errors_total | +| Gauge | value that can go up or down | queue_depth, memory_usage | +| Histogram | distribution of observations across buckets | request_latency_ms | + +Counters are good for rates and totals. Gauges show current state. Histograms are essential for latency because averages hide tail pain. + +#### 3.3.2 RED and USE + +Two famous heuristics show up often in interviews. + +RED is useful for services: + +- Rate: how many requests are we serving? +- Errors: how many are failing? +- Duration: how long are they taking? + +USE is useful for resources: + +- Utilization: how busy is the resource? +- Saturation: how much queued or waiting work exists? +- Errors: is the resource itself failing? + +Strong teams use both. RED tells you user-facing service behavior. USE tells you whether underlying resources are the bottleneck. + +#### 3.3.3 Latency, Traffic, Errors, Saturation + +These four ideas should always be mentally connected: + +- traffic increases demand +- latency reveals response time degradation +- errors reveal outright failure +- saturation reveals how close you are to collapse + +Saturation is often the most neglected signal. CPU may be only 60 percent busy while database connection pools, thread pools, disk queues, or downstream concurrency limits are already maxed out. + +#### 3.3.4 Cardinality Problems + +Metrics systems love aggregation and hate unbounded labels. + +Bad idea: + +- label every request by user_id, session_id, order_id, or full URL path + +Why this breaks: + +- memory usage explodes +- query performance degrades +- storage cost rises quickly +- alerting becomes unstable + +This is a classic production lesson. Metrics are not logs. They should summarize classes of behavior, not store every individual event identity. + +#### 3.3.5 How Metrics Drive Alerting + +Common alerting patterns: + +- error rate above threshold for a sustained period +- p99 latency above SLO target +- success rate below objective +- queue depth rising continuously +- worker backlog age exceeding target +- database saturation rising while throughput plateaus + +At companies like Google and many SaaS teams influenced by SRE practices, the strongest alerts are symptom-based. That means they alert on user-visible pain, not just internal weirdness. + +### 3.4 Dashboards + +Dashboards are human-readable views of system state. They exist to help operators build situational awareness quickly. + +#### 3.4.1 What a Good Dashboard Does + +- shows service health at a glance +- connects symptoms to likely causes +- reflects ownership boundaries +- supports incident triage under pressure +- avoids burying the important signal under decorative noise + +#### 3.4.2 Good vs Bad Dashboard Design + +| Good pattern | Bad pattern | +|---|---| +| starts with SLO and user-impact indicators | starts with dozens of host-level charts | +| shows request rate, errors, latency, saturation together | shows disconnected charts without context | +| broken down by region, endpoint, and dependency | mixes unrelated systems on one screen | +| has a clear owner and purpose | exists because someone thought dashboards are good | + +#### 3.4.3 Useful Dashboard Layers + +- executive or SLO dashboard: is the service meeting promises? +- service owner dashboard: what part of the service is degrading? +- dependency dashboard: is the database, cache, or queue causing the issue? +- operational drill-down: instance or pod level details for active debugging + +Interview insight: good dashboards are navigational aids, not data graveyards. + +### 3.5 Alerting + +Alerting is where many organizations accidentally create operational self-harm. + +#### 3.5.1 What Makes an Alert Good + +A good alert is: + +- actionable +- tied to impact or a credible precursor to impact +- routed to the right owner +- urgent enough to deserve interruption +- low-noise enough that people still trust it + +If an alert does not change what an engineer should do, it is probably not a good page. + +#### 3.5.2 Alert Fatigue + +Noisy alerts train engineers to ignore the monitoring system. That is dangerous because the real outage then arrives on a channel people have learned to distrust. + +Common causes: + +- thresholds set too close to normal variability +- paging on transient spikes instead of sustained conditions +- duplicate alerts from multiple layers for the same symptom +- paging on causes rather than symptoms without enough confidence +- alerts with no runbook or clear ownership + +#### 3.5.3 Thresholds vs Anomaly Detection + +| Approach | Strength | Weakness | +|---|---|---| +| Static thresholds | simple and explainable | brittle under seasonal traffic patterns | +| Dynamic baselines / anomaly detection | adapts to changing patterns | can be opaque and noisy if poorly tuned | + +Most mature systems use a mix. Critical SLO breaches often use straightforward thresholds. Supporting signals may use anomaly detection. + +#### 3.5.4 Paging vs Non-Paging Alerts + +- paging alerts wake humans because user impact is happening or imminent +- non-paging alerts create tickets, Slack notifications, or backlog items + +Not every bad graph deserves a pager. Use severity tiers such as: + +- P0: severe widespread outage or data risk +- P1: major feature degradation with high customer impact +- P2: limited degradation or internal operational issue + +#### 3.5.5 How Good Teams Reduce Noise + +- alert on burn rate against SLOs rather than raw single-sample spikes +- deduplicate related alerts +- suppress child alerts during known parent outages +- require every alert to have an owner and action +- review and prune alerts after incidents + +Netflix, Google-style SRE teams, and mature SaaS platforms commonly treat alert quality as an engineering problem, not a monitoring config problem. + +### 3.6 Uptime Checks + +Uptime checks are active probes that ask, "Can I reach this system and get the expected result?" + +#### 3.6.1 Synthetic Monitoring + +Synthetic monitoring sends artificial requests on a schedule from one or more regions. + +Examples: + +- GET health endpoint every 30 seconds +- create a lightweight test account and perform a login flow +- simulate checkout without committing payment +- fetch an API response and verify critical fields + +#### 3.6.2 Health Endpoint vs Real User Monitoring + +| Method | What it tells you | Limitation | +|---|---|---| +| Health endpoint | service says it is alive or ready | may not reflect real user experience | +| Synthetic monitoring | external path works for scripted flows | may miss edge cases and customer diversity | +| Real user monitoring | what actual users are experiencing | harder to control and aggregate cleanly | + +#### 3.6.3 Global Checks and Detection Latency + +Checks from many regions matter because an outage may be regional, DNS-related, CDN-related, or ISP-specific. + +Tradeoff: + +- more frequent checks reduce detection latency +- but higher frequency increases probe noise and cost + +#### 3.6.4 "System Is Up" vs "System Is Usable" + +A service can return HTTP 200 and still be functionally broken. + +Examples: + +- login succeeds but dashboard data never loads +- API responds quickly with empty or stale data because a dependency is degraded +- health endpoint passes while database writes silently fail + +This is why mature uptime programs test critical user journeys, not just TCP reachability. + +--- + +## 4. Logging + +Logs are the detailed event record of what the system believed and did at specific moments. + +### 4.1 Core Idea + +If metrics tell you that something is wrong, logs often tell you what happened in enough detail to investigate. + +They are the black box recorder of distributed systems. + +Logs are especially important because production incidents often involve context that metrics intentionally discard: + +- exact error messages +- request payload characteristics +- code path decisions +- retry behavior +- dependency-specific failure reasons +- tenant- or endpoint-specific anomalies + +### 4.2 Logs vs Metrics + +Metrics compress behavior into numeric summaries. Logs preserve detailed event context. + +That difference is why you need both. + +If p99 latency spikes, metrics show the spike. Logs may reveal that requests with a certain downstream provider, tenant, or payload size were timing out. + +### 4.3 Centralized Logs + +#### 4.3.1 Why Centralization Is Necessary + +In modern distributed systems, requests touch many ephemeral machines and containers. Local log files are not enough because: + +- instances autoscale and disappear +- a single request crosses many services +- incidents require searching across time and systems +- security and compliance often require retention and auditability + +So logs are shipped to centralized systems such as Elasticsearch-based stacks, Loki, Splunk, Datadog, or vendor-managed observability backends. + +#### 4.3.2 Structured Logging vs Plain Text + +Structured logs use fields, usually JSON or key-value format, rather than free-form text. + +Example fields: + +- timestamp +- level +- service +- environment +- region +- request_id +- trace_id +- route +- user_id or tenant_id if safe and allowed +- error_code +- latency_ms + +Structured logging wins because it is searchable and aggregatable. Plain text is easy for humans to read locally, but much harder to query reliably at scale. + +#### 4.3.3 Indexing and Searchability + +Central log systems typically parse incoming logs, extract fields, index selected attributes, and support search by time, service, correlation ID, severity, or structured fields. + +The core tradeoff is query flexibility versus cost. + +Full indexing of all fields can become very expensive at scale. Mature teams choose carefully which fields deserve indexing and which belong only in raw archived events. + +#### 4.3.4 Retention Policies + +Log retention is both a cost and compliance topic. + +Hot searchable storage is expensive. Cold archival storage is cheaper but slower to query. + +Many systems use tiers: + +- short retention for high-volume debug logs +- longer retention for important application and audit logs +- strict redaction or encryption for sensitive data + +#### 4.3.5 Log Volume Explosion + +Logging scales badly if left unmanaged. + +Common failure modes: + +- every retry logs a full stack trace +- debug logging is left on in production +- large request or response payloads are logged indiscriminately +- high-cardinality fields create indexing blowups + +During incidents, bad logging can worsen the outage by saturating disk, network, or the logging backend itself. + +### 4.4 Logging in Distributed Systems + +#### 4.4.1 Correlation IDs and Request IDs + +A correlation ID is a shared identifier attached to all logs related to one logical request or workflow. + +Without this, debugging a request across services becomes guesswork. + +Typical flow: + +1. Gateway receives request. +2. Gateway generates or propagates request ID and trace ID. +3. Each downstream service logs those IDs. +4. Operators search all logs for that ID. + +#### 4.4.2 Cross-Service Traceability + +Suppose a checkout request hits: + +- API gateway +- cart service +- inventory service +- payment service +- notification service + +If the payment call fails only for certain retries and inventory had already reserved stock, engineers need the end-to-end sequence. Shared identifiers make that investigation possible. + +#### 4.4.3 Ordering Challenges + +Distributed logs are not perfectly ordered because: + +- clocks are not identical +- network delays vary +- logs are buffered and shipped asynchronously +- retries create overlapping attempts + +This is why timestamps alone are insufficient. Correlation IDs and trace IDs are mandatory for serious debugging. + +#### 4.4.4 Real-World Debugging Example + +Imagine users report intermittent checkout failures. + +Metrics show: + +- error rate spike in checkout API +- latency spike in payment dependency + +Logs reveal: + +- payment provider timeout after 2 seconds +- retry policy triggered twice +- inventory reservation succeeded but compensation job was delayed + +Tracing reveals: + +- most latency sits in one external payment span + +Together, these signals tell a coherent story. Any one signal alone would be incomplete. + +### 4.5 Logging Best Practices and Mistakes + +Best practices: + +- use structured logs +- include request and trace identifiers +- log meaningful business events, not only code exceptions +- redact secrets and personal data +- control log levels carefully +- sample noisy repetitive logs when appropriate + +Common mistakes: + +- logging passwords, tokens, or full card data +- logging too little context to debug anything +- logging so much detail that the platform becomes unusable +- using logs as a substitute for metrics + +--- + +## 5. Tracing + +Tracing maps the journey of a request through a distributed system. + +### 5.1 Core Idea + +Logs tell you what happened in individual components. Metrics tell you the system-level shape of behavior. Tracing tells you where time and failure accumulated along one request path. + +This matters because modern backends are not linear. One user request may trigger multiple services, caches, queues, and external APIs. If the request is slow, the critical question becomes: + +"Which part of the path consumed the time?" + +### 5.2 Distributed Tracing Fundamentals + +#### 5.2.1 Trace and Span + +- a trace represents one end-to-end request or workflow +- a span represents one unit of work inside that trace + +Each span usually includes: + +- start time +- end time or duration +- operation name +- service name +- status or error info +- parent span reference +- attributes like route, region, DB statement class, retry count + +#### 5.2.2 Parent-Child Relationships + +Tracing forms a tree or DAG-like timeline. + +Example: + +- root span: incoming API request +- child span: call auth service +- child span: call order service +- child span: call payment service +- nested child span: payment service calls external processor + +This lets engineers see both total latency and where it was spent. + +#### 5.2.3 Latency Breakdown and Bottleneck Identification + +Tracing is exceptionally good at revealing: + +- which dependency is slow +- where parallelism is working or not working +- which retries amplified latency +- which branch of a request fan-out dominates tail latency + +### 5.3 Request Flow Example + +```mermaid +sequenceDiagram + participant Client + participant Gateway + participant Orders + participant Inventory + participant Payments + participant DB + + Client->>Gateway: POST /checkout + Gateway->>Orders: create order span + Orders->>Inventory: reserve stock span + Inventory->>DB: update inventory span + DB-->>Inventory: success + Inventory-->>Orders: reserved + Orders->>Payments: charge card span + Payments->>DB: load payment state span + DB-->>Payments: payment state + Payments-->>Orders: timeout / slow response + Orders-->>Gateway: partial failure + Gateway-->>Client: 503 or retryable error +``` + +With tracing, the system can show that total request time was, for example, 2.8 seconds, of which 2.4 seconds came from the payment span. + +### 5.4 Trace Propagation + +Tracing only works if context is propagated across service boundaries. + +Typical behavior: + +1. The entry point generates a trace ID and root span. +2. Outbound requests carry tracing headers. +3. Downstream services create child spans linked to the parent. +4. Async workflows may continue the trace with linked context. + +If a service fails to propagate headers, observability becomes fragmented. This is one of the most common real-world tracing failures. + +### 5.5 OpenTelemetry Conceptually + +OpenTelemetry is a widely used vendor-neutral observability framework. + +At a high level, it provides: + +- APIs and SDKs for instrumentation +- conventions for traces, metrics, and logs +- context propagation support +- exporters to different backends + +Why it matters: + +- teams do not want instrumentation tied forever to one vendor +- standard context propagation improves interoperability +- shared semantic conventions reduce chaos across services + +High-level flow: + +- application emits spans and metrics through OpenTelemetry SDKs +- local agent or collector batches and processes telemetry +- exporter sends data to systems such as Jaeger, Tempo, Prometheus-compatible pipelines, Datadog, New Relic, Honeycomb, or cloud-native observability backends + +### 5.6 Tracing in Production + +Tracing every request at full detail can be expensive. + +Real systems often use: + +- head-based sampling: decide early whether to keep a trace +- tail-based sampling: keep traces matching outcomes like errors or high latency +- adaptive sampling: retain more unusual or important traces + +Tradeoff: + +- more tracing detail improves debugging +- but increases storage, network overhead, and backend cost + +Companies handling very high throughput often sample heavily for normal traffic while keeping full traces for errors or important workflows. + +### 5.7 Common Mistakes + +- not propagating trace context across all services +- naming spans inconsistently +- tracing everything but not correlating to logs and metrics +- hiding expensive downstream calls inside uninstrumented libraries +- ignoring async boundaries such as queue consumers and background jobs + +--- + +## 6. Observability + +Observability is the ability to understand the internal state of a system by examining its outputs. + +### 6.1 Core Idea + +Monitoring asks, "Did something go wrong?" + +Observability asks, "Can we explain why this behavior is happening, even if we did not predict this exact failure mode in advance?" + +That difference matters a lot in distributed systems because not every outage matches a prewritten alert rule. + +### 6.2 The Three Pillars + +The common mental model is logs, metrics, and traces. The phrase is a simplification, but still useful. + +| Signal | Best question it answers | Typical strengths | Typical weakness | +|---|---|---|---| +| Metrics | what is happening broadly? | fast aggregation, dashboards, alerts | limited detail | +| Logs | why did this specific event happen? | rich context, debugging detail | high volume and cost | +| Traces | where did time or failure occur in the path? | request journey and dependency map | sampling and instrumentation gaps | + +The simplest interview-friendly summary is: + +- metrics tell you what is happening +- logs tell you why it is happening +- traces tell you where it is happening + +That summary is not perfect, but it is very useful. + +### 6.3 Why Observability Exists + +Modern systems are: + +- distributed +- dynamic +- asynchronous +- partially failing +- constantly changing through deploys and config changes + +Because of that, operators cannot rely on static mental models alone. They need live evidence that connects symptoms to root causes. + +### 6.4 Observability Pipeline Architecture + +```mermaid +flowchart LR + App1[Service A] -->|metrics| Col[OpenTelemetry Collector / Agents] + App1 -->|logs| Col + App1 -->|traces| Col + App2[Service B] -->|metrics| Col + App2 -->|logs| Col + App2 -->|traces| Col + GW[Gateway / Edge] -->|access logs, RED metrics, root traces| Col + + Col --> M[Metrics Store] + Col --> L[Log Store / Search] + Col --> T[Trace Store] + + M --> Dash[Dashboards] + M --> Alert[Alerting Engine] + L --> Investigate[Incident Investigation] + T --> Investigate + Dash --> OnCall[On-call Engineer] + Alert --> OnCall +``` + +This pipeline is important because observability is not just instrumentation inside code. It also includes collection, transport, storage, indexing, alerting, retention, access control, and operational cost management. + +### 6.5 Debugging in Production + +#### 6.5.1 Typical Incident Flow + +```mermaid +flowchart TD + A[Alert fires or users report issue] --> B[Check SLO / impact dashboard] + B --> C[Identify affected service, region, route, tenant] + C --> D[Inspect metrics for latency, errors, saturation] + D --> E[Open traces for slow or failed requests] + E --> F[Search logs by request or trace ID] + F --> G[Confirm root cause hypothesis] + G --> H[Mitigate: rollback, failover, throttle, disable feature, scale] + H --> I[Verify recovery with dashboards and synthetic checks] +``` + +#### 6.5.2 How Observability Reduces MTTR + +MTTR is improved when engineers can move quickly from symptom to cause: + +- alert points to user-visible degradation +- dashboard localizes which dimension is affected +- trace identifies slow dependency or failing branch +- logs reveal exact failure details +- health and deployment metadata show whether a rollout caused the issue + +Without observability, incident response becomes guesswork, which increases both outage duration and risk of bad mitigation. + +### 6.6 Practical Real-World Examples + +- Google-style SRE thinking emphasizes SLIs, SLOs, and high-signal alerts tied to user impact. +- Netflix-style operations emphasize rich telemetry, dependency visibility, and resilience under partial failure. +- Uber-style microservice environments rely heavily on traceability and service ownership because requests span many internal systems. +- Stripe-style systems combine deep request context, idempotency, tracing, and auditability for high-stakes financial workflows. +- GitHub-style large API platforms need careful rate-limit observability, route-level error tracking, and tenant-aware operational debugging. + +### 6.7 Common Mistakes + +- collecting lots of telemetry without clear operational questions +- not connecting telemetry to service ownership +- keeping signals in separate tools without correlation IDs +- instrumenting only the application and ignoring proxies, queues, workers, and databases +- forgetting cost: observability can become a major platform expense + +--- + +## 7. Health Checks + +Health checks determine whether a system component should be considered safe to use. + +### 7.1 Core Idea + +In distributed systems, failure is often partial. A process may still be running but unable to serve traffic correctly because: + +- it cannot reach the database +- it is still warming caches +- it is deadlocked internally +- it is overloaded and timing out all work +- a critical dependency is unavailable + +Health checks exist so orchestration systems, service meshes, and load balancers can make smarter routing and restart decisions. + +### 7.2 Liveness Checks + +Liveness answers: + +"Is this process alive, or is it stuck badly enough that restart is reasonable?" + +#### 7.2.1 What It Detects + +- deadlocks +- event loop stalls +- process hangs +- unrecoverable internal corruption + +In Kubernetes-style environments, failing liveness checks usually triggers restart behavior. + +#### 7.2.2 What It Should Not Do + +Liveness should not depend on every downstream dependency. If your database has a brief hiccup and every pod fails liveness, the orchestrator may restart healthy application processes and make the incident worse. + +That is a classic mistake. + +### 7.3 Readiness Checks + +Readiness answers: + +"Can this instance safely receive traffic right now?" + +#### 7.3.1 Typical Readiness Conditions + +- startup complete +- configuration loaded +- required internal caches warmed +- dependency connections established if truly necessary +- worker pool or listener ready + +If readiness fails, the instance should be removed from load balancer rotation but not necessarily restarted. + +#### 7.3.2 Startup and Warm-Up + +Readiness is critical during deploys because an instance may be alive but not ready. + +Examples: + +- JVM application started but still warming caches +- service needs to load a large model into memory +- background migration step not complete +- thread pools not yet initialized + +Without readiness checks, load balancers send traffic too early and users see transient deployment failures. + +### 7.4 Liveness vs Readiness + +| Question | Liveness | Readiness | +|---|---|---| +| Meaning | should this process be restarted? | should this instance receive traffic? | +| Typical action | restart container or process | stop routing traffic to instance | +| Dependency sensitivity | low | moderate, only for critical serving dependencies | +| Common misuse | tying to downstream outage | making checks so strict that capacity flaps | + +### 7.5 Service Health and Dependency Awareness + +Real systems are not simply healthy or unhealthy. They are often degraded. + +Examples: + +- recommendation service down, checkout still works +- write path degraded, read path fine +- one region unhealthy, others normal +- one dependency timing out but cached responses still serve users + +This is why health models should support partial degradation, not only binary status. + +#### 7.5.1 Dependency-Aware Health Checks + +Sometimes readiness should include critical dependencies. The key word is critical. + +If a service cannot possibly serve correct traffic without the primary database, then readiness may reasonably fail when the DB is unreachable. + +But if a noncritical analytics sink is down, failing readiness would be wrong. That would convert a partial outage into total self-inflicted unavailability. + +#### 7.5.2 Cascading Failure Risk + +Naive health checks can create cascading failures: + +1. Database latency rises. +2. Application health endpoint checks DB synchronously. +3. Health checks time out. +4. Load balancer removes many instances. +5. Remaining instances take more load. +6. System collapses faster. + +This is a common interview discussion because it shows whether you understand feedback loops. + +### 7.6 Failure Detection Flow + +```mermaid +sequenceDiagram + participant LB as Load Balancer + participant Pod as Service Instance + participant HC as Health Endpoint + participant App as App Logic + participant DB as Critical Dependency + + LB->>Pod: readiness probe + Pod->>HC: evaluate readiness + HC->>App: check internal serving state + App->>DB: lightweight critical dependency check + DB-->>App: timeout / unhealthy + App-->>HC: not ready + HC-->>Pod: readiness=false + Pod-->>LB: remove from rotation +``` + +This sequence is useful only if the dependency check is carefully scoped. Do not perform heavy downstream checks on every probe. + +### 7.7 Best Practices + +- keep liveness simple and focused on stuck-process detection +- use readiness to gate traffic during startup and critical dependency loss +- support degraded modes where possible +- avoid expensive checks in hot probe paths +- add jitter and sensible intervals to avoid probe storms +- expose health state to operators, not just orchestration systems + +### 7.8 Common Mistakes + +- using identical logic for liveness and readiness +- making readiness depend on optional systems +- making health endpoints themselves expensive and failure-prone +- restarting instances during dependency outages when draining would be enough +- forgetting that health checks happen at scale across many pods simultaneously + +--- + +## 8. System Design Integration: How These Pieces Work Together + +This section is the most important operationally because real systems do not run rate limiting, monitoring, logging, tracing, and health checks in isolation. + +They form a coordinated control plane around request processing. + +### 8.1 End-to-End Architecture View + +```mermaid +flowchart LR + Client[Client / SDK / Browser] --> Edge[CDN / WAF / Edge Proxy] + Edge --> Gateway[API Gateway / Ingress] + Gateway --> LB[Internal Load Balancer / Service Mesh] + LB --> SvcA[Service A] + LB --> SvcB[Service B] + SvcA --> DB[(Database)] + SvcA --> Cache[(Cache)] + SvcB --> MQ[(Queue)] + SvcB --> Ext[External API] + + Edge -. edge rate limiting, bot defense, access logs .-> Obs[Observability Pipeline] + Gateway -. auth, quotas, request metrics, root trace, request ID .-> Obs + SvcA -. app logs, spans, business metrics .-> Obs + SvcB -. app logs, spans, worker metrics .-> Obs + DB -. DB metrics, slow query logs .-> Obs + Cache -. hit rate, memory, evictions .-> Obs + + HC[Health Probes] -. readiness / liveness .-> LB + HC -. pod status .-> SvcA + HC -. pod status .-> SvcB + + Obs --> MStore[Metrics Store] + Obs --> LStore[Log Store] + Obs --> TStore[Trace Store] + MStore --> Dash[Dashboards + Alerts] + LStore --> OnCall[Incident Investigation] + TStore --> OnCall + Dash --> OnCall +``` + +### 8.2 Where Each System Sits + +| Concern | Typical placement | Why there | +|---|---|---| +| Edge rate limiting | CDN, WAF, ingress | cheapest place to block obvious abuse | +| Auth-aware quotas | API gateway | consistent policy before services execute | +| Expensive-operation limits | service layer | requires business context | +| Metrics | every layer, especially gateway, services, dependencies | high-level health and alerting | +| Logging | gateway, services, workers, dependencies | forensic detail and debugging | +| Tracing | entry points and all RPC boundaries | end-to-end latency and failure mapping | +| Health checks | instances, orchestrator, load balancer | keep broken endpoints out of traffic paths | + +### 8.3 Example Request Journey + +Consider a request to create a payment or place an order: + +1. The client request reaches the edge. +2. Edge systems apply bot screening, IP reputation checks, and coarse anonymous rate limiting. +3. The gateway authenticates the request, generates or propagates a trace ID, emits access metrics, and enforces account or route quotas. +4. The load balancer sends traffic only to ready instances. +5. The service performs business logic and may apply stricter limits for expensive or risky operations. +6. Downstream database calls, queue writes, and provider calls emit spans and logs. +7. Metrics summarize overall behavior, logs capture contextual events, and traces map the request path. +8. If failures appear, alerts fire from user-impacting metrics, and engineers pivot into traces and logs. + +### 8.4 What Breaks at Scale + +As systems scale, each protection mechanism develops new challenges. + +#### 8.4.1 Rate Limiting at Scale + +- limiter storage hot spots +- global versus regional consistency tradeoffs +- fairness across noisy tenants +- false positives under NAT or proxy aggregation + +#### 8.4.2 Monitoring at Scale + +- metric cardinality explosion +- alert floods during incidents +- dashboards nobody trusts +- missing saturation signals + +#### 8.4.3 Logging at Scale + +- ingest cost explosion +- slow search in huge indexes +- secret leakage risk +- logging platform overload during outages + +#### 8.4.4 Tracing at Scale + +- storage cost of unsampled traces +- missing context propagation +- high overhead on hot paths +- incomplete visibility for async workflows + +#### 8.4.5 Health Checking at Scale + +- probe storms against large fleets +- flapping readiness on unstable dependencies +- self-inflicted restarts from aggressive liveness policies + +### 8.5 Real Production Patterns + +#### 8.5.1 Google-Style Thinking + +Google SRE literature strongly shaped industry thinking here: + +- tie monitoring to user-facing indicators +- define service level objectives +- use alerts tied to reliability budget burn, not raw noise +- treat observability as part of operating the system, not a side feature + +#### 8.5.2 Netflix-Style Thinking + +Netflix popularized resilience thinking in microservice-heavy environments: + +- expect dependency failure +- isolate failure domains +- maintain rich telemetry across services +- protect systems through load shedding, timeouts, and adaptive behavior + +#### 8.5.3 Amazon-Style Thinking + +Amazon-style large-scale backend thinking emphasizes: + +- cell-based or isolation-focused architectures +- protecting downstream dependencies with throttles and retries tuned carefully +- operational ownership by teams +- metrics and alarms tied to service health and customer impact + +#### 8.5.4 Uber and Large SaaS Patterns + +In highly distributed service environments: + +- traceability becomes essential because request paths are long +- rate limits are often multi-dimensional: user, tenant, endpoint, city, partner, driver, merchant, or feature class +- observability needs strong ownership and standard instrumentation to avoid chaos + +#### 8.5.5 Stripe and GitHub Patterns + +For API-centric platforms: + +- rate limits and quota communication must be clear to external developers +- logs and traces must support incident forensics and customer support +- protection is often business-sensitive, especially around auth, abuse, payments, and webhooks + +### 8.6 Best Practices for Interview Answers + +When discussing reliability and protection in an interview: + +1. Start with what failure or abuse mode you are protecting against. +2. Place each mechanism in the architecture explicitly. +3. Explain tradeoffs, not just components. +4. Mention what changes at scale or in multi-region systems. +5. Tie observability to incident response, not just graphs. + +Example strong phrasing: + +"I would use edge rate limiting for anonymous abuse, gateway quotas for tenant fairness, and service-level throttles for expensive operations. I would instrument RED metrics and trace propagation at the gateway and service boundaries, centralize structured logs with request IDs, and use readiness checks to drain unhealthy instances before they receive traffic. For multi-region operation, I would keep fast local enforcement and accept some approximation rather than placing every request on a globally consistent counter path." + +### 8.7 Common Cross-Cutting Mistakes + +- adding retries without rate limits and causing retry storms +- adding health checks without thinking about dependency-induced flapping +- adding logs without structured fields or correlation IDs +- adding metrics with unbounded cardinality +- adding tracing without propagation across async boundaries +- adding alerts that page constantly without clear action + +These mistakes usually happen when systems are designed component-by-component rather than as a complete operating model. + +--- + +## 9. Final Mental Model + +The cleanest way to remember this topic is: + +- rate limiting protects capacity and fairness +- monitoring detects that reliability is degrading +- logging preserves detailed facts for investigation +- tracing maps a request across service boundaries +- observability combines those signals into explainability +- health checks keep unhealthy instances out of the traffic path + +Together, they answer the practical questions that matter in both interviews and production: + +- How does the system defend itself? +- How do we know it is failing? +- How do we find the bottleneck quickly? +- How do we stop a partial failure from becoming a full outage? +- How do we operate this architecture at real scale? + +If you can explain those connections clearly, you are already thinking like a backend engineer rather than someone memorizing system design buzzwords. diff --git a/systems design/2.identityAccess.md b/systems design/2.identityAccess.md new file mode 100644 index 0000000..3d1a015 --- /dev/null +++ b/systems design/2.identityAccess.md @@ -0,0 +1,1808 @@ +# 2. Identity & Access + +Identity and access is the control plane for nearly every backend system. It answers four questions for every request: + +1. Who is calling? +2. How do we know they are really that caller? +3. What are they allowed to do right now? +4. How do we prove later that the decision was correct? + +If you understand identity and access well, you can reason about login systems, sessions, JWTs, OAuth integrations, enterprise SSO, authorization policies, service-to-service security, and zero-trust architecture as one connected system rather than as isolated buzzwords. + +This guide is written for two goals at the same time: + +- interview preparation +- real-world backend and system design understanding + +The emphasis is practical. The goal is not to memorize definitions, but to understand why these systems exist, how they fail, and how production systems are actually built. + +--- + +## Table of Contents + +1. Why Identity & Access Exists +2. Core Concepts and Mental Model +3. Authentication Fundamentals +4. Login and Signup +5. Sessions +6. JWT and Token-Based Authentication +7. OAuth +8. SSO: SAML and OIDC +9. Password Reset +10. Authorization Fundamentals +11. RBAC +12. ABAC +13. Permissions and Access Control +14. Service-to-Service Authentication +15. How These Systems Fit Together +16. Real-World Patterns and Company Examples +17. Interview Discussion Guide +18. Common Mistakes and Best Practices + +--- + +## 1. Why Identity & Access Exists + +Most systems are multi-user, multi-device, multi-service, and increasingly multi-tenant. Without identity and access controls, the backend has no safe way to distinguish: + +- one user from another +- a user from an attacker +- an employee from a customer +- a production service from a compromised internal service +- a legitimate action from a replayed or forged request + +At small scale, identity and access looks like a login form plus a password check. At production scale, it becomes much bigger: + +- account creation and identity proofing +- credential storage and recovery +- MFA and risk detection +- sessions and token lifecycle management +- delegated access via OAuth +- enterprise federation via SSO +- role and policy evaluation +- service identity inside microservices +- auditing, revocation, key rotation, and incident response + +The reason interviews ask about identity and access so often is simple: it touches security, data modeling, distributed systems, product tradeoffs, and failure handling all at once. + +### The Core Tension + +Identity systems always balance three goals: + +| Goal | What it means | Why it is hard | +| --- | --- | --- | +| Security | Prevent impersonation and unauthorized access | Stronger security usually adds friction | +| Usability | Let real users sign in quickly and recover safely | Easier flows are often easier to abuse | +| Scalability | Support huge traffic, many services, and many tenants | Distributed state and revocation become harder | + +An excellent backend engineer treats identity not as a feature checkbox, but as a reliability and security subsystem. + +--- + +## 2. Core Concepts and Mental Model + +Before discussing flows, build the right mental model. + +### Important Terms + +| Term | Meaning | Practical intuition | +| --- | --- | --- | +| Identity | The subject being represented | A user, admin, device, service, or organization | +| Authentication (AuthN) | Verifying who the subject is | "Prove you are Alice" | +| Authorization (AuthZ) | Deciding what that subject may do | "Can Alice read invoice 123?" | +| Session | Server-recognized authenticated continuity over time | "This browser remains logged in" | +| Access token | Credential presented to APIs | Often short-lived | +| Refresh token | Credential used to obtain new access tokens | More sensitive than access tokens | +| Identity Provider (IdP) | System that authenticates identities | Google, Okta, Azure AD | +| Service Provider / Relying Party | App that trusts the IdP | Your SaaS product | +| Policy engine | Evaluates access rules | RBAC, ABAC, ReBAC, custom rules | +| Audit log | Immutable trail of security-relevant events | Needed for forensics and compliance | + +### One Request Through the System + +```mermaid +sequenceDiagram + actor User + participant Client + participant Edge as API Gateway / Edge + participant Auth as Auth Service + participant Policy as Policy Engine + participant App as Business Service + participant Data as Data Store + + User->>Client: Click "View invoice" + Client->>Edge: GET /invoices/123 + cookie/token + Edge->>Auth: Validate session/token + Auth-->>Edge: subject, tenant, auth strength, claims + Edge->>Policy: Can subject read invoice 123? + Policy-->>Edge: allow/deny + reason + Edge->>App: Forward authenticated request + App->>Data: Load resource + Data-->>App: Resource data + App-->>Client: 200 OK or 403 Forbidden +``` + +This is the simplest correct mental model: + +- authentication establishes identity +- authorization evaluates permissions for the requested action +- business logic executes only after those checks +- the decision should be observable and auditable + +### A Production Identity Stack + +In a real system, identity and access usually spans these components: + +| Component | Typical responsibility | +| --- | --- | +| Auth service | Login, signup, password verification, MFA, token issuance | +| User directory | Users, credentials metadata, verification state, tenant membership | +| Session store | Server-side sessions and revocation state | +| Token service | Access token and refresh token lifecycle | +| Policy engine | Role/attribute-based access decisions | +| Key management | Signing keys, encryption keys, secret rotation | +| Audit pipeline | Security events, admin actions, login failures, policy decisions | +| Risk engine | Rate limits, device reputation, fraud checks, anomaly detection | + +Interview shortcut: if you can clearly separate authentication, session/token management, and authorization, you already sound more senior than candidates who collapse them into one vague "auth layer". + +--- + +## 3. Authentication Fundamentals + +Authentication is the process of verifying identity claims. The claim is usually, "I am user X" or "I am service Y". + +### 3.1 Identity Verification Basics + +Authentication depends on evidence. The most common categories are: + +| Factor | Example | Strengths | Weaknesses | +| --- | --- | --- | --- | +| Something you know | Password, PIN | Familiar, cheap | Can be guessed, phished, reused | +| Something you have | Phone, authenticator app, hardware key | Stronger than passwords alone | Device loss, recovery complexity | +| Something you are | Fingerprint, Face ID | Convenient on-device UX | Biometric recovery and privacy concerns | + +Important nuance: many systems do not verify a human's real-world identity. They verify control over a credential. For example: + +- password login verifies knowledge of a password +- email verification verifies access to an inbox +- TOTP verifies possession of a seed-bound authenticator +- passkeys verify possession of a private key and user presence + +That is why identity systems often talk about assurance levels rather than absolute truth. + +### 3.2 Identifiers vs Authenticators + +Two concepts are often mixed up: + +- an identifier tells the system which subject is being referenced +- an authenticator proves control over that identity + +Examples: + +- `alice@example.com` is an identifier +- the password, passkey, or OAuth login is the authenticator + +Production systems often support multiple identifiers for the same user: + +- email +- username +- phone number +- enterprise SSO subject ID +- internal immutable user ID + +Best practice: use a stable internal user ID as the true primary key, even if the login identifier changes. + +### 3.3 Credential Storage + +This is one of the most common interview topics because it separates surface-level knowledge from real engineering understanding. + +#### Never store plaintext passwords + +If a database leak reveals plaintext passwords, the incident is catastrophic. Attackers will also try the same passwords on other services because users reuse credentials. + +#### Store password hashes, not passwords + +The flow is: + +1. User submits password. +2. Server generates a per-user salt. +3. Server applies a slow password hashing algorithm. +4. Server stores the resulting hash and metadata. +5. On login, the server recomputes and compares. + +Good password hashing algorithms are intentionally expensive. That is the point. They make offline brute force attacks slower. + +| Algorithm | Typical status | Why it matters | +| --- | --- | --- | +| Argon2id | Best modern default | Memory-hard and resistant to GPU attacks | +| bcrypt | Still common and acceptable | Widely supported, battle-tested | +| PBKDF2 | Common in legacy and regulated systems | Safer than fast hashes, but less ideal than Argon2id | +| SHA-256 / MD5 alone | Unsafe for password storage | Too fast, easy to brute force | + +#### Salt and Pepper + +| Mechanism | Purpose | +| --- | --- | +| Salt | Unique random value per password; prevents rainbow-table reuse | +| Pepper | Extra secret held outside the user table, often in KMS/HSM; raises attack cost after DB leaks | + +#### Practical Storage Pattern + +- store algorithm name and parameters with the hash +- use constant-time comparison to reduce timing leakage +- rehash on login when old parameters are outdated +- keep password policy reasonable; massive composition rules often lead to weaker behavior + +#### Interview depth point + +If an interviewer asks, "Why use bcrypt or Argon2 instead of SHA-256?", the real answer is not just "because it is more secure". The real answer is: + +- password databases are often attacked offline after leaks +- attackers can run billions of SHA-256 hashes quickly +- slow, memory-hard algorithms make each guess expensive +- cost parameters can be tuned as hardware improves + +### 3.4 MFA Basics + +Multi-factor authentication exists because passwords are a weak single point of failure. + +Common MFA methods: + +| Method | Security level | Practical notes | +| --- | --- | --- | +| SMS OTP | Low to medium | Vulnerable to SIM swap and phishing | +| Email OTP | Low | Better than nothing, but email is often the same recovery channel | +| TOTP app | Medium | Common and cheap; still phishable | +| Push approval | Medium | Good UX, but push fatigue attacks exist | +| WebAuthn / passkeys / hardware keys | High | Strong phishing resistance | + +Production systems often use risk-based MFA rather than always prompting: + +- new device +- new geography +- impossible travel +- admin action +- payout or billing change +- password reset or recovery event + +This is called step-up authentication. + +#### Recovery Matters + +Many teams design MFA setup but forget MFA recovery. Good systems provide: + +- recovery codes +- alternate authenticators +- carefully controlled support workflows + +The recovery flow is often more attackable than the MFA flow itself. + +### 3.5 Email Verification + +Email verification usually proves inbox control, not human identity. It exists to: + +- reduce fake or mistyped accounts +- ensure password reset reachability +- protect downstream systems from garbage identities +- support trust in notifications, billing, and invites + +Good implementation details: + +- generate a random, single-use token +- store only a hash of the token server-side if possible +- apply a short TTL +- invalidate older outstanding verification tokens after a new one is issued +- avoid leaking whether the account exists during resend flows + +### 3.6 Device Trust + +Device trust tries to answer, "Is this a previously seen, low-risk device?" + +Typical signals: + +- long-lived device cookie +- browser fingerprinting or device metadata +- last successful MFA on that device +- IP reputation and ASN patterns +- OS or app attestation on mobile + +Device trust is useful, but dangerous if over-trusted. Devices are compromiseable. Cookies can be stolen. Browsers change. Treat device trust as a risk signal, not a source of truth. + +### Authentication Failure Cases + +- weak password hashing leads to offline cracking after DB leaks +- email verification links are reusable or never expire +- MFA recovery bypasses stronger checks +- account enumeration leaks whether an email exists +- social login accounts are linked incorrectly to existing local accounts +- device trust becomes an authorization shortcut instead of a risk signal + +### Authentication Best Practices + +- prefer Argon2id or bcrypt for passwords +- rate-limit login, signup, reset, and verification endpoints +- use MFA for privileged users and step-up auth for sensitive actions +- log auth events with context, but never log secrets or raw passwords +- design credential rotation and recovery before launch, not after an incident + +--- + +## 4. Login and Signup + +Signup and login flows are the public entry points to your system. They are also some of the most attacked endpoints you will ever run. + +### 4.1 Signup Flow + +```mermaid +sequenceDiagram + actor User + participant Browser + participant Auth as Auth API + participant Risk as Risk / Abuse Service + participant Users as User DB + participant Mail as Email Service + participant Session as Session Store + + User->>Browser: Submit email + password + Browser->>Auth: POST /signup + Auth->>Risk: Check IP, velocity, disposable email, device + Risk-->>Auth: risk score / allow / challenge + Auth->>Users: Create pending account + password hash + Auth->>Mail: Send verification link + Mail-->>User: Verification email + User->>Browser: Click link + Browser->>Auth: GET /verify?token=... + Auth->>Users: Mark email verified + Auth->>Session: Create session + Auth-->>Browser: Set secure auth cookie +``` + +#### What actually happens in production + +A robust signup flow usually includes: + +1. Input normalization + Normalize email casing rules carefully, trim whitespace, reject obvious malformed values. +2. Abuse screening + IP reputation, rate limits, disposable email detection, CAPTCHA when needed, device velocity, and signup bursts by network. +3. Account creation state + Many systems create users in a `pending_verification` state first. +4. Email verification + The account may exist but have limited capabilities until verified. +5. Bootstrap domain objects + For SaaS, create workspace, tenant, default role, billing state, and onboarding tasks. +6. Initial session issuance + Some systems log the user in immediately after verification. Others require explicit login. + +#### Why pending state matters + +If you create fully active accounts before verification, you may end up with: + +- abandoned fake tenants +- spammed invites or API abuse +- polluted analytics and billing pipelines + +### 4.2 Login Flow + +The login flow is simpler than signup conceptually, but much more operationally sensitive. + +Common steps: + +1. Identify account by email/username/federated ID. +2. Fetch credential metadata and account status. +3. Verify password or federated assertion. +4. Evaluate account risk and MFA policy. +5. Create session or issue tokens. +6. Log success or failure for audit and anomaly detection. + +A production login decision often depends on more than a password: + +- account locked or disabled? +- tenant suspended? +- email verified? +- MFA enrolled? +- device known? +- unusual geography? +- refresh token family compromised? + +### 4.3 Signup Verification and Fraud Prevention Basics + +Fraud prevention is not just a payments problem. Identity systems are abused for: + +- spam account creation +- credential stuffing +- promo abuse +- referral fraud +- fake trial creation +- scraping and automated signups + +Basic but effective controls: + +| Control | What it helps with | +| --- | --- | +| Rate limiting by IP and identifier | Brute force and signup bursts | +| Device and IP reputation | Known bad networks and bots | +| CAPTCHA or challenge step-up | Automated abuse at suspicious thresholds | +| Email domain heuristics | Disposable inboxes, typo domains | +| Phone verification for high-risk cases | Raises attacker cost | +| Idempotency keys on signup APIs | Retry safety without duplicate accounts | + +Interview point: fraud controls are part of auth architecture because attackers do not politely separate "security" from "growth" endpoints. + +### 4.4 Social Login Considerations + +"Login with Google" or "Login with GitHub" improves user experience, but introduces federation complexity. + +Benefits: + +- no local password to manage +- faster onboarding +- higher conversion for some user segments + +Risks and edge cases: + +- provider outage affects sign-in +- incorrect account linking can cause account takeover +- email from provider may be unverified or not globally unique in the way you assume +- enterprise customers may not want personal social identities linked to business workspaces + +Best practice for account linking: + +- if a social identity is new, do not blindly attach it to a local account just because the email matches +- require proof of control or signed-in confirmation before linking to an existing account + +### 4.5 Onboarding Architecture + +Signup is not just about auth. It often triggers business setup: + +- create personal or team workspace +- assign owner role +- seed settings and notification preferences +- create billing customer object +- publish analytics and onboarding events + +This makes signup a distributed workflow. Real systems often handle it with: + +- synchronous creation for the minimum needed to log in +- async events for non-critical setup +- idempotent consumers to avoid duplicate workspaces or billing objects + +### Login and Signup Failure Cases + +- verification emails delayed or blocked, leaving users in limbo +- duplicate accounts created because signup is not idempotent +- support team manually verifies accounts in insecure ways +- social and password accounts merge incorrectly +- signup path leaks which emails already exist + +### Login and Signup Best Practices + +- keep the critical path small and reliable +- separate abuse checks from core credential logic, but make them part of the final decision +- use generic error messages externally and detailed audit logs internally +- make signup and login events observable with metrics and tracing + +--- + +## 5. Sessions + +Sessions are the classic way to keep users logged in across multiple HTTP requests. + +### 5.1 What a Session Really Is + +A session means the server has already authenticated the user and stores an authenticated state keyed by a session identifier. + +Typical flow: + +1. User logs in successfully. +2. Server creates a session record. +3. Server sends the client a session ID in a cookie. +4. Client sends the cookie on future requests. +5. Server looks up session state and reconstructs identity. + +### 5.2 Server-Side Sessions + +In server-side session architecture, the browser usually only stores an opaque identifier. + +Example session data: + +- user ID +- tenant ID +- auth strength or MFA state +- issued time and last activity time +- device metadata +- CSRF-related state + +Advantages: + +- easy revocation +- easy logout across devices +- server fully controls state +- easy to add security flags or session versioning + +Disadvantages: + +- needs a session store lookup +- requires shared state across app instances +- harder to scale if poorly designed + +### 5.3 Redis-Backed Sessions + +Redis is a very common session backend because it is fast, supports TTL, and works well as shared ephemeral state. + +```mermaid +flowchart LR + Browser[Browser with secure cookie] --> LB[Load Balancer] + LB --> App1[App Instance A] + LB --> App2[App Instance B] + App1 --> Redis[(Redis Session Store)] + App2 --> Redis + Redis --> Audit[Audit / Security Events] +``` + +Why Redis is popular for sessions: + +- low-latency reads and writes +- TTL expiration built in +- simple key-value model +- easy fit for horizontally scaled app fleets + +Scaling considerations: + +- shard or cluster if session volume is high +- replicate carefully; understand failover and session loss behavior +- monitor hot keys and uneven access patterns +- decide whether to refresh TTL on every request or on a sliding window + +### 5.4 Cookie Security + +Session security depends heavily on cookie configuration. + +| Cookie attribute | Why it matters | +| --- | --- | +| `HttpOnly` | Prevents JavaScript from reading the cookie, reducing XSS impact | +| `Secure` | Sends cookie only over HTTPS | +| `SameSite=Lax/Strict` | Reduces CSRF risk from cross-site requests | +| Domain scoping | Prevents unintended subdomain sharing | +| Path scoping | Limits where the cookie is sent | +| Expiry / Max-Age | Controls session persistence | + +Important nuance: + +- `HttpOnly` helps against token theft by frontend JavaScript +- `SameSite` helps against CSRF +- neither one fixes everything if the app has deeper logic flaws + +### 5.5 Session Invalidation + +Session invalidation is one reason server-side sessions remain attractive. + +You can revoke sessions when: + +- user logs out +- password changes +- MFA is reset +- admin disables the account +- suspicious activity is detected + +Common implementation patterns: + +- delete the session record outright +- mark session version or user auth version and reject old versions +- keep a device/session list per user for device management UI + +### 5.6 Logout Challenges + +Logout sounds trivial, but it is easy to implement incompletely. + +Problems include: + +- logout only clears client cookie but leaves server session valid +- user has multiple active devices and expects global logout +- session persists in mobile apps with long polling or background refresh +- cached pages or in-flight requests still complete after logout + +Good logout design answers: + +- single device logout or all devices? +- immediate revocation or eventual consistency? +- what about concurrent refresh operations? + +### 5.7 Session Security Issues + +| Problem | Meaning | Mitigation | +| --- | --- | --- | +| Session fixation | Attacker forces victim to use known session ID | Regenerate session ID after login | +| CSRF | Browser auto-sends cookies on forged cross-site requests | `SameSite`, CSRF tokens, origin checks | +| Session hijacking | Session token is stolen | HTTPS, `HttpOnly`, device/risk checks, short idle timeouts | +| Store outage | Session backend unavailable | Fallback behavior, multi-AZ design, graceful degradation | + +### Sessions in Interviews + +A good interview answer on sessions usually includes: + +- opaque session ID in secure cookie +- shared store like Redis +- session regeneration after login +- revocation and logout semantics +- CSRF protections +- sliding vs absolute expiration tradeoff + +--- + +## 6. JWT and Token-Based Authentication + +JWTs are one of the most discussed and most misunderstood identity topics. + +### 6.1 What a JWT Is + +JWT stands for JSON Web Token. It is a compact, self-contained token format commonly used to carry claims. + +A JWT typically has three parts: + +`header.payload.signature` + +- header: algorithm and metadata +- payload: claims such as subject, issuer, audience, expiry +- signature: proves integrity if signed correctly + +Important practical truth: signed JWTs are not secret by default. They are encoded, not hidden. Anyone holding the token can often read the claims. + +### 6.2 Signing vs Encryption + +| Mechanism | What it guarantees | Practical meaning | +| --- | --- | --- | +| Signing (JWS) | Integrity and authenticity | Token was issued by trusted signer and not modified | +| Encryption (JWE) | Confidentiality | Token contents are hidden from intermediaries/clients | + +Most production JWT usage is signed, not encrypted. + +That means: + +- do not put secrets in JWT payloads +- do not put more PII than necessary +- use claims for identity and authorization hints, not as a dumping ground + +### 6.3 Access Tokens vs Refresh Tokens + +| Token type | Lifetime | Used by | Main purpose | +| --- | --- | --- | --- | +| Access token | Short-lived | APIs | Authorize a request | +| Refresh token | Longer-lived | Auth client / backend | Obtain new access tokens | + +Best practice: + +- keep access tokens short-lived +- treat refresh tokens as highly sensitive credentials +- store refresh tokens more carefully than access tokens + +### 6.4 Why Teams Use JWTs + +Benefits: + +- easy for distributed services to verify locally +- no session store lookup on every request if verification is local +- good fit for API ecosystems and delegated access +- works well across domains and service boundaries + +Costs: + +- revocation is harder +- permissions embedded in tokens can become stale +- key rotation and issuer validation must be done correctly +- token size can grow dangerously if you stuff too many claims inside + +### 6.5 Token Rotation + +Refresh token rotation is a major real-world security mechanism. + +Idea: + +- every refresh use invalidates the previous refresh token +- the auth server issues a new refresh token and new access token +- if an old refresh token is reused, the server assumes theft and can revoke the token family + +```mermaid +sequenceDiagram + actor User + participant Client + participant Auth as Auth Server + participant Store as Token Store + + User->>Client: Continue using app + Client->>Auth: POST /token/refresh with refresh token + Auth->>Store: Validate token family and prior use + Store-->>Auth: valid / reused / revoked + Auth-->>Client: New access token + new refresh token + Auth->>Store: Mark old token used, persist new token state +``` + +### 6.6 Revocation Challenges + +Revocation is the biggest practical downside of stateless tokens. + +If an access token is self-contained and valid until `exp`, then after it is issued: + +- the user may be disabled +- permissions may change +- a tenant may be suspended +- the token may be stolen + +But the token may still verify cryptographically. + +Mitigations: + +- short access token TTLs +- refresh token rotation +- revocation list or denylist for critical cases +- user/session version claim checked against server state +- opaque tokens with introspection for high-control environments + +### 6.7 Stateless Auth Tradeoffs + +This is a favorite interview question: "Should I use JWT or sessions?" + +The mature answer is not dogmatic. It depends. + +| Topic | Server-side sessions | JWT | +| --- | --- | --- | +| Request-time state lookup | Usually yes | Not always | +| Easy revocation | Yes | Harder | +| Cross-service portability | Moderate | Strong | +| Simplicity for web apps | Often simpler | Often overused | +| Risk of stale claims | Lower | Higher | +| CSRF concern if cookie-based | Yes | Yes if stored in cookies | +| XSS risk if JS-accessible storage | Lower with `HttpOnly` cookies | Higher if stored in localStorage | + +A practical rule: + +- for traditional web apps, server-side sessions are often simpler and safer +- for API ecosystems, third-party integrations, and distributed service verification, tokens are often the better fit + +### 6.8 Common JWT Mistakes + +- storing JWTs in `localStorage` without carefully thinking through XSS risk +- placing roles and permissions in long-lived tokens and forgetting they go stale +- not checking `iss`, `aud`, `exp`, `nbf`, and key identifiers properly +- using symmetric signing keys everywhere and spreading them across many services +- putting secrets or excessive PII in token payloads + +### JWT Best Practices + +- prefer asymmetric signing for shared verification environments +- expose public keys via a JWKS endpoint if multiple verifiers exist +- keep access tokens short-lived +- rotate signing keys safely and support key overlap during rotation +- use opaque tokens or introspection if real-time revocation is a hard requirement + +--- + +## 7. OAuth + +OAuth solves delegated authorization. It lets one application access another application's resources on behalf of a user without receiving the user's password. + +### 7.1 The Problem OAuth Solves + +Without OAuth, a user might give App A their password to App B. That is unacceptable because: + +- App A can now do anything the user can do +- App B cannot scope access cleanly +- the user cannot revoke just that delegated access safely + +OAuth introduces a safer model: + +- user authenticates with the authorization server / IdP +- user consents to limited scopes +- client receives tokens with bounded permissions + +### 7.2 Authorization Code Flow with PKCE + +This is the modern default for browser and mobile-friendly public clients. + +```mermaid +sequenceDiagram + actor User + participant Client as SaaS App + participant Browser + participant AS as Authorization Server / IdP + participant API as Third-Party API + + User->>Client: Click "Connect Google Drive" + Client->>Browser: Redirect to /authorize + scope + code_challenge + Browser->>AS: Login and grant consent + AS-->>Browser: Redirect back with authorization code + Browser->>Client: Deliver authorization code + Client->>AS: Exchange code + code_verifier + AS-->>Client: Access token (+ refresh token) + Client->>API: Call API with access token + API-->>Client: Protected resource data +``` + +#### Why PKCE exists + +PKCE protects the code exchange step so a stolen authorization code is less useful. It is critical for public clients such as SPAs and mobile apps. + +### 7.3 Scopes + +Scopes define the breadth of access. Examples: + +- `read:user` +- `repo:write` +- `payments:refunds` +- `calendar.readonly` + +Good scope design is product design plus security design. + +If scopes are too broad: + +- users lose trust +- integrations become over-privileged +- incident blast radius increases + +If scopes are too granular: + +- consent screens become confusing +- implementation complexity rises +- developers ask for full access anyway + +### 7.4 Consent Screens + +Consent is the user-visible manifestation of delegated access. + +Good consent screens answer: + +- who is requesting access? +- to which data or actions? +- for how long? +- can the user revoke later? + +This matters a lot in SaaS ecosystems like Google Workspace, GitHub Apps, or Slack apps. + +### 7.5 Refresh Tokens in OAuth + +Long-running integrations often need refresh tokens so they can keep calling APIs without asking the user to re-consent constantly. + +Refresh token concerns: + +- high-value credential theft risk +- need for rotation and revocation +- tenant admins may want centralized revocation controls + +### 7.6 Third-Party Integrations + +In real SaaS systems, OAuth is often used for: + +- connecting Google Drive, GitHub, Slack, Salesforce, Stripe, or Dropbox +- importing or exporting data +- posting to external systems on behalf of the user or workspace + +Architectural consequences: + +- store provider account linkage metadata +- encrypt or otherwise protect provider refresh tokens +- model scopes per installation or workspace +- surface admin controls for revocation and reauthorization + +### 7.7 OAuth vs Authentication + +OAuth is about authorization. Authentication is not the original purpose of OAuth. + +However, many products use OAuth plus an identity layer such as OpenID Connect to support "Sign in with Google". + +Interview nuance: saying "OAuth is login" is incomplete. Better answer: + +- OAuth is delegated authorization +- OIDC adds identity information for authentication use cases + +### OAuth Failure Cases + +- client stores provider tokens insecurely +- redirect URI validation is weak +- state parameter not used correctly, enabling CSRF-like attacks in auth flows +- scopes are excessively broad +- tenants cannot audit or revoke third-party access easily + +--- + +## 8. SSO: SAML and OIDC + +Enterprise customers often do not want each SaaS app to manage a separate corporate password. They want central identity, central policy, and controlled employee access. That is where SSO comes in. + +### 8.1 Identity Provider vs Service Provider + +| Role | Meaning | +| --- | --- | +| Identity Provider (IdP) | System that authenticates the employee, such as Okta, Azure AD, Google Workspace | +| Service Provider (SP) / Relying Party (RP) | The SaaS application that trusts the IdP | + +### 8.2 SAML Basics + +SAML is older, XML-based, and still heavily used in enterprise environments. + +Mental model: + +- user tries to access the SaaS app +- SaaS redirects user to corporate IdP +- IdP authenticates user +- IdP sends signed assertion back to SaaS +- SaaS creates a local session + +Strengths: + +- entrenched in enterprise IT +- widely supported by corporate identity systems + +Costs: + +- XML complexity +- harder developer ergonomics +- trickier debugging and implementation compared with OIDC + +### 8.3 OIDC Basics + +OpenID Connect is an identity layer on top of OAuth 2.0. + +It provides: + +- ID tokens with identity claims +- standardized login flows +- better fit for modern web and mobile apps + +OIDC is usually easier to work with than SAML for modern applications. + +### 8.4 SAML vs OIDC + +| Topic | SAML | OIDC | +| --- | --- | --- | +| Typical format | XML assertions | JSON tokens | +| Common use case | Enterprise browser SSO | Modern app login and API ecosystems | +| Developer ergonomics | Heavier | Easier | +| Mobile/API friendliness | Weaker | Stronger | + +### 8.5 Enterprise Architecture + +```mermaid +flowchart LR + Employee[Employee] --> SaaS[Your SaaS App] + SaaS --> IdP[Enterprise IdP] + IdP --> SaaS + IdP --> Directory[Corporate Directory] + IdP --> SCIM[Provisioning / SCIM] + SCIM --> SaaS + SaaS --> Policy[Workspace Roles and Policies] +``` + +In production, enterprise identity usually includes two separate concerns: + +- authentication and SSO +- lifecycle management and provisioning + +Provisioning is often handled with SCIM or similar directory sync mechanisms so the SaaS app knows: + +- who exists +- which groups they belong to +- who has been deprovisioned + +### 8.6 Common Enterprise Requirements + +- just-in-time user creation on first login +- domain verification to prove company ownership +- group-to-role mapping +- forced MFA at IdP level +- admin-controlled session duration +- audit logs for all SSO events + +### 8.7 Failure Cases + +- bad mapping from IdP groups to app roles causes privilege escalation +- employee is disabled in IdP but app keeps old sessions alive too long +- email is used as unique identity key and later changes +- multiple IdPs or merged companies create ambiguous identity mapping + +### SSO Best Practices + +- use stable external subject identifiers, not just email +- model tenant-specific SSO config cleanly +- separate authentication trust from authorization mapping inside the app +- deprovision aggressively and revoke old sessions when identity status changes + +--- + +## 9. Password Reset + +Password reset is a high-risk recovery flow. Attackers love it because it often bypasses normal login defenses. + +### 9.1 Secure Token Flow + +```mermaid +sequenceDiagram + actor User + participant App + participant Auth as Auth Service + participant Users as User DB + participant Reset as Reset Token Store + participant Mail as Email Service + + User->>App: Click "Forgot password" + App->>Auth: POST /password-reset + Auth->>Users: Lookup account + Auth->>Reset: Store hashed single-use token + expiry + Auth->>Mail: Send password reset link + Auth-->>User: Generic success response + User->>App: Open reset link + App->>Auth: POST /password-reset/confirm token + new password + Auth->>Reset: Validate token unused and unexpired + Auth->>Users: Update password hash + Auth->>Reset: Mark token used + Auth-->>App: Success + revoke other sessions +``` + +### 9.2 Why This Design Exists + +Password reset has to be secure even if the attacker knows the user's email address. Therefore the reset token must be: + +- hard to guess +- short-lived +- single-use +- revocable + +Good systems also revoke active sessions or require re-authentication after password reset. + +### 9.3 Attack Prevention + +| Threat | Mitigation | +| --- | --- | +| Account enumeration | Return generic responses like "If an account exists, email sent" | +| Token guessing | Long random tokens, rate limits | +| Token replay | Single-use storage and invalidation | +| Email inbox compromise | Step-up verification for high-value actions after reset | +| Old session persistence | Revoke sessions after reset | + +### 9.4 Practical Advice + +- prefer opaque reset tokens over stuffing reset state into a long-lived JWT +- hash reset tokens at rest if you store them server-side +- keep TTL short, often 15 to 60 minutes depending on product sensitivity +- notify users when a reset is requested and completed + +### Password Reset Failure Cases + +- reset token is reusable +- old sessions remain active after password change +- reset endpoint leaks whether account exists +- support team bypasses the secure flow with weak manual procedures + +--- + +## 10. Authorization Fundamentals + +Authentication answers who the subject is. Authorization answers what that subject may do. + +### 10.1 AuthN vs AuthZ + +This distinction matters a lot. + +| Question | Category | +| --- | --- | +| "Who are you?" | Authentication | +| "Are you allowed to do this?" | Authorization | +| "How sure are we?" | Authentication strength / assurance | +| "Why was access denied?" | Authorization decision and audit | + +A user can be perfectly authenticated and still not be authorized. + +### 10.2 Authorization Decision Shape + +Every authZ decision is some variation of: + +`Can subject S perform action A on resource R under context C?` + +Where context may include: + +- tenant +- time of day +- network zone +- device trust level +- MFA level +- resource ownership +- subscription plan +- legal region or data residency constraints + +### 10.3 Enforcement Layers + +Authorization can happen at multiple layers: + +| Layer | Good for | Risk if overused | +| --- | --- | --- | +| API gateway | Coarse access checks, authentication, token validation | Too coarse for resource-specific rules | +| Service layer | Business-specific rules | Easy to duplicate logic across services | +| Data access layer | Row/tenant isolation, final enforcement | Hard to express all product rules here | +| Database native policies | Strong last line of defense in some systems | App logic can still drift if not modeled carefully | + +A common mistake is doing all authorization only at the edge. Edge checks are useful, but most real product rules depend on resource-specific business logic deeper inside the system. + +### 10.4 Policy Design + +Good policy design balances three things: + +- expressiveness +- debuggability +- operational simplicity + +Ask these questions: + +- who is the subject? +- what resource is being accessed? +- what action is requested? +- what context matters? +- who can change the policy? +- how do we explain and audit the decision? + +### 10.5 Auditing and Explainability + +Authorization is not just about allow or deny. In production you often need: + +- reason codes +- which policy matched +- who granted the permission +- when the permission changed +- evidence for support, compliance, and incident response + +This is why mature systems treat authorization as both a runtime path and a data model. + +--- + +## 11. RBAC + +RBAC stands for role-based access control. Permissions are grouped into roles, and subjects are assigned roles. + +### 11.1 Why RBAC Exists + +Without roles, you would assign individual permissions to every user. That becomes unmanageable quickly. + +RBAC simplifies administration: + +- `viewer` +- `editor` +- `admin` +- `billing_admin` + +Instead of attaching dozens of permissions directly to users, you attach permissions to roles and roles to users. + +### 11.2 Basic Model + +| Entity | Example | +| --- | --- | +| Permission | `invoice.read`, `invoice.refund`, `workspace.invite` | +| Role | `support_agent`, `workspace_admin` | +| Assignment | User U has role R in tenant T | + +Tenant scoping is critical. In multi-tenant SaaS, a user is rarely just "an admin" globally. They are usually an admin in a specific workspace or organization. + +### 11.3 Enterprise Patterns + +Common enterprise RBAC patterns include: + +- global roles for platform staff +- tenant-scoped roles for customers +- custom roles for larger organizations +- group-to-role mapping from SSO IdP groups + +### 11.4 The Role Explosion Problem + +RBAC starts simple but can degrade into dozens or hundreds of roles: + +- `viewer` +- `viewer_plus_export` +- `viewer_plus_export_plus_billing` +- `regional_admin_eu` +- `regional_admin_us` + +This is role explosion. + +It happens when RBAC is forced to encode too many contextual conditions that really belong in attributes or policies. + +### 11.5 RBAC Tradeoffs + +| Strength | Weakness | +| --- | --- | +| Easy to explain to users and admins | Coarse-grained for complex cases | +| Efficient at runtime | Can explode in number of roles | +| Works well for common SaaS admin patterns | Poor fit for dynamic context-heavy rules | + +### RBAC Best Practices + +- keep the base role set small +- scope roles by tenant, project, or resource container +- separate platform/internal staff roles from customer roles +- use RBAC for broad permissions and combine with finer policies when needed + +--- + +## 12. ABAC + +ABAC stands for attribute-based access control. Instead of only asking "What role does this user have?", ABAC asks about attributes of the subject, resource, and environment. + +### 12.1 Why ABAC Exists + +RBAC is often too static for real-world decisions like: + +- support agent can view tickets only in their assigned region +- manager can approve expenses under a threshold for their own department +- user can access data only from a compliant device in an approved country +- payout release requires recent MFA and elevated risk score below threshold + +These rules depend on context, not just role labels. + +### 12.2 Dynamic Policy Evaluation + +ABAC decisions may use attributes such as: + +- subject department +- resource owner +- tenant subscription tier +- request IP or network zone +- device trust score +- current time or shift window +- MFA strength + +Example policy idea: + +"Allow refund approval if the subject role is finance_manager, the order belongs to the same merchant account, the refund amount is below the subject limit, and MFA was performed in the last 10 minutes." + +### 12.3 Policy Engines + +ABAC often benefits from a dedicated policy engine because hardcoding many dynamic rules directly into services becomes brittle. + +Common approaches: + +- custom rules in application code +- centralized policy engine such as OPA/Rego +- cloud-style policy systems such as Cedar-like models +- relationship and graph-based systems for object access patterns + +### 12.4 ABAC Tradeoffs + +| Strength | Weakness | +| --- | --- | +| Expressive and context-aware | Harder to explain and debug | +| Reduces role explosion | Requires clean attribute sources | +| Good for fine-grained enterprise control | Runtime evaluation can be more expensive | + +### 12.5 Practical Use + +Many production systems do not choose "RBAC or ABAC". They combine them: + +- RBAC gives the broad lane +- ABAC applies contextual restrictions inside that lane + +Example: + +- role says user may edit invoices +- ABAC rule says only for their tenant, below approval threshold, and only after MFA for high-value invoices + +### ABAC Failure Cases + +- attributes are stale or inconsistently sourced across services +- policies become unreadable and impossible to reason about +- caching hides recent attribute changes like department moves or suspensions + +--- + +## 13. Permissions and Access Control + +Permissions are the actual capabilities a subject has. Access control is the mechanism that enforces those permissions correctly and consistently. + +### 13.1 Permission Models + +Common permission models include: + +| Model | Mental model | Example | +| --- | --- | --- | +| RBAC | Roles map to permissions | Workspace admin | +| ACL | Resource has a list of allowed subjects | Shared document editable by Alice and Bob | +| ABAC | Decision based on attributes | Region and MFA aware access | +| ReBAC | Decision based on relationships | User is member of team that owns repo | +| Capability/token-based | Possession of unforgeable capability grants access | Signed download URL | + +In modern systems, multiple models often coexist. + +GitHub is a good mental example: + +- org and team membership look like RBAC/ReBAC +- repo-specific collaborator lists look like ACLs +- fine product actions are individual permissions + +### 13.2 Inheritance + +Permissions often inherit down a hierarchy: + +- org -> workspace -> project -> resource +- folder -> document +- account -> sub-account + +Inheritance is useful, but easy to get wrong. + +Questions to design explicitly: + +- do child resources inherit all parent permissions? +- can child permissions override parent permissions? +- are denies supported, and if so do they take precedence? +- how do you compute effective permissions efficiently? + +### 13.3 Auditing + +You need to know: + +- who granted access +- when access changed +- who accessed a resource +- why a decision was allowed or denied + +Auditing matters for: + +- customer support +- security investigations +- compliance +- admin trust + +### 13.4 Enforcement Patterns + +There are two common implementation patterns. + +#### Pattern A: Embedded authorization in each service + +Pros: + +- low latency +- business context close to the resource + +Cons: + +- duplicated rules across services +- inconsistent decisions and audit semantics + +#### Pattern B: Centralized authorization service or policy engine + +Pros: + +- consistency +- shared policy language +- central auditing and explainability + +Cons: + +- added network hop +- dependency on a central service +- need good caching and fallback behavior + +### 13.5 Centralized Auth Service and Policy Caching + +```mermaid +flowchart LR + Request[Authenticated Request] --> Service[Business Service] + Service --> Cache[Policy Cache] + Cache -->|cache miss| Authz[Central Authorization Service] + Authz --> PDP[Policy Engine] + Authz --> Attrs[Attribute / Relationship Data] + PDP --> Decision[Allow / Deny + Reason] + Decision --> Audit[Audit Log] + Decision --> Service +``` + +Caching is often necessary, but introduces staleness risks. Common techniques: + +- short TTL caches for policy decisions +- versioned policy snapshots +- event-driven invalidation on membership or role changes +- cache only stable intermediate data, not final decisions, in high-risk systems + +### 13.6 Policy Caching Tradeoffs + +| Benefit | Cost | +| --- | --- | +| Lower latency | Stale authorization decisions | +| Lower policy service load | Harder revocation semantics | +| Better resilience during partial outage | Risk of fail-open or fail-stale behavior | + +Design question: when policy service is down, do you fail closed or fail open? + +- fail closed is safer but can hurt availability +- fail open preserves availability but may violate security + +For high-risk actions, fail closed is usually the right answer. + +### Access Control Best Practices + +- enforce tenant isolation early and repeatedly +- keep policy decisions explainable +- separate authentication claims from live authorization state when permissions change frequently +- audit all admin and permission-management actions +- do not trust internal network location as a permission model + +--- + +## 14. Service-to-Service Authentication + +User authentication is only half of production security. Modern backends also need to authenticate services to each other. + +### 14.1 Why Internal Service Authentication Exists + +In microservice systems, one request may pass through many services: + +- edge/API gateway +- auth service +- order service +- payment service +- notification service + +If internal calls are trusted just because they are "inside the VPC", a compromised service can impersonate others too easily. + +This is why zero-trust principles matter internally too. + +### 14.2 Service Identity + +A service needs its own identity, just like a user does. + +Examples: + +- `payments-service.prod` +- `orders-service.eu-west-1` +- workload identity bound to a Kubernetes service account + +A strong service identity system lets the platform answer: + +- which service is calling? +- is it the real deployed workload? +- is it allowed to call this destination? + +### 14.3 mTLS Basics + +Mutual TLS means both sides authenticate each other during the TLS handshake. + +Benefits: + +- encryption in transit +- client and server authentication +- strong cryptographic service identity + +Typical pattern: + +- internal CA issues short-lived certificates to workloads +- service presents client cert on outbound call +- destination validates issuer and identity + +```mermaid +sequenceDiagram + participant A as Service A + participant CA as Internal CA / Identity System + participant B as Service B + + A->>CA: Request workload certificate + CA-->>A: Short-lived cert + A->>B: TLS handshake + client cert + B->>B: Verify cert, issuer, SAN, expiry + B-->>A: Authenticated secure channel + A->>B: Application request + B-->>A: Response +``` + +### 14.4 Short-Lived Credentials + +Short-lived credentials are a major production best practice. + +Why? + +- if stolen, they expire quickly +- less need for manual secret rotation +- better fit with workload identity and automation + +This pattern shows up in: + +- cloud IAM temporary credentials +- Kubernetes workload identity +- service mesh certificates +- internal token minting systems + +### 14.5 Zero Trust Basics + +Zero trust does not mean trust nothing blindly forever. It means: + +- do not grant access solely based on network location +- verify identity continuously and explicitly +- enforce least privilege +- assume compromise is possible and reduce blast radius + +Google's public BeyondCorp ideas are the canonical mental model here: access should depend on identity, device state, and policy, not on whether traffic comes from "inside the office network". + +### 14.6 Service-to-Service Authorization + +Authentication tells you that a caller is `payments-service`. Authorization must still decide whether `payments-service` may: + +- read card metadata +- call refund APIs +- publish to a payout topic +- access a particular database table + +This is often implemented with: + +- service identity plus policy +- SPIFFE-like identity patterns +- service mesh policy +- signed internal tokens with audience restrictions + +### Service Auth Failure Cases + +- long-lived shared secrets copied across many services +- no certificate rotation automation +- any internal service can call any other service +- internal service trusts caller-provided headers like `X-User-Id` without verification +- service identity is authenticated but not authorized + +### Service Auth Best Practices + +- use workload or service identity, not shared static secrets where possible +- prefer short-lived credentials and automatic rotation +- bind end-user identity propagation carefully when needed +- separate service identity from end-user identity in request context + +--- + +## 15. How These Systems Fit Together + +A strong interview answer connects all the pieces into one architecture. + +### 15.1 Typical SaaS Architecture + +```mermaid +flowchart TD + User[Browser / Mobile App] --> Edge[Edge / API Gateway] + Edge --> Auth[Auth Service] + Auth --> UserDB[(User Directory)] + Auth --> Session[(Session Store / Token Store)] + Auth --> Keys[KMS / Signing Keys] + Edge --> App[Business Services] + App --> Authz[Authorization Service / Policy Engine] + Authz --> Perms[(Roles, Relationships, Attributes)] + App --> Data[(Application Data)] + Auth --> Audit[Security Audit Log] + Authz --> Audit + App --> Audit +``` + +The key idea is separation of concerns: + +- auth service proves identity and issues continuity artifacts +- session/token store manages continuity and revocation +- policy engine decides access +- application services enforce business operations +- audit pipeline records security-relevant facts + +### 15.2 Consumer SaaS vs Enterprise SaaS vs Internal Platform + +| Environment | Identity priorities | +| --- | --- | +| Consumer SaaS | Signup conversion, password recovery, abuse prevention, social login | +| Enterprise SaaS | SSO, provisioning, group mapping, auditability, tenant admin controls | +| Internal platform | Service identity, zero trust, least privilege, strong device posture | + +### 15.3 Data Freshness vs Statelessness + +One of the deepest identity design tradeoffs is this: + +- stateless verification is fast and scalable +- fresh authorization state often requires looking up server-side data + +That is why many mature architectures mix the two: + +- token or session for authentication continuity +- live policy check for sensitive authorization + +### 15.4 Tenant Isolation + +For SaaS systems, tenant isolation must be explicit in both authentication and authorization. + +Common patterns: + +- include tenant membership in auth/session context +- scope roles by tenant +- enforce tenant filters in service and data layers +- audit cross-tenant admin actions aggressively + +This is especially important in systems like GitHub organizations, Stripe connected accounts, or enterprise SaaS workspaces. + +--- + +## 16. Real-World Patterns and Company Examples + +These examples are useful as mental anchors, not as exact internal blueprints. + +### Google + +- Public Google identity and OIDC flows are a classic example of large-scale federated identity. +- Google's public BeyondCorp ideas are foundational for zero-trust access. +- Zanzibar is the famous reference point for large-scale, relationship-aware authorization. + +Interview lesson: centralized authorization models can work at huge scale if the data model, caching, and consistency story are designed carefully. + +### Netflix + +- Netflix-style service-rich environments highlight the need for service identity, short-lived credentials, and resilient internal auth patterns. +- Streaming and control plane workloads also show why identity systems must stay available under very high traffic. + +Interview lesson: internal service auth is not optional in large microservice systems. + +### Uber + +- Ride-sharing and marketplace architectures depend on strict service-to-service permissions, real-time risk checks, and strong tenant/user context propagation. +- Payment, dispatch, driver, and rider services cannot safely trust each other based only on network placement. + +Interview lesson: identity context often flows through many services and must remain verifiable. + +### Amazon + +- AWS IAM is the public archetype for policy-heavy authorization with users, roles, resource policies, temporary credentials, and least privilege. + +Interview lesson: enterprise-grade authorization is really a policy and identity modeling problem, not just a list of roles. + +### GitHub + +- GitHub demonstrates a mix of organization membership, teams, repository roles, OAuth apps, GitHub Apps, personal access tokens, and enterprise SSO. + +Interview lesson: one product often needs several identity and authorization models at the same time. + +### Stripe + +- Stripe is a useful example for strong dashboard authentication, MFA, API keys, restricted keys, OAuth for Connect-style platforms, and careful access around money movement. + +Interview lesson: high-risk actions need stronger auth, auditability, and granular permissions than low-risk read-only actions. + +### Typical SaaS Systems + +Most B2B SaaS products end up combining: + +- email/password and social login for self-serve customers +- SSO for enterprise customers +- RBAC for admin/editor/viewer patterns +- ABAC or policy rules for sensitive workflows +- API tokens or OAuth for integrations +- service identity for microservices + +--- + +## 17. Interview Discussion Guide + +If asked to design identity and access for a backend system, structure your answer progressively. + +### 17.1 Clarifying Questions + +Ask: + +- who are the subjects: end users, admins, services, partners? +- is this consumer, enterprise, or internal platform? +- are we designing login, third-party integration, or internal access control? +- what is the risk level: social app, fintech, healthcare, developer platform? +- do we need SSO, API access, or both? +- how fresh must revocation and permission changes be? + +### 17.2 Good Interview Structure + +1. Define identities and trust boundaries. +2. Choose authentication mechanism. +3. Choose continuity mechanism: session or token. +4. Design authorization model. +5. Address recovery, revocation, audit, and failure cases. +6. Address scale, caching, key rotation, and multi-region concerns. + +### 17.3 Common Interview Comparisons + +#### Sessions vs JWT + +Use when the interviewer asks about stateful vs stateless auth. + +| Question | Sessions answer | JWT answer | +| --- | --- | --- | +| Need instant logout? | Strong | Harder | +| Need local verification across services? | Weaker | Strong | +| Web app simplicity? | Often simpler | Often overcomplicated | +| Third-party API ecosystem? | Less natural | Better fit | + +#### RBAC vs ABAC + +| Question | RBAC | ABAC | +| --- | --- | --- | +| Easy admin mental model | Strong | Weaker | +| Fine-grained contextual rules | Weak | Strong | +| Risk of role explosion | High | Lower | +| Ease of debugging | Stronger | Harder | + +#### SAML vs OIDC + +| Question | SAML | OIDC | +| --- | --- | --- | +| Enterprise legacy support | Strong | Strong but varies | +| Modern web/mobile friendliness | Weaker | Strong | +| Developer ergonomics | Heavier | Better | + +### 17.4 Scaling Considerations to Mention + +- Redis or equivalent for shared sessions +- multi-region token verification and key distribution +- policy caching with invalidation strategy +- short-lived credentials for services +- audit event pipelines decoupled from critical-path latency +- abuse protection on login and signup + +### 17.5 Failure Cases Worth Calling Out + +- auth service outage blocks all logins +- Redis session store failure logs users out or prevents validation +- stale permissions cached after role removal +- signing key rotation breaks old verifiers +- refresh token theft leads to silent session hijack + +Interview tip: explicitly talking about revocation, rotation, and failure handling is often what moves an answer from junior to strong mid-level or senior. + +--- + +## 18. Common Mistakes and Best Practices + +### Common Mistakes + +- treating authentication and authorization as the same problem +- storing passwords with fast hashes +- putting too much trust in long-lived JWTs +- assuming "internal network" means trusted caller +- forgetting logout, revocation, and recovery flows +- doing authorization only at the gateway +- using email as the only durable identity key in enterprise federation +- failing to audit permission changes and admin actions +- building role systems that cannot express tenant or resource scope + +### Best Practices + +- separate identity proof, session/token continuity, and authorization policy clearly +- use slow password hashing and protect high-value secrets with KMS/HSM support +- prefer MFA and step-up authentication for sensitive actions +- keep access tokens short-lived and refresh tokens protected and rotated +- model tenant-aware roles and permissions explicitly +- centralize policy where consistency matters, but understand cache staleness tradeoffs +- use service identity and short-lived credentials internally +- build auditability and explainability into the system from the beginning + +### Final Mental Model + +If you remember one thing, remember this: + +Identity and access is not one feature. It is a chain of connected systems: + +- identity proof +- credential management +- session or token continuity +- authorization policy +- revocation and recovery +- service identity +- auditing and operations + +Real systems succeed when all of these parts are designed together. + +If one weak link exists, attackers and outages will find it. + +--- + +## Quick Review Checklist + +Use this when revising for interviews. + +- Can I clearly explain AuthN vs AuthZ? +- Do I know when to use sessions vs JWTs? +- Can I explain password hashing, salts, peppers, and MFA tradeoffs? +- Can I walk through OAuth authorization code flow with PKCE? +- Can I explain SAML vs OIDC and IdP vs SP? +- Can I compare RBAC and ABAC with examples? +- Can I describe revocation, logout, token rotation, and password reset securely? +- Can I explain service-to-service auth, mTLS, and zero trust? +- Can I describe where policy enforcement should happen in a real system? + +If the answer is yes to those questions, your identity and access fundamentals are strong enough for most software engineering interview discussions and practical backend design conversations. diff --git a/systems design/3.dataStorage.md b/systems design/3.dataStorage.md new file mode 100644 index 0000000..a96465f --- /dev/null +++ b/systems design/3.dataStorage.md @@ -0,0 +1,2128 @@ +# Data Storage + +Data storage is where system design stops being abstract and starts becoming expensive, slow, fragile, or correct. Almost every backend system eventually becomes a storage problem: where data lives, how fast it can be read, how safely it can be written, how it is recovered after failure, and how it evolves as the product and traffic grow. + +In interviews, data storage questions are rarely about naming a database. They are about whether you understand: + +- what kind of data model the system needs +- what read and write patterns dominate +- what correctness guarantees matter +- what breaks under scale or failure +- what tradeoffs you are making when you choose one storage model over another + +In production, storage decisions affect latency, reliability, developer velocity, analytics capability, and incident severity. A weak storage design causes constant pain later: bad schemas, slow queries, difficult migrations, stale replicas, failed restores, or impossible cross-shard operations. + +This guide is written to serve both interview preparation and real backend engineering. The goal is not to memorize definitions. The goal is to understand why these systems exist, how they work internally, and how strong engineers talk about them. + +Examples in this guide are generalized from common industry patterns and public engineering discussions from companies such as Google, Netflix, Uber, Amazon, GitHub, Stripe, and large SaaS platforms. + +## 1. Big Picture: How Storage Fits Into Real Systems + +At a high level, most production systems use multiple storage systems at once. + +One database is rarely enough because different workloads want different things: + +- transactional correctness for orders, payments, and user accounts +- low-latency caching for hot reads +- search-oriented storage for text retrieval +- analytics storage for large scans and aggregations +- event logs for asynchronous processing +- archival storage for old or infrequently accessed data + +### 1.1 Typical Production Architecture + +```mermaid +flowchart LR + C[Client / API Consumer] --> APP[Application Services] + APP --> CACHE[(Cache / Key-Value Store)] + APP --> SQLP[(Primary Relational DB)] + SQLP --> SQLR[(Read Replicas)] + APP --> DOC[(Document DB)] + APP --> KV[(Distributed Key-Value)] + APP --> STREAM[(Event Log / Queue)] + STREAM --> OLAP[(Warehouse / Analytics Store)] + SQLP --> BACKUP[(Backups + PITR Logs)] +``` + +The point is not to use many technologies for their own sake. The point is that storage is selected per workload. + +### 1.2 The Four Questions Behind Every Storage Decision + +| Question | Why it matters | Typical design consequence | +|---|---|---| +| What is the shape of the data? | Rows, documents, events, graphs, blobs, counters all behave differently | Drives relational vs document vs wide-column vs graph | +| How is the data accessed? | Point lookups, range scans, joins, aggregations, traversals, write bursts | Drives indexing, partitioning, caching, denormalization | +| What consistency is required? | Money movement and inventory need stronger guarantees than likes or view counts | Drives ACID, transactions, replica strategy, stale-read tolerance | +| How will it scale and fail? | Size, throughput, multi-region behavior, backups, migrations, hotspots | Drives sharding, replication, partitioning, archival, DR strategy | + +### 1.3 Interview Framing + +When an interviewer asks, "What database would you use?", a strong answer sounds like this: + +"I would start from access patterns and correctness requirements. If the core entities are strongly related, require transactions, and need flexible querying, I would default to a relational database. If the workload is massive key-based access with simple lookups and looser consistency, a NoSQL store may be a better fit. I would then discuss indexes, replication, partitioning, and backup strategy because the database choice alone does not solve production scale." + +That answer shows engineering judgment rather than tool memorization. + +## 2. Relational Databases + +Relational databases are the default storage system for a huge amount of backend software because most business systems revolve around structured entities and relationships: users, subscriptions, invoices, orders, teams, repositories, shipments, and permissions. + +### 2.1 What Relational Databases Are + +A relational database stores data in tables. Each table represents a type of entity. Rows represent individual records. Columns represent attributes of those records. + +Example: + +- `users(id, email, created_at)` +- `orders(id, user_id, total_amount, status)` +- `order_items(order_id, product_id, quantity, price)` + +The key idea is not just "table-shaped data". The key idea is that relationships are first-class. One row can reference another through keys, and the database can enforce integrity rules around those relationships. + +### 2.2 Why Relational Databases Exist + +They exist because business data is usually not independent. It is interconnected. + +Examples: + +- an invoice belongs to an account +- an order belongs to a customer +- a repository belongs to an organization +- a payout belongs to a merchant and must match ledger entries + +Without a relational model, every application would have to manually enforce the same guarantees: + +- uniqueness +- referential integrity +- transactional correctness +- safe concurrent updates +- consistent query semantics + +Relational databases centralize those guarantees and provide a query language, optimizer, transaction manager, and storage engine to manage them efficiently. + +### 2.3 Core Building Blocks + +| Concept | Meaning | Why it exists | +|---|---|---| +| Table | Collection of rows of the same entity type | Organizes records under a common schema | +| Row | One record | Represents one object or fact | +| Column | One attribute | Gives structure and type constraints | +| Primary key | Unique identifier for a row | Supports identity, lookup, and relationships | +| Foreign key | Reference to another table's primary key | Enforces valid relationships | +| Constraint | Rule the data must satisfy | Prevents invalid writes at the database layer | +| Schema | The formal structure of tables and fields | Makes data predictable and queryable | + +### 2.4 Primary Keys + +A primary key uniquely identifies a row. + +Typical choices: + +- auto-increment integer +- UUID +- ULID or time-sortable ID +- composite key in some domain-specific cases + +Tradeoffs: + +- auto-increment IDs are compact and index-friendly but can create write hotspots in distributed systems +- UUIDs support decentralized ID generation but are larger, less cache-friendly, and can fragment indexes depending on type +- time-ordered IDs improve write locality while keeping uniqueness across distributed nodes + +Interview discussion: + +If asked about ID choice, do not stop at uniqueness. Discuss index locality, distributed generation, predictability, and operational implications. + +### 2.5 Foreign Keys and Constraints + +Foreign keys ensure that a row points to something real. + +Example: + +- `orders.user_id` should reference `users.id` + +Why this matters: + +- prevents orphaned records +- protects against application bugs +- makes data trustworthy for downstream systems + +Constraints commonly used in production: + +- `PRIMARY KEY` +- `UNIQUE` +- `NOT NULL` +- `CHECK` +- `FOREIGN KEY` + +Tradeoff: + +Some high-scale systems relax or avoid certain constraints in hot paths because they want more write throughput or more control in the application layer. But removing database-level safety shifts correctness burden onto application code and background repair jobs. + +### 2.6 Schema Design + +Schema design is the process of deciding how entities are represented and related. + +A good schema: + +- reflects domain boundaries clearly +- supports the most important queries efficiently +- enforces correctness where possible +- avoids unnecessary duplication +- evolves safely over time + +Bad schema design is one of the most expensive long-term mistakes in backend systems. It causes: + +- hard-to-explain data models +- excessive joins +- duplicated fields that drift apart +- slow migrations +- poor index strategy +- downstream analytics confusion + +Practical rule: schema design should start from access patterns and invariants, not from class diagrams alone. + +### 2.7 How Relational Databases Work Internally + +Most modern relational systems are not just "tables on disk". Internally, they usually include: + +- a parser that understands SQL +- a planner/optimizer that chooses an execution strategy +- a buffer pool that caches pages in memory +- a storage engine that organizes rows and indexes into pages or files +- a write-ahead log (WAL) or redo log for durability and crash recovery +- a lock manager or MVCC mechanism for concurrency control + +Conceptually, a write often looks like this: + +1. Parse and validate the SQL. +2. Check permissions, constraints, and transaction state. +3. Update in-memory structures or page buffers. +4. Append the change to a durable log. +5. Eventually flush changed pages to disk. +6. On crash recovery, replay the log to restore a consistent state. + +This log-first approach is why databases can be both durable and reasonably fast. + +### 2.8 ACID Properties + +ACID is one of the main reasons relational databases dominate critical business systems. + +| Property | Meaning | Why it matters in production | +|---|---|---| +| Atomicity | A transaction happens entirely or not at all | Prevents partial writes such as charging a card but not recording the payment | +| Consistency | The database moves from one valid state to another valid state | Preserves constraints, invariants, and business rules | +| Isolation | Concurrent transactions behave as if they do not corrupt each other | Prevents race conditions and inconsistent reads | +| Durability | Committed data survives crashes | Critical for orders, money, and audit logs | + +ACID does not mean infinite performance or global serializability everywhere. It means the database provides mechanisms to preserve correctness within its transaction model. + +### 2.9 OLTP vs OLAP Basics + +| Dimension | OLTP | OLAP | +|---|---|---| +| Goal | Fast transactional reads/writes | Large analytical scans and aggregations | +| Query shape | Point lookups, small updates, short transactions | Complex joins, scans, groupings across large datasets | +| Data freshness | Near real-time | Often batch or streaming ingestion | +| Schema style | Highly normalized or moderately normalized | Often denormalized star/snowflake models | +| Example | Checkout, login, billing, inventory | Revenue dashboards, cohort analysis, forecasting | + +Common mistake: using the primary production database for heavy analytics. That often destroys latency for user-facing traffic. + +Production pattern: + +- OLTP DB for live product traffic +- CDC or event stream replicates changes into an OLAP system or warehouse + +### 2.10 When Relational Databases Are the Right Choice + +Choose a relational database when you need: + +- strong transactional correctness +- relationships across entities +- flexible querying across structured data +- strict constraints and integrity checks +- mature tooling and predictable operational behavior + +Typical examples: + +- Stripe-like payments and ledger systems +- GitHub-like repository metadata, accounts, permissions, billing +- SaaS products with users, teams, subscriptions, invoices, audit records +- e-commerce orders, inventory reservations, and shipments + +### 2.11 Common Production Problems + +| Problem | Why it happens | Typical response | +|---|---|---| +| Slow queries | Missing indexes, bad query shape, poor cardinality estimates | Add or fix indexes, rewrite query, update stats | +| Lock contention | Hot rows, long transactions, frequent updates | Reduce transaction scope, reorder operations, use optimistic patterns | +| Replica lag | Heavy writes, slow replication, long-running queries on replicas | Limit lag-sensitive reads, tune replication, use read-after-write routing | +| Connection exhaustion | Too many app instances or poor pooling | Use connection pooling and backpressure | +| Schema migration risk | Tight coupling between app versions and schema | Use backward-compatible migrations | +| Storage bloat | Dead tuples, fragmentation, old data retention | Vacuum/compaction, archiving, partitioning | + +## 3. SQL + +SQL is the language used to describe what data you want and how you want to change it. The important part is that SQL is largely declarative: you say what result you want, and the database decides how to get it. + +That separation is powerful because the optimizer can choose different execution strategies based on data size, indexes, and statistics. + +### 3.1 Why SQL Exists + +SQL exists because applications need a high-level way to: + +- retrieve specific records +- filter and aggregate data +- join related tables +- modify data safely inside transactions +- define schemas, constraints, and access policies + +Instead of hand-coding file access and pointer traversal, developers describe intent and let the database manage physical execution. + +### 3.2 How Databases Think About SQL Internally + +When you write a query, the database usually goes through stages like: + +1. Parse the SQL into an internal tree. +2. Validate table names, column names, permissions, and types. +3. Rewrite the query if useful, such as simplifying expressions or flattening subqueries. +4. Estimate row counts using statistics. +5. Consider candidate plans. +6. Choose access methods like index scan, sequential scan, hash join, sort, or aggregation. +7. Execute the plan and stream results back. + +### 3.3 Query Execution Flow + +```mermaid +flowchart TD + Q[SQL Query] --> P[Parser] + P --> R[Rewriter] + R --> O[Optimizer / Planner] + O --> A{Choose Access Path} + A --> S1[Sequential Scan] + A --> S2[Index Scan] + A --> J[Join Strategy] + J --> J1[Nested Loop] + J --> J2[Hash Join] + J --> J3[Merge Join] + S1 --> E[Execution Engine] + S2 --> E + J1 --> E + J2 --> E + J3 --> E + E --> RES[Rows Returned] +``` + +The optimizer is the reason two queries with similar meaning can perform very differently. + +### 3.4 Core SQL Operations + +#### `SELECT` + +Used to retrieve data. But the real question is not "how do I write SELECT". The real question is: what access path will the database choose to produce the rows? + +If a query filters by an indexed, selective column, the planner may use an index scan. If the condition matches a large percentage of the table, a full scan may actually be cheaper. + +#### `INSERT` + +Adds new rows. Internally, inserts may: + +- allocate a new row location +- update all relevant indexes +- validate constraints +- write to the transaction log + +A write to one logical table can mean multiple physical writes due to indexes. + +#### `UPDATE` + +Modifies existing rows. Updates can be deceptively expensive because they may: + +- find the existing row +- create a new row version under MVCC +- update each affected index +- generate dead tuples or fragmented storage + +#### `DELETE` + +Deletes rows. In many engines, deletes do not immediately remove bytes from disk. They mark rows as deleted, and cleanup happens later through vacuuming or compaction. + +That is why mass deletes can be surprisingly expensive and can cause storage bloat. + +### 3.5 `WHERE`, `GROUP BY`, `ORDER BY`, `HAVING` + +| Clause | What it does logically | What often matters physically | +|---|---|---| +| `WHERE` | Filters rows before aggregation | Index usage, predicate selectivity | +| `GROUP BY` | Groups rows into buckets | Sorting or hashing cost | +| `ORDER BY` | Sorts result rows | Whether an index can satisfy ordering | +| `HAVING` | Filters after grouping | Size of grouped intermediate results | + +Interview point: + +`WHERE` and `HAVING` are not interchangeable. `WHERE` reduces rows earlier. `HAVING` filters after grouping. Earlier reduction is usually cheaper. + +### 3.6 Aggregations + +Aggregations such as `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX` are straightforward conceptually but can be expensive at scale. + +Production concerns: + +- full-table aggregations are expensive on hot OLTP databases +- grouped aggregations can require large hash tables or sort phases +- repeated dashboard queries often need precomputation, caching, or OLAP offloading + +Example: + +A SaaS dashboard showing active users by day should not necessarily run a large `GROUP BY date(created_at)` directly against the primary transactional database every page load. + +### 3.7 Subqueries, CTEs, Views, Stored Procedures + +#### Subqueries + +Useful for expressing dependent logic, but performance depends on whether the optimizer can rewrite or decorrelate them. + +#### CTEs + +Common Table Expressions improve readability and can structure complex logic into steps. In some databases they behave mostly as syntax sugar; in others they can affect optimization boundaries. Do not assume they are always free. + +#### Views + +Views encapsulate query logic behind a virtual table interface. + +Good use cases: + +- simplifying common query logic +- enforcing access patterns +- providing stable read models to consumers + +Tradeoff: + +Regular views do not store data. They execute underlying logic each time. Materialized views store results but must be refreshed. + +#### Stored Procedures + +Stored procedures put logic in the database. + +Why teams use them: + +- reduce application round trips +- centralize critical logic close to data +- improve consistency for some workflows + +Why teams avoid overusing them: + +- harder versioning and testing compared to application code +- business logic becomes harder to trace +- vendor lock-in increases + +In interviews, a balanced answer is strongest: stored procedures are useful for certain data-heavy workflows, but many modern teams prefer to keep most domain logic in application services. + +### 3.8 Transactions and SQL + +SQL statements do not operate in a vacuum. They run inside transaction boundaries. + +This matters because: + +- multiple writes may need to succeed together +- reads may or may not see concurrent updates +- locks or row versions are affected by transaction scope + +Example: + +```sql +BEGIN; + +UPDATE accounts +SET balance = balance - 100 +WHERE id = 1; + +UPDATE accounts +SET balance = balance + 100 +WHERE id = 2; + +COMMIT; +``` + +This is not just two updates. It is one correctness boundary. + +### 3.9 Query Optimization Mindset + +Strong engineers do not guess blindly. They ask: + +- how many rows are being scanned? +- which predicates are selective? +- is an index usable? +- are we forcing a sort? +- are we joining large intermediate results? +- are we fetching columns not covered by the index? +- is this workload better served from a cache or OLAP store? + +### 3.10 Reading Execution Plans + +At a high level, execution plans tell you: + +- scan type used +- join order and join method +- estimated versus actual row counts +- sort and aggregation steps +- cost estimates + +What to look for first: + +- unexpected sequential scans on large tables +- massive row count mismatches indicating stale statistics or bad estimates +- repeated nested loops over huge inputs +- sorts spilling or large hash structures + +Interview point: + +You do not need to memorize every plan node. You do need to show that you know how to reason from plan shape to performance bottleneck. + +### 3.11 Common Interview SQL Discussions + +- Explain why a query is slow. +- Design indexes for a query. +- Compare a subquery to a join. +- Explain `WHERE` versus `HAVING`. +- Explain why `SELECT *` is often a bad idea. +- Discuss pagination: offset pagination versus keyset pagination. +- Explain how transactions affect concurrent SQL statements. + +## 4. Indexing + +Indexes exist because scanning every row for every query does not scale. + +If a table has 500 rows, a full scan is fine. If it has 500 million rows and the query wants one customer by email, a full scan is a disaster. + +### 4.1 The Full Table Scan Problem + +Without an index, the database may need to inspect every row to answer a filter such as: + +```sql +SELECT * FROM users WHERE email = 'a@example.com'; +``` + +That means CPU, disk I/O, buffer churn, and higher latency. Indexes trade extra storage and write cost for much faster reads. + +### 4.2 What an Index Is + +An index is an auxiliary data structure that makes certain lookups cheaper. + +Think of it like a book index: + +- without it, you scan the whole book +- with it, you jump near the right page immediately + +### 4.3 B-Tree Indexes + +B-Tree indexes are the default in many relational systems because they support: + +- exact lookups +- range scans +- ordered traversal +- prefix matches on composite keys + +Internally, B-Trees keep keys in sorted order across pages so the database can navigate down the tree quickly and then scan a relevant range. + +Why they are popular: + +- good general-purpose behavior +- efficient for equality and range queries +- can support `ORDER BY` if the ordering matches the index + +### 4.4 Hash Indexes + +Hash indexes map keys to buckets and are good for exact equality lookups. + +Tradeoff: + +- excellent for `=` lookups +- not useful for range scans like `BETWEEN`, `<`, `>`, or ordered traversal + +In many production systems, B-Trees remain the default because they are more versatile. + +### 4.5 Composite Indexes + +A composite index includes multiple columns, such as: + +```sql +CREATE INDEX idx_orders_user_status_created +ON orders(user_id, status, created_at); +``` + +Important concept: column order matters. + +This index is most useful when queries filter or order by the leading prefix: + +- `WHERE user_id = ?` +- `WHERE user_id = ? AND status = ?` +- `WHERE user_id = ? AND status = ? ORDER BY created_at` + +It may not help much for a query filtering only by `status`. + +### 4.6 Covering Indexes + +A covering index contains enough information to answer a query without going back to the base table. + +This reduces extra reads. + +Example idea: + +If a query needs only `id` and `status`, and those are already in the index, the engine may avoid fetching the full row. + +Why it matters: + +- lower I/O +- lower latency +- better cache efficiency + +### 4.7 Clustered vs Non-Clustered Indexes + +| Type | Meaning | Implication | +|---|---|---| +| Clustered index | Table data is physically ordered by the index key or closely tied to that key | Great for range scans on the clustering key, but only one primary physical order | +| Non-clustered index | Separate structure pointing to row locations | Flexible, but may require extra lookups to fetch row data | + +Practical intuition: + +You get one main physical organization of the table, but many secondary lookup structures. + +### 4.8 Unique Indexes + +Unique indexes enforce that indexed values do not repeat. + +Examples: + +- email should be unique per user account table +- external payment id should be unique to prevent duplicate processing + +They improve correctness, not just performance. + +### 4.9 Partial Indexes + +A partial index only indexes rows matching a condition. + +Example: + +```sql +CREATE INDEX idx_active_subscriptions +ON subscriptions(user_id) +WHERE status = 'active'; +``` + +Why it is useful: + +- smaller index +- faster maintenance +- targeted at frequent queries + +This is especially powerful when the application repeatedly queries a hot subset of data. + +### 4.10 Index Selectivity + +Selectivity means how well a predicate narrows the result set. + +- high selectivity: returns very few rows, like email or order id +- low selectivity: returns many rows, like boolean `is_active = true` in a large table where most rows are active + +Indexes are more useful on selective predicates. Indexing a low-selectivity column alone may not help much. + +### 4.11 How Indexes Affect Query Planning + +The optimizer decides whether an index helps based on estimated cost. + +It may ignore an index when: + +- the query needs most rows anyway +- the table is small enough that a scan is cheaper +- statistics suggest low selectivity +- using the index would cause too many random reads + +This surprises many beginners who think, "I created an index, so it must be used." + +It is not index existence that matters. It is cost. + +### 4.12 Write Amplification Tradeoff + +Every extra index speeds some reads but makes writes more expensive. + +An insert or update may need to update: + +- the base table storage +- the primary key index +- each secondary index + +Costs of over-indexing: + +- slower writes +- more storage +- more memory usage +- more vacuum/compaction overhead +- more operational complexity during schema changes + +### 4.13 Good vs Bad Indexing Examples + +| Pattern | Good or Bad | Why | +|---|---|---| +| Index on `users(email)` for login lookup | Good | High selectivity, common query path | +| Composite index on `(tenant_id, created_at)` for tenant timeline queries | Good | Matches filter and sort shape | +| Separate indexes on `tenant_id` and `created_at` when queries always use both together | Often weaker | Composite index may serve the access pattern better | +| Index every column "just in case" | Bad | High write cost, low usefulness | +| Index a boolean column by itself in a table where almost all values are the same | Often bad | Low selectivity | + +### 4.14 Common Indexing Mistakes + +- indexing without looking at real query patterns +- ignoring column order in composite indexes +- assuming indexes always help writes or deletes +- forgetting that indexes need maintenance and storage +- missing indexes on foreign keys used heavily in joins +- failing to reconsider indexes after query patterns change + +## 5. Joins + +Joins exist because relational data is split across tables to reduce redundancy and preserve structure, but applications still need combined answers. + +Example: + +- users in one table +- orders in another +- order items in a third + +To answer "show me Alice's last 10 orders and item counts", the database must combine related records. + +### 5.1 Why Joins Exist + +Without joins, every application would either: + +- duplicate related data everywhere, or +- perform multiple round trips and manually stitch results together + +Joins let the database combine data close to storage, often more efficiently than the application layer could. + +### 5.2 Join Types + +| Join | Meaning | Typical use | +|---|---|---| +| `INNER JOIN` | Return only matching rows | Orders with valid customers | +| `LEFT JOIN` | Keep all rows from left table, even if right side missing | Users and optional profile settings | +| `RIGHT JOIN` | Keep all rows from right table | Less commonly used; often rewritten as left join | +| `FULL OUTER JOIN` | Keep all rows from both sides | Reconciliation or comparison workloads | +| `SELF JOIN` | Join a table to itself | Org hierarchy, manager relationships | +| `CROSS JOIN` | Cartesian product | Rare except for deliberate matrix generation | + +### 5.3 How Databases Perform Joins Internally + +The database does not "magically combine" tables. It chooses an algorithm. + +#### Nested Loop Join + +For each row from one input, look up matching rows in the other input. + +Best when: + +- one side is small +- the other side has a useful index + +Danger: + +- terrible if both sides are large and lookups are expensive + +#### Hash Join + +Build a hash table from one input, then probe it using the other input. + +Best when: + +- joining large unsorted datasets on equality + +Tradeoff: + +- needs memory to build the hash structure +- not useful for every join predicate shape + +#### Merge Join + +Walk two sorted inputs together. + +Best when: + +- both inputs are already sorted or indexed on join keys +- join is based on sortable keys + +Tradeoff: + +- sorting inputs can be expensive if not already ordered + +### 5.4 Join Execution Example + +```mermaid +flowchart LR + A[Orders Scan or Index Scan] --> J{Join Strategy} + B[Users Scan or Index Scan] --> J + J --> NL[Nested Loop] + J --> HJ[Hash Join] + J --> MJ[Merge Join] + NL --> R[Joined Result] + HJ --> R + MJ --> R +``` + +### 5.5 Join Performance Problems + +Joins become expensive when: + +- join keys are not indexed appropriately +- large intermediate results are created before filtering +- data is highly skewed +- many-to-many joins explode result size +- joins happen across shards or services instead of within one database node + +### 5.6 Avoiding Expensive Joins at Scale + +Common production strategies: + +- add the right indexes on join keys +- filter early before joining large tables +- precompute summaries or read models +- denormalize hot read paths +- move heavy analytical joins to OLAP systems +- avoid cross-service synchronous joins in request paths + +Real-world intuition: + +A GitHub-like product can keep strongly normalized core data in MySQL or Postgres, but timeline feeds, counters, search views, and recommendation paths often use denormalized or precomputed models because repeated multi-table joins on hot endpoints become too costly. + +## 6. Transactions + +Transactions are the unit of correctness in storage systems. + +They exist because many business operations are not a single write. They are a bundle of writes and reads that must behave as one logical step. + +Examples: + +- transferring money between two accounts +- reserving inventory and creating an order +- creating a subscription and initial invoice together +- writing a payment and corresponding ledger entry + +### 6.1 What a Transaction Is + +A transaction is a sequence of operations that the database treats as one atomic unit. + +Either: + +- all intended changes commit, or +- none of them do + +### 6.2 Commit and Rollback + +- `COMMIT` makes the transaction durable and visible according to isolation rules +- `ROLLBACK` discards its changes + +This sounds simple, but under the hood the database must coordinate row versions, locks, logs, and crash recovery. + +### 6.3 Transaction Flow + +```mermaid +sequenceDiagram + participant App + participant DB + participant Log as WAL / Redo Log + + App->>DB: BEGIN + App->>DB: Read current state + App->>DB: Write change A + App->>DB: Write change B + DB->>Log: Append durable log records + App->>DB: COMMIT + DB-->>App: Success +``` + +If a failure happens before commit completes, the system must recover to a consistent state using the log. + +### 6.4 ACID In Depth + +#### Atomicity + +Prevents partial effects. + +Without atomicity: + +- balance deducted from account A +- crash occurs +- balance never credited to account B + +That is unacceptable for financial or inventory systems. + +#### Consistency + +Consistency means the database enforces declared invariants and valid transitions. + +Examples: + +- foreign keys remain valid +- unique constraints are not violated +- balances cannot go below allowed thresholds if enforced properly + +Important nuance: some business consistency rules are stronger than database constraints and must also be enforced in application logic. + +#### Isolation + +Isolation determines how concurrent transactions interact. + +This is where many subtle bugs live. + +#### Durability + +Durability means that once commit succeeds, data survives process crash or machine crash, assuming the system's durability model is functioning correctly. + +### 6.5 Isolation Levels + +| Isolation level | Prevents | Still allows in many systems | Typical tradeoff | +|---|---|---|---| +| Read uncommitted | Very little | Dirty reads, non-repeatable reads, phantoms | Rarely used for correctness-sensitive work | +| Read committed | Dirty reads | Non-repeatable reads, phantoms | Common default in many databases | +| Repeatable read | Dirty reads, many non-repeatable reads | Phantom behavior depends on engine | Stronger consistency, more overhead | +| Serializable | Behaves like serial execution | None conceptually, but may abort conflicting transactions | Strongest correctness, lowest concurrency | + +### 6.6 Read Anomalies + +#### Dirty Read + +Transaction A sees uncommitted data from transaction B. If B rolls back, A observed fiction. + +#### Non-Repeatable Read + +Transaction A reads a row twice and gets different values because another transaction committed a change between reads. + +#### Phantom Read + +Transaction A runs the same predicate twice and sees different sets of rows because another transaction inserted or deleted matching rows. + +Interview point: + +Do not memorize anomalies in isolation. Connect them to business impact, such as showing inconsistent account balance, inventory count, or list of eligible coupons. + +### 6.7 Locking Basics + +Databases often use locks or lock-like mechanisms to coordinate concurrent access. + +Common ideas: + +- shared locks for reading +- exclusive locks for writing +- row-level, page-level, or table-level scope depending on engine and operation + +Stronger isolation often means more locking or more conflict detection. + +### 6.8 Optimistic vs Pessimistic Locking + +| Approach | Idea | Best when | Tradeoff | +|---|---|---|---| +| Optimistic locking | Assume conflicts are rare; detect them at commit/update time | High-read, lower-conflict systems | Retries needed on conflict | +| Pessimistic locking | Lock early to prevent conflicts | High-conflict or correctness-critical updates | Lower concurrency, risk of lock waits | + +Example of optimistic locking: + +- row has a `version` +- update succeeds only if `version` still matches expected value + +This is common in SaaS applications where conflicts are possible but not constant. + +### 6.9 Deadlocks + +A deadlock happens when two transactions each wait for locks held by the other. + +Example: + +- transaction A locks row 1, then wants row 2 +- transaction B locks row 2, then wants row 1 + +The database detects the cycle and aborts one transaction. + +Best practices: + +- access rows in consistent order +- keep transactions short +- avoid user interaction inside transactions +- retry safely on deadlock aborts + +### 6.10 Distributed Transactions Basics + +Transactions inside one database node are already complex. Across multiple services or databases, they are much harder. + +Two-phase commit exists, but many large distributed systems avoid relying on it for core business flows because it adds latency, coordinator complexity, and difficult failure handling. + +### 6.11 Saga Pattern Basics + +The saga pattern breaks a distributed workflow into multiple local transactions plus compensating actions. + +Example: + +1. reserve inventory +2. create order +3. charge payment +4. if payment fails, release inventory and cancel order + +This is not the same as a single ACID transaction. It is a business-level consistency approach. + +Production tradeoff: + +- better scalability and service autonomy +- more complex failure and compensation logic +- eventual consistency windows must be handled explicitly + +### 6.12 Real-World Production Tradeoffs + +Stripe-like money systems often keep the critical ledger in a strongly consistent relational model because correctness matters more than raw write scale. A social feed or notification counter may accept eventual consistency and use cheaper distributed storage patterns. + +That contrast is the essence of storage design: correctness requirements decide architecture. + +## 7. Normalization + +Normalization exists to reduce redundancy and protect data quality. + +It is often taught as academic normal forms, but in production it is a practical engineering tool for controlling duplication and update pain. + +### 7.1 Why Normalization Exists + +Imagine storing customer email in every order row. If the customer changes email, old and new values may diverge across orders. Now your data is inconsistent. + +Normalization helps separate facts so each fact is stored in one canonical place when appropriate. + +### 7.2 Redundancy Problems and Anomalies + +| Problem | Meaning | Example | +|---|---|---| +| Update anomaly | Same fact must be updated in many places | Customer address copied into many rows | +| Delete anomaly | Deleting one record accidentally loses another fact | Deleting last order removes only stored customer metadata | +| Insert anomaly | Cannot insert one fact without another unrelated fact | Cannot create product record until there is an order | + +### 7.3 1NF, 2NF, 3NF + +#### First Normal Form (1NF) + +Store atomic values, not repeating groups stuffed into one field. + +Bad: + +- `phone_numbers = '123,456,789'` + +Better: + +- separate rows or related table for phone numbers + +#### Second Normal Form (2NF) + +For tables with composite keys, non-key attributes should depend on the whole key, not just part of it. + +#### Third Normal Form (3NF) + +Non-key attributes should depend on the key, the whole key, and nothing but the key. In practice, this means avoid storing derived or indirectly dependent facts in the same table when they belong elsewhere. + +### 7.4 When to Stop Normalizing + +Normalize until: + +- duplication is controlled +- invariants are clear +- common writes stay manageable +- queries are still practical + +Do not normalize forever just to satisfy theory. If a read path becomes too join-heavy or too expensive, denormalization may be the right decision. + +### 7.5 Denormalization Tradeoffs + +Denormalization means intentionally duplicating data to improve read performance, reduce joins, or simplify query paths. + +Examples: + +- storing `customer_name` on an order snapshot +- storing aggregate counters on a parent row +- materializing a feed or timeline table for fast reads + +Benefits: + +- faster hot-path reads +- simpler queries +- lower join cost + +Costs: + +- duplicate data can drift +- writes become more complex +- backfills and migrations become harder +- multiple representations must stay consistent + +### 7.6 Normalization vs Denormalization + +| Dimension | Normalized design | Denormalized design | +|---|---|---| +| Write correctness | Easier to enforce | More application logic required | +| Read performance | Can require joins | Often faster for hot reads | +| Storage usage | Lower duplication | Higher duplication | +| Evolution | Cleaner canonical model | Backfills and consistency harder | +| Best for | Core transactional records | High-volume read models, feeds, dashboards | + +Interview framing: + +Strong answers usually say: start normalized for correctness, then denormalize specific read paths once measurement shows it is necessary. + +## 8. NoSQL Databases + +NoSQL does not mean "no structure" or "always better at scale". It broadly refers to storage systems that do not center the relational table model and SQL-style joins/transactions as the default abstraction. + +### 8.1 Why NoSQL Exists + +NoSQL systems grew because many modern workloads stressed relational databases in specific ways: + +- enormous write throughput +- horizontal scaling across many nodes +- flexible or nested data models +- low-latency key-based access +- globally distributed workloads +- looser consistency acceptable for some use cases + +Relational databases are powerful, but not every workload needs joins and multi-row ACID transactions. Some systems care more about partition tolerance, write availability, or schema flexibility. + +### 8.2 CAP Theorem Basics + +CAP is often explained badly. + +In the presence of a network partition, a distributed system can prioritize: + +- consistency: all nodes reflect the same latest view before responding +- availability: every request receives a response even if some responses are stale + +Partition tolerance is not optional in distributed systems that span nodes and networks. The real question during partition is how the system behaves. + +Practical takeaway: + +- money transfer systems often prefer stronger consistency +- shopping carts, timelines, caches, and analytics may prefer availability or eventual consistency + +### 8.3 Eventual Consistency Basics + +Eventual consistency means replicas may temporarily disagree, but if no new writes happen, they converge eventually. + +This is acceptable when: + +- temporary staleness is tolerable +- conflict resolution is possible +- the business experience can absorb delayed convergence + +Examples: + +- social like counts +- follower counts +- recommendation data +- session caches in some architectures + +### 8.4 Schema Flexibility + +Some NoSQL systems let records vary more freely than rigid relational schemas. + +This helps when: + +- different entities have different optional fields +- product requirements evolve rapidly +- nested document structures fit the domain naturally + +But schema flexibility is not free. Uncontrolled flexibility leads to inconsistent documents, messy queries, and difficult validation. + +### 8.5 SQL vs NoSQL Comparison + +| Dimension | Relational / SQL | NoSQL | +|---|---|---| +| Data model | Tables with relations | Key-value, document, wide-column, graph | +| Transactions | Usually strong and mature | Often limited, scoped, or workload-dependent | +| Schema | Strongly defined | Flexible or query-driven depending on system | +| Joins | Native and powerful | Often limited or avoided | +| Horizontal scale | Possible but operationally harder | Often a primary design goal | +| Best workloads | Transactional business systems | Large-scale distributed access patterns and specialized models | +| Risk | Scaling and write hotspots if misused | Weaker guarantees or harder query flexibility if misapplied | + +### 8.6 When Not to Use NoSQL + +Do not choose NoSQL just because it sounds scalable. + +Avoid it when you need: + +- rich multi-table queries +- strong relational integrity +- complex transactional workflows +- frequent ad hoc reporting over structured business data + +A surprising number of early-stage products are best served by a relational database plus caching, not by premature polyglot persistence. + +## 9. Key-Value Databases + +Key-value databases store data as a mapping from key to value. + +That sounds simple because it is. Their power comes from doing simple access patterns extremely well. + +### 9.1 Data Model + +- key: unique identifier +- value: blob, string, JSON, serialized object, counter, etc. + +You usually ask for data by exact key. There is often little or no join logic. + +### 9.2 Why They Exist + +They are optimized for: + +- extremely fast reads and writes +- simple lookup patterns +- horizontal distribution +- caching and ephemeral state + +### 9.3 Common Use Cases + +| Use case | Why key-value fits | +|---|---| +| Cache | Same key requested repeatedly, low-latency reads matter | +| Session storage | Retrieve session by session id | +| Rate limiting | Increment counters by user or API key | +| Feature flags | Lookup flag set by environment or tenant | +| Idempotency keys | Check if request key has been seen before | + +### 9.4 Distributed Cache Basics + +Systems like Redis-style stores are often used as distributed caches sitting in front of primary databases. + +Why this matters: + +- many reads are repetitive +- memory is much faster than disk-backed database access +- caching can protect the primary database during traffic spikes + +Common patterns: + +- cache-aside: app reads cache, falls back to DB, then populates cache +- write-through: writes update cache and backing store together +- write-behind: cache writes asynchronously to backing store + +### 9.5 TTL Concepts + +TTL means time to live. After the TTL expires, the item is evicted or treated as absent. + +Useful for: + +- sessions +- temporary tokens +- API response caching +- rate-limit windows + +TTL prevents stale ephemeral data from living forever. + +### 9.6 Eviction Strategies Basics + +When cache memory is full, items must be evicted. + +Common policies: + +- LRU: least recently used +- LFU: least frequently used +- TTL-driven expiration +- random or approximate policies in some implementations + +Tradeoff: + +cache design is about picking what is okay to lose. + +### 9.7 Internal Working Model + +Many key-value systems rely on: + +- in-memory hash tables for fast lookups +- append-only logs or snapshots for durability +- replication for availability +- partitioning across nodes for scale + +If durability matters, do not assume an in-memory store is safe by default. Persistence configuration matters. + +### 9.8 Failure Cases and Mistakes + +- treating cache as the source of truth +- not handling cache stampedes when many requests miss simultaneously +- using unbounded key cardinality and blowing memory +- forgetting TTLs for ephemeral values +- assuming multi-key transactions behave like relational ACID systems + +Production note: + +Netflix-like and large SaaS architectures frequently use aggressive caching layers because keeping all read pressure on the relational core does not scale well enough. + +## 10. Document Databases + +Document databases store data as self-contained documents, often JSON-like. + +They work well when the application naturally thinks in aggregates rather than heavily normalized relationships. + +### 10.1 Data Model + +Example document: + +```json +{ + "orderId": "ord_123", + "customerId": "cus_42", + "shippingAddress": { + "city": "Seattle", + "country": "US" + }, + "items": [ + { "sku": "A1", "qty": 2, "price": 20 }, + { "sku": "B9", "qty": 1, "price": 50 } + ], + "status": "paid" +} +``` + +This is attractive because the whole order can be loaded and stored together. + +### 10.2 Why Document Databases Exist + +They exist for workloads where: + +- entities are naturally hierarchical or nested +- schema evolves frequently +- a whole aggregate is often read or written together +- joins are limited or can be avoided + +### 10.3 Schema Flexibility + +Schema flexibility is helpful during rapid product iteration, but mature teams still impose structure through validation, conventions, or schema tooling. + +Otherwise, the same concept ends up with multiple shapes across documents, and queries become painful. + +### 10.4 Indexing Considerations + +Document databases still need indexes. Flexibility does not eliminate query planning. + +Common indexed fields: + +- tenant id +- status +- created_at +- nested fields used in filters + +Mistake: + +storing large nested documents but frequently querying on many scattered nested fields without planning index cost. + +### 10.5 Querying Patterns + +Document databases are strongest when queries align with document boundaries. + +Good fit: + +- fetch profile by id +- fetch order by order id +- fetch products in category with a few indexed filters + +Weaker fit: + +- many complex joins across many collections +- arbitrary relational analytics +- cross-document transactions on hot paths if the system handles them poorly + +### 10.6 Denormalization Patterns + +Document databases often encourage denormalized aggregates. + +Example: + +- store shipping address snapshot inside the order document +- embed small child objects that are usually fetched together + +Good when: + +- aggregate boundaries are clear +- updates are localized +- read patterns match whole-document access + +Bad when: + +- the embedded data changes independently at high frequency +- many documents must be updated for one logical change + +### 10.7 When Document Databases Work Well + +- content management systems +- product catalogs with variable attributes +- user profiles with optional nested settings +- event metadata or semi-structured application state + +Public-pattern intuition: + +Uber-style dynamic entities, marketplace metadata, or fast-evolving product features often push teams toward document-style models in some subsystems because strict relational modeling can become cumbersome for highly variable payloads. + +## 11. Wide-Column Databases + +Wide-column databases are built for high write throughput, large-scale distribution, and query models that are designed around known access patterns. + +Examples include Cassandra-style systems. + +### 11.1 Column Family Model + +These systems are not "SQL tables with many columns". They typically organize data by partition key and clustering columns, storing rows together by partition. + +The design philosophy is often: model your tables around how queries will be executed, not around a perfectly normalized domain model. + +### 11.2 Partition Key and Clustering Key + +| Concept | Role | +|---|---| +| Partition key | Decides which node stores the data | +| Clustering key | Decides ordering within a partition | + +Example mental model: + +- partition by `user_id` +- cluster by `event_time` + +This makes "recent events for user X" very efficient. + +### 11.3 Why They Exist + +Wide-column systems are designed for: + +- large write-heavy workloads +- globally distributed or multi-node availability +- time-series, logs, events, and message-like data +- predictable query-first schemas + +### 11.4 Internal Working Model + +Many wide-column systems use LSM-tree style storage: + +- writes go first to a commit log for durability +- data is accumulated in memory structures such as memtables +- flushed to immutable disk files such as SSTables +- background compaction merges files over time + +Why this helps: + +- writes are fast and mostly sequential +- good for high-ingest workloads + +Tradeoff: + +- reads may touch multiple files until compaction catches up +- compaction can be operationally expensive + +### 11.5 Consistency Tradeoffs + +These systems often let you tune consistency per operation. + +You may choose stronger or weaker consistency depending on how many replicas must acknowledge a read or write. + +This is powerful, but it means the application team must actually understand the tradeoff. + +### 11.6 Modeling Around Queries First + +Wide-column systems punish generic modeling. + +You usually design tables specifically for questions such as: + +- get latest 100 events for device X +- get messages for conversation Y ordered by timestamp +- get metrics for service Z over last hour + +If you need arbitrary secondary queries later, the design may not support them well. + +### 11.7 Best Use Cases + +- event streams and activity histories +- IoT and telemetry ingestion +- time-series style data +- write-heavy systems that must survive node failure + +Public-pattern example: + +Netflix-like metadata and high-availability distributed services often favor Cassandra-style patterns for workloads where availability and write scalability are more important than relational joins. + +## 12. Graph Databases + +Graph databases model data as nodes and edges. + +They exist because some domains are fundamentally about relationships, not just records. + +### 12.1 Why Graph Databases Exist + +Relational databases can represent relationships, but some workloads involve repeated, deep traversal through interconnected entities. + +Examples: + +- social graph: who follows whom +- fraud graph: account, device, card, IP, merchant relationships +- recommendation graph: users, items, interactions, similarity edges +- network topology or dependency mapping + +### 12.2 Nodes and Edges + +- nodes represent entities +- edges represent relationships + +Edges can also have properties such as weight, type, or timestamp. + +### 12.3 Traversal Efficiency + +Graph systems are optimized for traversals like: + +- friends of friends +- accounts connected through shared devices +- shortest path +- subgraph pattern matching + +Why they can outperform SQL here: + +- repeated multi-hop joins in relational systems become cumbersome and expensive +- graph engines often store adjacency information more directly for traversal workloads + +### 12.4 When Graph Databases Are Better Than SQL Joins + +Graph databases shine when: + +- the core questions are relationship-first +- traversal depth is variable or multi-hop +- graph algorithms are central to the product + +They are not automatically better for simple CRUD over business entities. A graph database is not a replacement for every transactional workload. + +### 12.5 Typical Use Cases + +- fraud detection pipelines +- recommendation engines +- social features +- dependency impact analysis + +Production reality: + +Many companies do not use a graph database as the primary source of truth. They maintain transactional data in relational or document stores and project relationship-heavy subsets into graph-oriented systems. + +## 13. Scaling Databases + +As traffic or data volume grows, a single database instance eventually becomes a bottleneck. + +Scaling databases is hard because unlike stateless application servers, databases hold durable state, enforce consistency, and often cannot be copied or split casually. + +### 13.1 Replication + +Replication means keeping multiple copies of data across nodes. + +#### Why Replication Exists + +- improve availability +- protect against node failure +- scale reads +- support backups or region-level redundancy + +#### Primary-Replica Model + +In a common setup: + +- primary handles writes +- replicas receive data changes from primary +- replicas may serve reads + +```mermaid +flowchart LR + APP[Application] --> P[(Primary)] + P --> R1[(Replica 1)] + P --> R2[(Replica 2)] + P --> R3[(Replica 3)] + R1 --> READS1[Read Traffic] + R2 --> READS2[Read Traffic] +``` + +#### Synchronous vs Asynchronous Replication + +| Mode | Behavior | Benefit | Cost | +|---|---|---|---| +| Synchronous | Primary waits for replica ack before success | Stronger consistency | Higher write latency, lower availability if replicas slow | +| Asynchronous | Primary commits before replicas catch up | Lower latency, higher write availability | Replica lag and stale reads | + +#### Replication Lag + +Lag means replicas are behind the primary. + +Consequences: + +- stale reads +- broken read-after-write expectations +- confusing user experiences such as "I just updated my profile but do not see it" + +#### Failure Handling and Leader Election Basics + +If the primary fails, the system may promote a replica to become the new leader. + +Leader election matters because multiple primaries accepting writes independently can cause divergence. + +#### Split-Brain Basics + +Split-brain happens when multiple nodes believe they are primary and accept conflicting writes. + +This is one of the most dangerous failure modes in distributed storage. + +Strong coordination, quorum logic, and careful failover processes exist largely to prevent this. + +### 13.2 Read Replicas + +Read replicas are a specific use of replication for read scaling. + +#### Why They Exist + +Many systems are read-heavy. Product pages, user profiles, repository metadata, timelines, dashboards, and catalog views are often read far more than written. + +Read replicas let the system distribute read load away from the primary. + +#### Routing Reads vs Writes + +Common policy: + +- writes always go to primary +- stale-tolerant reads can go to replicas +- consistency-sensitive reads stay on primary or sticky session path + +#### Read-After-Write Consistency Issues + +If a user writes to the primary and immediately reads from a lagging replica, they may not see their update. + +Production strategies: + +- route same-user reads to primary for a short window after write +- use session stickiness +- check replica lag and avoid stale replicas for sensitive endpoints +- use version or timestamp based consistency logic + +GitHub-like or SaaS admin dashboards often need this because users expect immediate visibility after editing settings. + +### 13.3 Sharding + +Sharding means splitting data horizontally across multiple databases so one node does not hold or serve everything. + +#### Why Sharding Exists + +Replication helps availability and read scale, but it does not solve the core problem that one primary still handles all writes and all data volume. + +Sharding is how systems scale beyond one machine's storage, CPU, memory, or write throughput limits. + +#### Sharding Strategy Diagram + +```mermaid +flowchart TD + REQ[Application Request] --> ROUTER[Shard Router] + ROUTER --> S1[(Shard 1: users 1-1M)] + ROUTER --> S2[(Shard 2: users 1M-2M)] + ROUTER --> S3[(Shard 3: users 2M-3M)] + ROUTER --> S4[(Shard 4: users 3M+)] +``` + +#### Shard Key Selection + +The shard key decides where data goes. + +Good shard keys: + +- distribute traffic evenly +- align with common access patterns +- minimize cross-shard queries + +Bad shard keys: + +- create hotspots +- force frequent cross-shard fanout +- are hard to rebalance later + +Example: + +user-based sharding often works well when most queries are scoped to one user or tenant. + +#### Hotspot Problems + +Even a seemingly good shard key can fail if traffic is skewed. + +Examples: + +- one celebrity user receives disproportionate traffic +- one tenant is much larger than all others +- sequential keys route many recent writes to the same shard + +#### Rebalancing Challenges + +As shards fill unevenly, data must be moved. + +This is difficult because: + +- data movement is expensive +- routing metadata changes must be coordinated +- in-flight traffic must still work during migration +- hot shards often need urgent relief without causing downtime + +#### Consistent Hashing Basics + +Consistent hashing reduces remapping when nodes are added or removed. + +It is popular in distributed caches and some storage systems because it minimizes movement compared to naive modulo-based partitioning. + +#### Cross-Shard Query Problems + +Queries that need data from multiple shards become expensive because the application or middleware must: + +- fan out requests +- merge results +- handle partial failures +- coordinate ordering and pagination + +#### Cross-Shard Transactions + +These are much harder than local transactions. + +Many systems avoid them through: + +- careful data ownership boundaries +- asynchronous workflows +- per-tenant or per-user consistency scopes + +#### Operational Complexity + +Sharding solves scale problems by introducing operational problems: + +- routing layer complexity +- uneven load distribution +- resharding complexity +- harder debugging and analytics +- backup and restore per shard + +Public-pattern intuition: + +Uber-like marketplace platforms and GitHub-scale multi-tenant systems often end up with some form of sharding because a single relational instance eventually stops being enough for global growth. + +### 13.4 Partitioning + +Partitioning means splitting data into logical chunks. It can exist inside one database system or across multiple nodes. + +#### Vertical Partitioning + +Split by columns or functionality. + +Example: + +- user auth data in one database +- user analytics profile data in another + +Useful when different parts of the entity have different access, sensitivity, or scaling profiles. + +#### Horizontal Partitioning + +Split by rows. + +Example: + +- users 1-1M in one partition +- users 1M-2M in another + +This overlaps conceptually with sharding, but partitioning may happen within a single database product while sharding usually implies distribution across multiple database instances or clusters. + +#### Range Partitioning + +Partition rows by ranges, such as date ranges. + +Great for: + +- time-based data +- archival and pruning + +Risk: + +- recent range can become a hotspot + +#### Hash Partitioning + +Use a hash of the key to spread rows more evenly. + +Great for: + +- balancing load + +Tradeoff: + +- makes range queries less natural + +#### List Partitioning + +Partition by specific values such as region or tenant tier. + +#### Time-Based Partitioning + +Very common for logs, events, analytics, and audit tables. + +Why it matters: + +- easier retention policies +- easier archival +- partition pruning improves query efficiency + +#### Partition Pruning + +Partition pruning means the database skips partitions that cannot contain relevant data. + +Example: + +If a query asks for logs from last 7 days, the engine should avoid scanning last year's partitions. + +#### Archival Strategies + +Old partitions can be: + +- moved to cheaper storage +- exported to object storage or warehouse +- dropped based on retention policy + +### 13.5 Partitioning vs Sharding + +| Dimension | Partitioning | Sharding | +|---|---|---| +| Scope | Usually logical split within a system or table family | Usually split across multiple independent database instances or clusters | +| Goal | Manageability, pruning, performance | Capacity scale beyond one node and distribution of traffic | +| Operational cost | Lower | Higher | +| Query complexity | Often handled by one database engine | Often needs routing and cross-shard logic | + +### 13.6 Replication vs Sharding + +| Dimension | Replication | Sharding | +|---|---|---| +| Primary purpose | Availability and read scaling | Write scaling and data volume scaling | +| Data on each node | Mostly same dataset copies | Different subsets of data | +| Main challenge | Staleness and failover | routing and cross-shard complexity | + +## 14. Backups + +Backups exist because high availability is not the same as recoverability. + +Replication protects against node failure. It does not protect against every kind of bad write, accidental delete, corrupt deployment, or operator mistake. Replicas happily replicate mistakes too. + +### 14.1 Why Backups Matter + +You need backups for: + +- accidental deletion +- data corruption +- ransomware or security incidents +- region loss +- bad migrations +- operator error + +### 14.2 Types of Backups + +| Type | Meaning | Best for | +|---|---|---| +| Snapshot | Point-in-time copy of storage volume or dataset | Fast restore of large datasets | +| Logical backup | Export of schemas and rows, often SQL dump style | Portability and selective restore | +| Physical backup | Raw database files or engine-level backup | Faster full restores, engine-specific | + +### 14.3 Point-in-Time Recovery (PITR) + +PITR combines a base backup with transaction logs so you can restore to a specific moment. + +This is critical when the problem is not "disk died" but "bad deployment deleted good data at 11:07 AM". + +### 14.4 Restore Testing + +Backups that have never been restored are assumptions, not backups. + +This is one of the most important real-world lessons in storage engineering. + +Common failure modes: + +- backup files exist but are corrupted +- restore time is far too slow for business requirements +- missing transaction logs make PITR impossible +- permissions or secrets needed for restore are unavailable +- application code no longer works with restored data layout + +### 14.5 Backup Verification and Retention + +Best practices: + +- verify backups automatically +- test restore procedures regularly +- define retention policy by business and compliance needs +- keep copies in separate failure domains +- document RPO and RTO + +Where: + +- RPO: how much data loss is acceptable +- RTO: how long recovery may take + +### 14.6 Disaster Recovery Basics + +Disaster recovery is broader than backups. It includes: + +- alternate region strategy +- restore runbooks +- DNS or traffic failover +- infrastructure recreation +- dependency recovery order + +Interview point: + +If you mention backups but not restore testing, you are leaving the operational story incomplete. + +## 15. Migrations + +Migrations are how storage evolves without breaking production. + +This is where many outages come from because schema changes are easy in development and dangerous in live systems with rolling deployments and old code versions still running. + +### 15.1 Schema Migrations + +Schema migrations change database structure: + +- add column +- remove column +- add table +- create index +- rename field +- change data type + +### 15.2 Backward-Compatible Migrations + +In production, application and database must remain compatible during rollout. + +Safe default rule: + +- first make schema changes that old and new code can both tolerate +- then deploy code that uses the new schema +- only later remove old structures + +### 15.3 Expand-and-Contract Pattern + +This is a standard zero-downtime migration pattern. + +1. Expand: add new schema elements without removing old ones. +2. Dual-write or backfill data if needed. +3. Switch reads to the new structure. +4. Contract: remove old columns or tables only after full cutover. + +### 15.4 Zero-Downtime Migration Example + +Suppose you want to rename `full_name` to `display_name`. + +Unsafe approach: + +- drop `full_name` +- add `display_name` +- deploy code + +Safe approach: + +1. Add `display_name`. +2. Deploy code that writes both fields and can read either. +3. Backfill old rows. +4. Switch reads fully to `display_name`. +5. Remove `full_name` later. + +### 15.5 Data Migrations + +Data migrations move or transform actual data, not just schema. + +Risks: + +- long-running locks +- replication lag +- huge write amplification +- application reading partially migrated state + +Best practice: + +- run in batches +- make the process idempotent +- measure lag and throughput +- provide rollback or pause controls + +### 15.6 Rolling Deployments Impact + +During rolling deploys, old and new app versions may run simultaneously. + +That means: + +- database changes must tolerate mixed versions +- writes must remain compatible from both versions +- feature flags often help separate schema rollout from behavior rollout + +### 15.7 Rollback Strategies + +Rollback is easy only before destructive steps. + +Strong teams design migrations so that: + +- code can be rolled back without immediate schema breakage +- destructive changes are delayed until confidence is high +- backfills can resume safely + +### 15.8 Migration Failure Scenarios + +| Failure | Why it happens | Mitigation | +|---|---|---| +| Table lock outage | Large DDL on hot table | Use online schema change strategy or maintenance plan | +| Replica lag spike | Backfill or index build overwhelms replication | Throttle migration, observe lag | +| Mixed-version incompatibility | New code expects schema that old DB path lacks | Expand first, contract later | +| Partial backfill | Job crashed midway | Idempotent batch processing and progress tracking | +| Rollback impossible | Old columns already dropped | Delay destructive cleanup | + +Production note: + +At companies like GitHub, Stripe, Amazon, and large SaaS platforms, safe migration discipline is not optional. It is core operational hygiene. + +## 16. How These Systems Connect in Real Architectures + +The strongest backend designs do not ask, "SQL or NoSQL?" They ask, "Which storage model belongs to which workload?" + +### 16.1 Typical SaaS Architecture + +```mermaid +flowchart LR + U[User Request] --> API[API Service] + API --> AUTH[(Relational DB: Users / Billing / Permissions)] + API --> CACHE[(Redis-style Cache)] + API --> SEARCH[(Search / Indexing Store)] + API --> EVENTS[(Event Stream)] + EVENTS --> ANALYTICS[(Warehouse / OLAP)] + EVENTS --> READMODEL[(Denormalized Read Models)] +``` + +What lives where: + +- relational DB for core source-of-truth entities +- cache for hot reads and sessions +- event stream for async workflows and fanout +- warehouse for large analytical queries +- search index for text retrieval +- denormalized read models for dashboards or feeds + +### 16.2 Example System Patterns + +#### Google-style lesson + +When you need globally distributed SQL with strong consistency, a system like Spanner demonstrates that it is possible, but expensive and operationally sophisticated. The lesson is not "always use globally consistent SQL". The lesson is that strong consistency at global scale requires serious infrastructure investment. + +#### Netflix-style lesson + +Large-scale streaming and metadata systems often rely heavily on caches and wide-column stores because read volume and availability requirements are enormous. The lesson is that relational correctness is not the only objective; availability and latency matter too. + +#### Uber-style lesson + +Marketplace systems deal with geospatial lookups, event streams, denormalized views, high write rates, and many asynchronous workflows. The lesson is that one storage engine rarely fits all workloads inside the same product. + +#### Amazon-style lesson + +The Dynamo lineage shows how high availability, partition tolerance, and simple key-based access can drive very different storage choices than a classic relational core. The lesson is to design around failure and scale from the start when the workload demands it. + +#### GitHub-style lesson + +Developer platforms still rely heavily on relational storage for core entities, permissions, and consistency-sensitive workflows, but they complement it with replicas, caches, partitioning, and background jobs. The lesson is that relational databases remain central even at significant scale. + +#### Stripe-style lesson + +Money movement systems keep the ledger and transactional truth in strongly consistent storage, then derive downstream analytics and reporting elsewhere. The lesson is that the source of truth must match the strictest correctness requirement, not the most convenient scale story. + +## 17. Common Interview Discussion Themes + +### 17.1 SQL vs NoSQL + +Good answer structure: + +1. Start with access patterns and correctness needs. +2. Explain why relational is usually default for business workflows. +3. Explain when NoSQL wins because of data shape or scale pattern. +4. Mention that many real systems use both. + +### 17.2 Replication vs Sharding + +Good answer: + +- replication copies the same data for availability and read scaling +- sharding splits the data to scale writes and storage +- most large systems eventually use both + +### 17.3 Normalization vs Denormalization + +Good answer: + +- normalize source-of-truth entities for correctness +- denormalize proven hot read paths for performance +- discuss consistency maintenance and backfill cost + +### 17.4 Transactions in Microservices + +Good answer: + +- keep strong ACID boundaries local when possible +- avoid distributed transactions for every workflow +- use sagas or asynchronous orchestration where acceptable +- discuss idempotency and compensating actions + +### 17.5 What Breaks at Scale + +Interviewers often want to hear these failure modes: + +- hot keys and hot partitions +- replica lag and stale reads +- slow queries due to missing indexes +- lock contention and deadlocks +- painful cross-shard operations +- unsafe migrations +- backups that have never been restored + +If you can talk about those fluently, you sound much more like a production engineer. + +## 18. Best Practices Checklist + +- Start from access patterns, not technology hype. +- Use relational databases by default for transactional business systems unless the workload clearly argues otherwise. +- Design indexes from real queries, not guesswork. +- Keep transactions short and explicit. +- Normalize core data first; denormalize intentionally, not accidentally. +- Assume replica lag exists and design around it. +- Choose shard keys carefully because changing them later is painful. +- Separate OLTP from heavy analytics. +- Treat backup restore testing as a real engineering responsibility. +- Use backward-compatible, zero-downtime migration patterns. +- Measure real bottlenecks before introducing new storage technologies. + +## 19. Common Mistakes + +- choosing NoSQL too early for simple SaaS CRUD workloads +- using one database for transactions, analytics, search, caching, and feeds all at once +- adding indexes everywhere without understanding write cost +- building schemas around current ORM classes instead of durable domain boundaries +- ignoring replication lag in user experience design +- trying to solve bad query design with bigger hardware forever +- denormalizing without a plan for consistency repair +- sharding before exhausting simpler options +- believing replicas are backups +- running destructive migrations without compatibility strategy + +## 20. Final Mental Model + +Think about storage decisions in this order: + +1. What are the core entities and invariants? +2. What are the hottest reads and writes? +3. What consistency guarantees are actually required? +4. What data model matches the workload best? +5. How will indexes, replication, caching, and partitioning support that workload? +6. What fails first as traffic grows? +7. How will you back up, restore, and migrate safely? + +If you can answer those questions clearly, you can usually handle both interview discussions and real production design work. + +The most important practical lesson is simple: data storage is not just about where bytes live. It is about preserving correctness while delivering latency, scale, and operability under failure. diff --git a/systems design/4.perfLayer.md b/systems design/4.perfLayer.md new file mode 100644 index 0000000..3defea7 --- /dev/null +++ b/systems design/4.perfLayer.md @@ -0,0 +1,1809 @@ +# Performance Layer + +The performance layer is the part of system design that turns a system from merely correct into fast, cost-efficient, and resilient under real traffic. In interviews, candidates often describe databases, queues, and load balancers, but they stop short of explaining how the system stays responsive when one hot endpoint receives millions of requests, when one viral product page suddenly becomes the top key in the fleet, or when static assets must be served globally at low latency. + +That gap is the performance layer. + +At a practical level, the performance layer is made of techniques and systems that sit between raw demand and expensive work. It includes in-process caches, distributed caches like Redis or Memcached, CDNs, edge caches, TTL strategy, eviction policy, and invalidation mechanisms. These components exist because real backends cannot afford to recompute, reread, or retransmit everything from the source of truth on every request. + +This guide is written for two goals at once: + +- interview preparation, where you need to explain tradeoffs clearly +- real backend engineering, where you need to build systems that remain stable at scale + +The focus is not on memorizing buzzwords. The focus is understanding why these systems exist, how they work internally, what breaks at scale, and how to discuss them like an engineer who has operated them. + +Examples in this guide are generalized from common industry patterns and public engineering discussions from companies such as Google, Netflix, Uber, Amazon, GitHub, Stripe, and large SaaS platforms. + +## 1. Big Picture: What the Performance Layer Actually Does + +Every backend has some expensive path: + +- reading from a database +- computing personalized recommendations +- rendering a large page or API payload +- resizing images +- loading product catalog data +- checking permissions repeatedly +- serving static assets to users across continents + +If every request goes all the way to the expensive source, the system eventually fails one of these goals: + +- latency becomes too high +- database load becomes too high +- compute cost becomes too high +- throughput becomes too low +- spikes become unmanageable + +The performance layer exists to absorb repeated work. + +### 1.1 Performance Layer in a Real Architecture + +```mermaid +flowchart LR + U[User / Browser] --> B[Browser Cache] + B --> DNS[DNS / Global Routing] + DNS --> CDN[CDN Edge] + CDN --> LB[Load Balancer / API Gateway] + LB --> APP[Application Service] + APP --> LC[Local In-Process Cache] + APP --> RC[(Redis / Memcached)] + APP --> DB[(Primary Database)] + DB --> REP[(Read Replicas / Search / Derived Stores)] + DB --> BUS[Event Bus / Invalidation Stream] + BUS --> RC + BUS --> CDN +``` + +This diagram captures an important idea: performance is not a single cache. It is a stack. + +- the browser may cache assets or API responses +- the CDN may answer without touching your origin +- the application may use local in-memory caching for extremely hot keys +- a distributed cache may prevent repeated database reads +- invalidation events keep caches aligned with changing data + +### 1.2 Why Interviewers Care About This Layer + +Interviewers use performance-layer questions to test whether you understand the difference between a toy design and a production design. + +A weak answer sounds like this: + +"I would use a database and maybe Redis for caching." + +A strong answer sounds like this: + +"I expect read-heavy traffic with hot keys, so I would place a distributed cache in front of the database, possibly keep a small local cache inside each service instance for ultra-hot objects, use TTL plus event-driven invalidation to balance freshness and load, and put static assets behind a CDN to reduce origin traffic globally. Then I would discuss what happens during cache misses, stampedes, stale reads, and regional failover." + +That answer shows system-level thinking. + +### 1.3 Core Mental Model + +The performance layer trades one form of complexity for another. + +It improves: + +- latency +- throughput +- cost +- resilience to spikes + +But it introduces: + +- stale data risk +- invalidation complexity +- memory pressure +- partial failure modes +- operational tuning work + +If you remember one sentence from this guide, remember this: + +Performance systems are valuable because they turn expensive operations into cheap lookups, but they make correctness and freshness harder. + +## 2. Caching Fundamentals + +Caching is the core concept of the performance layer. + +### 2.1 What Caching Is + +A cache stores the result of an expensive operation so that future requests can avoid doing the expensive work again. + +That expensive work might be: + +- a database query +- an API response assembly step +- a rendered HTML fragment +- a permission lookup +- a session lookup +- a computed leaderboard +- a static file fetch from origin + +The basic idea is simple, but the system effect is huge. If a result is requested many times, caching changes the system from repeatedly paying the full cost to paying it once and reusing it. + +### 2.2 Why Caching Exists + +| Goal | What caching improves | Real production effect | +|---|---|---| +| Reduce latency | Serves data from memory or from a nearby edge | Requests that take tens of milliseconds from Redis may have taken hundreds from a database or remote service | +| Reduce database load | Prevents repeated reads for the same objects | Fewer DB connections, less lock contention, lower CPU, fewer read replicas | +| Reduce compute cost | Avoids re-running expensive business logic or rendering | Lower application CPU and lower cloud spend | +| Improve scalability | Lets the same backend support more traffic | A service that would saturate at 20k RPS may survive 10x more if most reads hit cache | +| Smooth spikes | Absorbs bursts on hot keys | Viral traffic becomes manageable instead of overwhelming the source of truth | + +### 2.3 Hot Paths and Hot Data + +Most traffic is not evenly distributed. + +In real systems, a small fraction of endpoints, users, products, or assets often receives a large fraction of total traffic. This is why caching is so effective. + +Examples: + +- an e-commerce homepage and a few trending product pages get disproportionate reads +- a GitHub repository landing page gets far more traffic after a release or public announcement +- a Stripe dashboard loads the same account metadata repeatedly during one user session +- a ride-sharing system repeatedly reads nearby-driver state for a busy downtown area + +This concentration of demand creates hot paths and hot data. + +- hot path: an endpoint or execution path hit extremely often +- hot data: specific keys or objects requested repeatedly + +Performance engineering often starts by identifying those hot spots and caching them before scaling the entire system blindly. + +### 2.4 Cache Hit vs Cache Miss + +| Term | Meaning | Why it matters | +|---|---|---| +| Cache hit | The requested item is found in cache | Fast path, cheap path | +| Cache miss | The item is not in cache | Slow path, must fetch from source | +| Hit ratio | Fraction of requests served from cache | One of the main health metrics for a cache-backed system | +| Miss penalty | Extra latency and source load caused by a miss | Critical when the backend is expensive or fragile | + +You should think of a cache miss as more than a slower request. A miss is also a source-of-truth request. At scale, misses multiply into pressure on databases, services, and storage layers. + +If 95 percent of requests hit cache, the remaining 5 percent still define whether the system stays alive during a cache flush or deployment. + +### 2.5 Cache Lifecycle + +```mermaid +flowchart LR + REQ[Request Arrives] --> LOOKUP{Item In Cache?} + LOOKUP -- Yes --> HIT[Return Cached Value] + LOOKUP -- No --> LOAD[Load From Source] + LOAD --> STORE[Store In Cache] + STORE --> RESP[Return Response] + HIT --> TTL[Wait Until TTL / Invalidation / Eviction] + RESP --> TTL + TTL --> EXPIRE[Item Expires or Is Removed] + EXPIRE --> LOOKUP +``` + +This looks circular because it is. A cache is not a one-time optimization. It is a continuous lifecycle of population, use, staleness, and removal. + +### 2.6 Cache Warming and Cold Starts + +Cache warming means pre-populating a cache before traffic arrives or before a new deployment begins serving traffic. + +Cold start means the cache is empty or mostly empty. + +Why cold starts are dangerous: + +- traffic suddenly falls through to the database +- p99 latency jumps sharply +- a previously healthy backend gets overloaded +- autoscaling can make it worse because new instances start with empty local caches + +Typical warming strategies: + +- prefill the top N hot keys after deployment +- replay recent hot key logs into the cache +- keep a warm standby cache fleet during migration +- gradually shift traffic to fresh instances + +Production example: + +An e-commerce system may warm the homepage, popular category pages, top products, and pricing metadata before turning on a new region. A SaaS dashboard may warm account summaries and permission maps for large enterprise tenants. + +### 2.7 Multi-Layer Caching + +Most large systems use more than one cache layer. + +| Layer | Where it lives | Strengths | Weaknesses | +|---|---|---|---| +| Browser cache | On the client | Zero origin cost, extremely low latency | Harder to invalidate precisely, less control | +| CDN cache | At edge POPs | Global latency reduction, origin offload | Works best for cacheable and moderately shared content | +| Local cache | Inside service process | Fastest server-side lookup, no network hop | Small, per-instance inconsistency, lost on restart | +| Distributed cache | Remote shared cache like Redis or Memcached | Shared across instances, larger capacity | Network hop, operational overhead | + +Multi-layer caching exists because the cheapest cache is the one closest to the requester. The closer you can answer, the less network, compute, and backend work you do. + +### 2.8 Local Cache vs Distributed Cache + +| Dimension | Local Cache | Distributed Cache | +|---|---|---| +| Latency | Extremely low | Low but includes network hop | +| Scope | Single process or instance | Shared across many instances | +| Consistency | Weak across instances | Better shared visibility | +| Capacity | Limited by process memory | Larger dedicated memory pool | +| Failure behavior | Lost on process restart | Survives app restarts but is its own dependency | +| Good for | Ultra-hot config, permission snapshots, small metadata | Shared sessions, popular entities, counters, rate limits | + +Strong production designs often combine them. For example: + +- local cache for 100 hottest objects per instance +- Redis for shared hot data across the fleet +- database for source of truth + +### 2.9 How Caching Changes Overall Architecture + +Caching does not just make a request faster. It changes the control flow of the system. + +Without caching: + +- every read goes to the source of truth +- latency scales with database or service performance +- spikes directly impact the backend + +With caching: + +- reads split into hit path and miss path +- miss handling becomes a core part of correctness +- invalidation and freshness become first-class architecture concerns +- partial outages can be hidden or amplified depending on cache behavior + +That is why caching is never "just add Redis." It is an architectural decision that affects reads, writes, deployments, incident response, and observability. + +### 2.10 Consistency Challenges and Stale Data Tradeoffs + +Caches are usually not the source of truth. That means there is always a risk that the cache contains old data. + +The key design question is not "Can stale data happen?" The answer is almost always yes. + +The real questions are: + +- how stale can data be before users notice or correctness breaks +- how quickly do updates need to propagate +- what should happen if invalidation fails +- which data can tolerate eventual consistency and which cannot + +Examples: + +- stale profile photos are usually fine +- stale inventory counts can cause overselling +- stale permission or fraud state can become a security problem +- stale pricing can create financial or legal issues + +Strong engineers discuss stale data in business terms, not just technical terms. + +### 2.11 Failure Patterns Every Interviewer Expects + +| Failure pattern | What it means | What it looks like in production | Common mitigations | +|---|---|---|---| +| Cache stampede | Many requests miss the same key and all rebuild it | DB traffic spike after key expiry | Request coalescing, single-flight, locks, soft TTL, background refresh | +| Cache penetration | Requests repeatedly ask for keys that do not exist | Attack or bug causes endless DB misses | Negative caching, bloom filters, request validation | +| Cache avalanche | Many keys expire simultaneously | Sudden backend overload when a batch of TTLs ends together | TTL jitter, staggered expiration, warmup, traffic shaping | +| Hot key overload | One key becomes extremely popular | One Redis shard or service instance gets overloaded | Key replication, local cache, consistent hashing, hot key splitting | +| Stale data leak | Invalidation fails or is delayed | Users see old values after updates | Version checks, event-driven invalidation, bounded TTLs, read repair | + +These are not interview-only concepts. They show up in real incidents. + +### 2.12 Best Practices and Common Mistakes + +Best practices: + +- cache only where the access pattern justifies it +- measure hit ratio, miss latency, key size, and source-of-truth fallback rate +- define ownership of cache invalidation explicitly +- use TTL jitter to avoid synchronized expiration storms +- keep serialized values compact and stable +- protect the miss path with rate limits, request coalescing, or backpressure + +Common mistakes: + +- caching everything without understanding access patterns +- caching data that must be strongly consistent without a freshness plan +- letting huge values or huge key cardinality blow up memory +- using one global TTL for all data types +- ignoring cache outage behavior +- forgetting that a cache flush can look like a DDoS against your database + +## 3. Redis + +Redis is one of the most widely used performance-layer systems in backend engineering. + +### 3.1 What Redis Is + +Redis is an in-memory data structure store commonly used as a distributed cache, fast key-value store, rate limiter, session store, coordination primitive, leaderboard engine, and lightweight stream or queue component. + +It became popular because it combines several valuable traits: + +- very low latency +- simple operational model for many use cases +- multiple built-in data structures +- high developer productivity +- strong ecosystem support + +In practice, Redis often becomes the first non-database data system teams add when they need more performance. + +### 3.2 Why Redis Is Widely Used + +Redis is useful because it covers many common backend problems with one fast system. + +Examples: + +- caching user profiles or product metadata +- storing web sessions +- implementing rate limiting per user, token, or IP +- maintaining rolling counters and quotas +- computing leaderboards with sorted sets +- distributing lightweight invalidation or coordination events +- holding ephemeral state for jobs, retries, and workflows + +A big reason engineers like Redis is that it is often fast enough to move a system from struggling to healthy without a major architectural rewrite. + +### 3.3 In-Memory Architecture + +Redis keeps its primary working dataset in memory. That is the key reason it is fast. + +Compared to disk-backed databases: + +- memory access is much faster than disk I/O +- data structures can be updated quickly with minimal indirection +- many operations are constant time or logarithmic time + +But in-memory design also brings constraints: + +- memory is expensive compared to disk +- total dataset size is limited by RAM +- persistence must be carefully designed if data matters +- large keys and fragmentation can become operational issues + +Redis works best when the working set fits comfortably in memory and when the system is comfortable with Redis being a fast data layer rather than the ultimate durable source of truth. + +### 3.4 Single-Threaded Event Loop Concept + +Historically, Redis is famous for a mostly single-threaded command execution model. + +That sounds like a weakness at first, but the intuition matters. + +Why single-threaded command execution helped Redis: + +- avoids lock contention inside the core data path +- simplifies internal state management +- keeps operations predictable +- reduces complexity in common use cases + +The model is roughly: + +1. accept network requests +2. parse commands +3. execute commands against in-memory data structures +4. write responses back to clients + +Because the operations happen in memory and avoid heavy internal locking, the system can be extremely fast. + +Important nuance for interviews: + +Modern Redis versions use threads for some I/O and background work, but command execution for a given shard is still primarily single-threaded in spirit. The point is not the slogan "single-threaded" by itself. The point is why that model worked: low coordination overhead on an in-memory data path. + +### 3.5 Persistence Basics + +Redis is often used as a cache, but it also supports persistence. + +The two core persistence ideas are: + +| Mechanism | What it does | Strengths | Weaknesses | +|---|---|---|---| +| RDB snapshot | Periodically writes a point-in-time snapshot to disk | Compact, good for backups and restart speed | Can lose recent writes between snapshots | +| AOF append-only file | Logs write operations as they happen | Better durability, more recent recovery | Larger files, rewrite complexity, more I/O | + +Many deployments combine both. + +Interview framing: + +If Redis is used only as a cache, persistence may be optional. If it stores critical ephemeral state like sessions, counters, or queues that you care about recovering, persistence and replication matter much more. + +### 3.6 Redis Data Structures and Why They Matter + +Redis is not just a string map. Its data structures are a large part of why it is useful. + +| Data structure | Intuition | Common operations | Production use cases | +|---|---|---|---| +| Strings | Simplest key to value mapping | GET, SET, INCR | General caching, counters, feature flags, serialized objects | +| Hashes | Small field map under one key | HGET, HSET | User/session metadata, grouped object fields | +| Lists | Ordered sequence with push/pop | LPUSH, RPUSH, LPOP | Simple queues, activity buffers, recent events | +| Sets | Unordered unique members | SADD, SISMEMBER | Membership checks, tags, deduplication | +| Sorted sets | Unique members with score ordering | ZADD, ZRANGE | Leaderboards, ranking, delayed tasks, time windows | +| Bitmaps | Bit-level state compactly stored | SETBIT, BITCOUNT | Presence flags, lightweight analytics, feature rollout markers | +| Streams | Append-only log with consumer groups | XADD, XREADGROUP | Event pipelines, work distribution, ordered consumption | + +#### Strings + +Strings are the default choice for caching serialized values such as JSON or protobuf blobs. + +Why teams use them: + +- easiest operationally +- flexible schema at the application layer +- supports TTL directly +- works well for counters with atomic increment operations + +#### Hashes + +Hashes are useful when you want multiple fields grouped under one logical key. They can reduce duplication and sometimes improve ergonomics for partial field access. + +Example: + +- `session:123` with fields like `user_id`, `expires_at`, `role` + +#### Lists + +Lists are good for queue-like behavior, but teams should be careful not to overuse Redis lists as a full durable queue platform when stronger guarantees are required. + +#### Sets + +Sets give fast membership testing. + +Example: + +- which users have access to a beta feature +- which object IDs were already processed + +#### Sorted Sets + +Sorted sets are one of Redis's most powerful primitives. + +They maintain a set of members ordered by score. This makes them ideal for: + +- leaderboards +- ranking systems +- top-N queries +- sliding windows for rate limiting +- scheduling delayed tasks by timestamp + +#### Bitmaps + +Bitmaps are memory-efficient when you need large boolean state spaces. + +Example: + +- whether a user ID belongs to a cohort +- whether an event occurred on a given day + +#### Streams + +Streams provide an append-only structure with consumer groups and replay semantics. They are useful when you want lightweight log-like behavior in Redis. + +They are helpful, but you should not automatically replace Kafka or other durable log systems with Redis Streams for large-scale, long-retention event pipelines. + +### 3.7 Pub/Sub Basics + +Redis Pub/Sub allows publishers to send messages to channels and subscribers to receive them. + +This is useful for lightweight fan-out such as: + +- cache invalidation notifications +- internal live updates +- ephemeral coordination signals + +But Pub/Sub is not durable messaging. If a subscriber is down, messages can be missed. That makes it suitable for best-effort signaling, not for business-critical guaranteed delivery. + +### 3.8 Distributed Locks Basics + +Redis is often used for lightweight distributed locking. + +Typical use case: + +- ensure only one worker rebuilds a hot cache key +- avoid duplicate job execution +- coordinate a small critical section across nodes + +The basic pattern is setting a key with a TTL only if it does not already exist. + +Important caution: + +Distributed locking is easy to misuse. If the lock expires too early, if clients pause, or if ownership is not verified on release, correctness bugs appear. For critical correctness, database transactions or purpose-built coordination systems may be safer. + +### 3.9 Common Redis Use Cases + +| Use case | Why Redis fits | +|---|---| +| Rate limiting | Atomic increments and expirations are simple and fast | +| Session storage | Low latency key lookup with TTL | +| Leaderboards | Sorted sets make ranking natural | +| Caching | Memory speed plus TTL support | +| Lightweight queues | Lists or streams for simple work pipelines | +| Token or OTP storage | Fast expiry-based ephemeral data | +| Idempotency keys | Short-lived state for duplicate request protection | + +#### Rate Limiting + +Redis is common for rate limiting because counters and expirations are easy to implement atomically. + +Examples: + +- 100 requests per minute per API key +- 5 login attempts per 10 minutes per account +- per-IP abuse protection at API gateway or edge layer + +#### Session Storage + +Redis is widely used for session storage because sessions are read frequently, written occasionally, and usually have natural expiration. + +Typical SaaS pattern: + +- app server reads session by token +- Redis returns session data quickly +- TTL naturally removes expired sessions + +#### Leaderboards + +Games, social apps, and competition systems often use sorted sets for leaderboards because rank queries and top-N retrieval are natural operations. + +#### Queues Basics + +Redis can be used for simple queueing, retries, and scheduled jobs. + +It is often a good fit when: + +- throughput is moderate +- retention is short +- operational simplicity matters + +It is a weaker fit when: + +- you need long retention +- you need strong replay guarantees +- event history matters deeply +- consumer scaling and durability are primary concerns + +### 3.10 Replication Basics + +Redis commonly uses primary-replica replication. + +Why replicas exist: + +- improve read scale +- improve availability +- reduce data loss risk during failure + +Tradeoff: + +replication is typically asynchronous, so replicas may lag. That means stale reads are possible. + +For many cache use cases, that is acceptable. For correctness-sensitive use cases, that must be discussed explicitly. + +### 3.11 Sentinel Basics + +Redis Sentinel monitors Redis instances and helps automate failover for primary-replica setups. + +What Sentinel does: + +- health checks +- failure detection +- leader election among Sentinel nodes +- promoting a replica to primary +- updating clients or discovery mechanisms + +Sentinel matters when you want high availability without full Redis Cluster complexity, especially for simpler primary-replica deployments. + +### 3.12 Redis Cluster Basics + +Redis Cluster provides sharding across multiple nodes. + +Why it exists: + +- a single Redis node has memory and throughput limits +- large workloads need horizontal scale + +Cluster distributes keys across hash slots. That spreads memory and traffic across nodes. + +Tradeoffs: + +- operations spanning multiple keys become more constrained +- some application logic must be shard-aware +- hot keys can still overload one shard +- operational complexity increases + +Cluster helps with capacity and throughput, but it does not magically eliminate key-distribution problems. + +### 3.13 Memory Management Considerations + +Redis performance problems often become memory problems. + +Things engineers must watch: + +- maxmemory limits +- eviction behavior under pressure +- large keys or huge collections +- fragmentation overhead +- persistence overhead during fork or rewrite +- replication buffers +- serialization bloat + +Bad Redis incidents often come from not respecting memory reality. + +Examples: + +- storing enormous JSON blobs under one key +- letting key cardinality grow without bounds +- forgetting that snapshots or AOF rewrites need extra memory headroom +- assuming TTL means memory disappears immediately and uniformly + +### 3.14 When Redis Should Not Be Used + +Do not use Redis when: + +- the dataset does not fit comfortably in memory +- you need the primary source of truth for large durable data +- you need complex relational queries or joins +- you need very strong durability guarantees with minimal write loss tolerance +- you need long-lived event storage and replay at log-system scale +- you cannot tolerate cache/data loss but are treating Redis like a cheap database + +Redis is excellent, but many outages come from stretching it past its natural use case. + +### 3.15 Real-World Patterns + +Generalized production patterns you will see repeatedly: + +- Amazon-like e-commerce systems cache product and pricing metadata, but keep order and payment state in durable databases +- GitHub-like systems use caching for repository page composition and rate limiting, but not as the source of truth for repository metadata +- Stripe-like systems may use Redis for short-lived idempotency, fraud throttles, or session-like state, while preserving financial correctness in durable transactional stores +- Uber-like systems use fast data systems for hot operational state and rate control, while durable systems preserve business records and historical data + +## 4. Memcached + +Memcached is another classic distributed caching system. + +### 4.1 What Memcached Is + +Memcached is a high-performance, memory-only, distributed cache built around a simpler model than Redis. + +It is focused primarily on one job: caching values in memory and serving them fast. + +That focus is why many companies historically used it heavily for large-scale read caching. + +### 4.2 How Memcached Differs in Spirit from Redis + +Redis evolved into a multi-purpose in-memory data system. + +Memcached stayed closer to a simple cache appliance. + +That means: + +- fewer built-in data structures +- less feature breadth +- simpler mental model +- often lower overhead for straightforward cache workloads + +### 4.3 Simple Distributed Caching Model + +A classic Memcached deployment is made of many independent cache nodes. The client typically decides which node holds a given key using hashing. + +This model is simple: + +- key arrives at the application +- application hashes the key +- application sends request to the selected Memcached node +- node stores or returns the value + +There is usually less server-side coordination than in more feature-rich clustered systems. + +### 4.4 Memory-Only Behavior + +Memcached is memory-only. It is not designed as a durable store. + +This is important conceptually: + +- it is a pure performance layer +- if it restarts, cached data is gone +- that is acceptable because the source of truth should be elsewhere + +This simplicity is powerful when your cache is truly disposable. + +### 4.5 Slab Allocation Basics + +One important internal concept in Memcached is slab allocation. + +The cache groups memory into classes of fixed-size chunks so that similarly sized objects can be stored efficiently. + +Why this exists: + +- general-purpose memory allocation can fragment under heavy cache churn +- cache workloads often involve huge numbers of similarly sized objects +- fixed allocation classes improve speed and predictability + +Tradeoff: + +- if object sizes do not fit slab classes well, memory can be wasted through internal fragmentation + +This is a good example of a design optimized specifically for caching rather than for general-purpose data structures. + +### 4.6 Cache-Focused Design + +Memcached is intentionally narrow. + +Its strength is that it does not try to be a queue, stream platform, lock manager, or ranked index. It tries to be a very fast shared cache. + +This makes it attractive when the problem really is just: + +- store hot objects in memory +- retrieve them quickly +- let the app refill them on misses + +### 4.7 Common Production Use Cases + +Memcached is commonly used for: + +- page fragment caching +- session-like ephemeral web data +- query result caching +- product or profile object caching +- large-scale read-heavy web workloads where durability is irrelevant + +Historically, many large web companies used Memcached aggressively in front of databases for exactly this reason. + +### 4.8 Scaling Characteristics + +Memcached scales horizontally in a straightforward way because nodes are relatively independent. + +Strengths: + +- easy to add more cache capacity +- predictable use for simple key-value caching +- low complexity for read-heavy workloads + +Weaknesses: + +- fewer built-in coordination features +- no rich server-side data structures +- less helpful when the application wants more than plain caching + +### 4.9 Limitations Compared to Redis + +Compared to Redis, Memcached generally has: + +- less feature breadth +- less support for rich data structures +- no native persistence model for recovering data +- fewer coordination-oriented use cases + +But that narrower design can be a feature, not a bug, when simplicity is what you want. + +### 4.10 Redis vs Memcached + +| Dimension | Redis | Memcached | +|---|---|---| +| Primary identity | General-purpose in-memory data store | Pure distributed cache | +| Data structures | Rich: strings, hashes, lists, sets, sorted sets, streams, more | Mostly simple key-value | +| Persistence | Optional RDB/AOF | Memory-only | +| Coordination features | Pub/Sub, scripts, counters, locks, streams | Minimal | +| Operational simplicity for pure cache | Good, but broader feature set | Often very simple | +| Memory efficiency for basic cache workloads | Good, workload dependent | Historically attractive for pure cache cases | +| Best fit | Cache plus broader backend primitives | Straight shared cache at scale | + +### 4.11 When Companies Choose One Over the Other + +Choose Redis when: + +- you want one fast system for caching plus rate limits, counters, sessions, or leaderboards +- you need richer data types +- you want optional persistence or replication features + +Choose Memcached when: + +- the problem is pure disposable caching +- the workload is straightforward key-value object caching +- simplicity and cache-specific behavior matter more than feature breadth + +In interviews, do not answer this as a popularity contest. Answer it as a workload decision. + +## 5. Cache Access Patterns + +The cache technology is only half the story. The access pattern determines behavior, consistency, and failure modes. + +### 5.1 Cache-Aside Pattern + +Cache-aside is the most common caching pattern in production systems. + +The idea is simple: + +1. application reads from cache first +2. if the key exists, return it +3. if it does not exist, read from database or source of truth +4. store the result in cache +5. return it to the caller + +This is also called lazy loading because the cache is filled on demand. + +#### Why Cache-Aside Exists + +It is popular because it is simple, flexible, and keeps the source of truth unchanged. The application decides when and what to cache. + +#### Cache-Aside Read Flow + +```mermaid +sequenceDiagram + participant C as Client + participant A as Application + participant Cache as Cache + participant DB as Database + + C->>A: Read object + A->>Cache: GET key + alt Cache hit + Cache-->>A: cached value + A-->>C: response + else Cache miss + A->>DB: query object + DB-->>A: row / record + A->>Cache: SET key with TTL + A-->>C: response + end +``` + +#### Advantages + +- simple mental model +- cache stores only demanded data +- no cache write cost for cold data +- application controls key format and TTL per object type + +#### Disadvantages + +- first read after expiry is slow +- cache misses can overload the database +- stale data appears if invalidation is weak +- multiple readers may rebuild the same key at once + +#### Stale Data Risks + +On writes, the source of truth changes first. If the cache is not invalidated immediately, future reads may still see the old cached value. + +This is why cache-aside usually needs one of these: + +- delete cached key on write +- update cached value on write +- short TTL as a backstop +- version checks in the application + +#### Failure Cases + +- cache node unavailable: all reads fall through to DB +- DB slow: miss path becomes dangerous +- key expires under burst traffic: stampede +- invalidation event lost: stale data survives until TTL + +#### Common Production Usage + +Cache-aside is common for: + +- product pages +- user profiles +- configuration data +- permission maps that tolerate bounded staleness +- API aggregation results + +### 5.2 Write-Through + +Write-through means writes go to the cache and to the backing store as part of the write path. + +The intent is to keep cache and source of truth aligned immediately. + +#### Write Path Flow + +1. client sends write +2. application validates input +3. application writes new value to database and cache in the same logical operation +4. future reads hit a fresh cache entry + +#### Why It Exists + +Write-through exists because read-after-write consistency from cache is often better than with purely lazy cache-aside. Immediately after a successful write, the cache already has the fresh value. + +#### Benefits + +- fresher cache after writes +- simpler read path after updates +- fewer stale reads right after mutation + +#### Tradeoffs + +- every write pays cache cost even if the data is never read again +- write latency increases because more systems are involved +- failure handling becomes trickier if DB write succeeds but cache write fails, or vice versa + +#### Failure Handling Questions + +You must define: + +- which write is authoritative if one succeeds and one fails +- whether the request should fail or retry +- whether reconciliation jobs exist + +#### Production Suitability + +Write-through is useful when: + +- reads soon after writes are common +- keeping cache hot is valuable +- write volume is manageable + +It is a weaker fit when: + +- write traffic is very high +- many written objects are never read again +- write latency is extremely sensitive + +### 5.3 Write-Back / Write-Behind + +Write-back means the application writes to the cache first and persists to the database asynchronously later. + +This is the most aggressive performance-oriented pattern. + +#### Why It Exists + +It exists to absorb high write throughput and smooth backend load. The immediate write path becomes very fast because the durable store is no longer on the critical path. + +#### Throughput Advantages + +- low-latency writes +- batched or buffered persistence +- can smooth write bursts before they hit the database + +#### Durability Risks + +This pattern is dangerous because data may exist only in the cache or buffer for some time. + +If the cache crashes, if the async worker fails, or if the queue is lost, writes can disappear. + +#### Data Loss Scenarios + +- cache node fails before flush +- async worker backlog grows without bound +- persistence queue is dropped during incident +- ordering bugs cause older writes to overwrite newer writes + +#### Queueing Considerations + +Write-back systems are really queueing systems too. You need: + +- durable buffering strategy +- retry behavior +- ordering guarantees +- backpressure when database falls behind +- replay and reconciliation tools + +#### Operational Complexity + +Write-back is harder to operate because the write acknowledgement and true persistence are decoupled. + +This can be acceptable for: + +- analytics counters +- non-critical engagement metrics +- temporary derived state + +It is usually not acceptable for: + +- payments +- orders +- inventory reservation +- anything audit-sensitive + +### 5.4 Pattern Comparison + +| Pattern | Read behavior | Write behavior | Main strength | Main risk | +|---|---|---|---|---| +| Cache-aside | Reads cache first, loads on miss | Source updated separately, cache invalidated or refreshed | Simple and common | Stale reads and miss storms | +| Write-through | Reads often hit fresh cache | Write updates cache and DB together | Better freshness after writes | Higher write latency and dual-write complexity | +| Write-back | Reads hit hot cache | Write acknowledged before durable persistence completes | High write throughput | Data loss and operational complexity | + +### 5.5 Write-Through vs Cache-Aside + +| Question | Write-Through | Cache-Aside | +|---|---|---| +| Is cache populated on write? | Yes | Usually no | +| First read after write | Often fast | May miss if cache was invalidated | +| Write cost | Higher | Lower | +| Common fit | Read-after-write sensitive data | General-purpose read-heavy systems | + +### 5.6 Write-Back vs Write-Through + +| Question | Write-Back | Write-Through | +|---|---|---| +| Persistence timing | Asynchronous | Synchronous or near-synchronous | +| Durability | Weaker | Stronger | +| Throughput | Higher | Lower | +| Operational complexity | Higher | Lower | +| Safe for critical data | Rarely | More often | + +## 6. TTL (Time To Live) + +TTL is one of the most important cache controls. + +### 6.1 Why TTL Exists + +TTL gives cached data an expiration time. + +It exists because: + +- cache entries should not live forever +- data changes over time +- invalidation is never perfect +- memory must eventually be reclaimed + +TTL is both a freshness policy and a safety valve. + +### 6.2 Freshness vs Performance Tradeoff + +Short TTL: + +- fresher data +- more misses +- more backend load + +Long TTL: + +- better hit ratio +- lower backend load +- greater stale data risk + +Choosing TTL is not a mathematical purity exercise. It is a business decision informed by traffic shape and correctness requirements. + +### 6.3 Choosing TTL Values + +| Data type | Typical TTL thinking | Why | +|---|---|---| +| Static assets with versioned URLs | Very long, often effectively immutable | Content changes only when filename changes | +| Product catalog metadata | Minutes to hours, often event-invalidated too | Read heavy, moderate freshness needs | +| User profile display info | Minutes | Slight staleness often acceptable | +| Inventory or seat availability | Very short or event-driven | Stale data can cause user-visible errors | +| Auth or permission data | Short or version-checked | Security sensitivity | +| Rate limiting counters | Natural expiration aligned to window | TTL defines the policy itself | + +### 6.4 Short TTL vs Long TTL + +Short TTLs are attractive because they reduce staleness, but they often create hidden instability. + +If a key is hit constantly and expires every few seconds, the system repeatedly repays the miss penalty. That can waste backend capacity. + +Long TTLs improve performance, but only if you also have a reliable invalidation strategy or a clear tolerance for stale data. + +### 6.5 Dynamic TTL Strategies + +Good production systems often use different TTLs for different data classes. + +Examples: + +- long TTL for immutable product images +- medium TTL for product descriptions +- short TTL for stock level or surge pricing data +- longer TTL for cold data, shorter TTL for volatile entities + +Some systems also vary TTL by popularity. Very hot keys may justify proactive refresh or longer cache retention because the savings are large. + +### 6.6 Soft TTL vs Hard TTL + +Hard TTL means the entry is considered expired and must be reloaded before serving. + +Soft TTL means the entry is considered old enough to refresh, but the system may still serve it briefly while a background refresh happens. + +Soft TTL is a practical way to avoid user-facing latency spikes and stampedes. It supports patterns like stale-while-revalidate. + +### 6.7 Expiration Storms and Jitter Strategies + +If many keys are created at the same time with the same TTL, they may all expire together. + +That causes an expiration storm or avalanche. + +The standard mitigation is jitter: add randomness to expiration times. + +Example: + +- instead of every key expiring at exactly 600 seconds +- expire keys at 600 seconds plus or minus a bounded random offset + +This spreads rebuild work over time. + +### 6.8 Practical TTL Decisions in Production + +Strong production TTL policy usually includes: + +- base TTL chosen per data class +- event-driven invalidation for important writes +- jitter to avoid synchronized expiry +- soft TTL for hot or expensive-to-build keys +- observability on miss storms and stale-read complaints + +Practical rule: + +If you cannot explain why a TTL is what it is, the TTL is probably wrong. + +## 7. Eviction Policies + +TTL decides when entries should expire logically. Eviction decides what happens when memory pressure forces the cache to throw something away. + +### 7.1 Why Eviction Policies Matter + +When memory fills up, the cache must choose which entries survive. + +That choice directly affects hit ratio and therefore system performance. + +Wrong eviction policy can destroy performance by retaining low-value data and evicting exactly the hot data that saves the backend. + +### 7.2 Common Policies + +| Policy | Intuition | Works well when | Fails when | +|---|---|---|---| +| LRU | Evict least recently used items | Recent access predicts future access | Workload has scanning patterns that pollute recency | +| LFU | Evict least frequently used items | Repeated popularity matters | Frequency history adapts too slowly to sudden changes if tuned poorly | +| FIFO | Evict oldest inserted items | Simplicity matters more than precision | Age is not a good signal of future value | +| Random | Evict arbitrary items | Cheap and simple, decent in some broad workloads | Can evict very hot keys unpredictably | +| TTL-based | Prefer items nearing expiration | Expiry is meaningful and freshness-driven | Hot but old keys may be evicted too early | + +### 7.3 Redis Eviction Modes + +Redis exposes several eviction modes. + +| Mode | Meaning | +|---|---| +| `noeviction` | Reject writes when memory limit is reached | +| `allkeys-lru` | Evict least recently used keys from all keys | +| `volatile-lru` | Evict least recently used keys only among keys with TTL | +| `allkeys-lfu` | Evict least frequently used keys from all keys | +| `volatile-lfu` | Evict least frequently used keys only among keys with TTL | +| `allkeys-random` | Evict random keys from all keys | +| `volatile-random` | Evict random keys among TTL keys | +| `volatile-ttl` | Evict keys with nearest expiration among TTL keys | + +The right mode depends on workload and whether all keys are disposable. + +### 7.4 Workload-Based Policy Selection + +Use LRU when: + +- recent access strongly predicts future access +- the working set shifts over time + +Use LFU when: + +- long-term popularity matters +- some keys remain hot over long periods + +Use TTL-sensitive strategies when: + +- expiring data is naturally less valuable +- freshness policy is integral to value + +Avoid random or FIFO unless you have a reason. Simpler is not always safer. + +### 7.5 How Wrong Eviction Destroys Performance + +Example: + +- assume a SaaS dashboard has 5 percent extremely hot keys and 95 percent rarely used keys +- if eviction repeatedly removes hot keys, hit ratio falls sharply +- application traffic shifts back to the database +- database CPU rises, tail latency rises, and autoscaling may not help because the problem is miss amplification + +Engineers often blame the database first, but the real issue is sometimes that the cache is keeping the wrong objects. + +### 7.6 Best Practices + +- size memory with headroom rather than relying on constant eviction +- monitor eviction rate alongside hit ratio +- identify hot keys and oversized keys +- match policy to workload instead of using defaults blindly +- test cache behavior during memory pressure, not just normal load + +## 8. CDN (Content Delivery Network) + +Caching is not only a backend service concern. At internet scale, the performance layer extends to the edge. + +### 8.1 What a CDN Is + +A CDN is a globally distributed network of edge servers that caches and delivers content closer to users. + +Instead of every user request hitting your origin servers directly, a nearby edge location can serve cacheable content. + +### 8.2 Why CDNs Exist + +| Goal | CDN benefit | Real effect | +|---|---|---| +| Reduce latency | Content served closer to user | Faster page loads and API edge responses | +| Reduce bandwidth from origin | Repeated asset delivery stays at edge | Lower origin network cost | +| Offload backend | Fewer requests reach origin | Origin survives higher traffic | +| Improve resilience | Edge absorbs surges and some attacks | Better stability during spikes | +| Provide global delivery | POPs around the world | Better user experience across regions | + +### 8.3 CDN Architecture + +```mermaid +flowchart LR + U[User Browser] --> EDGE[Nearest CDN Edge POP] + EDGE -->|Cache Hit| RESP[Response Returned] + EDGE -->|Cache Miss| SHIELD[Origin Shield / Regional Cache] + SHIELD --> ORIGIN[Origin App / Object Store] + ORIGIN --> SHIELD + SHIELD --> EDGE + EDGE --> RESP +``` + +Important concepts: + +- edge server or POP: geographically distributed cache location +- origin server: your source system where content is generated or stored +- origin shield: an extra cache layer between edge POPs and origin to reduce duplicate origin fetches + +#### CDN vs Reverse Proxy + +These terms are related, but they are not the same thing. + +| Dimension | CDN | Reverse Proxy | +|---|---|---| +| Typical placement | Globally distributed edge network | Usually sits in front of origin inside one region or network boundary | +| Main goal | Global latency reduction and origin offload | Traffic routing, load balancing, TLS termination, caching, security controls | +| Geographic reach | Many POPs across the world | Usually one site or a few controlled deployment points | +| Best use case | Shared content close to users worldwide | Centralized front door for backend services | +| Examples in practice | CloudFront, Fastly, Cloudflare edge delivery | NGINX, Envoy, HAProxy at origin or regional edge | + +In real systems, they often work together rather than compete. A CDN may sit in front of a reverse proxy, and the reverse proxy then routes to application services. The CDN handles global edge delivery and shared caching; the reverse proxy handles origin-side traffic management and policy enforcement. + +#### DDoS Mitigation Basics + +CDNs help with basic DDoS resilience because they distribute traffic across a large edge footprint, absorb repeated requests close to the network boundary, and keep a meaningful fraction of malicious or accidental traffic away from the origin. That does not eliminate the need for rate limiting, WAF rules, or origin protection, but it reduces how directly every spike hits your backend. + +### 8.4 Edge Caching + +Edge caching means storing content at CDN nodes so users can be served without going back to origin. + +This is especially effective for: + +- static assets +- images +- videos +- public API responses that can be cached safely +- partially personalized pages with shared fragments + +### 8.5 Browser Cache vs CDN Cache + +| Dimension | Browser Cache | CDN Cache | +|---|---|---| +| Location | End user device | Provider edge POP | +| Main benefit | Zero network or reduced network for repeat user visits | Shared origin offload across many users | +| Control | Limited by browser behavior and headers | Controlled via CDN policies and headers | +| Best for | User-specific repeat access to assets | Shared assets and shared responses | + +### 8.6 Cache Headers Basics + +HTTP caching works because servers tell intermediaries and browsers how to cache. + +| Header / concept | What it does | Why it matters | +|---|---|---| +| `Cache-Control` | Defines caching directives like max age or public/private | Primary cache behavior control | +| `s-maxage` | Shared-cache max age | Lets CDN cache differently from browser | +| `ETag` | Validator representing response version | Enables revalidation without full body transfer | +| `Last-Modified` | Timestamp validator | Simpler revalidation mechanism | +| `stale-while-revalidate` | Allows stale content briefly while refresh happens | Better user latency and fewer stalls | +| `Vary` | Signals which request headers affect cache key | Critical for safe caching of content variations | + +### 8.7 Revalidation Flow + +```mermaid +sequenceDiagram + participant U as User Browser + participant E as CDN Edge + participant O as Origin + + U->>E: GET /app.js with validator + E->>O: Revalidate with ETag / If-None-Match + alt Not changed + O-->>E: 304 Not Modified + E-->>U: Cached body reused + else Changed + O-->>E: 200 New content + E-->>U: New content cached and returned + end +``` + +Revalidation avoids retransmitting full content when the content has not changed. + +### 8.8 Personalized Content Challenges + +CDNs are easy for public static assets. They are harder for personalized content. + +Problems: + +- one user's data must not leak to another user +- too many personalization dimensions can destroy cacheability +- authentication headers or cookies may fragment cache keys badly + +Common strategies: + +- cache only the shared shell, fetch personalized data separately +- use edge logic to vary on a small safe set of dimensions +- cache by versioned fragments instead of full pages +- mark highly personalized responses as private or uncacheable at the shared edge + +### 8.9 Dynamic Content Edge Strategies + +Modern systems do not limit CDNs to images and CSS. + +They often use edge caching for: + +- public API responses +- HTML shell plus client-side personalized fetches +- signed asset access +- bot-resistant and rate-limited request handling +- geographically optimized routing to nearest healthy origin + +Google-like and Amazon-like large systems rely heavily on globally distributed frontends or edge layers because global latency is a real product problem, not just a backend benchmark problem. + +### 8.10 Static Asset Delivery + +Static asset delivery is the most successful CDN use case. + +#### JS, CSS, and Image Delivery + +Typical frontend/backend production flow: + +1. frontend build produces versioned asset filenames +2. assets are uploaded to object storage or origin bucket +3. CDN caches those assets globally +4. HTML references versioned URLs +5. browser and CDN cache them aggressively because names change on deploy + +#### Versioned Asset Strategy + +Versioned or content-hashed filenames solve invalidation elegantly. + +Example: + +- `app.8f3d2.js` instead of `app.js` + +If content changes, the filename changes. That means old caches remain valid for old references, while new deployments use new URLs. + +This is one of the cleanest examples of version-based invalidation in production. + +#### Immutable Asset Caching + +If assets are content-addressed or versioned, you can safely use very long cache lifetimes and immutable caching directives. + +That gives extremely high cache hit rates with almost no freshness downside. + +#### Cache Busting + +Cache busting means changing the URL when content changes so caches naturally treat the asset as new. + +Good cache busting is usually versioned naming, not manual emergency purges for every deploy. + +#### Compression Basics + +CDNs and origins commonly use compression: + +- Gzip: common general-purpose compression +- Brotli: often better compression for web assets, especially text assets + +Why it matters: + +- lower transfer size +- faster page loads +- reduced bandwidth cost + +#### Image Optimization Basics + +Images dominate page weight in many systems. + +Common CDN/image strategies: + +- resize images per device size +- use modern formats where possible +- compress aggressively without harming visible quality +- cache multiple transformed variants at edge + +#### Signed URLs Basics + +Signed URLs allow protected asset access through time-limited or permission-scoped links. + +This is common for: + +- private downloads +- customer-specific files +- media assets behind authorization rules + +The CDN can still help, but the cache key and security model must be designed carefully. + +### 8.11 Global Distribution + +Global delivery changes architecture decisions. + +#### Geo Routing + +Geo routing directs users toward nearby or appropriate regions. + +Why it matters: + +- shorter network round trips +- better perceived performance +- better regional failover options + +#### Anycast Basics + +Anycast is a routing technique where multiple edge locations advertise the same IP, and network routing sends the user to a nearby or efficient destination. + +This matters because users do not manually choose an edge. Network routing steers them. + +#### Regional Latency Reduction + +If your origin is only in one region, every distant user pays transcontinental latency. CDNs reduce that for cacheable content, but truly dynamic uncached requests still feel origin distance. + +This is why global systems often pair CDNs with multi-region origins. + +#### Multi-Region Architecture Impact + +Once you have multiple origins or regions, the performance layer must interact with: + +- traffic steering +- state locality +- replication lag +- failover policies +- regional cache consistency + +#### Failover Benefits + +A good CDN and global routing layer can keep a regional origin issue from becoming a full global outage. Edge caches may continue serving stale or previously cached content while origins recover. + +#### Origin Shielding Basics + +Origin shielding adds an intermediate cache layer so many edge POPs do not all miss directly to origin. This is useful during viral events or large cache turnovers. + +### 8.12 Real-World Examples + +- Netflix is the classic example of edge-heavy delivery for video content; the lesson is that moving content close to users dramatically changes scalability economics +- Amazon-like e-commerce systems use CDNs for asset delivery, image optimization, and global storefront performance +- GitHub-like systems use edge delivery for assets, release downloads, and parts of public web traffic +- Stripe-like documentation, dashboards, and static resources benefit from aggressive CDN caching even when core payment flows remain origin-controlled +- typical SaaS systems often keep app shells and static assets heavily cached while user-specific API calls remain dynamic + +## 9. Cache Invalidation + +Cache invalidation is the hardest part of the performance layer because it is where performance and correctness collide. + +### 9.1 Why Cache Invalidation Is Hard + +The famous joke says there are only two hard things in computer science: cache invalidation and naming things. + +The practical meaning is this: + +Once you copy data away from the source of truth, you have created multiple versions of reality. Now you must decide when old copies stop being acceptable. + +That is hard because: + +- one source update may affect many cached views +- invalidation can race with reads and writes +- events can be delayed or lost +- caches may exist at many layers: browser, CDN, local process, Redis +- some views are aggregates, not direct copies of one row + +### 9.2 Delete vs Update Strategies + +There are two classic invalidation approaches after a write. + +| Strategy | How it works | Strengths | Weaknesses | +|---|---|---|---| +| Delete on change | Remove cache entry after source update | Simple, avoids writing wrong value into cache | Next read is a miss, can trigger stampede | +| Update on change | Write new value into cache immediately | Better freshness, avoids immediate miss | Risk of dual-write inconsistency and more write overhead | + +Delete is often simpler and safer. Update can be faster for read-after-write workloads. + +### 9.3 Event-Driven Invalidation + +In event-driven invalidation, the source-of-truth write publishes an event that tells caches or downstream services what changed. + +Example flow: + +1. product price changes in database +2. product service emits `product.updated` +3. consumers remove or refresh relevant cache keys +4. next read sees new data or repopulates with new value + +This is powerful because it decouples writers from all readers and cached views. It is also operationally harder because events must be reliable enough. + +### 9.4 Pub/Sub Invalidation + +Lightweight invalidation often uses Pub/Sub. + +This works when: + +- missed events are acceptable because TTL is a fallback +- low latency matters +- invalidation is best effort rather than strictly durable + +It is weaker when you need guaranteed processing and replay. + +### 9.5 Version-Based Invalidation + +Version-based invalidation means the cache key or validator includes a version. + +Examples: + +- `user:123:v17` +- asset filename hash +- ETag generated from content version + +This is powerful because old cached entries naturally become irrelevant when the version changes. + +It is extremely common in: + +- static assets +- schema-aware API responses +- derived views where version numbers are easy to compute + +### 9.6 Tag-Based Invalidation + +Tag-based invalidation groups related cache entries under logical tags. + +Example: + +- product detail page, search results, category page, and recommendation widget all share the tag `product:123` + +When the product changes, all content attached to that tag can be invalidated. + +This is useful when one underlying object fans out into many cached representations. + +### 9.7 Dependency Invalidation + +Many caches hold derived data, not raw rows. + +Example: + +- homepage recommendations depend on user preferences, inventory, and pricing + +Now invalidation is harder because one source change may invalidate multiple aggregates. + +This is where dependency graphs, tags, or event fan-out matter. + +### 9.8 Eventual Consistency Tradeoffs + +In practice, many invalidation systems are eventually consistent. + +That means for some brief period: + +- source of truth is updated +- one cache is updated +- another cache still has old data + +Your job is to make that window safe. + +Techniques include: + +- bounded TTL +- read version checks +- idempotent invalidation events +- periodic repair or refresh jobs +- treating stale responses as acceptable only for certain data types + +### 9.9 Stale Read Mitigation + +Good production systems do not assume invalidation is perfect. They layer safeguards. + +Common mitigations: + +- short TTL for sensitive data +- longer TTL plus event invalidation for less sensitive data +- version numbers in payloads +- client-side revalidation for edge content +- cache bypass on critical user actions +- observability for stale-read incidents + +### 9.10 Cache Invalidation Flow + +```mermaid +flowchart TD + W[Write Request] --> DB[(Database Update)] + DB --> EVT[Change Event] + EVT --> INV[Invalidation Service] + INV --> REDIS[Redis / App Cache Delete or Refresh] + INV --> CDN[CDN Purge or Tag Invalidate] + INV --> LOCAL[Local Service Cache Bust] + REDIS --> NEXT[Next Read Rebuilds or Uses Fresh Value] + CDN --> NEXT + LOCAL --> NEXT +``` + +### 9.11 Real-World Invalidation Patterns + +Typical production patterns: + +- product catalog systems invalidate product detail keys and search-result fragments when price or stock changes +- GitHub-like public pages often rely on versioned assets and shorter-lived HTML or fragment caching +- Stripe-like systems may avoid aggressive caching on the most correctness-sensitive payment paths but still use invalidation for dashboard and metadata views +- typical SaaS apps invalidate tenant configuration, permissions, and dashboard aggregates via events plus TTL backstops + +### 9.12 Best Practices and Common Mistakes + +Best practices: + +- define source of truth clearly +- design cache keys systematically +- make invalidation events idempotent +- use TTL as backup, not as the only correctness mechanism for important data +- model dependencies explicitly for derived views + +Common mistakes: + +- forgetting that one row update affects multiple cached views +- using overly broad purges that destroy hit ratio +- trusting best-effort invalidation for correctness-sensitive data +- not planning what happens if the invalidation bus is down + +## 10. How These Pieces Connect in Actual Architecture + +The performance layer matters most when you can explain how the pieces work together, not just individually. + +### 10.1 Typical SaaS Request Flow + +```mermaid +sequenceDiagram + participant Browser as Browser + participant CDN as CDN Edge + participant API as API Service + participant Local as Local Cache + participant Redis as Redis + participant DB as Database + participant Bus as Event Bus + + Browser->>CDN: Request app shell / assets + alt Edge hit + CDN-->>Browser: Cached asset or cached response + else Edge miss + CDN->>API: Forward request + API->>Local: Lookup hot local entry + alt Local hit + Local-->>API: Value + else Local miss + API->>Redis: Lookup shared cache + alt Redis hit + Redis-->>API: Value + API->>Local: Populate local cache + else Redis miss + API->>DB: Query source of truth + DB-->>API: Fresh data + API->>Redis: Store with TTL + API->>Local: Populate + end + end + API-->>CDN: Response + CDN-->>Browser: Response + end + + DB-->>Bus: Change event on writes + Bus-->>Redis: Invalidate or refresh + Bus-->>CDN: Purge / tag invalidation +``` + +This is what a realistic design conversation sounds like. The system is not "database plus Redis." It is a layered request path with distinct hit and miss behaviors. + +### 10.2 What Breaks at Scale + +As scale grows, the performance layer encounters these problems first: + +- one hot key overloads a single shard +- local caches diverge across many instances +- cache rebuilds overwhelm the database after deployments +- CDN cache keys explode because of too many vary dimensions +- invalidation lags behind writes during incident conditions +- eviction removes exactly the hottest working set +- global traffic shifts expose region-specific cold caches + +A strong answer in interviews is to identify not only the optimization but also the failure mode created by that optimization. + +### 10.3 Performance Layer Design Heuristics + +Use a CDN when: + +- content is static or moderately cacheable +- users are globally distributed +- origin offload matters + +Use local cache when: + +- objects are extremely hot +- a network hop is still too expensive +- slight inconsistency is acceptable + +Use distributed cache when: + +- many app instances need shared fast access +- backend misses are expensive +- TTL and invalidation can be managed + +Use careful invalidation instead of blind long TTLs when: + +- data changes matter to users or correctness + +Do not add caching yet when: + +- traffic is low +- the real bottleneck is poor query design or bad indexes +- correctness cost exceeds performance benefit + +## 11. Interview Discussion Guide + +### 11.1 Common Interview Questions and Strong Answer Angles + +| Question | What a strong answer should include | +|---|---| +| Why add a cache? | Latency reduction, DB offload, compute savings, scalability, hot-key behavior | +| Redis or Memcached? | Workload fit, feature needs, durability expectations, simplicity tradeoff | +| Cache-aside or write-through? | Read/write mix, freshness needs, miss penalty, write latency impact | +| What happens on cache failure? | Fallback behavior, database protection, rate limits, degraded mode | +| Why is invalidation hard? | Multiple copies of data, derived views, event loss, race conditions, multi-layer caches | +| How do you choose TTL? | Volatility, business tolerance for staleness, hit ratio, backend cost, jitter | +| What breaks at scale? | Stampedes, hot keys, eviction under pressure, stale reads, regional cold starts | +| Why use a CDN? | Edge latency reduction, origin offload, static asset delivery, global availability | + +### 11.2 A Strong Interview Structure + +When asked about any performance-layer component, answer in this order: + +1. what problem it solves +2. where it sits in the architecture +3. how it works on the request path +4. tradeoffs and failure modes +5. what you would monitor in production + +That structure works for caching, Redis, Memcached, TTL, eviction, CDN, and invalidation. + +### 11.3 Metrics You Should Mention + +For caches: + +- hit ratio +- miss latency +- eviction rate +- memory usage +- hot key distribution +- rebuild rate after expiry +- stale read complaints or mismatches + +For CDNs: + +- edge hit ratio +- origin fetch rate +- regional latency +- revalidation rate +- cache-key cardinality +- purge success and propagation time + +### 11.4 Final Mental Checklist + +Ask yourself these questions in every design: + +- what data is read repeatedly +- which reads can tolerate staleness +- what happens on cache miss +- what happens when the cache is down +- how are writes reflected in cached views +- how are hot keys handled +- how do global users get low latency +- how do you keep the source of truth safe even if the performance layer fails + +## 12. Final Takeaways + +The performance layer is about moving repeated work away from expensive systems and closer to the user. + +Caching reduces latency, protects databases, and lowers cost. Redis gives a rich and fast in-memory platform for many backend patterns. Memcached remains a strong option for pure distributed caching. TTL and eviction policy determine whether the cache behaves like an asset or a liability. CDNs extend caching to the edge and fundamentally change global performance. Invalidation is the price you pay for speed, and it must be treated as a core design problem rather than an afterthought. + +In interviews, the goal is not just to say "use cache." The goal is to explain: + +- why the cache exists +- where it sits in the request path +- how reads and writes behave +- what can go stale +- what breaks at scale +- how production systems stay safe when the performance layer fails + +That is what separates glossary knowledge from engineering understanding. diff --git a/systems design/5.asyncSystem.md b/systems design/5.asyncSystem.md new file mode 100644 index 0000000..b661565 --- /dev/null +++ b/systems design/5.asyncSystem.md @@ -0,0 +1,2201 @@ +# Async Systems + +Async systems are the part of backend engineering that let a system accept work now, complete it later, and stay alive when traffic, latency, and failure conditions stop being friendly. + +In interviews, async design is where system design answers start sounding like real production engineering. It is one thing to say, "I will call service B from service A." It is another to explain what happens when service B slows down, when traffic spikes 20x, when jobs fail halfway through, when messages are duplicated, or when you need to process the same event for analytics, billing, notifications, and fraud detection without coupling every system together. + +That is what async systems solve. + +This guide is written for two goals at once: + +- interview preparation, where you need to explain tradeoffs clearly and structure your answer well +- real backend engineering, where you need to design systems that keep working under failure, load, and operational complexity + +The focus here is not memorizing definitions. The focus is understanding why async systems exist, how they work internally, what breaks at scale, and how these pieces connect in real architectures used by companies such as Google, Netflix, Uber, Amazon, GitHub, Stripe, and typical SaaS platforms. + +## 1. Big Picture: Why Async Systems Exist + +### 1.1 The Core Problem + +Synchronous request-response works well when: + +- the work is short +- the downstream dependency is healthy +- the user needs the answer immediately +- the traffic pattern is reasonably smooth + +It breaks down when: + +- the work takes seconds or minutes +- one request triggers many side effects +- downstream systems are slower than incoming traffic +- traffic arrives in bursts rather than at a steady rate +- some work is important but not user-blocking + +Examples: + +- uploading a video and generating multiple encodings +- placing an order and triggering payment, inventory reservation, email, analytics, fraud checks, and warehouse workflows +- sending millions of notification emails after a product launch +- recomputing search indexes or analytics aggregates +- retrying webhook delivery to external systems that may be temporarily down + +In all of these cases, pushing everything directly into the request path makes the system fragile. + +### 1.2 Sync vs Async Architecture + +```mermaid +flowchart LR + subgraph Sync Path + U1[User] --> API1[API Service] + API1 --> PAY1[Payment Service] + API1 --> INV1[Inventory Service] + API1 --> MAIL1[Email Service] + API1 --> ANA1[Analytics Service] + ANA1 --> DB1[(Analytics Store)] + end +``` + +In the synchronous version, the request path is only as healthy as the slowest dependency. If analytics stalls, the order path may stall. If email times out, user latency may grow even though email is not necessary to confirm the order. + +```mermaid +flowchart LR + subgraph Async Path + U2[User] --> API2[API Service] + API2 --> DB2[(Primary DB)] + API2 --> Q1[[Order Events / Jobs]] + Q1 --> PAY2[Payment Worker] + Q1 --> INV2[Inventory Worker] + Q1 --> MAIL2[Email Worker] + Q1 --> ANA2[Analytics Worker] + end +``` + +In the async version, the API handles the critical path, persists the authoritative transaction, emits a job or event, and lets background systems do the rest. The user gets a fast response, and the system gains buffering and decoupling. + +### 1.3 What Async Actually Means + +Async does not simply mean "parallel" or "faster." + +It means the caller does not wait for all work to complete before moving on. + +That changes several things: + +- latency becomes lower for the caller +- completion becomes eventual, not immediate +- failures move from direct request failures to background job failures +- observability becomes harder because the workflow spans time and multiple components +- correctness depends on retries, idempotency, and coordination between systems + +### 1.4 Why Companies Lean on Async Systems + +Large production systems use async patterns because they need to survive uneven traffic and isolate failures. + +Common examples: + +- Amazon-style commerce systems separate checkout from downstream fulfillment, shipment updates, recommendation updates, and notification pipelines +- Uber-like trip systems fan trip events out to pricing, receipts, ETA updates, driver incentives, fraud models, and analytics +- Netflix-like media systems use async pipelines for media encoding, playback telemetry ingestion, and recommendation updates +- GitHub-like platforms use background processing for webhook delivery, code scanning, repository indexing, notifications, and audit pipelines +- Stripe-like systems push payment side effects, reconciliation, ledger updates, webhook delivery, and risk review into highly reliable async workflows +- SaaS platforms use async jobs for exports, report generation, email campaigns, CRM sync, file processing, and periodic billing operations + +### 1.5 What Async Does Well and What It Does Not + +| Question | Async is good when... | Async is a bad fit when... | +|---|---|---| +| User latency | Work is non-blocking or long-running | User needs the final answer immediately | +| Reliability | Work can be retried safely | Work is non-idempotent and retrying causes damage | +| Scalability | Arrival rate can exceed processing rate temporarily | The system needs strict instantaneous consistency end-to-end | +| Decoupling | Multiple systems consume the same event independently | Strong synchronous coordination is required | +| Cost | Worker fleets can scale independently | Operational complexity outweighs the benefit | + +### 1.6 The Most Important Mental Model + +Async systems trade immediacy for resilience and throughput. + +That trade is often correct in production, but it means you now need to reason about: + +- queues instead of direct calls +- retries instead of single attempts +- duplicate execution instead of single execution +- delayed completion instead of immediate completion +- lag and backlog instead of only request latency +- operators and dashboards instead of only code correctness + +If you explain async design in an interview, the strongest framing is this: + +"I am moving non-critical or long-running work off the request path so the system can decouple producers from consumers, absorb bursts, retry safely, and scale each side independently. That improves latency and resilience, but now I need to discuss delivery guarantees, idempotency, observability, and failure handling." + +## 2. Message Queues + +### 2.1 What a Message Queue Is + +A message queue is a system that stores units of work or messages so one component can produce them and another component can process them later. + +The producer does not need the consumer to be ready at the same moment. That is the central value. + +The message might represent: + +- a task to execute, such as "send this email" +- an event that occurred, such as "order placed" +- a command, such as "generate invoice PDF" +- a state transition, such as "subscription past due" + +Queues are one of the simplest and most useful tools in backend engineering because they solve timing mismatch. + +### 2.2 Why Async Systems Often Start with a Queue + +Without a queue, the producer and consumer must be healthy and available at the same time. + +With a queue: + +- the producer can continue even if consumers are temporarily slow +- the system can smooth out bursts in traffic +- consumers can be scaled separately from producers +- failures can be retried without involving the original caller +- work can be prioritized, delayed, or routed differently + +### 2.3 Producer-Consumer Model + +```mermaid +flowchart LR + P[Producer Service] --> Q[[Message Queue]] + Q --> C1[Worker 1] + Q --> C2[Worker 2] + Q --> C3[Worker 3] + C1 --> S[(Side Effect / DB / External API)] + C2 --> S + C3 --> S +``` + +The producer creates work. The queue stores work. The consumers pull or receive work and process it. + +This model looks trivial, but it introduces powerful separation: + +- producers control creation rate +- consumers control processing rate +- the queue absorbs the difference + +That is why queues are foundational to scalable systems. + +### 2.4 How a Queue Works Internally + +At a high level, a queue system does four jobs: + +1. accept writes from producers +2. durably store messages until processed or expired +3. deliver messages to consumers +4. track whether a message was acknowledged, retried, or dead-lettered + +Internally, most queue systems need to answer these questions: + +- where is the message stored, memory or disk or both? +- when is a message considered visible to consumers? +- how long is it retained? +- how does the broker know whether processing succeeded? +- what happens if a worker crashes halfway through execution? + +Those questions define delivery semantics and operational behavior. + +### 2.5 Decoupling Services + +Decoupling is one of the most repeated words in system design, but it needs precise meaning. + +Queues decouple services in at least four ways: + +| Type of decoupling | Meaning | Example | +|---|---|---| +| Time decoupling | Producer and consumer do not need to run simultaneously | Checkout can succeed even if email workers are down for 5 minutes | +| Load decoupling | Consumer throughput can differ from producer throughput | 100k uploads arrive in a burst but processing continues over time | +| Failure decoupling | Consumer failure does not immediately fail the producer | Analytics outage should not block payment success | +| Deployment decoupling | Teams can evolve consumer services independently | A new fraud consumer can subscribe later without changing checkout | + +### 2.6 Buffering Traffic Spikes + +This is where queues deliver some of their biggest production value. + +Suppose incoming requests create 50,000 jobs in one minute, but your workers can only process 10,000 per minute. Without a queue, the producers may fail or overload downstream systems immediately. With a queue, the system stores the backlog and drains it over time. + +That does not create free capacity. It creates controlled overload. + +Controlled overload is often the difference between a degraded system and an outage. + +The key metrics become: + +- enqueue rate +- processing rate +- queue depth +- age of oldest message +- time to drain backlog + +If processing rate is greater than arrival rate, the backlog drains. If not, it grows indefinitely. + +### 2.7 Reliability Improvements + +Queues improve reliability by making work durable and retryable. + +Instead of one in-memory attempt inside a request handler, a queue lets the system: + +- persist the work item +- retry if the consumer crashes or a downstream dependency times out +- isolate the failure to one worker or one queue +- inspect failed messages later + +A good interview answer here is not just "queues increase reliability." It is: + +"Queues increase reliability because work survives process crashes and temporary downstream failures. The producer can persist the job, the consumer can retry transient errors, and operators can inspect the backlog and DLQ when failures persist." + +### 2.8 Delivery Guarantees + +This is one of the most important interview topics. + +#### At-most-once + +The system tries to deliver a message zero or one time. + +Meaning: + +- duplicates are minimized +- message loss is possible if failure happens before durable acknowledgment or redelivery + +Use it when missing a message is acceptable or when duplicate effects are worse than loss. + +#### At-least-once + +The system ensures a message is delivered one or more times. + +Meaning: + +- message loss is less likely +- duplicates are expected + +This is the most common practical model in production because retrying is easier than proving exact single execution. + +#### Exactly-once + +In interviews, many candidates say "I want exactly-once delivery." In real systems, exactly-once is usually not a universal property. It is a carefully scoped guarantee built from storage transactions, deduplication, idempotent handlers, and well-defined boundaries. + +Practical reality: + +- the broker may deduplicate writes in limited contexts +- the consumer may commit offsets transactionally with produced output in some systems such as Kafka streams-style pipelines +- the business side effect still often needs idempotency to handle retries and partial failures + +The safest real-world mindset is: + +"Assume at-least-once delivery and design consumers to be idempotent." + +### 2.9 Idempotency Importance + +Idempotency means repeating the same operation produces the same effective result. + +Examples: + +- charging a card twice is not idempotent unless you use a payment idempotency key +- setting order status to "email sent" with a unique message key can be idempotent +- inserting into a table with a unique event ID can be idempotent +- uploading the same file by content hash can be idempotent + +Why it matters: + +- workers crash after performing the side effect but before acknowledging the message +- networks time out after the side effect succeeded +- brokers redeliver after visibility timeout or connection loss +- operators replay old messages during recovery + +If the consumer is not idempotent, all of these turn into data corruption or repeated customer-facing actions. + +### 2.10 Ordering Guarantees + +Ordering sounds simple, but strong ordering is expensive. + +Questions to ask: + +- do I need global ordering or just per-entity ordering? +- do I need order for all events or only within one account, trip, or order? +- can I tolerate reordering if events carry version numbers or timestamps? + +Practical rules: + +- global ordering limits scalability badly +- per-partition or per-key ordering is much more practical +- if many workers process the same queue, ordering may break unless the system deliberately serializes processing for a key + +A common production strategy is to partition by entity key, such as `order_id` or `user_id`, so events for the same entity stay ordered while the overall system remains parallel. + +### 2.11 Queue Depth Monitoring and Backpressure Basics + +Backpressure means slowing producers or controlling work intake when consumers cannot keep up. + +If you only watch request latency, you will miss async overload. You need queue-centric metrics. + +| Metric | Why it matters | +|---|---| +| Queue depth | Shows backlog size | +| Age of oldest message | Shows user-visible staleness and starvation | +| Processing latency | Shows time from enqueue to completion | +| Success/failure rate | Shows whether workers are healthy | +| Retry rate | Shows instability in downstream dependencies | +| DLQ rate | Shows persistent or poison failures | +| Consumer lag | Critical in log-based systems like Kafka | + +Backpressure mechanisms include: + +- rate limiting producers +- temporarily rejecting low-priority work +- shedding optional jobs +- autoscaling worker fleets +- using separate queues by priority or workload class + +Without backpressure, retries and requeues can flood the system and make recovery impossible. + +### 2.12 Poison Messages + +A poison message is a message that keeps failing whenever it is processed. + +Reasons: + +- corrupt payload +- incompatible schema +- code bug in the worker +- missing referenced data +- invalid business state +- a downstream dependency rejects that specific request forever + +If poison messages are blindly retried forever, they clog the queue and waste worker capacity. That is why systems use retry limits and dead-letter queues. + +### 2.13 Operational Concerns + +Operating queues well is not just about publishing and consuming. + +You need to think about: + +- retention period +- message size limits +- schema evolution +- reprocessing strategy +- queue explosion from too many fine-grained queues +- hot partitions or hot keys +- worker autoscaling and fairness +- observability across producer, broker, and consumer + +### 2.14 Common Mistakes with Queues + +- putting user-critical validation behind a queue when the user needs an immediate answer +- assuming one message will only ever be delivered once +- not storing a job state or dedupe key +- treating queue depth as the only metric instead of also measuring age and lag +- using a single queue for workloads with wildly different priorities and runtimes +- publishing a message before the source-of-truth database write commits successfully + +That last mistake is especially important. In production, teams often use the outbox pattern: write the business record and an "event to publish" into the same database transaction, then publish from the outbox asynchronously. This avoids saying something happened in the queue before it actually committed in the database. + +## 3. Kafka + +### 3.1 What Kafka Is + +Kafka is a distributed event streaming platform built around an append-only log rather than a traditional broker-managed work queue. + +That design choice changes everything. + +In a traditional queue, consumers often remove work from the queue as they process it. In Kafka, records are appended to a log, retained for a configurable period, and consumers track their own position using offsets. + +That means Kafka is not just a delivery mechanism. It is also a durable event history for a period of time. + +### 3.2 The Append-Only Log Model + +Think of Kafka as a set of logs split into partitions. + +- producers append records to a partition +- records are stored in order within that partition +- consumers read forward by offset +- old records stay available until retention expires + +This makes replay possible. That is one of Kafka's defining advantages. + +```mermaid +flowchart LR + P1[Producer A] --> T[Topic Orders] + P2[Producer B] --> T + subgraph Topic[Kafka Topic: orders] + PART0[Partition 0
offset 0..n] + PART1[Partition 1
offset 0..n] + PART2[Partition 2
offset 0..n] + end + T --> PART0 + T --> PART1 + T --> PART2 + PART0 --> CG1[Consumer Group: billing] + PART1 --> CG1 + PART2 --> CG1 + PART0 --> CG2[Consumer Group: analytics] + PART1 --> CG2 + PART2 --> CG2 +``` + +### 3.3 Topics, Partitions, Brokers, Producers, Consumers + +#### Topic + +A topic is the named stream of records, such as `orders`, `payments`, or `trip_events`. + +#### Partition + +A topic is split into partitions for scalability. Ordering is guaranteed only within a partition. + +#### Broker + +A broker is a Kafka server that stores partitions and serves reads and writes. + +#### Producer + +A producer writes records to a topic, optionally choosing a partition key such as `user_id` or `order_id`. + +#### Consumer + +A consumer reads records from partitions and tracks progress using offsets. + +### 3.4 Consumer Groups + +Consumer groups are how Kafka scales consumption. + +Within one consumer group: + +- each partition is assigned to at most one consumer instance at a time +- multiple consumers can split partitions for parallelism +- ordering within a partition is preserved + +Across different consumer groups: + +- each group gets its own view of the stream +- one group can handle billing, another analytics, another search indexing + +This is why Kafka is excellent for fanout pipelines. + +### 3.5 Offsets + +Offsets are position markers in a partition. + +They matter because Kafka does not "know" the business meaning of processing. The consumer decides when it is safe to commit progress. + +That creates important failure scenarios: + +- if you commit the offset before finishing the side effect, you may lose processing on crash +- if you finish the side effect before committing the offset, a crash may cause duplicate processing + +This is the same core tradeoff seen across async systems. Kafka just exposes it more explicitly. + +### 3.6 Retention and Replayability + +Kafka retains records for time-based or size-based windows. + +This enables: + +- replaying data after a consumer bug fix +- bootstrapping a new consumer from historical events +- rebuilding materialized views or search indexes +- running batch and stream consumers from the same source + +This replay capability is why Kafka is so common in event pipelines, analytics ingestion, log aggregation, and change-data-capture architectures. + +### 3.7 Ordering Guarantees + +Kafka guarantees order within a partition, not across the whole topic. + +That means you should choose partition keys carefully. + +Examples: + +- partition by `user_id` if user events must stay ordered +- partition by `order_id` if order state transitions must stay ordered +- do not partition randomly if you require per-entity ordering + +The tradeoff is that a very hot key can overload one partition and become a throughput bottleneck. + +### 3.8 Replication Basics and Leader-Follower Partitions + +Kafka replicates partitions across brokers. + +For each partition: + +- one replica is the leader +- followers replicate data from the leader +- producers and consumers usually talk to the leader +- if the leader fails, a follower can take over + +This improves availability and durability, but replication introduces operational questions: + +- how many replicas do you need? +- how much lag is allowed between leader and followers? +- do you allow writes if not enough replicas acknowledge? + +Those settings affect durability versus availability. + +### 3.9 Why Kafka Has High Throughput + +Kafka gets strong throughput from several design choices: + +- sequential append to logs instead of random in-place mutation +- partition-based parallelism +- batched reads and writes +- efficient network and disk usage +- consumer-controlled progress instead of broker-tracked per-message acknowledgments + +This makes Kafka a strong fit for very high-volume event streams. + +### 3.10 Event Streaming vs Traditional Queues + +| Characteristic | Kafka | Traditional work queue | +|---|---|---| +| Core model | Append-only log | Broker-managed message queue | +| Replay | Strong, built-in | Often limited or awkward | +| Fanout | Excellent via consumer groups | Often requires additional routing setup | +| Per-message ack model | Consumer offset commits | Message acknowledgment / redelivery | +| Best at | Event pipelines, analytics, stream processing, audit history | Task execution, RPC-like async jobs, routing-heavy workflows | +| Ordering | Per partition | Depends on queue and consumption pattern | + +### 3.11 When Kafka Is the Right Choice + +Kafka is a strong choice when: + +- many downstream systems need the same event stream +- replay is important +- throughput is very high +- you want durable event history for some retention window +- analytics and operational consumers share a common pipeline +- per-key ordering is enough + +Typical real-world use cases: + +- clickstream and telemetry ingestion at Google-scale or Netflix-scale volumes +- trip event pipelines in Uber-like systems +- audit and domain event pipelines in enterprise SaaS +- CDC-based replication from OLTP databases into search or analytics systems +- fraud, recommendation, and feature-engineering pipelines that consume the same event stream differently + +### 3.12 When Kafka Is a Poor Fit + +Kafka is not always the simplest answer. + +It may be a poor fit when: + +- you only need simple job queue semantics +- you want very low operational overhead +- routing flexibility matters more than replay +- your team does not want to operate partitions, replication, rebalancing, and lag monitoring +- tasks are long-running with per-message acknowledgment and retry patterns better suited to a traditional broker + +### 3.13 Common Operational Issues + +- consumer lag building up silently +- partition skew from bad key choice +- large messages hurting throughput and replication +- expensive rebalances disrupting consumers +- retention misconfiguration causing data loss or runaway storage cost +- schema evolution problems across many producers and consumers +- assuming a replay is free when it can overwhelm downstream consumers + +### 3.14 Interview Discussion Points for Kafka + +In interviews, strong Kafka discussion usually includes: + +- why append-only logs are different from queues +- ordering only within a partition +- consumer groups for fanout and scaling +- offsets and replay as core concepts +- at-least-once processing and idempotent consumers +- operational risks such as lag, hot partitions, and rebalancing + +## 4. RabbitMQ + +### 4.1 What RabbitMQ Is + +RabbitMQ is a broker-based messaging system designed around exchanges, queues, routing rules, acknowledgments, and flexible delivery patterns. + +Where Kafka emphasizes durable event logs and replayable streams, RabbitMQ emphasizes message routing and broker-managed delivery. + +That makes it attractive for transactional workflows and task-oriented messaging patterns. + +### 4.2 Broker-Based Queueing Model + +In RabbitMQ, producers usually publish to an exchange, not directly to a queue. + +The exchange decides where the message should go based on bindings and routing rules. + +```mermaid +flowchart LR + P[Producer] --> EX[Exchange] + EX --> Q1[Queue: email] + EX --> Q2[Queue: billing] + EX --> Q3[Queue: audit] + Q1 --> W1[Email Worker] + Q2 --> W2[Billing Worker] + Q3 --> W3[Audit Worker] +``` + +This routing layer is one of RabbitMQ's biggest strengths. + +### 4.3 Exchanges, Queues, Bindings, Routing Keys + +#### Exchange + +Receives published messages and routes them. + +#### Queue + +Stores messages for consumers. + +#### Binding + +A rule connecting an exchange to a queue. + +#### Routing key + +Metadata used by certain exchange types to decide routing. + +### 4.4 Exchange Types + +#### Direct exchange + +Routes based on exact routing key match. + +Example: + +- `invoice.created` goes only to the billing queue + +#### Fanout exchange + +Broadcasts a message to all bound queues. + +Example: + +- `user.registered` fans out to onboarding email, CRM sync, and analytics queues + +#### Topic exchange + +Routes using wildcard patterns. + +Example: + +- `order.*` or `payment.#` + +This is useful when routing logic needs hierarchy and flexibility. + +```mermaid +flowchart LR + P2[Producer] --> TOP[Topic Exchange] + TOP -->|order.created| QO[Order Queue] + TOP -->|order.created| QA[Audit Queue] + TOP -->|payment.failed| QB[Billing Queue] + TOP -->|payment.failed| QN[Notification Queue] +``` + +### 4.5 Acknowledgments and Delivery Guarantees + +RabbitMQ commonly uses acknowledgments to know whether processing succeeded. + +Typical flow: + +1. broker delivers message to consumer +2. consumer processes the message +3. consumer acknowledges success +4. if consumer crashes or does not ack, message may be redelivered + +This leads to at-least-once behavior in most real deployments. + +Again, idempotency matters. + +### 4.6 Retries and Dead Lettering + +RabbitMQ supports several retry patterns: + +- immediate requeue +- delayed retry using TTL plus dead-letter exchange patterns +- moving failed messages to a dead-letter queue after retry exhaustion + +This is useful in transactional systems where some failures are transient and others are permanent. + +### 4.7 Ordering Considerations + +RabbitMQ can preserve queue order in simple cases, but ordering becomes weaker when: + +- multiple consumers pull from the same queue +- failed messages are requeued +- priorities are used +- multiple queues participate in one workflow + +For interview purposes, a good answer is: + +"RabbitMQ can provide queue order, but end-to-end ordering depends on the consumption pattern. Once you add multiple consumers, retries, or requeueing, you should not assume strict business-level ordering without extra design." + +### 4.8 When RabbitMQ Is Better Than Kafka + +RabbitMQ is often better when: + +- you need flexible routing rules +- you want classic work queue behavior +- per-message acknowledgments matter +- workflows are more transactional than streaming-oriented +- you do not need long retention and replay of large event histories +- the team wants a broker semantics model rather than a distributed event log model + +Typical examples: + +- order workflow tasks +- email and notification dispatch +- RPC-style async request handling +- service integration patterns with explicit routing +- enterprise back-office workflows + +### 4.9 Transactional Workflow Example + +Imagine a SaaS billing system: + +1. invoice closes +2. billing service publishes `invoice.ready` +3. RabbitMQ routes it to payment, email, audit, and CRM queues +4. payment worker tries collection +5. email worker sends receipt or failure notice +6. audit worker stores immutable audit data + +RabbitMQ fits well because the routing rules are central, the workflows are task-oriented, and each consumer may have specific retry or DLQ policies. + +### 4.10 Common Operational Issues + +- using too many transient queues and losing track of ownership +- requeue storms from consumer bugs +- large backlogs causing memory or disk pressure +- assuming ordering survives multiple consumers and retries +- weak DLQ discipline leading to silent failure piles +- treating RabbitMQ like an infinite event store when it is not designed for that role + +### 4.11 Interview Discussion Points for RabbitMQ + +- exchange/queue/binding model +- flexible routing as a core advantage +- acknowledgments and redelivery +- why it is good for job and transactional workflows +- when it is simpler and more appropriate than Kafka + +## 5. SQS + +### 5.1 What Amazon SQS-Style Systems Solve + +Amazon SQS represents a very common engineering choice: use a managed queue service so the team does not operate queue infrastructure directly. + +This is appealing because many teams want queue benefits without running brokers themselves. + +Managed queues solve: + +- durable message buffering +- elastic scale without managing brokers +- simple integration with cloud compute services +- built-in retry and DLQ patterns + +### 5.2 Why Teams Often Choose Managed Queues + +Most companies are not trying to innovate on queue infrastructure. They are trying to ship product features. + +SQS-style systems are attractive because they reduce operational burden: + +- no broker fleet to patch +- no partition planning like Kafka +- no complex routing topology like RabbitMQ by default +- simple cloud permissions and integrations +- pay-for-usage economics often fit small and medium workloads well + +The tradeoff is less control and fewer advanced semantics. + +### 5.3 Standard vs FIFO Queues + +| Type | Strengths | Tradeoffs | +|---|---|---| +| Standard | Very high scale, simple, cheap, at-least-once behavior | Best-effort ordering, duplicates possible | +| FIFO | Ordered delivery within message groups, deduplication support | Lower throughput, stricter usage patterns | + +The key interview point is that FIFO is not "better" by default. It is more constrained and should only be chosen when ordering or dedupe requirements justify the throughput tradeoff. + +### 5.4 Visibility Timeout + +Visibility timeout is a core SQS concept. + +When a worker receives a message: + +- the message becomes temporarily invisible to other workers +- the worker processes it +- if the worker deletes it before the timeout expires, processing is considered complete +- if the worker crashes or fails to delete it, the message becomes visible again and can be redelivered + +This is how SQS supports reliable retry without requiring sticky ownership. + +It also means duplicate execution is normal. + +### 5.5 Polling Model and Long Polling + +SQS consumers usually poll for messages. + +Short polling checks quickly and often, which can waste requests and money when queues are idle. + +Long polling waits for messages to become available, reducing empty responses and improving efficiency. + +This sounds like an implementation detail, but it matters in production for both cost and latency behavior. + +### 5.6 Retries and Dead-Letter Queues + +SQS commonly pairs with: + +- redelivery after visibility timeout expiration +- maximum receive count thresholds +- dead-letter queues for messages that keep failing + +That gives teams a simple, managed failure handling pattern. + +```mermaid +flowchart LR + API[Producer Service] --> SQS[[SQS Queue]] + SQS --> W[Worker Fleet] + W --> EXT[DB / External API] + W -->|Success| ACK[Delete Message] + W -->|Failure| VT[Visibility Timeout Expires] + VT --> SQS + SQS -->|Too Many Failures| DLQ[Dead-Letter Queue] +``` + +### 5.7 Scaling Worker Fleets + +SQS makes it easy to scale worker fleets horizontally. + +Common patterns: + +- container workers scaling on queue depth or age of oldest message +- serverless consumers for bursty or low-volume jobs +- separate queues for high-priority and low-priority workloads + +Because the queue is managed, the main scaling work moves to the consumers and their downstream dependencies. + +### 5.8 Serverless Integrations + +This is a major reason SQS-style systems are popular. + +Examples: + +- SQS to Lambda for lightweight asynchronous processing +- SQS to ECS or Kubernetes workers for longer-running or more controlled execution +- SQS paired with Step Functions or workflow engines for multi-step orchestration + +This is common in SaaS platforms where product teams want straightforward event handling without operating their own broker fleet. + +### 5.9 Operational Simplicity Tradeoffs + +Managed simplicity comes with tradeoffs: + +- fewer advanced routing features than RabbitMQ +- weaker replay and stream semantics than Kafka +- polling-based consumption rather than native log streaming +- ordering limitations unless using FIFO queues carefully + +This is why many teams choose SQS for business workflows and Kafka for high-volume event pipelines. + +### 5.10 Common Operational Issues + +- wrong visibility timeout causing duplicate work or slow retries +- workers taking longer than the timeout without extending it +- no idempotency key for retried tasks +- scaling consumers aggressively and overwhelming the database or third-party API +- treating queue depth alone as sufficient while ignoring message age +- sending huge payloads instead of storing payloads in object storage and sending references + +### 5.11 Interview Discussion Points for SQS + +- managed queue benefits +- visibility timeout as the key reliability primitive +- standard versus FIFO tradeoffs +- simplicity versus control tradeoff compared with Kafka and RabbitMQ + +## 6. Pub/Sub and Event-Driven Architecture + +### 6.1 Queue vs Pub/Sub Difference + +This is a classic interview topic. + +A queue is primarily about distributing work so one or a small number of consumers process each message. + +Pub/sub is primarily about distributing an event to multiple independent consumers. + +| Pattern | Main goal | Typical behavior | +|---|---|---| +| Queue | One work item should be handled by one worker or worker group | Competing consumers share the workload | +| Pub/Sub | Many systems should hear that something happened | Each subscriber gets its own copy or logical view | + +In practice, real systems often blend the two. + +### 6.2 Fanout Event Distribution + +```mermaid +flowchart LR + ORD[Order Service] --> BUS[[Event Bus / Topic]] + BUS --> BILL[Billing Consumer] + BUS --> NOTIF[Notification Consumer] + BUS --> SEARCH[Search Index Consumer] + BUS --> ANA[Analytics Consumer] + BUS --> FRAUD[Fraud Consumer] +``` + +This pattern is useful because the order service does not need direct knowledge of every downstream consumer. + +That keeps the system extensible. + +### 6.3 Event-Driven Architecture + +Event-driven architecture uses domain events or system events as a way for services to react to state changes asynchronously. + +Examples: + +- `user_registered` +- `payment_succeeded` +- `subscription_renewal_due` +- `trip_started` +- `repository_pushed` + +Why teams like this pattern: + +- easy fanout to many consumers +- services remain loosely coupled +- new capabilities can be added by attaching new consumers +- events create a natural audit trail if retained + +Why teams fear this pattern when it is done poorly: + +- flows become hard to trace +- ownership becomes ambiguous +- accidental dependencies form through event contracts +- schema changes break multiple downstream systems +- eventual consistency surprises product teams + +### 6.4 Event Notifications vs Source of Truth + +Not every event should carry the full source-of-truth state. + +There are two common patterns: + +#### Notification event + +Says something happened and consumers fetch more data if needed. + +Strengths: + +- small message size +- less schema coupling + +Weaknesses: + +- more downstream reads +- consumers depend on another system being available + +#### State-carrying event + +Carries enough data for the consumer to act without more reads. + +Strengths: + +- lower coupling to synchronous reads +- better for analytics and independent processing + +Weaknesses: + +- schema evolution is harder +- larger payloads +- risk of stale assumptions about what fields downstream users need + +### 6.5 Event Sourcing Basics + +Event sourcing is more specific than general event-driven architecture. + +In event sourcing: + +- the sequence of events is the source of truth +- current state is derived by replaying those events or projecting them into read models + +This can be powerful for auditability and reconstructing state, but it is not the default choice for most product systems because it adds modeling and operational complexity. + +Interview nuance: + +Do not equate "we publish events" with "we use event sourcing." Most event-driven systems are not full event-sourced systems. + +### 6.6 Pub/Sub vs Webhooks + +Webhooks are an externalized event delivery mechanism, usually over HTTP. + +Comparison: + +| Characteristic | Internal Pub/Sub | Webhook | +|---|---|---| +| Audience | Internal services | External customer systems | +| Delivery transport | Broker or event bus | HTTP callbacks | +| Trust boundary | Internal | Cross-organization | +| Retry logic | Controlled internally | Must handle remote untrusted endpoints | +| Security | Internal auth and network control | Signatures, verification, abuse control | + +GitHub-style and Stripe-style systems are famous for webhook delivery. Internally they often still rely on queues or event buses and then run a dedicated webhook delivery pipeline on top, with retries, signatures, dedupe, and observability. + +### 6.7 Delivery Guarantees, Ordering, and Replay + +Pub/sub systems vary widely. + +You should discuss: + +- whether each subscriber tracks its own progress +- whether subscribers can replay older events +- whether ordering is global, per-key, or best-effort +- how long events are retained +- whether consumers see duplicates + +Kafka-style pub/sub is strong for replay. Simpler notification buses may prioritize live fanout over long retention. + +### 6.8 Real-World Examples + +- Uber-like trip events fan out to ETA, billing, incentives, support timelines, and analytics +- Netflix-like playback events fan out to recommendations, QoE analytics, device debugging, and experimentation pipelines +- Amazon-style order lifecycle events fan out to warehouse, notifications, CRM, and fraud systems +- GitHub-like repository events fan out to notifications, search indexing, audit logs, integrations, and webhook delivery systems +- Stripe-like payment lifecycle events fan out to ledger systems, webhook delivery, risk systems, and reconciliation pipelines + +### 6.9 Common Mistakes with Event-Driven Systems + +- emitting vague event names with weak ownership +- publishing events before the underlying transaction commits +- versioning schemas informally and breaking subscribers +- assuming async fanout automatically creates reliable workflows without job state and retries +- turning the event bus into a hidden dependency graph no one understands + +## 7. Retries + +### 7.1 Why Retries Are Necessary + +Retries exist because many failures are temporary, not permanent. + +Examples: + +- network packet loss +- short-lived service overload +- transient database failover +- rate limit windows resetting +- temporary lock contention +- brief DNS or TLS issues + +If the system gives up immediately on every transient failure, reliability collapses. + +### 7.2 Why Retries Are Dangerous + +Retries can save a system, but they can also destroy it. + +If a downstream service is already struggling, aggressive retries can multiply the load and turn a partial slowdown into a full outage. + +That is why retry design is a systems problem, not just an error-handling detail. + +### 7.3 Timeout Handling and Failure Classification + +Before retrying, classify the failure. + +Safe candidates for retry: + +- timeouts +- connection resets +- 429 or retryable rate-limit signals +- transient 5xx responses + +Usually not safe to retry blindly: + +- validation errors +- permanent authorization failures +- malformed payloads +- business rule violations + +Permanent failures should move quickly to DLQ, error state, or operator attention instead of burning retry budget. + +### 7.4 Exponential Backoff + +Exponential backoff spaces out retries so repeated failures do not happen in a tight loop. + +Example progression: + +- retry after 1 second +- then 2 seconds +- then 4 seconds +- then 8 seconds + +This reduces immediate pressure on a recovering dependency. + +### 7.5 Jitter + +Jitter adds randomness to retry timing. + +Without jitter, thousands of clients may retry at the same exact intervals, causing retry waves. + +With jitter, retries are spread out. + +This is simple and extremely important in production. + +```mermaid +flowchart TD + A[Request Fails] --> B{Transient Failure?} + B -- No --> P[Mark Permanent Failure or DLQ] + B -- Yes --> C[Compute Backoff + Jitter] + C --> D[Wait] + D --> E[Retry] + E --> F{Succeeded?} + F -- Yes --> G[Complete] + F -- No --> H{Retry Budget Left?} + H -- Yes --> C + H -- No --> P +``` + +### 7.6 Retry Storms + +Retry storms happen when many callers retry a failing dependency at once. + +Classic pattern: + +1. downstream latency increases +2. callers time out +3. callers retry immediately +4. downstream load multiplies +5. latency worsens further +6. more retries trigger + +This feedback loop is one of the most common outage amplifiers. + +Mitigations: + +- reasonable timeout settings +- exponential backoff with jitter +- bounded retry counts +- circuit breakers +- admission control and rate limiting +- queueing and load shedding + +### 7.7 Duplicate Execution Problems + +If a request times out, you often do not know whether the remote side performed the action. + +That is the heart of duplicate execution risk. + +Examples: + +- payment charge succeeded remotely but response got lost +- email was sent but acknowledgment failed +- database write committed but worker crashed before marking job done + +This is why retries require idempotency. + +### 7.8 Idempotency Keys + +Idempotency keys are a practical mechanism for safe retries. + +The idea: + +- the client or worker attaches a stable unique key for the logical operation +- the server stores or checks that key +- repeated requests with the same key return the same logical result instead of repeating the side effect + +Stripe-like payment APIs are a canonical example. In distributed systems, idempotency keys are one of the most powerful tools for turning unreliable networks into safe business behavior. + +### 7.9 Relationship to Circuit Breakers + +Retries and circuit breakers complement each other. + +- retries try again when failure seems temporary +- circuit breakers stop hammering a dependency that appears unhealthy + +Without circuit breakers, retry logic may keep feeding an outage. + +Without retries, the system may give up on failures that would have recovered quickly. + +### 7.10 Safe Retry Patterns + +- retry only known-transient failures +- bound the retry count or total retry duration +- add jitter +- make the operation idempotent +- record attempt count and last error +- propagate trace context so retries are observable +- consider moving repeated failures out of the request path into async repair workflows + +### 7.11 Retry Observability + +You should monitor: + +- retry count distribution +- retry success rate +- time to eventual success +- duplicate suppression rate +- downstream latency and error codes +- jobs failing after retry exhaustion + +If you do not observe retries, you do not know whether they are rescuing the system or quietly degrading it. + +### 7.12 Common Mistakes with Retries + +- retrying permanent validation errors +- retrying without idempotency +- using the same timeout budget on every attempt +- nesting retries in multiple layers and multiplying total requests +- using fixed retry intervals without jitter +- not correlating retries to one logical operation in logs and traces + +## 8. Dead Letter Queues (DLQ) + +### 8.1 What DLQs Are + +A dead-letter queue stores messages that could not be processed successfully after the allowed retry policy or routing rule. + +It is the system's way of saying: + +"This message is not succeeding through normal automation. Move it out of the hot path and make it inspectable." + +### 8.2 Why DLQs Exist + +DLQs exist for three main reasons: + +- failure isolation, so poison messages do not block healthy traffic +- debugging, so operators can inspect the original payload and error context +- controlled recovery, so bad messages can be fixed and replayed deliberately + +### 8.3 Poison Message Handling and Retry Exhaustion + +Without DLQs, a permanently failing message can: + +- loop forever +- waste worker capacity +- hide real throughput problems +- starve newer messages +- create endless noisy alerts + +Moving such messages to a DLQ lets the main queue keep flowing. + +### 8.4 What Good DLQ Payloads Should Include + +Do not just move the raw message and call it done. + +Good DLQ design includes: + +- original payload +- message ID or event ID +- attempt count +- last error code and message +- timestamps for first failure and last failure +- source queue or topic +- trace or correlation ID +- schema version + +That data is what turns DLQ from a trash pile into an operational tool. + +### 8.5 Replay Strategies + +Replaying DLQ messages is dangerous if done casually. + +Common strategies: + +- replay after fixing a code bug +- replay after restoring a downstream dependency +- replay only a filtered subset +- replay into a quarantine or low-rate queue first +- replay with additional dedupe and observability + +Blindly dumping the entire DLQ back into the main queue is a classic mistake. If the root cause is not fixed, you just recreate the outage. + +### 8.6 Operational Monitoring and Alerting + +DLQ metrics deserve explicit monitoring. + +Alert on: + +- DLQ size growth +- new DLQ arrival rate +- spike by message type or tenant +- oldest message age in DLQ +- repeated replay failure + +In real systems, on-call runbooks often start with DLQ inspection because it gives a high-signal view of which workflows are breaking persistently. + +### 8.7 Practical Production Debugging Workflow + +```mermaid +flowchart TD + A[Message Fails Repeatedly] --> B[Moved to DLQ] + B --> C[Alert Fires] + C --> D[Operator Inspects Payload and Error Metadata] + D --> E{Root Cause?} + E -- Code Bug --> F[Fix Code and Deploy] + E -- Bad Data --> G[Repair or Filter Bad Records] + E -- Dependency Outage --> H[Restore Dependency] + F --> I[Replay Safely] + G --> I + H --> I + I --> J[Monitor Replay Success and Dedupe] +``` + +### 8.8 Common Mistakes with DLQs + +- having a DLQ but no alerting on it +- storing too little metadata to diagnose failures +- replaying everything blindly +- never clearing or triaging DLQ growth +- treating DLQ as normal backlog instead of exceptional failure state +- sending messages to DLQ too early without distinguishing transient from permanent failures + +### 8.9 Interview Discussion Points for DLQs + +- DLQs isolate poison messages +- they improve system availability by protecting the hot path +- they support debugging and safe replay +- they are only useful if combined with alerting, metadata, and operational discipline + +## 9. Background Workers and Async Jobs + +### 9.1 What Background Jobs Are + +Background jobs are units of work executed outside the request-response path. + +This is the practical execution layer of async systems. + +Examples: + +- sending emails +- generating reports and exports +- media transcoding +- search indexing +- payment reconciliation +- fraud review workflows +- delayed reminders and notifications +- CRM or third-party system sync + +### 9.2 Why Not Everything Should Happen Inside the Request Path + +The request path should stay focused on work that is: + +- necessary to answer the user now +- fast enough for the SLA +- safe to couple to the current user interaction + +Everything else should be questioned. + +Reasons to defer work: + +- user should not wait for it +- downstream dependency is slow or unreliable +- workload is bursty +- task may need retries over minutes or hours +- task is CPU-heavy or operationally isolated + +### 9.3 Request-Response vs Deferred Execution + +```mermaid +flowchart LR + U[User] --> API[Application API] + API --> DB[(Primary DB)] + API --> JOB[[Job Queue]] + API --> RESP[Fast Response to User] + JOB --> W[Background Worker] + W --> EXT[Email / Storage / External API / Analytics] + W --> STATE[(Job State Store)] +``` + +The API does the minimum durable business action, then hands off the rest. + +That is the shape of a huge number of production systems. + +### 9.4 Typical Job Categories + +| Job type | Why async helps | +|---|---| +| Email sending | External provider latency and retries should not block user requests | +| Report generation | Heavy DB reads and file creation may take seconds or minutes | +| Payment reconciliation | Often batch-oriented, retry-heavy, and integration-heavy | +| Media processing | CPU-intensive and long-running | +| Notification systems | High fanout and bursty workloads | +| Search indexing | Eventually consistent is acceptable in many product flows | + +### 9.5 Job State Tracking + +Real systems usually need more than a queue entry. They often need a job record. + +Typical states: + +- pending +- running +- succeeded +- failed +- retrying +- canceled +- dead-lettered + +Why job state matters: + +- product can show progress to users +- operators can inspect failures +- workflows can enforce idempotency and dedupe +- downstream systems can reason about completion + +### 9.6 Idempotent Execution + +Workers must assume the job may be delivered more than once. + +Good patterns: + +- unique constraint on logical operation ID +- status table keyed by job ID or idempotency key +- side effects guarded by dedupe records +- external calls made with idempotency tokens where supported + +### 9.7 Retries, Failures, and Partial Completion + +A background job often has multiple steps. Partial failure is normal. + +Example: + +1. generate invoice PDF succeeds +2. upload to object storage succeeds +3. email provider times out + +Now what? + +You need to know: + +- which steps are safe to repeat +- which outputs already exist +- whether the job should resume, restart, or compensate + +This is why mature job systems store progress and side-effect identifiers, not just success/failure. + +### 9.8 Observability and Tracing + +Async jobs are much harder to observe than synchronous request flows because they span time, brokers, and workers. + +Good observability includes: + +- correlation IDs from request to job +- per-attempt logs +- queue metrics and worker metrics +- traces linking producer, broker, and consumer actions +- dashboards by job type, tenant, and failure reason + +### 9.9 Common Mistakes with Async Jobs + +- no job state outside the queue +- no idempotency key +- treating retries as an implementation detail instead of a first-class design concern +- letting one huge job monopolize a worker and starve short jobs +- no timeout or heartbeat for long-running tasks +- no operator tooling for inspect, cancel, replay, or rate limit + +## 10. Delayed Tasks + +### 10.1 What Delayed Tasks Are + +Delayed tasks are jobs scheduled to run in the future rather than immediately. + +Examples: + +- send reminder in 24 hours +- retry payment tomorrow +- expire reservation after 15 minutes +- renew subscription at period boundary +- send delayed push notification at local 9 AM + +### 10.2 Why Delayed Execution Matters + +Many business workflows are time-based, not request-based. + +If your async system only supports "run now," it cannot model a large class of production behavior. + +### 10.3 Implementation Approaches + +#### Delay queues + +Some systems support delayed visibility or delayed delivery directly. + +Strengths: + +- simple mental model +- good for moderate-scale reminder and retry workflows + +Weaknesses: + +- limited querying and control over future tasks +- very large delayed backlogs may be operationally awkward + +#### Scheduler plus task table + +Store future tasks in a database and periodically enqueue due tasks. + +Strengths: + +- easier auditing and edits +- better support for recurring jobs and cancellation + +Weaknesses: + +- requires a scheduler component and careful dedupe logic + +#### Time wheel or specialized timer systems + +Used in systems that need huge volumes of timers efficiently. + +This is more advanced and common in infrastructure-heavy environments. + +### 10.4 Duplicate Prevention + +Delayed systems often produce duplicates due to scheduler retries or failover. + +Prevention techniques: + +- unique key per logical scheduled action +- state transition checks before execution +- atomic claim of due task rows +- idempotent handlers at execution time + +### 10.5 Real-World Examples + +- subscription renewal attempts in SaaS billing +- abandoned cart reminders in e-commerce +- invoice follow-up emails after fixed intervals +- account lock release or token expiration enforcement +- payment retry strategies after transient bank failures + +### 10.6 Common Mistakes with Delayed Tasks + +- storing future jobs with no cancellation path +- assuming one scheduler instance will always stay healthy +- not accounting for daylight saving or time zone logic in user-facing reminders +- retrying delayed jobs without a stable dedupe key + +## 11. Task Execution + +### 11.1 How Worker Execution Actually Behaves in Production + +People often imagine worker execution as: + +1. pick message +2. do work +3. mark done + +Real production behavior is messier. + +Workers can: + +- crash mid-execution +- lose network connection after side effects happen +- time out while still running +- be rescheduled during deployment +- hold locks too long +- partially complete multi-step work + +So execution design is fundamentally about ownership, retries, and safe completion. + +### 11.2 Worker Lifecycle + +```mermaid +flowchart TD + A[Worker Polls or Receives Task] --> B[Claims Task / Gets Visibility Lease] + B --> C[Processes Task] + C --> D{Succeeded?} + D -- Yes --> E[Ack / Delete / Commit Progress] + D -- No --> F{Retryable?} + F -- Yes --> G[Requeue or Wait for Visibility Timeout] + F -- No --> H[Move to DLQ or Mark Failed] + E --> I[Fetch Next Task] + G --> I + H --> I +``` + +### 11.3 Task Pickup and Locking + +Task pickup needs some form of ownership control. + +Common patterns: + +- broker-delivered lease, such as SQS visibility timeout +- database row claim using `SELECT ... FOR UPDATE` or status transition +- distributed lock for externally coordinated work +- partition ownership in Kafka consumer groups + +The goal is not perfect exclusivity forever. The goal is to limit concurrent conflicting processing and recover when workers die. + +### 11.4 Acknowledgments and Visibility Timeout Interaction + +In systems with leases or visibility timeouts, the worker has temporary ownership. + +This creates a critical operational question: + +What if the task takes longer than expected? + +Possible outcomes: + +- worker extends the lease or visibility timeout +- worker times out and another worker picks the same task +- duplicate processing occurs + +That is why long-running tasks need heartbeats, lease extension, or a more suitable orchestration system. + +### 11.5 Task Ownership and Partial Failure Handling + +Ownership is rarely absolute. + +If a worker performs an external side effect and crashes before acknowledgment, another worker may retry. + +That means correctness lives in: + +- idempotent side effects +- durable progress markers +- safe retry boundaries +- compensating logic where idempotency is impossible + +### 11.6 Exactly-Once Myths + +Exactly-once execution is usually not a property of the worker fleet as a whole. + +More realistic statements are: + +- exactly-once insertion into this table because of a unique constraint +- exactly-once state transition for this entity because of version checks +- exactly-once publishing from an outbox because of transactional semantics + +At the task execution level, the practical design is still at-least-once delivery plus idempotent handling. + +### 11.7 Idempotent Job Design + +Good idempotent job design often includes: + +- stable logical operation ID +- input version or event sequence number +- side-effect record of what was already done +- compare-and-set or unique constraint protection +- explicit completion record + +### 11.8 Common Mistakes in Task Execution + +- acknowledging before the durable side effect is safe +- running tasks longer than visibility timeout without extension +- assuming retries will not overlap +- building workers that cannot resume or recognize partial completion +- failing to propagate correlation IDs and attempt counts + +## 12. Worker Pools + +### 12.1 Why Worker Pools Matter + +A queue by itself does nothing. Worker pools turn queued work into completed work. + +Their job is to provide controlled concurrency. + +If concurrency is too low, backlog grows and user-visible delay increases. + +If concurrency is too high, downstream systems get overwhelmed. + +### 12.2 Concurrency Models + +| Model | Best for | Strengths | Weaknesses | +|---|---|---|---| +| Thread pool | Many blocking I/O tasks | Simple mental model in many languages | Memory and scheduling overhead at high counts | +| Async event loop | Very high I/O concurrency | Efficient for network-heavy workloads | Poor fit for CPU-heavy work | +| Process pool | CPU-bound jobs or isolation needs | Better CPU scaling and fault isolation | Higher startup and memory cost | +| Container fleet | Mixed workloads and autoscaling | Operational isolation and elasticity | More orchestration complexity | + +The right model depends on the job type. + +### 12.3 CPU-Bound vs I/O-Bound Jobs + +This distinction matters a lot. + +#### CPU-bound + +Examples: + +- video transcoding +- image processing +- compression +- heavy document rendering + +Need: + +- process-level parallelism or specialized compute +- bounded concurrency to avoid CPU thrash + +#### I/O-bound + +Examples: + +- calling APIs +- sending emails +- writing webhooks +- waiting on storage or DB operations + +Need: + +- high concurrency but careful rate limiting +- strong timeout and retry policies + +### 12.4 Queue Backlog Management + +Backlog management is not just "add more workers." + +You need to ask: + +- are workers the real bottleneck? +- will scaling workers just overload the database? +- are some tasks much longer than others? +- should the queue be split by priority or runtime class? + +Useful metrics: + +- backlog size +- age of oldest message +- throughput per worker +- worker saturation +- downstream latency and error rate + +### 12.5 Worker Starvation, Prioritization, and Fairness + +If all jobs share one queue, long or low-value tasks can starve urgent tasks. + +Common strategies: + +- separate queues by priority +- weighted worker allocation +- per-tenant fairness limits +- dedicated pools for CPU-heavy work +- maximum runtime or chunking for large jobs + +Fairness matters especially in multi-tenant SaaS. One noisy customer should not consume the entire worker fleet. + +### 12.6 Autoscaling Worker Fleets + +Autoscaling sounds simple but can be dangerous. + +Good autoscaling inputs: + +- queue depth +- age of oldest message +- worker CPU or memory +- downstream dependency health + +Bad autoscaling pattern: + +- scale workers aggressively on queue depth alone +- overwhelm the database or external API +- create more retries and errors +- make the backlog worse + +A mature autoscaling design considers both queue pressure and downstream capacity. + +```mermaid +flowchart LR + Q[[Queue Backlog]] --> M[Autoscaling Controller] + M --> W1[Worker Pod 1] + M --> W2[Worker Pod 2] + M --> W3[Worker Pod N] + W1 --> DEP[DB / External Dependency] + W2 --> DEP + W3 --> DEP + DEP --> FB[Health Signals / Rate Limits] + FB --> M +``` + +### 12.7 Graceful Shutdown and Draining Workers Safely + +Deployments and autoscaling terminate workers all the time. + +Graceful shutdown means: + +- stop pulling new work +- finish or checkpoint in-flight tasks if possible +- extend lease if needed while draining +- release ownership safely if the task cannot finish +- emit final logs and metrics + +Without graceful draining, deployments create duplicates, partial failures, and confusing spikes in retry counts. + +### 12.8 Real-World Worker Fleet Patterns + +- GitHub-like webhook delivery systems often separate fast delivery workers from slow retry workers +- Stripe-like payment or reconciliation systems often isolate high-risk jobs from low-risk notification jobs +- Netflix-like media pipelines use specialized worker pools for CPU-heavy transcoding and separate pipelines for metadata updates +- SaaS export systems often dedicate pools for large tenant exports so normal email or notification jobs are not starved + +## 13. Job Scheduler + +### 13.1 What Cron Systems Are + +Cron systems run tasks on a schedule. + +Simple examples: + +- daily cleanup job +- hourly analytics aggregation +- monthly invoice generation +- weekly email digest + +Single-machine cron is enough for small systems, but it becomes fragile quickly in distributed production environments. + +### 13.2 Single-Machine Cron vs Distributed Cron + +| Approach | Strengths | Weaknesses | +|---|---|---| +| Single machine cron | Simple, easy to understand | Single point of failure, weak observability, hard failover | +| Distributed scheduler | Highly available, scalable, operationally visible | Much harder correctness and coordination problems | + +### 13.3 Limitations of Basic Cron + +Basic cron does not handle many real production needs well: + +- missed execution recovery after downtime +- duplicate prevention across multiple scheduler instances +- retries with visibility into failures +- dynamic per-tenant schedules +- audit trails and job history +- time zone aware execution at scale + +### 13.4 Drift Problems and Missed Execution Handling + +Schedulers drift because clocks drift, processes pause, infrastructure fails, and queues back up. + +Questions a real scheduler must answer: + +- if a server is down at the scheduled minute, should the job run later? +- if a job is delayed 20 minutes, is it still useful? +- if the scheduler restarts, how does it discover missed runs? +- if the same run is scheduled twice during failover, how is duplicate execution prevented? + +### 13.5 Recurring Jobs + +Recurring jobs are jobs that should run repeatedly according to a schedule. + +Examples: + +- recurring billing +- data cleanup +- usage aggregation +- periodic sync with third-party systems +- digest emails +- reindexing or cache warming jobs + +The important design point is that recurring jobs still need idempotency. + +If the scheduler runs the same billing task twice, the system must not double-charge. + +### 13.6 Scheduling Guarantees + +No scheduler gives perfect guarantees for free. + +Useful questions: + +- at what level is uniqueness guaranteed? +- how are missed runs handled? +- how are retries tracked? +- can operators trigger backfill safely? +- can runs overlap, or must they serialize? + +### 13.7 Distributed Scheduling + +This is much harder than normal cron because now you need coordination. + +Problems to solve: + +- which scheduler instance is active? +- how do you avoid duplicate execution? +- how do you recover if the leader dies mid-scheduling? +- how do you handle clock skew across nodes? +- how do you scale millions of scheduled tasks? + +#### Leader election + +One instance becomes the active scheduler and others remain standby. + +This simplifies duplicates but creates failover logic and lease management concerns. + +#### Distributed locks + +Locks can protect scheduling or execution of a particular run, but lock systems themselves must be reliable and carefully scoped. + +#### Shard-based scheduling + +Very large systems may partition schedules by tenant, time bucket, or hash range so multiple scheduler nodes each own part of the schedule space. + +#### Global-scale scheduling + +At global scale, teams must think about: + +- region failover +- clock discipline +- per-region scheduling ownership +- data locality +- cross-region duplicate prevention + +```mermaid +flowchart LR + subgraph Scheduler Cluster + S1[Scheduler A] + S2[Scheduler B] + S3[Scheduler C] + end + LEASE[(Leader Lease / Lock)] --> S1 + S1 --> DUE[Find Due Jobs] + DUE --> ENQ[[Enqueue Runnable Jobs]] + ENQ --> WF[Worker Fleet] + WF --> RES[(Job State / Run History)] + RES --> S1 +``` + +### 13.8 Scheduler Failover + +When the active scheduler dies, another instance should take over. + +The hard part is preventing both from scheduling the same run. + +Common techniques: + +- leader lease with short renewal interval +- run records with unique keys such as `(job_id, scheduled_time)` +- idempotent enqueue logic +- execution-time dedupe in workers + +### 13.9 Kubernetes CronJobs Basics + +Kubernetes CronJobs are useful for many teams because they provide managed scheduled execution on a cluster. + +They are a strong default for moderate complexity, but they do not remove all design concerns: + +- missed runs during cluster issues +- overlapping runs +- idempotency of the underlying job +- observability of job success/failure at business level + +They schedule containers. They do not automatically solve workflow correctness. + +### 13.10 Workflow Orchestration Basics + +Some workflows are too complex for a single queue job or cron trigger. + +Examples: + +- multi-step data pipelines +- approval workflows +- ML training plus validation plus deployment steps +- payment recovery flows with timed retries and branching behavior + +Airflow-style or workflow-engine systems add: + +- DAGs or state machines +- retries by step +- dependency management +- backfills +- run history and observability + +These systems are useful when the workflow itself becomes a product-worthy operational object. + +### 13.11 Why Distributed Scheduling Is Harder Than Normal Cron + +Because time itself becomes a distributed systems problem. + +Normal cron on one machine mostly asks, "What time is it?" + +Distributed scheduling asks: + +- which machine decides what time it is for this job? +- how do we prove only one machine scheduled this run? +- what if the scheduler died after enqueue but before recording success? +- how do we recover missed windows without duplicating work? + +That is why scheduler correctness is often underestimated. + +## 14. Comparative View: Kafka vs RabbitMQ vs SQS + +### 14.1 High-Level Comparison + +| System | Best for | Strengths | Tradeoffs | +|---|---|---|---| +| Kafka | High-throughput event streams, replay, fanout pipelines, analytics | Replayable log, strong throughput, consumer groups, retention | More operational complexity, partition management, weaker fit for classic job semantics | +| RabbitMQ | Task routing, broker-managed messaging, transactional workflows | Flexible exchanges and routing, classic queue semantics, ack model | Less natural for long retention and replay-heavy pipelines | +| SQS | Managed cloud queueing for jobs and async workflows | Operational simplicity, elastic scale, easy cloud integrations | Fewer advanced routing and replay features, polling model, less control | + +### 14.2 Simple Selection Heuristic + +- choose Kafka when events need replay, multiple consumer groups, and very high throughput +- choose RabbitMQ when routing logic and broker-managed delivery matter most +- choose SQS when you want a reliable managed queue with low operational overhead + +In real production systems, companies often use more than one: + +- Kafka for domain event streams and analytics +- SQS or RabbitMQ for application job processing +- internal pub/sub for fanout +- workflow engines for complex orchestration + +## 15. Putting the Pieces Together in Real Architecture + +### 15.1 Example: E-Commerce / SaaS Order Flow + +```mermaid +flowchart LR + U[User] --> API[Checkout API] + API --> DB[(Orders DB)] + API --> OUT[(Outbox Table)] + OUT --> PUB[Publisher] + PUB --> BUS[[Event Bus / Queue]] + BUS --> PAY[Payment Worker] + BUS --> INV[Inventory Worker] + BUS --> MAIL[Email Worker] + BUS --> ANA[Analytics Consumer] + PAY --> LEDGER[(Ledger / Payment Records)] + INV --> STOCK[(Inventory DB)] + MAIL --> ESP[Email Provider] + PAY --> DLQ[DLQ / Failed Operations] + MAIL --> DLQ +``` + +What is happening here: + +1. request path writes the authoritative order record +2. outbox pattern ensures the event is not lost between DB write and publication +3. event bus or queue fans out downstream work +4. workers run with retries and idempotency +5. failures that exceed retry policy move to DLQ +6. analytics can consume events separately without coupling to checkout latency + +This is much closer to how real systems work than a purely synchronous diagram. + +### 15.2 Example: Stripe-Like Payment Workflow + +Practical characteristics: + +- API request must be low latency and strongly correct for the initial payment intent creation +- downstream settlement, ledger posting, receipt email, webhook delivery, and reconciliation are async +- idempotency keys are critical +- retries must be carefully controlled because duplicate charges are unacceptable +- operator visibility into failed jobs is mandatory + +### 15.3 Example: GitHub-Like Event and Webhook System + +Practical characteristics: + +- repository events and actions fan out internally +- webhook delivery is isolated from the primary product flow +- each delivery attempt is tracked +- retries happen with backoff over time +- failed deliveries are inspectable and sometimes replayable + +### 15.4 Example: Uber-Like Event-Driven Platform + +Practical characteristics: + +- trip lifecycle events are high volume and consumed by many systems +- per-trip ordering matters more than global ordering +- event streaming platforms are a strong fit +- analytics, fraud, support, receipts, and pricing all consume the same event lineage differently + +### 15.5 Example: Netflix-Like Media Pipeline + +Practical characteristics: + +- ingest and playback telemetry produce huge streams +- media processing is CPU-heavy and asynchronous +- worker pools are specialized by workload type +- retries and backpressure must respect expensive compute resources + +## 16. What Breaks at Scale + +### 16.1 Common Failure Patterns + +- queue backlog grows faster than workers can drain it +- retries multiply load on a degraded dependency +- poison messages block throughput without DLQ isolation +- hot keys or partitions create uneven load +- long-running tasks exceed visibility timeouts and execute twice +- schema changes break consumers silently +- scheduler failover creates duplicate runs +- lack of tracing makes async flows impossible to debug quickly + +### 16.2 Scaling Considerations + +When discussing scale, cover: + +- arrival rate versus processing rate +- partitioning strategy +- worker concurrency and downstream capacity +- storage retention and replay cost +- per-tenant fairness +- regional failover and disaster recovery +- operator tooling for replay, cancel, and backfill + +### 16.3 Operational Metrics That Matter + +| Layer | Metrics | +|---|---| +| Queue / broker | depth, lag, enqueue rate, dequeue rate, retention, publish failures | +| Worker fleet | concurrency, throughput, CPU, memory, error rate, retry count | +| Workflow | end-to-end completion latency, success rate, duplicate suppression rate | +| DLQ | inflow, size, age, replay success rate | +| Scheduler | missed runs, duplicate runs, scheduling delay, lock/lease health | + +## 17. Best Practices + +### 17.1 Design Principles + +- keep the request path minimal and durable +- assume at-least-once delivery unless you can prove otherwise +- make consumers idempotent +- add retries only for transient failures +- use exponential backoff and jitter +- isolate poison messages with DLQs +- store job state when the business flow needs visibility or repairability +- propagate correlation IDs across async boundaries +- choose technology based on workload shape, not hype + +### 17.2 Architecture Practices Used in Production + +- outbox pattern to publish events reliably after DB commits +- inbox or dedupe table on consumers for exactly-once business effects +- separate queues by priority or runtime class +- autoscale workers with awareness of downstream limits +- runbooks for DLQ inspection and replay +- schema versioning discipline for events +- replay mechanisms that are filtered and rate-limited + +### 17.3 Common Interview Mistakes + +- saying "exactly-once" without explaining what boundary actually guarantees it +- saying "use Kafka" without explaining why stream semantics matter +- saying "add retries" without discussing idempotency and retry storms +- saying "use cron" without addressing duplicate prevention and missed runs in distributed systems +- treating async as always better than sync + +## 18. How to Answer in an Interview + +When an interviewer asks about async systems, structure your answer like this: + +1. identify what must stay synchronous because the user needs it now +2. identify what can move off the request path +3. choose the async primitive: queue, pub/sub, event stream, or scheduler +4. explain delivery semantics and idempotency +5. explain failure handling: retries, DLQ, backpressure +6. explain scaling: partitions, worker pools, autoscaling, backlog monitoring +7. explain observability and operator workflows + +A concise but strong answer often sounds like this: + +"I would keep the minimum correctness-critical write in the synchronous request path, then publish an async job or domain event for non-blocking work. I would assume at-least-once delivery, make consumers idempotent, add exponential backoff with jitter for transient failures, send poison messages to a DLQ, and monitor queue depth, message age, lag, and retry rates. If the workload needs replay and many independent consumers, Kafka is a strong fit. If it is more task-oriented with routing-heavy workflows, RabbitMQ or SQS may be better depending on whether we want operational simplicity or richer broker behavior." + +## 19. Final Takeaway + +Async systems are not just queues. They are a way of designing software around the fact that real production systems are full of slow dependencies, bursty traffic, retries, partial failures, and work that does not belong in the request path. + +If you understand async systems well, you can answer four important questions clearly: + +- What should happen now versus later? +- How do we avoid losing work? +- How do we avoid doing the same work dangerously twice? +- How do we keep the system stable when traffic and failures spike? + +That is exactly why async systems are central both to elite interview performance and to real backend engineering. diff --git a/systems design/6.commSystems.md b/systems design/6.commSystems.md new file mode 100644 index 0000000..2c0ba2d --- /dev/null +++ b/systems design/6.commSystems.md @@ -0,0 +1,2363 @@ +# Communication Systems + +Communication systems are the part of backend engineering that move information from one actor to another with the right combination of speed, reliability, cost, and user experience. + +That sounds simple until you try to build one in production. + +At small scale, communication looks like: + +- send a message to a browser +- notify a mobile app that something changed +- email a receipt +- send an OTP SMS +- show whether a user is online + +At production scale, the real questions become harder: + +- how fast must the update arrive before the product feels broken? +- what happens if the user disconnects halfway through? +- how do you keep millions of long-lived connections open? +- how do you avoid sending the same notification five times? +- how do you retry safely without creating retry storms? +- how do you respect user preferences, cost constraints, and compliance requirements? +- how do you fan one event out to millions of recipients without melting the system? + +This is why communication systems are a major system design topic. They sit at the boundary between users and backend state, so every weakness becomes visible quickly: latency, stale data, duplicate delivery, missing messages, ghost presence, notification fatigue, and operational complexity. + +This guide is written for two goals at once: + +- interview preparation, where you need to explain tradeoffs clearly and structure your answer well +- real backend engineering, where you need to design systems that survive load, failure, and product growth + +The focus here is not memorizing definitions. The focus is understanding why these systems exist, how they work internally, what breaks at scale, and how companies such as Google, Netflix, Uber, Amazon, GitHub, Stripe, Slack, WhatsApp, Discord, and typical SaaS platforms think about them in practice. + +## 1. Big Picture: What Communication Systems Solve + +Communication systems usually fall into two overlapping families: + +1. real-time systems, where data should move to users quickly while they are actively connected +2. notification systems, where events should be delivered across channels such as in-app, email, SMS, or push, even if the user is not currently online + +In real products, these are not separate worlds. They are usually fed by the same event streams. + +Example: + +- a ride status changes from `driver_arrived` to `trip_started` +- the rider app needs a real-time UI update +- the driver app may need confirmation +- analytics pipelines need the event +- customer support tools may need the updated timeline +- if the rider is offline, a push notification may be needed + +One state transition can trigger multiple communication paths. + +```mermaid +flowchart LR + A[Business Event
order placed / trip updated / message sent] --> B[[Event Bus / Outbox]] + B --> C[Real-Time Delivery Layer] + B --> D[Notification Orchestrator] + B --> E[Analytics / Audit / Logging] + C --> F[WebSocket / SSE / Long Polling Clients] + D --> G[In-App Inbox] + D --> H[Email Provider] + D --> I[SMS Provider] + D --> J[Push Provider] +``` + +The architectural pattern is consistent across many companies: + +- business systems emit authoritative events +- communication infrastructure decides who should be told, through which channel, and with what guarantee +- channel-specific delivery systems handle protocol differences and provider failures + +### 1.1 The Core Design Tension + +Communication systems are always balancing five things: + +| Concern | What you want | Why it is hard | +|---|---|---| +| Latency | Users see updates quickly | Fast delivery usually requires persistent state and more expensive infrastructure | +| Reliability | Important messages are not lost | Retries create duplicates unless the system is idempotent | +| Ordering | Updates arrive in sensible order | Distributed systems naturally reorder across partitions, retries, and reconnects | +| Cost | Delivery is affordable | SMS and mobile push can be expensive or rate-limited; persistent connections consume memory | +| User experience | Users are informed but not annoyed | Over-delivery causes fatigue; under-delivery makes the product feel broken | + +Strong system design answers acknowledge all five. + +### 1.2 Interview Framing + +If an interviewer asks about communication systems, they are usually testing whether you can think beyond "send message from A to B." + +They want to hear that you understand: + +- latency targets depend on the product +- transport choice depends on directionality and interaction pattern +- delivery guarantees must be designed, not assumed +- large fanout requires asynchronous pipelines and backpressure control +- presence and online status are approximate, not perfect truth +- notification systems require orchestration, deduplication, preferences, and retry strategy + +The best answers sound like this: + +"I would separate the authoritative business event from channel delivery. Then I would choose the transport based on interaction style, add idempotency and retry controls, and design for fanout, reconnects, and per-user preference rules." + +## 2. Real-Time Systems Overview + +Real-time systems are systems where the value of data depends heavily on how quickly it arrives. + +In backend interviews, "real-time" almost always means internet-scale soft real-time systems, not embedded hard real-time control systems. + +### 2.1 What Real-Time Means in Backend Systems + +In a web or mobile product, real-time usually means the update should arrive fast enough that the user feels the application is live. + +Examples: + +- chat messages should appear almost immediately +- typing indicators should feel instant +- ride location should update every few seconds +- collaborative edits should converge with minimal visible lag +- market dashboards should feel current enough for decisions +- multiplayer game state should update continuously, though serious gaming systems often use specialized protocols beyond basic web stacks + +"Real-time" is therefore a product requirement, not a single protocol. + +### 2.2 Soft Real-Time vs Hard Real-Time + +| Type | Meaning | Typical examples | Backend interview relevance | +|---|---|---|---| +| Hard real-time | Missing a deadline is a correctness failure | avionics, industrial control, some medical systems | Usually not what web/backend interviews mean | +| Soft real-time | Missing a deadline degrades user experience, but the system still functions | chat, live dashboards, collaborative tools, ride tracking | The common system design case | + +Hard real-time systems care about deterministic deadlines. Soft real-time systems care about low latency and bounded staleness. + +For backend products, the question is usually: + +"How stale can data become before the product feels wrong?" + +That is a much more practical framing than arguing whether the system is literally real-time. + +### 2.3 Why Real-Time Communication Matters + +Real-time communication matters because it changes product behavior: + +- it makes collaboration feel shared instead of delayed +- it reduces the need for manual refreshes +- it keeps users engaged during active workflows +- it improves trust in stateful experiences such as deliveries, rides, or build pipelines +- it enables products whose value depends on immediacy, such as incident dashboards or trading views + +Poor real-time behavior is visible immediately. + +Users may say: + +- "the chat is laggy" +- "the driver marker jumps randomly" +- "I got the notification but the app still shows old state" +- "it says they are online but they are not responding" + +These are communication-system failures, not merely UI issues. + +### 2.4 Latency Expectations by Product + +There is no universal latency target. It depends on the human workflow. + +| Use case | Typical expectation | What users notice | +|---|---|---| +| Typing indicators | 50 to 300 ms | Anything slower feels fake | +| Chat message delivery | under 500 ms ideal | Multi-second delay feels broken | +| Collaborative editing | under 100 to 300 ms for local echo and convergence | Delayed merge or cursor motion is obvious | +| Ride or delivery tracking | 1 to 5 seconds is often acceptable | Long freezes reduce trust | +| Live operational dashboard | 1 to 10 seconds depending on use | Depends on decision criticality | +| Financial/trading dashboards | often sub-second to a few hundred ms for UX, though core trading infra may require much stricter guarantees | Stale numbers can be dangerous | + +A good interview answer explicitly sets a target. If you do not define the acceptable staleness, the rest of the design becomes vague. + +### 2.5 Consistency Challenges + +Real-time systems are fundamentally consistency problems with a time dimension. + +Common challenges: + +- the backend state changes faster than clients can consume updates +- multiple clients observe updates from different replicas or regions +- reconnecting clients may miss some events and need replay +- the UI may apply optimistic updates before the server confirms them +- ordering may break when events are retried, sharded, or merged from multiple sources + +This means you should think in layers: + +1. authoritative state lives in the source-of-truth system +2. real-time delivery is a fast propagation layer, not the only truth +3. clients often need version numbers, sequence numbers, or snapshots to reconcile with source-of-truth state + +### 2.6 Ordering Guarantees + +Ordering sounds simple until the system becomes distributed. + +Useful distinctions: + +- per-connection ordering: messages on a single TCP connection arrive in order +- per-partition ordering: messages within one partition of Kafka or another log may be ordered +- per-room or per-entity ordering: feasible if you shard consistently by room, stream, or entity ID +- global ordering: usually expensive, unnecessary, or impossible at scale + +Interviewers often like hearing this sentence: + +"I would aim for per-entity ordering, not global ordering, because global ordering becomes a bottleneck and is rarely needed by the product." + +### 2.7 Delivery Guarantees + +Real-time systems usually choose one of these practical models: + +| Guarantee | Meaning | Reality in production | +|---|---|---| +| At-most-once | Message may be lost, but not duplicated | Simple and low overhead; common for low-value ephemeral events like typing | +| At-least-once | Message may be duplicated but should eventually arrive | Common when reliability matters and clients can dedupe | +| Exactly-once | Message is processed once end-to-end | Usually approximated with idempotency, dedupe keys, and transactional boundaries rather than achieved literally | + +Important point: a transport like WebSocket does not magically provide business-level delivery guarantees. It only moves bytes on a connection. Business-level guarantees require acknowledgments, persistence, retries, dedupe, and replay logic. + +### 2.8 Fanout Challenges + +Fanout means taking one event and sending it to many recipients. + +Examples: + +- one new chat message goes to every participant in a channel +- one sports score update goes to millions of viewers +- one incident event updates many dashboards + +Fanout is where many designs fail, because the cost of one event is no longer constant. + +If a message goes to 3 users, the cost is small. +If it goes to 3 million users, the delivery system becomes the problem. + +Production systems handle this with combinations of: + +- pub/sub layers +- shard-aware connection gateways +- regional fanout +- batching or aggregation for low-priority updates +- per-subscriber filtering so only interested users receive the event + +### 2.9 Scaling Persistent Connections + +Persistent connections are not free. + +Each active connection consumes: + +- file descriptors +- kernel socket buffers +- user-space memory for connection metadata and outbound queues +- CPU for heartbeats, TLS, parsing, and serialization + +At one million connections, even small per-connection cost matters. + +If one connection costs only 20 KB of memory for total state, one million connections already imply roughly 20 GB of memory before business logic overhead. + +That is why large-scale real-time systems often use dedicated gateway tiers optimized for connection handling, event loops, and lightweight session metadata. + +### 2.10 Stateless vs Stateful Challenges + +HTTP APIs are often designed to be stateless. Real-time connection layers are not. + +A persistent connection pins a client to a specific server process for some period of time. That creates state: + +- which user is attached to which gateway +- which rooms/channels the connection is subscribed to +- what the last acknowledged sequence number was +- what outbound messages are buffered + +This creates operational consequences: + +- load balancers may need sticky behavior for established connections +- connection draining becomes important during deploys +- reconnect storms can overload specific shards +- state often must be externalized partially to Redis, Kafka, or another coordination layer + +The usual production compromise is: + +- stateful edge or gateway layer for live connections +- stateless application layer for most business logic +- shared pub/sub or stream infrastructure connecting the two + +```mermaid +flowchart LR + C[Clients] --> LB[Load Balancer] + LB --> G1[Realtime Gateway 1] + LB --> G2[Realtime Gateway 2] + LB --> G3[Realtime Gateway 3] + G1 --> PS[[Pub/Sub / Stream Bus]] + G2 --> PS + G3 --> PS + PS --> APP[Business Services] + APP --> DB[(Source of Truth DB)] +``` + +### 2.11 Common Interview Examples + +Typical products that surface these issues: + +- chat systems such as Slack, WhatsApp, Discord +- trading or operational dashboards used by finance or SRE teams +- collaborative editors such as Google Docs +- ride tracking systems such as Uber or delivery apps +- multiplayer or shared presence systems + +Each of these has different priorities: + +- chat needs reasonable ordering and persistence +- collaborative editing needs conflict resolution and convergence +- ride tracking needs geo-update pipelines and bounded staleness +- dashboards often need efficient broadcast and backpressure handling + +## 3. WebSockets + +WebSockets are the standard choice when a browser or app needs persistent, bidirectional, low-latency communication over a single long-lived connection. + +### 3.1 What WebSockets Are + +WebSocket is a protocol that starts as an HTTP request and then upgrades to a long-lived TCP connection that both sides can use to send messages independently. + +Why this matters: + +- traditional HTTP is request-response oriented +- many interactive systems need the server to push updates without waiting for a fresh request +- some systems also need the client to send frequent messages without paying repeated HTTP overhead + +Examples: + +- chat input and delivery +- collaborative cursor updates +- game or live collaboration events +- dashboard subscriptions where clients may also send control messages + +### 3.2 Why WebSockets Exist + +Before WebSockets, developers used polling, long polling, or various hacks to simulate server push. + +Those approaches had problems: + +- repeated request headers and repeated TLS work +- higher latency between event generation and client receipt +- difficulty handling truly bidirectional traffic +- more server overhead due to connection churn + +WebSockets exist because web applications increasingly behaved like continuously connected applications rather than static pages. + +### 3.3 HTTP Upgrade Flow + +The connection begins as HTTP and upgrades. + +```mermaid +sequenceDiagram + participant C as Client + participant LB as Load Balancer + participant G as WebSocket Gateway + + C->>LB: HTTP GET with Upgrade: websocket + LB->>G: Forward upgrade request + G->>G: Authenticate user, validate origin, allocate session + G-->>LB: 101 Switching Protocols + LB-->>C: 101 Switching Protocols + C->>G: WebSocket frames + G->>C: WebSocket frames +``` + +Typical details involved in practice: + +- the client connects with `ws://` or `wss://`; production typically uses `wss://` +- authentication may be done via cookie, short-lived token, or signed session token +- the gateway may register the connection in a presence or subscription store +- after upgrade, messages are framed according to the WebSocket protocol instead of normal HTTP response semantics + +### 3.4 Connection Lifecycle + +A production WebSocket connection usually has this lifecycle: + +1. connect and authenticate +2. subscribe to channels, rooms, or entities +3. exchange bidirectional messages +4. send heartbeats to detect broken connections +5. handle temporary disconnects and reconnects +6. replay missed messages or fetch a snapshot if needed + +Designing only step 3 is not enough. Most operational pain comes from steps 4 through 6. + +### 3.5 Heartbeats, Ping-Pong, and Idle Detection + +Long-lived connections fail silently all the time. + +Reasons include: + +- mobile device sleep +- NAT or proxy idle timeout +- Wi-Fi to cellular network switch +- gateway crash +- browser tab suspension +- broken half-open TCP connection + +This is why WebSocket systems usually use heartbeat messages or protocol-level ping-pong frames. + +Heartbeats help answer two questions: + +1. is the connection still alive? +2. should the system still consider this client online? + +Without heartbeats, you get ghost sessions that appear connected long after the client is gone. + +### 3.6 Reconnect Strategies + +Reconnect behavior is a major part of real-world quality. + +Bad reconnect strategy: + +- immediate reconnect loop +- no jitter +- no session resume +- no replay of missed messages + +This creates reconnect storms during deploys or brief outages. + +Good reconnect strategy usually includes: + +- exponential backoff +- random jitter +- session tokens that expire quickly but can be refreshed safely +- a cursor, offset, or last-seen sequence number so missed updates can be replayed +- a fallback snapshot fetch if the replay window has expired + +For example, a chat client may reconnect with "last message sequence = 8421" so the server can resend messages 8422 onward, or instruct the client to resync from the room history API. + +### 3.7 Backpressure Handling + +Backpressure means the producer is sending faster than the consumer can handle. + +In WebSocket systems, this often shows up as: + +- a client on a slow network +- a dashboard subscribed to too many noisy streams +- a fanout burst after a major event +- a gateway trying to write faster than the kernel socket buffer drains + +If you ignore backpressure, memory usage grows because outbound queues grow. + +Common handling strategies: + +- per-connection outbound queue limits +- drop low-priority ephemeral updates such as typing indicators or cursor positions +- coalesce updates so only the latest state is sent for fast-changing metrics +- disconnect very slow consumers after threshold violations +- split high-volume topics into separate subscription classes + +An important interview line: + +"For rapidly changing state, I would prefer sending the latest state rather than every intermediate event, because preserving every micro-update can create unnecessary backpressure." + +### 3.8 Scaling Millions of Connections + +At large scale, the architecture is usually not "application servers also do WebSockets." Instead it becomes a specialized gateway tier. + +```mermaid +flowchart TB + U[Clients] --> LB[Global Load Balancer] + LB --> R1[Realtime Gateway Pool A] + LB --> R2[Realtime Gateway Pool B] + R1 --> BUS[[Kafka / Redis PubSub / NATS / Custom Bus]] + R2 --> BUS + BUS --> CHAT[Chat Service] + BUS --> PRES[Presence Service] + BUS --> FEED[Live Feed Service] + CHAT --> DB[(Message Store)] + PRES --> CACHE[(Ephemeral Presence Store)] + FEED --> TS[(Metrics / Event Store)] +``` + +Key design patterns: + +- gateways hold active connections and subscription state +- business services publish events to a bus rather than pushing directly to every gateway +- gateways subscribe to only the shards or channels they need +- sharding is often based on user ID, room ID, organization ID, or topic +- regional edge clusters reduce latency and avoid routing every message through one region + +Operational concerns at this scale: + +- file descriptor limits +- efficient event-loop networking, such as epoll or kqueue based runtimes +- TLS termination cost +- per-connection memory footprint +- deployment draining and connection migration +- DDoS and abuse protection + +### 3.9 Sticky Sessions and Their Implications + +WebSocket connections are naturally sticky after they are established because the connection lives on one server. + +Two common patterns exist: + +1. sticky routing only for the lifetime of the connection, with minimal external session state +2. externalized session or subscription metadata so another node can recover more easily after reconnect + +Tradeoff: + +- keeping more state in-memory is fast but makes failover harder +- externalizing more state improves recoverability but increases read/write overhead + +Most production systems use a hybrid approach. + +### 3.10 Pub/Sub Integration + +WebSockets are a transport. They are rarely the source of truth. + +The common production flow is: + +1. a business event happens, such as a new chat message +2. the source-of-truth service persists it +3. the service publishes an event to a bus +4. the relevant real-time gateways receive the event +5. the gateways push it to subscribed clients + +This avoids tight coupling between the message store and every connected gateway. + +Chat example: + +- message written to durable message store +- room event published to Kafka or another bus +- gateways with members of that room fan it out +- offline users do not receive live delivery, but the message remains in durable history + +### 3.11 Message Ordering + +WebSocket itself gives ordered byte delivery on a single connection because it runs over TCP. + +That does not solve application-level ordering fully. + +Ordering can still break due to: + +- reconnects +- retry/replay logic +- multi-region replication lag +- events produced from multiple backend services +- fanout from multiple partitions + +Practical fix: + +- assign monotonic sequence numbers per room, stream, or entity +- let clients detect gaps or duplicates +- if a gap exists, trigger replay or snapshot sync + +### 3.12 Delivery Guarantees + +Default WebSocket delivery is closer to at-most-once unless you add more machinery. + +For stronger guarantees, systems add: + +- message IDs +- client acknowledgments +- server-side retry windows +- dedupe on the client or server +- durable storage for important messages + +Chat systems often separate message durability from live delivery: + +- live delivery is fast but may fail during transient disconnects +- durable history ensures the user still sees the message when they reconnect + +This is how systems like Slack or Discord can tolerate temporary live-delivery failures without losing conversation history. + +### 3.13 WebSockets vs Traditional Request-Response + +| Dimension | Request-response HTTP | WebSockets | +|---|---|---| +| Communication model | Client asks, server answers | Both sides can send after connection is established | +| Overhead | New request/response semantics each time | One persistent connection | +| Server push | Awkward or impossible directly | Natural | +| Best for | CRUD, infrequent interactions, cacheable reads | Interactive real-time features | +| Statefulness | Usually stateless per request | Stateful connection management | +| Scaling pain | Request burst handling | Connection count, fanout, reconnect storms | + +### 3.14 Production Examples + +Slack or Discord style chat: + +- WebSockets to maintain active sessions +- durable message store for history +- presence service for online status +- per-channel fanout through pub/sub +- client acknowledgments or read markers managed separately from transport + +Live operational dashboard: + +- metrics ingestion into stream system +- aggregation layer computes latest values +- gateway pushes only the latest state at a controlled cadence +- slow clients may receive sampled or coalesced updates instead of every raw event + +Collaborative editing like Google Docs style systems: + +- persistent channel for operations and cursor updates +- separate algorithm for convergence such as OT or CRDT +- ordered per-document operation stream more important than global ordering + +### 3.15 Failure Cases and Best Practices + +Common failures: + +- dropped connections during deploys +- missing replay after reconnect +- memory blow-up from unbounded per-client queues +- duplicate sends during reconnect races +- stale authentication on long-lived connections +- load imbalance when popular rooms concentrate on one shard + +Best practices: + +- keep auth tokens short-lived and renewable +- use heartbeats with sensible timeout and grace period +- track per-stream sequence numbers for replay and gap detection +- bound outbound queues and define drop policies +- separate durable history from ephemeral transport +- plan explicitly for reconnect storms + +## 4. Server-Sent Events (SSE) + +Server-Sent Events are a simpler real-time transport for one-way server-to-client streaming over HTTP. + +### 4.1 What SSE Is + +SSE uses a long-lived HTTP response with content type `text/event-stream`. The server keeps the response open and pushes events as text lines. + +This is simpler than WebSockets because: + +- it stays within standard HTTP semantics +- the communication direction is only server to client +- browsers provide built-in reconnect behavior through `EventSource` + +SSE is often the right tool when you need live updates but not client-to-server realtime messaging on the same channel. + +### 4.2 Why SSE Exists + +Many applications need push from server to browser, but not full bidirectional sockets. + +Examples: + +- notification feeds +- build or job status updates +- stock or metrics dashboards +- live logs +- admin consoles showing state changes + +For these, WebSockets may be more power than you need. SSE offers lower conceptual complexity. + +### 4.3 How SSE Works Internally + +The server responds with a streaming body where events look like this conceptually: + +```text +id: 1052 +event: status_update +retry: 3000 +data: {"state":"processing","progress":65} + +``` + +Important fields: + +- `data`: payload, often JSON serialized as text +- `event`: optional named event type +- `id`: event ID used for resume/reconnect logic +- `retry`: suggested reconnect delay in milliseconds + +The browser may reconnect automatically and include the last event ID. + +### 4.4 Browser Reconnect Behavior + +One reason teams like SSE is that browsers handle reconnection automatically through the `EventSource` API. + +That does not mean you can ignore replay design. + +In production, you still need to decide: + +- how long the server remembers old event IDs +- whether reconnect should replay from a cursor or force a snapshot refresh +- what happens if the client was disconnected longer than the replay window + +If you care about reliable catch-up, event IDs need meaning. They cannot just be decorative. + +### 4.5 Proxy and Load Balancer Considerations + +SSE works over HTTP, but intermediaries can cause problems. + +Common issues: + +- reverse proxies buffering the response instead of flushing promptly +- idle timeouts closing the stream +- HTTP/1.1 connection count limits in browsers +- network infrastructure tuned for short responses rather than long-lived streams + +Typical mitigations: + +- disable response buffering where necessary +- send periodic keepalive comments or lightweight events +- tune load balancer idle timeout above expected stream lifetime +- prefer HTTP/2 where available to improve multiplexing behavior + +### 4.6 Advantages of SSE + +SSE is often better than WebSockets when: + +- you only need server-to-client updates +- the payload is text-oriented JSON or status messages +- you want easy browser support with simpler client code +- you prefer staying inside HTTP infrastructure where possible +- you want built-in reconnect behavior without custom socket management + +It is especially attractive for SaaS dashboards and internal tools. + +### 4.7 Limitations of SSE + +SSE is not a universal replacement for WebSockets. + +Limitations: + +- browser-to-server communication still needs normal HTTP requests +- payloads are text-based, not native binary frames +- some proxies or CDNs may handle buffering poorly if misconfigured +- older browser and infrastructure quirks may matter in enterprise environments +- very high frequency bidirectional workloads fit WebSockets better + +### 4.8 SSE vs WebSockets + +| Dimension | SSE | WebSockets | +|---|---|---| +| Direction | Server to client only | Bidirectional | +| Protocol model | Streaming HTTP | Upgraded persistent socket | +| Browser reconnect support | Built in | Usually implemented by application | +| Complexity | Lower | Higher | +| Binary support | Not natural | Supported | +| Best use | Feeds, status, live logs, dashboards | Chat, collaboration, two-way real-time | + +### 4.9 When SSE Is the Better Choice + +SSE is often the better choice when the client mostly listens. + +Examples: + +- GitHub or CI-style live build logs +- a Stripe-like dashboard showing payment processing updates +- an incident dashboard showing service health changes +- a notification center feed for a web app +- stock or operations dashboards where commands go through separate REST APIs + +This is a good interview point: choosing the simpler transport when it satisfies requirements is usually a sign of maturity. + +### 4.10 Production Example + +Imagine a live job-status console for a video encoding platform. + +Architecture: + +- workers emit progress events to Kafka +- a status aggregation service maintains current job state +- an SSE gateway streams progress events to browser tabs watching those jobs +- user actions such as cancel or retry still go through normal HTTP APIs + +This keeps the design simple because only one direction needs to be live. + +```mermaid +sequenceDiagram + participant B as Browser + participant S as SSE Gateway + participant A as Aggregation Service + + B->>S: GET /jobs/123/stream + S->>A: Subscribe to job 123 updates + A-->>S: status event id=51 + S-->>B: event: progress, id:51, data:{65%} + A-->>S: status event id=52 + S-->>B: event: progress, id:52, data:{78%} + Note over B,S: If disconnected, browser reconnects with Last-Event-ID +``` + +### 4.11 Failure Cases and Best Practices + +Common failures: + +- events buffered by proxy, making the feed appear delayed +- auto-reconnect loops against an unhealthy server +- event IDs not aligned with replay logic +- assuming SSE gives guaranteed delivery without durable backing state + +Best practices: + +- use SSE for one-way streaming where simplicity matters +- define replay windows and semantics for `Last-Event-ID` +- tune proxy buffering and timeout settings explicitly +- send heartbeats or comments to keep the stream alive +- keep authoritative state elsewhere so clients can resync after longer disconnects + +## 5. Long Polling + +Long polling is the older but still useful technique where the client makes a request, the server holds it open until data is available or a timeout occurs, then the client immediately opens a new request. + +### 5.1 What Long Polling Is + +Long polling tries to approximate server push while staying entirely within ordinary request-response behavior. + +Lifecycle: + +1. client sends request asking for updates +2. server waits if no new data is available +3. when data arrives, or timeout happens, server responds +4. client immediately sends another request + +This can work surprisingly well at modest scale. + +### 5.2 Why Older Systems Used It + +Long polling became popular because: + +- it worked before modern browser WebSocket support was universal +- it fit existing HTTP infrastructure better +- it avoided some firewall or proxy restrictions that broke upgraded connections +- it was simple to integrate into applications already centered on HTTP APIs + +Many legacy enterprise products still use it, and some modern systems still choose it when operational simplicity matters more than efficiency. + +### 5.3 Request Lifecycle and Timeout Behavior + +```mermaid +sequenceDiagram + participant C as Client + participant S as Server + + C->>S: GET /updates?cursor=120 + Note over S: Hold request open + S-->>C: 200 OK with new event 121 + C->>S: GET /updates?cursor=121 + Note over S: Hold request open again + S-->>C: 204 No Content or timeout response + C->>S: GET /updates?cursor=121 +``` + +The cursor is important. Without it, the client cannot reliably resume. + +### 5.4 Server Resource Cost + +Long polling has hidden cost: + +- many open requests waiting simultaneously +- repeated HTTP parsing and headers +- repeated auth checks unless optimized +- more connection churn than persistent transports +- potentially more threads or async waiting overhead depending on server model + +In modern async servers this can be manageable, but it is still usually less efficient than SSE or WebSockets for high-frequency updates. + +### 5.5 Retry Flow and Failure Behavior + +If a long-poll request fails, the client retries. If the system is overloaded, thousands of clients may do this simultaneously. + +This can create: + +- thundering herds after load balancer restarts +- increased TLS handshake overhead +- request spikes aligned to timeout intervals + +Good long-poll implementations randomize retry timing and use cursors to avoid gaps. + +### 5.6 Scaling Limitations + +Long polling tends to hit limits earlier than SSE or WebSockets because: + +- every delivery cycle requires a fresh request +- idle clients still create repeated requests +- server-side waiting requests increase memory and scheduling overhead +- load balancers and API gateways see much higher request volume + +It is therefore more expensive per delivered update at scale. + +### 5.7 Long Polling vs SSE vs WebSockets + +| Dimension | Long Polling | SSE | WebSockets | +|---|---|---|---| +| Direction | Mostly server to client via repeated requests | Server to client | Bidirectional | +| Latency | Good but depends on re-request timing | Better | Best for interactive two-way | +| Overhead | Highest | Lower | Lowest per message after connection | +| Infrastructure simplicity | High | Moderate | Moderate to high | +| Browser support model | Universal HTTP | Modern browser friendly | Modern browser friendly | +| Best fit | Legacy environments, simple compatibility | One-way streaming | Full duplex interactive systems | + +### 5.8 Where It Still Appears Today + +Long polling still appears in: + +- legacy enterprise systems +- products that need broad proxy compatibility +- simple notification or status systems where connection counts are moderate +- fallback modes when WebSockets fail + +Some client libraries silently downgrade to long polling when sockets are unavailable. + +### 5.9 Common Mistakes + +Common mistakes: + +- no cursor or sequence number, causing gaps or duplicates +- synchronized timeout intervals across all clients +- treating long polling as free because it uses HTTP +- using blocking server threads for many waiting requests + +Best practices: + +- add jitter to retries and request restarts +- carry a cursor or last seen event ID +- use async I/O servers +- switch to SSE or WebSockets when event frequency or client count grows significantly + +## 6. Presence + +Presence is the system that answers questions such as: + +- is this user online? +- were they active recently? +- are they typing right now? +- which device are they on? +- when were they last seen? + +Presence looks easy until you build it across unreliable networks and multiple devices. + +### 6.1 What Presence Means + +Presence is not just a boolean. It is usually a collection of signals. + +Examples of signals: + +- active WebSocket connection exists +- recent heartbeat received within threshold +- app is in foreground or background +- user generated interaction recently +- one or more devices are connected +- a typing event was emitted in the last few seconds + +Products like Slack, WhatsApp, and Discord treat these signals differently based on product semantics and privacy expectations. + +### 6.2 Online/Offline State vs Rich Presence + +There are multiple levels of presence richness: + +| Presence signal | Meaning | Typical use | +|---|---|---| +| Online | At least one recent active connection | chat roster | +| Active now | Recent interaction or foreground state | collaboration tools | +| Last seen | Last trusted activity timestamp | messaging apps | +| Typing | Very recent ephemeral intent signal | chat UI | +| Device-specific presence | Desktop online, mobile background, etc. | cross-device messaging | + +Rich presence improves UX but increases cost and complexity. + +### 6.3 Heartbeat Mechanisms + +Presence systems typically rely on heartbeats because disconnects are not always detected immediately. + +Typical model: + +- client sends heartbeat every N seconds +- gateway refreshes an expiry in an ephemeral store +- if no refresh occurs before TTL plus grace period, the client is considered offline + +Heartbeats are usually lightweight and sometimes piggyback on existing ping-pong traffic. + +### 6.4 Disconnect Detection + +There are several ways a system may detect that a user is gone: + +- clean socket close +- missed heartbeat threshold +- TCP reset or read/write failure +- mobile OS background suspension causing missed keepalives +- gateway process failure + +The important design lesson is that disconnect detection is delayed and probabilistic. + +### 6.5 Ghost Online Problems + +Ghost online means the system shows a user as online even though they have effectively disappeared. + +Causes: + +- stale presence entry in Redis or similar store +- gateway crash before cleanup +- mobile network change without clean disconnect +- delayed heartbeat expiration +- clock skew or delayed writes + +Ghost online is not just cosmetic. In chat systems it changes user expectations and trust. + +### 6.6 Mobile Network Challenges + +Mobile clients make presence significantly harder: + +- apps move between foreground and background +- the OS may throttle network activity +- the device may sleep aggressively +- the network may switch between Wi-Fi and cellular +- radio conditions cause intermittent reachability + +This is why messaging apps rarely promise precise truth about presence. They provide an approximation that is good enough for UX. + +### 6.7 Multi-Device Presence + +A single user may be connected from: + +- desktop browser +- mobile phone +- tablet +- another browser tab + +So the question becomes: how should user presence be aggregated? + +Common policies: + +- user is online if any device is online +- show the most active device class +- typing indicator is scoped per conversation and device session +- last seen is updated from the most recently trusted activity source + +This requires per-device state plus a user-level aggregation layer. + +### 6.8 Presence Architecture at Scale + +```mermaid +flowchart LR + C1[Client Devices] --> G[Realtime Gateways] + G --> P[Presence Service] + P --> E[(Ephemeral TTL Store)] + P --> L[(Last Seen Store)] + P --> BUS[[Presence Event Bus]] + BUS --> W1[Workspace / Room Subscribers] + BUS --> W2[Friend / Contact Subscribers] +``` + +Important implementation choices: + +- keep rapidly changing online state in an ephemeral store with TTL, often Redis-like +- persist last seen more selectively because writing every heartbeat to durable storage is wasteful +- fan out presence only to users who care, such as channel members or contact lists +- avoid global broadcasts of presence updates + +### 6.9 How Slack, WhatsApp, and Discord Think About Presence + +Slack-like systems: + +- presence is often relevant within a workspace, not globally +- typing indicators are highly ephemeral and can be dropped safely +- active/away may incorporate user activity signal, not only socket existence + +WhatsApp-like systems: + +- privacy settings affect whether last seen or online is visible +- mobile network behavior dominates design decisions +- message delivery status and online status are related but not identical + +Discord-like systems: + +- gateway connections and presence updates are central +- fanout is scoped to servers, friends, or subscribed contexts +- rich presence may include game/activity metadata beyond simple online state + +### 6.10 Failure Cases and Best Practices + +Common failures: + +- presence storms during reconnect events +- over-broadcasting presence changes to too many recipients +- stale typing indicators that never clear +- durable stores overloaded by heartbeat writes + +Best practices: + +- use TTL-based ephemeral storage for active presence +- use grace periods before declaring offline +- treat typing as ephemeral, low-guarantee data +- separate last seen persistence from heartbeat frequency +- scope presence fanout carefully + +## 7. Online and Offline State + +Online/offline state deserves separate treatment because it is one of the most misunderstood topics in system design. + +### 7.1 Connection State vs Actual User Activity + +A connected socket does not prove the user is attentive. + +Examples: + +- the app is open in a background tab +- the phone has a connection but the user has not interacted for an hour +- the device is connected through a stale transport path + +So "online" is often a layered concept: + +- connected +- recently active +- foreground active +- reachable through push only +- offline + +### 7.2 Why Online Is Probabilistic + +"Online" is often probabilistic rather than exact because distributed systems only observe signals, not intent. + +Signals can be wrong or delayed: + +- heartbeat arrives late +- device sleeps +- disconnect cleanup fails +- region replication lags +- clock skew shifts thresholds + +That is why serious systems avoid over-promising semantic precision. + +### 7.3 Delayed Disconnect Detection and Grace Periods + +If you mark a user offline immediately on one missed heartbeat, the UI will flap constantly. If you wait too long, users stay falsely online. + +The typical answer is a grace period. + +Example: + +- heartbeat expected every 20 seconds +- after 45 seconds without heartbeat, mark as "probably offline" +- after 90 seconds, emit strong offline event + +The exact numbers depend on product sensitivity and mobile behavior. + +### 7.4 Distributed Presence Tracking + +At scale, presence may be tracked across regions or gateway clusters. + +Challenges: + +- the same user may have devices connected to different regions +- presence updates may replicate asynchronously +- room members may themselves be distributed globally +- failover can cause duplicate or delayed presence transitions + +Practical design patterns: + +- aggregate presence per user from device-level sessions +- keep the fast path regional when possible +- replicate summarized state, not every heartbeat, across regions +- use eventual consistency for non-critical presence displays + +### 7.5 Eventual Consistency and Reconciliation + +Presence is almost always eventually consistent. + +That is acceptable because most presence UI is advisory rather than transactional. + +Reconciliation patterns: + +- refresh contact roster periodically from source state +- clear typing indicators after TTL rather than requiring explicit "stop typing" +- overwrite stale online state when stronger evidence arrives +- prefer monotonic versioning or timestamps for state updates + +### 7.6 Offline Event Reconciliation + +When a user comes back online, the system often needs to reconcile what happened while they were away. + +Examples: + +- unread chat message counts +- missed mentions or alerts +- latest document state +- last known ride or delivery state + +This is where live presence and durable notification systems connect. Presence alone is not enough. Offline catch-up requires durable event or state storage. + +### 7.7 Useful Interview Language + +Strong interview framing: + +"I would treat online status as a best-effort product signal based on recent heartbeats and activity, not as exact truth. I would keep active presence in an ephemeral TTL store, add grace periods to avoid flapping, and rely on durable stores for offline reconciliation." + +## 8. Notification Systems Overview + +Notification systems take internal events and decide whether, when, and how to tell a user. + +This is a much bigger problem than "send email" or "send push." + +### 8.1 What Notification Systems Are + +A notification system is usually an event-driven orchestration system with responsibilities such as: + +- ingest product events +- determine recipients +- check user preferences and eligibility rules +- choose channels such as in-app, email, SMS, or push +- render channel-specific content +- schedule immediate or delayed delivery +- retry safely on transient failures +- capture delivery outcomes and user actions + +This is why notification systems often become a platform inside a company. + +### 8.2 Why They Exist as Dedicated Systems + +Without a dedicated notification system, every product team reimplements: + +- preferences +- templates +- provider integrations +- retry logic +- deduplication +- analytics +- unsubscribe logic + +That leads to inconsistency, bugs, and poor user experience. + +Centralized notification platforms exist because communication policy should be consistent even when event producers are decentralized. + +### 8.3 Event-Driven Producer-Consumer Flow + +```mermaid +flowchart LR + EV[Business Events
PR comment / payment failed / trip update] --> BUS[[Event Bus / Outbox]] + BUS --> ORCH[Notification Orchestrator] + ORCH --> PREF[Preferences / Suppression Rules] + ORCH --> TPL[Template Service] + ORCH --> DEDUPE[Idempotency / Deduplication] + ORCH --> Q1[[Email Queue]] + ORCH --> Q2[[SMS Queue]] + ORCH --> Q3[[Push Queue]] + Q1 --> EP[Email Provider] + Q2 --> SP[SMS Provider] + Q3 --> PP[APNs / FCM] + EP --> CALLBACK[Delivery Callbacks / Webhooks] + SP --> CALLBACK + PP --> CALLBACK + CALLBACK --> STATUS[(Notification Status Store)] +``` + +### 8.4 Notification Orchestration + +The orchestrator is the brain of the system. + +It decides: + +- should we notify at all? +- which recipients should receive it? +- which channels are allowed for this user and event type? +- should the notification be immediate, batched, or suppressed? +- how should duplicates be prevented? +- what fallback should happen if one channel fails? + +This is why the orchestrator is usually more important than the provider adapters. + +### 8.5 Fanout Systems + +A single event can notify: + +- one user +- all watchers of a repository +- all participants in a channel +- all on-call engineers for a critical incident +- millions of users in a campaign + +Fanout complexity depends on the recipient graph. + +Examples: + +- GitHub notification fanout may target repository watchers, mention targets, or review participants +- Uber trip updates may target rider, driver, and support systems +- Stripe-like alerts may target account owners and configured team members + +The fanout step is often separated from channel delivery so the system can scale recipient expansion independently. + +### 8.6 Personalization and Preferences + +Mature systems do not treat all recipients equally. + +They consider: + +- language and locale +- preferred channels +- quiet hours +- marketing consent vs transactional necessity +- account type or plan tier +- notification priority +- prior recent sends to avoid fatigue + +This means notification systems need user preference stores, policy engines, and template personalization. + +### 8.7 Delivery Guarantees and Idempotency + +Notification systems are almost always at-least-once internally. + +Why: + +- queues retry on failures +- provider callbacks can be delayed or duplicated +- downstream consumers may crash after partial processing + +Therefore idempotency is mandatory. + +Examples of idempotency keys: + +- `event_id + recipient_id + channel + template_version` +- `otp_request_id + phone_number` +- `invoice_id + notification_type` + +Without idempotency, retries turn into duplicate emails, repeated SMS, or multiple push notifications. + +### 8.8 User Preferences, Suppression, and Fatigue Prevention + +Good notification systems do not just maximize sends. They maximize useful communication. + +Common controls: + +- per-event-type preferences +- daily or hourly caps +- digesting multiple low-priority events into one summary +- suppression if the user already saw the update in-app +- channel priority rules such as push first, email later if unread +- quiet hours by timezone + +This is especially important for SaaS products, GitHub-like collaboration tools, and consumer apps where over-notification causes churn. + +### 8.9 Real-World Examples + +GitHub-like system: + +- new comment event arrives +- identify subscribers and participants +- if user is active in the web app, maybe show in-app only +- if user is offline, maybe send email or mobile push depending on preferences + +Stripe-like system: + +- payment failed event arrives +- critical transactional email sent immediately +- dashboard alert created +- webhook to merchant retried separately with strong idempotency rules + +Uber-like system: + +- trip state changes drive in-app realtime updates when online +- push or SMS may be used if the user is offline or inactive +- some updates are critical and some are suppressible + +### 8.10 Common Mistakes + +Common mistakes: + +- every producer sends directly to providers +- no central dedupe or idempotency control +- no distinction between marketing and transactional communications +- retrying permanent failures forever +- ignoring user timezone and quiet hours +- no feedback loop from provider delivery callbacks + +## 9. Email Notifications + +Email is deceptively complex. It looks simple because the send API is simple. Production email systems are complicated because delivery, trust, compliance, and reputation matter. + +### 9.1 Transactional vs Marketing Email + +| Type | Purpose | Examples | Operational expectations | +|---|---|---|---| +| Transactional | Required or highly important user/account events | password reset, receipt, billing alert, security alert | High reliability, lower latency, clearer deliverability priority | +| Marketing | Engagement and campaigns | product updates, promotions, newsletters | Segmentation, scheduling, consent, reputation sensitivity | + +Good companies often separate these streams operationally because poor marketing practices can harm the deliverability of important transactional mail. + +### 9.2 Provider Integrations + +Common providers include SES, SendGrid, Mailgun, Postmark, and others. + +The backend typically needs to manage: + +- authentication credentials and rotation +- domain verification +- sending rate limits +- webhook endpoints for bounces, complaints, and delivered events +- fallback provider strategy if the primary provider degrades + +The provider is not your notification system. It is only the last-mile email delivery partner. + +### 9.3 Deliverability Basics + +Deliverability means whether mail actually lands where it should. + +It depends on more than API success. + +Core concepts: + +- SPF, DKIM, and DMARC to establish domain authenticity +- domain and IP reputation +- complaint rates and unsubscribe behavior +- bounce rates +- content quality and spam-like patterns +- volume ramp-up or IP/domain warming + +A send accepted by the provider can still end up in spam or be throttled downstream. + +### 9.4 Retries and Failure Handling + +Email sending failures can be transient or permanent. + +Transient examples: + +- provider API timeout +- rate limit exceeded +- temporary downstream mail server deferral + +Permanent examples: + +- invalid recipient address +- hard bounce history +- unsubscribed marketing recipient + +The notification system should classify failures and retry only the transient ones. + +### 9.5 Bounce Handling + +Bounces matter because continuing to send to bad addresses damages reputation. + +Typical flow: + +1. send email +2. provider later reports delivery, soft bounce, hard bounce, complaint, or unsubscribe +3. your system updates recipient state and future eligibility + +Hard bounces often lead to immediate suppression for that address. +Soft bounces may tolerate a few retries or future attempts. + +### 9.6 Spam Prevention Basics + +Spam prevention is partly technical and partly policy. + +Important basics: + +- authenticate sending domains correctly +- do not mix high-risk marketing traffic with critical transactional mail blindly +- honor unsubscribe and complaint signals quickly +- avoid misleading content or sudden massive bursts from a cold domain +- keep lists clean and verified + +### 9.7 Rate Limits and Provider Behavior + +Providers enforce rate limits, and mailbox providers also effectively rate-limit through throttling. + +Implications: + +- you may need send pacing +- campaign fanout should often be spread over time +- critical transactional mail may need priority queues separate from bulk mail + +Amazon or Stripe style critical receipt and billing systems usually do not want a giant promotional campaign to delay password resets. + +### 9.8 Email Verification Basics + +Verification is used to reduce garbage addresses and account abuse. + +Common patterns: + +- double opt-in for marketing lists +- confirmation link for new account email ownership +- suppression of obviously invalid domains or malformed addresses + +This improves both security and deliverability. + +### 9.9 Unsubscribe Flows + +Unsubscribe handling is non-negotiable for marketing communications and often still useful for preference management around non-critical transactional categories. + +Good systems support: + +- one-click unsubscribe where required +- fine-grained preference center +- audit trail of consent changes +- region/compliance specific logic if needed + +### 9.10 Domain Reputation Basics + +Reputation is cumulative. Sending one terrible campaign can affect future delivery of important mail. + +This is why large companies often: + +- separate sending domains or subdomains by traffic type +- ramp up new domains gradually +- isolate risky campaigns from critical operational email + +### 9.11 Why Email Is More Complex Than "Just Send Email" + +Because success requires all of the following: + +- rendering the right content +- sending through a healthy provider +- passing authentication checks +- avoiding spam classification +- respecting preferences and compliance +- tracking callbacks +- updating suppression lists + +A provider returning `200 OK` is the beginning, not the end. + +## 10. SMS Notifications + +SMS is powerful because it reaches users outside the app and does not require mobile data for app delivery semantics. It is also expensive, abuse-prone, and operationally messy. + +### 10.1 Common Use Cases + +SMS is usually best reserved for high-value or urgent communication. + +Examples: + +- OTP and account verification +- fraud or security alerts +- delivery or ride status for users without app engagement +- critical operational alerts + +It is usually a poor default channel for high-volume low-value communication. + +### 10.2 Provider Integrations + +Typical providers include Twilio, Sinch, MessageBird, and region-specific aggregators. + +Real systems often need to manage: + +- phone number normalization and validation +- country-specific sender rules +- short code, long code, toll-free, or sender ID strategy +- provider failover or route selection +- delivery receipt webhook processing + +### 10.3 Delivery Reliability + +SMS delivery reliability is more uncertain than many engineers expect. + +Reasons: + +- carriers can filter traffic +- delivery receipts may be delayed or incomplete +- regional regulations vary heavily +- roaming, handset state, and carrier routing affect outcomes + +This means SMS is not a perfect guarantee channel. It is just a high-reach channel. + +### 10.4 Regional Delivery Issues + +SMS is deeply regional. + +Challenges include: + +- country-specific compliance requirements +- differing support for alphanumeric sender IDs +- carrier throughput differences +- local holidays or network conditions affecting delivery +- multi-language content and character encoding costs + +A design that works in the US may be wrong in India, Brazil, or the Middle East. + +### 10.5 Cost Considerations + +SMS is expensive relative to push or in-app messaging. + +Implications: + +- use it for critical moments, not noise +- add rate limits and spend monitoring +- avoid repeated retries that burn money without improving outcomes + +This is one reason companies often prefer push first, then SMS only for critical fallback paths. + +### 10.6 Retry Strategies + +Retrying SMS requires care. + +Bad strategy: + +- retry every failure aggressively +- resend OTP repeatedly with no abuse checks + +Better strategy: + +- retry transient provider errors with capped backoff +- do not retry obvious permanent failures +- rotate providers or routes only when there is evidence of provider-side trouble +- ensure OTP requests invalidate or supersede older codes safely + +### 10.7 Abuse and Fraud Prevention + +SMS systems are targets for abuse. + +Examples: + +- OTP bombing a victim's phone +- account enumeration by testing phone numbers +- SMS pumping fraud, where attackers generate traffic to monetize premium routes +- SIM swap exposure when SMS is treated as strong identity proof + +Protections: + +- per-user and per-destination rate limits +- CAPTCHA or additional friction on suspicious flows +- phone reputation checks where appropriate +- anomaly detection on destination patterns and country mix +- avoid using SMS as the only high-assurance security factor for sensitive systems + +### 10.8 Fallback Strategies + +Reasonable fallback examples: + +- push notification first for app users, SMS only if unread or unreachable +- voice call backup for some OTP flows in limited contexts +- in-app confirmation plus email receipt after SMS-based security action + +Fallback should be policy-driven, not improvised per service. + +### 10.9 Why SMS Should Be Used Carefully + +SMS is valuable because it cuts through offline conditions, but it should be treated as a scarce and risky channel. + +In interviews, a good answer sounds like: + +"I would reserve SMS for high-importance events such as OTP or urgent alerts, because it has real cost, variable regional reliability, and meaningful abuse risk." + +## 11. Push Notifications + +Push notifications are the standard way mobile apps receive out-of-app alerts through platform push services. + +### 11.1 Mobile Push Architecture + +A typical flow is: + +1. app registers with APNs on iOS or FCM on Android +2. the app backend stores the device token +3. when an event occurs, the notification system sends a message to APNs or FCM +4. the platform service attempts delivery to the device +5. the app may open, refresh state, or show a system notification depending on payload type and app state + +```mermaid +flowchart LR + APP[Your Backend] --> ORCH[Notification Orchestrator] + ORCH --> APNS[APNs] + ORCH --> FCM[FCM] + APNS --> IOS[iOS Device] + FCM --> AND[Android Device] +``` + +### 11.2 APNs Basics + +APNs is Apple's push service. Important operational realities: + +- you send to APNs, not directly to the device +- device tokens can change and become invalid +- delivery timing is not strictly guaranteed +- app state and OS policies affect whether and how the notification is shown or processed + +### 11.3 FCM Basics + +FCM plays a similar role for Android and can also support web push scenarios. + +Important details: + +- token lifecycle must be managed carefully +- device availability and manufacturer-specific behavior can affect delivery +- notification and data messages behave differently depending on app state + +### 11.4 Token Management + +Token management is one of the most common production pain points. + +You need to handle: + +- token registration on app install or refresh +- multiple tokens per user due to multiple devices +- invalid or expired tokens returned by providers +- token removal when users log out or uninstall + +Without disciplined token hygiene, you waste money and effort sending to dead destinations. + +### 11.5 Delivery Uncertainty + +Push notifications are not a guaranteed immediate-delivery channel. + +Reasons: + +- device is offline +- battery optimization delays delivery +- platform may coalesce or deprioritize some notifications +- user disabled permissions +- token invalidated + +This is why important workflows often require eventual in-app reconciliation or email fallback. + +### 11.6 Silent Notifications + +Silent notifications can wake the app to refresh content without necessarily showing a visible alert. + +They are useful for: + +- refreshing inbox state +- syncing badge counts +- preloading likely-needed data + +But they are heavily constrained by OS policies, so they should not be treated as a guaranteed background execution channel. + +### 11.7 Foreground vs Background Behavior + +The same push may behave differently based on app state: + +- foreground: app may intercept and render custom UX +- background: system may show notification or deliver data silently depending on payload and permissions +- terminated: behavior depends on platform rules and payload type + +This is why push design requires close partnership between backend and mobile engineers. + +### 11.8 Retries and Delivery Limitations + +Your backend can retry requests to APNs or FCM when their API fails transiently, but once the provider accepts a push, actual device delivery is still probabilistic. + +Therefore: + +- retry provider API failures sensibly +- do not assume acceptance equals user seen +- use collapse keys or dedupe keys for replaceable updates +- rely on app open and in-app state sync for real correctness + +### 11.9 Practical Examples + +Uber-like app: + +- real-time in-app updates while rider is online +- push for driver arrival if rider is backgrounded +- fallback SMS only for critical reachability gaps or special cases + +SaaS mobile app: + +- push for mentions, approvals, incidents, or account alerts +- user preferences decide which event types generate push +- app syncs authoritative state on open rather than trusting push payload alone + +### 11.10 Common Mistakes + +Common mistakes: + +- treating push as reliable enough to replace durable inbox state +- never cleaning invalid tokens +- sending too many pushes and training users to disable them +- using silent push as if it were guaranteed scheduled background compute + +Best practices: + +- maintain durable notification state separately +- manage token lifecycle aggressively +- use collapse and dedupe strategies for replaceable alerts +- measure open rate, delivery feedback, and disablement trends + +## 12. Retries + +Retries are one of the most important and most dangerous parts of communication systems. + +Done well, they recover from transient failure and make the system resilient. +Done badly, they multiply load, create duplicates, and extend outages. + +### 12.1 Why Retries Exist + +Many failures are temporary: + +- provider timeout +- brief network partition +- overloaded downstream service returning 429 or 503 +- transient database or cache issue + +Retrying later may succeed without user-visible failure. + +### 12.2 Transient vs Permanent Failures + +This distinction is critical. + +| Failure type | Example | Retry? | +|---|---|---| +| Transient | timeout, 503, temporary DNS issue | Usually yes | +| Capacity/rate limit | 429, queue full, provider throttle | Yes, but with backoff and pacing | +| Permanent data error | invalid email, malformed phone number, missing template variable | No | +| Policy rejection | unsubscribed user, blocked sender, compliance rule | No | + +If you do not classify failures, you will retry garbage forever. + +### 12.3 Retry Policies + +A retry policy usually defines: + +- maximum attempts +- initial delay +- backoff multiplier +- jitter strategy +- retryable status codes or error classes +- whether fallback provider or alternate channel should be used +- what goes to dead-letter after final failure + +### 12.4 Exponential Backoff + +Exponential backoff increases delay after each failure. + +Conceptually: + +$$ +delay_n = base \times factor^n +$$ + +This reduces immediate pressure on an unhealthy dependency. + +### 12.5 Jitter + +Jitter randomizes retry timing so all failed requests do not retry in lockstep. + +Without jitter, a temporary outage can turn into synchronized retry spikes. + +That is one of the classic ways retries make outages worse. + +### 12.6 Dead-Letter Queues + +A dead-letter queue stores items that repeatedly failed and need inspection or alternate handling. + +DLQs are useful because they: + +- stop poison messages from blocking healthy traffic +- preserve failed work for debugging +- allow manual replay after fixes +- provide operational visibility into systemic issues + +### 12.7 Poison Messages + +A poison message is a message that always fails because the content or logic is bad. + +Examples: + +- template references a missing variable +- payload violates provider schema +- recipient state is invalid and will never pass validation + +Retries will not fix poison. Classification and DLQ handling will. + +### 12.8 Retry Storms + +Retry storms happen when failure causes more traffic than success ever would. + +Typical causes: + +- too many retries with too little backoff +- many workers retrying the same provider simultaneously +- client reconnect loops plus server-side retries both activating during outage +- no circuit breaker or health-aware throttling + +This is why retries can save systems or destroy them. + +### 12.9 Idempotency Importance + +Retries mean duplicate attempts are inevitable. + +Idempotency makes duplicates safe. + +Examples: + +- if `event_id + recipient + channel` already produced a send, do not send again +- if a webhook delivery with the same delivery ID is retried, downstream consumer should accept duplicate receipt without double-processing +- if an OTP request is superseded, older attempts should not remain valid + +### 12.10 Queue Processing and Retry Pipeline + +```mermaid +flowchart LR + Q[[Notification Queue]] --> W[Worker] + W --> TRY{Send succeeds?} + TRY -- Yes --> DONE[Mark delivered / awaiting callback] + TRY -- No transient --> RETRY[Schedule retry with backoff + jitter] + TRY -- No permanent --> FAIL[Mark failed permanently] + RETRY --> DQ[[Delayed Retry Queue]] + DQ --> W + W --> MAX{Attempts exceeded?} + MAX -- Yes --> DLQ[[Dead Letter Queue]] + MAX -- No --> TRY +``` + +### 12.11 Best Practices + +Best practices: + +- classify failures before retrying +- use exponential backoff with jitter +- cap attempts and send poison messages to DLQ +- make send operations idempotent +- add provider-level circuit breakers or adaptive throttling +- expose retry metrics so operators can see storms forming + +## 13. Templates + +Templates are where product communication and backend correctness meet. + +They are also a major source of production mistakes. + +### 13.1 Why Templates Matter + +Templates allow teams to separate event logic from presentation. + +Without templates, every service hardcodes message strings and formatting logic. That creates: + +- duplicated content logic +- harder localization +- inconsistent branding +- risky ad hoc changes in code deployments + +### 13.2 Template Engines + +Common template systems support: + +- placeholders such as user name, amount, due date +- conditional content +- loops for repeated items +- channel-specific rendering, such as HTML email vs plain text vs push body + +The engine should ideally be constrained. Overly powerful templates can become hard to reason about or insecure. + +### 13.3 Localization + +Global products need localization support: + +- translated strings +- locale-specific date, time, number, and currency formatting +- right-to-left language support where needed +- per-market legal wording variations + +Localization is a major reason templates should be centrally managed. + +### 13.4 Personalization + +Templates often need data from multiple sources: + +- user profile +- event payload +- account metadata +- product context such as workspace or subscription plan + +The challenge is ensuring all required variables are available and valid at render time. + +### 13.5 Placeholders and Strong Contracts + +A mature system treats template variables like an API contract. + +Good practice: + +- define required and optional variables per template version +- validate payload shape before enqueueing large fanouts +- fail fast if critical variables are missing + +This prevents discovering a missing placeholder only after a million-send campaign starts. + +### 13.6 Versioning + +Templates need versioning because content changes over time. + +Reasons: + +- copy updates +- regulatory changes +- product redesigns +- A/B tests + +Versioning lets you: + +- replay historical sends faithfully +- know exactly which content a user saw +- roll back safely if a bad template is published + +### 13.7 A/B Testing Basics + +Notification systems often support experiments on: + +- subject line +- send timing +- body copy +- CTA wording +- channel choice + +The important engineering point is that experiment assignment should be stable and observable. Otherwise analytics become misleading. + +### 13.8 Approval Workflows + +At scale, template changes are often too risky for direct edit-and-send. + +Companies frequently use: + +- draft and review states +- approval workflows for legal, compliance, or brand teams +- test-send environments +- guarded rollout by percentage or audience segment + +This is especially common for finance, healthcare, and large SaaS platforms. + +### 13.9 Template Safety + +Safety concerns include: + +- accidental broken placeholders +- unsafe HTML content +- unescaped user-generated data +- overlong SMS bodies creating unexpected segmentation cost +- push payload exceeding provider limits + +Template safety needs validation per channel, not just generic render success. + +### 13.10 Rendering Failures + +Rendering can fail because: + +- a required variable is missing +- localization entry is absent +- HTML is malformed +- experiment assignment references an unpublished variant + +These failures should be classified as permanent or configuration errors, not retried blindly forever. + +### 13.11 Preview and Testing Systems + +Good internal tooling matters. + +Useful capabilities: + +- preview with sample payloads +- render in multiple locales +- device/channel preview for email, SMS, and push +- validation against size limits and required variables +- test send to controlled recipients + +This is how companies safely manage hundreds or thousands of templates. + +## 14. Scheduling + +Scheduling is the part of notification systems that answers "not now, but later." It sounds easy until timezones, retries, DST, and exactly-once concerns appear. + +### 14.1 Delayed Notifications and Reminders + +Common scheduled notifications: + +- payment reminder tomorrow at 9 AM local time +- retry failed webhook in 30 minutes +- send digest every evening +- remind user if task remains unresolved for 24 hours + +Scheduling therefore exists both for product UX and operational retry workflows. + +### 14.2 Cron vs Queue-Based Scheduling + +| Approach | Strengths | Weaknesses | Best fit | +|---|---|---|---| +| Cron-like scheduler | Simple for recurring global jobs | Not great for millions of per-user scheduled items | fixed recurring tasks | +| Delayed queue / timer wheel | Good for large numbers of future jobs | Operational complexity at very large horizons | reminders, retries, one-off schedules | +| Database-backed scheduler | Easy to reason about with scans | Can become expensive and contention-heavy | moderate scale scheduling | + +Most production systems use more than one mechanism. + +### 14.3 Distributed Schedulers + +A distributed scheduler must avoid missing jobs and also avoid running the same job many times. + +Typical strategies: + +- partition scheduled jobs by time bucket and worker shard +- use lease/claim model so only one worker owns a due item at a time +- store a dedupe or execution key so reruns are safe +- enqueue work into normal queues when it becomes due rather than executing inline in the scheduler + +This separation keeps the scheduler simple and lets worker fleets scale independently. + +### 14.4 Retry Scheduling + +Retry scheduling is just scheduling with shorter horizons and stricter idempotency. + +Important considerations: + +- next attempt time should be explicit +- attempts should be capped +- retry state must survive worker restarts +- delayed retries should not block primary queue throughput + +### 14.5 Timezone Handling + +Timezone bugs are common because user-local scheduling sounds simple but is not. + +Examples: + +- send at 9 AM in the user's configured timezone +- user changes timezone between scheduling and execution +- organization policy uses workspace timezone instead of user timezone + +The scheduler must define which timezone source is authoritative. + +### 14.6 DST Problems + +Daylight saving time creates two classic edge cases: + +- a local clock time that does not exist +- a local clock time that occurs twice + +If you schedule "2:30 AM local time" on the wrong day, what does that mean? + +Production systems need explicit behavior, such as: + +- shift to the next valid local time +- pin to UTC internally after initial local conversion +- document recurrence rules clearly + +### 14.7 Calendar Edge Cases + +Other calendar issues: + +- month-end behavior, such as "send on the 31st" +- leap years +- business-day-only rules +- local holidays for campaign or billing workflows + +These matter more than many candidates expect, especially in financial and enterprise systems. + +### 14.8 Campaign Scheduling + +Large campaigns introduce extra complexity: + +- recipient expansion can be huge +- provider rate limits require pacing +- rollout may need regional waves +- cancellation or pause controls are necessary +- duplicate prevention across retries and partial runs matters + +A campaign system is often more like a controlled distributed batch pipeline than a simple timer. + +### 14.9 Exactly-Once Challenges + +Scheduling systems rarely achieve true exactly-once delivery end-to-end. + +More realistic goal: + +- enqueue due work at least once +- make downstream processing idempotent +- store execution markers to avoid duplicate user-visible effects + +That is the same practical pattern seen across broader distributed systems. + +### 14.10 A Practical Scheduling Architecture + +```mermaid +flowchart LR + REQ[Create reminder / delayed notification] --> SCHED[(Schedule Store)] + SCHED --> SCAN[Scheduler Workers] + SCAN --> DUE{Now due?} + DUE -- No --> SCHED + DUE -- Yes --> Q[[Delivery Queue]] + Q --> W[Notification Worker] + W --> CH[Channel Provider] + W --> MARK[(Execution / Idempotency Store)] +``` + +## 15. Comparing Channels in Practice + +When interviewers ask communication questions, they often want channel tradeoffs, not only definitions. + +### 15.1 WebSockets vs SSE vs Long Polling + +| Question | WebSockets | SSE | Long Polling | +|---|---|---|---| +| Need bidirectional low-latency interaction? | Best choice | No | Poor fit | +| Need simple server-to-browser streaming? | Works, but may be overkill | Often best | Acceptable fallback | +| Need broad compatibility with older infrastructure? | Sometimes harder | Usually okay | Best compatibility | +| High connection count cost | Persistent connection memory | Persistent HTTP stream cost | Highest repeated request overhead | +| Typical examples | chat, collaboration, multiplayer-like UX | feeds, logs, dashboards, status | legacy updates, fallback transports | + +### 15.2 Email vs SMS vs Push Notifications + +| Channel | Strengths | Weaknesses | Best uses | +|---|---|---|---| +| Email | Rich content, durable inbox, cheap at scale relative to SMS | Deliverability complexity, slower user attention | receipts, alerts, digests, workflow updates | +| SMS | Very high reach, urgent attention | Expensive, abuse-prone, region-specific constraints | OTP, critical alerts, fallback reachability | +| Push | Low cost, strong mobile engagement | Delivery uncertainty, token churn, permission dependence | mobile app alerts, reminders, activity nudges | + +A strong production design usually uses channel hierarchy rather than one channel for everything. + +Example: + +- real-time in-app if user is active +- push if mobile user is inactive +- email for durable summary or critical account record +- SMS only for urgent or high-assurance scenarios + +## 16. How These Systems Connect in Real Architectures + +The biggest interview mistake is describing each component in isolation. + +Production systems connect them. + +### 16.1 Chat Application Architecture + +```mermaid +flowchart TB + U1[Sender Client] --> GW1[Realtime Gateway] + GW1 --> MSG[Message Service] + MSG --> DB[(Message Store)] + MSG --> BUS[[Room Event Bus]] + BUS --> GW2[Realtime Gateways] + GW2 --> U2[Online Recipients] + BUS --> NOTIF[Notification Orchestrator] + NOTIF --> PUSH[Push / Email for offline users] + GW2 --> PRES[Presence Service] + PRES --> CACHE[(Presence TTL Store)] +``` + +Key insight: + +- live transport handles active users +- durable message storage handles history and offline recovery +- presence influences whether to send push or keep delivery in-app + +### 16.2 Ride Tracking Architecture + +Uber-like example: + +1. driver app emits location updates +2. ingestion service validates and writes them to a stream +3. geo or trip service computes rider-visible state +4. active rider session receives real-time updates through WebSocket or SSE +5. if rider is backgrounded, notification system may send push for major trip milestones + +The real-time system and notification system are therefore two views over the same event stream. + +### 16.3 SaaS Workflow Notifications + +GitHub or Stripe style example: + +1. domain event occurs, such as comment added or payment failed +2. write event through outbox pattern to avoid losing it during transaction boundaries +3. event bus feeds both in-app real-time feed and notification orchestrator +4. preferences and suppression rules decide channel +5. durable inbox and email/push/SMS state are updated independently + +This architecture is robust because the event is authoritative and downstream communication is decoupled. + +### 16.4 Notification Delivery Pipeline + +```mermaid +sequenceDiagram + participant P as Producer Service + participant O as Outbox / Event Bus + participant N as Notification Orchestrator + participant R as Rules + Preferences + participant Q as Channel Queue + participant W as Worker + participant X as Provider + participant S as Status Store + + P->>O: Publish domain event + O->>N: Deliver event + N->>R: Resolve recipients, channels, suppression + R-->>N: Eligible deliveries + N->>Q: Enqueue channel jobs with idempotency key + Q->>W: Deliver job + W->>X: Send via provider + X-->>S: Callback / receipt / bounce / failure + S-->>N: Update notification state and metrics +``` + +## 17. What Breaks at Scale + +Large communication systems usually break in predictable ways. + +### 17.1 Real-Time System Failures + +- reconnect storms after deploys or regional blips +- hotspot channels or rooms causing shard imbalance +- slow consumers building large outbound queues +- stale presence from delayed disconnect detection +- message replay windows too short for realistic mobile disconnects + +### 17.2 Notification System Failures + +- duplicate sends from missing idempotency +- retry storms after provider outage +- dead-letter growth due to poison templates or bad payload contracts +- provider callback backlog causing stale delivery state +- campaign sends overwhelming rate limits and delaying transactional traffic + +### 17.3 Organizational Failures + +- every team building its own notification logic +- no shared preference model +- weak observability into delivery stages +- no clear ownership of templates or compliance-sensitive content + +Strong engineering organizations build communication systems as platforms because the problems repeat across products. + +## 18. Best Practices and Common Mistakes + +### 18.1 Best Practices + +- separate authoritative business events from channel delivery +- choose the simplest transport that satisfies product requirements +- keep real-time transport distinct from durable history or durable inbox state +- use sequence numbers, cursors, or snapshots for reconciliation +- make retries idempotent and classify failures carefully +- use ephemeral stores with TTL for active presence +- centralize user preferences, suppression, and dedupe logic +- isolate transactional and bulk notification paths +- measure end-to-end latency, not just enqueue success + +### 18.2 Common Mistakes + +- assuming WebSockets imply reliable message delivery by themselves +- treating online state as exact truth +- sending every possible notification instead of designing for user attention +- retrying permanent failures forever +- ignoring provider callbacks and token invalidation +- letting campaigns compete with critical operational messages +- storing all heartbeat or presence writes durably and overwhelming the database + +## 19. How to Explain This in an Interview + +If asked to design or explain a communication system, structure the answer in this order: + +1. define the product requirement and latency expectation +2. choose transport based on interaction style +3. define durability and delivery guarantees +4. explain fanout and scaling approach +5. cover reconnect, retry, and idempotency behavior +6. discuss presence or offline fallback if relevant +7. mention observability, rate limits, and operational failure modes + +Example framing: + +"For active users, I would use WebSockets because the product needs bidirectional low-latency updates. I would keep durable state in the message store and use sequence numbers for replay after reconnect. Presence would be managed through heartbeats in a TTL-based store with grace periods. For offline users, I would feed the same events into a notification orchestrator that applies preferences, dedupe, and channel-specific retries for push, email, or SMS." + +That answer sounds like someone who has moved beyond glossary knowledge. + +## 20. Final Mental Model + +Communication systems are not only about sending data. They are about delivering the right information: + +- to the right user +- through the right channel +- with the right urgency +- at the right cost +- with acceptable reliability and ordering +- without overwhelming infrastructure or users + +The best backend systems treat communication as a first-class architecture concern, not a thin wrapper around providers or sockets. + +If you remember one core idea, remember this: + +real-time delivery and notifications should be designed as downstream communication layers built on top of authoritative events, with explicit handling for fanout, retries, ordering, idempotency, presence, and user preference. + +That is the difference between a toy design and a production one. diff --git a/systems design/7.searchAndFIleSystem.md b/systems design/7.searchAndFIleSystem.md new file mode 100644 index 0000000..de3e5be --- /dev/null +++ b/systems design/7.searchAndFIleSystem.md @@ -0,0 +1,2978 @@ +# Search & Discovery + Media & File Systems + +This chapter covers two families of systems that appear in almost every large product: + +- systems that decide what users can find +- systems that store, transform, and deliver large binary data such as images, videos, documents, logs, and backups + +In interviews, these topics are often split into separate questions: "Design search for an e-commerce app", "Design Instagram feed", "Design file upload for a SaaS product", or "Design YouTube video processing". + +In production, they are deeply connected. + +An e-commerce product page may depend on: + +- a transactional database for product metadata +- a search index for keyword retrieval +- a ranking system for relevance and business rules +- a recommendation system for discovery +- an object store for images and video +- a CDN for fast delivery +- an async media pipeline for thumbnails and optimization + +So the real engineering question is not, "What is search?" or "What is S3?". The real question is: + +How do these systems work together under scale, latency pressure, stale data, changing ranking logic, unreliable networks, expensive media processing, and strict security requirements? + +This guide is written for both interview preparation and real backend engineering. The goal is not to memorize terms. The goal is to understand: + +- why each system exists +- what problem it solves that simpler systems do not +- how it works internally +- what fails at scale +- what tradeoffs strong engineers discuss in interviews +- how companies actually combine these components in production + +Examples in this guide are generalized from common patterns publicly discussed by companies such as Google, Amazon, Netflix, Uber, YouTube, Instagram, TikTok, GitHub, Stripe, and large SaaS platforms. + +## 1. Big Picture: Why These Topics Belong Together + +Search, discovery, and media systems all sit on the user-facing edge of backend engineering. + +They are the systems users feel immediately: + +- search that returns the wrong result feels broken +- autocomplete that lags feels cheap +- feeds that repeat stale content feel low quality +- uploads that fail at 95 percent feel unreliable +- videos that buffer or thumbnails that look wrong feel unfinished + +These systems also force tradeoffs faster than many internal systems: + +- relevance vs latency +- freshness vs throughput +- quality vs compute cost +- personalization vs privacy +- durability vs storage cost +- precomputation vs flexibility + +### 1.1 One Product, Many Subsystems + +```mermaid +flowchart LR + U[User] --> APP[API / App Backend] + APP --> DB[(Primary DB)] + DB --> CDC[CDC / Change Events] + CDC --> SEARCH[Search Index] + CDC --> FEAT[Feature / Event Pipeline] + APP --> REC[Recommendation / Feed Service] + FEAT --> REC + REC --> CACHE[(Feed Cache)] + APP --> OBJ[(Object Storage)] + OBJ --> MEDIA[Media Processing Pipeline] + MEDIA --> CDN[CDN] + APP --> CDN + SEARCH --> APP + CACHE --> APP +``` + +This architecture is common across many products: + +- Amazon-like commerce: product DB, search index, ranking, recommendations, images in object storage +- GitHub-like SaaS: repo and issue metadata in DB, code or issue search index, attachments in object storage, permissions filtering everywhere +- YouTube or TikTok: metadata DB, feed ranking system, object storage, transcoding pipeline, CDN delivery +- Stripe-like internal document systems: metadata DB, audit logs, secure file storage, signed downloads, retention policies + +### 1.2 Search vs Discovery + +These ideas are related but not identical. + +| Dimension | Search | Discovery / Recommendation | +|---|---|---| +| User intent | Explicit | Often implicit | +| Input | Query, filters, sort | User profile, session, context, behavior | +| Goal | Find what user asked for | Show what user is likely to want | +| Retrieval basis | Query-document match | Candidate generation from many signals | +| Example | "wireless headphones" | "Products you may like" | + +Search is usually intent retrieval. Recommendation is usually intent inference. + +The best products use both. + +### 1.3 Latency Expectations + +Users tolerate different delays for different surfaces. + +| Surface | Typical expectation | Why it matters | +|---|---|---| +| Autocomplete | Often less than 50 ms server-side, very low hundreds end-to-end | Typing feels broken if suggestions lag | +| Search results | Often around 100-300 ms for the first result page | Search is interactive and abandonment is high | +| Home feed | Often low hundreds of ms for first page, with prefetching and caching | Users expect quick app open | +| File upload initiation | Usually should start immediately | Perceived responsiveness matters | +| Image delivery | Often tens of ms from edge | Visual surfaces must render fast | +| Video playback startup | Usually a few hundred ms to a few seconds depending on network and buffer policy | Startup delay strongly affects engagement | + +The exact target depends on product and network conditions, but the theme is consistent: these are latency-sensitive systems. + +### 1.4 Interview Framing + +When interviewers ask about search, feeds, or media, they are usually evaluating whether you can reason about: + +- data flow from source of truth to user-facing surface +- specialized indexes or storage layouts +- online vs offline computation +- scale bottlenecks and hotspots +- correctness and security boundaries +- degradation behavior when one subsystem is stale or partially unavailable + +Strong answers start from user behavior and workload shape, not from product names. + +## 2. Search System + +### 2.1 What a Search System Is + +A search system is a retrieval system that helps users find relevant documents, records, products, posts, issues, places, or media based on a query. + +The important word is relevant. + +Databases already know how to retrieve data, so why do search systems exist? + +Because most user-facing search is not exact lookup. It is approximate, text-heavy, fuzzy, relevance-ordered retrieval. + +Examples: + +- a user types "noise cancel headphone" and expects products matching "noise-canceling headphones" +- a developer searches GitHub issues for a phrase and expects typo tolerance, ranking, and permission-safe results +- a job seeker searches roles by title, seniority, remote status, and location +- a rider searches for a place in Uber and expects prefix matching, geospatial awareness, and ranking by context + +### 2.2 Why Databases Alone Are Often Not Enough + +Relational databases and standard secondary indexes are optimized for exact lookups, range scans, joins, and transactional workloads. They are not primarily optimized for large-scale full-text ranking. + +If you try to implement serious search with only a transactional database, you quickly run into problems: + +- `LIKE '%term%'` queries do not scale well for large text corpora +- phrase search and ranking are limited or expensive +- stemming, synonyms, language analyzers, and typo tolerance are not first-class features in many OLTP systems +- scoring millions of candidate documents with relevance functions is not what OLTP engines are optimized for +- query patterns are highly varied and difficult to serve with normal B-tree indexes alone + +### 2.3 Full-Text Search vs Exact Lookup + +| Dimension | Exact DB Lookup | Full-Text Search | +|---|---|---| +| Match type | Exact key, range, prefix in some cases | Token-based, phrase-based, fuzzy, semantic, ranked | +| Result ordering | Usually explicit sort order | Usually relevance first | +| Storage layout | Row-oriented or index-oriented for structured fields | Inverted indexes, postings, specialized scoring metadata | +| Common use | User by ID, order by timestamp | Search products, documents, issues, articles | +| Optimization target | Transactional correctness and predictable queries | Fast retrieval and ranking over large corpora | + +Interview shortcut: databases answer "Which rows satisfy these predicates?" Search systems answer "Which documents are most relevant to what the user probably meant?" + +### 2.4 Architecture of a Search System + +At a high level, production search systems split into two pipelines: + +- indexing pipeline: turns source data into searchable indexes +- query pipeline: turns a user query into ranked results + +```mermaid +flowchart LR + SRC[Source of Truth
DB / CMS / Event Log] --> ING[Ingestion] + ING --> IDX[Index Build / Update] + IDX --> SHARDS[Search Shards + Replicas] + Q[User Query] --> API[Search API] + API --> PARSE[Query Parsing / Rewrite] + PARSE --> COORD[Coordinator] + COORD --> SHARDS + SHARDS --> RANK[Merge + Rank] + RANK --> RES[Results] +``` + +The source of truth is usually not the search index itself. It is usually: + +- a relational DB +- a document database +- a content management system +- a stream of events +- a crawler pipeline in web search + +The search index is a serving structure optimized for retrieval, not necessarily for canonical storage. + +### 2.5 How Production Search Differs from Normal Database Queries + +Production search systems usually have to solve problems that ordinary OLTP queries do not: + +- tokenization and normalization +- ranking by multiple signals +- approximate matching +- synonym expansion +- language-aware analysis +- ACL or permission filtering +- faceting and filtering at scale +- scatter-gather across many shards +- near-real-time indexing with eventual consistency +- degraded behavior under partial shard failures + +That is why many production systems use specialized engines such as Elasticsearch, OpenSearch, Solr, Vespa, Lucene-based services, or internal retrieval systems. + +### 2.6 Query Pipeline + +A realistic query pipeline often includes more than keyword matching. + +```mermaid +flowchart LR + Q[Raw Query] --> CLEAN[Normalize / Spell / Parse] + CLEAN --> REWRITE[Synonyms / Query Rewrite / Intent Detection] + REWRITE --> RET[Retrieval] + RET --> FILT[Apply Filters] + FILT --> LR[Lightweight Ranking] + LR --> HR[Heavy Ranking / Business Rules] + HR --> ACL[Permission Check / Result Shaping] + ACL --> OUT[Final Results] +``` + +Important steps: + +- query normalization: lowercasing, punctuation handling, Unicode normalization +- parser: interpret phrases, field-specific search, boolean operators, quoted strings +- query rewrite: expand synonyms, fix spelling, map common variants +- retrieval: find candidate documents quickly from the index +- ranking: score candidates using lexical, behavioral, freshness, popularity, and quality signals +- filtering: apply structured constraints such as category, location, price, permissions + +### 2.7 Distributed Search Basics + +Large search systems cannot keep all searchable data on one machine. So the index is partitioned into shards. + +Common pattern: + +1. documents are assigned to shards +2. each shard stores a local index +3. replicas provide availability and read capacity +4. a query coordinator fans the request out to relevant shards +5. each shard returns its top-k candidates +6. the coordinator merges them into the global top-k + +This is called scatter-gather. + +Challenges: + +- tail latency: the whole query waits for slow shards unless timeouts or degraded modes exist +- score comparability: local shard scores may need consistent scoring logic so global merge is meaningful +- hotspot shards: skewed data or popular terms can overload specific shards +- rebalancing: adding capacity requires moving large index segments + +### 2.8 Freshness vs Performance + +Search freshness means how quickly updates in the source system appear in search results. + +Users expect different freshness depending on domain: + +- social posts or breaking news: often seconds or near-real-time +- product inventory and pricing: usually very fresh because stale search hurts conversion +- document search in SaaS: usually seconds to minutes is acceptable depending on UX promises +- code search on a giant corpus: sometimes modest indexing delay is acceptable if retrieval is fast and reliable + +The tradeoff is that very fresh indexing increases write pressure and may reduce query efficiency. + +Common tension: + +- frequent small segment updates improve freshness +- large optimized segments improve query performance and compression + +Many systems compromise with near-real-time indexing: documents become searchable quickly, while expensive segment merges happen asynchronously. + +### 2.9 Consistency Challenges + +Search is usually eventually consistent with the source of truth. + +Typical failure cases: + +- DB write succeeded but indexing event was delayed +- product deleted in DB still appears in search for a short period +- permission change not reflected immediately, risking data leakage if ACL filtering is wrong +- inventory count in search is stale while checkout uses the DB + +Best practice: + +- treat the DB or source system as the correctness authority +- do not let search be the final source for money, inventory reservation, or permissions +- apply final correctness checks in downstream business logic when it matters + +GitHub-like systems care deeply about permission-safe search. Returning a private issue, repo, or file in search is worse than returning no result. + +### 2.10 Search Latency and Fault Tolerance + +Search systems are interactive systems, so they need graceful degradation. + +Common patterns: + +- shard replicas for availability +- coordinator timeouts to avoid waiting forever on a straggler shard +- degraded results if one replica set is temporarily unavailable +- hot query caching for common requests +- precomputed filter bitsets or caches for expensive constraints +- monitoring p50, p95, p99 separately because tail latency matters more than averages + +Interview note: if an interviewer asks about fault tolerance, discuss replicas, partial results, timeouts, retry behavior, and stale indexes. "We have backups" is not the answer for serving systems. + +### 2.11 Common Mistakes + +- treating search as just another SQL query layer +- making the search index the source of truth for critical writes +- ignoring permission filtering until late in the design +- underestimating reindexing cost after analyzer changes +- focusing only on retrieval and forgetting ranking quality + +## 3. Indexing + +### 3.1 Why Search Indexing Exists + +Search indexing exists because scanning every document for every query is too slow. + +The core idea is preprocessing. + +Instead of asking, "Which documents contain this term?" by reading the whole corpus repeatedly, the system builds a data structure ahead of time that answers that question quickly. + +That preprocessing step is indexing. + +### 3.2 Document Ingestion Pipeline + +Indexing is usually a pipeline, not a single write. + +```mermaid +flowchart LR + DB[DB / Source Records] --> EVT[CDC / Event Stream / Batch Export] + EVT --> EXTRACT[Extract Fields] + EXTRACT --> ANALYZE[Tokenize / Normalize / Language Analysis] + ANALYZE --> ENRICH[Synonyms / ACLs / Metadata / Quality Signals] + ENRICH --> BUILD[Build or Update Index Segments] + BUILD --> REPL[Replicate / Refresh Search Nodes] + REPL --> SERVE[Search Serving] +``` + +Documents often need field-specific handling: + +- title may get higher weight +- tags may be exact or lightly analyzed +- description may use full stemming and stop-word removal +- permissions and category fields may be stored for filtering +- freshness timestamps may be stored for ranking + +### 3.3 Tokenization + +Tokenization breaks text into searchable units called tokens. + +Examples: + +- "wireless headphones" -> `wireless`, `headphones` +- "foo-bar" may become `foo`, `bar`, or `foo-bar` depending on analyzer design +- East Asian languages may require dictionary-based or statistical segmentation rather than whitespace splitting + +Why this matters: + +What counts as a token determines what can be retrieved. + +Poor tokenization causes obvious product bugs: + +- searching "e-mail" fails to match "email" +- searching code symbols breaks because punctuation handling is wrong +- searching C++ or C# fails because analyzers stripped important characters + +Production systems often use different analyzers for different fields and languages. + +### 3.4 Normalization + +Normalization makes equivalent text forms consistent. + +Common steps: + +- lowercase conversion +- Unicode normalization +- accent folding in some products +- punctuation normalization +- whitespace collapsing + +Without normalization, simple variants become separate search terms and retrieval quality drops. + +### 3.5 Stemming and Lemmatization + +Stemming reduces words to a base form so related terms match. + +Examples: + +- `running`, `runs`, `ran` may reduce toward a common root +- `connect`, `connected`, `connection` may become more retrievable together + +Why it exists: + +Users usually care about concept matching, not exact inflected forms. + +Tradeoff: + +- aggressive stemming increases recall +- too much stemming can hurt precision by conflating different meanings + +Not every domain wants stemming. Code search, SKU search, names, and legal text often need more exact handling. + +### 3.6 Stop Words + +Stop words are very common words such as "the", "a", or "of" that may add little value to retrieval. + +Why they are sometimes removed: + +- they occur in many documents +- they increase index size +- they often do not help ranking + +Why they are sometimes kept: + +- phrase search needs them +- some queries depend on them +- domain-specific language may make them important + +Example: "to be or not to be" or song titles need precise handling. + +### 3.7 Synonyms + +Synonyms allow related terms to match the same concept. + +Examples: + +- `tv` and `television` +- `hoodie` and `sweatshirt` +- `software engineer` and `developer` in some job systems +- `nyc` and `new york city` + +Synonyms are powerful and dangerous. + +They improve recall, but bad synonym rules can create surprising results. Expanding `apple` to `fruit` and `company` naively is an obvious relevance bug. + +Production systems usually treat synonyms as curated domain knowledge, not as a casual text feature. + +### 3.8 Language Handling Basics + +Multi-language search is not just translation. + +Language handling may include: + +- language detection +- per-language analyzers +- script normalization +- stemming rules per language +- tokenization strategies for languages without whitespace delimiters +- query rewriting or synonym dictionaries per locale + +Global products such as Google, Amazon, YouTube, and large SaaS tools need language-aware indexing because naive English-centric analysis fails internationally. + +### 3.9 Incremental Indexing + +Rebuilding the entire index on every document change is impossible at scale. So production systems do incremental indexing. + +Typical process: + +1. source record changes +2. an event or CDC record is emitted +3. indexer fetches or receives the updated document +4. changed fields are reanalyzed +5. the document is added, updated, or tombstoned in the index + +Common challenges: + +- event duplication +- out-of-order updates +- delete propagation +- retries causing duplicate work +- partial failure between DB write and index update + +Idempotent indexing pipelines matter a lot. + +### 3.10 Near Real-Time Indexing + +Many search engines are near real-time rather than strictly real-time. + +That means: + +- writes become searchable after a short delay +- index refresh is decoupled from durable storage operations +- background merges optimize segments later + +This design keeps query latency reasonable while preserving acceptable freshness. + +For products like issue search, product catalogs, and SaaS document search, near-real-time indexing is often the right tradeoff. + +### 3.11 Reindexing Challenges + +Reindexing is one of the biggest operational realities in search. + +You need full reindexing when: + +- analyzer rules change +- synonym logic changes significantly +- field weights or schema design changes +- permissions model changes +- you move to a new index version + +Why it is hard: + +- large corpora take time to rebuild +- dual-running old and new indexes increases cost +- cutover must avoid downtime and bad ranking regressions +- stale or missing events during rebuild can corrupt freshness + +Common strategies: + +- build a new index version in parallel +- backfill from source of truth +- replay recent events after the backfill window +- run shadow reads or compare sample queries +- switch traffic gradually + +This is similar to blue-green deployment for search data. + +### 3.12 Database Indexes vs Search Indexes + +| Dimension | Database Index | Search Index | +|---|---|---| +| Primary goal | Speed up structured lookups and range queries | Speed up text retrieval and relevance-ranked retrieval | +| Typical structure | B-tree, hash, LSM-related structures | Inverted index, postings, term dictionaries, doc values | +| Query style | Predicates on fields | Query terms, phrases, fuzzy matching, ranking | +| Source of truth role | Often part of the canonical DB | Usually derived from another source of truth | +| Update pattern | Tight coupling with DB writes | Often async or near-real-time | +| Ranking support | Limited compared with search engines | Central purpose of the system | + +### 3.13 Best Practices + +- keep indexing idempotent and replay-safe +- separate source of truth from serving index +- version analyzers and schemas explicitly +- measure freshness lag, not just query latency +- treat reindexing as a normal operational workflow, not an emergency-only task + +## 4. Inverted Index + +### 4.1 What an Inverted Index Is + +An inverted index maps each term to the documents that contain it. + +Instead of storing documents and asking, "Which terms are inside this document?" the system stores terms and asks, "Which documents contain this term?" + +That inversion is what makes large-scale text retrieval efficient. + +### 4.2 Why It Powers Most Search Systems + +Most keyword search systems need to answer queries like: + +- which documents contain `wireless` +- which documents contain both `wireless` and `headphones` +- which documents contain the exact phrase `noise cancelling` + +If you have a term-to-document mapping, you can answer these queries much faster than scanning all documents. + +### 4.3 Term -> Document Mapping + +A simple example: + +Documents: + +- D1: "wireless noise cancelling headphones" +- D2: "wired gaming headset" +- D3: "wireless earbuds with case" + +Inverted view: + +- `wireless` -> D1, D3 +- `noise` -> D1 +- `cancelling` -> D1 +- `headphones` -> D1 +- `wired` -> D2 +- `gaming` -> D2 +- `headset` -> D2 +- `earbuds` -> D3 +- `case` -> D3 + +The list of document IDs for a term is called a postings list. + +### 4.4 Postings Lists + +A postings list usually stores more than just document IDs. + +It may include: + +- document ID +- term frequency in the document +- positions of the term inside the document +- field information such as title vs body +- payloads or extra per-hit metadata in some engines + +Why extra metadata matters: + +- term frequency helps ranking +- positions enable phrase and proximity search +- field data enables field weighting + +### 4.5 Positional Indexes and Phrase Search + +If the system stores term positions, it can support phrase queries. + +Example: + +- D1: "machine learning systems" +- D2: "systems for machine translation learning" + +Searching for the phrase "machine learning" should strongly prefer D1. + +Without positions, the engine only knows both terms exist. With positions, it knows whether they appear adjacent and in the correct order. + +### 4.6 Boolean Search + +Boolean search combines postings lists. + +Examples: + +- `A AND B`: intersect postings lists +- `A OR B`: union postings lists +- `A NOT B`: subtract postings lists + +This is one reason inverted indexes are fast: set operations on sorted document ID lists are efficient. + +### 4.7 Compression Basics + +Postings lists can be huge, so compression matters. + +Common ideas: + +- store sorted document IDs and compress gaps between them instead of raw IDs +- use variable-length integer encoding +- group postings into blocks +- add skip pointers or skip blocks so the engine can jump ahead during intersections + +Compression improves memory and disk efficiency, and often query speed too because less data must be read. + +### 4.8 Distributed Inverted Indexes + +At large scale, the index is partitioned across many nodes. + +Partitioning approaches: + +- document partitioning: each shard stores all terms for a subset of documents +- term partitioning: less common in many general-purpose serving systems, but conceptually possible for some specialized workloads + +Document partitioning is common because it simplifies writes and local scoring. + +The coordinator sends the query to all relevant shards, and each shard computes local top results using its local inverted index. + +### 4.9 How Retrieval Is Fast in Practice + +Suppose the query is `wireless headphones`. + +The search engine typically: + +1. normalizes the query +2. finds postings for `wireless` +3. finds postings for `headphones` +4. intersects or otherwise combines candidate sets +5. uses frequency, field boosts, positions, and ranking signals to score candidates +6. returns only the top few results + +The system does not score the whole corpus. It narrows aggressively using the inverted index first. + +That is why retrieval and ranking are separated. + +### 4.10 Common Failure Cases + +- very common terms create long postings lists and high query cost +- badly chosen analyzers create index bloat +- large positional indexes improve quality but increase storage +- hotspot terms create shard imbalance +- deletes and updates create segment fragmentation until merges clean things up + +### 4.11 Interview Angle + +If asked to explain an inverted index, keep it simple: + +"It is a term-to-document lookup structure. Instead of scanning every document on each query, the engine jumps directly from the query terms to candidate documents through postings lists. Positional metadata enables phrase search, and compressed postings plus shard-level retrieval keep it fast at scale." + +## 5. Autocomplete + +### 5.1 Why Autocomplete Exists + +Autocomplete reduces typing effort, helps users express intent, and increases query success. + +It is one of the highest-leverage search UX features because it helps before the actual search even runs. + +Good autocomplete does several things: + +- speeds up input +- corrects or guides query formulation +- exposes popular intents +- reduces zero-result searches +- nudges users toward query structures the backend handles well + +Amazon-style search boxes, Google suggestions, GitHub issue filters, and SaaS global search bars all rely on autocomplete. + +### 5.2 Prefix Matching + +The simplest autocomplete form is prefix matching. + +If the user types `wire`, the system returns suggestions starting with that prefix: + +- `wireless headphones` +- `wireless mouse` +- `wired headset` + +Prefix matching is attractive because it is conceptually simple and fast. + +### 5.3 Trie Basics + +A trie is a tree where each edge represents a character or token prefix. + +Why tries are useful: + +- prefixes share storage +- prefix lookups are fast +- top suggestions can be stored or aggregated at intermediate nodes + +```mermaid +flowchart TD + ROOT[Root] --> W[w] + W --> WI[wi] + WI --> WIR[wir] + WIR --> WIRE[wire] + WIRE --> WIREL[wirel] + WIREL --> WIRELE[wirele] + WIRELE --> WIRELES[wireles] + WIRELES --> WIRELESS[wireless] + WIRELESS --> H[wireless headphones] + WIRELESS --> M[wireless mouse] +``` + +At scale, production systems usually store compacted tries or other optimized prefix structures rather than naive character-by-character trees. + +### 5.4 N-gram Approaches + +Autocomplete is not always solved with a trie. + +N-gram indexing can help with: + +- substring matching +- typo tolerance +- matching mid-word fragments +- languages or domains where token boundaries are tricky + +Tradeoff: + +- n-grams improve recall and flexibility +- they increase index size significantly +- they may add noise if scoring is weak + +### 5.5 Popularity-Based Suggestions + +Not every valid prefix completion should be shown. + +Usually suggestions are ranked by popularity and usefulness. + +Signals may include: + +- historical query frequency +- click-through rate after suggestion selection +- conversion rate in commerce systems +- recency or trending status +- user-specific history + +Example: + +If millions of users search for `iphone charger`, that suggestion should likely outrank a rare but lexically valid completion. + +### 5.6 Recent Searches and Personalization + +Autocomplete often mixes multiple sources: + +- global popular suggestions +- user's own recent searches +- session context +- personalized entities such as repos, docs, contacts, or previous products viewed + +GitHub-like enterprise search or SaaS admin dashboards often use personalization heavily because each user's accessible universe is different. + +### 5.7 Typo Tolerance Basics + +Users make mistakes while typing. Production autocomplete systems usually include typo handling such as: + +- edit-distance based correction +- keyboard-neighbor heuristics +- common misspelling dictionaries +- phonetic or transliteration support in some markets + +The challenge is latency. + +Autocomplete does not have much time budget, so typo tolerance must be efficient and bounded. + +### 5.8 Caching Strategies + +Autocomplete traffic is extremely cache-friendly because prefixes repeat heavily. + +Common patterns: + +- cache hot prefixes in memory +- use CDN or edge caching for anonymous popular suggestions where acceptable +- keep top-k suggestions per prefix precomputed +- debounce client requests to avoid one request per keystroke + +### 5.9 Large-Scale Production Considerations + +Large autocomplete systems often need to handle: + +- huge prefix skew on popular queries +- language and locale differences +- abuse or bot traffic +- personalized suggestions that reduce cacheability +- freshness for trending terms +- safe filtering of prohibited or low-quality suggestions + +A common design is hybrid: + +- static or precomputed prefix data for speed +- online popularity updates for freshness +- user history overlay for personalization + +### 5.10 Common Mistakes + +- sending requests on every keystroke without debounce +- returning lexically valid suggestions with poor user value +- ignoring abuse and suggestion poisoning +- making autocomplete depend on expensive full ranking pipelines + +## 6. Filtering + +### 6.1 What Filtering Is + +Filtering narrows results using structured constraints. + +Examples: + +- price between 50 and 100 +- remote jobs only +- category = laptops +- flights with one stop or fewer +- issues labeled `bug` +- only repositories the user can access + +Filtering is not a side feature. In many business systems it is central. + +For some search experiences, the user query is weak and filters carry most of the actual intent. + +### 6.2 Structured Filters + +Structured filters work over fields whose semantics are known. + +Examples: + +- numeric ranges +- enums or categories +- dates +- geo constraints +- booleans such as in-stock only + +These differ from text ranking because they are usually precise constraints rather than fuzzy matches. + +### 6.3 Faceted Search + +Facets show result breakdowns by filter values. + +Example in e-commerce: + +- brand counts +- color counts +- price buckets +- availability counts + +Why facets matter: + +- they help users refine large result sets +- they reveal the shape of the catalog +- they guide discovery without requiring new queries + +The challenge is performance. Facet counts can be expensive, especially when the base query is broad and filters are combined interactively. + +### 6.4 Range Filters + +Range filters are common for: + +- price +- salary +- rating +- departure time +- file size +- timestamps + +They often need specialized data structures or optimized field storage because arbitrary numeric range scans over huge result sets can be expensive. + +### 6.5 Filtering and Ranking Interaction + +Filtering and ranking interact more than beginners expect. + +If you filter too early, you may remove items that could have been relevant under softer criteria. + +If you filter too late, you may waste ranking work on documents that cannot be shown. + +| Strategy | Good for | Risk | +|---|---|---| +| Pre-filter before ranking | Hard constraints such as ACLs, category limits, geography, inventory | Can shrink candidate set too aggressively if constraints are loose or noisy | +| Post-filter after retrieval or early ranking | Soft presentation rules, some UI-level shaping | Wastes work and may leave too few valid final results | + +Common practice: + +- apply hard constraints early +- apply softer business shaping later + +### 6.6 Filter Performance Optimization + +Common techniques: + +- store filterable fields in efficient columnar or doc-values structures +- build bitmap or bitset representations for high-volume facets +- cache frequent filter combinations +- precompute common facet counts where practical +- use approximate counts if exactness is not required for UX + +ACL filtering is especially important. If the user should not see a document, that filter should behave as a hard constraint and should be efficient. + +### 6.7 Real-World Examples + +E-commerce: + +- category, brand, price, availability, shipping speed, seller, ratings + +Job boards: + +- location, remote, salary range, experience level, company size, visa support + +Travel search: + +- stops, departure window, airline, baggage, refund policy, hotel rating, neighborhood + +These products often spend as much engineering effort on filter performance and facet correctness as on keyword matching. + +### 6.8 Common Failure Cases + +- facet counts computed on stale or mismatched indexes +- filters applied after ranking causing irrelevant or empty pages +- high-cardinality filters destroying cache hit rate +- permission filtering bolted on late and leaking data + +## 7. Ranking + +### 7.1 Why Ranking Matters + +Retrieval answers "what could match". Ranking answers "what should be shown first". + +Ranking matters because users rarely inspect many results. + +If the best result is not in the first few positions, the system feels wrong even if it technically retrieved the right document somewhere deeper in the list. + +### 7.2 Why Ranking Is Usually Multi-Stage + +Ranking is usually multi-stage because expensive models and business logic cannot run on the whole corpus. + +Typical shape: + +1. retrieve a broad candidate set cheaply +2. apply lightweight ranking to reduce candidates +3. apply heavier ranking on a smaller set +4. apply final business rules, diversity, sponsorship, and presentation logic + +```mermaid +flowchart LR + Q[Query / Context] --> RET[Retrieval: thousands] + RET --> L1[Stage 1 Ranker: hundreds] + L1 --> L2[Stage 2 Ranker: tens] + L2 --> BR[Business Rules / Diversity / Ads] + BR --> UI[Final Ordered Results] +``` + +### 7.3 Retrieval vs Ranking + +This separation is critical. + +Retrieval is optimized for recall and speed. + +Ranking is optimized for precision and utility. + +If retrieval misses a relevant item entirely, ranking cannot recover it. + +If retrieval returns too many weak candidates, ranking becomes expensive and noisy. + +### 7.4 Relevance Ranking vs Business Ranking + +Production ranking is rarely pure relevance. + +In e-commerce, a result order may consider: + +- text relevance +- inventory availability +- margin or business priority +- fulfillment speed +- review quality +- return rate +- seller trust +- sponsored placements + +In job search: + +- lexical match +- application likelihood +- compensation quality +- recency +- employer quality +- geographic fit + +In a SaaS global search: + +- text relevance +- recency of the document +- document type priority +- ownership or collaboration strength + +### 7.5 Common Ranking Signals + +Signals often include: + +- lexical relevance: term match, phrase match, field boosts +- popularity: clicks, purchases, views, stars, installs +- freshness: newer items may deserve higher weight in some surfaces +- engagement: dwell time, completion, watch time, saves, shares +- quality: seller quality, document quality, content safety +- trust: verified sources, low spam risk, low abuse signals +- context: location, device, language, current session intent + +### 7.6 Diversity Constraints + +Blindly ranking by one score can produce repetitive or unhealthy outputs. + +Examples: + +- ten nearly identical products from one seller +- a feed dominated by one creator +- a job result page filled with duplicates from the same company + +Diversity rules improve the experience by spreading exposure across: + +- categories +- sellers +- creators +- content types +- freshness buckets + +This is especially important for feeds and discovery systems. + +### 7.7 Sponsored Content Considerations + +Sponsored results complicate ranking because monetization and relevance must coexist. + +Strong systems separate: + +- auction or eligibility logic +- relevance constraints +- sponsored placement policies +- disclosure and compliance requirements + +Bad design either destroys relevance or leaves too much money on the table. + +### 7.8 Failure Cases + +- optimizing only CTR and creating clickbait +- overusing popularity so incumbents dominate forever +- boosting freshness too much and burying authoritative content +- letting one business rule overwhelm all relevance signals +- failing to monitor ranking regressions after model changes + +### 7.9 Best Practices + +- make ranking multi-stage +- separate hard eligibility from soft scoring +- log ranking features and decisions for debugging +- evaluate quality with offline metrics and online experiments +- protect the system from feedback loops that only reward already-popular items + +## 8. Relevance Scoring + +### 8.1 What Relevance Scoring Means + +Relevance scoring is how the system estimates how well a result matches the user's need. + +There is no single universal score. Real systems combine multiple signals. + +### 8.2 TF-IDF Basics + +TF-IDF is one of the classic ideas in lexical search. + +Intuition: + +- terms that appear often in a document may be important to that document +- terms that appear in many documents are less discriminative across the corpus + +One simple form is: + +$$ +TF\text{-}IDF(t,d) = \text{TF}(t,d) \cdot \log\left(\frac{N}{\text{DF}(t)}\right) +$$ + +Where: + +- $\text{TF}(t,d)$ is term frequency of term $t$ in document $d$ +- $\text{DF}(t)$ is the number of documents containing $t$ +- $N$ is the total number of documents + +Why it matters: + +Rare but present terms are usually more informative than extremely common terms. + +### 8.3 BM25 Basics + +BM25 is a practical ranking function widely used in lexical search systems. + +It improves on simpler TF-IDF variants by handling: + +- term frequency saturation +- document length normalization + +A common form is: + +$$ +BM25(q,d)=\sum_{t \in q} \text{IDF}(t) \cdot \frac{f(t,d)(k_1+1)}{f(t,d)+k_1\left(1-b+b\cdot \frac{|d|}{\text{avgdl}}\right)} +$$ + +You do not need to memorize the formula in interviews, but you should know the intuition: + +- more occurrences help, but not linearly forever +- long documents should not win just because they contain more words + +### 8.4 Semantic Relevance Basics + +Lexical matching is powerful but limited. + +Semantic relevance tries to capture meaning, not just exact token overlap. + +Examples: + +- `sofa` matching `couch` +- `software engineer` matching `backend developer` +- a support query matching a knowledge base article with similar meaning but different wording + +Common production pattern today is hybrid retrieval: + +- lexical retrieval for precision and exact matching +- semantic retrieval or re-ranking for meaning-based matches + +Why hybrid is common: + +- lexical search handles exact identifiers, codes, names, and rare terms well +- semantic methods handle paraphrases and intent better +- using both reduces the weaknesses of either alone + +### 8.5 Behavioral Signals + +Relevance is not only about text. + +Behavioral signals often matter: + +- clicks +- dwell time +- add-to-cart rate +- purchase rate +- save rate +- watch completion +- query reformulations after a click + +These signals help the system learn which results users actually found useful. + +But they are noisy. Clicked does not always mean satisfied. + +### 8.6 Quality, Trust, and Spam Prevention + +High relevance is not enough if the content is low quality or abusive. + +Production systems often include additional scoring dimensions: + +- content quality scores +- seller or source trust +- spam or fraud risk +- policy safety scores +- freshness or staleness penalties + +Examples: + +- Amazon-like marketplaces must suppress spammy or low-trust listings +- GitHub-like search may downrank spam repositories or abusive content +- YouTube-like platforms need safety and trust constraints around recommendations + +### 8.7 Balancing Relevance vs Business Goals + +"Best result" in production usually means the best result under multiple objectives. + +These can include: + +- lexical match quality +- user satisfaction +- engagement +- monetization +- content safety +- fairness or exposure goals +- freshness + +This is why ranking discussions often become multi-objective optimization discussions. + +### 8.8 Interview Framing + +If asked how a search engine decides the best result, a strong answer is: + +"It usually starts with lexical or hybrid retrieval to get candidates, then uses a relevance score combining term signals such as BM25, field boosts, freshness, popularity, behavioral feedback, quality signals, and business constraints. The final order is almost never based on one score alone." + +## 9. Recommendation System Overview + +### 9.1 What Recommendation Systems Are + +Recommendation systems choose what to show users when the user did not explicitly ask for a specific query. + +They power: + +- Netflix homepages +- YouTube next videos +- TikTok For You feeds +- Amazon "Customers also bought" +- Instagram and X home feeds +- SaaS dashboards showing suggested docs, tasks, or entities + +### 9.2 Why They Exist + +The internet has too much content. Most users will not search for everything they could care about. + +Recommendation systems help with discovery by predicting relevance from behavior, similarity, context, and popularity. + +Search solves explicit intent. Recommendation solves hidden intent. + +### 9.3 Online vs Offline Recommendation + +| Dimension | Offline | Online | +|---|---|---| +| When computed | Batch or scheduled jobs | At request time or near request time | +| Good for | Heavy model training, embeddings, broad candidate pools | Fresh context, session adaptation, final ranking | +| Tradeoff | Efficient at scale but stale | Fresh but latency-sensitive | + +Most production systems use both. + +Example: + +- offline jobs compute user embeddings, item embeddings, similarity graphs, creator clusters, trending statistics +- online systems use current session behavior, freshness, and context to rank a page right now + +### 9.4 Cold Start Problem + +Cold start means the system has too little data. + +Two forms: + +- new user: little or no history +- new item or creator: little or no interaction data + +Common mitigations: + +- use popularity and trending signals +- use content-based features +- use onboarding preferences +- use location, language, device, and session context +- provide exploration slots so new items can earn data + +TikTok-like and YouTube-like systems care deeply about new content discovery. If only established content wins, the ecosystem becomes stale. + +### 9.5 Feedback Loops + +Recommendation systems influence behavior, so they create feedback loops. + +If the system shows content, it gets more interaction data on that content, which can cause it to rank even higher. + +This can be useful, but it can also create runaway popularity bias. + +Examples of risks: + +- rich-get-richer exposure +- narrow content bubbles +- overfitting to clickbait +- suppressing new creators + +### 9.6 Exploration vs Exploitation + +Recommendation systems constantly balance: + +- exploitation: show what is most likely to perform well now +- exploration: show some uncertain or new content to learn more and avoid stagnation + +If you only exploit, the system becomes conservative and may miss better items. + +If you explore too much, user experience degrades. + +### 9.7 Engagement vs Quality Tradeoffs + +Not every engagement signal maps to long-term product quality. + +High click-through or short-term watch time may conflict with: + +- satisfaction +- trust +- creator ecosystem health +- safety +- retention quality + +This is a central real-world discussion in recommendation systems. + +### 9.8 Example Patterns + +Netflix: + +- heavy personalization by row and by title ranking +- strong use of offline signals plus contextual ranking + +YouTube: + +- massive candidate generation followed by multi-stage ranking +- strong importance of watch time, satisfaction, freshness, safety + +TikTok: + +- short-term session signals matter heavily +- content and user embeddings are critical for rapid personalization + +Amazon: + +- recommendations mix collaborative signals, co-purchase graphs, browse history, price sensitivity, and business objectives + +## 10. Ranking in Recommendation Systems + +### 10.1 Candidate Ranking + +Recommendation ranking usually happens after candidate generation. The candidate pool may already be hundreds or thousands of items rather than millions. + +The ranker then decides ordering using user, item, and context features. + +### 10.2 Multi-Stage Ranking + +The same multi-stage logic from search applies, but recommendation models are often more feature-heavy. + +Typical pipeline: + +1. generate candidates from many sources +2. apply a lightweight ranker to remove obvious weak candidates +3. apply a heavier model to a smaller candidate set +4. apply diversity, policy, and business constraints + +```mermaid +flowchart LR + CTX[User + Session + Context] --> CG[Candidate Sources] + CG --> CANDS[Candidate Pool] + CANDS --> FAST[Fast Ranker] + FAST --> HEAVY[Heavy Ranker] + HEAVY --> MIX[Diversity / Fairness / Ads / Policy] + MIX --> FEED[Final Feed] +``` + +### 10.3 Lightweight vs Heavy Rankers + +Fast rankers may use: + +- simple feature transforms +- linear models +- shallow trees +- small neural models + +Heavy rankers may use: + +- deeper neural models +- sequence models over session history +- expensive cross-feature interactions +- richer content understanding features + +The reason for multiple rankers is simple: the most accurate model is often too expensive to run on too many candidates. + +### 10.4 ML Ranking Basics + +ML rankers usually learn from historical interactions. + +Common labels or targets: + +- click +- long dwell time +- like or save +- watch completion +- add to cart +- purchase +- hide or negative feedback + +The hardest part is not fitting a model. It is defining the right objective and handling bias in logged data. + +### 10.5 Online Ranking Decisions + +At request time, the system may incorporate: + +- current session signals +- latest follows or interactions +- freshness windows +- user device and network conditions +- time of day or location +- safety or rate-limit decisions + +This is why recommendation systems are rarely fully precomputed. + +### 10.6 Delayed Feedback Challenges + +Some outcomes arrive late. + +Examples: + +- purchases happen long after an impression +- subscription retention takes days or weeks +- satisfaction surveys are sparse + +This creates training and evaluation problems because immediate clicks are easy to measure, but long-term satisfaction is harder. + +### 10.7 Fairness and Creator Fairness + +Real platforms often need fairness constraints such as: + +- not letting one creator dominate all slots +- giving new creators a chance to gather signal +- balancing exposure across categories or sellers +- avoiding discrimination in jobs, housing, lending, or other regulated domains + +These are not only ethics topics. They are product-health topics. + +### 10.8 Freshness vs Relevance + +Older content may have stronger engagement history. Newer content may be more timely. + +Different products choose differently: + +- news and social updates care heavily about freshness +- evergreen education or documentation may prioritize authority over recency +- commerce often needs a mix of demand history and current availability + +## 11. Fanout + +### 11.1 What Fanout Means + +Fanout is the process of distributing content references to the users who may see them. + +This is most commonly discussed for social feeds. + +If a user posts something, how do followers get it into their home timelines? + +### 11.2 Fanout-on-Write + +In fanout-on-write, when a user creates a post, the system pushes that post reference into follower timelines immediately or soon after write time. + +Why it exists: + +- fast feed reads for ordinary users +- precomputed per-user timelines + +Tradeoff: + +- very expensive for users with huge follower counts +- write amplification can be enormous + +### 11.3 Fanout-on-Read + +In fanout-on-read, when a user opens the app, the system fetches posts from followed accounts and constructs the feed at read time. + +Why it exists: + +- avoids massive write amplification +- handles high-fanout authors better + +Tradeoff: + +- more expensive reads +- more complex low-latency ranking at request time + +### 11.4 Hybrid Models + +Real systems often use hybrid approaches. + +Common pattern: + +- ordinary users: fanout-on-write +- celebrity or mega-scale accounts: fanout-on-read or special handling + +This is the classic celebrity problem. + +### 11.5 Fanout-on-Write vs Fanout-on-Read + +| Dimension | Fanout-on-Write | Fanout-on-Read | +|---|---|---| +| Write cost | High | Lower | +| Read cost | Lower | Higher | +| Good for | Many readers with modest graph sizes | Large fanout creators and flexible ranking | +| Freshness control | Good if timelines update quickly | Good if reads fetch current data | +| Complexity | Simpler reads, harder writes | Harder reads, simpler writes | + +### 11.6 Social Feed Architecture Example + +```mermaid +flowchart LR + P[Post Created] --> DIST[Distribution Service] + DIST --> FW[Write to Follower Timelines] + DIST --> HOT[Mark Celebrity Posts for Read-Time Fetch] + U[User Opens Feed] --> FEED[Feed Service] + FEED --> TL[(Timeline Cache)] + FEED --> HOT + TL --> RANK[Rank + Merge] + HOT --> RANK + RANK --> OUT[Feed Page] +``` + +### 11.7 Cache Invalidation Challenges + +Feed systems must handle: + +- deleted posts +- blocked users +- privacy changes +- edited content +- ranking model changes + +If timelines are heavily cached or precomputed, invalidation becomes hard. + +### 11.8 Common Failure Cases + +- pushing celebrity posts to millions of timelines and overwhelming storage or queues +- expensive read-time joins across too many sources +- duplicated items because of retries or hybrid merge bugs +- stale deleted content because cache invalidation lagged + +### 11.9 Interview Angle + +If asked to design a social feed, always discuss fanout strategy. That is one of the main architectural decisions. + +## 12. Candidate Generation + +### 12.1 Why Candidate Generation Exists + +Recommendation ranking does not start from the whole corpus. It starts from a narrowed candidate pool. + +Why? + +Because ranking millions or billions of items per request is impossible. + +Candidate generation reduces the problem from "everything" to "a few hundred or thousand promising items". + +### 12.2 Common Candidate Sources + +Production systems often combine many candidate sources: + +- collaborative filtering +- content-based similarity +- social graph neighbors +- trending or popular items +- creator-follow graph +- embedding nearest neighbors +- recently interacted entities +- business-curated pools + +The union of these sources becomes the candidate pool. + +### 12.3 Collaborative Filtering Basics + +Collaborative filtering uses behavioral similarity. + +Core intuition: + +- users who behaved similarly in the past may like similar things in the future +- items that co-occur in behavior may be related + +Examples: + +- users who bought product A often buy product B +- users who watched show X often watch show Y + +Amazon-style "customers also bought" is the classic mental model. + +### 12.4 Content-Based Filtering Basics + +Content-based filtering uses item attributes. + +Examples: + +- recommend jobs similar to jobs a user previously clicked +- recommend articles with similar topics or embeddings +- recommend videos with related audio, captions, or visual features + +This is useful for cold start because it does not depend entirely on historical interaction volume. + +### 12.5 Graph-Based Candidates + +Graph-based candidates come from relationships: + +- who the user follows +- who similar users follow +- authors frequently co-engaged by the same audience +- co-starred repos or linked documents + +Graph candidates are common in social apps, GitHub-like collaboration systems, and commerce systems with co-view or co-purchase graphs. + +### 12.6 Trending Candidates + +Trending pools capture global or local momentum. + +Why they matter: + +- they solve some cold start problems +- they inject freshness +- they expose popular content without deep personalization + +But trending alone is not personalization. + +### 12.7 Embedding Retrieval and ANN Basics + +Embeddings map users and items into vector spaces where similar concepts are close together. + +This enables nearest-neighbor retrieval: + +- find items near the user's embedding +- find items near the current session embedding +- find items similar to the current content being viewed + +Exact nearest-neighbor search can be expensive at scale, so many systems use approximate nearest neighbor, or ANN, techniques. + +High-level idea: + +- use structures that avoid comparing against every vector +- trade a little exactness for large speed gains + +This is often good enough for candidate generation because ranking happens later. + +### 12.8 Why Ranking Starts After Candidate Generation + +Candidate generation is about recall and breadth. + +Ranking is about precision and ordering. + +If you skip candidate generation, ranking is too expensive. + +If candidate generation is poor, ranking quality is capped. + +This boundary is one of the most important concepts in modern recommendation systems. + +## 13. Personalization + +### 13.1 What Personalization Means + +Personalization means the same corpus produces different results for different users. + +Examples: + +- two users searching the same marketplace query may see different ranking orders +- two Netflix users get different homepages +- two SaaS users searching the same global search term see different accessible docs and different likely hits + +### 13.2 User Profiles + +A personalization profile may include: + +- long-term interests +- recent interactions +- follows, subscriptions, or teams +- geographic preferences +- device/network patterns +- explicit preferences +- negative feedback and muted topics + +Profiles are often built from both online and offline signals. + +### 13.3 Implicit vs Explicit Feedback + +Explicit feedback: + +- likes, ratings, follows, saves, thumbs up, manual preferences + +Implicit feedback: + +- clicks, dwell time, purchases, watch completion, skips, repeats, hides + +Implicit data is abundant but noisy. Explicit data is sparse but clearer. + +### 13.4 Long-Term vs Short-Term Interests + +Long-term interests represent stable tastes. + +Short-term interests capture immediate intent. + +Example: + +- a user generally likes backend engineering content +- this week the user is specifically searching for Redis and search systems + +TikTok-like feeds often weight short-term session intent heavily. Netflix-like systems also care about context, but long-term taste matters more for broad discovery. + +### 13.5 Contextual Ranking + +Ranking may depend on context such as: + +- time of day +- current page or query +- device type +- network speed +- location +- current session sequence + +Example: + +- low-bandwidth users may get lower-bitrate or different video choices +- local services like Uber care strongly about geography and current location + +### 13.6 Privacy Considerations + +Personalization raises privacy questions: + +- how much behavioral data is stored +- how long it is retained +- whether sensitive attributes are inferred +- whether users can opt out or reset personalization +- whether data is used across products or only within one surface + +Production systems need data minimization, retention controls, access controls, and often region-specific compliance behavior. + +### 13.7 Explainability Basics + +Explainability means giving a human-understandable reason for some results. + +Examples: + +- "Because you watched..." +- "Suggested because you follow..." +- "Related to your recent searches" + +Explainability is useful for trust, debugging, and product feedback even if the underlying model is more complex than the explanation suggests. + +### 13.8 Common Mistakes + +- overfitting to recent clicks and making feeds unstable +- ignoring negative feedback +- overpersonalizing so much that exploration disappears +- using sensitive data carelessly + +## 14. Feed Generation + +### 14.1 What Feed Generation Is + +Feed generation is the process of deciding what appears in a user's home timeline or discovery surface. + +This is one of the hardest system design topics because it combines: + +- graph data +- recommendation ranking +- caching +- fanout strategy +- freshness +- pagination +- abuse controls +- content safety + +### 14.2 Home Feed Architecture + +```mermaid +flowchart LR + U[User Opens App] --> FEED[Feed Service] + FEED --> PROF[Profile / Session Service] + FEED --> SOURCES[Candidate Sources] + SOURCES --> FOLLOW[Following Graph] + SOURCES --> TREND[Trending] + SOURCES --> EMB[Embedding Retrieval] + SOURCES --> CACHE[(Timeline / Candidate Cache)] + SOURCES --> RANK[Ranking Service] + RANK --> HYD[Hydration / Metadata Fetch] + HYD --> PAGE[Paginated Response] +``` + +### 14.3 Cache Layers + +Feed systems may cache: + +- precomputed timelines +- candidate pools +- ranking features +- hydrated entity metadata +- first page responses for very hot users or anonymous feeds + +Caching helps, but cache invalidation is hard because feed contents change frequently and are personalized. + +### 14.4 Pagination Challenges + +Pagination in feeds is not as simple as offset and limit. + +Problems with offset-based pagination: + +- feed contents change between requests +- inserts at the top shift offsets +- duplicates or gaps appear + +Cursor-based pagination is usually better. + +But even cursor pagination is tricky if the ranking model is highly dynamic. + +### 14.5 Consistency vs Freshness + +Users want fresh content, but highly dynamic feed generation can lead to inconsistent paging and repeated items. + +Common compromise: + +- freeze a short-lived ranked window for a session or cursor +- refresh when the user pulls to refresh or a new session begins + +### 14.6 Backfill Strategies + +Backfill means what to show when the natural candidate pool is sparse. + +Examples: + +- new user with few follows +- quiet time period with not enough new content +- strict filters remove many candidates + +Backfill sources may include: + +- trending content +- suggested accounts or topics +- evergreen content +- sponsored content under policy rules + +### 14.7 Ranking at Read Time vs Write Time + +Write-time ranking: + +- rank or partially prepare content as it is distributed +- fast reads +- less flexible when user state changes + +Read-time ranking: + +- more personalized and fresh +- more expensive and latency-sensitive + +Many systems use hybrid approaches: precompute easy parts, rank final candidates at read time. + +### 14.8 Product Styles + +X or Twitter-like following feed: + +- strong graph component +- hybrid fanout patterns +- freshness matters heavily + +Instagram-like home feed: + +- mix of follow graph, engagement prediction, and recommendations +- strong importance of re-ranking and diversity + +TikTok-like For You feed: + +- candidate generation from broad corpus, not only follows +- strong session-based ranking and content understanding + +### 14.9 Common Failure Cases + +- stale caches showing deleted or blocked content +- duplicated items across pages +- expensive read-time ranking melting the service during traffic spikes +- feedback loops making the feed monotonous +- ranking bugs that over-prioritize one creator or content type + +## 15. Object Storage + +### 15.1 What Object Storage Is + +Object storage stores data as objects, each usually accessed by a key within a bucket or namespace. + +An object typically contains: + +- the binary content +- metadata +- a key or name + +Object storage is the default storage layer for large unstructured blobs such as: + +- images +- videos +- PDFs +- archives +- backups +- logs +- user uploads + +### 15.2 Why It Exists + +Databases are good for structured records. Local disks are tied to one machine. Traditional file systems provide hierarchical paths and POSIX-like semantics. + +Object storage exists because internet-scale systems need: + +- massive scale +- high durability +- relatively simple access patterns +- low operational burden per file +- cost-effective storage of large blobs + +### 15.3 Object Storage vs Block Storage vs File Systems + +| Dimension | Object Storage | Block Storage | File System | +|---|---|---| +| Interface | Key/object API | Raw blocks attached to machines | Files and directories | +| Typical use | Media, backups, logs, attachments | Databases, VM disks, low-level persistent volumes | Shared files, app files, local hierarchical access | +| Scaling model | Very large namespaces, distributed service | Usually attached volumes per instance or host | Depends on file system implementation | +| Mutation model | Often write whole objects or multipart operations | Fine-grained block updates | File operations with richer semantics | +| Strength | Durability and scale for blobs | Low-level performance control | Familiar file semantics | + +### 15.4 Object Storage vs Database vs Local Disk + +| Dimension | Object Storage | Database | Local Disk | +|---|---|---| +| Best for | Large blobs | Structured records and queries | Fast machine-local access | +| Query support | Minimal metadata lookup | Rich queries and indexes | Minimal unless app-managed | +| Durability model | Service-level replication or erasure coding | Depends on DB replication and backup | Depends on host and disk setup | +| Sharing | Easy across services | Structured access only | Tied to machine unless networked | +| Cost profile | Usually cheap per GB for large blobs | Higher for blob-heavy usage | Cheap locally but operationally limited | + +### 15.5 Durability Concepts + +Object stores are designed for very high durability. + +Internally, that typically means some combination of: + +- replication across devices or zones +- erasure coding +- background integrity checks +- repair workflows for lost fragments + +Durability is different from availability. + +An object can be highly durable but temporarily unavailable due to network or control-plane issues. + +### 15.6 Scalability Characteristics + +Object stores are designed for: + +- huge object counts +- independent object retrieval +- simple write and read APIs +- high parallelism + +They are not designed for transactional joins or row-level relational queries. + +### 15.7 Object Immutability Concepts + +Many object storage workflows treat objects as immutable. + +Instead of modifying a large object in place, systems often: + +- upload a new version +- update metadata pointers +- rely on versioning for history + +This simplifies distributed durability and caching. + +### 15.8 Metadata and CDN Relationship + +Object stores are often paired with: + +- a database for metadata and permissions +- a CDN for global low-latency delivery + +Why a CDN matters: + +- it caches content near users +- reduces origin load +- improves image and video latency dramatically + +### 15.9 Cost Considerations + +Cost is not just storage per GB. + +Real costs include: + +- request volume +- data transfer out +- replication region choices +- lifecycle tiering +- media derivative explosion such as multiple image sizes and video renditions + +Systems that look cheap at rest can become expensive in egress and processing. + +### 15.10 Lifecycle Policies + +Object stores often support lifecycle policies such as: + +- transition old objects to cheaper storage tiers +- expire temporary uploads +- clean abandoned multipart uploads +- retain or lock data for compliance windows + +This matters for backups, logs, and SaaS attachment retention. + +## 16. S3-Style Storage + +### 16.1 Buckets, Objects, and Keys + +S3-style systems usually organize data as: + +- bucket: top-level namespace or container +- object: stored binary plus metadata +- key: object identifier within the bucket + +Despite folder-like UIs, the key space is conceptually flat. Paths are usually naming conventions encoded into the key. + +### 16.2 Versioning + +Versioning keeps previous object versions when objects are overwritten or deleted. + +Why it matters: + +- accidental delete recovery +- auditability +- rollback +- safer overwrite semantics + +### 16.3 Pre-Signed URLs + +Pre-signed URLs allow temporary access to upload or download an object without proxying the bytes through the application server. + +Why they are useful: + +- reduce backend bandwidth load +- keep object store credentials hidden from clients +- enforce short-lived, scoped access + +This is one of the most common production upload and download patterns. + +### 16.4 Access Control Basics + +Access control is usually layered: + +- bucket-level policies +- object-level permissions in some systems +- application-level authorization before issuing signed URLs +- CDN or origin access policies + +Best practice: do not make private content directly world-readable and hope the frontend hides the URL. + +### 16.5 Multipart Uploads + +Multipart uploads break a large object into parts. + +Benefits: + +- retry only failed parts +- upload parts in parallel +- resume large uploads more efficiently + +This is essential for video uploads, cloud drive systems, and large backups. + +### 16.6 Consistency Basics + +Modern object stores often provide strong consistency for many common operations within a region, but engineers should still think carefully about distributed workflows. + +Why? + +- event notifications may arrive asynchronously +- cross-region replication may lag +- caches and CDNs may serve stale content +- metadata DB updates may race with object lifecycle events + +So even if storage reads are strongly consistent, the surrounding workflow may still be eventually consistent. + +### 16.7 Event-Driven Workflows + +Object creation often triggers downstream work: + +- antivirus scanning +- image resizing +- video transcoding +- metadata extraction +- OCR or transcription +- search indexing + +This is why object storage frequently sits at the center of async pipelines. + +### 16.8 Real-World Examples + +- user uploads for profile photos or attachments +- backup archives +- application logs shipped to long-term storage +- static site assets +- media hosting for images and videos + +## 17. Uploads and Downloads + +### 17.1 Direct Upload vs Backend Proxy Upload + +There are two common upload styles. + +| Strategy | How it works | Good for | Main tradeoff | +|---|---|---|---| +| Direct upload | Client gets signed URL and uploads to object storage directly | Large media, high scale, low backend bandwidth | More client complexity and async workflow coordination | +| Backend proxy upload | Client uploads bytes to app backend, backend forwards or stores | Simpler auth/control, small files, strict validation workflows | Backend becomes bandwidth bottleneck | + +For large-scale media systems, direct upload is usually preferred. + +### 17.2 Browser Upload Flow + +```mermaid +flowchart LR + C[Client] --> API[Backend API] + API --> AUTH[Authorize Upload] + AUTH --> SIGN[Generate Signed Upload URL] + SIGN --> C + C --> OBJ[(Object Storage)] + OBJ --> EVT[Object Created Event] + EVT --> PROC[Scan / Extract / Transform] + PROC --> DB[(Metadata DB)] + DB --> READY[Asset Ready] +``` + +Typical steps: + +1. client asks backend to start upload +2. backend authorizes user and creates upload record +3. backend returns signed URL or multipart session data +4. client uploads directly to object storage +5. object store emits event +6. async processors validate and enrich the asset +7. metadata DB marks asset ready for use + +### 17.3 Mobile Upload Considerations + +Mobile uploads are harder because of: + +- flaky networks +- app backgrounding +- battery constraints +- limited memory +- varying file sizes and camera formats + +Best practices: + +- resumable multipart uploads +- persistent local upload state +- idempotent retry tokens +- chunk sizes tuned for mobile conditions + +### 17.4 Secure Download Patterns + +Secure download is often implemented with: + +- backend authorization check +- short-lived signed URL or signed cookie +- private object origin behind CDN +- optional watermarking or audit logging for sensitive downloads + +Typical SaaS pattern: + +- metadata and permissions live in the app DB +- app checks access +- app issues a short-lived signed URL for the specific file or CDN path + +```mermaid +flowchart LR + REQ[Client Requests File] --> API[Backend API] + API --> AUTH[Authorize User / Tenant / ACL] + AUTH --> SIGN[Issue Short-Lived Signed URL or Cookie] + SIGN --> CDN[CDN / Private Origin] + CDN --> OBJ[(Object Storage)] + OBJ --> CDN + CDN --> RESP[Client Downloads File] +``` + +### 17.5 Resumable Uploads + +Resumable upload means the client can continue after failure without restarting from zero. + +This matters for: + +- large videos +- weak mobile connectivity +- long uploads in browser tabs +- enterprise file transfer systems + +### 17.6 Retry Strategies + +Retries must be careful. + +Good practice: + +- retry failed chunks, not the whole upload +- use exponential backoff with limits +- make upload initiation idempotent +- track completed parts so retries do not duplicate work + +### 17.7 Integrity Verification + +File integrity matters because uploads can be corrupted or truncated. + +Checks may include: + +- size verification +- checksum per part and final checksum +- content-type validation +- file signature or magic-byte inspection + +Do not trust only the filename extension. + +### 17.8 Antivirus Scanning Basics + +Many production systems scan user uploads before making them broadly available. + +Common flow: + +- upload lands in quarantine or pending state +- scan job checks the file +- file is promoted to usable state only after passing validation + +This is common in SaaS attachment systems and customer-facing file upload platforms. + +### 17.9 Failure Cases + +- signed URL expired during large upload +- backend marked metadata row ready before scan or transform completed +- clients retried whole uploads and multiplied storage cost +- insecure direct object exposure leaked private files + +## 18. Chunked Uploads + +### 18.1 Why Chunked Uploads Exist + +Chunked or multipart uploads exist because large files are fragile to send as one giant request. + +Problems with single-shot uploads: + +- one failure restarts the whole transfer +- memory usage may be large +- progress tracking is coarse +- network interruptions waste more work + +### 18.2 Multipart Upload Flow + +```mermaid +flowchart LR + INIT[Initiate Multipart Upload] --> PARTS[Upload Parts in Parallel] + PARTS --> TRACK[Track Completed Part IDs + Checksums] + TRACK --> COMPLETE[Complete Upload] + COMPLETE --> ASSEMBLE[Store Final Object] +``` + +Typical process: + +1. client requests multipart upload session +2. backend or object store returns upload ID and per-part instructions +3. client uploads parts independently +4. client records which parts succeeded +5. final completion call assembles the object + +### 18.3 Resume After Failure + +To resume correctly, the client or backend needs state such as: + +- upload ID +- part numbers +- completed part identifiers or ETags +- expected total file size +- checksum state if used + +This state may live in: + +- client local storage +- backend DB +- Redis for short-lived sessions + +### 18.4 Parallel Uploads + +Parallel part uploads improve throughput, especially for large files. + +Tradeoffs: + +- more concurrency can increase speed +- too much concurrency can overwhelm mobile devices, browsers, or rate limits + +Chunk size and concurrency are tuning knobs. + +### 18.5 Ordering and Assembly + +Even when parts upload in parallel, the final object needs deterministic ordering. + +The system uses part numbers or ordered manifests to assemble the correct final file. + +### 18.6 Checksum Verification + +Checksums can be used at: + +- per-part level +- final object level + +This helps detect corruption during transfer or assembly. + +### 18.7 Large File Handling in Production + +For very large uploads, systems often add: + +- rate limiting by user or tenant +- upload quotas +- expiration of abandoned multipart sessions +- lifecycle cleanup of orphaned parts + +Cloud drives and video platforms rely heavily on these controls. + +### 18.8 Common Mistakes + +- not cleaning abandoned multipart parts +- not persisting upload state for resume +- choosing chunk sizes without considering mobile networks +- trusting client-provided completion status without validation + +## 19. Metadata + +### 19.1 Why Metadata Exists Separately + +Object storage is good at storing blobs, but most applications need richer metadata and business rules around each file. + +Examples of metadata: + +- owner user or tenant +- permission model +- file name and original content type +- processing status +- timestamps +- retention or legal hold flags +- checksum and size +- references to derived assets such as thumbnails or transcoded variants + +This metadata is usually stored in a database, not only inside the object store. + +### 19.2 Database + Object Storage Relationship + +Common pattern: + +- object store holds the bytes +- DB holds metadata, ownership, permissions, and workflow state + +Why this split is useful: + +- queries are easier +- business transactions stay in the DB +- permissions are easier to manage +- processing state is easier to update + +Stripe-like or GitHub-like systems use this model for user files, exports, logs, and compliance artifacts. + +### 19.3 Ownership and Permissions + +Metadata tables often model: + +- user ownership +- organization or tenant ownership +- access scopes +- sharing links +- expiration rules + +The object key alone should not be the permission system. + +### 19.4 Indexing Metadata + +Applications often need search over metadata, not the object bytes themselves. + +Examples: + +- search files by filename, tag, owner, created date +- search PDFs by OCR text and metadata +- search media library by dimensions, duration, language, transcript terms + +This is where search and file systems meet. + +### 19.5 Audit Logging + +Sensitive file systems often log: + +- who uploaded a file +- who downloaded it +- who changed permissions +- who deleted or restored it + +Audit trails are critical in enterprise SaaS, finance, healthcare, and security tools. + +### 19.6 Soft Delete Patterns + +Soft delete means a record is marked deleted in metadata before hard physical removal. + +Why it helps: + +- recovery from mistakes +- retention window enforcement +- asynchronous cleanup jobs + +The object may remain in storage until retention rules allow permanent deletion. + +### 19.7 Retention Policies + +Retention policies control how long data stays available or must be preserved. + +This matters for: + +- compliance +- customer contracts +- internal audits +- backup windows + +### 19.8 Why Metadata Usually Does Not Live Only in the Object Store + +Because object stores are not designed to answer business questions efficiently. + +Questions like these belong in a metadata DB or search index: + +- show all files owned by tenant X uploaded in the last 7 days +- list all pending virus scans +- find all files shared externally +- restore the previous version of this document + +## 20. Versioning + +### 20.1 Object Versioning + +Versioning means keeping multiple historical states of an object rather than replacing the old state permanently. + +This supports: + +- rollback after mistakes +- accidental deletion protection +- auditability +- legal retention + +### 20.2 File History + +In user-facing products such as Google Drive or Dropbox, file history is a product feature. + +Under the hood, this may mean: + +- multiple object versions in storage +- metadata rows pointing to the active version +- retention rules controlling how long old versions are kept + +### 20.3 Rollback Capability + +Rollback often means changing metadata to point to an older version rather than mutating the object in place. + +This is simpler and safer in distributed systems. + +### 20.4 Accidental Deletion Protection + +Versioning helps because delete can mean: + +- create a delete marker +- hide current version +- retain earlier versions for recovery + +This protects users from mistakes and protects operators from incidents. + +### 20.5 Legal Retention and Holds + +Some systems need to prevent deletion for compliance or litigation. + +That means versioning and retention logic must support: + +- immutable retention windows +- hold flags +- auditability around deletion attempts + +### 20.6 Overwrite Semantics + +Without versioning, overwrite means old content is lost. + +With versioning, overwrite usually means: + +- new object version becomes current +- old version remains restorable until policy cleanup + +### 20.7 Tradeoffs + +- better safety and recovery +- higher storage cost +- more complex metadata and lifecycle management + +Production systems usually accept that tradeoff for important user data. + +## 21. Image Optimization + +### 21.1 Why Image Optimization Exists + +Raw uploaded images are often too large, too slow, or in the wrong format for user-facing delivery. + +Image optimization exists to improve: + +- page load time +- bandwidth cost +- visual quality per byte +- rendering on different device sizes + +This matters for profile pictures, e-commerce catalogs, social posts, dashboards, and documentation platforms. + +### 21.2 Resizing and Thumbnails + +Common derivatives: + +- thumbnail +- small card image +- medium detail view +- large zoomable image + +Why precompute sizes: + +- repeated on-the-fly resizing is expensive +- predictable variants simplify caching +- UI surfaces often reuse the same size classes + +### 21.3 Responsive Image Delivery + +Different devices need different image sizes. + +Serving a giant desktop asset to a mobile client wastes bandwidth. + +Responsive delivery uses: + +- multiple size variants +- CDN selection or URL conventions +- client hints or frontend image markup strategies + +### 21.4 Format Conversion Basics + +Common formats: + +- JPEG: widely compatible, good for photos +- PNG: lossless and good for sharp graphics or transparency-heavy content +- WebP: often smaller than JPEG/PNG for many web cases +- AVIF: often strong compression efficiency, but with ecosystem and encoding tradeoffs + +The right format depends on content type, compatibility, and CPU cost. + +### 21.5 Lazy Loading Relevance + +Lazy loading reduces unnecessary downloads for off-screen images. + +This is mostly a frontend delivery concern, but backend and CDN design still matter because smaller, optimized derivatives make lazy loading much more effective. + +### 21.6 CDN Image Optimization + +Some systems optimize images on request at the edge or CDN layer. + +Benefits: + +- fewer precomputed variants needed +- flexible resizing +- device-aware format negotiation + +Tradeoffs: + +- added compute cost at the edge or image service +- cache fragmentation if variant space is uncontrolled + +### 21.7 Quality vs Size Tradeoffs + +Image optimization is always a tradeoff. + +Too much compression causes artifacts. Too little wastes bandwidth and storage. + +E-commerce sites care a lot here because product clarity affects conversion, but slow pages also hurt conversion. + +### 21.8 Async Processing Pipelines + +Common image flow: + +1. upload original asset +2. validate and scan +3. generate variants +4. store derivatives in object storage +5. publish metadata and CDN paths + +Profile pictures, avatar systems, and commerce catalogs commonly use this asynchronous derivative pipeline. + +## 22. Video Transcoding + +### 22.1 Why Transcoding Exists + +A raw uploaded video is rarely suitable for direct delivery to every device and network condition. + +Transcoding converts the input into delivery-friendly formats and variants. + +Why it is needed: + +- devices support different codecs and containers +- users have different bandwidth conditions +- multiple resolutions are needed +- streaming systems need chunked delivery formats + +### 22.2 Codec Basics + +A codec defines how video and audio are compressed and decoded. + +You do not need deep media math in most interviews, but you should know: + +- codecs affect compatibility, compression efficiency, and CPU cost +- better compression often costs more encode time +- playback support across devices matters as much as compression ratio + +### 22.3 Bitrate Adaptation and Resolution Variants + +Adaptive streaming works by producing multiple renditions such as: + +- 240p low bitrate +- 480p medium bitrate +- 720p or 1080p higher bitrate + +The player switches between them based on network and device conditions. + +This reduces buffering and improves startup reliability. + +### 22.4 HLS and DASH Basics + +HLS and DASH are common adaptive streaming approaches. + +High-level idea: + +- split video into chunks or segments +- generate manifests listing available renditions and segments +- player fetches segments dynamically based on bandwidth and playback logic + +### 22.5 Async Job Processing + +Video transcoding is expensive, so it is almost always asynchronous. + +Common architecture: + +```mermaid +flowchart LR + UP[Uploaded Video] --> Q[Queue / Job Orchestrator] + Q --> TR[Transcoding Workers] + TR --> PACK[Package HLS / DASH Variants] + TR --> THUMB[Thumbnail Extraction] + PACK --> OBJ[(Object Storage)] + THUMB --> OBJ + OBJ --> CDN[CDN] +``` + +### 22.6 Queue-Based Transcoding Systems + +Why queues are used: + +- uploads arrive in bursts +- transcoding jobs vary wildly by duration and cost +- retries and worker scaling need decoupling +- backpressure needs to be explicit + +Workers may be specialized by codec, region, or job size. + +### 22.7 Storage Explosion Challenges + +Video systems multiply data quickly. + +One uploaded asset may create: + +- multiple renditions +- audio tracks +- captions or subtitles +- thumbnails +- preview clips +- manifests + +This is why video platforms think carefully about retention, renditions, archival tiers, and whether every input truly needs every derivative. + +### 22.8 Playback Optimization + +Playback quality depends on more than transcoding. + +It also depends on: + +- segment sizing +- CDN placement +- startup buffer strategy +- manifest design +- thumbnail or preview availability +- device compatibility testing + +YouTube- and Netflix-style systems invest heavily in startup latency and rebuffer reduction because user drop-off is highly sensitive to playback quality. + +### 22.9 Common Failure Cases + +- queue backlogs causing long time-to-playable +- corrupted input files causing worker crashes +- incompatible renditions for some clients +- expensive reprocessing after pipeline changes +- storage growth from keeping every derivative forever + +## 23. Thumbnails and Previews + +### 23.1 Why Thumbnails Exist + +Thumbnails help users decide what to open before downloading or playing the full asset. + +They matter for: + +- video browsing +- PDF and document explorers +- image galleries +- cloud drive UIs +- e-commerce product grids + +### 23.2 Video Thumbnail Generation + +Video thumbnails may be generated by: + +- picking fixed offsets +- picking keyframes +- choosing frames based on saliency or quality heuristics + +Why keyframe selection matters: + +- a black frame or transition frame makes the content look broken +- better thumbnails improve CTR and perceived quality + +### 23.3 Document Preview Generation + +Documents such as PDFs or presentations often need previews. + +Typical flow: + +- extract first page image or several page previews +- store derived images in object storage +- cache them behind a CDN + +### 23.4 Async Preview Generation + +Preview generation is usually asynchronous because: + +- file types vary +- extraction can be CPU-heavy +- malformed files must be isolated from the main request path + +### 23.5 Caching Strategies + +Thumbnail and preview requests are highly cacheable. + +Common patterns: + +- store deterministic derivative paths +- serve from CDN +- cache aggressive immutable variants when versioned in URL + +### 23.6 Failure Cases + +- preview service timing out on large or malformed documents +- thumbnail chosen from poor video frame +- preview cache not invalidated after replacement or new version upload + +## 24. Compression + +### 24.1 Why Compression Exists + +Compression reduces storage and network transfer size. + +It matters because: + +- bandwidth is expensive +- users have limited network quality +- storage multiplies at scale +- smaller payloads improve latency when CPU cost is acceptable + +### 24.2 Lossless vs Lossy + +| Type | Meaning | Common use | +|---|---|---| +| Lossless | Original data can be reconstructed exactly | Text, archives, logs, some images, many document workflows | +| Lossy | Some information is discarded for higher compression | Photos, audio, video | + +### 24.3 CPU Tradeoffs + +Compression is not free. + +Stronger compression can save bandwidth or storage but cost more CPU and latency. + +This is why production systems choose compression differently for: + +- hot online serving +- asynchronous background processing +- archival storage + +### 24.4 Compression in Uploads and Downloads + +Compression may happen at multiple places: + +- client-side before upload in some media apps +- server-side during processing +- CDN or transport layer for text-based responses +- archival pipeline for logs and backups + +Do not blindly recompress already compressed media formats. It may waste CPU and reduce quality. + +### 24.5 Media-Specific Considerations + +Images and videos already use domain-specific codecs. General-purpose compression on top often gives limited benefit. + +Instead, media optimization usually means: + +- choosing the right codec or format +- choosing quality settings carefully +- generating the right resolution variants + +### 24.6 Archive Workflows + +Archive workflows such as backups and log retention often use strong compression because: + +- data is cold +- latency matters less +- storage savings compound heavily over time + +### 24.7 Where Compression Should Happen + +Good rule of thumb: + +- compress once in a controlled part of the pipeline +- avoid repeated transcoding or recompression unless there is a clear reason +- separate serving optimization from archival optimization + +### 24.8 Common Mistakes + +- recompressing lossy media too many times +- optimizing for smallest size and harming user experience +- using CPU-heavy compression in hot request paths without need + +## 25. How These Systems Connect in Real Architectures + +The most useful mental model is that search, recommendation, and media systems are not isolated services. They are derived-serving systems around a source-of-truth core. + +### 25.1 Example: E-Commerce Architecture + +Product system: + +- DB stores product, inventory, seller, and pricing records +- search index stores analyzed text plus filterable fields +- ranking combines lexical relevance, popularity, margin, seller trust, and availability +- recommendation system generates related items and home feed modules +- object storage stores product images and videos +- image pipeline generates thumbnails and responsive variants +- CDN serves optimized media globally + +Failure discussion: + +- stale index may show out-of-stock products +- stale media caches may show old images after updates +- poor ranking can bury relevant products even if retrieval worked + +### 25.2 Example: SaaS Document Platform + +Document system: + +- metadata DB stores ownership, permissions, and workflow state +- object storage stores uploaded files and derived previews +- search index stores filename, OCR text, comments, tags, and ACL-aware retrieval fields +- recommendation or discovery surface suggests recent or relevant docs +- signed URLs protect downloads +- antivirus and preview generation run asynchronously + +Failure discussion: + +- ACL lag can leak confidential docs if search filtering is wrong +- metadata/object mismatch can show broken files +- preview lag makes the product feel stale even when upload technically succeeded + +### 25.3 Example: Short-Form Video Platform + +```mermaid +flowchart LR + CREATOR[Creator Upload] --> OBJ[(Object Storage)] + OBJ --> MEDIA[Transcode + Thumbnail + Moderation] + MEDIA --> CDN[CDN] + MEDIA --> META[(Metadata DB)] + META --> INDEX[Search / Hashtag Index] + META --> REC[Candidate Generation + Ranking] + REC --> FEED[Home Feed Service] + FEED --> USER[Viewer] + CDN --> USER +``` + +System properties: + +- upload pipeline must be reliable and resumable +- media pipeline must scale with bursty creator traffic +- recommendation system must generate candidates and rank them quickly +- search index may support creators, hashtags, captions, or sounds +- CDN must absorb global playback traffic + +### 25.4 Common Cross-System Failure Modes + +- source-of-truth DB updated but search index stale +- object uploaded but metadata row missing +- metadata row exists but object processing failed +- recommendation service uses stale features or bad model rollout +- CDN serves old media after overwrite because cache keying is wrong +- ACL changes propagate inconsistently across DB, search, cache, and signed download logic + +### 25.5 Strong Engineering Principles Across All These Systems + +- keep a clear source of truth +- treat indexes, feeds, and media derivatives as derived serving layers +- design async pipelines to be idempotent and replay-safe +- measure freshness, not just latency +- plan for backfills and reprocessing from day one +- separate hard constraints from soft ranking +- build for partial degradation, not only full success +- treat permissions as first-class, especially in search and file access + +## 26. Interview Playbook + +If you are asked to design one of these systems, structure your answer around these questions: + +1. What is the user-facing behavior and latency expectation? +2. What is the source of truth? +3. What derived indexes or serving structures are needed? +4. What is precomputed vs done online? +5. How does ranking or retrieval work? +6. What are the main failure cases and stale-data risks? +7. How do permissions, abuse, and compliance affect the design? +8. What changes at 10x scale? + +### 26.1 High-Value Tradeoffs to Discuss + +- database query vs specialized search index +- lexical retrieval vs semantic retrieval vs hybrid +- freshness vs query performance +- fanout-on-write vs fanout-on-read +- precompute vs read-time ranking +- direct upload vs backend proxy upload +- object storage vs block storage vs file systems +- exact counts vs approximate facets +- aggressive personalization vs fairness and exploration + +### 26.2 What Breaks at Scale + +- long-tail shard latency dominates search response time +- popular prefixes overload autocomplete caches +- ranking models become too expensive for online serving +- celebrity fanout explodes write amplification +- object store costs spike from derivative explosion and egress +- background media pipelines backlog during traffic bursts +- stale caches and delayed async processing create user-visible inconsistency + +### 26.3 Final Mental Model + +Search systems answer explicit intent. + +Recommendation systems infer likely intent. + +Object storage and media pipelines make large assets durable and deliverable. + +The engineering challenge is rarely the isolated component. It is the interaction between source data, derived indexes, ranking, caching, asynchronous processing, permissions, and scale. + +That is the level interviews usually want, and it is also the level production systems demand. diff --git a/systems design/8.financial.md b/systems design/8.financial.md new file mode 100644 index 0000000..4f6c732 --- /dev/null +++ b/systems design/8.financial.md @@ -0,0 +1,1941 @@ +# Financial Systems + +Financial systems are where backend engineering stops being forgiving. A normal CRUD bug might show the wrong profile picture or duplicate a notification. A financial bug can double-charge a customer, underpay a merchant, misstate revenue, break compliance rules, or create an accounting discrepancy that surfaces weeks later. + +That is why payment, billing, invoicing, and ledger design show up so often in system design interviews. These systems force you to think about correctness, retries, immutability, eventual consistency, external dependencies, and operational recovery all at once. + +This guide is designed for two goals: + +1. Help you explain financial systems clearly in backend and system design interviews. +2. Help you build a realistic mental model of how production financial systems are actually built. + +Examples in this guide are generalized from public industry patterns used by systems like Stripe, PayPal, Amazon, Uber, Netflix, Shopify, GitHub, and typical SaaS subscription platforms. + +## 1. Big Picture: Why Financial Systems Are Different + +At a high level, a financial system answers a deceptively simple question: + +"Who owes whom how much money, for what reason, in what currency, at what time, and how certain are we?" + +That question is harder than it looks because the answer changes over time and is often influenced by external systems you do not control. + +In a normal application, the request-response cycle often feels authoritative. In financial systems, the HTTP response is usually only the beginning of the story. + +### 1.1 Why These Systems Are Hard + +| Problem | Why it exists | What it forces you to design for | +|---|---|---| +| Money movement has real consequences | A bug affects customer trust, legal exposure, and revenue | correctness, auditability, strong operational controls | +| External systems are involved | banks, card networks, processors, and payment providers can be delayed or inconsistent | async workflows, reconciliation, retries | +| Requests are retried | clients, load balancers, and workers will retry on timeouts | idempotency, deduplication, safe replays | +| Truth is distributed | your DB, the processor, and the bank may disagree temporarily | state machines, eventual consistency, operational review | +| History matters | you cannot casually mutate financial records without losing traceability | immutable records, append-only ledgers, reversals | +| Regulations and contracts matter | taxes, invoices, refunds, disputes, and payouts have legal rules | compliance-aware data design | +| Failure is partial | auth may succeed while your app times out, or webhook may arrive late | ambiguity handling, polling, reconciliation | + +### 1.2 The Core Mental Model + +A strong production design separates responsibilities instead of letting one table do everything. + +- Product systems decide what happened in the business domain. +- Billing decides what should be charged. +- Payments decide whether money collection succeeded. +- Invoices define what the customer legally owes. +- The ledger records financial truth as immutable movements. +- Reconciliation confirms that internal truth matches external truth. +- Audit logs explain who changed what and when. + +If a company tries to collapse all of that into a single `payments` table with a `status` column, the system usually becomes fragile very quickly. + +### 1.3 Non-Negotiable Design Rules + +These are not optional nice-to-haves. They are survival rules. + +1. Represent money as integers in minor units, not floating point. +2. Separate order state, payment state, invoice state, entitlement state, and ledger state. +3. Assume every network call can be retried or duplicated. +4. Assume every webhook can be delayed, reordered, or redelivered. +5. Prefer append-only financial history over mutable rows. +6. Build operational tools for refunds, disputes, and reconciliation from day one. +7. Treat external provider reports as mandatory inputs, not optional debug data. + +### 1.4 End-to-End Financial Architecture + +```mermaid +flowchart LR + PROD[Product Events\norders seats rides storage usage] --> BILL[Billing Engine] + PLAN[Plan Catalog\npricing discounts tax rules] --> BILL + BILL --> INV[Invoice Engine] + INV --> PAY[Payment Orchestrator] + PAY --> GW[Gateway / Processor / Provider] + GW --> RAILS[Card Network / Bank Rails / Wallet Rails] + GW --> WEBHOOKS[Webhook / Settlement Events] + WEBHOOKS --> PAY + PAY --> LEDGER[Ledger Posting Engine] + INV --> LEDGER + PAY --> ENT[Entitlement Service] + BILL --> ENT + LEDGER --> RECON[Reconciliation Engine] + GW --> REPORTS[Provider Reports] + RAILS --> BANK[Bank Statements / Settlement Files] + REPORTS --> RECON + BANK --> RECON + PAY --> AUDIT[Audit Log] + BILL --> AUDIT + LEDGER --> ACCOUNTING[Accounting Export / Finance Warehouse] + RECON --> OPS[Finance Ops / Support Review] +``` + +This diagram is important because it shows the most common interview mistake: assuming payment success automatically means all other financial systems are done. In reality, payment collection, invoice generation, ledger posting, entitlements, and reconciliation are separate concerns. + +## 2. Payment System + +### 2.1 What a Payment System Is + +A payment system is the software and infrastructure that lets a customer transfer money to a merchant or platform. In practice, it is less about "charging a card" and more about coordinating many moving parts safely. + +A production payment system often needs to: + +- accept a payment request +- authenticate or verify the payment method +- call external payment providers +- handle redirects or additional customer action such as 3DS +- track payment state transitions over time +- prevent duplicates +- post financial entries +- support refunds and disputes +- reconcile internal state with provider and bank reports + +### 2.2 Why Payment Systems Are Difficult + +Payment systems are difficult because they combine all of the hardest parts of distributed systems: + +- external dependencies you do not control +- user-facing latency requirements +- irreversible or expensive side effects +- asynchronous confirmations +- legal and audit requirements +- fraud and abuse pressure + +If an API that sends emails times out, you can usually retry safely and move on. If a payment API times out, you often do not know whether money was already authorized, captured, declined, or is still in flight. + +That ambiguity is one of the defining properties of payment engineering. + +### 2.3 Trust, Correctness, and Money Movement Requirements + +Financial systems are built around trust: + +- customers must trust they will not be double charged +- merchants must trust they will get paid correctly +- finance teams must trust balances and reports +- regulators and auditors must trust the records + +That leads to three core requirements: + +| Requirement | What it means in practice | +|---|---| +| Correctness | no duplicate charges, balanced ledger entries, accurate invoice totals | +| Durability | once a financial event is committed, it should not disappear | +| Traceability | every state change should be explainable later | + +### 2.4 High-Level Payment Actors + +The exact terminology varies across providers, but the standard card ecosystem usually includes these actors. + +| Actor | Role | Practical note | +|---|---|---| +| Customer | the person or business paying | may abandon, retry, dispute, or fail authentication | +| Merchant | the business receiving payment | often your company or your platform merchant | +| Payment gateway | API layer that tokenizes or routes payment requests | often confused with processor in interviews | +| Payment processor | coordinates transaction processing with acquiring banks and networks | Stripe, Adyen, Braintree, PayPal, etc. can abstract this | +| Acquiring bank | bank that processes payments for the merchant | receives funds on merchant side | +| Issuing bank | bank that issued the customer's card | decides approve, decline, fraud challenge | +| Card network | Visa, Mastercard, AmEx, RuPay, etc. | routes messages and settlement rules | + +Important interview nuance: in modern systems, one provider may abstract multiple roles. Stripe or Adyen may expose a unified API, but the underlying ecosystem still includes processors, acquirers, networks, and issuers. + +### 2.5 High-Level Payment Flow + +```mermaid +sequenceDiagram + participant C as Customer + participant UI as Merchant App + participant M as Merchant Backend + participant P as Payment Service + participant G as Gateway/Processor + participant A as Acquirer + participant N as Card Network + participant I as Issuer + participant W as Webhooks + + C->>UI: Checkout and submit payment method + UI->>M: Create order + M->>P: Create payment intent / payment session + P->>G: Authorize payment + G->>A: Forward request + A->>N: Route auth + N->>I: Approve or decline + I-->>N: Decision + N-->>A: Decision + A-->>G: Decision + G-->>P: Initial auth result + P-->>M: Payment pending / authorized / requires action + M-->>UI: Show status to customer + G-->>W: Async settlement/refund/dispute events + W-->>P: Webhook delivery + P->>M: Update order / trigger fulfillment / update ledger +``` + +### 2.6 Payment Lifecycle Overview + +A payment usually goes through several phases rather than a single binary success/failure outcome. + +1. Payment is created. +2. Payment method is attached or selected. +3. Additional customer action may be required. +4. Authorization may be attempted. +5. The merchant may capture immediately or later. +6. Funds clear and settle later. +7. Refunds or disputes may happen days or weeks later. + +That is why a payment system is almost always modeled as a state machine rather than a single insert into a `charges` table. + +### 2.7 Online vs Offline Payment Flows + +#### Online payments + +Online payments involve real-time network communication at transaction time. + +Examples: + +- card-not-present ecommerce checkout +- UPI or instant bank payment +- wallet checkout using an external provider + +Characteristics: + +- synchronous API call initiates the payment +- immediate or near-immediate response exists +- still may need async confirmation later + +#### Offline or deferred flows + +Offline payments do not always consult the full network in real time, or the final financial confirmation happens later. + +Examples: + +- transit systems with offline authorization logic +- POS terminals in poor connectivity environments +- invoice payments by bank transfer +- cash on delivery or manual collections + +Characteristics: + +- customer action and financial confirmation are decoupled +- risk is shifted to later operational validation +- reconciliation becomes more important + +### 2.8 Synchronous vs Asynchronous Confirmation + +This distinction is one of the most important concepts to explain in an interview. + +| Type | Meaning | Example | +|---|---|---| +| Synchronous confirmation | request returns an immediate initial result | card authorization approved or declined | +| Asynchronous confirmation | final truth arrives later via webhook, polling, file, or settlement report | ACH success later, refund completed later, dispute opened later | + +Real-world payment systems are often both. A card payment may synchronously return `authorized`, but the final fulfillment decision may still wait for provider webhooks, fraud checks, or successful capture. + +### 2.9 Failure Handling and Retries + +Payment failures are not all the same. Good systems classify them. + +| Failure type | Example | Correct response | +|---|---|---| +| User/business decline | insufficient funds, expired card, stolen card suspicion | do not blindly retry; ask for new method or user action | +| Technical transient error | provider timeout, connection reset, temporary 5xx | retry safely with idempotency | +| Ambiguous outcome | request timed out after provider received it | query status, use idempotency, reconcile later | +| Async failure | auth succeeded, capture failed later | state machine transition plus operational handling | + +Best practices: + +- separate retriable and non-retriable failures +- never retry without an idempotency strategy +- preserve provider reference IDs for later lookup +- expose pending states to the business system instead of guessing + +### 2.10 How Payment Systems Differ from Normal CRUD Systems + +| Normal CRUD system | Payment system | +|---|---| +| request-response often feels authoritative | response is often provisional | +| updates overwrite prior state | financial history is usually append-only or versioned | +| duplicate requests may be harmless | duplicate requests can create duplicate charges | +| a row can often be edited freely | financial records may require reversal instead of mutation | +| internal DB is primary source of truth | external providers and banks also matter | +| logs are mostly for debugging | logs and audit trails are operationally and legally important | + +## 3. Payment Processing + +Payment processing is the detailed mechanics of what happens after the customer initiates a payment. + +### 3.1 Core Stages + +| Stage | What it means | Why it exists | +|---|---|---| +| Authorization | reserve funds or confirm the payment method can be charged | lets merchant validate ability to pay before finalizing | +| Capture | convert an authorization into an actual charge | useful when final amount or fulfillment timing is delayed | +| Clearing | exchange transaction details between parties | needed for network and bank processing | +| Settlement | actual money movement between institutions | this is when funds are truly transferred | +| Refund | return some or all funds after charge | needed for cancellations, service failures, policy decisions | +| Chargeback | issuer/cardholder dispute reverses or challenges the charge | consumer protection and fraud handling | + +### 3.2 Authorization vs Capture + +This is a classic interview comparison. + +| Topic | Authorization | Capture | +|---|---|---| +| Purpose | checks or reserves available funds | actually charges the customer | +| Timing | often happens at checkout | may happen immediately or later | +| Business use case | reserve payment before shipment or service completion | take money when fulfillment is confirmed | +| Example | Amazon authorizes before shipment | Amazon captures when items ship | +| Failure impact | order can remain unfulfilled | revenue collection fails after customer thought checkout succeeded | + +Delayed capture is common in real systems: + +- Amazon captures when an item ships, not when the order is first placed. +- Uber may pre-authorize an estimated ride amount and capture the final amount after trip completion. +- Hotels often authorize a card at check-in and settle later. + +### 3.3 Payment State Machines + +A robust payment system uses explicit state transitions instead of ad hoc boolean flags. + +```mermaid +stateDiagram-v2 + [*] --> RequiresPaymentMethod + RequiresPaymentMethod --> RequiresConfirmation + RequiresConfirmation --> RequiresAction + RequiresConfirmation --> Processing + RequiresAction --> Processing + Processing --> Authorized + Processing --> Failed + Authorized --> Captured + Authorized --> Expired + Captured --> Settling + Settling --> Succeeded + Succeeded --> PartiallyRefunded + PartiallyRefunded --> Refunded + Succeeded --> Refunded + Succeeded --> ChargebackOpen + ChargebackOpen --> ChargebackWon + ChargebackOpen --> ChargebackLost + Failed --> [*] + Refunded --> [*] + ChargebackWon --> [*] + ChargebackLost --> [*] +``` + +The specific states vary across providers, but the idea is consistent: + +- payments are long-lived workflows +- transitions happen over time +- asynchronous events may drive later transitions +- invalid transitions should be rejected explicitly + +### 3.4 Stripe-Like Payment Intent Model + +Stripe popularized the idea that a payment should be represented as a durable intent object rather than a one-shot charge request. + +The idea is powerful because it matches reality. + +A payment intent usually stores: + +- merchant or account context +- customer context +- amount and currency +- payment method information or token reference +- current state +- whether capture is automatic or manual +- metadata for order correlation +- provider reference IDs + +Why this model exists: + +- the same payment may require multiple attempts +- customer authentication may interrupt the flow +- the client, server, and provider all need a shared durable object to coordinate around +- webhooks can safely attach later state changes to that object + +In practice, a payment intent model reduces the temptation to create multiple charges for the same checkout retry. + +### 3.5 Partial Capture + +Partial capture means the merchant authorized one amount but captures only part of it. + +Examples: + +- an order with multiple items ships in parts +- a ride estimate was higher than the final fare +- a restaurant pre-authorized a larger amount but settled the final bill + +Design implications: + +- the payment object must track authorized amount, captured amount, and remaining capturable amount +- ledger posting must reflect only captured funds, not full authorization +- invoice and fulfillment systems must know whether a partial capture is expected behavior or a problem + +### 3.6 Partial Refunds + +Partial refunds are extremely common. + +Examples: + +- one item in a multi-item order is returned +- a service credit covers part of the invoice +- support grants a goodwill refund for part of the charge + +Design implications: + +- never model refund as a boolean flag on payment +- track cumulative refunded amount and remaining refundable amount +- prevent total refunds from exceeding captured amount +- ledger and accounting should reflect each refund as its own event + +### 3.7 Delayed Capture + +Delayed capture is not an edge case. It is a standard production requirement. + +Why it exists: + +- final amount may be unknown at checkout +- merchant may want to confirm inventory or fraud review first +- fulfillment may happen hours or days later + +Tradeoffs: + +- improves business control +- increases complexity in payment lifecycle tracking +- introduces authorization expiry risk if capture is too late + +### 3.8 Clearing and Settlement + +Developers often stop thinking at authorization, but finance systems cannot. + +#### Clearing + +Clearing is the exchange of transaction details among payment institutions. It confirms what transaction should be processed and under what rules. + +#### Settlement + +Settlement is the actual movement of funds between institutions. + +Why this matters: + +- a payment can be authorized but not yet settled +- a successful API response does not mean cash is in the merchant bank account yet +- marketplace payout logic often depends on settlement confidence, not just auth success + +### 3.9 Refunds and Chargebacks + +Refunds are merchant-initiated. Chargebacks are usually customer-issuer-initiated disputes. + +| Topic | Refund | Chargeback | +|---|---|---| +| Who initiates | merchant or support workflow | cardholder through issuer | +| Typical reason | return, cancellation, service issue | fraud, dissatisfaction, unrecognized charge | +| Timing | usually after capture | often days or weeks later | +| Control | merchant usually controls initiation | external process, evidence-based | +| Operational impact | customer experience and accounting adjustment | financial loss risk, dispute operations, fees | + +Chargebacks are a major reason payment systems need strong evidence storage, audit logs, and operational tooling. + +### 3.10 Retry Behavior in Processing Systems + +Good retry behavior requires classification. + +Retry candidates: + +- connection timeout before receiving provider result +- transient provider 5xx +- webhook delivery failure +- temporary downstream database outage + +Do not blindly retry: + +- issuer declines +- fraud rules blocked the payment +- expired payment method +- authorization window already expired + +Practical pattern: + +1. attempt payment with provider idempotency key +2. on timeout, mark local state as uncertain or processing +3. query provider by external reference if possible +4. wait for webhook or scheduled reconciliation if ambiguity remains + +### 3.11 Webhooks and Event-Driven Payment Updates + +Real payment platforms are event-driven because external truth arrives asynchronously. + +Typical events: + +- payment succeeded +- payment failed +- payment requires action +- charge captured +- refund succeeded +- dispute opened +- payout paid + +Best practices: + +- verify webhook signatures +- store raw webhook payloads for replay and audit +- deduplicate by provider event ID and business object ID +- process via internal queue, not inline on the HTTP handler only +- make handlers idempotent and order-aware + +Common mistake: assuming events arrive exactly once and in order. Providers often deliver at least once and may redeliver older events. + +### 3.12 Duplicate Charge Prevention + +Duplicate charge prevention is not one feature. It is multiple layers. + +| Layer | Example | +|---|---| +| API idempotency | same checkout request with same key returns same result | +| Business key uniqueness | one successful payment per order or invoice | +| Payment intent reuse | retries reuse the same payment object | +| Provider-side idempotency | send a provider idempotency key or merchant reference | +| Operational controls | support tools show prior attempts before allowing manual re-charge | + +### 3.13 Exactly-Once vs At-Least-Once Reality + +Exactly-once is mostly a marketing phrase in distributed money systems. + +What you can realistically achieve is: + +- at-least-once delivery of events +- idempotent processing of duplicates +- unique business constraints that prevent duplicate final effects +- reconciliation that catches anything missed + +A strong interview answer explicitly says this instead of claiming "I will ensure exactly once delivery". + +## 4. Subscriptions + +Subscriptions turn one-time payment processing into a long-lived revenue engine. + +### 4.1 What Recurring Billing Really Means + +A subscription is a contract-like object that says: + +- who is being billed +- for what plan or service +- on what schedule +- under what pricing rules +- with what payment collection behavior + +Recurring billing is not just a cron job that charges a card every month. It is a coordinated system involving: + +- plan catalog and pricing rules +- entitlement logic +- invoice generation +- taxes and discounts +- payment collection +- retries and dunning +- cancellation and plan change rules + +### 4.2 High-Level Subscription Architecture + +```mermaid +flowchart LR + CAT[Plan Catalog\nprices features coupons] --> SUB[Subscription Service] + CUST[Customer / Account] --> SUB + SUB --> SCHED[Renewal Scheduler] + SCHED --> BILL[Billing Engine] + USAGE[Usage Metering] --> BILL + BILL --> INV[Invoice Engine] + INV --> PAY[Payment Service] + PAY --> PSP[Payment Provider] + PAY --> LEDGER[Ledger] + INV --> LEDGER + SUB --> ENT[Entitlement Service] + PAY --> ENT + SUB --> NOTIFY[Email / In-App Notifications] + PAY --> NOTIFY +``` + +### 4.3 Monthly vs Annual Subscriptions + +| Model | Business benefit | Technical impact | +|---|---|---| +| Monthly | lower entry barrier, more flexible | more frequent renewals, more retry churn | +| Annual | better cash flow, lower churn, simpler gross retention story | larger invoice amounts, proration complexity, annual tax handling | + +A subscription system must treat cadence as a first-class configuration, not a hard-coded monthly assumption. + +### 4.4 Trial Periods + +Trials exist because product growth often needs a usage or time-limited evaluation phase. + +Design questions: + +- does the trial require a payment method up front? +- what happens when the trial ends without a payment method? +- when do entitlements activate and deactivate? +- can the same customer create repeated free trials? + +Real-world examples: + +- Netflix historically used trials as a conversion funnel in some markets +- SaaS tools often require a card to reduce abuse and improve conversion quality + +### 4.5 Upgrades, Downgrades, and Proration + +Proration exists because customers change plans mid-cycle. + +Example: + +- plan A costs $100 per month +- plan B costs $200 per month +- customer upgrades halfway through the billing cycle +- the incremental charge is usually about $50 before tax and discounts + +Why proration is hard: + +- seat counts may change multiple times in the same cycle +- taxes may vary by jurisdiction +- credits may need to be carried to the next invoice instead of immediately refunded +- invoice presentation must stay understandable for the customer + +Best practice: store billable events and proration line items explicitly rather than trying to recompute historical changes from current plan state. + +### 4.6 Grace Periods + +A grace period allows temporary continued access after a failed renewal or overdue invoice. + +Why it exists: + +- transient card failures are common +- immediate shutdown creates bad customer experience +- enterprise contracts may allow time for manual payment + +But grace periods need clear rules: + +- what features remain enabled? +- how long does grace last? +- which customer segments qualify? +- when do collections escalate? + +### 4.7 Failed Renewal Handling, Retries, and Dunning + +Dunning is the process for recovering failed recurring payments. + +Typical strategy: + +1. attempt renewal +2. if failed for retriable reason, retry on scheduled intervals +3. notify the customer by email or in-product banners +4. optionally use backup payment methods +5. apply grace period +6. cancel or suspend if payment is not recovered + +Important nuance: retry behavior should depend on failure reason. Retrying an insufficient-funds decline after salary day might work. Retrying a stolen-card block probably will not. + +### 4.8 Cancellations + +Cancellation behavior should be explicit. + +Common models: + +- immediate cancellation with immediate entitlement removal +- cancel at period end +- cancel and refund unused period under specific policy +- pause rather than cancel + +Good systems store both: + +- operational subscription status +- effective cancellation date + +Those are not always the same thing. + +### 4.9 Scheduled Plan Changes + +Many SaaS systems allow a downgrade or upgrade to take effect later. + +Examples: + +- upgrade immediately, because customer wants more features now +- downgrade at next renewal, because customer already paid for current cycle +- price increase takes effect at next term for existing customers + +That means you often need both: + +- current plan version +- pending future plan version + +### 4.10 Subscription Lifecycle Design + +```mermaid +stateDiagram-v2 + [*] --> Trialing + Trialing --> Active + Trialing --> Incomplete + Incomplete --> Active + Active --> PastDue + PastDue --> GracePeriod + GracePeriod --> Active + GracePeriod --> Paused + GracePeriod --> Canceled + Active --> PendingCancellation + PendingCancellation --> Canceled + Active --> Paused + Paused --> Active + Canceled --> [*] +``` + +Do not collapse all of these into one status without clear semantics. In real systems, invoice state, payment state, and entitlement state may temporarily differ. + +Example: + +- invoice is unpaid +- payment is retrying +- subscription is still active because the customer is inside grace period + +### 4.11 SaaS Subscription Architecture Patterns + +Strong SaaS systems usually separate: + +- subscription contract management +- pricing and catalog management +- invoice generation +- payment collection +- entitlement enforcement + +That separation matters because GitHub-style seat billing, Shopify merchant billing, and enterprise SaaS contracts all need different rules, but they still plug into the same general architecture. + +### 4.12 Common Failures and Scaling Considerations + +What breaks at scale: + +- renewing millions of subscriptions at exactly midnight UTC +- large proration computations for accounts with many seat changes +- delayed usage events missing the current invoice cut-off +- broken entitlements because payment and subscription state were coupled too tightly + +Best practices: + +- spread renewal schedules across time instead of one giant batch +- snapshot pricing terms at subscription creation time +- version plans instead of mutating them in place +- keep entitlements separately derived from subscription policy plus payment policy + +## 5. Invoices + +### 5.1 What an Invoice Is + +An invoice is a formal statement that documents what the customer owes for specific goods or services. + +In many businesses, the invoice is not just an internal billing artifact. It is a customer-facing financial document with tax, legal, and accounting implications. + +### 5.2 Why Invoices Exist + +Invoices exist because businesses need a structured record of: + +- what was sold +- when it was sold +- who owes the money +- what taxes applied +- when payment is due + +For subscription businesses, invoices often bridge billing, finance, and customer support. For enterprise sales, they may be the main financial document the customer actually processes. + +### 5.3 Invoice Generation + +Invoice generation typically pulls from multiple sources: + +- subscription charges +- usage aggregates +- discounts and credits +- taxes +- currency settings +- customer billing profile + +Typical stages: + +1. collect billable line items +2. compute rating and discounts +3. compute taxes +4. create draft invoice +5. review or finalize +6. collect payment or send for manual payment + +### 5.4 Invoice Numbering + +Invoice numbering sounds trivial, but it is often compliance-sensitive. + +Requirements may include: + +- uniqueness +- no accidental reuse +- understandable sequence for finance operations +- jurisdiction-specific numbering expectations + +Practical approaches: + +- globally unique invoice numbers +- per legal entity sequences +- per country or tax registration sequences + +Common mistake: using a random UUID as the only customer-visible invoice identifier. UUIDs are fine as internal IDs, but finance and customers often still need a human-usable invoice number. + +### 5.5 Tax Basics + +A production billing system usually needs at least a working tax model. + +Common concerns: + +- sales tax vs VAT vs GST +- business location and customer location +- tax-exempt customers +- net price vs gross price presentation +- tax per line item vs tax on total + +Even if a company uses Stripe Tax, Avalara, or another external tax provider, your system still needs to store inputs, outputs, versioned tax decisions, and invoice presentation. + +### 5.6 Invoice Finalization + +Finalization is the point at which a draft invoice becomes an authoritative billing document. + +Why it matters: + +- line items usually become locked +- taxes and totals become authoritative +- payment collection or customer delivery may begin +- downstream accounting and reporting can depend on it + +Many systems allow drafts to be edited but finalized invoices to be immutable. + +### 5.7 Why Invoices Often Cannot Be Edited Like Normal Records + +This is a very common interview discussion point. + +Once an invoice has been issued, multiple downstream systems may already rely on it: + +- customer procurement or AP systems +- tax reporting +- finance close processes +- revenue reporting +- external accounting exports + +If you simply edit the original invoice row, you can destroy auditability and legal traceability. + +Typical production approaches instead are: + +- void the invoice if rules allow +- issue a credit note +- create a replacement invoice +- maintain versioned presentation while preserving immutable original financial facts + +### 5.8 Due Dates and Payment Status Tracking + +Common invoice statuses: + +- draft +- finalized or open +- paid +- partially paid +- overdue +- uncollectible +- void + +These statuses are not just cosmetic. They drive collections, revenue operations, support workflows, and customer entitlements in some businesses. + +### 5.9 Invoice vs Receipt + +| Topic | Invoice | Receipt | +|---|---|---| +| Purpose | request or document amount owed | proof that payment was received | +| Timing | typically before or at billing time | after payment succeeds | +| Business meaning | customer owes or was billed | customer already paid | +| Common use | subscription billing, enterprise accounts receivable | ecommerce confirmation, payment confirmation | + +This distinction matters because many junior designs treat them as interchangeable. + +### 5.10 Invoice Lifecycle + +```mermaid +flowchart LR + D[Draft Invoice] --> F[Finalized Invoice] + F --> O[Open / Awaiting Payment] + O --> P[Paid] + O --> OD[Overdue] + OD --> U[Uncollectible] + F --> V[Void] + P --> R[Credit Note / Refund Adjustment] +``` + +### 5.11 Invoice Versioning Considerations + +Versioning is often needed for presentation changes without rewriting financial history. + +Examples: + +- corrected customer address +- updated PDF template +- additional display metadata for support + +Good design distinguishes between: + +- immutable financial facts +- mutable presentation metadata + +### 5.12 Legal and Compliance Considerations + +The rules vary heavily by country and business model, but common themes include: + +- invoice numbering discipline +- retention periods +- tax fields and calculations +- non-editability of issued documents +- legal entity identifiers + +The engineering lesson is simple: invoice design is not just a UI export problem. + +## 6. Idempotency + +Idempotency is one of the most important topics in financial backend design. + +### 6.1 Why Idempotency Is Critical + +Retries are unavoidable. + +They happen because: + +- the client timed out +- the mobile network dropped +- a proxy retried the request +- a worker retried the job +- a human clicked the checkout button twice + +Without idempotency, retries turn into duplicate financial side effects. + +### 6.2 Duplicate Payment Risks + +Duplicate payment failures are expensive because they create: + +- customer trust loss +- support costs +- refund workload +- dispute risk +- accounting noise + +That is why idempotency is a first-class design concern, not a nice API feature. + +### 6.3 Idempotency Keys + +An idempotency key is a client- or server-generated key representing one logical operation. + +Example use cases: + +- `create payment for order 123` +- `refund payment 456 for $20` +- `create invoice for subscription cycle 2026-04` + +Good idempotency key design usually includes scope: + +- merchant or account ID +- operation type +- business object reference + +### 6.4 API Idempotency Design + +Typical pattern: + +1. client sends request with idempotency key +2. server checks durable idempotency store +3. if key exists with completed identical request, return stored result +4. if key exists with conflicting payload, reject +5. if key is new, reserve it and process the request +6. store final response or final effect reference + +Important nuance: storing only "request seen" is often insufficient. You usually also need the final outcome or object ID. + +### 6.5 Idempotency Processing Flow + +```mermaid +flowchart TD + REQ[Incoming Request with Idempotency Key] --> LOOKUP{Key Exists?} + LOOKUP -->|No| RESERVE[Reserve Key as In-Progress] + RESERVE --> EXEC[Execute Business Operation] + EXEC --> STORE[Store Final Result / Object Reference] + STORE --> RESP[Return Response] + LOOKUP -->|Yes same payload| RETURN[Return Existing Result] + LOOKUP -->|Yes different payload| CONFLICT[Reject as Idempotency Conflict] +``` + +### 6.6 Storage Strategies + +| Strategy | Pros | Cons | +|---|---|---| +| Relational table with unique constraint | durable, transactional, simple to reason about | can become hot under heavy traffic | +| Redis plus durable backing store | low latency | needs careful durability and failover semantics | +| Business-object uniqueness only | simpler for narrow cases | not enough for generic API retries | +| Workflow engine state store | good for long-running operations | heavier architecture | + +In financial systems, a relational durable store is often the safest default for write paths that truly move money. + +### 6.7 Expiration Strategies + +Idempotency keys should not necessarily live forever, but expiring them too quickly is dangerous. + +Considerations: + +- retry window of clients and background jobs +- provider response delays +- support workflows that may replay operations +- fraud or abuse risks from unbounded storage + +A practical approach is: + +- keep keys for a bounded period such as 24 hours or several days for payment creation APIs +- rely on business-level uniqueness constraints for longer-term protection + +### 6.8 Webhook Idempotency + +Webhook handlers also need idempotency. + +Use multiple layers: + +- deduplicate provider event IDs +- also make object-state transitions idempotent +- reject invalid backward transitions unless explicitly allowed + +Why both matter: + +- the same event may be delivered multiple times +- different events may refer to the same payment object +- events may arrive out of order + +### 6.9 Exactly-Once Myths + +Exactly-once delivery across clients, APIs, workers, providers, and webhooks is not a realistic assumption. + +What you want instead: + +- idempotent APIs +- idempotent event processing +- durable business object references +- reconciliation to catch misses + +### 6.10 Designing Safe Retries + +Safe retry design usually means: + +- deterministic operation identity +- state machine aware transitions +- provider-side idempotency support when available +- ability to query by external reference after ambiguous failures + +### 6.11 Real-World Example: Duplicate Checkout Prevention + +Imagine a customer presses "Pay" twice because the page looked frozen. + +A robust system uses: + +- one order ID +- one payment intent for that order +- one client-generated idempotency key per payment create or confirm action +- a unique constraint that only one successful charge can attach to the order + +That layered design is how systems like Stripe-integrated checkouts avoid accidental duplicate charges. + +## 7. Reconciliation + +Reconciliation is the process of comparing internal financial records with external records to ensure they match. + +### 7.1 What Reconciliation Is + +Reconciliation answers questions like: + +- did every captured payment in our system appear in provider reports? +- did every provider settlement correspond to a ledger entry? +- did our bank receive the expected payout amount? +- are refunds, fees, and chargebacks reflected correctly? + +### 7.2 Why Reconciliation Exists + +Reconciliation exists because your code is not the only actor in the system. + +Even if your application logic is correct, discrepancies still happen because of: + +- delayed webhooks +- provider bugs or temporary outages +- manual operations +- settlement timing differences +- duplicate or missing files +- currency conversion and fee differences + +This is why reconciliation is mandatory even when your code is "correct". + +### 7.3 Internal vs External Reconciliation + +| Type | What it compares | Example | +|---|---|---| +| Internal reconciliation | your own systems against each other | order amount matches invoice, payment, and ledger postings | +| External reconciliation | your systems against providers and banks | captured payments match processor report and bank payout | + +Strong platforms do both. + +### 7.4 Payment Provider Reconciliation + +Provider reconciliation usually compares internal payment records against: + +- provider event feeds +- settlement reports +- fee reports +- refund reports +- dispute reports + +Matching keys often include: + +- external payment ID +- merchant reference or order ID +- amount and currency +- settlement date + +### 7.5 Bank Reconciliation + +Bank reconciliation compares expected cash movement against actual bank statements or payout reports. + +This matters because provider-level success does not always mean the merchant bank account received the exact expected cash at the expected time. + +### 7.6 Settlement Verification + +Settlement verification answers: + +- was a captured payment settled? +- was the correct fee deducted? +- did the merchant or platform receive the expected net amount? +- were refunds and chargebacks netted correctly? + +Marketplace platforms care deeply about this because payouts to sellers depend on correct net settlement calculations. + +### 7.7 Reconciliation Flow + +```mermaid +flowchart LR + INT[Internal Orders / Payments / Ledger] --> MATCH[Reconciliation Engine] + PR[Processor Reports] --> MATCH + BANK[Bank Statements / Payout Files] --> MATCH + MATCH --> OK[Matched Records] + MATCH --> EX[Exceptions Queue] + EX --> OPS[Finance Ops Review] + OPS --> FIX[Adjustments / Replays / Escalation] + FIX --> LED[Ledger Adjustment or Operational Resolution] +``` + +### 7.8 Mismatch Detection + +Common mismatch categories: + +- internal record exists but provider record missing +- provider record exists but internal record missing +- amount mismatch +- currency mismatch +- status mismatch +- settlement date mismatch +- duplicate external events + +### 7.9 Missing Transaction Handling + +When records are missing, the system should not just log and forget. + +Typical workflow: + +1. detect exception automatically +2. classify likely cause +3. attempt automatic replay or status refresh if safe +4. escalate to finance or payment ops queue +5. resolve via adjustment, support action, or provider escalation + +### 7.10 Delayed Event Handling + +Reconciliation systems need time windows and tolerance for delay. + +Example: + +- a payment was captured on day 1 +- settlement report arrives on day 2 +- bank payout arrives on day 3 + +If your recon job assumes all systems should match instantly, it will generate noisy false positives. + +### 7.11 Operational Workflows and Manual Review + +Real finance systems need exception management. + +Manual review may be required for: + +- ambiguous provider outcomes +- chargeback evidence preparation +- missing bank settlement +- suspicious refund patterns +- mismatched currency conversions + +This is a good interview point: strong financial system design includes admin tools and ops queues, not just APIs and tables. + +### 7.12 Best Practices and Common Mistakes + +Best practices: + +- store raw provider reports and bank files immutably +- keep reconciliation jobs rerunnable +- use deterministic matching rules and explainable exception categories +- distinguish pending mismatch from confirmed mismatch + +Common mistakes: + +- relying only on webhooks and skipping provider reports +- silently auto-correcting discrepancies without traceability +- failing to account for settlement delay windows +- not exposing exception queues to operations teams + +## 8. Ledger Systems + +The ledger is the core financial memory of the platform. + +### 8.1 What a Ledger System Is + +A ledger system records money movement as financial entries between accounts. + +It is the source of truth for balances, obligations, and historical financial events. + +Examples: + +- wallet balances +- merchant payable balances +- platform fee revenue entries +- stored credits and promotional balances +- payout obligations + +### 8.2 Why a Ledger Exists + +Applications often start with a naive design like this: + +- `user.balance = user.balance + amount` + +That works until you need to answer: + +- why is the balance this number? +- which transaction changed it? +- can we reverse one specific event? +- can we reproduce yesterday's state? +- can auditors follow the trail? + +The ledger exists because balances should be derived from recorded entries, not treated as unexplained mutable facts. + +### 8.3 Append-Only Design + +Good ledgers are append-only. + +That means: + +- new financial events create new entries +- existing entries are not casually edited or deleted +- corrections are represented as reversing or compensating entries + +This is essential for auditability. + +### 8.4 Immutable Financial Records + +Immutability matters because historical financial truth should be reconstructable. + +If a platform changes an old ledger row in place, it may make today's balance look correct while destroying the explanation of how that balance was reached. + +### 8.5 Double-Entry Bookkeeping Basics + +Double-entry bookkeeping means every financial event affects at least two accounts, and the posting remains balanced. + +The core idea is simple: + +- money cannot appear from nowhere +- money cannot disappear without an offsetting explanation + +In interviews, the exact accounting treatment is less important than the principle that every financial event should create balanced entries. + +### 8.6 Debit and Credit Concepts + +Developers often struggle here because debit and credit are not just synonyms for plus and minus. + +They are directions whose effect depends on account type. + +Useful interview-safe intuition: + +- assets and expenses typically increase with debits +- liabilities, equity, and revenue typically increase with credits + +You do not need to present a CPA-level lecture. You do need to show that a financial event should be represented as balanced movement across accounts, not one mutable balance update. + +### 8.7 Example Ledger Posting + +Suppose a platform charges a customer $100 and keeps a $3 fee while $97 is owed to the merchant. + +One possible journal representation is: + +| Account | Entry | +|---|---| +| Processor receivable | Debit $100 | +| Merchant payable | Credit $97 | +| Platform fee revenue | Credit $3 | + +The exact chart of accounts differs by business, but the balanced-entry principle does not. + +### 8.8 Balances Derived from Entries + +A balance table may still exist for performance, but it should usually be a materialized or derived view of ledger entries, not the only source of truth. + +This is a fundamental distinction. + +### 8.9 Ledger vs Simple Balance Table + +| Ledger system | Simple balance table | +|---|---| +| append-only history | latest number only | +| supports reconstruction and audit | hard to explain changes | +| supports reversals and corrections | corrections overwrite history | +| good for compliance and reconciliation | fragile under concurrency and debugging | +| more complex to build | simpler initially | + +### 8.10 Ledger Posting Flow + +```mermaid +flowchart LR + EV[Business Event\npayment capture refund payout] --> POST[Posting Engine] + POST --> RULES[Posting Rules / Chart of Accounts] + RULES --> JE[Journal Entry with Balanced Lines] + JE --> LED[Append-Only Ledger] + LED --> BAL[Materialized Balances] + LED --> STMT[Statements / Reporting / Reconciliation] +``` + +### 8.11 Reversals Instead of Updates + +If a refund happens, the ledger should usually record a new reversing or compensating event instead of editing the original capture entry. + +Why this matters: + +- preserves history +- makes reconciliation explainable +- supports financial close and audit workflows + +### 8.12 Auditability and Correctness Guarantees + +Good ledger systems enforce invariants such as: + +- journal entries must balance +- account and currency must be explicit +- financial periods may be locked after close +- external references must be preserved + +### 8.13 Wallet and Platform Examples + +Examples where ledgers matter deeply: + +- Stripe Connect-like marketplace balances +- PayPal wallet balances +- Uber driver earnings and adjustments +- Shopify merchant payouts and fee deductions +- SaaS customer credit balances + +### 8.14 Common Mistakes + +Common mistakes include: + +- using floating point for money +- mixing currencies in the same account without explicit conversion events +- updating balances directly without append-only entries +- letting business services write arbitrary ledger rows without a posting engine + +### 8.15 Scaling Considerations + +What changes at scale: + +- very hot accounts may need partitioning or sharding +- balances may need snapshotting for fast reads +- backfills need controlled replay semantics +- period close and replay rules need strong governance + +Best practice: centralize posting logic so product teams do not each invent their own accounting behavior. + +## 9. Billing System + +Billing is the system that decides what should be charged, when, and under what pricing rules. + +### 9.1 Billing System Overview + +Billing is not the same thing as payments. + +| System | Core question | +|---|---| +| Billing | what should this customer owe? | +| Payments | did we successfully collect money? | +| Invoicing | what formal document shows the charge? | +| Ledger | what are the immutable financial movements? | + +Why billing and payments are separate: + +- a customer can owe money before payment happens +- payment can fail while the invoice remains valid +- some businesses bill monthly but collect later by wire or ACH +- some charges are usage-derived long after the product event happened + +### 9.2 Billing Architecture + +```mermaid +flowchart LR + PE[Product Events\nAPI calls seats storage compute] --> MTR[Metering Ingestion] + MTR --> AGG[Usage Aggregation] + CAT[Plan Catalog\nprice books discounts] --> RATE[Rating Engine] + AGG --> RATE + SUB[Subscription Contracts] --> RATE + RATE --> INV[Invoice Engine] + INV --> PAY[Payment Collection] + INV --> AR[Accounts Receivable] + PAY --> LEDGER[Ledger] + INV --> LEDGER + LEDGER --> ACC[Accounting Export] + PAY --> ENT[Entitlement / Service Control] + INV --> NOTIFY[Invoice Email / Customer Portal] +``` + +### 9.3 Usage-Based vs Subscription Billing + +| Model | Example | Technical implications | +|---|---|---| +| Pure subscription | Netflix monthly plan | scheduled renewals, simpler predictable invoices | +| Usage-based | API calls, storage, compute hours | metering, aggregation, late events, rating complexity | +| Seat-based | GitHub or SaaS user seats | seat snapshots, proration, entitlement sync | +| Hybrid | Shopify plan plus transaction fees | combine recurring and usage or fee-derived line items | + +### 9.4 Pricing Model Design + +Pricing is partly a product decision and partly a systems design problem. + +Technical questions pricing creates: + +- can pricing rules be versioned? +- can invoices explain the charge simply? +- can finance reconcile it? +- can sales and support understand it? +- can the system backfill or re-rate if needed? + +Bad pricing models are often hard not because math is hard, but because they create confusing operational behavior. + +### 9.5 Usage Tracking + +Usage tracking powers usage-based billing and parts of seat-based or entitlement billing. + +#### 9.5.1 Usage Metering Basics + +Metering means recording billable events such as: + +- API requests +- storage GB-months +- compute minutes or instance-hours +- messages sent +- seats active during a billing window + +#### 9.5.2 Event Ingestion + +Meter events usually arrive through: + +- synchronous product write path +- async message streams +- log pipelines +- batch imports from service usage systems + +Best practice: persist raw usage events before aggregation so billing can be recomputed if needed. + +#### 9.5.3 Usage Aggregation + +Aggregation transforms raw events into billable quantities. + +Examples: + +- total API calls by account per day +- average daily active seats in cycle +- total storage byte-hours converted to GB-month + +Aggregation often needs a dedicated pipeline because raw event volume is too large for invoice-time computation. + +#### 9.5.4 Deduplication + +Usage events are often duplicated by retries or replay. + +Dedup strategies include: + +- event IDs unique per producer +- idempotent upserts into raw event store +- windowed dedup during aggregation + +#### 9.5.5 Delayed and Late-Arriving Usage + +Late usage is normal. + +Examples: + +- a region buffers logs and ships them later +- mobile device usage syncs late +- a downstream service republishes events after outage recovery + +This forces you to choose a policy: + +- hold invoice finalization until watermark is reached +- close invoice on time and roll late usage into next cycle +- issue an adjustment invoice later + +There is no universal answer. It depends on customer expectations and finance policy. + +#### 9.5.6 Billing Windows + +Billing windows define which usage belongs to which cycle. + +You need clear answers for: + +- timezone and cutoff rules +- how retries across midnight are handled +- how backdated corrections are posted + +#### 9.5.7 Prepaid vs Postpaid Usage + +| Model | Meaning | Example | +|---|---|---| +| Prepaid | customer buys credits or balance in advance | ad platforms, wallet systems, prepaid API credits | +| Postpaid | usage is measured first, billed later | cloud compute, SaaS overage billing | + +Prepaid systems care more about balance control and real-time enforcement. Postpaid systems care more about accurate metering and invoice correctness. + +#### 9.5.8 Accuracy Guarantees + +Real systems rarely guarantee perfect exactly-once event ingestion. Instead they aim for: + +- durable raw event retention +- deterministic aggregation logic +- replay capability +- reconciliation between product metrics and billable usage + +#### 9.5.9 Fraud Prevention Basics + +Usage billing can be abused. + +Examples: + +- account takeover causing huge compute spend +- self-generated fake usage for promotional credit abuse +- duplicated or forged meter events + +Basic controls: + +- signed or authenticated meter producers +- anomaly detection on usage spikes +- spend caps and alerts +- quota-based temporary throttling + +#### 9.5.10 Real-World Examples + +- API platforms bill on requests or tokens consumed +- cloud storage bills on byte-hours or GB-months +- compute platforms bill on runtime duration or instance-hours +- GitHub-like SaaS products may bill on active seats or premium features enabled + +### 9.6 Subscription Plans + +#### 9.6.1 Pricing Models + +| Model | Example | System implications | +|---|---|---| +| Flat rate | one plan, one price | easiest billing, simplest invoices | +| Seat-based | per user or active seat | seat counting rules, proration, entitlement sync | +| Tiered pricing | first 100 units one rate, next 900 another | complex rating and customer explanation | +| Usage-based | pay per request, GB, minute | metering pipeline required | +| Hybrid | base subscription plus overages | multiple billing engines meet on one invoice | + +#### 9.6.2 Feature Entitlements + +Feature access should not be hard-coded to plan name strings. + +Good design uses: + +- plan version +- feature flags or entitlement policies +- effective date ranges + +Why this matters: + +- marketing renames plans +- enterprise exceptions happen +- grandfathered customers need older feature bundles + +#### 9.6.3 Enterprise Custom Plans + +Enterprise contracts often include: + +- custom pricing +- annual commitments +- manual invoicing +- negotiated payment terms like net 30 or net 60 +- true-up charges later + +This is why billing platforms need enough flexibility to support both self-serve checkout and finance-managed accounts receivable flows. + +#### 9.6.4 Discounts, Coupons, and Promotions + +Discount systems need careful modeling. + +Questions to answer: + +- percent or flat discount? +- one-time or recurring? +- plan-limited or account-wide? +- does it apply before or after tax? +- does it affect revenue reporting? + +#### 9.6.5 Grandfathered Plans + +When pricing changes, many existing customers keep old terms. + +Engineering implication: + +- do not mutate the current plan price in place and assume history will still make sense +- create new plan versions +- store which version each subscription is attached to + +#### 9.6.6 Plan Versioning + +Plan versioning is one of the most important billing design habits. + +Without it, you cannot safely explain historical invoices after pricing or feature changes. + +### 9.7 Refunds + +Refunds connect customer support, payments, billing, ledger, and fraud controls. + +#### 9.7.1 Full and Partial Refunds + +Refunds can be: + +- full refund of the whole captured amount +- partial refund of specific amount or line items +- service credit instead of cash refund + +#### 9.7.2 Refund Approval Flows + +Not every refund should be a direct API call. + +Common approval patterns: + +- automated refund within policy limits +- support-initiated refund with permission checks +- manager approval for large amounts +- finance review for old or exceptional cases + +#### 9.7.3 Refund Timing Constraints + +Refund behavior depends on payment stage. + +- if payment was only authorized, you may be able to void instead of refund +- if captured but not fully settled, provider behavior may differ +- bank transfer refunds may require separate payout instructions + +#### 9.7.4 Asynchronous Refund Completion + +Refunds are often asynchronous. + +That means the system should track states like: + +- refund requested +- refund submitted to provider +- refund pending +- refund succeeded +- refund failed + +#### 9.7.5 Ledger Reversal Handling + +A refund should create compensating financial entries, not erase the original charge. + +That usually means: + +- reduce receivable or cash position as appropriate +- reduce merchant payable or reverse revenue where applicable +- link refund entries to original payment event + +#### 9.7.6 Abuse Prevention and Fraud Considerations + +Refund systems can be abused. + +Examples: + +- compromised support account issuing fraudulent refunds +- customer requesting repeated partial refunds across channels +- refunding to a payment method not tied to the original transaction + +Controls: + +- permissioned refund roles +- approval thresholds +- audit logs for every refund action +- anomaly detection on refund velocity + +#### 9.7.7 Accounting Implications + +Refunds affect more than payment state. + +They can affect: + +- revenue reporting +- tax adjustments +- merchant payable balances +- customer statements + +#### 9.7.8 Safe Operational Refund Design + +```mermaid +flowchart LR + SUP[Support / Customer Request] --> POLICY[Refund Policy Engine] + POLICY --> APPROVE{Needs Approval?} + APPROVE -->|Yes| MGR[Manager / Finance Approval] + APPROVE -->|No| REQ[Create Refund Request] + MGR --> REQ + REQ --> PAY[Payment Provider Refund API] + REQ --> LED[Ledger Reversal Pending] + PAY --> WH[Refund Webhook / Status Update] + WH --> LED2[Ledger Reversal Finalized] + WH --> NOTIF[Customer Notification] + WH --> AUD[Audit Log] +``` + +### 9.8 Audit Logs + +Audit logs are not the same as application logs. + +#### 9.8.1 Why Audit Logs Matter + +Audit logs answer questions like: + +- who issued this refund? +- who changed billing settings? +- when was a plan changed? +- who updated payout bank details? +- who overrode a failed payment and granted access? + +#### 9.8.2 Compliance and Operational Requirements + +Financially relevant actions often require durable traceability for: + +- internal investigations +- fraud reviews +- regulatory or audit requests +- customer disputes + +#### 9.8.3 What Audit Logs Typically Capture + +- actor identity +- action type +- target object +- before and after state where allowed +- timestamp +- request ID or trace ID +- approval context if applicable + +#### 9.8.4 Immutable Event History + +Audit logs are usually append-only and write-restricted. + +Why: + +- if admins can casually edit audit history, the log loses its value +- investigations need evidence quality, not best-effort debugging + +#### 9.8.5 Admin Action Tracking + +Particularly sensitive actions include: + +- refunds +- manual captures +- invoice voids +- plan price changes +- bank account or payout changes +- permission grants for support and finance roles + +#### 9.8.6 Financial Investigation Workflows + +When something goes wrong, investigators often need to correlate: + +- customer-facing event +- internal admin actions +- payment provider events +- ledger postings +- reconciliation exceptions + +Strong systems make that traceability possible with consistent IDs. + +#### 9.8.7 Retention Strategy and Tamper Resistance + +Common design choices: + +- long retention for financially relevant events +- restricted delete permissions +- append-only storage patterns +- checksums or immutable storage layers for high-sensitivity contexts + +#### 9.8.8 Application Logs vs Audit Logs + +| Topic | Application logs | Audit logs | +|---|---|---| +| Purpose | debugging and observability | accountability and investigation | +| Retention | often shorter | often longer | +| Editability | may be reprocessed or rotated freely | should be tightly controlled | +| Structure | often operational and noisy | structured and action-focused | +| Example | HTTP 500 from refund API | admin user 42 approved refund of $500 | + +Common mistake: assuming standard service logs are enough for auditability. + +## 10. How These Systems Connect in Real Architecture + +The best way to understand the whole stack is to follow a realistic end-to-end flow. + +### 10.1 SaaS Subscription Example + +Imagine a GitHub-like SaaS product with seat-based billing and annual plans. + +1. Customer selects a plan and seat count. +2. Subscription service creates the contract and stores plan version. +3. Invoice engine generates the first invoice, applying discount and tax. +4. Payment service creates a payment intent and charges the default card. +5. Provider returns initial result, then sends async confirmation webhook. +6. Ledger posts the charge and resulting receivable or revenue flows. +7. Entitlement service activates the plan after policy conditions are met. +8. Renewal scheduler later creates the next billing cycle. +9. Usage and seat changes may create proration line items. +10. Reconciliation confirms settlement and fee correctness. + +### 10.2 Marketplace Example + +Imagine a Shopify-like or Uber-like platform. + +1. Customer pays the platform. +2. Payment is authorized and later captured. +3. Ledger records customer payment, platform fee, and merchant or driver payable. +4. Refunds or chargebacks later create adjusting entries. +5. Payout system sends net funds to merchant or driver. +6. Reconciliation confirms provider settlement and bank payout. + +This example shows why ledgers are central in platforms that hold and distribute money between parties. + +### 10.3 Common Architectural Separation + +Good financial architectures often split into services like: + +- checkout or payment orchestration +- provider integration adapters +- invoice engine +- subscription service +- metering and rating pipeline +- ledger or accounting posting service +- reconciliation service +- audit and admin tooling + +This separation exists because each domain has different correctness rules, scaling patterns, and operational teams. + +## 11. Common Interview Discussions + +When interviewers ask about financial systems, they usually care less about memorizing provider jargon and more about whether you understand failure and correctness. + +### 11.1 Questions You Should Be Ready For + +- How do you prevent duplicate charges? +- How do you model payment state transitions? +- Why do you need a ledger instead of a balance column? +- How do you handle provider webhooks arriving late or twice? +- Why is reconciliation needed if your service is correct? +- How would you support partial refunds and chargebacks? +- How do you design subscription retries and dunning? +- Why are billing and payments separate systems? + +### 11.2 Strong Talking Points + +Strong answers usually include: + +- idempotency at API and event-processing layers +- explicit state machines for long-lived payment workflows +- append-only ledger entries with reversals instead of mutation +- asynchronous processing and webhook handling +- reconciliation against provider and bank data +- operational tooling for refunds, disputes, and manual review + +### 11.3 Weak Talking Points + +Weak answers often sound like: + +- "I would just store payment status in a table" +- "I would make sure events are exactly once" +- "If the request times out, I would retry" +- "The invoice can just be edited if something changes" + +Those answers ignore the real complexity of money systems. + +## 12. Common Production Mistakes + +These are the mistakes that repeatedly break real systems. + +1. Using floating point for money. +2. Treating payment success as final truth without waiting for async confirmation where needed. +3. Not separating order, invoice, payment, and ledger state. +4. Forgetting idempotency on write APIs and webhook consumers. +5. Updating financial records in place instead of using reversals or versioned models. +6. Skipping reconciliation because "our DB is correct". +7. Designing no admin tooling for refunds, disputes, and exceptions. +8. Hard-coding plan names and pricing instead of versioning them. +9. Using the balance table as the only source of truth. +10. Ignoring currency, tax, and settlement timing edge cases. + +## 13. Practical Best Practices Checklist + +If you want an interview answer that also sounds production-ready, these are the habits to emphasize. + +- represent money in minor units with explicit currency +- use business IDs and provider IDs together +- design state machines, not booleans +- keep financial records append-only where possible +- make retries safe with idempotency +- verify and store raw webhook events +- build reconciliation jobs from the start +- maintain audit logs for sensitive operations +- version pricing, plans, and invoice-affecting rules +- give operations teams visibility and controls + +## 14. Final Mental Model + +If you remember only one thing, remember this: + +Financial systems are not just about moving money. They are about maintaining trust under retries, ambiguity, delay, failure, and scrutiny. + +A strong design usually separates: + +- business events +- billable calculation +- payment execution +- legal documents +- immutable financial recording +- external verification +- operational accountability + +That is the mental model behind real payment platforms, subscription businesses, marketplaces, and SaaS billing systems. + +And that is also the mental model interviewers are usually trying to detect. diff --git a/systems design/9.internalOps.md b/systems design/9.internalOps.md new file mode 100644 index 0000000..7f7e87f --- /dev/null +++ b/systems design/9.internalOps.md @@ -0,0 +1,1545 @@ +# Internal Operations + +Internal operations are the systems a company uses to run the product after launch. They include admin tools, moderation consoles, support dashboards, feature-flag and configuration control planes, operational overrides, event pipelines, reporting systems, and BI layers. In interviews, many candidates describe the user-facing request path and stop there. In production, that is only half the architecture. + +Real companies also need ways for employees and automated control loops to: + +- inspect state safely +- change system behavior without a redeploy +- investigate incidents +- enforce policy and trust rules +- resolve customer issues without pulling engineers into every ticket +- understand product behavior without hurting transactional workloads + +If these internal systems are weak, the product becomes expensive and fragile to operate. Support teams depend on engineers for routine tickets. Moderators cannot keep up with abuse spikes. Incident responders have no safe kill switch. Analysts query the production database and slow down user traffic. Every team builds a one-off internal panel with different permissions and no audit log. + +This guide treats internal operations as first-class system design. It is written for interview preparation, but it is also meant to build strong production intuition. + +## 1. Big Picture: What Internal Operations Actually Are + +The easiest mistake is to think of internal operations as "just dashboards." That is too shallow. + +Internal operations are the part of the system that lets the company operate, govern, debug, and understand the product. Product systems serve end users. Internal systems serve staff, automated workflows, and business decision-makers. + +### 1.1 Three Planes You Should Keep Separate Mentally + +| Plane | Primary users | Main job | Typical workload | What matters most | +|---|---|---|---|---| +| Product data plane | end users, public API clients | serve product behavior | request/response, transactions, low-latency reads and writes | availability, correctness, latency | +| Admin control plane | support, moderators, ops, engineers, automated controls | inspect and change production safely | privileged reads, state-changing operations, policy enforcement | safety, auditability, least privilege | +| Analytics / decision plane | analysts, finance, product, leadership, automated reporting | explain what happened and why | large scans, aggregations, historical analysis | correctness, consistency, query efficiency | + +The phrase control plane vs data plane is extremely useful in interviews. + +- The data plane does the core product work: create orders, process payments, serve feeds, store messages, update listings. +- The control plane tells the system how to behave or helps humans intervene: disable a feature, suspend an account, reroute traffic, resend a webhook, review flagged content. + +At small scale, these concerns are often mixed together. At larger scale, separating them becomes mandatory because the operational requirements are different. + +### 1.2 Why Internal Operations Are Underrated + +Internal systems are often underestimated because they do not look customer-facing. But they directly affect customer outcomes. + +Examples: + +- A Stripe-like payments company needs support staff to inspect payment attempts, webhooks, disputes, and risk decisions quickly. Without this, every customer issue becomes an engineering ticket. +- A GitHub-like developer platform needs internal views of repositories, organizations, abuse flags, account state, and audit records. Without this, abuse handling and support become chaotic. +- A marketplace needs moderation and risk tooling to detect scams, fake listings, refund abuse, and policy violations before trust collapses. +- A Netflix-like platform needs operational controls to disable problematic features, shift traffic, and understand viewer behavior without slowing the streaming experience. + +The operational leverage is enormous. Good internal systems reduce mean time to resolve incidents, reduce manual toil, and reduce the number of engineers needed to support day-to-day business operations. + +### 1.3 Why Product Systems and Internal Systems Diverge + +User-facing systems and internal systems usually diverge on five axes: + +| Dimension | User-facing systems | Internal systems | +|---|---|---| +| Latency expectations | often tight, user-visible | usually looser, but correctness is stricter | +| Access model | broad but low privilege | small user set but very high privilege | +| Data shape | product-oriented, transactional | cross-cutting, investigative, operational | +| Failure cost | user-facing downtime | privileged mistakes, compliance failures, operational paralysis | +| Evolution pattern | optimized around product features | optimized around workflows, audits, and safety controls | + +This is why serious companies eventually stop letting staff run direct SQL or use ad hoc scripts against production. They build explicit internal products with proper authorization, auditing, and safe abstractions. + +### 1.4 High-Level Architecture + +```mermaid +flowchart TB + subgraph DataPlane[Product Data Plane] + Users[End users] --> App[Web / Mobile / Public APIs] + App --> Services[Backend services] + Services --> OLTP[(Operational DBs)] + Services --> Cache[(Caches)] + end + + subgraph ControlPlane[Admin Control Plane] + Staff[Support / Moderators / Ops / Engineers] --> AdminUI[Internal tools] + AdminUI --> AdminAPI[Internal admin APIs] + AdminAPI --> Policy[RBAC / approvals / policy checks] + Policy --> Flags[Feature flags / runtime config] + Policy --> Actions[Support, moderation, incident actions] + Actions --> Audit[(Audit log)] + end + + Services -. events .-> Stream[Event stream] + OLTP -. CDC .-> Warehouse[(Warehouse / lake)] + Stream --> Warehouse + Warehouse --> BI[Dashboards / BI / reports] + Stream --> Realtime[(Realtime analytics store)] + Realtime --> OpsDash[Operational dashboards] + Services -. logs / traces .-> SupportRead[Support read models / log index] + AdminAPI --> SupportRead +``` + +The key idea is that internal operations are not one system. They are a family of systems that sit around the product and make it operable. + +## 2. Admin Systems + +Admin systems are the internal interfaces used by staff and automation to inspect, govern, and control the product. + +They exist because production operations cannot scale through engineers manually SSH-ing into boxes, editing config files, or running one-off scripts. That works in emergencies at very small scale. It fails badly once the company has customers, audits, on-call rotations, or regulated data. + +In practice, admin systems often become the operational nervous system of the company. + +### 2.1 What Admin Systems Usually Include + +Common examples: + +- moderation consoles +- customer support dashboards +- account and tenant management tools +- feature flag UIs +- risk review panels +- configuration management interfaces +- back-office operations tools for finance, fulfillment, or marketplace operations +- incident controls such as kill switches and traffic overrides + +### 2.2 Design Principles for Serious Admin Systems + +If the interviewer asks how to design internal tools at scale, these principles are worth stating explicitly: + +1. Do not let the UI write directly to production databases. +2. Route privileged operations through internal APIs with policy checks. +3. Separate read paths from write paths. +4. Make every privileged action auditable. +5. Prefer purpose-built actions over arbitrary mutation. +6. Assume insider risk and accidental misuse. +7. Build for workflow, not just data visibility. + +That last point matters. A weak admin tool shows data. A strong admin tool helps an employee complete a job safely. + +### 2.3 Moderation Tools + +Moderation tools exist when a platform allows user-generated content or user-generated actions that can harm the ecosystem. + +Examples: + +- social platforms: posts, comments, images, videos, direct messages +- marketplaces: listings, seller profiles, reviews, refund behavior +- developer platforms: abuse reports, malware packages, spam accounts, phishing repositories +- fintech and payments products: fraud signals, high-risk accounts, suspicious transaction flows + +The hard part is that moderation is not just classification. It is a policy enforcement system operating under uncertainty. + +#### 2.3.1 Why Moderation Exists + +Without moderation, platforms degrade quickly: + +- abusive users drive away normal users +- scams destroy trust and conversion +- illegal or policy-violating content creates legal and reputational risk +- spam overwhelms genuine content +- employees are forced into reactive manual cleanup + +Moderation is usually a mixture of three things: + +- policy definition +- automated detection +- human review and enforcement + +#### 2.3.2 How Moderation Works Internally + +The common production shape is a layered pipeline: + +1. Inline checks at write time. +2. Asynchronous scoring and enrichment. +3. Queue-based human review for uncertain cases. +4. Enforcement and appeals. + +Inline checks are cheap and fast. They may include: + +- auth and account reputation checks +- rate limits +- blocklists +- duplicate or spam heuristics +- malware scanning for uploads +- file type and size validation + +Asynchronous processing does the heavier work: + +- ML classification +- image or video analysis +- graph-based risk scoring +- cross-account linkage detection +- policy-specific rule evaluation + +This separation matters because some decisions must happen immediately, but others are too expensive or uncertain for the request path. + +#### 2.3.3 Moderation Decision Flow + +```mermaid +flowchart LR + Submit[User submits content or listing] --> Inline[Inline checks: auth, rate limit, blocklists, malware, spam heuristics] + Inline -->|clear violation| Quarantine[Quarantine or reject] + Inline -->|allowed or uncertain| Publish[Store content and emit moderation event] + Publish --> Models[Rules engine + ML scoring + risk enrichment] + Models --> Decision{Confidence and policy outcome} + Decision -->|high confidence safe| Live[Leave content live] + Decision -->|high confidence unsafe| AutoAction[Auto remove or restrict reach] + Decision -->|uncertain| Queue[Human review queue] + Queue --> Reviewer[Moderator console] + Reviewer --> Action{Moderator action} + Action --> Remove[Remove content] + Action --> Shadow[Shadow ban / limit distribution] + Action --> Warn[Warning / strike] + Action --> Suspend[Account suspension] + Action --> Escalate[Escalate to specialist or legal] + Quarantine --> Audit[(Audit trail)] + Live --> Audit + AutoAction --> Audit + Remove --> Audit + Shadow --> Audit + Warn --> Audit + Suspend --> Audit + Escalate --> Audit +``` + +#### 2.3.4 Human-in-the-Loop Moderation + +Human reviewers remain necessary because policy is ambiguous and context-dependent. + +Typical workflow: + +- automation assigns a risk score and recommended action +- the system routes items into queues by severity, language, region, policy type, or SLA +- moderators review evidence, policy excerpts, account history, and previous decisions +- the tool records the decision, reason code, evidence, policy version, and actor identity + +This queue-based design is critical. If every suspicious item blocked the write path, the system would become too slow and too costly. If everything were auto-approved, abuse would slip through. Queues let the company apply scarce human attention where it matters most. + +#### 2.3.5 Common Moderation Actions + +| Action | What it does | Typical use | Risk | +|---|---|---|---| +| Remove content | hides or tombstones a post, listing, or media object | clear policy violation | over-removal hurts trust and engagement | +| Shadow ban / reduce distribution | limits visibility without explicit hard deletion | spam, coordinated manipulation, low-confidence abuse | opacity can create fairness concerns | +| Warning / strike | notifies user and records a policy incident | first offense, borderline behavior | inconsistent enforcement creates appeals load | +| Account suspension | blocks future actions, sometimes temporarily | repeated or severe abuse | mistaken suspensions damage trust | +| Escalation / legal hold | preserves evidence and routes to specialists | safety, fraud, legal, regulatory issues | slow path can create backlog | + +In production, many platforms do not immediately hard-delete evidence. They hide the content in the product but retain an access-controlled copy for audits, appeals, and legal obligations. + +#### 2.3.6 Precision, Recall, and Abuse Tradeoffs + +Moderation is full of false positive and false negative tradeoffs. + +- High precision means the system rarely flags innocent content, but it misses more bad content. +- High recall means the system catches more bad content, but it creates more false positives. + +Which side you prefer depends on the domain. + +- A children-focused or safety-critical platform may prefer more aggressive catches. +- A professional collaboration tool may optimize harder for avoiding false suspensions. +- A marketplace may auto-hide obviously fraudulent listings quickly but send many edge cases to review. + +Interviewers often like hearing that the right answer depends on policy severity, appeal cost, legal requirements, and user trust. + +#### 2.3.7 Real-World Production Patterns + +Common large-platform patterns include: + +- social media systems using ML plus policy reviewers for posts, comments, and media +- marketplace systems scoring listings, sellers, and buyer behavior to catch scams and counterfeit activity +- ride-sharing or delivery platforms reviewing safety incidents, identity issues, and abuse reports with escalation queues +- payments systems using risk signals, manual review queues, and specialized compliance teams + +The implementation is usually distributed: + +- the core product service stores the content or entity +- an event is published into a stream +- moderation services enrich and classify +- a review system builds work queues +- an enforcement service writes policy decisions back to product systems +- audit logs capture who did what and why + +#### 2.3.8 Failure Cases at Scale + +What breaks in production: + +- queue backlog grows faster than humans can review +- model drift causes sudden false positives after a product change +- policy changes are not versioned, so old decisions become impossible to interpret +- evidence is not snapshotted, so moderators review a moving target +- global products fail to account for language and regional policy differences +- internal reviewers have too much power and too little audit oversight + +#### 2.3.9 Best Practices + +- version policies and store the policy version with each decision +- keep moderation decisions append-only where possible +- store reason codes, evidence references, and actor identity +- support appeals and second-level review +- separate recommendation from final enforcement when confidence is low +- measure reviewer throughput, queue depth, decision consistency, and appeal overturn rate + +#### 2.3.10 Interview Discussion Angles + +When asked to design moderation at scale, strong answers usually mention: + +- synchronous and asynchronous checks +- human-in-the-loop queues +- auditability +- precision/recall tradeoffs +- abuse evasion and adversarial behavior +- region and policy versioning + +### 2.4 Support Dashboards + +Support dashboards are the operational interface between customers and the company. + +The goal is not just to show data. The goal is to let support resolve real issues without waiting on engineers for every question. + +For many SaaS companies, support tooling is one of the highest leverage internal investments because it reduces escalations, shortens ticket time, and improves customer trust. + +#### 2.4.1 What a Good Support Dashboard Does + +A serious support dashboard usually offers a customer 360 view: a single place to inspect the current state and recent history of a user, account, tenant, or transaction. + +Typical components: + +- account profile and plan information +- permissions and organization membership +- recent user actions +- payment history, invoices, disputes, webhook deliveries +- feature flags or experiments affecting the account +- relevant logs, request IDs, and traces +- risk flags or account restrictions +- recent support actions taken by staff + +The key is correlation. Support cannot debug an issue from one table alone. + +#### 2.4.2 State, Events, and Logs Are Different + +One of the best mental models for support tooling is this: + +| Question | Best source | +|---|---| +| What is true now? | current database state or read model | +| What happened over time? | event timeline or audit trail | +| Why did the system behave that way? | logs, traces, and downstream error details | + +If the dashboard only shows current state, support misses the story. If it only shows logs, support misses the business meaning. Good systems join these perspectives into one workflow. + +#### 2.4.3 Support Investigation Flow + +```mermaid +sequenceDiagram + participant Customer + participant Agent as Support Agent + participant UI as Support Dashboard + participant Read as Customer 360 Read Model + participant Events as Event Timeline + participant Logs as Log / Trace Index + participant Admin as Action API + participant Audit as Audit Log + + Customer->>Agent: Reports issue + Agent->>UI: Open account or request + UI->>Read: Fetch current account state + UI->>Events: Fetch timeline of recent actions + UI->>Logs: Search request IDs, errors, traces + Read-->>UI: Current state + Events-->>UI: Ordered activity history + Logs-->>UI: Failure context + UI-->>Agent: Unified investigation view + Agent->>Admin: Perform scoped action + Admin->>Audit: Record actor, reason, change + Admin-->>Agent: Action result + Agent-->>Customer: Resolution or next steps +``` + +#### 2.4.4 Why Support Dashboards Reduce Engineering Load + +Without good tooling, support teams ask engineers questions like: + +- Did this webhook fail? +- Why was this user unable to log in? +- Was this charge retried? +- Which feature flags were active for this tenant? +- Did a recent deployment change behavior? + +If the support dashboard can answer these questions directly, engineering involvement drops dramatically. + +This is how companies like Stripe-like payment platforms and mature B2B SaaS companies scale support without turning backend engineers into a human query layer. + +#### 2.4.5 Timeline Views Matter More Than People Expect + +A timeline view is often the single most useful support primitive. + +Why: + +- it converts scattered operational signals into a human-readable narrative +- it makes causality easier to understand +- it exposes ordering problems, retries, and partial failures +- it helps support and engineering talk about the same incident + +Good timeline events are business-aware: + +- invoice created +- payment authorized +- webhook delivery failed +- retry scheduled +- customer updated billing details +- support resent invoice email + +These are far more useful than raw low-level logs alone. + +#### 2.4.6 Impersonation Tools + +"View as user" or impersonation is common because many customer issues are easiest to reproduce from the user's perspective. + +But impersonation is dangerous. Safe implementations usually include: + +- strong RBAC and just-in-time access +- explicit reason capture +- prominent session banners +- read-only mode by default +- masking of secrets or regulated fields +- action restrictions or approval for mutating operations +- full audit trail + +The safest pattern is often not true raw impersonation, but a scoped support token that renders the customer experience while blocking sensitive operations. + +#### 2.4.7 Production Architecture Patterns + +Support dashboards usually should not query transactional services directly for every page load. + +Common production patterns: + +- read models built from events or CDC +- search indexes for account and ticket lookup +- read replicas for operational state +- log and trace indexes for investigation context +- admin action APIs for carefully scoped writes + +This matters because support queries are cross-cutting and investigative. They often span many services and would be too expensive or too fragile to assemble live from the hot request path every time. + +#### 2.4.8 Failure Cases + +- support sees stale state and gives the wrong answer +- the dashboard leaks secrets, PII, or credentials +- actions are not idempotent, so retries create duplicate refunds or emails +- internal actions bypass product invariants and corrupt state +- staff actions are not audited, making incident reconstruction impossible +- every team adds fields ad hoc, creating a cluttered and confusing UI + +#### 2.4.9 Best Practices + +- build read-optimized customer 360 models +- use correlation IDs consistently across systems +- expose both current state and event history +- default to redaction and least privilege +- wrap state changes in well-defined action APIs +- require reason codes for sensitive operations +- record every mutation in an immutable audit log + +### 2.5 Internal Panels + +Internal panels are the broader class of admin UIs used by operations, finance, trust and safety, growth, customer success, and engineering teams. + +They are often CRUD-heavy, but reducing them to CRUD misses the important part. Their real job is to encode internal workflows safely. + +Typical examples: + +- user and account management +- tenant provisioning +- catalog and listing administration +- risk case management +- refunds and credits +- feature flag management +- configuration rollout interfaces +- partner onboarding workflows + +#### 2.5.1 What Companies Build in Internal Panels + +At a startup, internal panels may begin as a single admin web app. At larger companies, they evolve into a platform: + +- shared auth and SSO +- reusable RBAC and approval workflows +- common tables and detail pages +- standardized audit logging +- workflow primitives such as queues, checklists, and task assignment +- extensibility for team-specific tools + +This is where "internal tool sprawl" becomes a real problem. Every team wants its own dashboard. Without platform discipline, the company ends up with dozens of brittle apps that all reimplement auth, search, and audit badly. + +#### 2.5.2 Low-Code vs Custom Tools + +| Approach | Strengths | Weaknesses | Best fit | +|---|---|---|---| +| Low-code internal tools | fast to build, easy forms/tables, often built-in auth and connectors | limited custom workflows, hidden complexity, weaker testing and review | early-stage back-office tools | +| Custom-built tools | full control, better for complex workflows and domain-specific safety rules | higher engineering cost | sensitive or core internal operations | +| Platform plus extensions | shared foundation with custom modules | requires strong governance | common pattern in growing companies | + +Low-code is attractive because internal users want results quickly. But once tools become security-sensitive or business-critical, custom or platform-based approaches usually become necessary. + +#### 2.5.3 Internal Tool Architecture + +```mermaid +flowchart TB + Staff[Employee] --> SSO[SSO + MFA] + SSO --> Portal[Internal admin portal] + Portal --> Gateway[Internal gateway] + Gateway --> RBAC[RBAC / policy engine / approvals] + RBAC --> ReadSvc[Read services] + RBAC --> ActionSvc[Action services] + ReadSvc --> ReadModels[(Read replicas / search / event timelines)] + ActionSvc --> Domain[Domain service APIs] + ActionSvc --> Config[(Config / flag store)] + ActionSvc --> Jobs[Background jobs] + ActionSvc --> Audit[(Audit log)] + Domain --> Primary[(Primary databases)] +``` + +The architecture emphasizes something important: internal tools should usually call services, not mutate storage directly. + +Why: + +- services already know domain invariants +- side effects such as notifications and events remain consistent +- authorization can be centralized +- operations become testable and auditable + +#### 2.5.4 Maintainability Principles + +Good internal panels tend to share these traits: + +- one internal identity layer, not per-tool local auth +- one consistent policy model, not ad hoc role checks scattered everywhere +- clear ownership of each tool and workflow +- thin UI, with logic pushed into internal APIs and services +- reusable primitives for search, timelines, approval, notes, and audit records +- strong schema discipline so tables and forms do not drift wildly + +#### 2.5.5 Common Failure Modes + +- direct production SQL becomes the de facto admin interface +- one "super admin" role can do everything with no review +- each team builds a separate dashboard with inconsistent permissions +- write actions bypass domain services and skip side effects +- the UI becomes a giant orchestration layer nobody can test +- internal tools accumulate stale features and nobody knows what is safe to remove + +### 2.6 Operational Controls + +Operational controls are the mechanisms engineers and operators use to change runtime behavior safely in production. + +These are essential because when something is breaking, waiting for a full code deployment is often too slow or too risky. + +Typical controls: + +- feature flags +- kill switches +- traffic rerouting +- rollback controls +- runtime configuration changes +- rate limit overrides +- circuit-breaker thresholds +- background job pausing or queue draining + +#### 2.6.1 Why Operational Controls Exist + +At scale, outages are often mitigated before they are fixed. + +Examples: + +- disable a new recommendation model that is causing timeouts +- reduce request volume to a degraded downstream dependency +- turn off an expensive feature for one region +- shift traffic away from a failing cluster +- pause a worker that is corrupting records + +These are control-plane actions. They do not solve root cause, but they reduce blast radius and buy time. + +#### 2.6.2 Common Controls and What They Protect + +| Control | Purpose | Typical scope | Example | +|---|---|---|---| +| Feature flag | enable or disable functionality | user, tenant, region, environment, percentage | turn off a new checkout step | +| Kill switch | immediately stop a dangerous path | service or workflow | disable an outbound webhook processor | +| Traffic reroute | shift requests to healthy capacity | region, cluster, service subset | move traffic away from a failing AZ | +| Rollback | revert recent code or config | deploy unit or service | roll back a bad release | +| Rate limit override | protect dependencies or unblock key customers | tenant, route, system-wide | lower traffic during database stress | +| Runtime config | tune behavior without code changes | service or policy domain | change queue concurrency or timeout values | + +#### 2.6.3 Runtime Configuration Systems + +A mature runtime config system usually includes: + +- versioned configuration +- targeting rules by tenant, region, user cohort, or environment +- propagation to services through polling, push, or sidecars +- validation and dry-run support +- roll-forward and rollback capability +- audit logging + +The hard problem is not storing config. The hard problem is safe propagation and consistent interpretation. + +Common failure patterns: + +- some instances see the new value, others do not +- the config is syntactically valid but semantically dangerous +- services interpret the same flag differently +- operators change config faster than the system can stabilize + +#### 2.6.4 Safe Rollouts + +Operational controls and rollout strategy are tightly linked. + +Common safe rollout patterns: + +- canary deployment: send a small portion of traffic to the new version first +- ring deployment: expand from internal users to low-risk cohorts to the general population +- regional rollout: enable in one region before global enablement +- tenant-based rollout: enable for a small set of customers first +- dark launch: execute code paths without exposing the user-visible result + +Companies like Netflix, Amazon, and other large platforms are well known for taking rollout safety seriously because operational mistakes at their scale are amplified immediately. + +#### 2.6.5 Incident Mitigation Control Loop + +```mermaid +flowchart LR + Detect[Alert or anomaly detected] --> Triage[Triage using dashboards, logs, traces] + Triage --> Decide{Choose mitigation} + Decide --> Flag[Disable feature or kill switch] + Decide --> Route[Reroute traffic or fail over] + Decide --> Limit[Lower limits or shed load] + Decide --> Rollback[Rollback code or config] + Flag --> Verify[Observe metrics and user impact] + Route --> Verify + Limit --> Verify + Rollback --> Verify + Verify -->|stable| Stabilize[Keep safe state and document] + Verify -->|not stable| Escalate[Escalate and broaden response] + Escalate --> Decide + Stabilize --> Timeline[(Incident timeline and audit record)] +``` + +#### 2.6.6 Production Realities + +In real systems, the most valuable control is often not the fanciest one. It is the one that is: + +- obvious to find during an incident +- well understood by responders +- tested before the incident happens +- constrained to a safe blast radius +- reversible + +An elegant but untested kill switch is less useful than a simple, well-practiced flag that the team knows how to use. + +#### 2.6.7 Failure Cases and Mistakes + +- flags are added but never cleaned up, turning the system into a maze +- kill switches are too broad and take out healthy functionality +- rollback paths break because database migrations are not reversible +- traffic rerouting overwhelms the target region +- only senior engineers know how to use the controls +- there is no audit trail for emergency actions + +#### 2.6.8 Best Practices + +- treat controls as product features, not emergency hacks +- document ownership and intended usage +- test controls during game days and incident drills +- require audit records and reason capture +- prefer scoped controls over system-wide ones +- pair controls with observable success criteria + +## 3. Analytics Systems + +Analytics systems answer questions that product systems are bad at answering directly. + +Examples: + +- How many users completed onboarding this week? +- Which experiment variant increased conversion? +- How many failed payments came from one issuer? +- Which content categories create the most abuse reports? +- What is the retention curve for users acquired through a campaign? + +These are not transactional questions. They are aggregations across time, users, regions, dimensions, and events. + +If you try to answer them by repeatedly querying the production database, you eventually hurt the product. + +### 3.1 User Events + +User events are append-only records of something that happened. + +Examples: + +- page viewed +- search executed +- listing created +- checkout started +- payment succeeded +- comment flagged +- invoice downloaded + +Events are foundational because they let the company observe behavior without repeatedly reverse-engineering it from mutable transactional state. + +#### 3.1.1 Why Events Exist + +Transactional databases tell you the current state very well. They are weaker at explaining behavioral history at scale. + +Example: + +- An orders table may tell you that an order is currently cancelled. +- It is much worse at telling you the full user journey that led there: page views, add-to-cart events, retries, payment declines, coupon application, and support contact. + +Events preserve the narrative. + +They also decouple producers from consumers. One emitted event can feed: + +- realtime dashboards +- fraud systems +- recommendation systems +- experimentation systems +- BI and reporting +- support timelines + +#### 3.1.2 Event Schema Basics + +Good event design is part data modeling and part operational discipline. + +Typical fields: + +| Field | Purpose | +|---|---| +| event_id | unique identifier for deduplication | +| event_type | semantic name such as `checkout.started` or `invoice.paid` | +| occurred_at | when the event happened at the source | +| ingested_at | when the pipeline received it | +| actor_id / user_id | who initiated it | +| account_id / tenant_id | organizational context | +| request_id / trace_id | correlation with logs and traces | +| source | client, server, worker, third party | +| schema_version | allows safe evolution | +| metadata | event-specific attributes | + +Example event: + +```json +{ + "event_id": "evt_6f2c6f7c", + "event_type": "payment.succeeded", + "occurred_at": "2026-04-26T12:34:56Z", + "ingested_at": "2026-04-26T12:34:58Z", + "user_id": "usr_123", + "account_id": "acct_456", + "request_id": "req_789", + "source": "server", + "schema_version": 3, + "metadata": { + "amount": 4200, + "currency": "USD", + "payment_method": "card" + } +} +``` + +#### 3.1.3 Client-Side vs Server-Side Instrumentation + +| Strategy | Strengths | Weaknesses | Good use cases | +|---|---|---|---| +| Client-side events | captures UX actions directly, useful for product analytics | ad blockers, offline behavior, clock skew, tampering | page views, button clicks, UI funnels | +| Server-side events | more authoritative, tied to real business outcomes | can miss pure client intent, requires backend integration | payments, orders, account changes, security-sensitive workflows | + +Strong production systems often use both. + +- client-side for user interaction detail +- server-side for authoritative business facts + +For money, permissions, and compliance-sensitive actions, server-side events usually win. + +#### 3.1.4 Event Ingestion Pipeline + +```mermaid +flowchart LR + User[User action] --> App[Web / Mobile / API] + App --> ClientSDK[Client SDK] + App --> Backend[Backend service] + ClientSDK --> Collector[Event collector] + Backend --> Collector + Collector --> Validate[Schema validation + enrichment] + Validate --> Stream[Kafka / Kinesis / PubSub] + Stream --> StreamProc[Stream processing] + Stream --> Raw[(Raw event lake)] + StreamProc --> RT[(Realtime analytics store)] + StreamProc --> Warehouse[(Warehouse)] + RT --> Dash[Realtime dashboards] + Warehouse --> BI[BI / reporting / experimentation] +``` + +This is the core analytics pattern at many companies. + +#### 3.1.5 Event Flow: User Action to Dashboard + +```mermaid +sequenceDiagram + participant User + participant App + participant Ingest as Event Collector + participant Stream + participant RT as Realtime Store + participant WH as Warehouse + participant Dash as Dashboard + + User->>App: Completes product action + App->>App: Write transactional state + App->>Ingest: Emit event + Ingest->>Stream: Append event + Stream->>RT: Update short-window aggregates + Stream->>WH: Load raw event for historical models + Dash->>RT: Query latest KPIs + Dash->>WH: Query historical trends + Dash-->>User: Fresh operational view and long-term context +``` + +This is why analytics pipelines are usually separate from product databases. They are designed for different questions. + +#### 3.1.6 Ordering Challenges + +Event ordering is trickier than it looks. + +Problems: + +- clients can be offline and upload late +- mobile device clocks can be wrong +- distributed producers write to different partitions +- retries create duplicates and apparent reordering +- downstream consumers process at different speeds + +Production systems usually distinguish between: + +- event time: when the action happened +- processing time: when the system processed it +- ingestion time: when it entered the pipeline + +You should not assume global ordering across the whole system. At best, you often get ordering within a key or partition. + +#### 3.1.7 Deduplication and the Exactly-Once Myth + +Interviewers often appreciate hearing that "exactly once" is mostly an end-to-end design goal, not a magical default. + +In practice, many event systems are at-least-once. + +That means you need deduplication using things like: + +- event IDs +- idempotency keys +- consumer-side upsert semantics +- watermarking and replay windows + +If a pipeline retries and emits the same purchase event twice, dashboards and reports can become wrong very quickly. + +#### 3.1.8 Schema Evolution + +Schema evolution problems are common and painful. + +Examples: + +- a producer renames a field and breaks downstream jobs +- required fields are added without backward compatibility +- teams overload a generic metadata blob and lose consistent semantics +- analysts interpret `country` differently across producers + +Mature systems solve this with: + +- schema registries +- versioned event contracts +- compatibility checks in CI +- tracking plans and naming conventions +- ownership for event families + +#### 3.1.9 Why Events Beat Querying OLTP for Analytics + +Events scale better than reading analytics straight from transactional databases because they are: + +- append-friendly +- decoupled from user-facing transactions +- easier to fan out to many consumers +- richer in behavioral context +- safer to process asynchronously + +By contrast, running heavy analytical queries on OLTP systems causes contention, cache churn, lock pressure, and unpredictable latency for the actual product. + +#### 3.1.10 Production Examples + +- Netflix-like products collect playback and engagement events for recommendations, QoE dashboards, and experimentation. +- Uber-like systems emit trip lifecycle events, marketplace balance events, and operational events that feed realtime ops dashboards and pricing systems. +- Stripe-like systems emit payment, dispute, webhook, and account events used for support, reporting, and downstream automation. +- GitHub-like platforms track repository activity, workflow runs, package events, and abuse signals for analytics and internal operations. + +#### 3.1.11 Failure Cases + +- ad blockers or privacy settings drop client events +- bot traffic pollutes product metrics +- event volume spikes overwhelm collectors +- schema drift breaks downstream consumers silently +- delayed events distort daily reporting windows +- PII leaks into event payloads and spreads through the warehouse + +#### 3.1.12 Best Practices + +- define event naming conventions early +- prefer stable semantic event types over UI-specific names +- make business-critical events server authoritative +- carry request IDs and tenant IDs consistently +- validate schemas at ingestion +- quarantine bad events rather than silently dropping them +- publish freshness and completeness metrics for pipelines + +### 3.2 Dashboards + +Dashboards are the human-readable surface of analytics systems. + +They answer questions such as: + +- What is happening right now? +- Is the new release improving or hurting conversion? +- Which region is failing? +- Is abuse rising? +- Are support queues growing faster than expected? + +Dashboards are deceptively hard. A chart is easy to draw. A trustworthy dashboard is not. + +#### 3.2.1 Realtime vs Near Realtime vs Batch + +| Mode | Freshness | Typical backend | Common use cases | Main tradeoff | +|---|---|---|---|---| +| Realtime | seconds | stream processors, in-memory stores, realtime OLAP | ops monitoring, fraud, support queue health | more cost and more complexity | +| Near realtime | tens of seconds to minutes | micro-batches, materialized views | growth funnels, experiment monitoring | slightly stale but simpler | +| Batch | hours or daily | warehouses and scheduled jobs | executive reporting, finance, compliance | highest correctness, lowest freshness | + +One of the most important interview points is that not every dashboard needs to be realtime. Realtime is expensive. Use it where operational decisions need it. + +#### 3.2.2 Why Dashboards Do Not Sit on OLTP Databases + +Operational databases are optimized for point reads and writes on current state. + +Dashboards want: + +- large scans +- time-window aggregations +- percentiles and histograms +- group-by across many dimensions +- historical comparisons + +Those are OLAP-style queries. Running them on product databases eventually hurts the product. + +#### 3.2.3 OLAP Basics + +OLAP systems are designed for analytical queries. Conceptually they are: + +- read-optimized +- good at scanning many rows but only the columns needed +- good at aggregation and filtering across dimensions +- often columnar rather than row-oriented + +Popular real-world examples conceptually include systems in the ClickHouse, Druid, BigQuery, Snowflake, or Redshift family. + +#### 3.2.4 Pre-Aggregation vs On-Demand Queries + +| Approach | Strengths | Weaknesses | Best fit | +|---|---|---|---| +| Pre-aggregation | fast dashboard reads, predictable cost | extra pipeline complexity, less flexibility | repeated KPIs and standard dashboards | +| On-demand aggregation | flexible exploration | slower and more expensive queries | ad hoc analysis and lower-volume queries | + +Most production systems use both: + +- precompute the common stuff +- allow on-demand queries for exploration or less frequent questions + +#### 3.2.5 Dashboard Query Flow + +```mermaid +flowchart LR + Viewer[Manager / Analyst / Ops] --> UI[Dashboard UI] + UI --> QueryAPI[Analytics query API] + QueryAPI --> Auth[Metric definitions + access control] + QueryAPI --> Cache[(Result cache)] + Cache -->|miss| Agg[(Materialized views / cubes)] + Agg --> OLAP[(OLAP store / warehouse)] + OLAP --> QueryAPI + QueryAPI --> UI +``` + +The important production point is that a dashboard is often not a raw SQL client. There is usually a query service in between to enforce metric definitions, caching, authorization, and query limits. + +#### 3.2.6 Caching Strategies + +Common dashboard caching strategies: + +- result cache for identical queries +- time-bucket cache, especially for recent windows +- pre-rendered tiles for expensive panels +- CDN or edge caching for public dashboards + +You can often cache aggressively because many dashboard viewers ask the same questions over the same time ranges. + +#### 3.2.7 Performance Challenges + +What makes dashboards slow or expensive: + +- high-cardinality dimensions such as user ID or raw URL +- many joins across poorly modeled tables +- percentile queries over huge windows +- unbounded filter combinations +- realtime calculations without materialization + +A classic anti-pattern is letting every panel issue arbitrary raw queries with no guardrails. + +#### 3.2.8 Good vs Bad Dashboards + +Good dashboards: + +- align to an operational question or business decision +- show freshness and time window clearly +- define metrics consistently +- highlight anomalies, not noise +- support drill-down to the next useful level + +Bad dashboards: + +- track dozens of vanity metrics with no owner +- mix incompatible definitions on one page +- hide the data lag and make stale numbers look live +- overfit the dashboard to one incident and create long-term clutter + +#### 3.2.9 Real-World Examples + +- ops teams use realtime dashboards for request volume, latency, queue depth, and error rate +- product teams use near-realtime funnels and experiment monitoring dashboards +- marketplace teams monitor live listing creation, fraud flags, dispute rates, and support backlogs +- SaaS companies monitor active seats, feature adoption, and tenant health + +### 3.3 Reporting Systems + +Reporting systems generate scheduled, repeatable outputs for business or customer consumption. + +Examples: + +- daily revenue reports +- weekly marketplace liquidity reports +- monthly customer usage summaries +- payout statements +- partner settlement files +- compliance exports + +Reports are different from exploratory dashboards because they are expected to be consistent and repeatable. + +#### 3.3.1 Why Reporting Is Usually Batch-Oriented + +Many reports care more about correctness and reproducibility than raw freshness. + +Example: + +- Finance may prefer a daily report generated from a closed accounting window, even if it is a few hours old. +- A customer usage statement should not change every minute while they are reading it. + +That is why reporting pipelines are usually batch-oriented, even in otherwise realtime companies. + +#### 3.3.2 ETL Pipelines + +ETL stands for Extract, Transform, Load. + +Conceptually: + +1. Extract data from source systems. +2. Transform it into a consistent and useful model. +3. Load it into the target analytical store or reporting layer. + +Modern systems often do ELT in practice: load raw data first, then transform inside the warehouse. But the ETL mental model is still useful because it describes the stages clearly. + +#### 3.3.3 Batch Reporting Pipeline + +```mermaid +flowchart LR + Sources[OLTP DBs / event stream / third-party systems] --> Extract[Batch extract or ELT load] + Extract --> Transform[Data cleaning, joins, business rules, snapshots] + Transform --> Warehouse[(Warehouse)] + Warehouse --> Metrics[Scheduled metric and snapshot jobs] + Metrics --> Report[Report generator] + Report --> Files[(CSV / PDF / data export)] + Report --> Delivery[Email / Slack / API delivery] + Metrics --> QA[(Reconciliation and data quality checks)] +``` + +#### 3.3.4 How Companies Keep Reports Correct + +Correct reporting usually depends on operational discipline, not just SQL skill. + +Common techniques: + +- snapshot tables for closed reporting periods +- idempotent batch jobs +- data quality tests and reconciliations +- late-data handling rules +- metric versioning +- explicit timezone handling +- backfill processes with approval and lineage + +This is especially important in finance-like domains where a report can trigger money movement, compliance submissions, or executive decisions. + +#### 3.3.5 Consistency vs Freshness + +Reporting systems often choose consistency over freshness. + +Tradeoff examples: + +- a report built from a stable midnight snapshot is easier to audit +- a near-live report is fresher but may change as late events arrive + +Strong interview answers explain that the right choice depends on the business meaning of the report. + +#### 3.3.6 PDF and Email Report Generation + +Report generation is often its own mini-platform: + +- select data window and report definition +- compute metrics and aggregate tables +- render CSV, PDF, or HTML +- store generated artifacts +- deliver by email or API +- track delivery status and retries + +This sounds mundane, but at scale it raises real concerns: + +- retries must not duplicate deliveries incorrectly +- generated files may contain sensitive data +- report definitions change over time and need versioning +- rendering can become CPU heavy + +#### 3.3.7 Failure Cases + +- upstream data arrives late and misses the report window +- reruns produce different numbers without explanation +- finance and product teams use slightly different metric logic +- timezone mistakes shift data across reporting days +- the warehouse contains duplicate rows from replayed ingestion +- report delivery succeeds but the metadata says it failed, triggering duplicates + +#### 3.3.8 Best Practices + +- define the reporting window explicitly +- snapshot critical numbers when appropriate +- keep report definitions versioned and reviewable +- reconcile against source-of-truth systems +- separate customer-facing reports from exploratory analytics models +- publish freshness and completion status + +### 3.4 Business Intelligence (BI) + +Business intelligence is the layer that turns raw operational and event data into a shared analytical truth for the company. + +BI is broader than dashboards. Dashboards answer repeated questions. BI helps the company ask and answer new questions safely. + +#### 3.4.1 Dashboards vs BI + +| Category | Dashboards | BI systems | +|---|---|---| +| Main purpose | repeated visibility into key metrics | broad self-serve analysis and decision support | +| Primary users | operators, managers, product teams | analysts, finance, product, leadership | +| Query style | curated and repeated | ad hoc and exploratory | +| Data model | often pre-aggregated | modeled warehouse tables, semantic layers | +| Governance need | moderate | very high | + +#### 3.4.2 Facts and Dimensions + +Dimensional modeling is a common BI concept because it makes analytical queries easier and more consistent. + +- Facts are measurable events or transactions: orders, payments, sessions, ad impressions. +- Dimensions describe context: date, customer, region, product, plan, campaign. + +This model is popular because it aligns with how businesses ask questions: + +- revenue by region +- orders by merchant and week +- churn by plan and acquisition source + +#### 3.4.3 Star Schema vs Snowflake Schema + +| Model | Description | Strengths | Weaknesses | +|---|---|---|---| +| Star schema | one fact table linked directly to denormalized dimension tables | simple, fast for common BI queries | can duplicate dimension data | +| Snowflake schema | dimension tables are further normalized | less duplication, more normalized modeling | more joins, often harder for analysts | + +At interview depth, the main takeaway is that star schemas are usually easier for analytics and self-serve use. + +#### 3.4.4 BI Architecture + +```mermaid +flowchart LR + Product[Product DBs and services] --> ELT[ELT / CDC / event loads] + Events[Event stream / raw lake] --> ELT + ThirdParty[Payments / CRM / support / ads] --> ELT + ELT --> Warehouse[(Data warehouse)] + Warehouse --> Models[Curated models: facts, dimensions, semantic layer] + Models --> BI[BI tools / notebooks / dashboard builders] + Models --> Reverse[Reverse ETL / operational activation] + Models --> Gov[Catalog / lineage / access control] + BI --> Teams[Finance / Product / Ops / Leadership] +``` + +This is how BI becomes a truth layer. Raw data is not enough. The company needs curated definitions, lineage, and governance. + +#### 3.4.5 Self-Serve Analytics + +Self-serve analytics is attractive because it lets teams answer questions without waiting on a central data team. + +But self-serve only works when the underlying models are good. + +Without that foundation, self-serve turns into chaos: + +- ten definitions of active user +- inconsistent timezone logic +- duplicated dashboards with different filters +- analysts querying raw semi-structured events directly + +This is the "single source of truth" problem. + +#### 3.4.6 Governance and Access Control + +BI often contains the broadest copy of company data, so governance matters a lot. + +Important controls: + +- role-based access to schemas and metrics +- row-level and column-level security +- PII masking and tokenization +- dataset ownership and certification +- lineage so teams know where a number came from +- approval workflows for sensitive exports + +This is one reason warehouses and BI tools become central to company operations. They are not just reporting surfaces. They become the shared decision substrate. + +#### 3.4.7 Production Reality + +A common modern stack conceptually looks like this: + +- source systems emit events or expose CDC +- raw data lands in warehouse or lake storage +- transformation models produce fact and dimension tables +- a semantic layer defines core metrics +- BI tools query the curated layer +- selected outputs get pushed back into operations through reverse ETL or internal tools + +This is how a support team might see customer health scores, or a sales team might see product usage metrics inside a CRM. + +#### 3.4.8 Failure Cases + +- raw data is accessible but poorly documented, so every team redefines metrics +- the warehouse becomes a dumping ground with no ownership +- dashboards disagree because dimensions are modeled inconsistently +- analysts accidentally expose PII through exports +- no one knows whether a metric is fresh, deprecated, or certified + +#### 3.4.9 Best Practices + +- define and certify core business metrics +- keep curated models separate from raw ingestion data +- publish freshness, lineage, and owner metadata +- standardize semantic definitions for company-wide metrics +- invest in access control and masking early + +## 4. Operational vs Analytical Systems + +This distinction shows up constantly in backend engineering interviews. + +The company usually needs both: + +- operational systems to run the product +- analytical systems to understand the product + +Trying to make one system do both equally well usually leads to pain. + +### 4.1 OLTP vs OLAP + +| Property | OLTP | OLAP | +|---|---|---| +| Full name | Online Transaction Processing | Online Analytical Processing | +| Main purpose | serve product transactions | serve analytics and aggregation | +| Typical access pattern | many small reads and writes | fewer but much heavier scans and aggregations | +| Data shape | current state, normalized or service-owned | historical, denormalized or columnar, aggregation-friendly | +| Optimization target | low latency, correctness, concurrency | scan efficiency, aggregation speed, compression | +| Example questions | create order, update profile, fetch invoice | weekly retention by cohort, top erroring tenants, revenue by region | + +This is the simplest reason companies separate operational DBs and analytical stores: they are optimized for different query patterns. + +### 4.2 Write-Optimized vs Read-Optimized + +Operational systems are usually write-aware and state-aware. + +- point lookups +- transaction boundaries +- invariants and locks +- low-latency updates + +Analytical systems are usually read-optimized. + +- scan many rows quickly +- read only the needed columns +- aggregate across large windows +- serve repeated heavy queries efficiently + +One database engine can blur the lines a bit, but the architectural distinction remains important. + +### 4.3 Realtime vs Batch Processing + +| Processing style | Strengths | Weaknesses | Typical internal-ops use cases | +|---|---|---|---| +| Streaming / realtime | low latency, supports operational decisions | more complexity, higher cost, harder debugging | incident dashboards, fraud signals, live moderation queues | +| Batch | simpler, easier to reconcile, cheaper for large windows | stale data | scheduled reports, finance summaries, historical BI | + +Most mature companies run both. + +Examples: + +- realtime metrics for current system health +- batch pipelines for finance and board reporting +- near-realtime operational analytics for support teams + +### 4.4 Why Separation Matters + +Separating operational and analytical systems provides: + +- isolation of user traffic from heavy analytics queries +- better storage formats for each use case +- easier fan-out to many analytics consumers +- improved governance over analytical data copies + +The cost is data duplication and lag. + +That tradeoff is worth calling out explicitly in interviews. + +Data duplication costs: + +- more storage +- more pipelines to operate +- consistency lag between source and analytics views +- more governance surface area + +Benefits: + +- safer production performance +- richer historical analysis +- easier experimentation and reporting +- multiple downstream consumers can reuse the same event data + +### 4.5 Dual Pipeline Architecture + +```mermaid +flowchart LR + Users[Users] --> API[Product APIs] + API --> OLTP[(Operational DB)] + API --> Bus[Event stream] + OLTP -. CDC .-> ELT[CDC / ELT jobs] + Bus --> StreamProc[Stream processing] + StreamProc --> RT[(Realtime analytics store)] + StreamProc --> Lake[(Raw event lake)] + ELT --> Warehouse[(Warehouse)] + Lake --> Warehouse + RT --> OpsDash[Realtime dashboards] + Warehouse --> BI[BI and reports] +``` + +This pattern is extremely common because it lets the product serve users while analytics systems consume the same business activity in a form optimized for different questions. + +## 5. How These Systems Connect in Real Architecture + +The strongest system design answers do not describe admin tools and analytics in isolation. They explain how these systems reinforce each other. + +Examples: + +- support dashboards rely on event timelines, logs, and current state +- moderation tools consume product events and write policy actions back into the product +- incident controls use analytics and monitoring signals to guide mitigations +- BI systems consume the same event streams and CDC feeds to build business truth + +### 5.1 Combined Admin and Analytics Architecture + +```mermaid +flowchart TB + Users[End users] --> Product[Web / Mobile / Public APIs] + Product --> Services[Core backend services] + Services --> Primary[(Operational DBs)] + Services --> EventBus[Event stream] + Services -. logs / traces .-> Obs[(Observability stores)] + + Staff[Support / Moderation / Ops / Finance] --> Portal[Internal portal] + Portal --> AdminAPI[Internal APIs] + AdminAPI --> Guard[SSO / RBAC / approvals / audit] + Guard --> Support[Support dashboard] + Guard --> Moderation[Moderation queues and review] + Guard --> Ops[Flags / config / incident controls] + + Support --> ReadModels[(Customer 360 read models)] + Support --> Obs + Moderation --> ReviewStore[(Case and evidence store)] + Moderation --> Services + Ops --> Config[(Flag and config store)] + Config --> Services + + EventBus --> StreamProc[Stream processing] + StreamProc --> RT[(Realtime analytics store)] + StreamProc --> Warehouse[(Warehouse / lakehouse)] + Primary -. CDC .-> Warehouse + Warehouse --> BI[BI tools / reports / exec dashboards] + RT --> LiveDash[Operational dashboards] + AdminAPI --> Audit[(Audit log)] +``` + +### 5.2 Example: Typical SaaS Product + +Imagine a B2B SaaS product with subscriptions, usage billing, feature flags, and customer support. + +The product architecture might look like this: + +- transactional services manage accounts, subscriptions, invoices, and permissions +- server-side domain events are emitted for account updates, invoices, usage records, and payment outcomes +- support dashboards assemble a customer 360 view from read replicas, billing events, and log indexes +- operational controls let responders disable a failing billing integration or reroute traffic +- analytics pipelines load usage and billing events into the warehouse for growth and finance reporting +- BI defines canonical metrics such as MRR, churn, active seats, and feature adoption + +This is a clean example of why internal operations are not secondary. They are the system that helps the company run the product and understand the business. + +### 5.3 Example: Marketplace Platform + +In a marketplace: + +- product services create listings, orders, messages, and reviews +- moderation pipelines evaluate listings, media, and abuse reports +- risk systems score suspicious sellers and transactions +- support dashboards show buyer and seller timelines, disputes, payout state, and prior enforcement +- analytics systems track GMV, liquidity, trust metrics, dispute rates, and review turnaround times +- ops controls can restrict categories, lower seller creation limits, or disable risky workflows during attacks + +This is a great interview example because it naturally combines trust, support, analytics, and operational control. + +### 5.4 Example: Large Consumer Platform + +For a Meta-like or Google-scale content platform: + +- user-generated content creates immense moderation pressure +- automation handles high-confidence cases while humans review edge cases +- trust and safety teams need queue systems, appeals, policy versioning, and analytics on reviewer accuracy +- support or operations teams need account-level investigation tools +- BI teams need reliable definitions for engagement, retention, integrity metrics, and policy enforcement outcomes + +The big lesson is that internal operations scale with product complexity. They do not remain a small side tool forever. + +## 6. Common Interview Discussions and What Strong Answers Sound Like + +### 6.1 How Would You Design Internal Tools at Scale? + +Good discussion points: + +- separate control plane from product data plane +- use internal APIs rather than direct DB writes +- implement strong RBAC, audit logs, and approval flows +- build read models for investigative workflows +- design write actions as safe, idempotent operations +- expect tool sprawl and standardize the platform early + +Weak answer: + +- "I would build an admin dashboard that can edit the database" + +That answer ignores safety, invariants, auditability, and scale. + +### 6.2 How Do Companies Manage Operations Safely in Production? + +Good discussion points: + +- feature flags and runtime config +- kill switches and rollback paths +- scoped rollout strategies +- incident mitigation loops tied to metrics and logs +- tested, auditable controls with small blast radius + +### 6.3 How Do Analytics Pipelines Work End to End? + +Good discussion points: + +- instrumentation and event schemas +- collectors, validation, and stream transport +- stream processing and batch processing +- realtime stores for operational dashboards +- warehouses and semantic models for BI +- deduplication, ordering, and schema evolution + +### 6.4 What Breaks as Systems Grow? + +Good discussion points: + +- internal tool sprawl +- queue backlog and operational bottlenecks +- privilege misuse and missing audit trails +- metric definition drift +- schema evolution breaking downstream consumers +- analytics queries leaking back onto OLTP systems +- stale read models and support confusion + +### 6.5 Tradeoffs Worth Mentioning in Interviews + +| Tradeoff | Why it matters | +|---|---| +| precision vs recall in moderation | trust, user harm, and review cost are in tension | +| low-code vs custom internal tools | speed of delivery vs safety and long-term maintainability | +| realtime vs batch analytics | freshness vs complexity and cost | +| pre-aggregation vs on-demand querying | speed vs flexibility | +| data duplication vs isolation | extra pipeline cost vs protecting OLTP performance | +| centralized governance vs team autonomy | consistency vs speed of iteration | + +## 7. Common Mistakes in Real Systems + +These mistakes show up repeatedly across companies: + +- treating internal tools as temporary hacks, then never hardening them +- allowing direct privileged database mutations outside controlled APIs +- building dashboards without ownership, definitions, or freshness metadata +- assuming event ordering and uniqueness without explicit design +- storing sensitive data in logs or event payloads carelessly +- forgetting that support, moderation, and analytics have different latency and correctness needs +- not connecting admin actions to audit logs and incident timelines + +## 8. Final Mental Model + +Internal operations are the systems that let a company safely run, govern, debug, and understand its product. + +Admin systems are the control plane. They let humans and automated policies intervene in production through support tools, moderation systems, internal panels, and operational controls. + +Analytics systems are the decision plane. They turn events and data copies into dashboards, reports, and BI so the company can understand behavior, detect issues, and make decisions without harming the transactional product path. + +The most practical architecture lesson is simple: + +- keep product traffic fast and safe +- keep privileged actions explicit and auditable +- keep analytics workloads off the transactional path +- connect these systems through well-defined events, read models, and governance + +If you can explain internal operations this way in an interview, you sound like someone who has thought beyond APIs and databases and into how real backend systems are actually operated.