61 KiB
Request Handling
Request handling is the full journey of a request from the moment a client sends it to the moment the system returns a response. In interviews, this topic sits at the boundary between API design, distributed systems, reliability engineering, and production operations. In real systems, request handling is where latency, availability, security, and cost are decided.
Most weak system design answers treat a request as if it magically reaches the correct service and that service magically succeeds. Real systems do not work that way. Before business logic runs, a request usually passes through multiple control points:
- edge protection
- routing
- load balancing
- authentication and authorization checks
- rate limiting or throttling
- validation
- version negotiation
- retries and failover logic
- observability hooks
If you understand request handling well, you can explain not only how a request succeeds, but also how the system behaves when traffic spikes, instances fail, regions go down, clients retry aggressively, or malformed input hits your API.
This guide is written with two goals:
- Help you answer backend and system design interview questions with depth and structure.
- Help you understand how production systems at companies like Google, Netflix, Uber, Amazon, GitHub, Stripe, and typical SaaS platforms are actually built.
Examples in this guide are intentionally generalized from widely used industry patterns and public engineering discussions rather than private internal implementation details.
1. Big Picture: What Request Handling Really Means
At a high level, request handling exists because distributed systems are hostile environments:
- networks are slow and unreliable
- clients are untrusted
- services scale up and down dynamically
- requests are not evenly distributed
- failures are partial, not binary
- deployments happen continuously
- one bad downstream dependency can cascade into an outage
The job of request handling is to answer a series of questions quickly and safely:
- Should this request be allowed into the system?
- Is the client authenticated and allowed to do this?
- Is the request well-formed and safe?
- Which region, cluster, service, and instance should receive it?
- Can the system handle the load right now?
- If something is failing, should we retry, reroute, degrade, or reject?
- How do we observe what happened later?
1.1 End-to-End Request Lifecycle
flowchart LR
C[Client: Browser / Mobile / API Consumer] --> CDN[CDN / Edge Cache]
CDN --> WAF[WAF / Reverse Proxy / TLS Termination]
WAF --> GW[API Gateway]
GW --> POL[Auth / Validation / Rate Limit]
POL --> ROUTE[Routing + Load Balancing]
ROUTE --> S1[Service A]
ROUTE --> S2[Service B]
S1 --> CACHE[(Cache)]
S1 --> DB[(Database)]
S2 --> MQ[(Queue / Stream)]
GW -. logs / metrics / traces .-> OBS[Observability Stack]
S1 -. logs / metrics / traces .-> OBS
S2 -. logs / metrics / traces .-> OBS
1.2 Core Goals of Request Handling
| Goal | Why it matters | Typical mechanisms |
|---|---|---|
| Correctness | Wrong requests can corrupt data or create security holes | validation, auth, idempotency, request signing |
| Availability | Users should still be served when instances or regions fail | load balancing, health checks, failover, retries, circuit breaking |
| Performance | Users care about latency more than architecture diagrams | caching, compression, routing, efficient LB strategy |
| Isolation | One tenant or one endpoint should not take down the whole system | rate limiting, throttling, priority queues, load shedding |
| Observability | If you cannot see failures, you cannot fix them | centralized logs, metrics, traces, correlation IDs |
| Evolvability | APIs and deployments must change without breaking clients | versioning, traffic splitting, canary release, blue-green rollout |
1.3 Interview Framing
In an interview, saying "I will add a load balancer" is not enough. A stronger answer sounds more like this:
"Requests first hit a reverse proxy or gateway where we terminate TLS, authenticate the caller, apply rate limiting, and route traffic. From there, an L7 load balancer sends traffic to healthy service instances discovered dynamically. We keep requests idempotent where retries are possible, use readiness checks so bad instances do not receive traffic, and add observability at the edge and service layers so we can debug tail latency and failure patterns."
That answer shows system thinking rather than component name-dropping.
2. API Gateway
2.1 What It Is
An API gateway is the entry point for requests into a backend system. It sits between clients and backend services and applies common policies before requests reach business logic.
Think of it as a programmable front door for your platform.
Common technologies:
- Envoy
- NGINX
- Kong
- HAProxy
- AWS API Gateway
- Spring Cloud Gateway
- Netflix Zuul historically
2.2 Why It Exists
Without a gateway, each service often ends up re-implementing the same cross-cutting concerns:
- token validation
- request logging
- rate limiting
- route matching
- error normalization
- response compression
- API version handling
That leads to duplicated logic, inconsistent behavior, and harder operations.
The gateway centralizes edge concerns so backend services can focus more on business rules.
2.3 Main Responsibilities
| Responsibility | What it does | Why it belongs at the gateway |
|---|---|---|
| Authentication | Verifies tokens, API keys, signed requests | cheap early reject before backend work |
| Authorization at coarse level | Rejects callers that cannot access an API family | reduces unnecessary downstream traffic |
| Request aggregation | Combines data from multiple services into one response | reduces client chattiness, especially mobile |
| Centralized logging | Captures request metadata once | consistent audit and debugging |
| Observability | emits metrics, traces, request IDs | easier latency and error tracking |
| Service discovery integration | resolves service names to live instances | works with autoscaling and dynamic infra |
| Retries and timeouts | handles transient failures | reduces client-visible errors when used carefully |
| Circuit breaking | stops sending traffic to failing backends | prevents cascading failure |
| Response transformation | maps internal responses to public API shape | decouples client contract from internal service contract |
| Caching | serves repeated read traffic cheaply | lowers latency and backend load |
2.4 How It Works Internally
A gateway typically processes a request through a pipeline:
- Accept TCP/TLS connection.
- Terminate TLS or pass it through depending on setup.
- Parse HTTP request metadata.
- Match route rules using host, path, method, headers, or query parameters.
- Apply middleware or policies such as auth, rate limiting, schema checks, and logging.
- Resolve the upstream service using static config or service discovery.
- Select a healthy backend instance using a load-balancing policy.
- Forward the request.
- Apply retries, timeouts, or circuit-breaker policy if needed.
- Transform, compress, cache, or redact the response.
- Emit logs, metrics, and trace spans.
2.5 Request Lifecycle Through a Gateway
sequenceDiagram
participant Client
participant Gateway
participant Auth as Auth/JWKS
participant Registry as Service Discovery
participant Service as Backend Service
participant Obs as Observability
Client->>Gateway: HTTPS request
Gateway->>Auth: Validate token or signature
Auth-->>Gateway: Auth result
Gateway->>Registry: Resolve upstream instances
Registry-->>Gateway: Healthy endpoints
Gateway->>Service: Forward request
Service-->>Gateway: Response
Gateway->>Obs: Logs / metrics / traces
Gateway-->>Client: Final response
2.6 Request Aggregation
Request aggregation is when the gateway calls multiple backend services and combines their results into a single response.
Example: a mobile home screen might need:
- profile service
- recommendation service
- notification service
- recent activity service
Without aggregation, the mobile app might issue 4 to 8 network calls. With a gateway or backend-for-frontend layer, the client makes one call and receives one composite response.
Why it exists:
- mobile networks are high latency
- clients should not need to know internal service topology
- it reduces repetitive orchestration logic across clients
Tradeoffs:
- the gateway becomes more complex
- partial failures are harder to represent
- tail latency can worsen because one slow dependency slows the whole aggregate response
- aggregation logic can become accidental business logic
Best practice: keep aggregation focused on shaping data for clients, not on implementing domain rules that belong in services.
2.7 Authentication at the Gateway
This is common because authentication is cheap to reject early and expensive to repeat everywhere.
Typical patterns:
- JWT validation at the gateway using cached public keys from a JWKS endpoint
- API key lookup for machine clients
- OAuth token introspection for opaque tokens
- mTLS for service-to-service requests in internal systems
Important nuance: gateway authentication does not eliminate service-level authorization.
The gateway can answer, "Is this caller known and allowed to hit this API family?"
The service still often needs to answer, "Can this user access this specific invoice, order, or repository?"
Common mistake: pushing all authorization into the gateway. Fine-grained authorization usually belongs closer to the business object.
2.8 Centralized Logging and Observability
The gateway is the best place to generate or propagate correlation identifiers such as:
- request ID
- trace ID
- span ID
- tenant ID
- client application ID
Useful gateway metrics:
- requests per second by route
- latency percentiles, especially p95 and p99
- error rates by status code family
- upstream retry counts
- rate-limited requests
- auth failures
- cache hit ratio
Observability matters because many request-handling bugs look similar from the outside. A user just sees a timeout. Internally, that timeout might have been caused by:
- route misconfiguration
- unhealthy instances still receiving traffic
- retry storms
- TLS handshake issues
- bad DNS resolution
- a dependency that is slow but not fully down
2.9 Service Discovery Integration
In dynamic environments, backend instances change constantly because of autoscaling, deployments, and failures. Hardcoding backend IPs is not realistic.
The gateway therefore needs service discovery.
Common models:
- client-side discovery: the caller resolves service instances and chooses one
- server-side discovery: the gateway or load balancer resolves instances and forwards traffic
In Kubernetes, a gateway often routes to a Service object, and kube-proxy or the data plane routes to live pod endpoints. In service-mesh-heavy environments, Envoy sidecars may receive endpoint updates via xDS-style control-plane APIs.
Failure case: stale discovery data can send traffic to dead instances.
Best practices:
- respect health information, not just presence in registry
- support quick config propagation
- use connection draining when removing instances
- avoid very aggressive caching of endpoint lists
2.10 Retries
Retries are deceptively dangerous.
Why they exist:
- networks fail transiently
- connections reset occasionally
- an instance may fail while others are healthy
Why they are risky:
- retries multiply traffic load during incidents
- non-idempotent operations may execute twice
- stacked retries at multiple layers create retry storms
Best practices:
- retry only idempotent or safely repeatable operations
- use bounded retries, usually very small counts
- add exponential backoff and jitter
- couple retries with timeouts and circuit breakers
- never let every layer retry blindly
Interview point: if a gateway retries a POST payment request, you must discuss idempotency keys.
2.11 Circuit Breaking
Circuit breaking protects the rest of the system from a dependency that is failing or timing out.
Typical states:
- closed: traffic flows normally
- open: requests fail fast instead of calling a bad dependency
- half-open: limited test traffic checks whether recovery has happened
Why it matters:
If a database or dependency is timing out, continuing to send full traffic often just consumes worker threads, saturates queues, and increases latency everywhere.
Circuit breaking buys time and preserves system health.
2.12 Response Transformation
Gateways often transform responses by:
- removing internal fields
- renaming fields for public API consistency
- changing status code mappings
- combining multiple backend responses into one DTO
- translating protocols such as gRPC to JSON/HTTP for clients
Useful when:
- internal services evolve independently
- multiple clients need different shapes
- you want to hide internal topology
Danger: too much transformation turns the gateway into a fragile orchestration layer.
2.13 Request and Response Caching
Caching at the gateway is powerful for read-heavy APIs.
Good candidates:
- public or semi-public GET responses
- configuration or feature metadata
- rarely changing reference data
Hard parts:
- cache invalidation
- per-user or per-tenant cache keys
- auth-sensitive content
- stale responses after writes
Best practices:
- cache only clearly safe responses
- include authorization context in the cache key if needed
- use TTLs conservatively
- prefer cache headers and explicit policies over guesswork
2.14 Gateway vs Service Mesh
| Dimension | API Gateway | Service Mesh |
|---|---|---|
| Main traffic direction | north-south, from clients into platform | east-west, service-to-service |
| Main role | edge policy and API entry | internal traffic management |
| Common features | auth, rate limiting, versioning, aggregation, public API concerns | mTLS, retries, traffic shaping, service identity, observability |
| Consumer | external clients or apps | internal services |
| Operational risk | can become a choke point | can add significant platform complexity |
| Typical tools | API Gateway products, NGINX, Kong, Envoy | Istio, Linkerd, Consul Connect, Envoy-based meshes |
They are not mutually exclusive. Many real systems use both.
2.15 Production Patterns
- Netflix popularized a gateway-style edge layer to handle cross-cutting concerns before requests hit microservices.
- Stripe-like public APIs often emphasize idempotency, auth, versioning, and request logging at the edge because correctness matters more than raw throughput alone.
- Large SaaS platforms often use gateways to enforce tenant-aware limits and route requests to the correct service family.
2.16 Common Mistakes
- turning the gateway into a monolith of business logic
- retrying non-idempotent requests
- centralizing coarse and fine-grained authorization in the same place
- logging secrets or PII in raw form
- caching personalized responses incorrectly
- making the gateway a single point of failure without horizontal scaling
3. Request Routing
Routing decides where a request goes after it enters the system.
This sounds simple, but routing is one of the most important control points in production because it determines:
- which service handles the request
- which version handles it
- which region handles it
- which tenant or experimental path it follows
3.1 Routing Types
| Routing style | How it works | Common use cases | Risks |
|---|---|---|---|
| Path-based routing | route by URL path such as /payments or /users |
REST APIs, monolith decomposition, ingress rules | overlapping route patterns, regex complexity |
| Host-based routing | route by hostname such as api.example.com or admin.example.com |
multiple products or domains behind one edge | DNS and certificate management complexity |
| Header-based routing | route using headers such as version, tenant, device type, or experiment ID | canaries, A/B tests, tenant isolation | header spoofing, harder debugging |
| Geo-based routing | route by location, region, or country | latency reduction, data residency, regulatory compliance | incorrect geo inference, data locality problems |
| Canary routing | send a small portion of traffic to a new version | safe rollout | canary users may not represent real load |
| Blue-green routing | switch traffic between old and new environments | low-risk deployments with quick rollback | expensive duplication, data migration risk |
| Weighted traffic splitting | send 90 percent to old version and 10 percent to new version, then ramp | gradual deployment, model rollout | sticky results, measurement bias |
3.2 How Routing Usually Works Internally
The router generally evaluates rules in order:
- Match host.
- Match method and path.
- Evaluate higher-priority header or cookie rules.
- Apply traffic-splitting policy if multiple upstreams are eligible.
- Resolve the target service using discovery.
- Pick a healthy instance.
Rule order matters. A subtle configuration bug can shadow a more specific route with a broader one.
3.3 Routing Decision Flow
flowchart TD
A[Incoming Request] --> B{Host Match?}
B -->|api.example.com| C{Path Match?}
B -->|admin.example.com| D[Admin Service]
C -->|/v1/payments| E{Header or Canary Rule?}
C -->|/v1/users| F[User Service]
E -->|Canary| G[Payments vNext]
E -->|Default| H[Payments vCurrent]
3.4 Path-Based Routing
This is the most common form of routing for HTTP APIs.
Examples:
/orders/*to order service/payments/*to payment service/search/*to search service
Why it exists:
- intuitive for REST-style APIs
- easy to reason about operationally
- fits ingress and gateway tools well
Failure case: route collisions. For example, a generic /payments/* rule may accidentally catch /payments/admin/* if route precedence is wrong.
3.5 Host-Based Routing
Host-based routing routes by domain name.
Examples:
api.company.comadmin.company.comuploads.company.comhooks.company.com
This is useful when products or workloads differ enough that they deserve different operational policies.
For example, a webhook ingestion domain may need different timeout, retry, and rate-limit rules than a user-facing API domain.
3.6 Header-Based Routing
Header-based routing is common for:
- API version rollout
- internal testing
- tenant routing
- language or device-specific responses
- canary routing with explicit opt-in
Example headers:
X-API-VersionX-Tenant-IDX-ExperimentX-Canary
Be careful: headers are easy for internal callers, but for public APIs they may be spoofed unless protected by auth and policy.
3.7 Geo-Based Routing
Geo routing tries to send users to the best region based on one or more goals:
- lower latency
- data residency compliance
- regulatory boundaries
- disaster isolation
- capacity balancing
Examples:
- EU users sent to EU region for GDPR-sensitive workloads
- ride-sharing or mapping traffic sent to region nearest user demand
- global SaaS tenants pinned to a home region
Tradeoffs:
- nearest region is not always best if the user’s data lives elsewhere
- geo-IP is imperfect
- cross-region writes can be very expensive or consistency-sensitive
Interview point: geo routing and data placement must be discussed together.
3.8 Canary Routing
Canary routing sends a small portion of traffic to a new version first.
Typical rollout:
- 1 percent traffic
- 5 percent traffic
- 10 percent traffic
- 25 percent traffic
- 50 percent traffic
- 100 percent traffic
What you watch:
- error rate
- latency regression
- resource utilization
- business metrics such as checkout success or sign-in success
Why companies use it:
- safer than instant full rollout
- catches dependency or schema issues early
- provides rollback window
Real-world intuition: Netflix-style continuous deployment is only practical because traffic shaping and observability let teams expose changes gradually.
3.9 Blue-Green Deployments
Blue-green means you maintain two environments:
- blue: current production
- green: new version
Then you shift traffic from one to the other.
Why it exists:
- rollback is simple in theory because the old environment still exists
- deployment risk is separated from code build risk
Where it gets hard:
- databases do not switch as cleanly as stateless services
- dual environments cost more
- background jobs and asynchronous consumers may still affect shared data
3.10 Traffic Splitting
Traffic splitting is a more general concept than canary.
You can split traffic by:
- percentage
- user cohort
- tenant tier
- geography
- request attributes
- session stickiness
This is useful for:
- A/B experiments
- canaries
- ML model rollout
- progressive feature migration
3.11 Service Discovery Impact
Routing depends on service discovery more than most beginners realize.
A route usually points to a logical service name, not a fixed machine. The system must map that name to currently healthy instances.
If service discovery is stale or slow:
- requests go to dead instances
- traffic may concentrate on a few nodes
- rollout changes may not propagate consistently
Best practices:
- keep route definitions separate from ephemeral instance identity
- use health-aware endpoint selection
- support connection draining during deployment
- prefer automation over manual endpoint lists
3.12 Common Interview Discussions
- How do you safely route traffic to a new version?
- How do you guarantee tenant isolation during routing?
- What happens if a route config is wrong globally?
- How do you roll back a bad canary quickly?
4. Load Balancing
4.1 Why Load Balancing Exists
If requests were sent to a single server, that server would become a bottleneck and a single point of failure.
Load balancing exists to:
- distribute traffic across multiple instances
- improve availability
- enable horizontal scaling
- reduce overload on any single machine
- route away from unhealthy instances
Horizontal scaling is the key idea. Instead of buying one huge machine forever, you run multiple smaller instances and distribute work.
4.2 Active-Active vs Active-Passive
| Mode | Meaning | Benefits | Drawbacks | Common use |
|---|---|---|---|---|
| Active-active | multiple nodes or regions serve traffic at the same time | high availability, better capacity utilization, low failover time | more complex consistency and routing | web and API frontends, global services |
| Active-passive | one node or region is primary, another waits as standby | simpler operational model, easier correctness story | slower failover, unused capacity | legacy systems, cost-sensitive setups, some databases |
Interview rule of thumb: active-active improves availability and latency, but only if the data layer and operational discipline are strong enough to support it.
4.3 Common Load-Balancing Algorithms
| Algorithm | How it works | Best for | Strengths | Weaknesses |
|---|---|---|---|---|
| Round robin | rotate evenly through instances | similar stateless servers | simple and cheap | ignores load differences |
| Weighted round robin | same as round robin, but some instances get more traffic | mixed-capacity fleets | easy to express capacity skew | weights can become stale |
| Least connections | send to instance with fewest active connections | long-lived connections, uneven request duration | better than round robin for sticky or long sessions | connection count may not reflect real CPU or memory load |
| Least response time | prefer instances with lower observed latency | latency-sensitive APIs | reacts to slow instances | can create feedback loops or oscillations |
| Consistent hashing | map request key to instance ring | caches, sharded workloads, affinity | minimal reshuffling when instances change | hotspot risk if keys are skewed |
| IP hash | hash client IP to pick instance | simple affinity | easy stickiness | NAT and shared IPs skew traffic badly |
One strong interview addition: consistent hashing is not a general-purpose default. It is especially useful when request locality matters, such as cache ownership or shard routing.
4.4 Round Robin
Round robin is the simplest algorithm: instance A, then B, then C, then back to A.
Why it exists:
- low overhead
- easy to reason about
- works fine when instances are homogeneous and requests are similar
Why it fails:
- requests may vary wildly in cost
- one node may be slow but still receive equal traffic
4.5 Weighted Round Robin
Weighted round robin lets you give bigger instances more traffic.
Example:
- instance A weight 4
- instance B weight 2
- instance C weight 1
This is useful during mixed instance migrations or when canary nodes should receive only a small share.
4.6 Least Connections
This is common when connection duration matters, such as proxies handling long-lived connections or websocket-heavy workloads.
Why it helps:
- a server already holding many active sessions gets less new work
Limitations:
- 10 idle connections are not equivalent to 10 expensive requests
- CPU-heavy but short-lived requests may still be imbalanced
4.7 Least Response Time
This tries to avoid slow nodes by routing to faster ones.
Good intuition: if one instance’s latency is rising, it may be overloaded or degraded.
Risk: feedback loops.
If the algorithm overreacts, traffic can bounce around and create instability. Slow-start, damping, and outlier detection often help.
4.8 Consistent Hashing
Consistent hashing is important in system design interviews.
Instead of randomly balancing every request, you map a key such as:
- user ID
- session ID
- cache key
- shard key
to a position on a hash ring. Each instance owns a portion of the ring.
Why it exists:
- when instances join or leave, only a subset of keys remap
- this preserves cache locality and reduces churn
Common use cases:
- distributed caches
- sharded databases
- sticky-ish routing without a central session store
Failure case: skewed keys can create hotspots. Virtual nodes and better key design are common mitigations.
4.9 IP Hash
IP hash is a crude form of affinity.
It is easy to set up, but it performs poorly when many clients appear behind a single NAT or corporate proxy. One large office can accidentally behave like one giant user from the load balancer’s perspective.
4.10 Distributed Load Balancing
At scale, the load balancer itself must also scale.
That means:
- multiple LB nodes, not one appliance
- shared or replicated configuration
- health state propagation
- often anycast or DNS in front of LB fleets
If the load balancer layer is not distributed, you have just moved the bottleneck.
4.11 Global Load Balancing
Global load balancing chooses a region before a local load balancer chooses an instance.
Goals:
- reduce latency
- avoid unhealthy regions
- keep traffic near user or data
- manage regional capacity
flowchart TD
U[Global Users] --> GTM[Global Traffic Manager]
GTM --> US[US Region]
GTM --> EU[EU Region]
GTM --> AP[APAC Region]
US --> USLB[Regional Load Balancer]
EU --> EULB[Regional Load Balancer]
AP --> APLB[Regional Load Balancer]
USLB --> US1[Service Instances]
EULB --> EU1[Service Instances]
APLB --> AP1[Service Instances]
Google-scale and Amazon-scale systems rely heavily on global traffic management concepts because the first routing decision is often regional, not per-instance.
4.12 DNS Load Balancing
DNS can return different IPs for the same hostname.
Why it is attractive:
- simple
- globally available
- often the first layer of balancing
Limitations:
- DNS caching means failover is not immediate
- clients do not always honor TTL precisely
- DNS cannot see application-level health very well by itself
Interview point: DNS is useful, but it is usually too coarse to be the only failover mechanism.
4.13 Best Practices
- remove unhealthy instances quickly
- use connection draining before terminating instances
- prefer zone-aware balancing if cross-zone traffic is expensive
- track p95 and p99, not just average latency
- use slow start for newly added instances so they do not get overwhelmed instantly
- treat retries as part of traffic load, not separate from it
4.14 Common Failure Cases
- unhealthy instances still receive traffic because readiness is wrong
- least-response-time routing amplifies instability
- one AZ gets overloaded because balancing is not zone-aware
- sticky affinity causes hotspots
- the load balancer layer itself is not redundant
5. Rate Limiting
5.1 Why It Exists
Rate limiting controls how much traffic a client, tenant, API key, IP, or endpoint can send over time.
It exists to enforce:
- abuse prevention
- fairness
- cost control
- protection of downstream systems
- multi-tenant isolation
Without rate limits, one abusive or buggy client can monopolize resources.
5.2 Where It Is Applied
Rate limits can exist at multiple layers:
- CDN or edge
- API gateway
- service layer
- database or queue concurrency controls
Different layers often enforce different types of limits.
Examples:
- per-IP login limit at the edge
- per-API-key limit at the gateway
- per-tenant expensive-operation limit in the service
5.3 Common Algorithms
| Algorithm | Idea | Strengths | Weaknesses | Good fit |
|---|---|---|---|---|
| Fixed window | count requests in each fixed period | very simple | bursty at window boundaries | low-complexity systems |
| Sliding window | approximate rolling window using adjacent buckets | smoother than fixed window | more logic and state | typical APIs |
| Sliding log | keep timestamps of requests | precise | expensive in memory and compute | strict low-volume policies |
| Token bucket | tokens refill at a constant rate, requests spend tokens | allows controlled bursts | stateful logic | public APIs with burst tolerance |
| Leaky bucket | requests enter a bucket and drain at fixed rate | smooths outgoing rate | may delay or drop burst traffic | traffic shaping and smoothing |
5.4 Fixed Window
Example rule: 100 requests per minute.
Implementation is often as simple as:
- increment a counter for current window
- reject if counter exceeds limit
- expire counter at end of window
Problem: boundary burst.
A client can send 100 requests at the end of one minute and 100 more at the start of the next minute, effectively sending 200 requests in a short interval.
5.5 Sliding Window
Sliding window reduces the boundary-burst problem.
Instead of treating time as disconnected minute buckets, it approximates usage across a rolling interval. This is fairer and smoother for APIs.
5.6 Sliding Log
Sliding log stores individual request timestamps and removes old ones.
It is the most exact of the common approaches, but also the most expensive. It is rarely the default choice for very high-cardinality high-throughput public traffic.
5.7 Token Bucket
This is one of the most useful algorithms to understand.
Mental model:
- tokens drip into a bucket at a steady rate
- each request consumes one or more tokens
- if the bucket is empty, reject or delay
Why it is popular:
- supports bursts up to bucket capacity
- preserves average rate over time
- easy to reason about for product limits
Example:
- refill 10 tokens per second
- bucket size 50
The client can burst 50 requests instantly, but over time they only sustain about 10 per second.
5.8 Leaky Bucket
Leaky bucket emphasizes output smoothing more than burst allowance.
Requests may queue and drain at a steady rate. This is useful when downstream systems need smooth, predictable load rather than spikes.
5.9 Redis Implementation Patterns
Redis is common for distributed rate limiting because it is fast and supports atomic operations.
Typical patterns:
- fixed window:
INCRplusEXPIRE - sliding log: sorted set of timestamps with
ZADD,ZREMRANGEBYSCORE, andZCARD - token bucket: store token count and last refill timestamp, update atomically with Lua
Why Lua scripts matter:
- distributed rate limiting requires atomic read-modify-write behavior
- without atomicity, concurrent requests can exceed the intended limit
Key design examples:
ratelimit:user:123:/paymentsratelimit:tenant:acme:minuteratelimit:ip:203.0.113.10:login
5.10 Distributed Rate Limiting Challenges
This is where interview answers often become shallow.
Real problems include:
- hot keys for large tenants or popular routes
- clock skew between nodes
- cross-region consistency
- Redis outages
- fail-open versus fail-closed policy decisions
- cardinality explosion if keys are too granular
Fail-open means the limiter allows traffic if the limiter store is down.
Fail-closed means it rejects traffic if the limiter store is down.
Which is right depends on the endpoint:
- login or anti-abuse endpoint may prefer fail-closed
- general read endpoint may prefer fail-open to preserve availability
5.11 Best Practices
- limit by the right identity: IP is often not enough
- use different limits for read, write, and expensive endpoints
- return
429 Too Many RequestswithRetry-Afterwhen possible - monitor near-limit behavior, not just hard rejections
- consider shadow mode before enforcing a new limit in production
- keep rate limiting close to the edge for cheap rejection
5.12 Real-World Intuition
- GitHub-style public APIs need clear client-visible limits to keep the platform fair.
- Stripe-like payment APIs need rate limiting to protect correctness-sensitive backends from abuse or accidental retry loops.
- Multi-tenant SaaS platforms often combine per-user, per-tenant, and per-endpoint limits.
6. Request Validation
Validation is the discipline of refusing bad requests before they do damage.
This includes much more than checking whether a JSON field exists.
6.1 Why Validation Exists
Requests are dangerous because they may be:
- malformed
- malicious
- duplicated
- replayed
- semantically invalid
- inconsistent with business rules
Validation protects:
- correctness
- security
- downstream capacity
- developer sanity
6.2 Validation Layers
| Layer | Typical checks | Why here |
|---|---|---|
| Edge or gateway | body size, basic schema, auth format, signature presence, rate limit | cheap early rejection |
| API layer | required fields, type checks, enum checks, version compatibility | contract correctness |
| Domain layer | business rules and state-dependent validation | real correctness |
| Database layer | unique constraints, foreign keys, transactional guarantees | final integrity guardrail |
Important interview point: validation should be layered. Do not rely on only one layer.
6.3 Schema Validation
Schema validation checks the request shape.
Examples:
- JSON schema or OpenAPI validation for REST
- protobuf validation for gRPC
- GraphQL schema and resolver validation
Why it exists:
- catches bad input early
- makes API behavior predictable
- prevents weird null or type bugs from leaking deep into business logic
What it does not do:
- prove business correctness
Example:
- schema can prove
amountexists and is numeric - it cannot prove that a user is allowed to charge that amount
6.4 Input Sanitization
Sanitization is about ensuring input cannot be used to exploit downstream systems.
Common concerns:
- SQL injection
- command injection
- path traversal
- log injection
- XSS if data will later be rendered in browsers
The important mindset is not "strip all special characters". That often breaks legitimate input.
Better practice:
- use parameterized queries
- encode output for the correct context
- validate formats where needed
- avoid blindly interpolating request data into logs or shell commands
6.5 Idempotency
Idempotency is one of the most important backend interview concepts.
A request is idempotent if sending it multiple times has the same effect as sending it once.
Why it matters:
- clients retry when timeouts happen
- networks fail after the server may already have processed the request
- gateways or proxies may retry transient failures
Typical example: payments.
If a client sends "charge $100" twice because the first response was lost, you do not want to charge the card twice.
A common solution is an idempotency key.
sequenceDiagram
participant Client
participant API
participant Store as Idempotency Store
participant Payment as Payment Service
Client->>API: POST /charges + Idempotency-Key: abc123
API->>Store: Lookup key abc123
alt Key not found
Store-->>API: miss
API->>Payment: Execute charge
Payment-->>API: success
API->>Store: Save key + normalized request hash + response
API-->>Client: 200 OK with result
else Key found
Store-->>API: previous response
API-->>Client: return same stored result
end
Best practices for idempotency:
- store both the key and enough request fingerprinting to detect misuse
- scope keys appropriately, often per client or per endpoint family
- keep key retention long enough to cover realistic retry windows
- use for non-idempotent operations such as payment creation or order submission
6.6 Replay Protection
Replay protection prevents an attacker or buggy intermediary from resending a valid request later.
Common techniques:
- timestamps with expiration windows
- nonces stored briefly to prevent reuse
- signed requests that include method, path, body hash, and timestamp
This is especially common in webhook verification and partner API integrations.
6.7 Request Signing Basics
Request signing often works like this:
- client builds a canonical string from method, path, timestamp, and body hash
- client signs it with an HMAC secret or private key
- server recomputes expected signature
- server rejects if signature differs or timestamp is too old
Why it exists:
- verifies authenticity
- detects tampering
- supports replay protection when timestamp and nonce are included
GitHub-style or Stripe-style webhooks commonly use a variant of this pattern so receivers can verify that an event really came from the platform.
6.8 Common Validation Mistakes
- trusting frontend validation
- validating schema but not business semantics
- implementing idempotency without scoping or request hashing
- rejecting too late after expensive downstream work already happened
- logging raw secrets, tokens, or signed payloads
- treating all retries as duplicates without considering request identity
7. API Versioning
Versioning exists because APIs change, but clients do not upgrade instantly.
7.1 Why Versioning Matters
Without a versioning strategy:
- client upgrades become risky
- breaking changes become outages
- multiple mobile app versions become painful to support
- integration partners lose trust
Strong backend teams design for API evolution, not just initial launch.
7.2 Common Versioning Strategies
| Strategy | Example | Benefits | Drawbacks | When useful |
|---|---|---|---|---|
| URI versioning | /v1/orders |
explicit and easy to see | path clutter, can encourage large version forks | public REST APIs |
| Header versioning | X-API-Version: 2025-10-01 |
cleaner URLs, flexible rollout | less visible, harder to debug manually | mature APIs, platform clients |
| Media type versioning | Accept: application/vnd.company.v2+json |
precise content negotiation | operationally less friendly, not beginner-friendly | specialized APIs |
7.3 Backward Compatibility
This is often more important than the version number itself.
Safer changes:
- adding optional fields
- adding new endpoints
- adding new enum values only if clients are tolerant
Risky changes:
- removing fields
- renaming fields
- changing meaning or units of existing fields
- turning nullable fields into required ones
Production rule: additive evolution is easier than breaking evolution.
7.4 Deprecation Strategy
Good deprecation is operational, not just documented.
A practical strategy:
- announce deprecation clearly
- measure usage of the old version
- provide migration docs and examples
- support both versions during a migration window
- alert high-usage customers directly if possible
- set and communicate a sunset date
7.5 Migration Strategy
Strong teams avoid big-bang migrations.
Typical approach:
- dual-read or dual-write only when necessary and carefully controlled
- gateway routes old and new versions separately
- monitor client adoption
- migrate major SDKs first
- cut off the oldest, least-safe versions gradually
7.6 Real-World Examples
- Stripe is well known for careful API versioning because payment integrations cannot break casually.
- GitHub exposes explicit API versioning so clients know which contract they are using.
- Internal microservices often use protobuf or schema evolution rules instead of public URI versioning.
7.7 Best Practices
- version only when needed; do not fork casually
- prefer backward-compatible changes when possible
- monitor version usage by client and tenant
- keep error formats consistent across versions
- document behavioral differences, not just field differences
8. Throttling
Rate limiting and throttling are related but not identical.
8.1 Throttling vs Rate Limiting
| Concept | Main goal | Typical action | Example |
|---|---|---|---|
| Rate limiting | enforce quota or fairness | reject when request budget is exceeded | 100 requests per minute per API key |
| Throttling | protect system under stress or shape traffic | slow down, queue, degrade, or reject | reduce expensive search traffic during overload |
Rate limiting is often policy-driven.
Throttling is often system-health-driven.
8.2 Why Throttling Exists
Even legitimate traffic can overwhelm a system.
Throttling helps you:
- degrade gracefully rather than crash
- preserve critical endpoints over less important ones
- absorb bursts temporarily
- smooth traffic into fragile downstream systems
8.3 Graceful Degradation
Graceful degradation means not every feature must remain equally available during stress.
Examples:
- checkout stays available, recommendations are temporarily disabled
- write-heavy analytics ingestion is delayed, user sign-in remains online
- expensive search filters are limited, basic search still works
This is how mature systems preserve business-critical value during incidents.
8.4 Queueing
Queueing is useful when the work does not need to complete synchronously.
Examples:
- email sending
- thumbnail generation
- event enrichment
- some analytics processing
Why it helps:
- decouples request acceptance from background work
- smooths spikes
- improves perceived responsiveness if the request can return early
Danger: unbounded queues are just hidden outages. If the queue grows forever, latency becomes effectively infinite.
8.5 Shedding Load
Load shedding means rejecting traffic on purpose so the rest of the system survives.
This can feel wrong to beginners, but it is often the correct decision.
Serving 70 percent of traffic quickly is better than timing out 100 percent of traffic after exhausting all workers.
Common strategies:
- reject low-priority requests first
- enforce concurrency caps on expensive endpoints
- return stale cached data for noncritical reads
- cut off optional features during incidents
8.6 Best Practices
- define traffic priority classes
- keep queues bounded
- expose clear client signals such as
429or503 - combine throttling with backoff guidance for clients
- ensure degraded mode is tested before an incident
8.7 Common Mistakes
- using queueing for work that users expect immediately
- letting retries refill the queue faster than it drains
- not distinguishing critical and noncritical traffic
- treating throttling only as an edge concern when downstream systems are the real bottleneck
9. Reverse Proxy and Load Balancer
These terms overlap in practice, but they are not identical.
9.1 Reverse Proxy Role
A reverse proxy sits in front of backend servers and receives requests on their behalf.
Clients think they are talking to one endpoint. The proxy decides how to forward requests internally.
Why it exists:
- hide internal topology
- centralize TLS handling
- compress and cache content
- enforce some security rules
- simplify operational control
Common tools:
- NGINX
- Envoy
- HAProxy
- cloud-managed L7 load balancers
9.2 SSL/TLS Termination
TLS termination means the proxy or load balancer decrypts incoming HTTPS traffic.
Benefits:
- central certificate management
- offload crypto work from backend services
- enables L7 inspection for routing and policy
Tradeoff:
- internal traffic must still be protected appropriately
- if you terminate at the edge and send plaintext internally, the trust boundary moves inward
Many production systems re-encrypt internally or use mTLS on internal hops.
9.3 Caching
Reverse proxies can cache static assets and some API responses.
Why it matters:
- lower origin load
- lower latency
- better resilience during backend spikes
But be careful with:
- personalized responses
- auth-dependent content
- stale data after updates
9.4 Compression
Compression reduces payload size for responses such as JSON, HTML, CSS, and JS.
Benefits:
- lower bandwidth
- faster transfers for text-heavy payloads
Tradeoff:
- CPU overhead
- not useful for already compressed formats such as many images or zipped binaries
9.5 WAF Basics
A Web Application Firewall applies security rules to incoming requests.
It commonly helps with:
- blocking obviously malicious payloads
- filtering known exploit patterns
- enforcing IP reputation rules
- reducing bot and abuse traffic
Important nuance: a WAF is helpful, but it is not a substitute for secure application code and proper validation.
9.6 CDN Relationship
A CDN is often the outermost layer, especially for global systems.
Typical order:
- CDN
- WAF or reverse proxy
- API gateway
- internal load balancing and service routing
CDNs are best for:
- static assets
- edge caching
- some globally cacheable API responses
- DDoS absorption and edge presence
9.7 Reverse Proxy vs API Gateway
A reverse proxy is often lower-level and more generic.
An API gateway usually adds richer API-specific behavior such as:
- auth policies
- API keys
- per-route rate limits
- response transformation
- version-aware routing
In practice, the same product may serve both roles.
10. L4 vs L7
This comparison appears constantly in interviews.
10.1 The Basic Difference
- L4 operates at the transport layer, mainly IP, TCP, and UDP information.
- L7 operates at the application layer, understanding HTTP methods, paths, headers, cookies, and sometimes message semantics.
10.2 Comparison Table
| Dimension | L4 | L7 |
|---|---|---|
| Visibility | IP, port, protocol, connection metadata | URL path, headers, host, cookies, method, status |
| Speed | generally lower overhead | generally more overhead due to parsing and richer policy |
| Routing options | by IP and port | by path, host, header, content-type, user, version |
| TLS handling | can pass through TLS | often terminates TLS to inspect HTTP |
| Use cases | very high throughput transport balancing, TCP services, simple load distribution | APIs, canary routing, auth, rate limits, response transforms |
| Observability | coarse | rich request-aware observability |
| Examples | AWS NLB, IPVS-style balancing, transport proxies | AWS ALB, Envoy, NGINX, Kong |
10.3 Performance Tradeoffs
Why choose L4:
- lower per-request overhead
- works for non-HTTP protocols
- simpler fast-path routing
Why choose L7:
- can make smarter decisions
- can enforce API policy
- can support sophisticated routing and deployment patterns
Interview answer pattern:
"I would use L7 where I need HTTP-aware routing, auth, canarying, or response handling. I would prefer L4 for simpler, high-throughput transport balancing or protocols where application parsing is unnecessary."
11. Health Checks
Health checks determine whether an instance should receive traffic or be restarted.
11.1 Liveness
Liveness asks: is the process alive at all?
If liveness fails, the platform may restart the container or instance.
Best practice: keep liveness simple. It should detect deadlock or fatal stuck states, not depend on every external dependency.
11.2 Readiness
Readiness asks: is this instance ready to serve traffic right now?
If readiness fails, the instance should stop receiving traffic, but it does not necessarily need a restart.
Examples of not-ready:
- startup still in progress
- critical dependency unavailable
- cache warmup incomplete if that makes service unusable
- instance is draining for deployment
11.3 Startup Probes
Startup probes exist because some applications take time to initialize.
Without startup-aware logic, a slow boot may be misclassified as a dead process and restarted repeatedly.
11.4 Dependency Health
This is subtle.
Should readiness depend on downstream dependencies?
Answer: only on truly critical dependencies.
If a noncritical dependency fails and the service can degrade gracefully, readiness should often stay healthy. Otherwise you risk removing all instances from service for a partial dependency issue.
11.5 Kubernetes Relevance
In Kubernetes:
- liveness probe failure can restart a pod
- readiness probe failure removes the pod from service endpoints
- startup probe delays liveness/readiness enforcement until boot completes
This is why bad probes can cause cascading production pain.
11.6 Probe Comparison
| Probe | Purpose | Good use | Common mistake |
|---|---|---|---|
| Liveness | detect dead or stuck process | deadlock or irrecoverable internal failure | checking database and causing restart storms |
| Readiness | decide whether to receive traffic | dependency-aware traffic gating | marking ready too early |
| Startup | allow slow initialization | JVM warmup, cache preload, large bootstraps | omitting it for slow-start services |
11.7 Best Practices
- keep liveness shallow
- make readiness meaningful
- distinguish recoverable dependency issues from fatal ones
- support graceful shutdown by failing readiness first, then draining connections
- test probe behavior during deployments and partial outages
12. Failover
Failover is the process of moving traffic or responsibility from a failing component to a healthy one.
12.1 Why It Matters
Failure is not exceptional in distributed systems. Machines fail, networks partition, regions degrade, and deployments go wrong.
Failover is how the system continues serving despite that reality.
12.2 Active-Passive Failover
One environment serves traffic. Another waits in standby.
Variants:
- cold standby: mostly offline until needed
- warm standby: partially provisioned
- hot standby: ready to serve immediately
Benefits:
- easier reasoning about writes
- lower coordination complexity
Costs:
- potentially slower failover
- wasted standby capacity
12.3 Active-Active Failover
Multiple regions or clusters actively serve traffic.
Benefits:
- lower latency for global users
- fast failover because traffic is already live elsewhere
- better capacity utilization
Challenges:
- data consistency
- write coordination
- duplicate processing risk
- routing users to the correct regional data
12.4 Regional Failover
Regional failover usually means a global traffic manager stops sending requests to a bad region and shifts them elsewhere.
flowchart LR
User[Client Traffic] --> GTM[Global Traffic Manager]
GTM -->|Primary healthy| R1[Primary Region]
GTM -->|Standby or overflow| R2[Secondary Region]
R1 -. region unhealthy .-> GTM
GTM -->|Failover| R2
Hard parts:
- DNS caches may slow failover
- stateful sessions may not exist in the secondary region
- databases may lag if replication is asynchronous
- legal or data residency rules may limit where data can fail over
12.5 Database Failover Basics
This is where request-handling discussions must connect to the data layer.
Stateless compute failover is relatively straightforward.
Database failover is much harder because of:
- replication lag
- split-brain risk
- leader election complexity
- transaction durability guarantees
- write fencing and stale primary protection
Common patterns:
- primary-replica with leader promotion
- managed database multi-AZ failover
- read replicas for scale, primary for writes
- careful multi-region replication for stricter availability needs
12.6 RTO and RPO
These two terms matter in interviews.
- RTO: Recovery Time Objective. How quickly must the system recover?
- RPO: Recovery Point Objective. How much data loss is acceptable?
Examples:
- a chat notification system may tolerate some message delay and nonzero RPO
- a payment ledger system usually needs very low RPO and strict correctness
Lower RTO and lower RPO generally increase cost and complexity.
12.7 Best Practices
- design failover with data consistency in mind, not just traffic movement
- test failover regularly, not just in slide decks
- make failover automation observable and reversible
- drain traffic from degraded zones before full failure when possible
- separate control-plane failure from data-plane failure in your reasoning
12.8 Common Mistakes
- assuming DNS failover is instant
- forgetting session or cache locality during regional failover
- promoting replicas without considering lag or split-brain protection
- calling a system active-active when only the stateless tier is active-active
13. Sticky Sessions
Sticky sessions mean requests from the same client are repeatedly routed to the same backend instance.
13.1 Why They Exist
They are often used when session state is stored in process memory on the server.
Examples:
- legacy web sessions
- websocket affinity
- in-memory shopping cart state in older systems
13.2 When They Help
- short-term compatibility during migration away from stateful servers
- workloads where per-connection state is expensive to rebuild
- some real-time protocols that prefer affinity
13.3 Why They Are Often Avoided
Modern backend design prefers stateless services because stateless services:
- scale horizontally more easily
- recover from failure more cleanly
- tolerate rebalancing better
- simplify failover and deployment
Sticky sessions hurt these properties.
Problems they create:
- uneven load distribution
- poor failover if the chosen instance dies
- harder autoscaling
- harder cross-region portability
13.4 Better Alternatives
- external session store such as Redis
- signed or encrypted stateless tokens where appropriate
- shared caches for session metadata
- move connection state into durable or replicated infrastructure when necessary
13.5 Common Interview Answer
If asked about sticky sessions, a strong answer is:
"I would avoid them unless the workload truly needs affinity. In general I prefer stateless services and store session state externally so the load balancer can route to any healthy instance."
14. How These Pieces Fit Together in Real Architecture
This is the part many interview answers miss. These systems are not isolated topics. They work together as one request-handling pipeline.
14.1 Typical SaaS API Architecture
flowchart TD
Client[Client / Partner API Consumer] --> CDN[CDN]
CDN --> Edge[WAF / Reverse Proxy]
Edge --> Gateway[API Gateway]
Gateway --> Limits[Auth / Versioning / Rate Limit / Validation]
Limits --> LB[L7 Routing + Load Balancing]
LB --> App1[Service Instance Group A]
LB --> App2[Service Instance Group B]
App1 --> Cache[(Redis / Cache)]
App1 --> DB[(Primary DB + Replicas)]
App2 --> Queue[(Async Queue)]
Gateway -. traces .-> Obs[Logs / Metrics / Tracing]
App1 -. traces .-> Obs
App2 -. traces .-> Obs
Gateway -. service discovery .-> Registry[Service Registry / Orchestrator]
14.2 End-to-End Example: Payment API
Consider POST /payments in a Stripe-like or SaaS billing system.
- Client sends HTTPS request with auth token and idempotency key.
- CDN or edge forwards dynamic request to reverse proxy.
- Gateway terminates TLS, checks auth, attaches trace IDs, and applies per-client rate limit.
- Gateway validates request shape and routes to payment service.
- Load balancer picks a healthy instance.
- Payment service checks business rules and idempotency store.
- Service writes to database transactionally and calls external payment processor if needed.
- Response returns through gateway, which logs metadata and surfaces normalized errors.
This one request involves:
- authentication
- rate limiting
- validation
- routing
- load balancing
- idempotency
- observability
- external failure handling
14.3 End-to-End Example: Global Consumer App
Consider a Netflix-like or Uber-like global product.
- Global traffic manager chooses a region.
- Edge or gateway enforces auth and traffic policy.
- Router sends request to correct service version, maybe with canary rules.
- Service mesh or internal load balancing handles service-to-service calls.
- Read requests may hit regional caches first.
- If one dependency is degraded, the system may throttle optional features and preserve the core experience.
14.4 Layer-to-Concern Mapping
| Layer | Main concerns |
|---|---|
| CDN and edge | caching, DDoS absorption, global reach |
| Reverse proxy or WAF | TLS termination, request filtering, compression |
| API gateway | auth, rate limiting, versioning, route policy, observability |
| L7 routing and load balancing | path and header-aware routing, canary, health-aware distribution |
| Service layer | business validation, idempotency, authorization on domain objects |
| Data layer | constraints, consistency, replication, failover |
| Observability platform | logs, metrics, traces, alerting |
15. Real-World Discussion Patterns
15.1 Google-Style or Amazon-Scale Thinking
At very large scale, request handling is strongly influenced by geography and fleet management.
The important patterns are:
- global traffic steering
- regional isolation
- heavy automation around health and rollout
- multi-layer balancing rather than one magical balancer
15.2 Netflix-Style Thinking
A company deploying frequently cares deeply about:
- canary releases
- traffic shaping
- resilience under partial failure
- observability and fast rollback
15.3 Uber-Style Thinking
Systems tied to geography or real-time state often care about:
- geo-aware routing
- regional capacity balancing
- latency sensitivity
- selective degradation during spikes
15.4 GitHub-Style and Stripe-Style Thinking
Public API platforms care deeply about:
- stable API contracts
- client-visible rate limiting
- request signing and webhook verification
- versioning discipline
- auditability and correctness
15.5 Typical SaaS Thinking
SaaS platforms often need a combination of:
- tenant-aware routing
- per-tenant quotas and rate limits
- centralized auth and observability
- low operational complexity relative to global hyperscale systems
16. Common Interview Questions and Strong Angles
16.1 "Where would you put rate limiting?"
Strong answer:
"Mostly at the gateway or edge for cheap rejection, but I may also add service-level limits for expensive operations or tenant-specific protections."
16.2 "How do you avoid duplicate writes when retries happen?"
Strong answer:
"Use idempotency keys or natural idempotency where possible, persist enough request identity to detect duplicates, and only retry safely repeatable operations."
16.3 "When would you choose L4 vs L7?"
Strong answer:
"L7 when I need HTTP-aware routing and policies like auth, canary, or versioning. L4 when I need simpler, high-throughput transport balancing or I do not need application-layer inspection."
16.4 "How would you do a safe deployment?"
Strong answer:
"Use health checks plus canary or blue-green routing, monitor business and technical metrics, and ensure rollback is fast."
16.5 "What happens if the rate limiter or discovery service goes down?"
Strong answer:
"I need a failure policy. For some endpoints I fail open to preserve availability; for abuse-sensitive endpoints I may fail closed. For discovery, I keep short-lived cached endpoint data and remove unhealthy instances quickly, but I do not rely on stale data too long."
16.6 "Why avoid sticky sessions?"
Strong answer:
"Because they make scaling, failover, and even load distribution harder. Stateless services are easier to operate and recover."
17. Common Mistakes Across the Whole Topic
- describing only the happy path and ignoring failure behavior
- saying "use a load balancer" without specifying which type or why
- retrying everything blindly
- forgetting idempotency on write APIs
- conflating authentication with authorization
- overusing the gateway as a business-logic layer
- assuming health checks are trivial
- assuming DNS-based failover is immediate and sufficient
- using sticky sessions to avoid fixing state management
- forgetting observability at the edge and routing layers
18. Practical Best Practices Checklist
- terminate or manage TLS deliberately; do not let trust boundaries be accidental
- reject bad or abusive traffic as early as possible
- keep services stateless when you can
- make retries explicit, bounded, and idempotency-aware
- use readiness checks to gate traffic and liveness checks to recover dead processes
- monitor p95 and p99 latency, not just averages
- make rollout and failover mechanisms observable
- keep routing policy simple enough to debug under pressure
- treat API versioning as a product and operational discipline, not just a URL pattern
- test degraded modes, not just normal operation
19. Final Mental Model
The cleanest way to think about request handling is this:
Request handling is the system that protects scarce resources while getting the right request to the right code, at the right time, under the right policy, even when parts of the system are failing.
If you can explain request handling from that perspective, you will do well in interviews and you will design more production-ready backend systems.