Files
tarun-elango 26810e43d0 sd text
2026-04-26 13:27:19 -04:00

61 KiB
Raw Permalink Blame History

Request Handling

Request handling is the full journey of a request from the moment a client sends it to the moment the system returns a response. In interviews, this topic sits at the boundary between API design, distributed systems, reliability engineering, and production operations. In real systems, request handling is where latency, availability, security, and cost are decided.

Most weak system design answers treat a request as if it magically reaches the correct service and that service magically succeeds. Real systems do not work that way. Before business logic runs, a request usually passes through multiple control points:

  • edge protection
  • routing
  • load balancing
  • authentication and authorization checks
  • rate limiting or throttling
  • validation
  • version negotiation
  • retries and failover logic
  • observability hooks

If you understand request handling well, you can explain not only how a request succeeds, but also how the system behaves when traffic spikes, instances fail, regions go down, clients retry aggressively, or malformed input hits your API.

This guide is written with two goals:

  1. Help you answer backend and system design interview questions with depth and structure.
  2. Help you understand how production systems at companies like Google, Netflix, Uber, Amazon, GitHub, Stripe, and typical SaaS platforms are actually built.

Examples in this guide are intentionally generalized from widely used industry patterns and public engineering discussions rather than private internal implementation details.

1. Big Picture: What Request Handling Really Means

At a high level, request handling exists because distributed systems are hostile environments:

  • networks are slow and unreliable
  • clients are untrusted
  • services scale up and down dynamically
  • requests are not evenly distributed
  • failures are partial, not binary
  • deployments happen continuously
  • one bad downstream dependency can cascade into an outage

The job of request handling is to answer a series of questions quickly and safely:

  1. Should this request be allowed into the system?
  2. Is the client authenticated and allowed to do this?
  3. Is the request well-formed and safe?
  4. Which region, cluster, service, and instance should receive it?
  5. Can the system handle the load right now?
  6. If something is failing, should we retry, reroute, degrade, or reject?
  7. How do we observe what happened later?

1.1 End-to-End Request Lifecycle

flowchart LR
	C[Client: Browser / Mobile / API Consumer] --> CDN[CDN / Edge Cache]
	CDN --> WAF[WAF / Reverse Proxy / TLS Termination]
	WAF --> GW[API Gateway]
	GW --> POL[Auth / Validation / Rate Limit]
	POL --> ROUTE[Routing + Load Balancing]
	ROUTE --> S1[Service A]
	ROUTE --> S2[Service B]
	S1 --> CACHE[(Cache)]
	S1 --> DB[(Database)]
	S2 --> MQ[(Queue / Stream)]
	GW -. logs / metrics / traces .-> OBS[Observability Stack]
	S1 -. logs / metrics / traces .-> OBS
	S2 -. logs / metrics / traces .-> OBS

1.2 Core Goals of Request Handling

Goal Why it matters Typical mechanisms
Correctness Wrong requests can corrupt data or create security holes validation, auth, idempotency, request signing
Availability Users should still be served when instances or regions fail load balancing, health checks, failover, retries, circuit breaking
Performance Users care about latency more than architecture diagrams caching, compression, routing, efficient LB strategy
Isolation One tenant or one endpoint should not take down the whole system rate limiting, throttling, priority queues, load shedding
Observability If you cannot see failures, you cannot fix them centralized logs, metrics, traces, correlation IDs
Evolvability APIs and deployments must change without breaking clients versioning, traffic splitting, canary release, blue-green rollout

1.3 Interview Framing

In an interview, saying "I will add a load balancer" is not enough. A stronger answer sounds more like this:

"Requests first hit a reverse proxy or gateway where we terminate TLS, authenticate the caller, apply rate limiting, and route traffic. From there, an L7 load balancer sends traffic to healthy service instances discovered dynamically. We keep requests idempotent where retries are possible, use readiness checks so bad instances do not receive traffic, and add observability at the edge and service layers so we can debug tail latency and failure patterns."

That answer shows system thinking rather than component name-dropping.

2. API Gateway

2.1 What It Is

An API gateway is the entry point for requests into a backend system. It sits between clients and backend services and applies common policies before requests reach business logic.

Think of it as a programmable front door for your platform.

Common technologies:

  • Envoy
  • NGINX
  • Kong
  • HAProxy
  • AWS API Gateway
  • Spring Cloud Gateway
  • Netflix Zuul historically

2.2 Why It Exists

Without a gateway, each service often ends up re-implementing the same cross-cutting concerns:

  • token validation
  • request logging
  • rate limiting
  • route matching
  • error normalization
  • response compression
  • API version handling

That leads to duplicated logic, inconsistent behavior, and harder operations.

The gateway centralizes edge concerns so backend services can focus more on business rules.

2.3 Main Responsibilities

Responsibility What it does Why it belongs at the gateway
Authentication Verifies tokens, API keys, signed requests cheap early reject before backend work
Authorization at coarse level Rejects callers that cannot access an API family reduces unnecessary downstream traffic
Request aggregation Combines data from multiple services into one response reduces client chattiness, especially mobile
Centralized logging Captures request metadata once consistent audit and debugging
Observability emits metrics, traces, request IDs easier latency and error tracking
Service discovery integration resolves service names to live instances works with autoscaling and dynamic infra
Retries and timeouts handles transient failures reduces client-visible errors when used carefully
Circuit breaking stops sending traffic to failing backends prevents cascading failure
Response transformation maps internal responses to public API shape decouples client contract from internal service contract
Caching serves repeated read traffic cheaply lowers latency and backend load

2.4 How It Works Internally

A gateway typically processes a request through a pipeline:

  1. Accept TCP/TLS connection.
  2. Terminate TLS or pass it through depending on setup.
  3. Parse HTTP request metadata.
  4. Match route rules using host, path, method, headers, or query parameters.
  5. Apply middleware or policies such as auth, rate limiting, schema checks, and logging.
  6. Resolve the upstream service using static config or service discovery.
  7. Select a healthy backend instance using a load-balancing policy.
  8. Forward the request.
  9. Apply retries, timeouts, or circuit-breaker policy if needed.
  10. Transform, compress, cache, or redact the response.
  11. Emit logs, metrics, and trace spans.

2.5 Request Lifecycle Through a Gateway

sequenceDiagram
	participant Client
	participant Gateway
	participant Auth as Auth/JWKS
	participant Registry as Service Discovery
	participant Service as Backend Service
	participant Obs as Observability

	Client->>Gateway: HTTPS request
	Gateway->>Auth: Validate token or signature
	Auth-->>Gateway: Auth result
	Gateway->>Registry: Resolve upstream instances
	Registry-->>Gateway: Healthy endpoints
	Gateway->>Service: Forward request
	Service-->>Gateway: Response
	Gateway->>Obs: Logs / metrics / traces
	Gateway-->>Client: Final response

2.6 Request Aggregation

Request aggregation is when the gateway calls multiple backend services and combines their results into a single response.

Example: a mobile home screen might need:

  • profile service
  • recommendation service
  • notification service
  • recent activity service

Without aggregation, the mobile app might issue 4 to 8 network calls. With a gateway or backend-for-frontend layer, the client makes one call and receives one composite response.

Why it exists:

  • mobile networks are high latency
  • clients should not need to know internal service topology
  • it reduces repetitive orchestration logic across clients

Tradeoffs:

  • the gateway becomes more complex
  • partial failures are harder to represent
  • tail latency can worsen because one slow dependency slows the whole aggregate response
  • aggregation logic can become accidental business logic

Best practice: keep aggregation focused on shaping data for clients, not on implementing domain rules that belong in services.

2.7 Authentication at the Gateway

This is common because authentication is cheap to reject early and expensive to repeat everywhere.

Typical patterns:

  • JWT validation at the gateway using cached public keys from a JWKS endpoint
  • API key lookup for machine clients
  • OAuth token introspection for opaque tokens
  • mTLS for service-to-service requests in internal systems

Important nuance: gateway authentication does not eliminate service-level authorization.

The gateway can answer, "Is this caller known and allowed to hit this API family?"

The service still often needs to answer, "Can this user access this specific invoice, order, or repository?"

Common mistake: pushing all authorization into the gateway. Fine-grained authorization usually belongs closer to the business object.

2.8 Centralized Logging and Observability

The gateway is the best place to generate or propagate correlation identifiers such as:

  • request ID
  • trace ID
  • span ID
  • tenant ID
  • client application ID

Useful gateway metrics:

  • requests per second by route
  • latency percentiles, especially p95 and p99
  • error rates by status code family
  • upstream retry counts
  • rate-limited requests
  • auth failures
  • cache hit ratio

Observability matters because many request-handling bugs look similar from the outside. A user just sees a timeout. Internally, that timeout might have been caused by:

  • route misconfiguration
  • unhealthy instances still receiving traffic
  • retry storms
  • TLS handshake issues
  • bad DNS resolution
  • a dependency that is slow but not fully down

2.9 Service Discovery Integration

In dynamic environments, backend instances change constantly because of autoscaling, deployments, and failures. Hardcoding backend IPs is not realistic.

The gateway therefore needs service discovery.

Common models:

  • client-side discovery: the caller resolves service instances and chooses one
  • server-side discovery: the gateway or load balancer resolves instances and forwards traffic

In Kubernetes, a gateway often routes to a Service object, and kube-proxy or the data plane routes to live pod endpoints. In service-mesh-heavy environments, Envoy sidecars may receive endpoint updates via xDS-style control-plane APIs.

Failure case: stale discovery data can send traffic to dead instances.

Best practices:

  • respect health information, not just presence in registry
  • support quick config propagation
  • use connection draining when removing instances
  • avoid very aggressive caching of endpoint lists

2.10 Retries

Retries are deceptively dangerous.

Why they exist:

  • networks fail transiently
  • connections reset occasionally
  • an instance may fail while others are healthy

Why they are risky:

  • retries multiply traffic load during incidents
  • non-idempotent operations may execute twice
  • stacked retries at multiple layers create retry storms

Best practices:

  • retry only idempotent or safely repeatable operations
  • use bounded retries, usually very small counts
  • add exponential backoff and jitter
  • couple retries with timeouts and circuit breakers
  • never let every layer retry blindly

Interview point: if a gateway retries a POST payment request, you must discuss idempotency keys.

2.11 Circuit Breaking

Circuit breaking protects the rest of the system from a dependency that is failing or timing out.

Typical states:

  • closed: traffic flows normally
  • open: requests fail fast instead of calling a bad dependency
  • half-open: limited test traffic checks whether recovery has happened

Why it matters:

If a database or dependency is timing out, continuing to send full traffic often just consumes worker threads, saturates queues, and increases latency everywhere.

Circuit breaking buys time and preserves system health.

2.12 Response Transformation

Gateways often transform responses by:

  • removing internal fields
  • renaming fields for public API consistency
  • changing status code mappings
  • combining multiple backend responses into one DTO
  • translating protocols such as gRPC to JSON/HTTP for clients

Useful when:

  • internal services evolve independently
  • multiple clients need different shapes
  • you want to hide internal topology

Danger: too much transformation turns the gateway into a fragile orchestration layer.

2.13 Request and Response Caching

Caching at the gateway is powerful for read-heavy APIs.

Good candidates:

  • public or semi-public GET responses
  • configuration or feature metadata
  • rarely changing reference data

Hard parts:

  • cache invalidation
  • per-user or per-tenant cache keys
  • auth-sensitive content
  • stale responses after writes

Best practices:

  • cache only clearly safe responses
  • include authorization context in the cache key if needed
  • use TTLs conservatively
  • prefer cache headers and explicit policies over guesswork

2.14 Gateway vs Service Mesh

Dimension API Gateway Service Mesh
Main traffic direction north-south, from clients into platform east-west, service-to-service
Main role edge policy and API entry internal traffic management
Common features auth, rate limiting, versioning, aggregation, public API concerns mTLS, retries, traffic shaping, service identity, observability
Consumer external clients or apps internal services
Operational risk can become a choke point can add significant platform complexity
Typical tools API Gateway products, NGINX, Kong, Envoy Istio, Linkerd, Consul Connect, Envoy-based meshes

They are not mutually exclusive. Many real systems use both.

2.15 Production Patterns

  • Netflix popularized a gateway-style edge layer to handle cross-cutting concerns before requests hit microservices.
  • Stripe-like public APIs often emphasize idempotency, auth, versioning, and request logging at the edge because correctness matters more than raw throughput alone.
  • Large SaaS platforms often use gateways to enforce tenant-aware limits and route requests to the correct service family.

2.16 Common Mistakes

  • turning the gateway into a monolith of business logic
  • retrying non-idempotent requests
  • centralizing coarse and fine-grained authorization in the same place
  • logging secrets or PII in raw form
  • caching personalized responses incorrectly
  • making the gateway a single point of failure without horizontal scaling

3. Request Routing

Routing decides where a request goes after it enters the system.

This sounds simple, but routing is one of the most important control points in production because it determines:

  • which service handles the request
  • which version handles it
  • which region handles it
  • which tenant or experimental path it follows

3.1 Routing Types

Routing style How it works Common use cases Risks
Path-based routing route by URL path such as /payments or /users REST APIs, monolith decomposition, ingress rules overlapping route patterns, regex complexity
Host-based routing route by hostname such as api.example.com or admin.example.com multiple products or domains behind one edge DNS and certificate management complexity
Header-based routing route using headers such as version, tenant, device type, or experiment ID canaries, A/B tests, tenant isolation header spoofing, harder debugging
Geo-based routing route by location, region, or country latency reduction, data residency, regulatory compliance incorrect geo inference, data locality problems
Canary routing send a small portion of traffic to a new version safe rollout canary users may not represent real load
Blue-green routing switch traffic between old and new environments low-risk deployments with quick rollback expensive duplication, data migration risk
Weighted traffic splitting send 90 percent to old version and 10 percent to new version, then ramp gradual deployment, model rollout sticky results, measurement bias

3.2 How Routing Usually Works Internally

The router generally evaluates rules in order:

  1. Match host.
  2. Match method and path.
  3. Evaluate higher-priority header or cookie rules.
  4. Apply traffic-splitting policy if multiple upstreams are eligible.
  5. Resolve the target service using discovery.
  6. Pick a healthy instance.

Rule order matters. A subtle configuration bug can shadow a more specific route with a broader one.

3.3 Routing Decision Flow

flowchart TD
	A[Incoming Request] --> B{Host Match?}
	B -->|api.example.com| C{Path Match?}
	B -->|admin.example.com| D[Admin Service]
	C -->|/v1/payments| E{Header or Canary Rule?}
	C -->|/v1/users| F[User Service]
	E -->|Canary| G[Payments vNext]
	E -->|Default| H[Payments vCurrent]

3.4 Path-Based Routing

This is the most common form of routing for HTTP APIs.

Examples:

  • /orders/* to order service
  • /payments/* to payment service
  • /search/* to search service

Why it exists:

  • intuitive for REST-style APIs
  • easy to reason about operationally
  • fits ingress and gateway tools well

Failure case: route collisions. For example, a generic /payments/* rule may accidentally catch /payments/admin/* if route precedence is wrong.

3.5 Host-Based Routing

Host-based routing routes by domain name.

Examples:

  • api.company.com
  • admin.company.com
  • uploads.company.com
  • hooks.company.com

This is useful when products or workloads differ enough that they deserve different operational policies.

For example, a webhook ingestion domain may need different timeout, retry, and rate-limit rules than a user-facing API domain.

3.6 Header-Based Routing

Header-based routing is common for:

  • API version rollout
  • internal testing
  • tenant routing
  • language or device-specific responses
  • canary routing with explicit opt-in

Example headers:

  • X-API-Version
  • X-Tenant-ID
  • X-Experiment
  • X-Canary

Be careful: headers are easy for internal callers, but for public APIs they may be spoofed unless protected by auth and policy.

3.7 Geo-Based Routing

Geo routing tries to send users to the best region based on one or more goals:

  • lower latency
  • data residency compliance
  • regulatory boundaries
  • disaster isolation
  • capacity balancing

Examples:

  • EU users sent to EU region for GDPR-sensitive workloads
  • ride-sharing or mapping traffic sent to region nearest user demand
  • global SaaS tenants pinned to a home region

Tradeoffs:

  • nearest region is not always best if the users data lives elsewhere
  • geo-IP is imperfect
  • cross-region writes can be very expensive or consistency-sensitive

Interview point: geo routing and data placement must be discussed together.

3.8 Canary Routing

Canary routing sends a small portion of traffic to a new version first.

Typical rollout:

  1. 1 percent traffic
  2. 5 percent traffic
  3. 10 percent traffic
  4. 25 percent traffic
  5. 50 percent traffic
  6. 100 percent traffic

What you watch:

  • error rate
  • latency regression
  • resource utilization
  • business metrics such as checkout success or sign-in success

Why companies use it:

  • safer than instant full rollout
  • catches dependency or schema issues early
  • provides rollback window

Real-world intuition: Netflix-style continuous deployment is only practical because traffic shaping and observability let teams expose changes gradually.

3.9 Blue-Green Deployments

Blue-green means you maintain two environments:

  • blue: current production
  • green: new version

Then you shift traffic from one to the other.

Why it exists:

  • rollback is simple in theory because the old environment still exists
  • deployment risk is separated from code build risk

Where it gets hard:

  • databases do not switch as cleanly as stateless services
  • dual environments cost more
  • background jobs and asynchronous consumers may still affect shared data

3.10 Traffic Splitting

Traffic splitting is a more general concept than canary.

You can split traffic by:

  • percentage
  • user cohort
  • tenant tier
  • geography
  • request attributes
  • session stickiness

This is useful for:

  • A/B experiments
  • canaries
  • ML model rollout
  • progressive feature migration

3.11 Service Discovery Impact

Routing depends on service discovery more than most beginners realize.

A route usually points to a logical service name, not a fixed machine. The system must map that name to currently healthy instances.

If service discovery is stale or slow:

  • requests go to dead instances
  • traffic may concentrate on a few nodes
  • rollout changes may not propagate consistently

Best practices:

  • keep route definitions separate from ephemeral instance identity
  • use health-aware endpoint selection
  • support connection draining during deployment
  • prefer automation over manual endpoint lists

3.12 Common Interview Discussions

  • How do you safely route traffic to a new version?
  • How do you guarantee tenant isolation during routing?
  • What happens if a route config is wrong globally?
  • How do you roll back a bad canary quickly?

4. Load Balancing

4.1 Why Load Balancing Exists

If requests were sent to a single server, that server would become a bottleneck and a single point of failure.

Load balancing exists to:

  • distribute traffic across multiple instances
  • improve availability
  • enable horizontal scaling
  • reduce overload on any single machine
  • route away from unhealthy instances

Horizontal scaling is the key idea. Instead of buying one huge machine forever, you run multiple smaller instances and distribute work.

4.2 Active-Active vs Active-Passive

Mode Meaning Benefits Drawbacks Common use
Active-active multiple nodes or regions serve traffic at the same time high availability, better capacity utilization, low failover time more complex consistency and routing web and API frontends, global services
Active-passive one node or region is primary, another waits as standby simpler operational model, easier correctness story slower failover, unused capacity legacy systems, cost-sensitive setups, some databases

Interview rule of thumb: active-active improves availability and latency, but only if the data layer and operational discipline are strong enough to support it.

4.3 Common Load-Balancing Algorithms

Algorithm How it works Best for Strengths Weaknesses
Round robin rotate evenly through instances similar stateless servers simple and cheap ignores load differences
Weighted round robin same as round robin, but some instances get more traffic mixed-capacity fleets easy to express capacity skew weights can become stale
Least connections send to instance with fewest active connections long-lived connections, uneven request duration better than round robin for sticky or long sessions connection count may not reflect real CPU or memory load
Least response time prefer instances with lower observed latency latency-sensitive APIs reacts to slow instances can create feedback loops or oscillations
Consistent hashing map request key to instance ring caches, sharded workloads, affinity minimal reshuffling when instances change hotspot risk if keys are skewed
IP hash hash client IP to pick instance simple affinity easy stickiness NAT and shared IPs skew traffic badly

One strong interview addition: consistent hashing is not a general-purpose default. It is especially useful when request locality matters, such as cache ownership or shard routing.

4.4 Round Robin

Round robin is the simplest algorithm: instance A, then B, then C, then back to A.

Why it exists:

  • low overhead
  • easy to reason about
  • works fine when instances are homogeneous and requests are similar

Why it fails:

  • requests may vary wildly in cost
  • one node may be slow but still receive equal traffic

4.5 Weighted Round Robin

Weighted round robin lets you give bigger instances more traffic.

Example:

  • instance A weight 4
  • instance B weight 2
  • instance C weight 1

This is useful during mixed instance migrations or when canary nodes should receive only a small share.

4.6 Least Connections

This is common when connection duration matters, such as proxies handling long-lived connections or websocket-heavy workloads.

Why it helps:

  • a server already holding many active sessions gets less new work

Limitations:

  • 10 idle connections are not equivalent to 10 expensive requests
  • CPU-heavy but short-lived requests may still be imbalanced

4.7 Least Response Time

This tries to avoid slow nodes by routing to faster ones.

Good intuition: if one instances latency is rising, it may be overloaded or degraded.

Risk: feedback loops.

If the algorithm overreacts, traffic can bounce around and create instability. Slow-start, damping, and outlier detection often help.

4.8 Consistent Hashing

Consistent hashing is important in system design interviews.

Instead of randomly balancing every request, you map a key such as:

  • user ID
  • session ID
  • cache key
  • shard key

to a position on a hash ring. Each instance owns a portion of the ring.

Why it exists:

  • when instances join or leave, only a subset of keys remap
  • this preserves cache locality and reduces churn

Common use cases:

  • distributed caches
  • sharded databases
  • sticky-ish routing without a central session store

Failure case: skewed keys can create hotspots. Virtual nodes and better key design are common mitigations.

4.9 IP Hash

IP hash is a crude form of affinity.

It is easy to set up, but it performs poorly when many clients appear behind a single NAT or corporate proxy. One large office can accidentally behave like one giant user from the load balancers perspective.

4.10 Distributed Load Balancing

At scale, the load balancer itself must also scale.

That means:

  • multiple LB nodes, not one appliance
  • shared or replicated configuration
  • health state propagation
  • often anycast or DNS in front of LB fleets

If the load balancer layer is not distributed, you have just moved the bottleneck.

4.11 Global Load Balancing

Global load balancing chooses a region before a local load balancer chooses an instance.

Goals:

  • reduce latency
  • avoid unhealthy regions
  • keep traffic near user or data
  • manage regional capacity
flowchart TD
	U[Global Users] --> GTM[Global Traffic Manager]
	GTM --> US[US Region]
	GTM --> EU[EU Region]
	GTM --> AP[APAC Region]
	US --> USLB[Regional Load Balancer]
	EU --> EULB[Regional Load Balancer]
	AP --> APLB[Regional Load Balancer]
	USLB --> US1[Service Instances]
	EULB --> EU1[Service Instances]
	APLB --> AP1[Service Instances]

Google-scale and Amazon-scale systems rely heavily on global traffic management concepts because the first routing decision is often regional, not per-instance.

4.12 DNS Load Balancing

DNS can return different IPs for the same hostname.

Why it is attractive:

  • simple
  • globally available
  • often the first layer of balancing

Limitations:

  • DNS caching means failover is not immediate
  • clients do not always honor TTL precisely
  • DNS cannot see application-level health very well by itself

Interview point: DNS is useful, but it is usually too coarse to be the only failover mechanism.

4.13 Best Practices

  • remove unhealthy instances quickly
  • use connection draining before terminating instances
  • prefer zone-aware balancing if cross-zone traffic is expensive
  • track p95 and p99, not just average latency
  • use slow start for newly added instances so they do not get overwhelmed instantly
  • treat retries as part of traffic load, not separate from it

4.14 Common Failure Cases

  • unhealthy instances still receive traffic because readiness is wrong
  • least-response-time routing amplifies instability
  • one AZ gets overloaded because balancing is not zone-aware
  • sticky affinity causes hotspots
  • the load balancer layer itself is not redundant

5. Rate Limiting

5.1 Why It Exists

Rate limiting controls how much traffic a client, tenant, API key, IP, or endpoint can send over time.

It exists to enforce:

  • abuse prevention
  • fairness
  • cost control
  • protection of downstream systems
  • multi-tenant isolation

Without rate limits, one abusive or buggy client can monopolize resources.

5.2 Where It Is Applied

Rate limits can exist at multiple layers:

  • CDN or edge
  • API gateway
  • service layer
  • database or queue concurrency controls

Different layers often enforce different types of limits.

Examples:

  • per-IP login limit at the edge
  • per-API-key limit at the gateway
  • per-tenant expensive-operation limit in the service

5.3 Common Algorithms

Algorithm Idea Strengths Weaknesses Good fit
Fixed window count requests in each fixed period very simple bursty at window boundaries low-complexity systems
Sliding window approximate rolling window using adjacent buckets smoother than fixed window more logic and state typical APIs
Sliding log keep timestamps of requests precise expensive in memory and compute strict low-volume policies
Token bucket tokens refill at a constant rate, requests spend tokens allows controlled bursts stateful logic public APIs with burst tolerance
Leaky bucket requests enter a bucket and drain at fixed rate smooths outgoing rate may delay or drop burst traffic traffic shaping and smoothing

5.4 Fixed Window

Example rule: 100 requests per minute.

Implementation is often as simple as:

  • increment a counter for current window
  • reject if counter exceeds limit
  • expire counter at end of window

Problem: boundary burst.

A client can send 100 requests at the end of one minute and 100 more at the start of the next minute, effectively sending 200 requests in a short interval.

5.5 Sliding Window

Sliding window reduces the boundary-burst problem.

Instead of treating time as disconnected minute buckets, it approximates usage across a rolling interval. This is fairer and smoother for APIs.

5.6 Sliding Log

Sliding log stores individual request timestamps and removes old ones.

It is the most exact of the common approaches, but also the most expensive. It is rarely the default choice for very high-cardinality high-throughput public traffic.

5.7 Token Bucket

This is one of the most useful algorithms to understand.

Mental model:

  • tokens drip into a bucket at a steady rate
  • each request consumes one or more tokens
  • if the bucket is empty, reject or delay

Why it is popular:

  • supports bursts up to bucket capacity
  • preserves average rate over time
  • easy to reason about for product limits

Example:

  • refill 10 tokens per second
  • bucket size 50

The client can burst 50 requests instantly, but over time they only sustain about 10 per second.

5.8 Leaky Bucket

Leaky bucket emphasizes output smoothing more than burst allowance.

Requests may queue and drain at a steady rate. This is useful when downstream systems need smooth, predictable load rather than spikes.

5.9 Redis Implementation Patterns

Redis is common for distributed rate limiting because it is fast and supports atomic operations.

Typical patterns:

  • fixed window: INCR plus EXPIRE
  • sliding log: sorted set of timestamps with ZADD, ZREMRANGEBYSCORE, and ZCARD
  • token bucket: store token count and last refill timestamp, update atomically with Lua

Why Lua scripts matter:

  • distributed rate limiting requires atomic read-modify-write behavior
  • without atomicity, concurrent requests can exceed the intended limit

Key design examples:

  • ratelimit:user:123:/payments
  • ratelimit:tenant:acme:minute
  • ratelimit:ip:203.0.113.10:login

5.10 Distributed Rate Limiting Challenges

This is where interview answers often become shallow.

Real problems include:

  • hot keys for large tenants or popular routes
  • clock skew between nodes
  • cross-region consistency
  • Redis outages
  • fail-open versus fail-closed policy decisions
  • cardinality explosion if keys are too granular

Fail-open means the limiter allows traffic if the limiter store is down.

Fail-closed means it rejects traffic if the limiter store is down.

Which is right depends on the endpoint:

  • login or anti-abuse endpoint may prefer fail-closed
  • general read endpoint may prefer fail-open to preserve availability

5.11 Best Practices

  • limit by the right identity: IP is often not enough
  • use different limits for read, write, and expensive endpoints
  • return 429 Too Many Requests with Retry-After when possible
  • monitor near-limit behavior, not just hard rejections
  • consider shadow mode before enforcing a new limit in production
  • keep rate limiting close to the edge for cheap rejection

5.12 Real-World Intuition

  • GitHub-style public APIs need clear client-visible limits to keep the platform fair.
  • Stripe-like payment APIs need rate limiting to protect correctness-sensitive backends from abuse or accidental retry loops.
  • Multi-tenant SaaS platforms often combine per-user, per-tenant, and per-endpoint limits.

6. Request Validation

Validation is the discipline of refusing bad requests before they do damage.

This includes much more than checking whether a JSON field exists.

6.1 Why Validation Exists

Requests are dangerous because they may be:

  • malformed
  • malicious
  • duplicated
  • replayed
  • semantically invalid
  • inconsistent with business rules

Validation protects:

  • correctness
  • security
  • downstream capacity
  • developer sanity

6.2 Validation Layers

Layer Typical checks Why here
Edge or gateway body size, basic schema, auth format, signature presence, rate limit cheap early rejection
API layer required fields, type checks, enum checks, version compatibility contract correctness
Domain layer business rules and state-dependent validation real correctness
Database layer unique constraints, foreign keys, transactional guarantees final integrity guardrail

Important interview point: validation should be layered. Do not rely on only one layer.

6.3 Schema Validation

Schema validation checks the request shape.

Examples:

  • JSON schema or OpenAPI validation for REST
  • protobuf validation for gRPC
  • GraphQL schema and resolver validation

Why it exists:

  • catches bad input early
  • makes API behavior predictable
  • prevents weird null or type bugs from leaking deep into business logic

What it does not do:

  • prove business correctness

Example:

  • schema can prove amount exists and is numeric
  • it cannot prove that a user is allowed to charge that amount

6.4 Input Sanitization

Sanitization is about ensuring input cannot be used to exploit downstream systems.

Common concerns:

  • SQL injection
  • command injection
  • path traversal
  • log injection
  • XSS if data will later be rendered in browsers

The important mindset is not "strip all special characters". That often breaks legitimate input.

Better practice:

  • use parameterized queries
  • encode output for the correct context
  • validate formats where needed
  • avoid blindly interpolating request data into logs or shell commands

6.5 Idempotency

Idempotency is one of the most important backend interview concepts.

A request is idempotent if sending it multiple times has the same effect as sending it once.

Why it matters:

  • clients retry when timeouts happen
  • networks fail after the server may already have processed the request
  • gateways or proxies may retry transient failures

Typical example: payments.

If a client sends "charge $100" twice because the first response was lost, you do not want to charge the card twice.

A common solution is an idempotency key.

sequenceDiagram
	participant Client
	participant API
	participant Store as Idempotency Store
	participant Payment as Payment Service

	Client->>API: POST /charges + Idempotency-Key: abc123
	API->>Store: Lookup key abc123
	alt Key not found
		Store-->>API: miss
		API->>Payment: Execute charge
		Payment-->>API: success
		API->>Store: Save key + normalized request hash + response
		API-->>Client: 200 OK with result
	else Key found
		Store-->>API: previous response
		API-->>Client: return same stored result
	end

Best practices for idempotency:

  • store both the key and enough request fingerprinting to detect misuse
  • scope keys appropriately, often per client or per endpoint family
  • keep key retention long enough to cover realistic retry windows
  • use for non-idempotent operations such as payment creation or order submission

6.6 Replay Protection

Replay protection prevents an attacker or buggy intermediary from resending a valid request later.

Common techniques:

  • timestamps with expiration windows
  • nonces stored briefly to prevent reuse
  • signed requests that include method, path, body hash, and timestamp

This is especially common in webhook verification and partner API integrations.

6.7 Request Signing Basics

Request signing often works like this:

  1. client builds a canonical string from method, path, timestamp, and body hash
  2. client signs it with an HMAC secret or private key
  3. server recomputes expected signature
  4. server rejects if signature differs or timestamp is too old

Why it exists:

  • verifies authenticity
  • detects tampering
  • supports replay protection when timestamp and nonce are included

GitHub-style or Stripe-style webhooks commonly use a variant of this pattern so receivers can verify that an event really came from the platform.

6.8 Common Validation Mistakes

  • trusting frontend validation
  • validating schema but not business semantics
  • implementing idempotency without scoping or request hashing
  • rejecting too late after expensive downstream work already happened
  • logging raw secrets, tokens, or signed payloads
  • treating all retries as duplicates without considering request identity

7. API Versioning

Versioning exists because APIs change, but clients do not upgrade instantly.

7.1 Why Versioning Matters

Without a versioning strategy:

  • client upgrades become risky
  • breaking changes become outages
  • multiple mobile app versions become painful to support
  • integration partners lose trust

Strong backend teams design for API evolution, not just initial launch.

7.2 Common Versioning Strategies

Strategy Example Benefits Drawbacks When useful
URI versioning /v1/orders explicit and easy to see path clutter, can encourage large version forks public REST APIs
Header versioning X-API-Version: 2025-10-01 cleaner URLs, flexible rollout less visible, harder to debug manually mature APIs, platform clients
Media type versioning Accept: application/vnd.company.v2+json precise content negotiation operationally less friendly, not beginner-friendly specialized APIs

7.3 Backward Compatibility

This is often more important than the version number itself.

Safer changes:

  • adding optional fields
  • adding new endpoints
  • adding new enum values only if clients are tolerant

Risky changes:

  • removing fields
  • renaming fields
  • changing meaning or units of existing fields
  • turning nullable fields into required ones

Production rule: additive evolution is easier than breaking evolution.

7.4 Deprecation Strategy

Good deprecation is operational, not just documented.

A practical strategy:

  1. announce deprecation clearly
  2. measure usage of the old version
  3. provide migration docs and examples
  4. support both versions during a migration window
  5. alert high-usage customers directly if possible
  6. set and communicate a sunset date

7.5 Migration Strategy

Strong teams avoid big-bang migrations.

Typical approach:

  • dual-read or dual-write only when necessary and carefully controlled
  • gateway routes old and new versions separately
  • monitor client adoption
  • migrate major SDKs first
  • cut off the oldest, least-safe versions gradually

7.6 Real-World Examples

  • Stripe is well known for careful API versioning because payment integrations cannot break casually.
  • GitHub exposes explicit API versioning so clients know which contract they are using.
  • Internal microservices often use protobuf or schema evolution rules instead of public URI versioning.

7.7 Best Practices

  • version only when needed; do not fork casually
  • prefer backward-compatible changes when possible
  • monitor version usage by client and tenant
  • keep error formats consistent across versions
  • document behavioral differences, not just field differences

8. Throttling

Rate limiting and throttling are related but not identical.

8.1 Throttling vs Rate Limiting

Concept Main goal Typical action Example
Rate limiting enforce quota or fairness reject when request budget is exceeded 100 requests per minute per API key
Throttling protect system under stress or shape traffic slow down, queue, degrade, or reject reduce expensive search traffic during overload

Rate limiting is often policy-driven.

Throttling is often system-health-driven.

8.2 Why Throttling Exists

Even legitimate traffic can overwhelm a system.

Throttling helps you:

  • degrade gracefully rather than crash
  • preserve critical endpoints over less important ones
  • absorb bursts temporarily
  • smooth traffic into fragile downstream systems

8.3 Graceful Degradation

Graceful degradation means not every feature must remain equally available during stress.

Examples:

  • checkout stays available, recommendations are temporarily disabled
  • write-heavy analytics ingestion is delayed, user sign-in remains online
  • expensive search filters are limited, basic search still works

This is how mature systems preserve business-critical value during incidents.

8.4 Queueing

Queueing is useful when the work does not need to complete synchronously.

Examples:

  • email sending
  • thumbnail generation
  • event enrichment
  • some analytics processing

Why it helps:

  • decouples request acceptance from background work
  • smooths spikes
  • improves perceived responsiveness if the request can return early

Danger: unbounded queues are just hidden outages. If the queue grows forever, latency becomes effectively infinite.

8.5 Shedding Load

Load shedding means rejecting traffic on purpose so the rest of the system survives.

This can feel wrong to beginners, but it is often the correct decision.

Serving 70 percent of traffic quickly is better than timing out 100 percent of traffic after exhausting all workers.

Common strategies:

  • reject low-priority requests first
  • enforce concurrency caps on expensive endpoints
  • return stale cached data for noncritical reads
  • cut off optional features during incidents

8.6 Best Practices

  • define traffic priority classes
  • keep queues bounded
  • expose clear client signals such as 429 or 503
  • combine throttling with backoff guidance for clients
  • ensure degraded mode is tested before an incident

8.7 Common Mistakes

  • using queueing for work that users expect immediately
  • letting retries refill the queue faster than it drains
  • not distinguishing critical and noncritical traffic
  • treating throttling only as an edge concern when downstream systems are the real bottleneck

9. Reverse Proxy and Load Balancer

These terms overlap in practice, but they are not identical.

9.1 Reverse Proxy Role

A reverse proxy sits in front of backend servers and receives requests on their behalf.

Clients think they are talking to one endpoint. The proxy decides how to forward requests internally.

Why it exists:

  • hide internal topology
  • centralize TLS handling
  • compress and cache content
  • enforce some security rules
  • simplify operational control

Common tools:

  • NGINX
  • Envoy
  • HAProxy
  • cloud-managed L7 load balancers

9.2 SSL/TLS Termination

TLS termination means the proxy or load balancer decrypts incoming HTTPS traffic.

Benefits:

  • central certificate management
  • offload crypto work from backend services
  • enables L7 inspection for routing and policy

Tradeoff:

  • internal traffic must still be protected appropriately
  • if you terminate at the edge and send plaintext internally, the trust boundary moves inward

Many production systems re-encrypt internally or use mTLS on internal hops.

9.3 Caching

Reverse proxies can cache static assets and some API responses.

Why it matters:

  • lower origin load
  • lower latency
  • better resilience during backend spikes

But be careful with:

  • personalized responses
  • auth-dependent content
  • stale data after updates

9.4 Compression

Compression reduces payload size for responses such as JSON, HTML, CSS, and JS.

Benefits:

  • lower bandwidth
  • faster transfers for text-heavy payloads

Tradeoff:

  • CPU overhead
  • not useful for already compressed formats such as many images or zipped binaries

9.5 WAF Basics

A Web Application Firewall applies security rules to incoming requests.

It commonly helps with:

  • blocking obviously malicious payloads
  • filtering known exploit patterns
  • enforcing IP reputation rules
  • reducing bot and abuse traffic

Important nuance: a WAF is helpful, but it is not a substitute for secure application code and proper validation.

9.6 CDN Relationship

A CDN is often the outermost layer, especially for global systems.

Typical order:

  1. CDN
  2. WAF or reverse proxy
  3. API gateway
  4. internal load balancing and service routing

CDNs are best for:

  • static assets
  • edge caching
  • some globally cacheable API responses
  • DDoS absorption and edge presence

9.7 Reverse Proxy vs API Gateway

A reverse proxy is often lower-level and more generic.

An API gateway usually adds richer API-specific behavior such as:

  • auth policies
  • API keys
  • per-route rate limits
  • response transformation
  • version-aware routing

In practice, the same product may serve both roles.

10. L4 vs L7

This comparison appears constantly in interviews.

10.1 The Basic Difference

  • L4 operates at the transport layer, mainly IP, TCP, and UDP information.
  • L7 operates at the application layer, understanding HTTP methods, paths, headers, cookies, and sometimes message semantics.

10.2 Comparison Table

Dimension L4 L7
Visibility IP, port, protocol, connection metadata URL path, headers, host, cookies, method, status
Speed generally lower overhead generally more overhead due to parsing and richer policy
Routing options by IP and port by path, host, header, content-type, user, version
TLS handling can pass through TLS often terminates TLS to inspect HTTP
Use cases very high throughput transport balancing, TCP services, simple load distribution APIs, canary routing, auth, rate limits, response transforms
Observability coarse rich request-aware observability
Examples AWS NLB, IPVS-style balancing, transport proxies AWS ALB, Envoy, NGINX, Kong

10.3 Performance Tradeoffs

Why choose L4:

  • lower per-request overhead
  • works for non-HTTP protocols
  • simpler fast-path routing

Why choose L7:

  • can make smarter decisions
  • can enforce API policy
  • can support sophisticated routing and deployment patterns

Interview answer pattern:

"I would use L7 where I need HTTP-aware routing, auth, canarying, or response handling. I would prefer L4 for simpler, high-throughput transport balancing or protocols where application parsing is unnecessary."

11. Health Checks

Health checks determine whether an instance should receive traffic or be restarted.

11.1 Liveness

Liveness asks: is the process alive at all?

If liveness fails, the platform may restart the container or instance.

Best practice: keep liveness simple. It should detect deadlock or fatal stuck states, not depend on every external dependency.

11.2 Readiness

Readiness asks: is this instance ready to serve traffic right now?

If readiness fails, the instance should stop receiving traffic, but it does not necessarily need a restart.

Examples of not-ready:

  • startup still in progress
  • critical dependency unavailable
  • cache warmup incomplete if that makes service unusable
  • instance is draining for deployment

11.3 Startup Probes

Startup probes exist because some applications take time to initialize.

Without startup-aware logic, a slow boot may be misclassified as a dead process and restarted repeatedly.

11.4 Dependency Health

This is subtle.

Should readiness depend on downstream dependencies?

Answer: only on truly critical dependencies.

If a noncritical dependency fails and the service can degrade gracefully, readiness should often stay healthy. Otherwise you risk removing all instances from service for a partial dependency issue.

11.5 Kubernetes Relevance

In Kubernetes:

  • liveness probe failure can restart a pod
  • readiness probe failure removes the pod from service endpoints
  • startup probe delays liveness/readiness enforcement until boot completes

This is why bad probes can cause cascading production pain.

11.6 Probe Comparison

Probe Purpose Good use Common mistake
Liveness detect dead or stuck process deadlock or irrecoverable internal failure checking database and causing restart storms
Readiness decide whether to receive traffic dependency-aware traffic gating marking ready too early
Startup allow slow initialization JVM warmup, cache preload, large bootstraps omitting it for slow-start services

11.7 Best Practices

  • keep liveness shallow
  • make readiness meaningful
  • distinguish recoverable dependency issues from fatal ones
  • support graceful shutdown by failing readiness first, then draining connections
  • test probe behavior during deployments and partial outages

12. Failover

Failover is the process of moving traffic or responsibility from a failing component to a healthy one.

12.1 Why It Matters

Failure is not exceptional in distributed systems. Machines fail, networks partition, regions degrade, and deployments go wrong.

Failover is how the system continues serving despite that reality.

12.2 Active-Passive Failover

One environment serves traffic. Another waits in standby.

Variants:

  • cold standby: mostly offline until needed
  • warm standby: partially provisioned
  • hot standby: ready to serve immediately

Benefits:

  • easier reasoning about writes
  • lower coordination complexity

Costs:

  • potentially slower failover
  • wasted standby capacity

12.3 Active-Active Failover

Multiple regions or clusters actively serve traffic.

Benefits:

  • lower latency for global users
  • fast failover because traffic is already live elsewhere
  • better capacity utilization

Challenges:

  • data consistency
  • write coordination
  • duplicate processing risk
  • routing users to the correct regional data

12.4 Regional Failover

Regional failover usually means a global traffic manager stops sending requests to a bad region and shifts them elsewhere.

flowchart LR
	User[Client Traffic] --> GTM[Global Traffic Manager]
	GTM -->|Primary healthy| R1[Primary Region]
	GTM -->|Standby or overflow| R2[Secondary Region]
	R1 -. region unhealthy .-> GTM
	GTM -->|Failover| R2

Hard parts:

  • DNS caches may slow failover
  • stateful sessions may not exist in the secondary region
  • databases may lag if replication is asynchronous
  • legal or data residency rules may limit where data can fail over

12.5 Database Failover Basics

This is where request-handling discussions must connect to the data layer.

Stateless compute failover is relatively straightforward.

Database failover is much harder because of:

  • replication lag
  • split-brain risk
  • leader election complexity
  • transaction durability guarantees
  • write fencing and stale primary protection

Common patterns:

  • primary-replica with leader promotion
  • managed database multi-AZ failover
  • read replicas for scale, primary for writes
  • careful multi-region replication for stricter availability needs

12.6 RTO and RPO

These two terms matter in interviews.

  • RTO: Recovery Time Objective. How quickly must the system recover?
  • RPO: Recovery Point Objective. How much data loss is acceptable?

Examples:

  • a chat notification system may tolerate some message delay and nonzero RPO
  • a payment ledger system usually needs very low RPO and strict correctness

Lower RTO and lower RPO generally increase cost and complexity.

12.7 Best Practices

  • design failover with data consistency in mind, not just traffic movement
  • test failover regularly, not just in slide decks
  • make failover automation observable and reversible
  • drain traffic from degraded zones before full failure when possible
  • separate control-plane failure from data-plane failure in your reasoning

12.8 Common Mistakes

  • assuming DNS failover is instant
  • forgetting session or cache locality during regional failover
  • promoting replicas without considering lag or split-brain protection
  • calling a system active-active when only the stateless tier is active-active

13. Sticky Sessions

Sticky sessions mean requests from the same client are repeatedly routed to the same backend instance.

13.1 Why They Exist

They are often used when session state is stored in process memory on the server.

Examples:

  • legacy web sessions
  • websocket affinity
  • in-memory shopping cart state in older systems

13.2 When They Help

  • short-term compatibility during migration away from stateful servers
  • workloads where per-connection state is expensive to rebuild
  • some real-time protocols that prefer affinity

13.3 Why They Are Often Avoided

Modern backend design prefers stateless services because stateless services:

  • scale horizontally more easily
  • recover from failure more cleanly
  • tolerate rebalancing better
  • simplify failover and deployment

Sticky sessions hurt these properties.

Problems they create:

  • uneven load distribution
  • poor failover if the chosen instance dies
  • harder autoscaling
  • harder cross-region portability

13.4 Better Alternatives

  • external session store such as Redis
  • signed or encrypted stateless tokens where appropriate
  • shared caches for session metadata
  • move connection state into durable or replicated infrastructure when necessary

13.5 Common Interview Answer

If asked about sticky sessions, a strong answer is:

"I would avoid them unless the workload truly needs affinity. In general I prefer stateless services and store session state externally so the load balancer can route to any healthy instance."

14. How These Pieces Fit Together in Real Architecture

This is the part many interview answers miss. These systems are not isolated topics. They work together as one request-handling pipeline.

14.1 Typical SaaS API Architecture

flowchart TD
	Client[Client / Partner API Consumer] --> CDN[CDN]
	CDN --> Edge[WAF / Reverse Proxy]
	Edge --> Gateway[API Gateway]
	Gateway --> Limits[Auth / Versioning / Rate Limit / Validation]
	Limits --> LB[L7 Routing + Load Balancing]
	LB --> App1[Service Instance Group A]
	LB --> App2[Service Instance Group B]
	App1 --> Cache[(Redis / Cache)]
	App1 --> DB[(Primary DB + Replicas)]
	App2 --> Queue[(Async Queue)]
	Gateway -. traces .-> Obs[Logs / Metrics / Tracing]
	App1 -. traces .-> Obs
	App2 -. traces .-> Obs
	Gateway -. service discovery .-> Registry[Service Registry / Orchestrator]

14.2 End-to-End Example: Payment API

Consider POST /payments in a Stripe-like or SaaS billing system.

  1. Client sends HTTPS request with auth token and idempotency key.
  2. CDN or edge forwards dynamic request to reverse proxy.
  3. Gateway terminates TLS, checks auth, attaches trace IDs, and applies per-client rate limit.
  4. Gateway validates request shape and routes to payment service.
  5. Load balancer picks a healthy instance.
  6. Payment service checks business rules and idempotency store.
  7. Service writes to database transactionally and calls external payment processor if needed.
  8. Response returns through gateway, which logs metadata and surfaces normalized errors.

This one request involves:

  • authentication
  • rate limiting
  • validation
  • routing
  • load balancing
  • idempotency
  • observability
  • external failure handling

14.3 End-to-End Example: Global Consumer App

Consider a Netflix-like or Uber-like global product.

  1. Global traffic manager chooses a region.
  2. Edge or gateway enforces auth and traffic policy.
  3. Router sends request to correct service version, maybe with canary rules.
  4. Service mesh or internal load balancing handles service-to-service calls.
  5. Read requests may hit regional caches first.
  6. If one dependency is degraded, the system may throttle optional features and preserve the core experience.

14.4 Layer-to-Concern Mapping

Layer Main concerns
CDN and edge caching, DDoS absorption, global reach
Reverse proxy or WAF TLS termination, request filtering, compression
API gateway auth, rate limiting, versioning, route policy, observability
L7 routing and load balancing path and header-aware routing, canary, health-aware distribution
Service layer business validation, idempotency, authorization on domain objects
Data layer constraints, consistency, replication, failover
Observability platform logs, metrics, traces, alerting

15. Real-World Discussion Patterns

15.1 Google-Style or Amazon-Scale Thinking

At very large scale, request handling is strongly influenced by geography and fleet management.

The important patterns are:

  • global traffic steering
  • regional isolation
  • heavy automation around health and rollout
  • multi-layer balancing rather than one magical balancer

15.2 Netflix-Style Thinking

A company deploying frequently cares deeply about:

  • canary releases
  • traffic shaping
  • resilience under partial failure
  • observability and fast rollback

15.3 Uber-Style Thinking

Systems tied to geography or real-time state often care about:

  • geo-aware routing
  • regional capacity balancing
  • latency sensitivity
  • selective degradation during spikes

15.4 GitHub-Style and Stripe-Style Thinking

Public API platforms care deeply about:

  • stable API contracts
  • client-visible rate limiting
  • request signing and webhook verification
  • versioning discipline
  • auditability and correctness

15.5 Typical SaaS Thinking

SaaS platforms often need a combination of:

  • tenant-aware routing
  • per-tenant quotas and rate limits
  • centralized auth and observability
  • low operational complexity relative to global hyperscale systems

16. Common Interview Questions and Strong Angles

16.1 "Where would you put rate limiting?"

Strong answer:

"Mostly at the gateway or edge for cheap rejection, but I may also add service-level limits for expensive operations or tenant-specific protections."

16.2 "How do you avoid duplicate writes when retries happen?"

Strong answer:

"Use idempotency keys or natural idempotency where possible, persist enough request identity to detect duplicates, and only retry safely repeatable operations."

16.3 "When would you choose L4 vs L7?"

Strong answer:

"L7 when I need HTTP-aware routing and policies like auth, canary, or versioning. L4 when I need simpler, high-throughput transport balancing or I do not need application-layer inspection."

16.4 "How would you do a safe deployment?"

Strong answer:

"Use health checks plus canary or blue-green routing, monitor business and technical metrics, and ensure rollback is fast."

16.5 "What happens if the rate limiter or discovery service goes down?"

Strong answer:

"I need a failure policy. For some endpoints I fail open to preserve availability; for abuse-sensitive endpoints I may fail closed. For discovery, I keep short-lived cached endpoint data and remove unhealthy instances quickly, but I do not rely on stale data too long."

16.6 "Why avoid sticky sessions?"

Strong answer:

"Because they make scaling, failover, and even load distribution harder. Stateless services are easier to operate and recover."

17. Common Mistakes Across the Whole Topic

  • describing only the happy path and ignoring failure behavior
  • saying "use a load balancer" without specifying which type or why
  • retrying everything blindly
  • forgetting idempotency on write APIs
  • conflating authentication with authorization
  • overusing the gateway as a business-logic layer
  • assuming health checks are trivial
  • assuming DNS-based failover is immediate and sufficient
  • using sticky sessions to avoid fixing state management
  • forgetting observability at the edge and routing layers

18. Practical Best Practices Checklist

  • terminate or manage TLS deliberately; do not let trust boundaries be accidental
  • reject bad or abusive traffic as early as possible
  • keep services stateless when you can
  • make retries explicit, bounded, and idempotency-aware
  • use readiness checks to gate traffic and liveness checks to recover dead processes
  • monitor p95 and p99 latency, not just averages
  • make rollout and failover mechanisms observable
  • keep routing policy simple enough to debug under pressure
  • treat API versioning as a product and operational discipline, not just a URL pattern
  • test degraded modes, not just normal operation

19. Final Mental Model

The cleanest way to think about request handling is this:

Request handling is the system that protects scarce resources while getting the right request to the right code, at the right time, under the right policy, even when parts of the system are failing.

If you can explain request handling from that perspective, you will do well in interviews and you will design more production-ready backend systems.