Files

T

tarun-elango 26810e43d0 sd text

2026-04-26 13:27:19 -04:00

54 KiB

Raw Blame History

Reliability & Protection

Reliability and protection are the control systems that keep a backend usable when reality stops being polite. In a clean whiteboard interview, requests arrive at a nice steady rate, services are healthy, latency is predictable, and failures are isolated. In production, the opposite is usually true:

traffic is bursty, not smooth
clients retry badly
dependencies slow down before they fail
bots scrape and abuse public endpoints
one noisy tenant can starve everyone else
instances become partially unhealthy
engineers need to understand outages while users are actively impacted

This is why good distributed systems do more than process business logic. They also protect themselves, measure themselves, and explain themselves.

This guide is written for two audiences at the same time:

Someone preparing for backend and system design interviews who needs strong, structured explanations.
Someone trying to understand how real production platforms stay stable under load.

Examples in this guide are generalized from widely used public industry patterns rather than private implementation details, but they map closely to how large companies such as Google, Netflix, Amazon, Uber, Stripe, GitHub, and large SaaS platforms reason about these systems.

1. Big Picture: What Reliability & Protection Actually Mean

At a high level, reliability is the ability of a system to continue providing acceptable service over time, even when components fail, traffic changes, or dependencies misbehave.

Protection is the set of mechanisms that stop the system from being destabilized by bad traffic, abusive clients, overload, or unhealthy internal states.

These two ideas are tightly connected:

rate limiting protects reliability by controlling who gets to consume scarce capacity
monitoring protects reliability by detecting degradation quickly
logging and tracing protect reliability by making incidents diagnosable
health checks protect reliability by keeping broken instances out of the serving path
observability protects reliability by shortening mean time to recovery, or MTTR

If request handling decides where work goes, reliability and protection decide whether the system survives doing that work.

1.1 Core Questions These Systems Answer

Every mature backend eventually needs good answers to questions like these:

How do we stop one client from overwhelming the system?
How do we know something is wrong before users file tickets?
When the system is slow, how do we know whether the bottleneck is the app, the database, or the network?
How do we debug a single failing user request across 10 microservices?
How do we keep unhealthy instances from receiving traffic?
How do we decide whether to reject, delay, retry, degrade, or fail over?

Interviewers like this topic because it reveals whether you understand production reality, not just architecture vocabulary.

1.2 Reliability Control Surface in a Typical Backend

flowchart LR
	Client[Client / Browser / Mobile App] --> Edge[CDN / WAF / Edge Proxy]
	Edge --> Gateway[API Gateway]
	Gateway --> ServiceA[Service A]
	Gateway --> ServiceB[Service B]
	ServiceA --> Cache[(Cache)]
	ServiceA --> DB[(Primary DB)]
	ServiceB --> MQ[(Queue / Stream)]
	ServiceB --> Ext[External API]

	Edge -. edge rate limits .-> RL1[Protection Layer]
	Gateway -. auth + quotas + route metrics .-> RL2[Policy Layer]
	ServiceA -. app metrics/logs/traces .-> Obs[Observability Stack]
	ServiceB -. app metrics/logs/traces .-> Obs
	Gateway -. access logs + latency + trace roots .-> Obs
	Edge -. DDoS signals + blocked traffic .-> Obs

	LB[Load Balancer / Service Mesh] --> ServiceA
	LB --> ServiceB
	HC[Health Checks] -. remove unhealthy endpoints .-> LB

1.3 Interview Framing

A weak interview answer says:

"I would add monitoring and rate limiting."

A strong interview answer says:

"I would enforce coarse rate limits at the edge to stop abusive traffic early, then add service-level quotas for expensive operations. I would instrument RED metrics at the gateway and USE metrics for the database and worker pools, propagate trace IDs across services, centralize structured logs for debugging, and use readiness checks so bad instances are drained before they receive traffic."

That answer shows placement, purpose, and tradeoffs.

2. Rate Limiting

Rate limiting is one of the most important protection mechanisms in backend systems because it prevents demand from turning into collapse.

2.1 Core Idea

Fundamentally, rate limiting controls how quickly a client, tenant, API key, IP address, or internal caller can consume a resource.

The resource might be:

HTTP requests per second
login attempts per minute
messages published per second
database writes per tenant
expensive AI inference calls per hour
webhook deliveries per endpoint

Rate limiting exists because backend resources are finite. CPU, memory, database connections, thread pools, cache bandwidth, and downstream API budgets are all limited. Without protection, the system behaves unfairly and often unstably.

2.2 What Rate Limiting Protects Against

Problem	What happens without limits	Why rate limiting helps
Abuse and bots	scrapers, credential stuffing, spam, brute force requests	slows attackers, raises cost of abuse
Cost explosion	one client generates huge billable backend work	protects infrastructure and vendor spend
Unfairness	one noisy tenant degrades everyone else	enforces fairness and multi-tenant isolation
Cascading overload	overloaded service causes retries and wider collapse	sheds or delays work before queues explode
Capacity mismatch	demand spikes above safe throughput	keeps the system inside stable operating bounds

An important intuition: rate limiting is not mainly about denying traffic. It is about shaping demand so the system stays in a region where it can still serve useful work.

2.3 Where It Sits in the Request Lifecycle

Rate limiting can exist at multiple layers:

edge or CDN layer
web application firewall layer
API gateway layer
service or endpoint layer
internal RPC layer
asynchronous job admission layer

Each layer protects a different thing:

edge limiting protects internet-facing capacity and blocks obvious abuse cheaply
gateway limiting protects shared APIs and enforces customer quotas consistently
service-level limiting protects expensive or sensitive business operations
internal limiting protects downstream systems like databases, queues, or external providers

2.4 Multi-Layer Request Flow

flowchart LR
	C[Client] --> E[Edge / CDN / WAF]
	E --> G[API Gateway]
	G --> S1[Auth Service]
	G --> S2[Core API Service]
	S2 --> DB[(Database)]
	S2 --> P[(Payment / External Provider)]

	E -. IP reputation / bot filter / coarse rate limit .-> M1[Edge Protection]
	G -. API key / tenant / route quota .-> M2[Gateway Rate Limit]
	S2 -. expensive op guard / per-user write limit .-> M3[Service Limit]
	S2 -. metrics / logs / traces .-> O[Observability]
	G -. access logs / latency / rejections .-> O

This multi-layer approach matters because a single limiting point is usually too blunt. If you only limit at the service, abusive traffic still consumes gateway and network resources. If you only limit at the edge, an authenticated user may still abuse an expensive endpoint.

2.5 Token Bucket

Token bucket is one of the most common production rate limiting algorithms because it allows controlled bursts while maintaining a long-term average rate.

2.5.1 Intuition

Imagine a bucket holding tokens. A request can only proceed if it removes a token. Tokens are added back over time at a fixed refill rate until the bucket reaches a maximum size.

refill rate controls steady-state throughput
bucket size controls burst tolerance

If tokens exist, traffic can burst. If the bucket empties, excess requests are rejected or delayed.

2.5.2 How It Works Internally

The limiter stores two main pieces of state:

last refill timestamp
current token count

When a request arrives:

Compute elapsed time since last refill.
Add new tokens according to elapsed time times refill rate.
Cap the bucket at max capacity.
If at least one token is available, consume one and allow the request.
Otherwise reject, throttle, or queue the request.

In equation form, if the refill rate is r tokens per second and elapsed time is \Delta t, then:


new\_tokens = \min\left(capacity, current\_tokens + r \cdot \Delta t\right)

2.5.3 Concept Diagram

flowchart LR
	T[Time passes] --> R[Add tokens at fixed rate]
	R --> B[(Token Bucket)]
	Req[Incoming request] --> Check{Token available?}
	B --> Check
	Check -->|Yes| Allow[Consume token and allow]
	Check -->|No| Reject[Reject / delay / degrade]

2.5.4 Why Production Systems Like It

Real traffic is rarely perfectly even. Users refresh a page, mobile clients reconnect, cron jobs fire on the minute, and webhooks fan out in bursts. Token bucket handles this better than a hard per-second cutoff because it absorbs short bursts without punishing normal behavior.

This is why APIs from platforms like Stripe, GitHub, and cloud providers often combine long-term quotas with burst allowance rather than a rigid request-per-second cap.

2.5.5 Configuration Tradeoffs

Setting	If too low	If too high
Refill rate	good clients get throttled during normal use	service may still overload
Bucket capacity	no burst tolerance, bad user experience	burst can overwhelm downstream dependencies
Key granularity	unfair sharing across users	high cardinality and more storage

The important tradeoff is burst versus smoothness. A large bucket improves user experience for bursty workloads, but it can create traffic spikes that a fragile downstream service cannot handle.

2.5.6 Common Interview Discussion

Interviewers often ask: "Why use token bucket instead of fixed window counting?"

Strong answer:

token bucket handles bursts better
it avoids harsh boundary effects like "100 requests at 12:00:59 and another 100 at 12:01:00"
it maps better to systems that want average throughput plus burst tolerance

2.6 Leaky Bucket

Leaky bucket is another classic algorithm. It is often used when you want steady outflow instead of burst-friendly admission.

2.6.1 Intuition

Imagine requests entering a bucket, but water leaks out at a constant rate. If input comes faster than output, the bucket fills. Once full, new requests are dropped or blocked.

This models queue smoothing.

2.6.2 Internal Behavior

incoming requests are enqueued
a scheduler or consumer drains the queue at a fixed rate
if the queue exceeds capacity, additional requests are rejected

This produces a smoother stream downstream, which can be useful when the protected system is sensitive to burstiness.

2.6.3 Where It Is Useful

traffic shaping in network systems
smoothing writes to a fragile downstream service
controlling job dispatch into worker pools
protecting databases or third-party providers that dislike spikes

2.6.4 Token Bucket vs Leaky Bucket

Dimension	Token bucket	Leaky bucket
Traffic model	allows bursts up to bucket size	smooths traffic toward constant rate
User experience	better for bursty legitimate traffic	can add queueing delay
Downstream protection	moderate	stronger smoothing
Common usage	API rate limits, user quotas	shaping outbound work, queue drain control

The distinction is simple but important:

token bucket decides admission with burst allowance
leaky bucket decides drain rate with burst smoothing

Many real systems effectively combine both ideas. For example, an API may admit a burst via token bucket, then a worker queue may leak work toward a payment gateway at a controlled rate.

2.7 Distributed Counters

Single-process counters work in demos. They fail immediately in horizontally scaled systems.

If your API runs on 100 instances and each instance only tracks its own local counters, then a user can often exceed the intended global limit by spreading requests across many instances.

2.7.1 Why Single-Node Counters Break

load balancers route requests to different instances
autoscaling adds and removes nodes dynamically
restarts wipe in-memory state
multi-region routing makes local counters inconsistent globally

2.7.2 Redis-Based Counters

A common production approach is to keep rate limit state in Redis because it is fast, centralized, and supports atomic operations.

Typical designs use:

atomic increment with expiration for window counters
Lua scripts for atomic token bucket logic
sorted sets for sliding-window calculations

Why Redis is popular:

low latency
shared across instances
atomic primitives
operationally simpler than full database-backed counters

But Redis does not magically solve everything.

2.7.3 Consistency and Failure Challenges

Challenge	What can go wrong
replication lag	replicas may return stale counts
hot keys	a popular API key or IP becomes a bottleneck
partial outage	if Redis is down, limiter behavior must fail open or fail closed
cross-region latency	global counters become slow or inconsistent
clock dependence	time-window logic gets tricky around skew and boundaries

Fail-open means requests are allowed if the limiter backend is unavailable. This preserves availability but weakens protection. Fail-closed means requests are rejected when limiter state cannot be checked. This protects capacity but can create an outage for legitimate traffic.

Which one is correct depends on what you are protecting:

login abuse defenses often lean fail-closed or degrade aggressively
low-risk analytics APIs may lean fail-open to preserve usability
financial or fraud-sensitive paths may use more conservative protection

2.7.4 In-Memory Plus Sync Approaches

At very high scale, some systems use a hybrid strategy:

local in-memory counters for extremely fast coarse enforcement
periodic synchronization to shared backing state
eventual consistency with safety margins

This reduces Redis load and latency, but increases approximation error.

This is useful when exactness is less important than protection. For example, if the policy is "roughly 100 requests per second per client," being off by a small amount may be acceptable. If the policy is tied to billing or fraud, approximation may not be acceptable.

2.7.5 Time Window Challenges

Naive time-based limiting creates ugly edge effects:

fixed windows allow bursts at boundaries
sliding windows cost more to compute accurately
per-second precision increases storage and write amplification

This is why interview answers should mention time semantics, not just "store a counter in Redis."

2.7.6 Multi-Region Scaling Challenges

Global rate limiting is hard because the system must choose between:

strict global accuracy
low latency
high availability

You usually cannot maximize all three.

Typical patterns:

region-local limits with some over-allocation buffer
global quotas with asynchronous reconciliation
per-region token allotments periodically rebalanced
edge-local abuse blocking plus regional business quotas

Example: a global SaaS API may enforce hard account quotas at a central control plane, but use region-local token buckets for fast data-plane enforcement.

2.8 Abuse Prevention

Rate limiting is one piece of a broader abuse prevention system.

2.8.1 Common Abuse Patterns

bots scraping public pages or APIs
credential stuffing using leaked usernames and passwords
brute force attempts against login or OTP flows
spam account creation
card testing against payment endpoints
low-and-slow abuse designed to stay below simple thresholds
DDoS traffic intended to exhaust network, compute, or application resources

2.8.2 IP-Based vs User-Based Limiting

Dimension	IP-based	User/API key based
Strength	useful before authentication	better for fairness after auth
Weakness	NAT and mobile networks cause false positives	attackers can create many accounts
Best use	edge defense, anonymous traffic	tenant quotas, authenticated APIs

Strong systems combine both. Anonymous traffic might be limited by IP and ASN reputation at the edge, while authenticated traffic is limited by account, token, route, and operation type.

2.8.3 Behavioral Rate Limiting

Simple request counting misses intent. Behavioral limiting looks at patterns like:

many login attempts across many usernames from one IP
one account creating resources unusually fast
abnormal error-rate patterns
suspicious path traversal or enumeration behavior
unusual geo or device distribution

This is closer to how large anti-abuse systems at companies like Cloudflare, Stripe, GitHub, and large identity platforms think. The goal is not only "count requests," but "detect harmful behavior patterns." This often combines rules, heuristics, ML scoring, reputation data, and challenge systems such as CAPTCHAs or proof-of-work style friction.

2.8.4 Adaptive Throttling

Adaptive throttling changes policy based on current system health or attack conditions.

Examples:

tighten anonymous limits when error rate spikes
reduce expensive endpoint quotas when database saturation rises
require additional verification for suspicious login flows
deprioritize background traffic during an incident

This is more resilient than static limits because static thresholds are often wrong under dynamic load.

2.8.5 DDoS Mitigation Basics

Application teams usually do not stop large volumetric DDoS attacks alone. This is typically handled with layered defenses:

anycast edge networks
CDN absorption
WAF rules
SYN flood protections
upstream scrubbing centers
edge filtering by reputation and geography

The application layer still matters because sophisticated attacks often look like valid traffic but target expensive endpoints.

2.9 Production Placement and Best Practices

2.9.1 Edge vs Service-Level Limiting

Placement	Best for	Tradeoff
Edge	block abusive traffic cheaply and early	limited user context before auth
API gateway	enforce route and tenant quotas consistently	gateway can become a bottleneck
Service layer	protect expensive operations with business context	traffic already consumed upstream capacity

Best practice is layered enforcement, not choosing only one layer.

2.9.2 Common Mistakes

using only IP limits and harming legitimate users behind NAT
applying one global limit instead of per-route cost-aware limits
failing open on sensitive abuse endpoints without thinking through risk
forgetting internal callers and retry storms
storing high-cardinality limit keys without memory planning
ignoring user messaging and not returning retry metadata

A good production API usually returns clear headers or error fields such as remaining quota, reset hints, or backoff guidance.

2.9.3 Real-World Examples

GitHub-style APIs expose rate limits to clients and distinguish authenticated versus unauthenticated usage.
Stripe-style systems protect sensitive payment and authentication flows with route-specific controls, idempotency, and anti-abuse heuristics.
Cloudflare-style edge systems combine rate limits, bot signals, reputation, and challenge mechanisms.
Large SaaS platforms often have tenant-level quotas to stop one customer from exhausting shared resources.

3. Monitoring

Monitoring is how a system notices that it is drifting away from healthy behavior.

3.1 Core Idea

Many people think monitoring exists to help debugging after something breaks. That is only part of the story. Monitoring exists to answer a broader question:

"Is the system still delivering the reliability properties we promised?"

That means monitoring is about:

early detection
trend awareness
capacity planning
incident response
reliability management
business risk visibility

If users discover outages before engineers do, monitoring is underperforming.

3.2 Reliability vs Visibility

Reliable systems are not systems with many dashboards. They are systems where signals are tied to real user impact.

Visibility means you can see internal measurements.

Reliability means you can use those measurements to keep the user experience within acceptable bounds.

This is why mature teams tie monitoring to service level objectives, or SLOs, not just random machine metrics.

3.3 Metrics

Metrics are numeric measurements collected over time. They are efficient to aggregate, cheap to alert on, and ideal for answering "what is happening at scale?"

3.3.1 Main Metric Types

Type	Meaning	Example
Counter	monotonically increasing count of events	requests_total, errors_total
Gauge	value that can go up or down	queue_depth, memory_usage
Histogram	distribution of observations across buckets	request_latency_ms

Counters are good for rates and totals. Gauges show current state. Histograms are essential for latency because averages hide tail pain.

3.3.2 RED and USE

Two famous heuristics show up often in interviews.

RED is useful for services:

Rate: how many requests are we serving?
Errors: how many are failing?
Duration: how long are they taking?

USE is useful for resources:

Utilization: how busy is the resource?
Saturation: how much queued or waiting work exists?
Errors: is the resource itself failing?

Strong teams use both. RED tells you user-facing service behavior. USE tells you whether underlying resources are the bottleneck.

3.3.3 Latency, Traffic, Errors, Saturation

These four ideas should always be mentally connected:

traffic increases demand
latency reveals response time degradation
errors reveal outright failure
saturation reveals how close you are to collapse

Saturation is often the most neglected signal. CPU may be only 60 percent busy while database connection pools, thread pools, disk queues, or downstream concurrency limits are already maxed out.

3.3.4 Cardinality Problems

Metrics systems love aggregation and hate unbounded labels.

Bad idea:

label every request by user_id, session_id, order_id, or full URL path

Why this breaks:

memory usage explodes
query performance degrades
storage cost rises quickly
alerting becomes unstable

This is a classic production lesson. Metrics are not logs. They should summarize classes of behavior, not store every individual event identity.

3.3.5 How Metrics Drive Alerting

Common alerting patterns:

error rate above threshold for a sustained period
p99 latency above SLO target
success rate below objective
queue depth rising continuously
worker backlog age exceeding target
database saturation rising while throughput plateaus

At companies like Google and many SaaS teams influenced by SRE practices, the strongest alerts are symptom-based. That means they alert on user-visible pain, not just internal weirdness.

3.4 Dashboards

Dashboards are human-readable views of system state. They exist to help operators build situational awareness quickly.

3.4.1 What a Good Dashboard Does

shows service health at a glance
connects symptoms to likely causes
reflects ownership boundaries
supports incident triage under pressure
avoids burying the important signal under decorative noise

3.4.2 Good vs Bad Dashboard Design

Good pattern	Bad pattern
starts with SLO and user-impact indicators	starts with dozens of host-level charts
shows request rate, errors, latency, saturation together	shows disconnected charts without context
broken down by region, endpoint, and dependency	mixes unrelated systems on one screen
has a clear owner and purpose	exists because someone thought dashboards are good

3.4.3 Useful Dashboard Layers

executive or SLO dashboard: is the service meeting promises?
service owner dashboard: what part of the service is degrading?
dependency dashboard: is the database, cache, or queue causing the issue?
operational drill-down: instance or pod level details for active debugging

Interview insight: good dashboards are navigational aids, not data graveyards.

3.5 Alerting

Alerting is where many organizations accidentally create operational self-harm.

3.5.1 What Makes an Alert Good

A good alert is:

actionable
tied to impact or a credible precursor to impact
routed to the right owner
urgent enough to deserve interruption
low-noise enough that people still trust it

If an alert does not change what an engineer should do, it is probably not a good page.

3.5.2 Alert Fatigue

Noisy alerts train engineers to ignore the monitoring system. That is dangerous because the real outage then arrives on a channel people have learned to distrust.

Common causes:

thresholds set too close to normal variability
paging on transient spikes instead of sustained conditions
duplicate alerts from multiple layers for the same symptom
paging on causes rather than symptoms without enough confidence
alerts with no runbook or clear ownership

3.5.3 Thresholds vs Anomaly Detection

Approach	Strength	Weakness
Static thresholds	simple and explainable	brittle under seasonal traffic patterns
Dynamic baselines / anomaly detection	adapts to changing patterns	can be opaque and noisy if poorly tuned

Most mature systems use a mix. Critical SLO breaches often use straightforward thresholds. Supporting signals may use anomaly detection.

3.5.4 Paging vs Non-Paging Alerts

paging alerts wake humans because user impact is happening or imminent
non-paging alerts create tickets, Slack notifications, or backlog items

Not every bad graph deserves a pager. Use severity tiers such as:

P0: severe widespread outage or data risk
P1: major feature degradation with high customer impact
P2: limited degradation or internal operational issue

3.5.5 How Good Teams Reduce Noise

alert on burn rate against SLOs rather than raw single-sample spikes
deduplicate related alerts
suppress child alerts during known parent outages
require every alert to have an owner and action
review and prune alerts after incidents

Netflix, Google-style SRE teams, and mature SaaS platforms commonly treat alert quality as an engineering problem, not a monitoring config problem.

3.6 Uptime Checks

Uptime checks are active probes that ask, "Can I reach this system and get the expected result?"

3.6.1 Synthetic Monitoring

Synthetic monitoring sends artificial requests on a schedule from one or more regions.

Examples:

GET health endpoint every 30 seconds
create a lightweight test account and perform a login flow
simulate checkout without committing payment
fetch an API response and verify critical fields

3.6.2 Health Endpoint vs Real User Monitoring

Method	What it tells you	Limitation
Health endpoint	service says it is alive or ready	may not reflect real user experience
Synthetic monitoring	external path works for scripted flows	may miss edge cases and customer diversity
Real user monitoring	what actual users are experiencing	harder to control and aggregate cleanly

3.6.3 Global Checks and Detection Latency

Checks from many regions matter because an outage may be regional, DNS-related, CDN-related, or ISP-specific.

Tradeoff:

more frequent checks reduce detection latency
but higher frequency increases probe noise and cost

3.6.4 "System Is Up" vs "System Is Usable"

A service can return HTTP 200 and still be functionally broken.

Examples:

login succeeds but dashboard data never loads
API responds quickly with empty or stale data because a dependency is degraded
health endpoint passes while database writes silently fail

This is why mature uptime programs test critical user journeys, not just TCP reachability.

4. Logging

Logs are the detailed event record of what the system believed and did at specific moments.

4.1 Core Idea

If metrics tell you that something is wrong, logs often tell you what happened in enough detail to investigate.

They are the black box recorder of distributed systems.

Logs are especially important because production incidents often involve context that metrics intentionally discard:

exact error messages
request payload characteristics
code path decisions
retry behavior
dependency-specific failure reasons
tenant- or endpoint-specific anomalies

4.2 Logs vs Metrics

Metrics compress behavior into numeric summaries. Logs preserve detailed event context.

That difference is why you need both.

If p99 latency spikes, metrics show the spike. Logs may reveal that requests with a certain downstream provider, tenant, or payload size were timing out.

4.3 Centralized Logs

4.3.1 Why Centralization Is Necessary

In modern distributed systems, requests touch many ephemeral machines and containers. Local log files are not enough because:

instances autoscale and disappear
a single request crosses many services
incidents require searching across time and systems
security and compliance often require retention and auditability

So logs are shipped to centralized systems such as Elasticsearch-based stacks, Loki, Splunk, Datadog, or vendor-managed observability backends.

4.3.2 Structured Logging vs Plain Text

Structured logs use fields, usually JSON or key-value format, rather than free-form text.

Example fields:

timestamp
level
service
environment
region
request_id
trace_id
route
user_id or tenant_id if safe and allowed
error_code
latency_ms

Structured logging wins because it is searchable and aggregatable. Plain text is easy for humans to read locally, but much harder to query reliably at scale.

4.3.3 Indexing and Searchability

Central log systems typically parse incoming logs, extract fields, index selected attributes, and support search by time, service, correlation ID, severity, or structured fields.

The core tradeoff is query flexibility versus cost.

Full indexing of all fields can become very expensive at scale. Mature teams choose carefully which fields deserve indexing and which belong only in raw archived events.

4.3.4 Retention Policies

Log retention is both a cost and compliance topic.

Hot searchable storage is expensive. Cold archival storage is cheaper but slower to query.

Many systems use tiers:

short retention for high-volume debug logs
longer retention for important application and audit logs
strict redaction or encryption for sensitive data

4.3.5 Log Volume Explosion

Logging scales badly if left unmanaged.

Common failure modes:

every retry logs a full stack trace
debug logging is left on in production
large request or response payloads are logged indiscriminately
high-cardinality fields create indexing blowups

During incidents, bad logging can worsen the outage by saturating disk, network, or the logging backend itself.

4.4 Logging in Distributed Systems

4.4.1 Correlation IDs and Request IDs

A correlation ID is a shared identifier attached to all logs related to one logical request or workflow.

Without this, debugging a request across services becomes guesswork.

Typical flow:

Gateway receives request.
Gateway generates or propagates request ID and trace ID.
Each downstream service logs those IDs.
Operators search all logs for that ID.

4.4.2 Cross-Service Traceability

Suppose a checkout request hits:

API gateway
cart service
inventory service
payment service
notification service

If the payment call fails only for certain retries and inventory had already reserved stock, engineers need the end-to-end sequence. Shared identifiers make that investigation possible.

4.4.3 Ordering Challenges

Distributed logs are not perfectly ordered because:

clocks are not identical
network delays vary
logs are buffered and shipped asynchronously
retries create overlapping attempts

This is why timestamps alone are insufficient. Correlation IDs and trace IDs are mandatory for serious debugging.

4.4.4 Real-World Debugging Example

Imagine users report intermittent checkout failures.

Metrics show:

error rate spike in checkout API
latency spike in payment dependency

Logs reveal:

payment provider timeout after 2 seconds
retry policy triggered twice
inventory reservation succeeded but compensation job was delayed

Tracing reveals:

most latency sits in one external payment span

Together, these signals tell a coherent story. Any one signal alone would be incomplete.

4.5 Logging Best Practices and Mistakes

Best practices:

use structured logs
include request and trace identifiers
log meaningful business events, not only code exceptions
redact secrets and personal data
control log levels carefully
sample noisy repetitive logs when appropriate

Common mistakes:

logging passwords, tokens, or full card data
logging too little context to debug anything
logging so much detail that the platform becomes unusable
using logs as a substitute for metrics

5. Tracing

Tracing maps the journey of a request through a distributed system.

5.1 Core Idea

Logs tell you what happened in individual components. Metrics tell you the system-level shape of behavior. Tracing tells you where time and failure accumulated along one request path.

This matters because modern backends are not linear. One user request may trigger multiple services, caches, queues, and external APIs. If the request is slow, the critical question becomes:

"Which part of the path consumed the time?"

5.2 Distributed Tracing Fundamentals

5.2.1 Trace and Span

a trace represents one end-to-end request or workflow
a span represents one unit of work inside that trace

Each span usually includes:

start time
end time or duration
operation name
service name
status or error info
parent span reference
attributes like route, region, DB statement class, retry count

5.2.2 Parent-Child Relationships

Tracing forms a tree or DAG-like timeline.

Example:

root span: incoming API request
child span: call auth service
child span: call order service
child span: call payment service
nested child span: payment service calls external processor

This lets engineers see both total latency and where it was spent.

5.2.3 Latency Breakdown and Bottleneck Identification

Tracing is exceptionally good at revealing:

which dependency is slow
where parallelism is working or not working
which retries amplified latency
which branch of a request fan-out dominates tail latency

5.3 Request Flow Example

sequenceDiagram
	participant Client
	participant Gateway
	participant Orders
	participant Inventory
	participant Payments
	participant DB

	Client->>Gateway: POST /checkout
	Gateway->>Orders: create order span
	Orders->>Inventory: reserve stock span
	Inventory->>DB: update inventory span
	DB-->>Inventory: success
	Inventory-->>Orders: reserved
	Orders->>Payments: charge card span
	Payments->>DB: load payment state span
	DB-->>Payments: payment state
	Payments-->>Orders: timeout / slow response
	Orders-->>Gateway: partial failure
	Gateway-->>Client: 503 or retryable error

With tracing, the system can show that total request time was, for example, 2.8 seconds, of which 2.4 seconds came from the payment span.

5.4 Trace Propagation

Tracing only works if context is propagated across service boundaries.

Typical behavior:

The entry point generates a trace ID and root span.
Outbound requests carry tracing headers.
Downstream services create child spans linked to the parent.
Async workflows may continue the trace with linked context.

If a service fails to propagate headers, observability becomes fragmented. This is one of the most common real-world tracing failures.

5.5 OpenTelemetry Conceptually

OpenTelemetry is a widely used vendor-neutral observability framework.

At a high level, it provides:

APIs and SDKs for instrumentation
conventions for traces, metrics, and logs
context propagation support
exporters to different backends

Why it matters:

teams do not want instrumentation tied forever to one vendor
standard context propagation improves interoperability
shared semantic conventions reduce chaos across services

High-level flow:

application emits spans and metrics through OpenTelemetry SDKs
local agent or collector batches and processes telemetry
exporter sends data to systems such as Jaeger, Tempo, Prometheus-compatible pipelines, Datadog, New Relic, Honeycomb, or cloud-native observability backends

5.6 Tracing in Production

Tracing every request at full detail can be expensive.

Real systems often use:

head-based sampling: decide early whether to keep a trace
tail-based sampling: keep traces matching outcomes like errors or high latency
adaptive sampling: retain more unusual or important traces

Tradeoff:

more tracing detail improves debugging
but increases storage, network overhead, and backend cost

Companies handling very high throughput often sample heavily for normal traffic while keeping full traces for errors or important workflows.

5.7 Common Mistakes

not propagating trace context across all services
naming spans inconsistently
tracing everything but not correlating to logs and metrics
hiding expensive downstream calls inside uninstrumented libraries
ignoring async boundaries such as queue consumers and background jobs

6. Observability

Observability is the ability to understand the internal state of a system by examining its outputs.

6.1 Core Idea

Monitoring asks, "Did something go wrong?"

Observability asks, "Can we explain why this behavior is happening, even if we did not predict this exact failure mode in advance?"

That difference matters a lot in distributed systems because not every outage matches a prewritten alert rule.

6.2 The Three Pillars

The common mental model is logs, metrics, and traces. The phrase is a simplification, but still useful.

Signal	Best question it answers	Typical strengths	Typical weakness
Metrics	what is happening broadly?	fast aggregation, dashboards, alerts	limited detail
Logs	why did this specific event happen?	rich context, debugging detail	high volume and cost
Traces	where did time or failure occur in the path?	request journey and dependency map	sampling and instrumentation gaps

The simplest interview-friendly summary is:

metrics tell you what is happening
logs tell you why it is happening
traces tell you where it is happening

That summary is not perfect, but it is very useful.

6.3 Why Observability Exists

Modern systems are:

distributed
dynamic
asynchronous
partially failing
constantly changing through deploys and config changes

Because of that, operators cannot rely on static mental models alone. They need live evidence that connects symptoms to root causes.

6.4 Observability Pipeline Architecture

flowchart LR
	App1[Service A] -->|metrics| Col[OpenTelemetry Collector / Agents]
	App1 -->|logs| Col
	App1 -->|traces| Col
	App2[Service B] -->|metrics| Col
	App2 -->|logs| Col
	App2 -->|traces| Col
	GW[Gateway / Edge] -->|access logs, RED metrics, root traces| Col

	Col --> M[Metrics Store]
	Col --> L[Log Store / Search]
	Col --> T[Trace Store]

	M --> Dash[Dashboards]
	M --> Alert[Alerting Engine]
	L --> Investigate[Incident Investigation]
	T --> Investigate
	Dash --> OnCall[On-call Engineer]
	Alert --> OnCall

This pipeline is important because observability is not just instrumentation inside code. It also includes collection, transport, storage, indexing, alerting, retention, access control, and operational cost management.

6.5 Debugging in Production

6.5.1 Typical Incident Flow

flowchart TD
	A[Alert fires or users report issue] --> B[Check SLO / impact dashboard]
	B --> C[Identify affected service, region, route, tenant]
	C --> D[Inspect metrics for latency, errors, saturation]
	D --> E[Open traces for slow or failed requests]
	E --> F[Search logs by request or trace ID]
	F --> G[Confirm root cause hypothesis]
	G --> H[Mitigate: rollback, failover, throttle, disable feature, scale]
	H --> I[Verify recovery with dashboards and synthetic checks]

6.5.2 How Observability Reduces MTTR

MTTR is improved when engineers can move quickly from symptom to cause:

alert points to user-visible degradation
dashboard localizes which dimension is affected
trace identifies slow dependency or failing branch
logs reveal exact failure details
health and deployment metadata show whether a rollout caused the issue

Without observability, incident response becomes guesswork, which increases both outage duration and risk of bad mitigation.

6.6 Practical Real-World Examples

Google-style SRE thinking emphasizes SLIs, SLOs, and high-signal alerts tied to user impact.
Netflix-style operations emphasize rich telemetry, dependency visibility, and resilience under partial failure.
Uber-style microservice environments rely heavily on traceability and service ownership because requests span many internal systems.
Stripe-style systems combine deep request context, idempotency, tracing, and auditability for high-stakes financial workflows.
GitHub-style large API platforms need careful rate-limit observability, route-level error tracking, and tenant-aware operational debugging.

6.7 Common Mistakes

collecting lots of telemetry without clear operational questions
not connecting telemetry to service ownership
keeping signals in separate tools without correlation IDs
instrumenting only the application and ignoring proxies, queues, workers, and databases
forgetting cost: observability can become a major platform expense

7. Health Checks

Health checks determine whether a system component should be considered safe to use.

7.1 Core Idea

In distributed systems, failure is often partial. A process may still be running but unable to serve traffic correctly because:

it cannot reach the database
it is still warming caches
it is deadlocked internally
it is overloaded and timing out all work
a critical dependency is unavailable

Health checks exist so orchestration systems, service meshes, and load balancers can make smarter routing and restart decisions.

7.2 Liveness Checks

Liveness answers:

"Is this process alive, or is it stuck badly enough that restart is reasonable?"

7.2.1 What It Detects

deadlocks
event loop stalls
process hangs
unrecoverable internal corruption

In Kubernetes-style environments, failing liveness checks usually triggers restart behavior.

7.2.2 What It Should Not Do

Liveness should not depend on every downstream dependency. If your database has a brief hiccup and every pod fails liveness, the orchestrator may restart healthy application processes and make the incident worse.

That is a classic mistake.

7.3 Readiness Checks

Readiness answers:

"Can this instance safely receive traffic right now?"

7.3.1 Typical Readiness Conditions

startup complete
configuration loaded
required internal caches warmed
dependency connections established if truly necessary
worker pool or listener ready

If readiness fails, the instance should be removed from load balancer rotation but not necessarily restarted.

7.3.2 Startup and Warm-Up

Readiness is critical during deploys because an instance may be alive but not ready.

Examples:

JVM application started but still warming caches
service needs to load a large model into memory
background migration step not complete
thread pools not yet initialized

Without readiness checks, load balancers send traffic too early and users see transient deployment failures.

7.4 Liveness vs Readiness

Question	Liveness	Readiness
Meaning	should this process be restarted?	should this instance receive traffic?
Typical action	restart container or process	stop routing traffic to instance
Dependency sensitivity	low	moderate, only for critical serving dependencies
Common misuse	tying to downstream outage	making checks so strict that capacity flaps

7.5 Service Health and Dependency Awareness

Real systems are not simply healthy or unhealthy. They are often degraded.

Examples:

recommendation service down, checkout still works
write path degraded, read path fine
one region unhealthy, others normal
one dependency timing out but cached responses still serve users

This is why health models should support partial degradation, not only binary status.

7.5.1 Dependency-Aware Health Checks

Sometimes readiness should include critical dependencies. The key word is critical.

If a service cannot possibly serve correct traffic without the primary database, then readiness may reasonably fail when the DB is unreachable.

But if a noncritical analytics sink is down, failing readiness would be wrong. That would convert a partial outage into total self-inflicted unavailability.

7.5.2 Cascading Failure Risk

Naive health checks can create cascading failures:

Database latency rises.
Application health endpoint checks DB synchronously.
Health checks time out.
Load balancer removes many instances.
Remaining instances take more load.
System collapses faster.

This is a common interview discussion because it shows whether you understand feedback loops.

7.6 Failure Detection Flow

sequenceDiagram
	participant LB as Load Balancer
	participant Pod as Service Instance
	participant HC as Health Endpoint
	participant App as App Logic
	participant DB as Critical Dependency

	LB->>Pod: readiness probe
	Pod->>HC: evaluate readiness
	HC->>App: check internal serving state
	App->>DB: lightweight critical dependency check
	DB-->>App: timeout / unhealthy
	App-->>HC: not ready
	HC-->>Pod: readiness=false
	Pod-->>LB: remove from rotation

This sequence is useful only if the dependency check is carefully scoped. Do not perform heavy downstream checks on every probe.

7.7 Best Practices

keep liveness simple and focused on stuck-process detection
use readiness to gate traffic during startup and critical dependency loss
support degraded modes where possible
avoid expensive checks in hot probe paths
add jitter and sensible intervals to avoid probe storms
expose health state to operators, not just orchestration systems

7.8 Common Mistakes

using identical logic for liveness and readiness
making readiness depend on optional systems
making health endpoints themselves expensive and failure-prone
restarting instances during dependency outages when draining would be enough
forgetting that health checks happen at scale across many pods simultaneously

8. System Design Integration: How These Pieces Work Together

This section is the most important operationally because real systems do not run rate limiting, monitoring, logging, tracing, and health checks in isolation.

They form a coordinated control plane around request processing.

8.1 End-to-End Architecture View

flowchart LR
	Client[Client / SDK / Browser] --> Edge[CDN / WAF / Edge Proxy]
	Edge --> Gateway[API Gateway / Ingress]
	Gateway --> LB[Internal Load Balancer / Service Mesh]
	LB --> SvcA[Service A]
	LB --> SvcB[Service B]
	SvcA --> DB[(Database)]
	SvcA --> Cache[(Cache)]
	SvcB --> MQ[(Queue)]
	SvcB --> Ext[External API]

	Edge -. edge rate limiting, bot defense, access logs .-> Obs[Observability Pipeline]
	Gateway -. auth, quotas, request metrics, root trace, request ID .-> Obs
	SvcA -. app logs, spans, business metrics .-> Obs
	SvcB -. app logs, spans, worker metrics .-> Obs
	DB -. DB metrics, slow query logs .-> Obs
	Cache -. hit rate, memory, evictions .-> Obs

	HC[Health Probes] -. readiness / liveness .-> LB
	HC -. pod status .-> SvcA
	HC -. pod status .-> SvcB

	Obs --> MStore[Metrics Store]
	Obs --> LStore[Log Store]
	Obs --> TStore[Trace Store]
	MStore --> Dash[Dashboards + Alerts]
	LStore --> OnCall[Incident Investigation]
	TStore --> OnCall
	Dash --> OnCall

8.2 Where Each System Sits

Concern	Typical placement	Why there
Edge rate limiting	CDN, WAF, ingress	cheapest place to block obvious abuse
Auth-aware quotas	API gateway	consistent policy before services execute
Expensive-operation limits	service layer	requires business context
Metrics	every layer, especially gateway, services, dependencies	high-level health and alerting
Logging	gateway, services, workers, dependencies	forensic detail and debugging
Tracing	entry points and all RPC boundaries	end-to-end latency and failure mapping
Health checks	instances, orchestrator, load balancer	keep broken endpoints out of traffic paths

8.3 Example Request Journey

Consider a request to create a payment or place an order:

The client request reaches the edge.
Edge systems apply bot screening, IP reputation checks, and coarse anonymous rate limiting.
The gateway authenticates the request, generates or propagates a trace ID, emits access metrics, and enforces account or route quotas.
The load balancer sends traffic only to ready instances.
The service performs business logic and may apply stricter limits for expensive or risky operations.
Downstream database calls, queue writes, and provider calls emit spans and logs.
Metrics summarize overall behavior, logs capture contextual events, and traces map the request path.
If failures appear, alerts fire from user-impacting metrics, and engineers pivot into traces and logs.

8.4 What Breaks at Scale

As systems scale, each protection mechanism develops new challenges.

8.4.1 Rate Limiting at Scale

limiter storage hot spots
global versus regional consistency tradeoffs
fairness across noisy tenants
false positives under NAT or proxy aggregation

8.4.2 Monitoring at Scale

metric cardinality explosion
alert floods during incidents
dashboards nobody trusts
missing saturation signals

8.4.3 Logging at Scale

ingest cost explosion
slow search in huge indexes
secret leakage risk
logging platform overload during outages

8.4.4 Tracing at Scale

storage cost of unsampled traces
missing context propagation
high overhead on hot paths
incomplete visibility for async workflows

8.4.5 Health Checking at Scale

probe storms against large fleets
flapping readiness on unstable dependencies
self-inflicted restarts from aggressive liveness policies

8.5 Real Production Patterns

8.5.1 Google-Style Thinking

Google SRE literature strongly shaped industry thinking here:

tie monitoring to user-facing indicators
define service level objectives
use alerts tied to reliability budget burn, not raw noise
treat observability as part of operating the system, not a side feature

8.5.2 Netflix-Style Thinking

Netflix popularized resilience thinking in microservice-heavy environments:

expect dependency failure
isolate failure domains
maintain rich telemetry across services
protect systems through load shedding, timeouts, and adaptive behavior

8.5.3 Amazon-Style Thinking

Amazon-style large-scale backend thinking emphasizes:

cell-based or isolation-focused architectures
protecting downstream dependencies with throttles and retries tuned carefully
operational ownership by teams
metrics and alarms tied to service health and customer impact

8.5.4 Uber and Large SaaS Patterns

In highly distributed service environments:

traceability becomes essential because request paths are long
rate limits are often multi-dimensional: user, tenant, endpoint, city, partner, driver, merchant, or feature class
observability needs strong ownership and standard instrumentation to avoid chaos

8.5.5 Stripe and GitHub Patterns

For API-centric platforms:

rate limits and quota communication must be clear to external developers
logs and traces must support incident forensics and customer support
protection is often business-sensitive, especially around auth, abuse, payments, and webhooks

8.6 Best Practices for Interview Answers

When discussing reliability and protection in an interview:

Start with what failure or abuse mode you are protecting against.
Place each mechanism in the architecture explicitly.
Explain tradeoffs, not just components.
Mention what changes at scale or in multi-region systems.
Tie observability to incident response, not just graphs.

Example strong phrasing:

"I would use edge rate limiting for anonymous abuse, gateway quotas for tenant fairness, and service-level throttles for expensive operations. I would instrument RED metrics and trace propagation at the gateway and service boundaries, centralize structured logs with request IDs, and use readiness checks to drain unhealthy instances before they receive traffic. For multi-region operation, I would keep fast local enforcement and accept some approximation rather than placing every request on a globally consistent counter path."

8.7 Common Cross-Cutting Mistakes

adding retries without rate limits and causing retry storms
adding health checks without thinking about dependency-induced flapping
adding logs without structured fields or correlation IDs
adding metrics with unbounded cardinality
adding tracing without propagation across async boundaries
adding alerts that page constantly without clear action

These mistakes usually happen when systems are designed component-by-component rather than as a complete operating model.

9. Final Mental Model

The cleanest way to remember this topic is:

rate limiting protects capacity and fairness
monitoring detects that reliability is degrading
logging preserves detailed facts for investigation
tracing maps a request across service boundaries
observability combines those signals into explainability
health checks keep unhealthy instances out of the traffic path

Together, they answer the practical questions that matter in both interviews and production:

How does the system defend itself?
How do we know it is failing?
How do we find the bottleneck quickly?
How do we stop a partial failure from becoming a full outage?
How do we operate this architecture at real scale?

If you can explain those connections clearly, you are already thinking like a backend engineer rather than someone memorizing system design buzzwords.

54 KiB Raw Blame History

Reliability & Protection

1. Big Picture: What Reliability & Protection Actually Mean

1.1 Core Questions These Systems Answer

1.2 Reliability Control Surface in a Typical Backend

1.3 Interview Framing

2. Rate Limiting

2.1 Core Idea

2.2 What Rate Limiting Protects Against

2.3 Where It Sits in the Request Lifecycle

2.4 Multi-Layer Request Flow

2.5 Token Bucket

2.5.1 Intuition

2.5.2 How It Works Internally

2.5.3 Concept Diagram

2.5.4 Why Production Systems Like It

2.5.5 Configuration Tradeoffs

2.5.6 Common Interview Discussion

2.6 Leaky Bucket

2.6.1 Intuition

2.6.2 Internal Behavior

2.6.3 Where It Is Useful

2.6.4 Token Bucket vs Leaky Bucket

2.7 Distributed Counters

2.7.1 Why Single-Node Counters Break

2.7.2 Redis-Based Counters

2.7.3 Consistency and Failure Challenges

2.7.4 In-Memory Plus Sync Approaches

2.7.5 Time Window Challenges

2.7.6 Multi-Region Scaling Challenges

2.8 Abuse Prevention

2.8.1 Common Abuse Patterns

2.8.2 IP-Based vs User-Based Limiting

2.8.3 Behavioral Rate Limiting

2.8.4 Adaptive Throttling

2.8.5 DDoS Mitigation Basics

2.9 Production Placement and Best Practices

2.9.1 Edge vs Service-Level Limiting

2.9.2 Common Mistakes

2.9.3 Real-World Examples

3. Monitoring

3.1 Core Idea

3.2 Reliability vs Visibility

3.3 Metrics

3.3.1 Main Metric Types

3.3.2 RED and USE

3.3.3 Latency, Traffic, Errors, Saturation

3.3.4 Cardinality Problems

3.3.5 How Metrics Drive Alerting

3.4 Dashboards

3.4.1 What a Good Dashboard Does

3.4.2 Good vs Bad Dashboard Design

3.4.3 Useful Dashboard Layers

3.5 Alerting

3.5.1 What Makes an Alert Good

3.5.2 Alert Fatigue

3.5.3 Thresholds vs Anomaly Detection

3.5.4 Paging vs Non-Paging Alerts

3.5.5 How Good Teams Reduce Noise

3.6 Uptime Checks

3.6.1 Synthetic Monitoring

3.6.2 Health Endpoint vs Real User Monitoring

3.6.3 Global Checks and Detection Latency

3.6.4 "System Is Up" vs "System Is Usable"

4. Logging

4.1 Core Idea

4.2 Logs vs Metrics

4.3 Centralized Logs

4.3.1 Why Centralization Is Necessary

4.3.2 Structured Logging vs Plain Text

4.3.3 Indexing and Searchability

4.3.4 Retention Policies

4.3.5 Log Volume Explosion

4.4 Logging in Distributed Systems

4.4.1 Correlation IDs and Request IDs

4.4.2 Cross-Service Traceability

4.4.3 Ordering Challenges

4.4.4 Real-World Debugging Example

4.5 Logging Best Practices and Mistakes

5. Tracing

54 KiB

Raw Blame History