Files

T

tarun-elango 26810e43d0 sd text

2026-04-26 13:27:19 -04:00

80 KiB

Raw Blame History

Communication Systems

Communication systems are the part of backend engineering that move information from one actor to another with the right combination of speed, reliability, cost, and user experience.

That sounds simple until you try to build one in production.

At small scale, communication looks like:

send a message to a browser
notify a mobile app that something changed
email a receipt
send an OTP SMS
show whether a user is online

At production scale, the real questions become harder:

how fast must the update arrive before the product feels broken?
what happens if the user disconnects halfway through?
how do you keep millions of long-lived connections open?
how do you avoid sending the same notification five times?
how do you retry safely without creating retry storms?
how do you respect user preferences, cost constraints, and compliance requirements?
how do you fan one event out to millions of recipients without melting the system?

This is why communication systems are a major system design topic. They sit at the boundary between users and backend state, so every weakness becomes visible quickly: latency, stale data, duplicate delivery, missing messages, ghost presence, notification fatigue, and operational complexity.

This guide is written for two goals at once:

interview preparation, where you need to explain tradeoffs clearly and structure your answer well
real backend engineering, where you need to design systems that survive load, failure, and product growth

The focus here is not memorizing definitions. The focus is understanding why these systems exist, how they work internally, what breaks at scale, and how companies such as Google, Netflix, Uber, Amazon, GitHub, Stripe, Slack, WhatsApp, Discord, and typical SaaS platforms think about them in practice.

1. Big Picture: What Communication Systems Solve

Communication systems usually fall into two overlapping families:

real-time systems, where data should move to users quickly while they are actively connected
notification systems, where events should be delivered across channels such as in-app, email, SMS, or push, even if the user is not currently online

In real products, these are not separate worlds. They are usually fed by the same event streams.

Example:

a ride status changes from driver_arrived to trip_started
the rider app needs a real-time UI update
the driver app may need confirmation
analytics pipelines need the event
customer support tools may need the updated timeline
if the rider is offline, a push notification may be needed

One state transition can trigger multiple communication paths.

flowchart LR
	A[Business Event<br/>order placed / trip updated / message sent] --> B[[Event Bus / Outbox]]
	B --> C[Real-Time Delivery Layer]
	B --> D[Notification Orchestrator]
	B --> E[Analytics / Audit / Logging]
	C --> F[WebSocket / SSE / Long Polling Clients]
	D --> G[In-App Inbox]
	D --> H[Email Provider]
	D --> I[SMS Provider]
	D --> J[Push Provider]

The architectural pattern is consistent across many companies:

business systems emit authoritative events
communication infrastructure decides who should be told, through which channel, and with what guarantee
channel-specific delivery systems handle protocol differences and provider failures

1.1 The Core Design Tension

Communication systems are always balancing five things:

Concern	What you want	Why it is hard
Latency	Users see updates quickly	Fast delivery usually requires persistent state and more expensive infrastructure
Reliability	Important messages are not lost	Retries create duplicates unless the system is idempotent
Ordering	Updates arrive in sensible order	Distributed systems naturally reorder across partitions, retries, and reconnects
Cost	Delivery is affordable	SMS and mobile push can be expensive or rate-limited; persistent connections consume memory
User experience	Users are informed but not annoyed	Over-delivery causes fatigue; under-delivery makes the product feel broken

Strong system design answers acknowledge all five.

1.2 Interview Framing

If an interviewer asks about communication systems, they are usually testing whether you can think beyond "send message from A to B."

They want to hear that you understand:

latency targets depend on the product
transport choice depends on directionality and interaction pattern
delivery guarantees must be designed, not assumed
large fanout requires asynchronous pipelines and backpressure control
presence and online status are approximate, not perfect truth
notification systems require orchestration, deduplication, preferences, and retry strategy

The best answers sound like this:

"I would separate the authoritative business event from channel delivery. Then I would choose the transport based on interaction style, add idempotency and retry controls, and design for fanout, reconnects, and per-user preference rules."

2. Real-Time Systems Overview

Real-time systems are systems where the value of data depends heavily on how quickly it arrives.

In backend interviews, "real-time" almost always means internet-scale soft real-time systems, not embedded hard real-time control systems.

2.1 What Real-Time Means in Backend Systems

In a web or mobile product, real-time usually means the update should arrive fast enough that the user feels the application is live.

Examples:

chat messages should appear almost immediately
typing indicators should feel instant
ride location should update every few seconds
collaborative edits should converge with minimal visible lag
market dashboards should feel current enough for decisions
multiplayer game state should update continuously, though serious gaming systems often use specialized protocols beyond basic web stacks

"Real-time" is therefore a product requirement, not a single protocol.

2.2 Soft Real-Time vs Hard Real-Time

Type	Meaning	Typical examples	Backend interview relevance
Hard real-time	Missing a deadline is a correctness failure	avionics, industrial control, some medical systems	Usually not what web/backend interviews mean
Soft real-time	Missing a deadline degrades user experience, but the system still functions	chat, live dashboards, collaborative tools, ride tracking	The common system design case

Hard real-time systems care about deterministic deadlines. Soft real-time systems care about low latency and bounded staleness.

For backend products, the question is usually:

"How stale can data become before the product feels wrong?"

That is a much more practical framing than arguing whether the system is literally real-time.

2.3 Why Real-Time Communication Matters

Real-time communication matters because it changes product behavior:

it makes collaboration feel shared instead of delayed
it reduces the need for manual refreshes
it keeps users engaged during active workflows
it improves trust in stateful experiences such as deliveries, rides, or build pipelines
it enables products whose value depends on immediacy, such as incident dashboards or trading views

Poor real-time behavior is visible immediately.

Users may say:

"the chat is laggy"
"the driver marker jumps randomly"
"I got the notification but the app still shows old state"
"it says they are online but they are not responding"

These are communication-system failures, not merely UI issues.

2.4 Latency Expectations by Product

There is no universal latency target. It depends on the human workflow.

Use case	Typical expectation	What users notice
Typing indicators	50 to 300 ms	Anything slower feels fake
Chat message delivery	under 500 ms ideal	Multi-second delay feels broken
Collaborative editing	under 100 to 300 ms for local echo and convergence	Delayed merge or cursor motion is obvious
Ride or delivery tracking	1 to 5 seconds is often acceptable	Long freezes reduce trust
Live operational dashboard	1 to 10 seconds depending on use	Depends on decision criticality
Financial/trading dashboards	often sub-second to a few hundred ms for UX, though core trading infra may require much stricter guarantees	Stale numbers can be dangerous

A good interview answer explicitly sets a target. If you do not define the acceptable staleness, the rest of the design becomes vague.

2.5 Consistency Challenges

Real-time systems are fundamentally consistency problems with a time dimension.

Common challenges:

the backend state changes faster than clients can consume updates
multiple clients observe updates from different replicas or regions
reconnecting clients may miss some events and need replay
the UI may apply optimistic updates before the server confirms them
ordering may break when events are retried, sharded, or merged from multiple sources

This means you should think in layers:

authoritative state lives in the source-of-truth system
real-time delivery is a fast propagation layer, not the only truth
clients often need version numbers, sequence numbers, or snapshots to reconcile with source-of-truth state

2.6 Ordering Guarantees

Ordering sounds simple until the system becomes distributed.

Useful distinctions:

per-connection ordering: messages on a single TCP connection arrive in order
per-partition ordering: messages within one partition of Kafka or another log may be ordered
per-room or per-entity ordering: feasible if you shard consistently by room, stream, or entity ID
global ordering: usually expensive, unnecessary, or impossible at scale

Interviewers often like hearing this sentence:

"I would aim for per-entity ordering, not global ordering, because global ordering becomes a bottleneck and is rarely needed by the product."

2.7 Delivery Guarantees

Real-time systems usually choose one of these practical models:

Guarantee	Meaning	Reality in production
At-most-once	Message may be lost, but not duplicated	Simple and low overhead; common for low-value ephemeral events like typing
At-least-once	Message may be duplicated but should eventually arrive	Common when reliability matters and clients can dedupe
Exactly-once	Message is processed once end-to-end	Usually approximated with idempotency, dedupe keys, and transactional boundaries rather than achieved literally

Important point: a transport like WebSocket does not magically provide business-level delivery guarantees. It only moves bytes on a connection. Business-level guarantees require acknowledgments, persistence, retries, dedupe, and replay logic.

2.8 Fanout Challenges

Fanout means taking one event and sending it to many recipients.

Examples:

one new chat message goes to every participant in a channel
one sports score update goes to millions of viewers
one incident event updates many dashboards

Fanout is where many designs fail, because the cost of one event is no longer constant.

If a message goes to 3 users, the cost is small. If it goes to 3 million users, the delivery system becomes the problem.

Production systems handle this with combinations of:

pub/sub layers
shard-aware connection gateways
regional fanout
batching or aggregation for low-priority updates
per-subscriber filtering so only interested users receive the event

2.9 Scaling Persistent Connections

Persistent connections are not free.

Each active connection consumes:

file descriptors
kernel socket buffers
user-space memory for connection metadata and outbound queues
CPU for heartbeats, TLS, parsing, and serialization

At one million connections, even small per-connection cost matters.

If one connection costs only 20 KB of memory for total state, one million connections already imply roughly 20 GB of memory before business logic overhead.

That is why large-scale real-time systems often use dedicated gateway tiers optimized for connection handling, event loops, and lightweight session metadata.

2.10 Stateless vs Stateful Challenges

HTTP APIs are often designed to be stateless. Real-time connection layers are not.

A persistent connection pins a client to a specific server process for some period of time. That creates state:

which user is attached to which gateway
which rooms/channels the connection is subscribed to
what the last acknowledged sequence number was
what outbound messages are buffered

This creates operational consequences:

load balancers may need sticky behavior for established connections
connection draining becomes important during deploys
reconnect storms can overload specific shards
state often must be externalized partially to Redis, Kafka, or another coordination layer

The usual production compromise is:

stateful edge or gateway layer for live connections
stateless application layer for most business logic
shared pub/sub or stream infrastructure connecting the two

flowchart LR
	C[Clients] --> LB[Load Balancer]
	LB --> G1[Realtime Gateway 1]
	LB --> G2[Realtime Gateway 2]
	LB --> G3[Realtime Gateway 3]
	G1 --> PS[[Pub/Sub / Stream Bus]]
	G2 --> PS
	G3 --> PS
	PS --> APP[Business Services]
	APP --> DB[(Source of Truth DB)]

2.11 Common Interview Examples

Typical products that surface these issues:

chat systems such as Slack, WhatsApp, Discord
trading or operational dashboards used by finance or SRE teams
collaborative editors such as Google Docs
ride tracking systems such as Uber or delivery apps
multiplayer or shared presence systems

Each of these has different priorities:

chat needs reasonable ordering and persistence
collaborative editing needs conflict resolution and convergence
ride tracking needs geo-update pipelines and bounded staleness
dashboards often need efficient broadcast and backpressure handling

3. WebSockets

WebSockets are the standard choice when a browser or app needs persistent, bidirectional, low-latency communication over a single long-lived connection.

3.1 What WebSockets Are

WebSocket is a protocol that starts as an HTTP request and then upgrades to a long-lived TCP connection that both sides can use to send messages independently.

Why this matters:

traditional HTTP is request-response oriented
many interactive systems need the server to push updates without waiting for a fresh request
some systems also need the client to send frequent messages without paying repeated HTTP overhead

Examples:

chat input and delivery
collaborative cursor updates
game or live collaboration events
dashboard subscriptions where clients may also send control messages

3.2 Why WebSockets Exist

Before WebSockets, developers used polling, long polling, or various hacks to simulate server push.

Those approaches had problems:

repeated request headers and repeated TLS work
higher latency between event generation and client receipt
difficulty handling truly bidirectional traffic
more server overhead due to connection churn

WebSockets exist because web applications increasingly behaved like continuously connected applications rather than static pages.

3.3 HTTP Upgrade Flow

The connection begins as HTTP and upgrades.

sequenceDiagram
	participant C as Client
	participant LB as Load Balancer
	participant G as WebSocket Gateway

	C->>LB: HTTP GET with Upgrade: websocket
	LB->>G: Forward upgrade request
	G->>G: Authenticate user, validate origin, allocate session
	G-->>LB: 101 Switching Protocols
	LB-->>C: 101 Switching Protocols
	C->>G: WebSocket frames
	G->>C: WebSocket frames

Typical details involved in practice:

the client connects with ws:// or wss://; production typically uses wss://
authentication may be done via cookie, short-lived token, or signed session token
the gateway may register the connection in a presence or subscription store
after upgrade, messages are framed according to the WebSocket protocol instead of normal HTTP response semantics

3.4 Connection Lifecycle

A production WebSocket connection usually has this lifecycle:

connect and authenticate
subscribe to channels, rooms, or entities
exchange bidirectional messages
send heartbeats to detect broken connections
handle temporary disconnects and reconnects
replay missed messages or fetch a snapshot if needed

Designing only step 3 is not enough. Most operational pain comes from steps 4 through 6.

3.5 Heartbeats, Ping-Pong, and Idle Detection

Long-lived connections fail silently all the time.

Reasons include:

mobile device sleep
NAT or proxy idle timeout
Wi-Fi to cellular network switch
gateway crash
browser tab suspension
broken half-open TCP connection

This is why WebSocket systems usually use heartbeat messages or protocol-level ping-pong frames.

Heartbeats help answer two questions:

is the connection still alive?
should the system still consider this client online?

Without heartbeats, you get ghost sessions that appear connected long after the client is gone.

3.6 Reconnect Strategies

Reconnect behavior is a major part of real-world quality.

Bad reconnect strategy:

immediate reconnect loop
no jitter
no session resume
no replay of missed messages

This creates reconnect storms during deploys or brief outages.

Good reconnect strategy usually includes:

exponential backoff
random jitter
session tokens that expire quickly but can be refreshed safely
a cursor, offset, or last-seen sequence number so missed updates can be replayed
a fallback snapshot fetch if the replay window has expired

For example, a chat client may reconnect with "last message sequence = 8421" so the server can resend messages 8422 onward, or instruct the client to resync from the room history API.

3.7 Backpressure Handling

Backpressure means the producer is sending faster than the consumer can handle.

In WebSocket systems, this often shows up as:

a client on a slow network
a dashboard subscribed to too many noisy streams
a fanout burst after a major event
a gateway trying to write faster than the kernel socket buffer drains

If you ignore backpressure, memory usage grows because outbound queues grow.

Common handling strategies:

per-connection outbound queue limits
drop low-priority ephemeral updates such as typing indicators or cursor positions
coalesce updates so only the latest state is sent for fast-changing metrics
disconnect very slow consumers after threshold violations
split high-volume topics into separate subscription classes

An important interview line:

"For rapidly changing state, I would prefer sending the latest state rather than every intermediate event, because preserving every micro-update can create unnecessary backpressure."

3.8 Scaling Millions of Connections

At large scale, the architecture is usually not "application servers also do WebSockets." Instead it becomes a specialized gateway tier.

flowchart TB
	U[Clients] --> LB[Global Load Balancer]
	LB --> R1[Realtime Gateway Pool A]
	LB --> R2[Realtime Gateway Pool B]
	R1 --> BUS[[Kafka / Redis PubSub / NATS / Custom Bus]]
	R2 --> BUS
	BUS --> CHAT[Chat Service]
	BUS --> PRES[Presence Service]
	BUS --> FEED[Live Feed Service]
	CHAT --> DB[(Message Store)]
	PRES --> CACHE[(Ephemeral Presence Store)]
	FEED --> TS[(Metrics / Event Store)]

Key design patterns:

gateways hold active connections and subscription state
business services publish events to a bus rather than pushing directly to every gateway
gateways subscribe to only the shards or channels they need
sharding is often based on user ID, room ID, organization ID, or topic
regional edge clusters reduce latency and avoid routing every message through one region

Operational concerns at this scale:

file descriptor limits
efficient event-loop networking, such as epoll or kqueue based runtimes
TLS termination cost
per-connection memory footprint
deployment draining and connection migration
DDoS and abuse protection

3.9 Sticky Sessions and Their Implications

WebSocket connections are naturally sticky after they are established because the connection lives on one server.

Two common patterns exist:

sticky routing only for the lifetime of the connection, with minimal external session state
externalized session or subscription metadata so another node can recover more easily after reconnect

Tradeoff:

keeping more state in-memory is fast but makes failover harder
externalizing more state improves recoverability but increases read/write overhead

Most production systems use a hybrid approach.

3.10 Pub/Sub Integration

WebSockets are a transport. They are rarely the source of truth.

The common production flow is:

a business event happens, such as a new chat message
the source-of-truth service persists it
the service publishes an event to a bus
the relevant real-time gateways receive the event
the gateways push it to subscribed clients

This avoids tight coupling between the message store and every connected gateway.

Chat example:

message written to durable message store
room event published to Kafka or another bus
gateways with members of that room fan it out
offline users do not receive live delivery, but the message remains in durable history

3.11 Message Ordering

WebSocket itself gives ordered byte delivery on a single connection because it runs over TCP.

That does not solve application-level ordering fully.

Ordering can still break due to:

reconnects
retry/replay logic
multi-region replication lag
events produced from multiple backend services
fanout from multiple partitions

Practical fix:

assign monotonic sequence numbers per room, stream, or entity
let clients detect gaps or duplicates
if a gap exists, trigger replay or snapshot sync

3.12 Delivery Guarantees

Default WebSocket delivery is closer to at-most-once unless you add more machinery.

For stronger guarantees, systems add:

message IDs
client acknowledgments
server-side retry windows
dedupe on the client or server
durable storage for important messages

Chat systems often separate message durability from live delivery:

live delivery is fast but may fail during transient disconnects
durable history ensures the user still sees the message when they reconnect

This is how systems like Slack or Discord can tolerate temporary live-delivery failures without losing conversation history.

3.13 WebSockets vs Traditional Request-Response

Dimension	Request-response HTTP	WebSockets
Communication model	Client asks, server answers	Both sides can send after connection is established
Overhead	New request/response semantics each time	One persistent connection
Server push	Awkward or impossible directly	Natural
Best for	CRUD, infrequent interactions, cacheable reads	Interactive real-time features
Statefulness	Usually stateless per request	Stateful connection management
Scaling pain	Request burst handling	Connection count, fanout, reconnect storms

3.14 Production Examples

Slack or Discord style chat:

WebSockets to maintain active sessions
durable message store for history
presence service for online status
per-channel fanout through pub/sub
client acknowledgments or read markers managed separately from transport

Live operational dashboard:

metrics ingestion into stream system
aggregation layer computes latest values
gateway pushes only the latest state at a controlled cadence
slow clients may receive sampled or coalesced updates instead of every raw event

Collaborative editing like Google Docs style systems:

persistent channel for operations and cursor updates
separate algorithm for convergence such as OT or CRDT
ordered per-document operation stream more important than global ordering

3.15 Failure Cases and Best Practices

Common failures:

dropped connections during deploys
missing replay after reconnect
memory blow-up from unbounded per-client queues
duplicate sends during reconnect races
stale authentication on long-lived connections
load imbalance when popular rooms concentrate on one shard

Best practices:

keep auth tokens short-lived and renewable
use heartbeats with sensible timeout and grace period
track per-stream sequence numbers for replay and gap detection
bound outbound queues and define drop policies
separate durable history from ephemeral transport
plan explicitly for reconnect storms

4. Server-Sent Events (SSE)

Server-Sent Events are a simpler real-time transport for one-way server-to-client streaming over HTTP.

4.1 What SSE Is

SSE uses a long-lived HTTP response with content type text/event-stream. The server keeps the response open and pushes events as text lines.

This is simpler than WebSockets because:

it stays within standard HTTP semantics
the communication direction is only server to client
browsers provide built-in reconnect behavior through EventSource

SSE is often the right tool when you need live updates but not client-to-server realtime messaging on the same channel.

4.2 Why SSE Exists

Many applications need push from server to browser, but not full bidirectional sockets.

Examples:

notification feeds
build or job status updates
stock or metrics dashboards
live logs
admin consoles showing state changes

For these, WebSockets may be more power than you need. SSE offers lower conceptual complexity.

4.3 How SSE Works Internally

The server responds with a streaming body where events look like this conceptually:

id: 1052
event: status_update
retry: 3000
data: {"state":"processing","progress":65}

Important fields:

data: payload, often JSON serialized as text
event: optional named event type
id: event ID used for resume/reconnect logic
retry: suggested reconnect delay in milliseconds

The browser may reconnect automatically and include the last event ID.

4.4 Browser Reconnect Behavior

One reason teams like SSE is that browsers handle reconnection automatically through the EventSource API.

That does not mean you can ignore replay design.

In production, you still need to decide:

how long the server remembers old event IDs
whether reconnect should replay from a cursor or force a snapshot refresh
what happens if the client was disconnected longer than the replay window

If you care about reliable catch-up, event IDs need meaning. They cannot just be decorative.

4.5 Proxy and Load Balancer Considerations

SSE works over HTTP, but intermediaries can cause problems.

Common issues:

reverse proxies buffering the response instead of flushing promptly
idle timeouts closing the stream
HTTP/1.1 connection count limits in browsers
network infrastructure tuned for short responses rather than long-lived streams

Typical mitigations:

disable response buffering where necessary
send periodic keepalive comments or lightweight events
tune load balancer idle timeout above expected stream lifetime
prefer HTTP/2 where available to improve multiplexing behavior

4.6 Advantages of SSE

SSE is often better than WebSockets when:

you only need server-to-client updates
the payload is text-oriented JSON or status messages
you want easy browser support with simpler client code
you prefer staying inside HTTP infrastructure where possible
you want built-in reconnect behavior without custom socket management

It is especially attractive for SaaS dashboards and internal tools.

4.7 Limitations of SSE

SSE is not a universal replacement for WebSockets.

Limitations:

browser-to-server communication still needs normal HTTP requests
payloads are text-based, not native binary frames
some proxies or CDNs may handle buffering poorly if misconfigured
older browser and infrastructure quirks may matter in enterprise environments
very high frequency bidirectional workloads fit WebSockets better

4.8 SSE vs WebSockets

Dimension	SSE	WebSockets
Direction	Server to client only	Bidirectional
Protocol model	Streaming HTTP	Upgraded persistent socket
Browser reconnect support	Built in	Usually implemented by application
Complexity	Lower	Higher
Binary support	Not natural	Supported
Best use	Feeds, status, live logs, dashboards	Chat, collaboration, two-way real-time

4.9 When SSE Is the Better Choice

SSE is often the better choice when the client mostly listens.

Examples:

GitHub or CI-style live build logs
a Stripe-like dashboard showing payment processing updates
an incident dashboard showing service health changes
a notification center feed for a web app
stock or operations dashboards where commands go through separate REST APIs

This is a good interview point: choosing the simpler transport when it satisfies requirements is usually a sign of maturity.

4.10 Production Example

Imagine a live job-status console for a video encoding platform.

Architecture:

workers emit progress events to Kafka
a status aggregation service maintains current job state
an SSE gateway streams progress events to browser tabs watching those jobs
user actions such as cancel or retry still go through normal HTTP APIs

This keeps the design simple because only one direction needs to be live.

sequenceDiagram
	participant B as Browser
	participant S as SSE Gateway
	participant A as Aggregation Service

	B->>S: GET /jobs/123/stream
	S->>A: Subscribe to job 123 updates
	A-->>S: status event id=51
	S-->>B: event: progress, id:51, data:{65%}
	A-->>S: status event id=52
	S-->>B: event: progress, id:52, data:{78%}
	Note over B,S: If disconnected, browser reconnects with Last-Event-ID

4.11 Failure Cases and Best Practices

Common failures:

events buffered by proxy, making the feed appear delayed
auto-reconnect loops against an unhealthy server
event IDs not aligned with replay logic
assuming SSE gives guaranteed delivery without durable backing state

Best practices:

use SSE for one-way streaming where simplicity matters
define replay windows and semantics for Last-Event-ID
tune proxy buffering and timeout settings explicitly
send heartbeats or comments to keep the stream alive
keep authoritative state elsewhere so clients can resync after longer disconnects

5. Long Polling

Long polling is the older but still useful technique where the client makes a request, the server holds it open until data is available or a timeout occurs, then the client immediately opens a new request.

5.1 What Long Polling Is

Long polling tries to approximate server push while staying entirely within ordinary request-response behavior.

Lifecycle:

client sends request asking for updates
server waits if no new data is available
when data arrives, or timeout happens, server responds
client immediately sends another request

This can work surprisingly well at modest scale.

5.2 Why Older Systems Used It

Long polling became popular because:

it worked before modern browser WebSocket support was universal
it fit existing HTTP infrastructure better
it avoided some firewall or proxy restrictions that broke upgraded connections
it was simple to integrate into applications already centered on HTTP APIs

Many legacy enterprise products still use it, and some modern systems still choose it when operational simplicity matters more than efficiency.

5.3 Request Lifecycle and Timeout Behavior

sequenceDiagram
	participant C as Client
	participant S as Server

	C->>S: GET /updates?cursor=120
	Note over S: Hold request open
	S-->>C: 200 OK with new event 121
	C->>S: GET /updates?cursor=121
	Note over S: Hold request open again
	S-->>C: 204 No Content or timeout response
	C->>S: GET /updates?cursor=121

The cursor is important. Without it, the client cannot reliably resume.

5.4 Server Resource Cost

Long polling has hidden cost:

many open requests waiting simultaneously
repeated HTTP parsing and headers
repeated auth checks unless optimized
more connection churn than persistent transports
potentially more threads or async waiting overhead depending on server model

In modern async servers this can be manageable, but it is still usually less efficient than SSE or WebSockets for high-frequency updates.

5.5 Retry Flow and Failure Behavior

If a long-poll request fails, the client retries. If the system is overloaded, thousands of clients may do this simultaneously.

This can create:

thundering herds after load balancer restarts
increased TLS handshake overhead
request spikes aligned to timeout intervals

Good long-poll implementations randomize retry timing and use cursors to avoid gaps.

5.6 Scaling Limitations

Long polling tends to hit limits earlier than SSE or WebSockets because:

every delivery cycle requires a fresh request
idle clients still create repeated requests
server-side waiting requests increase memory and scheduling overhead
load balancers and API gateways see much higher request volume

It is therefore more expensive per delivered update at scale.

5.7 Long Polling vs SSE vs WebSockets

Dimension	Long Polling	SSE	WebSockets
Direction	Mostly server to client via repeated requests	Server to client	Bidirectional
Latency	Good but depends on re-request timing	Better	Best for interactive two-way
Overhead	Highest	Lower	Lowest per message after connection
Infrastructure simplicity	High	Moderate	Moderate to high
Browser support model	Universal HTTP	Modern browser friendly	Modern browser friendly
Best fit	Legacy environments, simple compatibility	One-way streaming	Full duplex interactive systems

5.8 Where It Still Appears Today

Long polling still appears in:

legacy enterprise systems
products that need broad proxy compatibility
simple notification or status systems where connection counts are moderate
fallback modes when WebSockets fail

Some client libraries silently downgrade to long polling when sockets are unavailable.

5.9 Common Mistakes

Common mistakes:

no cursor or sequence number, causing gaps or duplicates
synchronized timeout intervals across all clients
treating long polling as free because it uses HTTP
using blocking server threads for many waiting requests

Best practices:

add jitter to retries and request restarts
carry a cursor or last seen event ID
use async I/O servers
switch to SSE or WebSockets when event frequency or client count grows significantly

6. Presence

Presence is the system that answers questions such as:

is this user online?
were they active recently?
are they typing right now?
which device are they on?
when were they last seen?

Presence looks easy until you build it across unreliable networks and multiple devices.

6.1 What Presence Means

Presence is not just a boolean. It is usually a collection of signals.

Examples of signals:

active WebSocket connection exists
recent heartbeat received within threshold
app is in foreground or background
user generated interaction recently
one or more devices are connected
a typing event was emitted in the last few seconds

Products like Slack, WhatsApp, and Discord treat these signals differently based on product semantics and privacy expectations.

6.2 Online/Offline State vs Rich Presence

There are multiple levels of presence richness:

Presence signal	Meaning	Typical use
Online	At least one recent active connection	chat roster
Active now	Recent interaction or foreground state	collaboration tools
Last seen	Last trusted activity timestamp	messaging apps
Typing	Very recent ephemeral intent signal	chat UI
Device-specific presence	Desktop online, mobile background, etc.	cross-device messaging

Rich presence improves UX but increases cost and complexity.

6.3 Heartbeat Mechanisms

Presence systems typically rely on heartbeats because disconnects are not always detected immediately.

Typical model:

client sends heartbeat every N seconds
gateway refreshes an expiry in an ephemeral store
if no refresh occurs before TTL plus grace period, the client is considered offline

Heartbeats are usually lightweight and sometimes piggyback on existing ping-pong traffic.

6.4 Disconnect Detection

There are several ways a system may detect that a user is gone:

clean socket close
missed heartbeat threshold
TCP reset or read/write failure
mobile OS background suspension causing missed keepalives
gateway process failure

The important design lesson is that disconnect detection is delayed and probabilistic.

6.5 Ghost Online Problems

Ghost online means the system shows a user as online even though they have effectively disappeared.

Causes:

stale presence entry in Redis or similar store
gateway crash before cleanup
mobile network change without clean disconnect
delayed heartbeat expiration
clock skew or delayed writes

Ghost online is not just cosmetic. In chat systems it changes user expectations and trust.

6.6 Mobile Network Challenges

Mobile clients make presence significantly harder:

apps move between foreground and background
the OS may throttle network activity
the device may sleep aggressively
the network may switch between Wi-Fi and cellular
radio conditions cause intermittent reachability

This is why messaging apps rarely promise precise truth about presence. They provide an approximation that is good enough for UX.

6.7 Multi-Device Presence

A single user may be connected from:

desktop browser
mobile phone
tablet
another browser tab

So the question becomes: how should user presence be aggregated?

Common policies:

user is online if any device is online
show the most active device class
typing indicator is scoped per conversation and device session
last seen is updated from the most recently trusted activity source

This requires per-device state plus a user-level aggregation layer.

6.8 Presence Architecture at Scale

flowchart LR
	C1[Client Devices] --> G[Realtime Gateways]
	G --> P[Presence Service]
	P --> E[(Ephemeral TTL Store)]
	P --> L[(Last Seen Store)]
	P --> BUS[[Presence Event Bus]]
	BUS --> W1[Workspace / Room Subscribers]
	BUS --> W2[Friend / Contact Subscribers]

Important implementation choices:

keep rapidly changing online state in an ephemeral store with TTL, often Redis-like
persist last seen more selectively because writing every heartbeat to durable storage is wasteful
fan out presence only to users who care, such as channel members or contact lists
avoid global broadcasts of presence updates

6.9 How Slack, WhatsApp, and Discord Think About Presence

Slack-like systems:

presence is often relevant within a workspace, not globally
typing indicators are highly ephemeral and can be dropped safely
active/away may incorporate user activity signal, not only socket existence

WhatsApp-like systems:

privacy settings affect whether last seen or online is visible
mobile network behavior dominates design decisions
message delivery status and online status are related but not identical

Discord-like systems:

gateway connections and presence updates are central
fanout is scoped to servers, friends, or subscribed contexts
rich presence may include game/activity metadata beyond simple online state

6.10 Failure Cases and Best Practices

Common failures:

presence storms during reconnect events
over-broadcasting presence changes to too many recipients
stale typing indicators that never clear
durable stores overloaded by heartbeat writes

Best practices:

use TTL-based ephemeral storage for active presence
use grace periods before declaring offline
treat typing as ephemeral, low-guarantee data
separate last seen persistence from heartbeat frequency
scope presence fanout carefully

7. Online and Offline State

Online/offline state deserves separate treatment because it is one of the most misunderstood topics in system design.

7.1 Connection State vs Actual User Activity

A connected socket does not prove the user is attentive.

Examples:

the app is open in a background tab
the phone has a connection but the user has not interacted for an hour
the device is connected through a stale transport path

So "online" is often a layered concept:

connected
recently active
foreground active
reachable through push only
offline

7.2 Why Online Is Probabilistic

"Online" is often probabilistic rather than exact because distributed systems only observe signals, not intent.

Signals can be wrong or delayed:

heartbeat arrives late
device sleeps
disconnect cleanup fails
region replication lags
clock skew shifts thresholds

That is why serious systems avoid over-promising semantic precision.

7.3 Delayed Disconnect Detection and Grace Periods

If you mark a user offline immediately on one missed heartbeat, the UI will flap constantly. If you wait too long, users stay falsely online.

The typical answer is a grace period.

Example:

heartbeat expected every 20 seconds
after 45 seconds without heartbeat, mark as "probably offline"
after 90 seconds, emit strong offline event

The exact numbers depend on product sensitivity and mobile behavior.

7.4 Distributed Presence Tracking

At scale, presence may be tracked across regions or gateway clusters.

Challenges:

the same user may have devices connected to different regions
presence updates may replicate asynchronously
room members may themselves be distributed globally
failover can cause duplicate or delayed presence transitions

Practical design patterns:

aggregate presence per user from device-level sessions
keep the fast path regional when possible
replicate summarized state, not every heartbeat, across regions
use eventual consistency for non-critical presence displays

7.5 Eventual Consistency and Reconciliation

Presence is almost always eventually consistent.

That is acceptable because most presence UI is advisory rather than transactional.

Reconciliation patterns:

refresh contact roster periodically from source state
clear typing indicators after TTL rather than requiring explicit "stop typing"
overwrite stale online state when stronger evidence arrives
prefer monotonic versioning or timestamps for state updates

7.6 Offline Event Reconciliation

When a user comes back online, the system often needs to reconcile what happened while they were away.

Examples:

unread chat message counts
missed mentions or alerts
latest document state
last known ride or delivery state

This is where live presence and durable notification systems connect. Presence alone is not enough. Offline catch-up requires durable event or state storage.

7.7 Useful Interview Language

Strong interview framing:

"I would treat online status as a best-effort product signal based on recent heartbeats and activity, not as exact truth. I would keep active presence in an ephemeral TTL store, add grace periods to avoid flapping, and rely on durable stores for offline reconciliation."

8. Notification Systems Overview

Notification systems take internal events and decide whether, when, and how to tell a user.

This is a much bigger problem than "send email" or "send push."

8.1 What Notification Systems Are

A notification system is usually an event-driven orchestration system with responsibilities such as:

ingest product events
determine recipients
check user preferences and eligibility rules
choose channels such as in-app, email, SMS, or push
render channel-specific content
schedule immediate or delayed delivery
retry safely on transient failures
capture delivery outcomes and user actions

This is why notification systems often become a platform inside a company.

8.2 Why They Exist as Dedicated Systems

Without a dedicated notification system, every product team reimplements:

preferences
templates
provider integrations
retry logic
deduplication
analytics
unsubscribe logic

That leads to inconsistency, bugs, and poor user experience.

Centralized notification platforms exist because communication policy should be consistent even when event producers are decentralized.

8.3 Event-Driven Producer-Consumer Flow

flowchart LR
	EV[Business Events<br/>PR comment / payment failed / trip update] --> BUS[[Event Bus / Outbox]]
	BUS --> ORCH[Notification Orchestrator]
	ORCH --> PREF[Preferences / Suppression Rules]
	ORCH --> TPL[Template Service]
	ORCH --> DEDUPE[Idempotency / Deduplication]
	ORCH --> Q1[[Email Queue]]
	ORCH --> Q2[[SMS Queue]]
	ORCH --> Q3[[Push Queue]]
	Q1 --> EP[Email Provider]
	Q2 --> SP[SMS Provider]
	Q3 --> PP[APNs / FCM]
	EP --> CALLBACK[Delivery Callbacks / Webhooks]
	SP --> CALLBACK
	PP --> CALLBACK
	CALLBACK --> STATUS[(Notification Status Store)]

8.4 Notification Orchestration

The orchestrator is the brain of the system.

It decides:

should we notify at all?
which recipients should receive it?
which channels are allowed for this user and event type?
should the notification be immediate, batched, or suppressed?
how should duplicates be prevented?
what fallback should happen if one channel fails?

This is why the orchestrator is usually more important than the provider adapters.

8.5 Fanout Systems

A single event can notify:

one user
all watchers of a repository
all participants in a channel
all on-call engineers for a critical incident
millions of users in a campaign

Fanout complexity depends on the recipient graph.

Examples:

GitHub notification fanout may target repository watchers, mention targets, or review participants
Uber trip updates may target rider, driver, and support systems
Stripe-like alerts may target account owners and configured team members

The fanout step is often separated from channel delivery so the system can scale recipient expansion independently.

8.6 Personalization and Preferences

Mature systems do not treat all recipients equally.

They consider:

language and locale
preferred channels
quiet hours
marketing consent vs transactional necessity
account type or plan tier
notification priority
prior recent sends to avoid fatigue

This means notification systems need user preference stores, policy engines, and template personalization.

8.7 Delivery Guarantees and Idempotency

Notification systems are almost always at-least-once internally.

Why:

queues retry on failures
provider callbacks can be delayed or duplicated
downstream consumers may crash after partial processing

Therefore idempotency is mandatory.

Examples of idempotency keys:

event_id + recipient_id + channel + template_version
otp_request_id + phone_number
invoice_id + notification_type

Without idempotency, retries turn into duplicate emails, repeated SMS, or multiple push notifications.

8.8 User Preferences, Suppression, and Fatigue Prevention

Good notification systems do not just maximize sends. They maximize useful communication.

Common controls:

per-event-type preferences
daily or hourly caps
digesting multiple low-priority events into one summary
suppression if the user already saw the update in-app
channel priority rules such as push first, email later if unread
quiet hours by timezone

This is especially important for SaaS products, GitHub-like collaboration tools, and consumer apps where over-notification causes churn.

8.9 Real-World Examples

GitHub-like system:

new comment event arrives
identify subscribers and participants
if user is active in the web app, maybe show in-app only
if user is offline, maybe send email or mobile push depending on preferences

Stripe-like system:

payment failed event arrives
critical transactional email sent immediately
dashboard alert created
webhook to merchant retried separately with strong idempotency rules

Uber-like system:

trip state changes drive in-app realtime updates when online
push or SMS may be used if the user is offline or inactive
some updates are critical and some are suppressible

8.10 Common Mistakes

Common mistakes:

every producer sends directly to providers
no central dedupe or idempotency control
no distinction between marketing and transactional communications
retrying permanent failures forever
ignoring user timezone and quiet hours
no feedback loop from provider delivery callbacks

9. Email Notifications

Email is deceptively complex. It looks simple because the send API is simple. Production email systems are complicated because delivery, trust, compliance, and reputation matter.

9.1 Transactional vs Marketing Email

Type	Purpose	Examples	Operational expectations
Transactional	Required or highly important user/account events	password reset, receipt, billing alert, security alert	High reliability, lower latency, clearer deliverability priority
Marketing	Engagement and campaigns	product updates, promotions, newsletters	Segmentation, scheduling, consent, reputation sensitivity

Good companies often separate these streams operationally because poor marketing practices can harm the deliverability of important transactional mail.

9.2 Provider Integrations

Common providers include SES, SendGrid, Mailgun, Postmark, and others.

The backend typically needs to manage:

authentication credentials and rotation
domain verification
sending rate limits
webhook endpoints for bounces, complaints, and delivered events
fallback provider strategy if the primary provider degrades

The provider is not your notification system. It is only the last-mile email delivery partner.

9.3 Deliverability Basics

Deliverability means whether mail actually lands where it should.

It depends on more than API success.

Core concepts:

SPF, DKIM, and DMARC to establish domain authenticity
domain and IP reputation
complaint rates and unsubscribe behavior
bounce rates
content quality and spam-like patterns
volume ramp-up or IP/domain warming

A send accepted by the provider can still end up in spam or be throttled downstream.

9.4 Retries and Failure Handling

Email sending failures can be transient or permanent.

Transient examples:

provider API timeout
rate limit exceeded
temporary downstream mail server deferral

Permanent examples:

invalid recipient address
hard bounce history
unsubscribed marketing recipient

The notification system should classify failures and retry only the transient ones.

9.5 Bounce Handling

Bounces matter because continuing to send to bad addresses damages reputation.

Typical flow:

send email
provider later reports delivery, soft bounce, hard bounce, complaint, or unsubscribe
your system updates recipient state and future eligibility

Hard bounces often lead to immediate suppression for that address. Soft bounces may tolerate a few retries or future attempts.

9.6 Spam Prevention Basics

Spam prevention is partly technical and partly policy.

Important basics:

authenticate sending domains correctly
do not mix high-risk marketing traffic with critical transactional mail blindly
honor unsubscribe and complaint signals quickly
avoid misleading content or sudden massive bursts from a cold domain
keep lists clean and verified

9.7 Rate Limits and Provider Behavior

Providers enforce rate limits, and mailbox providers also effectively rate-limit through throttling.

Implications:

you may need send pacing
campaign fanout should often be spread over time
critical transactional mail may need priority queues separate from bulk mail

Amazon or Stripe style critical receipt and billing systems usually do not want a giant promotional campaign to delay password resets.

9.8 Email Verification Basics

Verification is used to reduce garbage addresses and account abuse.

Common patterns:

double opt-in for marketing lists
confirmation link for new account email ownership
suppression of obviously invalid domains or malformed addresses

This improves both security and deliverability.

9.9 Unsubscribe Flows

Unsubscribe handling is non-negotiable for marketing communications and often still useful for preference management around non-critical transactional categories.

Good systems support:

one-click unsubscribe where required
fine-grained preference center
audit trail of consent changes
region/compliance specific logic if needed

9.10 Domain Reputation Basics

Reputation is cumulative. Sending one terrible campaign can affect future delivery of important mail.

This is why large companies often:

separate sending domains or subdomains by traffic type
ramp up new domains gradually
isolate risky campaigns from critical operational email

9.11 Why Email Is More Complex Than "Just Send Email"

Because success requires all of the following:

rendering the right content
sending through a healthy provider
passing authentication checks
avoiding spam classification
respecting preferences and compliance
tracking callbacks
updating suppression lists

A provider returning 200 OK is the beginning, not the end.

10. SMS Notifications

SMS is powerful because it reaches users outside the app and does not require mobile data for app delivery semantics. It is also expensive, abuse-prone, and operationally messy.

10.1 Common Use Cases

SMS is usually best reserved for high-value or urgent communication.

Examples:

OTP and account verification
fraud or security alerts
delivery or ride status for users without app engagement
critical operational alerts

It is usually a poor default channel for high-volume low-value communication.

10.2 Provider Integrations

Typical providers include Twilio, Sinch, MessageBird, and region-specific aggregators.

Real systems often need to manage:

phone number normalization and validation
country-specific sender rules
short code, long code, toll-free, or sender ID strategy
provider failover or route selection
delivery receipt webhook processing

10.3 Delivery Reliability

SMS delivery reliability is more uncertain than many engineers expect.

Reasons:

carriers can filter traffic
delivery receipts may be delayed or incomplete
regional regulations vary heavily
roaming, handset state, and carrier routing affect outcomes

This means SMS is not a perfect guarantee channel. It is just a high-reach channel.

10.4 Regional Delivery Issues

SMS is deeply regional.

Challenges include:

country-specific compliance requirements
differing support for alphanumeric sender IDs
carrier throughput differences
local holidays or network conditions affecting delivery
multi-language content and character encoding costs

A design that works in the US may be wrong in India, Brazil, or the Middle East.

10.5 Cost Considerations

SMS is expensive relative to push or in-app messaging.

Implications:

use it for critical moments, not noise
add rate limits and spend monitoring
avoid repeated retries that burn money without improving outcomes

This is one reason companies often prefer push first, then SMS only for critical fallback paths.

10.6 Retry Strategies

Retrying SMS requires care.

Bad strategy:

retry every failure aggressively
resend OTP repeatedly with no abuse checks

Better strategy:

retry transient provider errors with capped backoff
do not retry obvious permanent failures
rotate providers or routes only when there is evidence of provider-side trouble
ensure OTP requests invalidate or supersede older codes safely

10.7 Abuse and Fraud Prevention

SMS systems are targets for abuse.

Examples:

OTP bombing a victim's phone
account enumeration by testing phone numbers
SMS pumping fraud, where attackers generate traffic to monetize premium routes
SIM swap exposure when SMS is treated as strong identity proof

Protections:

per-user and per-destination rate limits
CAPTCHA or additional friction on suspicious flows
phone reputation checks where appropriate
anomaly detection on destination patterns and country mix
avoid using SMS as the only high-assurance security factor for sensitive systems

10.8 Fallback Strategies

Reasonable fallback examples:

push notification first for app users, SMS only if unread or unreachable
voice call backup for some OTP flows in limited contexts
in-app confirmation plus email receipt after SMS-based security action

Fallback should be policy-driven, not improvised per service.

10.9 Why SMS Should Be Used Carefully

SMS is valuable because it cuts through offline conditions, but it should be treated as a scarce and risky channel.

In interviews, a good answer sounds like:

"I would reserve SMS for high-importance events such as OTP or urgent alerts, because it has real cost, variable regional reliability, and meaningful abuse risk."

11. Push Notifications

Push notifications are the standard way mobile apps receive out-of-app alerts through platform push services.

11.1 Mobile Push Architecture

A typical flow is:

app registers with APNs on iOS or FCM on Android
the app backend stores the device token
when an event occurs, the notification system sends a message to APNs or FCM
the platform service attempts delivery to the device
the app may open, refresh state, or show a system notification depending on payload type and app state

flowchart LR
	APP[Your Backend] --> ORCH[Notification Orchestrator]
	ORCH --> APNS[APNs]
	ORCH --> FCM[FCM]
	APNS --> IOS[iOS Device]
	FCM --> AND[Android Device]

11.2 APNs Basics

APNs is Apple's push service. Important operational realities:

you send to APNs, not directly to the device
device tokens can change and become invalid
delivery timing is not strictly guaranteed
app state and OS policies affect whether and how the notification is shown or processed

11.3 FCM Basics

FCM plays a similar role for Android and can also support web push scenarios.

Important details:

token lifecycle must be managed carefully
device availability and manufacturer-specific behavior can affect delivery
notification and data messages behave differently depending on app state

11.4 Token Management

Token management is one of the most common production pain points.

You need to handle:

token registration on app install or refresh
multiple tokens per user due to multiple devices
invalid or expired tokens returned by providers
token removal when users log out or uninstall

Without disciplined token hygiene, you waste money and effort sending to dead destinations.

11.5 Delivery Uncertainty

Push notifications are not a guaranteed immediate-delivery channel.

Reasons:

device is offline
battery optimization delays delivery
platform may coalesce or deprioritize some notifications
user disabled permissions
token invalidated

This is why important workflows often require eventual in-app reconciliation or email fallback.

11.6 Silent Notifications

Silent notifications can wake the app to refresh content without necessarily showing a visible alert.

They are useful for:

refreshing inbox state
syncing badge counts
preloading likely-needed data

But they are heavily constrained by OS policies, so they should not be treated as a guaranteed background execution channel.

11.7 Foreground vs Background Behavior

The same push may behave differently based on app state:

foreground: app may intercept and render custom UX
background: system may show notification or deliver data silently depending on payload and permissions
terminated: behavior depends on platform rules and payload type

This is why push design requires close partnership between backend and mobile engineers.

11.8 Retries and Delivery Limitations

Your backend can retry requests to APNs or FCM when their API fails transiently, but once the provider accepts a push, actual device delivery is still probabilistic.

Therefore:

retry provider API failures sensibly
do not assume acceptance equals user seen
use collapse keys or dedupe keys for replaceable updates
rely on app open and in-app state sync for real correctness

11.9 Practical Examples

Uber-like app:

real-time in-app updates while rider is online
push for driver arrival if rider is backgrounded
fallback SMS only for critical reachability gaps or special cases

SaaS mobile app:

push for mentions, approvals, incidents, or account alerts
user preferences decide which event types generate push
app syncs authoritative state on open rather than trusting push payload alone

11.10 Common Mistakes

Common mistakes:

treating push as reliable enough to replace durable inbox state
never cleaning invalid tokens
sending too many pushes and training users to disable them
using silent push as if it were guaranteed scheduled background compute

Best practices:

maintain durable notification state separately
manage token lifecycle aggressively
use collapse and dedupe strategies for replaceable alerts
measure open rate, delivery feedback, and disablement trends

12. Retries

Retries are one of the most important and most dangerous parts of communication systems.

Done well, they recover from transient failure and make the system resilient. Done badly, they multiply load, create duplicates, and extend outages.

12.1 Why Retries Exist

Many failures are temporary:

provider timeout
brief network partition
overloaded downstream service returning 429 or 503
transient database or cache issue

Retrying later may succeed without user-visible failure.

12.2 Transient vs Permanent Failures

This distinction is critical.

Failure type	Example	Retry?
Transient	timeout, 503, temporary DNS issue	Usually yes
Capacity/rate limit	429, queue full, provider throttle	Yes, but with backoff and pacing
Permanent data error	invalid email, malformed phone number, missing template variable	No
Policy rejection	unsubscribed user, blocked sender, compliance rule	No

If you do not classify failures, you will retry garbage forever.

12.3 Retry Policies

A retry policy usually defines:

maximum attempts
initial delay
backoff multiplier
jitter strategy
retryable status codes or error classes
whether fallback provider or alternate channel should be used
what goes to dead-letter after final failure

12.4 Exponential Backoff

Exponential backoff increases delay after each failure.

Conceptually:


delay_n = base \times factor^n

This reduces immediate pressure on an unhealthy dependency.

12.5 Jitter

Jitter randomizes retry timing so all failed requests do not retry in lockstep.

Without jitter, a temporary outage can turn into synchronized retry spikes.

That is one of the classic ways retries make outages worse.

12.6 Dead-Letter Queues

A dead-letter queue stores items that repeatedly failed and need inspection or alternate handling.

DLQs are useful because they:

stop poison messages from blocking healthy traffic
preserve failed work for debugging
allow manual replay after fixes
provide operational visibility into systemic issues

12.7 Poison Messages

A poison message is a message that always fails because the content or logic is bad.

Examples:

template references a missing variable
payload violates provider schema
recipient state is invalid and will never pass validation

Retries will not fix poison. Classification and DLQ handling will.

12.8 Retry Storms

Retry storms happen when failure causes more traffic than success ever would.

Typical causes:

too many retries with too little backoff
many workers retrying the same provider simultaneously
client reconnect loops plus server-side retries both activating during outage
no circuit breaker or health-aware throttling

This is why retries can save systems or destroy them.

12.9 Idempotency Importance

Retries mean duplicate attempts are inevitable.

Idempotency makes duplicates safe.

Examples:

if event_id + recipient + channel already produced a send, do not send again
if a webhook delivery with the same delivery ID is retried, downstream consumer should accept duplicate receipt without double-processing
if an OTP request is superseded, older attempts should not remain valid

12.10 Queue Processing and Retry Pipeline

flowchart LR
	Q[[Notification Queue]] --> W[Worker]
	W --> TRY{Send succeeds?}
	TRY -- Yes --> DONE[Mark delivered / awaiting callback]
	TRY -- No transient --> RETRY[Schedule retry with backoff + jitter]
	TRY -- No permanent --> FAIL[Mark failed permanently]
	RETRY --> DQ[[Delayed Retry Queue]]
	DQ --> W
	W --> MAX{Attempts exceeded?}
	MAX -- Yes --> DLQ[[Dead Letter Queue]]
	MAX -- No --> TRY

12.11 Best Practices

Best practices:

classify failures before retrying
use exponential backoff with jitter
cap attempts and send poison messages to DLQ
make send operations idempotent
add provider-level circuit breakers or adaptive throttling
expose retry metrics so operators can see storms forming

13. Templates

Templates are where product communication and backend correctness meet.

They are also a major source of production mistakes.

13.1 Why Templates Matter

Templates allow teams to separate event logic from presentation.

Without templates, every service hardcodes message strings and formatting logic. That creates:

duplicated content logic
harder localization
inconsistent branding
risky ad hoc changes in code deployments

13.2 Template Engines

Common template systems support:

placeholders such as user name, amount, due date
conditional content
loops for repeated items
channel-specific rendering, such as HTML email vs plain text vs push body

The engine should ideally be constrained. Overly powerful templates can become hard to reason about or insecure.

13.3 Localization

Global products need localization support:

translated strings
locale-specific date, time, number, and currency formatting
right-to-left language support where needed
per-market legal wording variations

Localization is a major reason templates should be centrally managed.

13.4 Personalization

Templates often need data from multiple sources:

user profile
event payload
account metadata
product context such as workspace or subscription plan

The challenge is ensuring all required variables are available and valid at render time.

13.5 Placeholders and Strong Contracts

A mature system treats template variables like an API contract.

Good practice:

define required and optional variables per template version
validate payload shape before enqueueing large fanouts
fail fast if critical variables are missing

This prevents discovering a missing placeholder only after a million-send campaign starts.

13.6 Versioning

Templates need versioning because content changes over time.

Reasons:

copy updates
regulatory changes
product redesigns
A/B tests

Versioning lets you:

replay historical sends faithfully
know exactly which content a user saw
roll back safely if a bad template is published

13.7 A/B Testing Basics

Notification systems often support experiments on:

subject line
send timing
body copy
CTA wording
channel choice

The important engineering point is that experiment assignment should be stable and observable. Otherwise analytics become misleading.

13.8 Approval Workflows

At scale, template changes are often too risky for direct edit-and-send.

Companies frequently use:

draft and review states
approval workflows for legal, compliance, or brand teams
test-send environments
guarded rollout by percentage or audience segment

This is especially common for finance, healthcare, and large SaaS platforms.

13.9 Template Safety

Safety concerns include:

accidental broken placeholders
unsafe HTML content
unescaped user-generated data
overlong SMS bodies creating unexpected segmentation cost
push payload exceeding provider limits

Template safety needs validation per channel, not just generic render success.

13.10 Rendering Failures

Rendering can fail because:

a required variable is missing
localization entry is absent
HTML is malformed
experiment assignment references an unpublished variant

These failures should be classified as permanent or configuration errors, not retried blindly forever.

13.11 Preview and Testing Systems

Good internal tooling matters.

Useful capabilities:

preview with sample payloads
render in multiple locales
device/channel preview for email, SMS, and push
validation against size limits and required variables
test send to controlled recipients

This is how companies safely manage hundreds or thousands of templates.

14. Scheduling

Scheduling is the part of notification systems that answers "not now, but later." It sounds easy until timezones, retries, DST, and exactly-once concerns appear.

14.1 Delayed Notifications and Reminders

Common scheduled notifications:

payment reminder tomorrow at 9 AM local time
retry failed webhook in 30 minutes
send digest every evening
remind user if task remains unresolved for 24 hours

Scheduling therefore exists both for product UX and operational retry workflows.

14.2 Cron vs Queue-Based Scheduling

Approach	Strengths	Weaknesses	Best fit
Cron-like scheduler	Simple for recurring global jobs	Not great for millions of per-user scheduled items	fixed recurring tasks
Delayed queue / timer wheel	Good for large numbers of future jobs	Operational complexity at very large horizons	reminders, retries, one-off schedules
Database-backed scheduler	Easy to reason about with scans	Can become expensive and contention-heavy	moderate scale scheduling

Most production systems use more than one mechanism.

14.3 Distributed Schedulers

A distributed scheduler must avoid missing jobs and also avoid running the same job many times.

Typical strategies:

partition scheduled jobs by time bucket and worker shard
use lease/claim model so only one worker owns a due item at a time
store a dedupe or execution key so reruns are safe
enqueue work into normal queues when it becomes due rather than executing inline in the scheduler

This separation keeps the scheduler simple and lets worker fleets scale independently.

14.4 Retry Scheduling

Retry scheduling is just scheduling with shorter horizons and stricter idempotency.

Important considerations:

next attempt time should be explicit
attempts should be capped
retry state must survive worker restarts
delayed retries should not block primary queue throughput

14.5 Timezone Handling

Timezone bugs are common because user-local scheduling sounds simple but is not.

Examples:

send at 9 AM in the user's configured timezone
user changes timezone between scheduling and execution
organization policy uses workspace timezone instead of user timezone

The scheduler must define which timezone source is authoritative.

14.6 DST Problems

Daylight saving time creates two classic edge cases:

a local clock time that does not exist
a local clock time that occurs twice

If you schedule "2:30 AM local time" on the wrong day, what does that mean?

Production systems need explicit behavior, such as:

shift to the next valid local time
pin to UTC internally after initial local conversion
document recurrence rules clearly

14.7 Calendar Edge Cases

Other calendar issues:

month-end behavior, such as "send on the 31st"
leap years
business-day-only rules
local holidays for campaign or billing workflows

These matter more than many candidates expect, especially in financial and enterprise systems.

14.8 Campaign Scheduling

Large campaigns introduce extra complexity:

recipient expansion can be huge
provider rate limits require pacing
rollout may need regional waves
cancellation or pause controls are necessary
duplicate prevention across retries and partial runs matters

A campaign system is often more like a controlled distributed batch pipeline than a simple timer.

14.9 Exactly-Once Challenges

Scheduling systems rarely achieve true exactly-once delivery end-to-end.

More realistic goal:

enqueue due work at least once
make downstream processing idempotent
store execution markers to avoid duplicate user-visible effects

That is the same practical pattern seen across broader distributed systems.

14.10 A Practical Scheduling Architecture

flowchart LR
	REQ[Create reminder / delayed notification] --> SCHED[(Schedule Store)]
	SCHED --> SCAN[Scheduler Workers]
	SCAN --> DUE{Now due?}
	DUE -- No --> SCHED
	DUE -- Yes --> Q[[Delivery Queue]]
	Q --> W[Notification Worker]
	W --> CH[Channel Provider]
	W --> MARK[(Execution / Idempotency Store)]

15. Comparing Channels in Practice

When interviewers ask communication questions, they often want channel tradeoffs, not only definitions.

15.1 WebSockets vs SSE vs Long Polling

Question	WebSockets	SSE	Long Polling
Need bidirectional low-latency interaction?	Best choice	No	Poor fit
Need simple server-to-browser streaming?	Works, but may be overkill	Often best	Acceptable fallback
Need broad compatibility with older infrastructure?	Sometimes harder	Usually okay	Best compatibility
High connection count cost	Persistent connection memory	Persistent HTTP stream cost	Highest repeated request overhead
Typical examples	chat, collaboration, multiplayer-like UX	feeds, logs, dashboards, status	legacy updates, fallback transports

15.2 Email vs SMS vs Push Notifications

Channel	Strengths	Weaknesses	Best uses
Email	Rich content, durable inbox, cheap at scale relative to SMS	Deliverability complexity, slower user attention	receipts, alerts, digests, workflow updates
SMS	Very high reach, urgent attention	Expensive, abuse-prone, region-specific constraints	OTP, critical alerts, fallback reachability
Push	Low cost, strong mobile engagement	Delivery uncertainty, token churn, permission dependence	mobile app alerts, reminders, activity nudges

A strong production design usually uses channel hierarchy rather than one channel for everything.

Example:

real-time in-app if user is active
push if mobile user is inactive
email for durable summary or critical account record
SMS only for urgent or high-assurance scenarios

16. How These Systems Connect in Real Architectures

The biggest interview mistake is describing each component in isolation.

Production systems connect them.

16.1 Chat Application Architecture

flowchart TB
	U1[Sender Client] --> GW1[Realtime Gateway]
	GW1 --> MSG[Message Service]
	MSG --> DB[(Message Store)]
	MSG --> BUS[[Room Event Bus]]
	BUS --> GW2[Realtime Gateways]
	GW2 --> U2[Online Recipients]
	BUS --> NOTIF[Notification Orchestrator]
	NOTIF --> PUSH[Push / Email for offline users]
	GW2 --> PRES[Presence Service]
	PRES --> CACHE[(Presence TTL Store)]

Key insight:

live transport handles active users
durable message storage handles history and offline recovery
presence influences whether to send push or keep delivery in-app

16.2 Ride Tracking Architecture

Uber-like example:

driver app emits location updates
ingestion service validates and writes them to a stream
geo or trip service computes rider-visible state
active rider session receives real-time updates through WebSocket or SSE
if rider is backgrounded, notification system may send push for major trip milestones

The real-time system and notification system are therefore two views over the same event stream.

16.3 SaaS Workflow Notifications

GitHub or Stripe style example:

domain event occurs, such as comment added or payment failed
write event through outbox pattern to avoid losing it during transaction boundaries
event bus feeds both in-app real-time feed and notification orchestrator
preferences and suppression rules decide channel
durable inbox and email/push/SMS state are updated independently

This architecture is robust because the event is authoritative and downstream communication is decoupled.

16.4 Notification Delivery Pipeline

sequenceDiagram
	participant P as Producer Service
	participant O as Outbox / Event Bus
	participant N as Notification Orchestrator
	participant R as Rules + Preferences
	participant Q as Channel Queue
	participant W as Worker
	participant X as Provider
	participant S as Status Store

	P->>O: Publish domain event
	O->>N: Deliver event
	N->>R: Resolve recipients, channels, suppression
	R-->>N: Eligible deliveries
	N->>Q: Enqueue channel jobs with idempotency key
	Q->>W: Deliver job
	W->>X: Send via provider
	X-->>S: Callback / receipt / bounce / failure
	S-->>N: Update notification state and metrics

17. What Breaks at Scale

Large communication systems usually break in predictable ways.

17.1 Real-Time System Failures

reconnect storms after deploys or regional blips
hotspot channels or rooms causing shard imbalance
slow consumers building large outbound queues
stale presence from delayed disconnect detection
message replay windows too short for realistic mobile disconnects

17.2 Notification System Failures

duplicate sends from missing idempotency
retry storms after provider outage
dead-letter growth due to poison templates or bad payload contracts
provider callback backlog causing stale delivery state
campaign sends overwhelming rate limits and delaying transactional traffic

17.3 Organizational Failures

every team building its own notification logic
no shared preference model
weak observability into delivery stages
no clear ownership of templates or compliance-sensitive content

Strong engineering organizations build communication systems as platforms because the problems repeat across products.

18. Best Practices and Common Mistakes

18.1 Best Practices

separate authoritative business events from channel delivery
choose the simplest transport that satisfies product requirements
keep real-time transport distinct from durable history or durable inbox state
use sequence numbers, cursors, or snapshots for reconciliation
make retries idempotent and classify failures carefully
use ephemeral stores with TTL for active presence
centralize user preferences, suppression, and dedupe logic
isolate transactional and bulk notification paths
measure end-to-end latency, not just enqueue success

18.2 Common Mistakes

assuming WebSockets imply reliable message delivery by themselves
treating online state as exact truth
sending every possible notification instead of designing for user attention
retrying permanent failures forever
ignoring provider callbacks and token invalidation
letting campaigns compete with critical operational messages
storing all heartbeat or presence writes durably and overwhelming the database

19. How to Explain This in an Interview

If asked to design or explain a communication system, structure the answer in this order:

define the product requirement and latency expectation
choose transport based on interaction style
define durability and delivery guarantees
explain fanout and scaling approach
cover reconnect, retry, and idempotency behavior
discuss presence or offline fallback if relevant
mention observability, rate limits, and operational failure modes

Example framing:

"For active users, I would use WebSockets because the product needs bidirectional low-latency updates. I would keep durable state in the message store and use sequence numbers for replay after reconnect. Presence would be managed through heartbeats in a TTL-based store with grace periods. For offline users, I would feed the same events into a notification orchestrator that applies preferences, dedupe, and channel-specific retries for push, email, or SMS."

That answer sounds like someone who has moved beyond glossary knowledge.

20. Final Mental Model

Communication systems are not only about sending data. They are about delivering the right information:

to the right user
through the right channel
with the right urgency
at the right cost
with acceptable reliability and ordering
without overwhelming infrastructure or users

The best backend systems treat communication as a first-class architecture concern, not a thin wrapper around providers or sockets.

If you remember one core idea, remember this:

real-time delivery and notifications should be designed as downstream communication layers built on top of authoritative events, with explicit handling for fanout, retries, ordering, idempotency, presence, and user preference.

That is the difference between a toy design and a production one.

80 KiB Raw Blame History

Communication Systems

1. Big Picture: What Communication Systems Solve

1.1 The Core Design Tension

1.2 Interview Framing

2. Real-Time Systems Overview

2.1 What Real-Time Means in Backend Systems

2.2 Soft Real-Time vs Hard Real-Time

2.3 Why Real-Time Communication Matters

2.4 Latency Expectations by Product

2.5 Consistency Challenges

2.6 Ordering Guarantees

2.7 Delivery Guarantees

2.8 Fanout Challenges

2.9 Scaling Persistent Connections

2.10 Stateless vs Stateful Challenges

2.11 Common Interview Examples

3. WebSockets

3.1 What WebSockets Are

3.2 Why WebSockets Exist

3.3 HTTP Upgrade Flow

3.4 Connection Lifecycle

3.5 Heartbeats, Ping-Pong, and Idle Detection

3.6 Reconnect Strategies

3.7 Backpressure Handling

3.8 Scaling Millions of Connections

3.9 Sticky Sessions and Their Implications

3.10 Pub/Sub Integration

3.11 Message Ordering

3.12 Delivery Guarantees

3.13 WebSockets vs Traditional Request-Response

3.14 Production Examples

3.15 Failure Cases and Best Practices

4. Server-Sent Events (SSE)

4.1 What SSE Is

4.2 Why SSE Exists

4.3 How SSE Works Internally

4.4 Browser Reconnect Behavior

4.5 Proxy and Load Balancer Considerations

4.6 Advantages of SSE

4.7 Limitations of SSE

4.8 SSE vs WebSockets

4.9 When SSE Is the Better Choice

4.10 Production Example

4.11 Failure Cases and Best Practices

5. Long Polling

5.1 What Long Polling Is

5.2 Why Older Systems Used It

5.3 Request Lifecycle and Timeout Behavior

5.4 Server Resource Cost

5.5 Retry Flow and Failure Behavior

5.6 Scaling Limitations

5.7 Long Polling vs SSE vs WebSockets

5.8 Where It Still Appears Today

5.9 Common Mistakes

6. Presence

6.1 What Presence Means

6.2 Online/Offline State vs Rich Presence

6.3 Heartbeat Mechanisms

6.4 Disconnect Detection

6.5 Ghost Online Problems

6.6 Mobile Network Challenges

6.7 Multi-Device Presence

6.8 Presence Architecture at Scale

6.9 How Slack, WhatsApp, and Discord Think About Presence

6.10 Failure Cases and Best Practices

7. Online and Offline State

7.1 Connection State vs Actual User Activity

7.2 Why Online Is Probabilistic

7.3 Delayed Disconnect Detection and Grace Periods

7.4 Distributed Presence Tracking

7.5 Eventual Consistency and Reconciliation

7.6 Offline Event Reconciliation

7.7 Useful Interview Language

8. Notification Systems Overview

8.1 What Notification Systems Are

8.2 Why They Exist as Dedicated Systems

8.3 Event-Driven Producer-Consumer Flow

8.4 Notification Orchestration

8.5 Fanout Systems

80 KiB

Raw Blame History