80 KiB
Communication Systems
Communication systems are the part of backend engineering that move information from one actor to another with the right combination of speed, reliability, cost, and user experience.
That sounds simple until you try to build one in production.
At small scale, communication looks like:
- send a message to a browser
- notify a mobile app that something changed
- email a receipt
- send an OTP SMS
- show whether a user is online
At production scale, the real questions become harder:
- how fast must the update arrive before the product feels broken?
- what happens if the user disconnects halfway through?
- how do you keep millions of long-lived connections open?
- how do you avoid sending the same notification five times?
- how do you retry safely without creating retry storms?
- how do you respect user preferences, cost constraints, and compliance requirements?
- how do you fan one event out to millions of recipients without melting the system?
This is why communication systems are a major system design topic. They sit at the boundary between users and backend state, so every weakness becomes visible quickly: latency, stale data, duplicate delivery, missing messages, ghost presence, notification fatigue, and operational complexity.
This guide is written for two goals at once:
- interview preparation, where you need to explain tradeoffs clearly and structure your answer well
- real backend engineering, where you need to design systems that survive load, failure, and product growth
The focus here is not memorizing definitions. The focus is understanding why these systems exist, how they work internally, what breaks at scale, and how companies such as Google, Netflix, Uber, Amazon, GitHub, Stripe, Slack, WhatsApp, Discord, and typical SaaS platforms think about them in practice.
1. Big Picture: What Communication Systems Solve
Communication systems usually fall into two overlapping families:
- real-time systems, where data should move to users quickly while they are actively connected
- notification systems, where events should be delivered across channels such as in-app, email, SMS, or push, even if the user is not currently online
In real products, these are not separate worlds. They are usually fed by the same event streams.
Example:
- a ride status changes from
driver_arrivedtotrip_started - the rider app needs a real-time UI update
- the driver app may need confirmation
- analytics pipelines need the event
- customer support tools may need the updated timeline
- if the rider is offline, a push notification may be needed
One state transition can trigger multiple communication paths.
flowchart LR
A[Business Event<br/>order placed / trip updated / message sent] --> B[[Event Bus / Outbox]]
B --> C[Real-Time Delivery Layer]
B --> D[Notification Orchestrator]
B --> E[Analytics / Audit / Logging]
C --> F[WebSocket / SSE / Long Polling Clients]
D --> G[In-App Inbox]
D --> H[Email Provider]
D --> I[SMS Provider]
D --> J[Push Provider]
The architectural pattern is consistent across many companies:
- business systems emit authoritative events
- communication infrastructure decides who should be told, through which channel, and with what guarantee
- channel-specific delivery systems handle protocol differences and provider failures
1.1 The Core Design Tension
Communication systems are always balancing five things:
| Concern | What you want | Why it is hard |
|---|---|---|
| Latency | Users see updates quickly | Fast delivery usually requires persistent state and more expensive infrastructure |
| Reliability | Important messages are not lost | Retries create duplicates unless the system is idempotent |
| Ordering | Updates arrive in sensible order | Distributed systems naturally reorder across partitions, retries, and reconnects |
| Cost | Delivery is affordable | SMS and mobile push can be expensive or rate-limited; persistent connections consume memory |
| User experience | Users are informed but not annoyed | Over-delivery causes fatigue; under-delivery makes the product feel broken |
Strong system design answers acknowledge all five.
1.2 Interview Framing
If an interviewer asks about communication systems, they are usually testing whether you can think beyond "send message from A to B."
They want to hear that you understand:
- latency targets depend on the product
- transport choice depends on directionality and interaction pattern
- delivery guarantees must be designed, not assumed
- large fanout requires asynchronous pipelines and backpressure control
- presence and online status are approximate, not perfect truth
- notification systems require orchestration, deduplication, preferences, and retry strategy
The best answers sound like this:
"I would separate the authoritative business event from channel delivery. Then I would choose the transport based on interaction style, add idempotency and retry controls, and design for fanout, reconnects, and per-user preference rules."
2. Real-Time Systems Overview
Real-time systems are systems where the value of data depends heavily on how quickly it arrives.
In backend interviews, "real-time" almost always means internet-scale soft real-time systems, not embedded hard real-time control systems.
2.1 What Real-Time Means in Backend Systems
In a web or mobile product, real-time usually means the update should arrive fast enough that the user feels the application is live.
Examples:
- chat messages should appear almost immediately
- typing indicators should feel instant
- ride location should update every few seconds
- collaborative edits should converge with minimal visible lag
- market dashboards should feel current enough for decisions
- multiplayer game state should update continuously, though serious gaming systems often use specialized protocols beyond basic web stacks
"Real-time" is therefore a product requirement, not a single protocol.
2.2 Soft Real-Time vs Hard Real-Time
| Type | Meaning | Typical examples | Backend interview relevance |
|---|---|---|---|
| Hard real-time | Missing a deadline is a correctness failure | avionics, industrial control, some medical systems | Usually not what web/backend interviews mean |
| Soft real-time | Missing a deadline degrades user experience, but the system still functions | chat, live dashboards, collaborative tools, ride tracking | The common system design case |
Hard real-time systems care about deterministic deadlines. Soft real-time systems care about low latency and bounded staleness.
For backend products, the question is usually:
"How stale can data become before the product feels wrong?"
That is a much more practical framing than arguing whether the system is literally real-time.
2.3 Why Real-Time Communication Matters
Real-time communication matters because it changes product behavior:
- it makes collaboration feel shared instead of delayed
- it reduces the need for manual refreshes
- it keeps users engaged during active workflows
- it improves trust in stateful experiences such as deliveries, rides, or build pipelines
- it enables products whose value depends on immediacy, such as incident dashboards or trading views
Poor real-time behavior is visible immediately.
Users may say:
- "the chat is laggy"
- "the driver marker jumps randomly"
- "I got the notification but the app still shows old state"
- "it says they are online but they are not responding"
These are communication-system failures, not merely UI issues.
2.4 Latency Expectations by Product
There is no universal latency target. It depends on the human workflow.
| Use case | Typical expectation | What users notice |
|---|---|---|
| Typing indicators | 50 to 300 ms | Anything slower feels fake |
| Chat message delivery | under 500 ms ideal | Multi-second delay feels broken |
| Collaborative editing | under 100 to 300 ms for local echo and convergence | Delayed merge or cursor motion is obvious |
| Ride or delivery tracking | 1 to 5 seconds is often acceptable | Long freezes reduce trust |
| Live operational dashboard | 1 to 10 seconds depending on use | Depends on decision criticality |
| Financial/trading dashboards | often sub-second to a few hundred ms for UX, though core trading infra may require much stricter guarantees | Stale numbers can be dangerous |
A good interview answer explicitly sets a target. If you do not define the acceptable staleness, the rest of the design becomes vague.
2.5 Consistency Challenges
Real-time systems are fundamentally consistency problems with a time dimension.
Common challenges:
- the backend state changes faster than clients can consume updates
- multiple clients observe updates from different replicas or regions
- reconnecting clients may miss some events and need replay
- the UI may apply optimistic updates before the server confirms them
- ordering may break when events are retried, sharded, or merged from multiple sources
This means you should think in layers:
- authoritative state lives in the source-of-truth system
- real-time delivery is a fast propagation layer, not the only truth
- clients often need version numbers, sequence numbers, or snapshots to reconcile with source-of-truth state
2.6 Ordering Guarantees
Ordering sounds simple until the system becomes distributed.
Useful distinctions:
- per-connection ordering: messages on a single TCP connection arrive in order
- per-partition ordering: messages within one partition of Kafka or another log may be ordered
- per-room or per-entity ordering: feasible if you shard consistently by room, stream, or entity ID
- global ordering: usually expensive, unnecessary, or impossible at scale
Interviewers often like hearing this sentence:
"I would aim for per-entity ordering, not global ordering, because global ordering becomes a bottleneck and is rarely needed by the product."
2.7 Delivery Guarantees
Real-time systems usually choose one of these practical models:
| Guarantee | Meaning | Reality in production |
|---|---|---|
| At-most-once | Message may be lost, but not duplicated | Simple and low overhead; common for low-value ephemeral events like typing |
| At-least-once | Message may be duplicated but should eventually arrive | Common when reliability matters and clients can dedupe |
| Exactly-once | Message is processed once end-to-end | Usually approximated with idempotency, dedupe keys, and transactional boundaries rather than achieved literally |
Important point: a transport like WebSocket does not magically provide business-level delivery guarantees. It only moves bytes on a connection. Business-level guarantees require acknowledgments, persistence, retries, dedupe, and replay logic.
2.8 Fanout Challenges
Fanout means taking one event and sending it to many recipients.
Examples:
- one new chat message goes to every participant in a channel
- one sports score update goes to millions of viewers
- one incident event updates many dashboards
Fanout is where many designs fail, because the cost of one event is no longer constant.
If a message goes to 3 users, the cost is small. If it goes to 3 million users, the delivery system becomes the problem.
Production systems handle this with combinations of:
- pub/sub layers
- shard-aware connection gateways
- regional fanout
- batching or aggregation for low-priority updates
- per-subscriber filtering so only interested users receive the event
2.9 Scaling Persistent Connections
Persistent connections are not free.
Each active connection consumes:
- file descriptors
- kernel socket buffers
- user-space memory for connection metadata and outbound queues
- CPU for heartbeats, TLS, parsing, and serialization
At one million connections, even small per-connection cost matters.
If one connection costs only 20 KB of memory for total state, one million connections already imply roughly 20 GB of memory before business logic overhead.
That is why large-scale real-time systems often use dedicated gateway tiers optimized for connection handling, event loops, and lightweight session metadata.
2.10 Stateless vs Stateful Challenges
HTTP APIs are often designed to be stateless. Real-time connection layers are not.
A persistent connection pins a client to a specific server process for some period of time. That creates state:
- which user is attached to which gateway
- which rooms/channels the connection is subscribed to
- what the last acknowledged sequence number was
- what outbound messages are buffered
This creates operational consequences:
- load balancers may need sticky behavior for established connections
- connection draining becomes important during deploys
- reconnect storms can overload specific shards
- state often must be externalized partially to Redis, Kafka, or another coordination layer
The usual production compromise is:
- stateful edge or gateway layer for live connections
- stateless application layer for most business logic
- shared pub/sub or stream infrastructure connecting the two
flowchart LR
C[Clients] --> LB[Load Balancer]
LB --> G1[Realtime Gateway 1]
LB --> G2[Realtime Gateway 2]
LB --> G3[Realtime Gateway 3]
G1 --> PS[[Pub/Sub / Stream Bus]]
G2 --> PS
G3 --> PS
PS --> APP[Business Services]
APP --> DB[(Source of Truth DB)]
2.11 Common Interview Examples
Typical products that surface these issues:
- chat systems such as Slack, WhatsApp, Discord
- trading or operational dashboards used by finance or SRE teams
- collaborative editors such as Google Docs
- ride tracking systems such as Uber or delivery apps
- multiplayer or shared presence systems
Each of these has different priorities:
- chat needs reasonable ordering and persistence
- collaborative editing needs conflict resolution and convergence
- ride tracking needs geo-update pipelines and bounded staleness
- dashboards often need efficient broadcast and backpressure handling
3. WebSockets
WebSockets are the standard choice when a browser or app needs persistent, bidirectional, low-latency communication over a single long-lived connection.
3.1 What WebSockets Are
WebSocket is a protocol that starts as an HTTP request and then upgrades to a long-lived TCP connection that both sides can use to send messages independently.
Why this matters:
- traditional HTTP is request-response oriented
- many interactive systems need the server to push updates without waiting for a fresh request
- some systems also need the client to send frequent messages without paying repeated HTTP overhead
Examples:
- chat input and delivery
- collaborative cursor updates
- game or live collaboration events
- dashboard subscriptions where clients may also send control messages
3.2 Why WebSockets Exist
Before WebSockets, developers used polling, long polling, or various hacks to simulate server push.
Those approaches had problems:
- repeated request headers and repeated TLS work
- higher latency between event generation and client receipt
- difficulty handling truly bidirectional traffic
- more server overhead due to connection churn
WebSockets exist because web applications increasingly behaved like continuously connected applications rather than static pages.
3.3 HTTP Upgrade Flow
The connection begins as HTTP and upgrades.
sequenceDiagram
participant C as Client
participant LB as Load Balancer
participant G as WebSocket Gateway
C->>LB: HTTP GET with Upgrade: websocket
LB->>G: Forward upgrade request
G->>G: Authenticate user, validate origin, allocate session
G-->>LB: 101 Switching Protocols
LB-->>C: 101 Switching Protocols
C->>G: WebSocket frames
G->>C: WebSocket frames
Typical details involved in practice:
- the client connects with
ws://orwss://; production typically useswss:// - authentication may be done via cookie, short-lived token, or signed session token
- the gateway may register the connection in a presence or subscription store
- after upgrade, messages are framed according to the WebSocket protocol instead of normal HTTP response semantics
3.4 Connection Lifecycle
A production WebSocket connection usually has this lifecycle:
- connect and authenticate
- subscribe to channels, rooms, or entities
- exchange bidirectional messages
- send heartbeats to detect broken connections
- handle temporary disconnects and reconnects
- replay missed messages or fetch a snapshot if needed
Designing only step 3 is not enough. Most operational pain comes from steps 4 through 6.
3.5 Heartbeats, Ping-Pong, and Idle Detection
Long-lived connections fail silently all the time.
Reasons include:
- mobile device sleep
- NAT or proxy idle timeout
- Wi-Fi to cellular network switch
- gateway crash
- browser tab suspension
- broken half-open TCP connection
This is why WebSocket systems usually use heartbeat messages or protocol-level ping-pong frames.
Heartbeats help answer two questions:
- is the connection still alive?
- should the system still consider this client online?
Without heartbeats, you get ghost sessions that appear connected long after the client is gone.
3.6 Reconnect Strategies
Reconnect behavior is a major part of real-world quality.
Bad reconnect strategy:
- immediate reconnect loop
- no jitter
- no session resume
- no replay of missed messages
This creates reconnect storms during deploys or brief outages.
Good reconnect strategy usually includes:
- exponential backoff
- random jitter
- session tokens that expire quickly but can be refreshed safely
- a cursor, offset, or last-seen sequence number so missed updates can be replayed
- a fallback snapshot fetch if the replay window has expired
For example, a chat client may reconnect with "last message sequence = 8421" so the server can resend messages 8422 onward, or instruct the client to resync from the room history API.
3.7 Backpressure Handling
Backpressure means the producer is sending faster than the consumer can handle.
In WebSocket systems, this often shows up as:
- a client on a slow network
- a dashboard subscribed to too many noisy streams
- a fanout burst after a major event
- a gateway trying to write faster than the kernel socket buffer drains
If you ignore backpressure, memory usage grows because outbound queues grow.
Common handling strategies:
- per-connection outbound queue limits
- drop low-priority ephemeral updates such as typing indicators or cursor positions
- coalesce updates so only the latest state is sent for fast-changing metrics
- disconnect very slow consumers after threshold violations
- split high-volume topics into separate subscription classes
An important interview line:
"For rapidly changing state, I would prefer sending the latest state rather than every intermediate event, because preserving every micro-update can create unnecessary backpressure."
3.8 Scaling Millions of Connections
At large scale, the architecture is usually not "application servers also do WebSockets." Instead it becomes a specialized gateway tier.
flowchart TB
U[Clients] --> LB[Global Load Balancer]
LB --> R1[Realtime Gateway Pool A]
LB --> R2[Realtime Gateway Pool B]
R1 --> BUS[[Kafka / Redis PubSub / NATS / Custom Bus]]
R2 --> BUS
BUS --> CHAT[Chat Service]
BUS --> PRES[Presence Service]
BUS --> FEED[Live Feed Service]
CHAT --> DB[(Message Store)]
PRES --> CACHE[(Ephemeral Presence Store)]
FEED --> TS[(Metrics / Event Store)]
Key design patterns:
- gateways hold active connections and subscription state
- business services publish events to a bus rather than pushing directly to every gateway
- gateways subscribe to only the shards or channels they need
- sharding is often based on user ID, room ID, organization ID, or topic
- regional edge clusters reduce latency and avoid routing every message through one region
Operational concerns at this scale:
- file descriptor limits
- efficient event-loop networking, such as epoll or kqueue based runtimes
- TLS termination cost
- per-connection memory footprint
- deployment draining and connection migration
- DDoS and abuse protection
3.9 Sticky Sessions and Their Implications
WebSocket connections are naturally sticky after they are established because the connection lives on one server.
Two common patterns exist:
- sticky routing only for the lifetime of the connection, with minimal external session state
- externalized session or subscription metadata so another node can recover more easily after reconnect
Tradeoff:
- keeping more state in-memory is fast but makes failover harder
- externalizing more state improves recoverability but increases read/write overhead
Most production systems use a hybrid approach.
3.10 Pub/Sub Integration
WebSockets are a transport. They are rarely the source of truth.
The common production flow is:
- a business event happens, such as a new chat message
- the source-of-truth service persists it
- the service publishes an event to a bus
- the relevant real-time gateways receive the event
- the gateways push it to subscribed clients
This avoids tight coupling between the message store and every connected gateway.
Chat example:
- message written to durable message store
- room event published to Kafka or another bus
- gateways with members of that room fan it out
- offline users do not receive live delivery, but the message remains in durable history
3.11 Message Ordering
WebSocket itself gives ordered byte delivery on a single connection because it runs over TCP.
That does not solve application-level ordering fully.
Ordering can still break due to:
- reconnects
- retry/replay logic
- multi-region replication lag
- events produced from multiple backend services
- fanout from multiple partitions
Practical fix:
- assign monotonic sequence numbers per room, stream, or entity
- let clients detect gaps or duplicates
- if a gap exists, trigger replay or snapshot sync
3.12 Delivery Guarantees
Default WebSocket delivery is closer to at-most-once unless you add more machinery.
For stronger guarantees, systems add:
- message IDs
- client acknowledgments
- server-side retry windows
- dedupe on the client or server
- durable storage for important messages
Chat systems often separate message durability from live delivery:
- live delivery is fast but may fail during transient disconnects
- durable history ensures the user still sees the message when they reconnect
This is how systems like Slack or Discord can tolerate temporary live-delivery failures without losing conversation history.
3.13 WebSockets vs Traditional Request-Response
| Dimension | Request-response HTTP | WebSockets |
|---|---|---|
| Communication model | Client asks, server answers | Both sides can send after connection is established |
| Overhead | New request/response semantics each time | One persistent connection |
| Server push | Awkward or impossible directly | Natural |
| Best for | CRUD, infrequent interactions, cacheable reads | Interactive real-time features |
| Statefulness | Usually stateless per request | Stateful connection management |
| Scaling pain | Request burst handling | Connection count, fanout, reconnect storms |
3.14 Production Examples
Slack or Discord style chat:
- WebSockets to maintain active sessions
- durable message store for history
- presence service for online status
- per-channel fanout through pub/sub
- client acknowledgments or read markers managed separately from transport
Live operational dashboard:
- metrics ingestion into stream system
- aggregation layer computes latest values
- gateway pushes only the latest state at a controlled cadence
- slow clients may receive sampled or coalesced updates instead of every raw event
Collaborative editing like Google Docs style systems:
- persistent channel for operations and cursor updates
- separate algorithm for convergence such as OT or CRDT
- ordered per-document operation stream more important than global ordering
3.15 Failure Cases and Best Practices
Common failures:
- dropped connections during deploys
- missing replay after reconnect
- memory blow-up from unbounded per-client queues
- duplicate sends during reconnect races
- stale authentication on long-lived connections
- load imbalance when popular rooms concentrate on one shard
Best practices:
- keep auth tokens short-lived and renewable
- use heartbeats with sensible timeout and grace period
- track per-stream sequence numbers for replay and gap detection
- bound outbound queues and define drop policies
- separate durable history from ephemeral transport
- plan explicitly for reconnect storms
4. Server-Sent Events (SSE)
Server-Sent Events are a simpler real-time transport for one-way server-to-client streaming over HTTP.
4.1 What SSE Is
SSE uses a long-lived HTTP response with content type text/event-stream. The server keeps the response open and pushes events as text lines.
This is simpler than WebSockets because:
- it stays within standard HTTP semantics
- the communication direction is only server to client
- browsers provide built-in reconnect behavior through
EventSource
SSE is often the right tool when you need live updates but not client-to-server realtime messaging on the same channel.
4.2 Why SSE Exists
Many applications need push from server to browser, but not full bidirectional sockets.
Examples:
- notification feeds
- build or job status updates
- stock or metrics dashboards
- live logs
- admin consoles showing state changes
For these, WebSockets may be more power than you need. SSE offers lower conceptual complexity.
4.3 How SSE Works Internally
The server responds with a streaming body where events look like this conceptually:
id: 1052
event: status_update
retry: 3000
data: {"state":"processing","progress":65}
Important fields:
data: payload, often JSON serialized as textevent: optional named event typeid: event ID used for resume/reconnect logicretry: suggested reconnect delay in milliseconds
The browser may reconnect automatically and include the last event ID.
4.4 Browser Reconnect Behavior
One reason teams like SSE is that browsers handle reconnection automatically through the EventSource API.
That does not mean you can ignore replay design.
In production, you still need to decide:
- how long the server remembers old event IDs
- whether reconnect should replay from a cursor or force a snapshot refresh
- what happens if the client was disconnected longer than the replay window
If you care about reliable catch-up, event IDs need meaning. They cannot just be decorative.
4.5 Proxy and Load Balancer Considerations
SSE works over HTTP, but intermediaries can cause problems.
Common issues:
- reverse proxies buffering the response instead of flushing promptly
- idle timeouts closing the stream
- HTTP/1.1 connection count limits in browsers
- network infrastructure tuned for short responses rather than long-lived streams
Typical mitigations:
- disable response buffering where necessary
- send periodic keepalive comments or lightweight events
- tune load balancer idle timeout above expected stream lifetime
- prefer HTTP/2 where available to improve multiplexing behavior
4.6 Advantages of SSE
SSE is often better than WebSockets when:
- you only need server-to-client updates
- the payload is text-oriented JSON or status messages
- you want easy browser support with simpler client code
- you prefer staying inside HTTP infrastructure where possible
- you want built-in reconnect behavior without custom socket management
It is especially attractive for SaaS dashboards and internal tools.
4.7 Limitations of SSE
SSE is not a universal replacement for WebSockets.
Limitations:
- browser-to-server communication still needs normal HTTP requests
- payloads are text-based, not native binary frames
- some proxies or CDNs may handle buffering poorly if misconfigured
- older browser and infrastructure quirks may matter in enterprise environments
- very high frequency bidirectional workloads fit WebSockets better
4.8 SSE vs WebSockets
| Dimension | SSE | WebSockets |
|---|---|---|
| Direction | Server to client only | Bidirectional |
| Protocol model | Streaming HTTP | Upgraded persistent socket |
| Browser reconnect support | Built in | Usually implemented by application |
| Complexity | Lower | Higher |
| Binary support | Not natural | Supported |
| Best use | Feeds, status, live logs, dashboards | Chat, collaboration, two-way real-time |
4.9 When SSE Is the Better Choice
SSE is often the better choice when the client mostly listens.
Examples:
- GitHub or CI-style live build logs
- a Stripe-like dashboard showing payment processing updates
- an incident dashboard showing service health changes
- a notification center feed for a web app
- stock or operations dashboards where commands go through separate REST APIs
This is a good interview point: choosing the simpler transport when it satisfies requirements is usually a sign of maturity.
4.10 Production Example
Imagine a live job-status console for a video encoding platform.
Architecture:
- workers emit progress events to Kafka
- a status aggregation service maintains current job state
- an SSE gateway streams progress events to browser tabs watching those jobs
- user actions such as cancel or retry still go through normal HTTP APIs
This keeps the design simple because only one direction needs to be live.
sequenceDiagram
participant B as Browser
participant S as SSE Gateway
participant A as Aggregation Service
B->>S: GET /jobs/123/stream
S->>A: Subscribe to job 123 updates
A-->>S: status event id=51
S-->>B: event: progress, id:51, data:{65%}
A-->>S: status event id=52
S-->>B: event: progress, id:52, data:{78%}
Note over B,S: If disconnected, browser reconnects with Last-Event-ID
4.11 Failure Cases and Best Practices
Common failures:
- events buffered by proxy, making the feed appear delayed
- auto-reconnect loops against an unhealthy server
- event IDs not aligned with replay logic
- assuming SSE gives guaranteed delivery without durable backing state
Best practices:
- use SSE for one-way streaming where simplicity matters
- define replay windows and semantics for
Last-Event-ID - tune proxy buffering and timeout settings explicitly
- send heartbeats or comments to keep the stream alive
- keep authoritative state elsewhere so clients can resync after longer disconnects
5. Long Polling
Long polling is the older but still useful technique where the client makes a request, the server holds it open until data is available or a timeout occurs, then the client immediately opens a new request.
5.1 What Long Polling Is
Long polling tries to approximate server push while staying entirely within ordinary request-response behavior.
Lifecycle:
- client sends request asking for updates
- server waits if no new data is available
- when data arrives, or timeout happens, server responds
- client immediately sends another request
This can work surprisingly well at modest scale.
5.2 Why Older Systems Used It
Long polling became popular because:
- it worked before modern browser WebSocket support was universal
- it fit existing HTTP infrastructure better
- it avoided some firewall or proxy restrictions that broke upgraded connections
- it was simple to integrate into applications already centered on HTTP APIs
Many legacy enterprise products still use it, and some modern systems still choose it when operational simplicity matters more than efficiency.
5.3 Request Lifecycle and Timeout Behavior
sequenceDiagram
participant C as Client
participant S as Server
C->>S: GET /updates?cursor=120
Note over S: Hold request open
S-->>C: 200 OK with new event 121
C->>S: GET /updates?cursor=121
Note over S: Hold request open again
S-->>C: 204 No Content or timeout response
C->>S: GET /updates?cursor=121
The cursor is important. Without it, the client cannot reliably resume.
5.4 Server Resource Cost
Long polling has hidden cost:
- many open requests waiting simultaneously
- repeated HTTP parsing and headers
- repeated auth checks unless optimized
- more connection churn than persistent transports
- potentially more threads or async waiting overhead depending on server model
In modern async servers this can be manageable, but it is still usually less efficient than SSE or WebSockets for high-frequency updates.
5.5 Retry Flow and Failure Behavior
If a long-poll request fails, the client retries. If the system is overloaded, thousands of clients may do this simultaneously.
This can create:
- thundering herds after load balancer restarts
- increased TLS handshake overhead
- request spikes aligned to timeout intervals
Good long-poll implementations randomize retry timing and use cursors to avoid gaps.
5.6 Scaling Limitations
Long polling tends to hit limits earlier than SSE or WebSockets because:
- every delivery cycle requires a fresh request
- idle clients still create repeated requests
- server-side waiting requests increase memory and scheduling overhead
- load balancers and API gateways see much higher request volume
It is therefore more expensive per delivered update at scale.
5.7 Long Polling vs SSE vs WebSockets
| Dimension | Long Polling | SSE | WebSockets |
|---|---|---|---|
| Direction | Mostly server to client via repeated requests | Server to client | Bidirectional |
| Latency | Good but depends on re-request timing | Better | Best for interactive two-way |
| Overhead | Highest | Lower | Lowest per message after connection |
| Infrastructure simplicity | High | Moderate | Moderate to high |
| Browser support model | Universal HTTP | Modern browser friendly | Modern browser friendly |
| Best fit | Legacy environments, simple compatibility | One-way streaming | Full duplex interactive systems |
5.8 Where It Still Appears Today
Long polling still appears in:
- legacy enterprise systems
- products that need broad proxy compatibility
- simple notification or status systems where connection counts are moderate
- fallback modes when WebSockets fail
Some client libraries silently downgrade to long polling when sockets are unavailable.
5.9 Common Mistakes
Common mistakes:
- no cursor or sequence number, causing gaps or duplicates
- synchronized timeout intervals across all clients
- treating long polling as free because it uses HTTP
- using blocking server threads for many waiting requests
Best practices:
- add jitter to retries and request restarts
- carry a cursor or last seen event ID
- use async I/O servers
- switch to SSE or WebSockets when event frequency or client count grows significantly
6. Presence
Presence is the system that answers questions such as:
- is this user online?
- were they active recently?
- are they typing right now?
- which device are they on?
- when were they last seen?
Presence looks easy until you build it across unreliable networks and multiple devices.
6.1 What Presence Means
Presence is not just a boolean. It is usually a collection of signals.
Examples of signals:
- active WebSocket connection exists
- recent heartbeat received within threshold
- app is in foreground or background
- user generated interaction recently
- one or more devices are connected
- a typing event was emitted in the last few seconds
Products like Slack, WhatsApp, and Discord treat these signals differently based on product semantics and privacy expectations.
6.2 Online/Offline State vs Rich Presence
There are multiple levels of presence richness:
| Presence signal | Meaning | Typical use |
|---|---|---|
| Online | At least one recent active connection | chat roster |
| Active now | Recent interaction or foreground state | collaboration tools |
| Last seen | Last trusted activity timestamp | messaging apps |
| Typing | Very recent ephemeral intent signal | chat UI |
| Device-specific presence | Desktop online, mobile background, etc. | cross-device messaging |
Rich presence improves UX but increases cost and complexity.
6.3 Heartbeat Mechanisms
Presence systems typically rely on heartbeats because disconnects are not always detected immediately.
Typical model:
- client sends heartbeat every N seconds
- gateway refreshes an expiry in an ephemeral store
- if no refresh occurs before TTL plus grace period, the client is considered offline
Heartbeats are usually lightweight and sometimes piggyback on existing ping-pong traffic.
6.4 Disconnect Detection
There are several ways a system may detect that a user is gone:
- clean socket close
- missed heartbeat threshold
- TCP reset or read/write failure
- mobile OS background suspension causing missed keepalives
- gateway process failure
The important design lesson is that disconnect detection is delayed and probabilistic.
6.5 Ghost Online Problems
Ghost online means the system shows a user as online even though they have effectively disappeared.
Causes:
- stale presence entry in Redis or similar store
- gateway crash before cleanup
- mobile network change without clean disconnect
- delayed heartbeat expiration
- clock skew or delayed writes
Ghost online is not just cosmetic. In chat systems it changes user expectations and trust.
6.6 Mobile Network Challenges
Mobile clients make presence significantly harder:
- apps move between foreground and background
- the OS may throttle network activity
- the device may sleep aggressively
- the network may switch between Wi-Fi and cellular
- radio conditions cause intermittent reachability
This is why messaging apps rarely promise precise truth about presence. They provide an approximation that is good enough for UX.
6.7 Multi-Device Presence
A single user may be connected from:
- desktop browser
- mobile phone
- tablet
- another browser tab
So the question becomes: how should user presence be aggregated?
Common policies:
- user is online if any device is online
- show the most active device class
- typing indicator is scoped per conversation and device session
- last seen is updated from the most recently trusted activity source
This requires per-device state plus a user-level aggregation layer.
6.8 Presence Architecture at Scale
flowchart LR
C1[Client Devices] --> G[Realtime Gateways]
G --> P[Presence Service]
P --> E[(Ephemeral TTL Store)]
P --> L[(Last Seen Store)]
P --> BUS[[Presence Event Bus]]
BUS --> W1[Workspace / Room Subscribers]
BUS --> W2[Friend / Contact Subscribers]
Important implementation choices:
- keep rapidly changing online state in an ephemeral store with TTL, often Redis-like
- persist last seen more selectively because writing every heartbeat to durable storage is wasteful
- fan out presence only to users who care, such as channel members or contact lists
- avoid global broadcasts of presence updates
6.9 How Slack, WhatsApp, and Discord Think About Presence
Slack-like systems:
- presence is often relevant within a workspace, not globally
- typing indicators are highly ephemeral and can be dropped safely
- active/away may incorporate user activity signal, not only socket existence
WhatsApp-like systems:
- privacy settings affect whether last seen or online is visible
- mobile network behavior dominates design decisions
- message delivery status and online status are related but not identical
Discord-like systems:
- gateway connections and presence updates are central
- fanout is scoped to servers, friends, or subscribed contexts
- rich presence may include game/activity metadata beyond simple online state
6.10 Failure Cases and Best Practices
Common failures:
- presence storms during reconnect events
- over-broadcasting presence changes to too many recipients
- stale typing indicators that never clear
- durable stores overloaded by heartbeat writes
Best practices:
- use TTL-based ephemeral storage for active presence
- use grace periods before declaring offline
- treat typing as ephemeral, low-guarantee data
- separate last seen persistence from heartbeat frequency
- scope presence fanout carefully
7. Online and Offline State
Online/offline state deserves separate treatment because it is one of the most misunderstood topics in system design.
7.1 Connection State vs Actual User Activity
A connected socket does not prove the user is attentive.
Examples:
- the app is open in a background tab
- the phone has a connection but the user has not interacted for an hour
- the device is connected through a stale transport path
So "online" is often a layered concept:
- connected
- recently active
- foreground active
- reachable through push only
- offline
7.2 Why Online Is Probabilistic
"Online" is often probabilistic rather than exact because distributed systems only observe signals, not intent.
Signals can be wrong or delayed:
- heartbeat arrives late
- device sleeps
- disconnect cleanup fails
- region replication lags
- clock skew shifts thresholds
That is why serious systems avoid over-promising semantic precision.
7.3 Delayed Disconnect Detection and Grace Periods
If you mark a user offline immediately on one missed heartbeat, the UI will flap constantly. If you wait too long, users stay falsely online.
The typical answer is a grace period.
Example:
- heartbeat expected every 20 seconds
- after 45 seconds without heartbeat, mark as "probably offline"
- after 90 seconds, emit strong offline event
The exact numbers depend on product sensitivity and mobile behavior.
7.4 Distributed Presence Tracking
At scale, presence may be tracked across regions or gateway clusters.
Challenges:
- the same user may have devices connected to different regions
- presence updates may replicate asynchronously
- room members may themselves be distributed globally
- failover can cause duplicate or delayed presence transitions
Practical design patterns:
- aggregate presence per user from device-level sessions
- keep the fast path regional when possible
- replicate summarized state, not every heartbeat, across regions
- use eventual consistency for non-critical presence displays
7.5 Eventual Consistency and Reconciliation
Presence is almost always eventually consistent.
That is acceptable because most presence UI is advisory rather than transactional.
Reconciliation patterns:
- refresh contact roster periodically from source state
- clear typing indicators after TTL rather than requiring explicit "stop typing"
- overwrite stale online state when stronger evidence arrives
- prefer monotonic versioning or timestamps for state updates
7.6 Offline Event Reconciliation
When a user comes back online, the system often needs to reconcile what happened while they were away.
Examples:
- unread chat message counts
- missed mentions or alerts
- latest document state
- last known ride or delivery state
This is where live presence and durable notification systems connect. Presence alone is not enough. Offline catch-up requires durable event or state storage.
7.7 Useful Interview Language
Strong interview framing:
"I would treat online status as a best-effort product signal based on recent heartbeats and activity, not as exact truth. I would keep active presence in an ephemeral TTL store, add grace periods to avoid flapping, and rely on durable stores for offline reconciliation."
8. Notification Systems Overview
Notification systems take internal events and decide whether, when, and how to tell a user.
This is a much bigger problem than "send email" or "send push."
8.1 What Notification Systems Are
A notification system is usually an event-driven orchestration system with responsibilities such as:
- ingest product events
- determine recipients
- check user preferences and eligibility rules
- choose channels such as in-app, email, SMS, or push
- render channel-specific content
- schedule immediate or delayed delivery
- retry safely on transient failures
- capture delivery outcomes and user actions
This is why notification systems often become a platform inside a company.
8.2 Why They Exist as Dedicated Systems
Without a dedicated notification system, every product team reimplements:
- preferences
- templates
- provider integrations
- retry logic
- deduplication
- analytics
- unsubscribe logic
That leads to inconsistency, bugs, and poor user experience.
Centralized notification platforms exist because communication policy should be consistent even when event producers are decentralized.
8.3 Event-Driven Producer-Consumer Flow
flowchart LR
EV[Business Events<br/>PR comment / payment failed / trip update] --> BUS[[Event Bus / Outbox]]
BUS --> ORCH[Notification Orchestrator]
ORCH --> PREF[Preferences / Suppression Rules]
ORCH --> TPL[Template Service]
ORCH --> DEDUPE[Idempotency / Deduplication]
ORCH --> Q1[[Email Queue]]
ORCH --> Q2[[SMS Queue]]
ORCH --> Q3[[Push Queue]]
Q1 --> EP[Email Provider]
Q2 --> SP[SMS Provider]
Q3 --> PP[APNs / FCM]
EP --> CALLBACK[Delivery Callbacks / Webhooks]
SP --> CALLBACK
PP --> CALLBACK
CALLBACK --> STATUS[(Notification Status Store)]
8.4 Notification Orchestration
The orchestrator is the brain of the system.
It decides:
- should we notify at all?
- which recipients should receive it?
- which channels are allowed for this user and event type?
- should the notification be immediate, batched, or suppressed?
- how should duplicates be prevented?
- what fallback should happen if one channel fails?
This is why the orchestrator is usually more important than the provider adapters.
8.5 Fanout Systems
A single event can notify:
- one user
- all watchers of a repository
- all participants in a channel
- all on-call engineers for a critical incident
- millions of users in a campaign
Fanout complexity depends on the recipient graph.
Examples:
- GitHub notification fanout may target repository watchers, mention targets, or review participants
- Uber trip updates may target rider, driver, and support systems
- Stripe-like alerts may target account owners and configured team members
The fanout step is often separated from channel delivery so the system can scale recipient expansion independently.
8.6 Personalization and Preferences
Mature systems do not treat all recipients equally.
They consider:
- language and locale
- preferred channels
- quiet hours
- marketing consent vs transactional necessity
- account type or plan tier
- notification priority
- prior recent sends to avoid fatigue
This means notification systems need user preference stores, policy engines, and template personalization.
8.7 Delivery Guarantees and Idempotency
Notification systems are almost always at-least-once internally.
Why:
- queues retry on failures
- provider callbacks can be delayed or duplicated
- downstream consumers may crash after partial processing
Therefore idempotency is mandatory.
Examples of idempotency keys:
event_id + recipient_id + channel + template_versionotp_request_id + phone_numberinvoice_id + notification_type
Without idempotency, retries turn into duplicate emails, repeated SMS, or multiple push notifications.
8.8 User Preferences, Suppression, and Fatigue Prevention
Good notification systems do not just maximize sends. They maximize useful communication.
Common controls:
- per-event-type preferences
- daily or hourly caps
- digesting multiple low-priority events into one summary
- suppression if the user already saw the update in-app
- channel priority rules such as push first, email later if unread
- quiet hours by timezone
This is especially important for SaaS products, GitHub-like collaboration tools, and consumer apps where over-notification causes churn.
8.9 Real-World Examples
GitHub-like system:
- new comment event arrives
- identify subscribers and participants
- if user is active in the web app, maybe show in-app only
- if user is offline, maybe send email or mobile push depending on preferences
Stripe-like system:
- payment failed event arrives
- critical transactional email sent immediately
- dashboard alert created
- webhook to merchant retried separately with strong idempotency rules
Uber-like system:
- trip state changes drive in-app realtime updates when online
- push or SMS may be used if the user is offline or inactive
- some updates are critical and some are suppressible
8.10 Common Mistakes
Common mistakes:
- every producer sends directly to providers
- no central dedupe or idempotency control
- no distinction between marketing and transactional communications
- retrying permanent failures forever
- ignoring user timezone and quiet hours
- no feedback loop from provider delivery callbacks
9. Email Notifications
Email is deceptively complex. It looks simple because the send API is simple. Production email systems are complicated because delivery, trust, compliance, and reputation matter.
9.1 Transactional vs Marketing Email
| Type | Purpose | Examples | Operational expectations |
|---|---|---|---|
| Transactional | Required or highly important user/account events | password reset, receipt, billing alert, security alert | High reliability, lower latency, clearer deliverability priority |
| Marketing | Engagement and campaigns | product updates, promotions, newsletters | Segmentation, scheduling, consent, reputation sensitivity |
Good companies often separate these streams operationally because poor marketing practices can harm the deliverability of important transactional mail.
9.2 Provider Integrations
Common providers include SES, SendGrid, Mailgun, Postmark, and others.
The backend typically needs to manage:
- authentication credentials and rotation
- domain verification
- sending rate limits
- webhook endpoints for bounces, complaints, and delivered events
- fallback provider strategy if the primary provider degrades
The provider is not your notification system. It is only the last-mile email delivery partner.
9.3 Deliverability Basics
Deliverability means whether mail actually lands where it should.
It depends on more than API success.
Core concepts:
- SPF, DKIM, and DMARC to establish domain authenticity
- domain and IP reputation
- complaint rates and unsubscribe behavior
- bounce rates
- content quality and spam-like patterns
- volume ramp-up or IP/domain warming
A send accepted by the provider can still end up in spam or be throttled downstream.
9.4 Retries and Failure Handling
Email sending failures can be transient or permanent.
Transient examples:
- provider API timeout
- rate limit exceeded
- temporary downstream mail server deferral
Permanent examples:
- invalid recipient address
- hard bounce history
- unsubscribed marketing recipient
The notification system should classify failures and retry only the transient ones.
9.5 Bounce Handling
Bounces matter because continuing to send to bad addresses damages reputation.
Typical flow:
- send email
- provider later reports delivery, soft bounce, hard bounce, complaint, or unsubscribe
- your system updates recipient state and future eligibility
Hard bounces often lead to immediate suppression for that address. Soft bounces may tolerate a few retries or future attempts.
9.6 Spam Prevention Basics
Spam prevention is partly technical and partly policy.
Important basics:
- authenticate sending domains correctly
- do not mix high-risk marketing traffic with critical transactional mail blindly
- honor unsubscribe and complaint signals quickly
- avoid misleading content or sudden massive bursts from a cold domain
- keep lists clean and verified
9.7 Rate Limits and Provider Behavior
Providers enforce rate limits, and mailbox providers also effectively rate-limit through throttling.
Implications:
- you may need send pacing
- campaign fanout should often be spread over time
- critical transactional mail may need priority queues separate from bulk mail
Amazon or Stripe style critical receipt and billing systems usually do not want a giant promotional campaign to delay password resets.
9.8 Email Verification Basics
Verification is used to reduce garbage addresses and account abuse.
Common patterns:
- double opt-in for marketing lists
- confirmation link for new account email ownership
- suppression of obviously invalid domains or malformed addresses
This improves both security and deliverability.
9.9 Unsubscribe Flows
Unsubscribe handling is non-negotiable for marketing communications and often still useful for preference management around non-critical transactional categories.
Good systems support:
- one-click unsubscribe where required
- fine-grained preference center
- audit trail of consent changes
- region/compliance specific logic if needed
9.10 Domain Reputation Basics
Reputation is cumulative. Sending one terrible campaign can affect future delivery of important mail.
This is why large companies often:
- separate sending domains or subdomains by traffic type
- ramp up new domains gradually
- isolate risky campaigns from critical operational email
9.11 Why Email Is More Complex Than "Just Send Email"
Because success requires all of the following:
- rendering the right content
- sending through a healthy provider
- passing authentication checks
- avoiding spam classification
- respecting preferences and compliance
- tracking callbacks
- updating suppression lists
A provider returning 200 OK is the beginning, not the end.
10. SMS Notifications
SMS is powerful because it reaches users outside the app and does not require mobile data for app delivery semantics. It is also expensive, abuse-prone, and operationally messy.
10.1 Common Use Cases
SMS is usually best reserved for high-value or urgent communication.
Examples:
- OTP and account verification
- fraud or security alerts
- delivery or ride status for users without app engagement
- critical operational alerts
It is usually a poor default channel for high-volume low-value communication.
10.2 Provider Integrations
Typical providers include Twilio, Sinch, MessageBird, and region-specific aggregators.
Real systems often need to manage:
- phone number normalization and validation
- country-specific sender rules
- short code, long code, toll-free, or sender ID strategy
- provider failover or route selection
- delivery receipt webhook processing
10.3 Delivery Reliability
SMS delivery reliability is more uncertain than many engineers expect.
Reasons:
- carriers can filter traffic
- delivery receipts may be delayed or incomplete
- regional regulations vary heavily
- roaming, handset state, and carrier routing affect outcomes
This means SMS is not a perfect guarantee channel. It is just a high-reach channel.
10.4 Regional Delivery Issues
SMS is deeply regional.
Challenges include:
- country-specific compliance requirements
- differing support for alphanumeric sender IDs
- carrier throughput differences
- local holidays or network conditions affecting delivery
- multi-language content and character encoding costs
A design that works in the US may be wrong in India, Brazil, or the Middle East.
10.5 Cost Considerations
SMS is expensive relative to push or in-app messaging.
Implications:
- use it for critical moments, not noise
- add rate limits and spend monitoring
- avoid repeated retries that burn money without improving outcomes
This is one reason companies often prefer push first, then SMS only for critical fallback paths.
10.6 Retry Strategies
Retrying SMS requires care.
Bad strategy:
- retry every failure aggressively
- resend OTP repeatedly with no abuse checks
Better strategy:
- retry transient provider errors with capped backoff
- do not retry obvious permanent failures
- rotate providers or routes only when there is evidence of provider-side trouble
- ensure OTP requests invalidate or supersede older codes safely
10.7 Abuse and Fraud Prevention
SMS systems are targets for abuse.
Examples:
- OTP bombing a victim's phone
- account enumeration by testing phone numbers
- SMS pumping fraud, where attackers generate traffic to monetize premium routes
- SIM swap exposure when SMS is treated as strong identity proof
Protections:
- per-user and per-destination rate limits
- CAPTCHA or additional friction on suspicious flows
- phone reputation checks where appropriate
- anomaly detection on destination patterns and country mix
- avoid using SMS as the only high-assurance security factor for sensitive systems
10.8 Fallback Strategies
Reasonable fallback examples:
- push notification first for app users, SMS only if unread or unreachable
- voice call backup for some OTP flows in limited contexts
- in-app confirmation plus email receipt after SMS-based security action
Fallback should be policy-driven, not improvised per service.
10.9 Why SMS Should Be Used Carefully
SMS is valuable because it cuts through offline conditions, but it should be treated as a scarce and risky channel.
In interviews, a good answer sounds like:
"I would reserve SMS for high-importance events such as OTP or urgent alerts, because it has real cost, variable regional reliability, and meaningful abuse risk."
11. Push Notifications
Push notifications are the standard way mobile apps receive out-of-app alerts through platform push services.
11.1 Mobile Push Architecture
A typical flow is:
- app registers with APNs on iOS or FCM on Android
- the app backend stores the device token
- when an event occurs, the notification system sends a message to APNs or FCM
- the platform service attempts delivery to the device
- the app may open, refresh state, or show a system notification depending on payload type and app state
flowchart LR
APP[Your Backend] --> ORCH[Notification Orchestrator]
ORCH --> APNS[APNs]
ORCH --> FCM[FCM]
APNS --> IOS[iOS Device]
FCM --> AND[Android Device]
11.2 APNs Basics
APNs is Apple's push service. Important operational realities:
- you send to APNs, not directly to the device
- device tokens can change and become invalid
- delivery timing is not strictly guaranteed
- app state and OS policies affect whether and how the notification is shown or processed
11.3 FCM Basics
FCM plays a similar role for Android and can also support web push scenarios.
Important details:
- token lifecycle must be managed carefully
- device availability and manufacturer-specific behavior can affect delivery
- notification and data messages behave differently depending on app state
11.4 Token Management
Token management is one of the most common production pain points.
You need to handle:
- token registration on app install or refresh
- multiple tokens per user due to multiple devices
- invalid or expired tokens returned by providers
- token removal when users log out or uninstall
Without disciplined token hygiene, you waste money and effort sending to dead destinations.
11.5 Delivery Uncertainty
Push notifications are not a guaranteed immediate-delivery channel.
Reasons:
- device is offline
- battery optimization delays delivery
- platform may coalesce or deprioritize some notifications
- user disabled permissions
- token invalidated
This is why important workflows often require eventual in-app reconciliation or email fallback.
11.6 Silent Notifications
Silent notifications can wake the app to refresh content without necessarily showing a visible alert.
They are useful for:
- refreshing inbox state
- syncing badge counts
- preloading likely-needed data
But they are heavily constrained by OS policies, so they should not be treated as a guaranteed background execution channel.
11.7 Foreground vs Background Behavior
The same push may behave differently based on app state:
- foreground: app may intercept and render custom UX
- background: system may show notification or deliver data silently depending on payload and permissions
- terminated: behavior depends on platform rules and payload type
This is why push design requires close partnership between backend and mobile engineers.
11.8 Retries and Delivery Limitations
Your backend can retry requests to APNs or FCM when their API fails transiently, but once the provider accepts a push, actual device delivery is still probabilistic.
Therefore:
- retry provider API failures sensibly
- do not assume acceptance equals user seen
- use collapse keys or dedupe keys for replaceable updates
- rely on app open and in-app state sync for real correctness
11.9 Practical Examples
Uber-like app:
- real-time in-app updates while rider is online
- push for driver arrival if rider is backgrounded
- fallback SMS only for critical reachability gaps or special cases
SaaS mobile app:
- push for mentions, approvals, incidents, or account alerts
- user preferences decide which event types generate push
- app syncs authoritative state on open rather than trusting push payload alone
11.10 Common Mistakes
Common mistakes:
- treating push as reliable enough to replace durable inbox state
- never cleaning invalid tokens
- sending too many pushes and training users to disable them
- using silent push as if it were guaranteed scheduled background compute
Best practices:
- maintain durable notification state separately
- manage token lifecycle aggressively
- use collapse and dedupe strategies for replaceable alerts
- measure open rate, delivery feedback, and disablement trends
12. Retries
Retries are one of the most important and most dangerous parts of communication systems.
Done well, they recover from transient failure and make the system resilient. Done badly, they multiply load, create duplicates, and extend outages.
12.1 Why Retries Exist
Many failures are temporary:
- provider timeout
- brief network partition
- overloaded downstream service returning 429 or 503
- transient database or cache issue
Retrying later may succeed without user-visible failure.
12.2 Transient vs Permanent Failures
This distinction is critical.
| Failure type | Example | Retry? |
|---|---|---|
| Transient | timeout, 503, temporary DNS issue | Usually yes |
| Capacity/rate limit | 429, queue full, provider throttle | Yes, but with backoff and pacing |
| Permanent data error | invalid email, malformed phone number, missing template variable | No |
| Policy rejection | unsubscribed user, blocked sender, compliance rule | No |
If you do not classify failures, you will retry garbage forever.
12.3 Retry Policies
A retry policy usually defines:
- maximum attempts
- initial delay
- backoff multiplier
- jitter strategy
- retryable status codes or error classes
- whether fallback provider or alternate channel should be used
- what goes to dead-letter after final failure
12.4 Exponential Backoff
Exponential backoff increases delay after each failure.
Conceptually:
delay_n = base \times factor^n
This reduces immediate pressure on an unhealthy dependency.
12.5 Jitter
Jitter randomizes retry timing so all failed requests do not retry in lockstep.
Without jitter, a temporary outage can turn into synchronized retry spikes.
That is one of the classic ways retries make outages worse.
12.6 Dead-Letter Queues
A dead-letter queue stores items that repeatedly failed and need inspection or alternate handling.
DLQs are useful because they:
- stop poison messages from blocking healthy traffic
- preserve failed work for debugging
- allow manual replay after fixes
- provide operational visibility into systemic issues
12.7 Poison Messages
A poison message is a message that always fails because the content or logic is bad.
Examples:
- template references a missing variable
- payload violates provider schema
- recipient state is invalid and will never pass validation
Retries will not fix poison. Classification and DLQ handling will.
12.8 Retry Storms
Retry storms happen when failure causes more traffic than success ever would.
Typical causes:
- too many retries with too little backoff
- many workers retrying the same provider simultaneously
- client reconnect loops plus server-side retries both activating during outage
- no circuit breaker or health-aware throttling
This is why retries can save systems or destroy them.
12.9 Idempotency Importance
Retries mean duplicate attempts are inevitable.
Idempotency makes duplicates safe.
Examples:
- if
event_id + recipient + channelalready produced a send, do not send again - if a webhook delivery with the same delivery ID is retried, downstream consumer should accept duplicate receipt without double-processing
- if an OTP request is superseded, older attempts should not remain valid
12.10 Queue Processing and Retry Pipeline
flowchart LR
Q[[Notification Queue]] --> W[Worker]
W --> TRY{Send succeeds?}
TRY -- Yes --> DONE[Mark delivered / awaiting callback]
TRY -- No transient --> RETRY[Schedule retry with backoff + jitter]
TRY -- No permanent --> FAIL[Mark failed permanently]
RETRY --> DQ[[Delayed Retry Queue]]
DQ --> W
W --> MAX{Attempts exceeded?}
MAX -- Yes --> DLQ[[Dead Letter Queue]]
MAX -- No --> TRY
12.11 Best Practices
Best practices:
- classify failures before retrying
- use exponential backoff with jitter
- cap attempts and send poison messages to DLQ
- make send operations idempotent
- add provider-level circuit breakers or adaptive throttling
- expose retry metrics so operators can see storms forming
13. Templates
Templates are where product communication and backend correctness meet.
They are also a major source of production mistakes.
13.1 Why Templates Matter
Templates allow teams to separate event logic from presentation.
Without templates, every service hardcodes message strings and formatting logic. That creates:
- duplicated content logic
- harder localization
- inconsistent branding
- risky ad hoc changes in code deployments
13.2 Template Engines
Common template systems support:
- placeholders such as user name, amount, due date
- conditional content
- loops for repeated items
- channel-specific rendering, such as HTML email vs plain text vs push body
The engine should ideally be constrained. Overly powerful templates can become hard to reason about or insecure.
13.3 Localization
Global products need localization support:
- translated strings
- locale-specific date, time, number, and currency formatting
- right-to-left language support where needed
- per-market legal wording variations
Localization is a major reason templates should be centrally managed.
13.4 Personalization
Templates often need data from multiple sources:
- user profile
- event payload
- account metadata
- product context such as workspace or subscription plan
The challenge is ensuring all required variables are available and valid at render time.
13.5 Placeholders and Strong Contracts
A mature system treats template variables like an API contract.
Good practice:
- define required and optional variables per template version
- validate payload shape before enqueueing large fanouts
- fail fast if critical variables are missing
This prevents discovering a missing placeholder only after a million-send campaign starts.
13.6 Versioning
Templates need versioning because content changes over time.
Reasons:
- copy updates
- regulatory changes
- product redesigns
- A/B tests
Versioning lets you:
- replay historical sends faithfully
- know exactly which content a user saw
- roll back safely if a bad template is published
13.7 A/B Testing Basics
Notification systems often support experiments on:
- subject line
- send timing
- body copy
- CTA wording
- channel choice
The important engineering point is that experiment assignment should be stable and observable. Otherwise analytics become misleading.
13.8 Approval Workflows
At scale, template changes are often too risky for direct edit-and-send.
Companies frequently use:
- draft and review states
- approval workflows for legal, compliance, or brand teams
- test-send environments
- guarded rollout by percentage or audience segment
This is especially common for finance, healthcare, and large SaaS platforms.
13.9 Template Safety
Safety concerns include:
- accidental broken placeholders
- unsafe HTML content
- unescaped user-generated data
- overlong SMS bodies creating unexpected segmentation cost
- push payload exceeding provider limits
Template safety needs validation per channel, not just generic render success.
13.10 Rendering Failures
Rendering can fail because:
- a required variable is missing
- localization entry is absent
- HTML is malformed
- experiment assignment references an unpublished variant
These failures should be classified as permanent or configuration errors, not retried blindly forever.
13.11 Preview and Testing Systems
Good internal tooling matters.
Useful capabilities:
- preview with sample payloads
- render in multiple locales
- device/channel preview for email, SMS, and push
- validation against size limits and required variables
- test send to controlled recipients
This is how companies safely manage hundreds or thousands of templates.
14. Scheduling
Scheduling is the part of notification systems that answers "not now, but later." It sounds easy until timezones, retries, DST, and exactly-once concerns appear.
14.1 Delayed Notifications and Reminders
Common scheduled notifications:
- payment reminder tomorrow at 9 AM local time
- retry failed webhook in 30 minutes
- send digest every evening
- remind user if task remains unresolved for 24 hours
Scheduling therefore exists both for product UX and operational retry workflows.
14.2 Cron vs Queue-Based Scheduling
| Approach | Strengths | Weaknesses | Best fit |
|---|---|---|---|
| Cron-like scheduler | Simple for recurring global jobs | Not great for millions of per-user scheduled items | fixed recurring tasks |
| Delayed queue / timer wheel | Good for large numbers of future jobs | Operational complexity at very large horizons | reminders, retries, one-off schedules |
| Database-backed scheduler | Easy to reason about with scans | Can become expensive and contention-heavy | moderate scale scheduling |
Most production systems use more than one mechanism.
14.3 Distributed Schedulers
A distributed scheduler must avoid missing jobs and also avoid running the same job many times.
Typical strategies:
- partition scheduled jobs by time bucket and worker shard
- use lease/claim model so only one worker owns a due item at a time
- store a dedupe or execution key so reruns are safe
- enqueue work into normal queues when it becomes due rather than executing inline in the scheduler
This separation keeps the scheduler simple and lets worker fleets scale independently.
14.4 Retry Scheduling
Retry scheduling is just scheduling with shorter horizons and stricter idempotency.
Important considerations:
- next attempt time should be explicit
- attempts should be capped
- retry state must survive worker restarts
- delayed retries should not block primary queue throughput
14.5 Timezone Handling
Timezone bugs are common because user-local scheduling sounds simple but is not.
Examples:
- send at 9 AM in the user's configured timezone
- user changes timezone between scheduling and execution
- organization policy uses workspace timezone instead of user timezone
The scheduler must define which timezone source is authoritative.
14.6 DST Problems
Daylight saving time creates two classic edge cases:
- a local clock time that does not exist
- a local clock time that occurs twice
If you schedule "2:30 AM local time" on the wrong day, what does that mean?
Production systems need explicit behavior, such as:
- shift to the next valid local time
- pin to UTC internally after initial local conversion
- document recurrence rules clearly
14.7 Calendar Edge Cases
Other calendar issues:
- month-end behavior, such as "send on the 31st"
- leap years
- business-day-only rules
- local holidays for campaign or billing workflows
These matter more than many candidates expect, especially in financial and enterprise systems.
14.8 Campaign Scheduling
Large campaigns introduce extra complexity:
- recipient expansion can be huge
- provider rate limits require pacing
- rollout may need regional waves
- cancellation or pause controls are necessary
- duplicate prevention across retries and partial runs matters
A campaign system is often more like a controlled distributed batch pipeline than a simple timer.
14.9 Exactly-Once Challenges
Scheduling systems rarely achieve true exactly-once delivery end-to-end.
More realistic goal:
- enqueue due work at least once
- make downstream processing idempotent
- store execution markers to avoid duplicate user-visible effects
That is the same practical pattern seen across broader distributed systems.
14.10 A Practical Scheduling Architecture
flowchart LR
REQ[Create reminder / delayed notification] --> SCHED[(Schedule Store)]
SCHED --> SCAN[Scheduler Workers]
SCAN --> DUE{Now due?}
DUE -- No --> SCHED
DUE -- Yes --> Q[[Delivery Queue]]
Q --> W[Notification Worker]
W --> CH[Channel Provider]
W --> MARK[(Execution / Idempotency Store)]
15. Comparing Channels in Practice
When interviewers ask communication questions, they often want channel tradeoffs, not only definitions.
15.1 WebSockets vs SSE vs Long Polling
| Question | WebSockets | SSE | Long Polling |
|---|---|---|---|
| Need bidirectional low-latency interaction? | Best choice | No | Poor fit |
| Need simple server-to-browser streaming? | Works, but may be overkill | Often best | Acceptable fallback |
| Need broad compatibility with older infrastructure? | Sometimes harder | Usually okay | Best compatibility |
| High connection count cost | Persistent connection memory | Persistent HTTP stream cost | Highest repeated request overhead |
| Typical examples | chat, collaboration, multiplayer-like UX | feeds, logs, dashboards, status | legacy updates, fallback transports |
15.2 Email vs SMS vs Push Notifications
| Channel | Strengths | Weaknesses | Best uses |
|---|---|---|---|
| Rich content, durable inbox, cheap at scale relative to SMS | Deliverability complexity, slower user attention | receipts, alerts, digests, workflow updates | |
| SMS | Very high reach, urgent attention | Expensive, abuse-prone, region-specific constraints | OTP, critical alerts, fallback reachability |
| Push | Low cost, strong mobile engagement | Delivery uncertainty, token churn, permission dependence | mobile app alerts, reminders, activity nudges |
A strong production design usually uses channel hierarchy rather than one channel for everything.
Example:
- real-time in-app if user is active
- push if mobile user is inactive
- email for durable summary or critical account record
- SMS only for urgent or high-assurance scenarios
16. How These Systems Connect in Real Architectures
The biggest interview mistake is describing each component in isolation.
Production systems connect them.
16.1 Chat Application Architecture
flowchart TB
U1[Sender Client] --> GW1[Realtime Gateway]
GW1 --> MSG[Message Service]
MSG --> DB[(Message Store)]
MSG --> BUS[[Room Event Bus]]
BUS --> GW2[Realtime Gateways]
GW2 --> U2[Online Recipients]
BUS --> NOTIF[Notification Orchestrator]
NOTIF --> PUSH[Push / Email for offline users]
GW2 --> PRES[Presence Service]
PRES --> CACHE[(Presence TTL Store)]
Key insight:
- live transport handles active users
- durable message storage handles history and offline recovery
- presence influences whether to send push or keep delivery in-app
16.2 Ride Tracking Architecture
Uber-like example:
- driver app emits location updates
- ingestion service validates and writes them to a stream
- geo or trip service computes rider-visible state
- active rider session receives real-time updates through WebSocket or SSE
- if rider is backgrounded, notification system may send push for major trip milestones
The real-time system and notification system are therefore two views over the same event stream.
16.3 SaaS Workflow Notifications
GitHub or Stripe style example:
- domain event occurs, such as comment added or payment failed
- write event through outbox pattern to avoid losing it during transaction boundaries
- event bus feeds both in-app real-time feed and notification orchestrator
- preferences and suppression rules decide channel
- durable inbox and email/push/SMS state are updated independently
This architecture is robust because the event is authoritative and downstream communication is decoupled.
16.4 Notification Delivery Pipeline
sequenceDiagram
participant P as Producer Service
participant O as Outbox / Event Bus
participant N as Notification Orchestrator
participant R as Rules + Preferences
participant Q as Channel Queue
participant W as Worker
participant X as Provider
participant S as Status Store
P->>O: Publish domain event
O->>N: Deliver event
N->>R: Resolve recipients, channels, suppression
R-->>N: Eligible deliveries
N->>Q: Enqueue channel jobs with idempotency key
Q->>W: Deliver job
W->>X: Send via provider
X-->>S: Callback / receipt / bounce / failure
S-->>N: Update notification state and metrics
17. What Breaks at Scale
Large communication systems usually break in predictable ways.
17.1 Real-Time System Failures
- reconnect storms after deploys or regional blips
- hotspot channels or rooms causing shard imbalance
- slow consumers building large outbound queues
- stale presence from delayed disconnect detection
- message replay windows too short for realistic mobile disconnects
17.2 Notification System Failures
- duplicate sends from missing idempotency
- retry storms after provider outage
- dead-letter growth due to poison templates or bad payload contracts
- provider callback backlog causing stale delivery state
- campaign sends overwhelming rate limits and delaying transactional traffic
17.3 Organizational Failures
- every team building its own notification logic
- no shared preference model
- weak observability into delivery stages
- no clear ownership of templates or compliance-sensitive content
Strong engineering organizations build communication systems as platforms because the problems repeat across products.
18. Best Practices and Common Mistakes
18.1 Best Practices
- separate authoritative business events from channel delivery
- choose the simplest transport that satisfies product requirements
- keep real-time transport distinct from durable history or durable inbox state
- use sequence numbers, cursors, or snapshots for reconciliation
- make retries idempotent and classify failures carefully
- use ephemeral stores with TTL for active presence
- centralize user preferences, suppression, and dedupe logic
- isolate transactional and bulk notification paths
- measure end-to-end latency, not just enqueue success
18.2 Common Mistakes
- assuming WebSockets imply reliable message delivery by themselves
- treating online state as exact truth
- sending every possible notification instead of designing for user attention
- retrying permanent failures forever
- ignoring provider callbacks and token invalidation
- letting campaigns compete with critical operational messages
- storing all heartbeat or presence writes durably and overwhelming the database
19. How to Explain This in an Interview
If asked to design or explain a communication system, structure the answer in this order:
- define the product requirement and latency expectation
- choose transport based on interaction style
- define durability and delivery guarantees
- explain fanout and scaling approach
- cover reconnect, retry, and idempotency behavior
- discuss presence or offline fallback if relevant
- mention observability, rate limits, and operational failure modes
Example framing:
"For active users, I would use WebSockets because the product needs bidirectional low-latency updates. I would keep durable state in the message store and use sequence numbers for replay after reconnect. Presence would be managed through heartbeats in a TTL-based store with grace periods. For offline users, I would feed the same events into a notification orchestrator that applies preferences, dedupe, and channel-specific retries for push, email, or SMS."
That answer sounds like someone who has moved beyond glossary knowledge.
20. Final Mental Model
Communication systems are not only about sending data. They are about delivering the right information:
- to the right user
- through the right channel
- with the right urgency
- at the right cost
- with acceptable reliability and ordering
- without overwhelming infrastructure or users
The best backend systems treat communication as a first-class architecture concern, not a thin wrapper around providers or sockets.
If you remember one core idea, remember this:
real-time delivery and notifications should be designed as downstream communication layers built on top of authoritative events, with explicit handling for fanout, retries, ordering, idempotency, presence, and user preference.
That is the difference between a toy design and a production one.