db37d59a6d
Co-authored-by: Copilot <copilot@github.com>
1604 lines
57 KiB
Markdown
1604 lines
57 KiB
Markdown
# Embedded System Reliability
|
|
|
|
Embedded system reliability is the discipline of making a product continue to behave safely and predictably when the real world is ugly: power rails sag, cables get zapped, motors inject noise, firmware gets stuck, sensors lie, and users do things the design did not politely ask for.
|
|
|
|
This guide is written as a practical handbook. The goal is not to memorize definitions. The goal is to understand why embedded systems fail, how engineers prevent those failures, how hardware and software cooperate, and how to debug problems that only appear after thousands of devices are shipped.
|
|
|
|
---
|
|
|
|
## 1. Reliability Mindset
|
|
|
|
### 1.1 What reliability actually means
|
|
|
|
In real engineering, reliability is not simply "the system works." A reliable system:
|
|
|
|
- starts correctly,
|
|
- keeps working under expected stress,
|
|
- detects abnormal conditions,
|
|
- transitions to a controlled state when it cannot keep working,
|
|
- recovers when recovery is appropriate,
|
|
- leaves behind enough evidence for engineers to understand what happened.
|
|
|
|
That last point matters more than many students expect. A device that resets itself but never records why it reset is hard to improve. Reliability is as much about diagnosability as it is about survival.
|
|
|
|
### 1.2 How embedded systems fail in practice
|
|
|
|
Embedded failures are often partial failures, not total failures.
|
|
|
|
Examples:
|
|
|
|
- The CPU still executes code, but an I2C peripheral is hung and blocks the control loop.
|
|
- The 3.3 V rail looks acceptable on a multimeter, but has 400 mV dips during motor startup.
|
|
- A GPIO button works in the lab but causes random resets in the field because ESD current is returning through the digital ground plane.
|
|
- Firmware "recovers" from a fault by restarting the main task, but the actuator output remains latched high in hardware.
|
|
|
|
This is why embedded reliability is system engineering, not just circuit design and not just firmware quality.
|
|
|
|
### 1.3 The reliability control loop
|
|
|
|
Good products follow the same loop repeatedly:
|
|
|
|
1. Prevent faults where possible.
|
|
2. Detect faults early.
|
|
3. Contain the fault so it does not spread.
|
|
4. Recover if safe.
|
|
5. Record enough data to learn from the event.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A[Disturbance or Fault\nPower, ESD, noise, software hang] --> B[Prevention\nDecoupling, layout, filters, reviews]
|
|
B --> C[Detection\nWatchdogs, supervisors, CRCs, plausibility checks]
|
|
C --> D[Containment\nReset, isolate peripheral, disable actuator]
|
|
D --> E[Recovery\nReboot, retry, degraded mode]
|
|
E --> F[Evidence\nReset cause, counters, logs, traces]
|
|
F --> G[Design Improvement]
|
|
G --> B
|
|
```
|
|
|
|
### 1.4 Time scales of failure
|
|
|
|
One reason reliability feels hard is that faults happen on very different time scales.
|
|
|
|
- Nanoseconds to microseconds: ESD, ringing, ground bounce, fast transients.
|
|
- Microseconds to milliseconds: watchdog windows, reset pulses, switching noise, relay kickback.
|
|
- Milliseconds to seconds: communication timeouts, boot sequencing, brownouts, control loop stalls.
|
|
- Hours to years: capacitor aging, connector corrosion, flash wear, thermal cycling.
|
|
|
|
You do not solve all of these with one technique. A watchdog does not stop ESD current. A TVS diode does not fix a deadlock. A brownout reset does not tell an actuator what safe state to enter.
|
|
|
|
### 1.5 The hardware-software contract
|
|
|
|
Reliable embedded systems depend on a clear contract between hardware and software.
|
|
|
|
Hardware should provide:
|
|
|
|
- clean power,
|
|
- stable clocks,
|
|
- valid reset behavior,
|
|
- protection paths for transients,
|
|
- safe actuator defaults,
|
|
- observability signals when possible.
|
|
|
|
Software should provide:
|
|
|
|
- watchdog policy,
|
|
- timeout and retry logic,
|
|
- state validation,
|
|
- fault classification,
|
|
- safe-state transitions,
|
|
- event logging and post-reset diagnosis.
|
|
|
|
Many real failures happen exactly at the boundary between the two. For example, firmware assumes it can save data during a brownout, while hardware cannot actually keep the rail alive long enough.
|
|
|
|
---
|
|
|
|
## 2. Reliability Architecture at a System Level
|
|
|
|
Before diving into the individual topics, it helps to see how they fit together.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
P[Power Input] --> REG[Regulators / PMIC]
|
|
REG --> MCU[MCU / SoC]
|
|
REG --> PER[Peripherals]
|
|
EXT[External Connectors] --> ESDP[ESD / Surge Protection]
|
|
ESDP --> MCU
|
|
ESDP --> PER
|
|
ACT[Actuators / Loads] --> NOISE[Noise Sources\nMotors, relays, switching edges]
|
|
NOISE --> REG
|
|
NOISE --> MCU
|
|
NOISE --> PER
|
|
MCU --> WD[Watchdog Strategy]
|
|
REG --> BOR[Brownout / Supervisor]
|
|
BOR --> RST[Reset Distribution]
|
|
WD --> RST
|
|
RST --> MCU
|
|
MCU --> SAFE[Fail-safe State Machine]
|
|
SAFE --> ACT
|
|
MCU --> LOG[Fault Logs / Reset Cause / Counters]
|
|
```
|
|
|
|
This picture is important because the requested topics are not separate chapters in a textbook. In production hardware, they interact continuously.
|
|
|
|
- Brownout logic affects reset behavior.
|
|
- Reset behavior affects watchdog effectiveness.
|
|
- ESD and noise can trigger resets.
|
|
- Fail-safe design determines what happens after a reset or fault.
|
|
|
|
---
|
|
|
|
## 3. Watchdogs
|
|
|
|
### 3.1 First principles
|
|
|
|
A watchdog exists because embedded systems can fail in ways that leave some parts alive and others dead.
|
|
|
|
The most common misconception is: "If the CPU is stuck, nothing runs." That is not always true.
|
|
|
|
Real failure examples:
|
|
|
|
- The main loop is blocked waiting for a peripheral flag that will never change.
|
|
- An interrupt storm starves normal tasks.
|
|
- A priority inversion prevents a control task from executing.
|
|
- Memory corruption changes a state variable so the system keeps spinning in a wrong but legal loop.
|
|
- Clocking or bus faults leave software executing nonsense or repeating a narrow code path forever.
|
|
|
|
In these cases, power is still present. The oscillator may still be running. The device may still toggle some outputs. That is why the system needs an independent mechanism to ask, "Are you truly healthy, or only partially alive?"
|
|
|
|
### 3.2 What a watchdog really does
|
|
|
|
A watchdog is a timer that must be refreshed by healthy software within a valid time window. If the refresh does not happen correctly, the watchdog assumes the system is unhealthy and triggers a recovery action, usually a reset.
|
|
|
|
The key phrase is healthy software. If you refresh the watchdog from the wrong place, you can create a system that looks protected but is not.
|
|
|
|
### 3.3 Types of watchdogs
|
|
|
|
| Type | How it works | Strengths | Weaknesses | Good use |
|
|
| --- | --- | --- | --- | --- |
|
|
| Internal watchdog | Timer inside MCU resets MCU if not refreshed | Cheap, easy, fast | Shares silicon and clock domain with MCU; may fail with same fault | Baseline protection |
|
|
| Windowed watchdog | Must be refreshed neither too late nor too early | Catches runaway loops that kick too fast | Needs careful timing analysis | Safety-conscious designs |
|
|
| External watchdog | Separate IC supervises MCU and can assert reset | More independent, stronger fault coverage | Extra BOM and layout work | Industrial, automotive, higher reliability products |
|
|
| Task or software watchdog | Supervisor task checks heartbeats from important tasks | Detects partial software failure | Not independent if poorly implemented | RTOS-based systems |
|
|
|
|
In serious products, engineers often combine them:
|
|
|
|
- task-level health supervision inside firmware,
|
|
- internal watchdog inside the MCU,
|
|
- external supervisor or watchdog IC for stronger independence.
|
|
|
|
### 3.4 Why feeding the watchdog is not the same as being healthy
|
|
|
|
Students often write code like this:
|
|
|
|
```c
|
|
while (1) {
|
|
kick_watchdog();
|
|
run_application();
|
|
}
|
|
```
|
|
|
|
This is weak because the system may still call `kick_watchdog()` even when the application is unhealthy.
|
|
|
|
Better approach:
|
|
|
|
1. Each critical task reports progress.
|
|
2. A health manager validates task timing, data freshness, memory margins, and fault status.
|
|
3. Only the health manager can refresh the watchdog.
|
|
4. If health is not confirmed, the system enters a safe state and intentionally stops feeding the watchdog.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
T1[Control Task Heartbeat] --> HM[Health Manager]
|
|
T2[Comms Task Heartbeat] --> HM
|
|
T3[Sensor Task Fresh Data] --> HM
|
|
T4[Memory / Stack Checks] --> HM
|
|
T5[Fault Flags / Plausibility] --> HM
|
|
HM -->|All healthy| KICK[Kick Watchdog]
|
|
HM -->|Any unhealthy| SAFE[Enter Safe State]
|
|
SAFE --> NOFEED[Stop Feeding Watchdog]
|
|
NOFEED --> RESET[Watchdog Reset]
|
|
```
|
|
|
|
### 3.5 Practical watchdog architecture
|
|
|
|
A production-quality watchdog strategy usually includes the following rules:
|
|
|
|
- The watchdog is enabled early in boot, but only after clocks and reset cause capture are stable.
|
|
- The code that refreshes it is centralized.
|
|
- Refresh is tied to verified system progress, not just CPU activity.
|
|
- Reset cause is captured at next boot.
|
|
- Repeated watchdog resets are counted and treated differently from a rare one-off event.
|
|
|
|
For example, if a device watchdog-resets once every six months after a severe EMI event, an automatic reboot may be acceptable. If it watchdog-resets three times in one minute, the device may need to remain in a restricted mode and ask for service.
|
|
|
|
### 3.6 Firmware pattern
|
|
|
|
```c
|
|
typedef struct {
|
|
bool control_ok;
|
|
bool comms_ok;
|
|
bool sensor_data_fresh;
|
|
bool storage_ok;
|
|
bool stack_margin_ok;
|
|
} system_health_t;
|
|
|
|
static bool system_can_feed_watchdog(system_health_t h)
|
|
{
|
|
return h.control_ok &&
|
|
h.comms_ok &&
|
|
h.sensor_data_fresh &&
|
|
h.storage_ok &&
|
|
h.stack_margin_ok;
|
|
}
|
|
|
|
void health_manager_tick_10ms(void)
|
|
{
|
|
system_health_t h = collect_system_health();
|
|
|
|
if (system_can_feed_watchdog(h)) {
|
|
kick_watchdog();
|
|
} else {
|
|
enter_safe_state();
|
|
log_fault_snapshot();
|
|
/* Intentionally do not kick watchdog. */
|
|
}
|
|
}
|
|
```
|
|
|
|
The important engineering decision is not the code itself. It is deciding what counts as healthy enough to keep the system alive.
|
|
|
|
### 3.7 Timeout selection
|
|
|
|
Watchdog timeout is a design decision, not a random constant.
|
|
|
|
If the timeout is too short:
|
|
|
|
- normal long operations cause nuisance resets,
|
|
- boot time may become unstable,
|
|
- field devices reset under valid heavy load.
|
|
|
|
If it is too long:
|
|
|
|
- faults persist too long,
|
|
- actuators may remain uncontrolled for dangerous durations,
|
|
- diagnostic resolution gets worse.
|
|
|
|
Choose timeout based on:
|
|
|
|
- worst-case task execution time,
|
|
- scheduler jitter,
|
|
- maximum acceptable fault reaction time,
|
|
- actuator hazard level,
|
|
- bootloader and firmware update behavior.
|
|
|
|
Windowed watchdogs add another dimension. They detect code that refreshes too early, which is useful for catching runaway loops.
|
|
|
|
### 3.8 External watchdogs and why they matter
|
|
|
|
An internal watchdog is better than nothing, but it is not fully independent.
|
|
|
|
Why this matters:
|
|
|
|
- If the MCU clock tree is broken in a subtle way, the internal watchdog may be affected too.
|
|
- If a silicon erratum impacts reset or watchdog logic, internal recovery can become unreliable.
|
|
- If firmware accidentally disables the watchdog, the protection disappears.
|
|
|
|
An external watchdog or supervisor IC can:
|
|
|
|
- monitor a heartbeat pin,
|
|
- enforce timing window behavior,
|
|
- hold reset low for a guaranteed duration,
|
|
- supervise supply voltage at the same time.
|
|
|
|
This is common in industrial controllers, vehicles, power electronics, and products where field recovery must be robust.
|
|
|
|
### 3.9 Production scenarios
|
|
|
|
#### Industrial controller
|
|
|
|
A PLC-like controller manages digital IO and fieldbus communication. A bad field device causes bus transactions to hang. Without a watchdog strategy, outputs may remain in their last state forever. With a task-level watchdog and safe output defaults, the system can:
|
|
|
|
1. detect stalled communication,
|
|
2. place outputs into a predefined safe state,
|
|
3. stop feeding the external watchdog,
|
|
4. reboot cleanly,
|
|
5. log a communication-stall reset reason.
|
|
|
|
#### Motor controller
|
|
|
|
If the control loop misses timing deadlines, the safe action may be to disable the gate driver immediately in hardware, not wait for a general-purpose reset. In this case, the watchdog protects the controller, but the fail-safe output path protects people and equipment.
|
|
|
|
### 3.10 Common mistakes with watchdogs
|
|
|
|
- Feeding the watchdog from an ISR or timer that can keep running while the main application is broken.
|
|
- Disabling the watchdog in release builds because it was inconvenient during debugging.
|
|
- Failing to log reset cause and last-known fault state.
|
|
- Using one heartbeat for the whole system instead of checking critical tasks separately.
|
|
- Ignoring repeated-reset behavior and creating permanent reboot loops.
|
|
- Enabling the watchdog during firmware update without designing update-safe timing.
|
|
|
|
### 3.11 Debugging watchdog issues
|
|
|
|
When a system is "mysteriously resetting," do not guess. Gather evidence.
|
|
|
|
Useful checks:
|
|
|
|
1. Read the reset cause register at the earliest possible boot stage.
|
|
2. Preserve crash breadcrumbs in retention RAM, backup registers, FRAM, or a reserved flash journal.
|
|
3. Count consecutive watchdog resets.
|
|
4. Measure reset pin behavior and watchdog output with an oscilloscope or logic analyzer.
|
|
5. Verify that long flash operations, radio startup, or blocking drivers do not violate watchdog timing.
|
|
|
|
If a device resets only in the field, add software breadcrumbs such as:
|
|
|
|
- last task entered,
|
|
- last successful communication transaction,
|
|
- stack watermark,
|
|
- recent fault flags,
|
|
- supply voltage sample if valid,
|
|
- number of resets by cause.
|
|
|
|
### 3.12 Interview-level understanding
|
|
|
|
A strong engineer should be able to explain:
|
|
|
|
- Why kicking a watchdog in the main loop is often insufficient.
|
|
- Why windowed watchdogs catch failures that normal watchdogs miss.
|
|
- Why an external watchdog increases fault coverage.
|
|
- Why watchdog design must include safe-state behavior, not just reset behavior.
|
|
|
|
---
|
|
|
|
## 4. Brownout Protection
|
|
|
|
### 4.1 First principles
|
|
|
|
Digital systems are built on analog physics. A microcontroller only looks digital because its transistors switch cleanly when voltage, current, and timing remain inside valid ranges.
|
|
|
|
A brownout is a condition where supply voltage drops below the level required for correct operation, but not necessarily all the way to zero.
|
|
|
|
This is dangerous because partial voltage can create partial correctness.
|
|
|
|
Examples of brownout behavior:
|
|
|
|
- CPU core logic becomes unreliable before the board is fully off.
|
|
- Flash or EEPROM writes can be corrupted.
|
|
- GPIO output levels can drift or glitch.
|
|
- One rail may collapse before another, causing peripheral latch-up or invalid bus signaling.
|
|
- A regulator may remain technically on, but already be out of regulation.
|
|
|
|
The key idea is that brownouts are not merely "power off." They are periods where your system may be alive enough to do damage but not healthy enough to do useful work.
|
|
|
|
### 4.2 Brownout, undervoltage, and power-good are not the same
|
|
|
|
These terms are related but different.
|
|
|
|
- Power-on reset (POR): ensures proper startup from power-up.
|
|
- Brownout reset (BOR): resets the MCU when supply falls below a threshold.
|
|
- UVLO (undervoltage lockout): prevents a regulator or power stage from operating below a safe voltage.
|
|
- Power-good (PG): indicates a rail is in regulation and usually ready for dependent circuits.
|
|
|
|
Engineers often confuse BOR with full brownout protection. BOR is only one part of the solution.
|
|
|
|
### 4.3 What actually happens during a brownout
|
|
|
|
Step by step:
|
|
|
|
1. Load current increases or input voltage dips.
|
|
2. The regulator output begins to droop.
|
|
3. Decoupling capacitors try to hold the rail up for a short time.
|
|
4. If the drop is brief, the system may ride through it.
|
|
5. If the rail continues down, clocks, logic thresholds, flash timing, and peripheral behavior become unreliable.
|
|
6. If BOR or a supervisor threshold is crossed, reset is asserted.
|
|
7. If reset is asserted too late, corrupted execution may occur first.
|
|
8. If power recovers, the system restarts and should log that a brownout occurred.
|
|
|
|
This is why threshold selection matters. The reset must occur before functional correctness is lost.
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> Normal
|
|
Normal --> VoltageDip: Load step / input sag
|
|
VoltageDip --> RideThrough: Capacitance and regulator recover rail
|
|
RideThrough --> Normal
|
|
VoltageDip --> WarningZone: Rail below margin but MCU may still run
|
|
WarningZone --> BORAsserted: Supervisor or BOR threshold crossed
|
|
BORAsserted --> ResetHeld: Reset held low until rail is valid
|
|
ResetHeld --> Reboot: Rail recovers and reset releases
|
|
Reboot --> LogCause: Boot code reads brownout flag
|
|
LogCause --> Normal
|
|
```
|
|
|
|
### 4.4 Sources of brownout in real products
|
|
|
|
- Battery droop during radio burst or motor start.
|
|
- Automotive cold crank.
|
|
- USB cable resistance and hot-plug events.
|
|
- Shared supply with high inrush load.
|
|
- Weak wall adapter.
|
|
- Long cable harness causing voltage drop.
|
|
- Regulator dropout under peak current.
|
|
- Poor bulk capacitance or high ESR capacitors.
|
|
|
|
Many brownouts are load-induced, not source-induced. Engineers sometimes blame the power adapter when the real problem is local current pulse demand and inadequate energy storage or layout.
|
|
|
|
### 4.5 Core design tools for brownout protection
|
|
|
|
#### Brownout reset inside the MCU
|
|
|
|
Useful and often mandatory. It protects against invalid CPU execution when supply falls.
|
|
|
|
But internal BOR thresholds may be coarse, temperature-dependent, or not aligned with all system rails and peripherals.
|
|
|
|
#### External voltage supervisor
|
|
|
|
An external supervisor IC can provide:
|
|
|
|
- accurate threshold,
|
|
- hysteresis,
|
|
- guaranteed reset pulse width,
|
|
- monitoring of rails not handled by the MCU,
|
|
- better independence.
|
|
|
|
#### Bulk capacitance and decoupling
|
|
|
|
Capacitors buy time. They do not create energy, but they can briefly deliver energy while the source and regulator catch up.
|
|
|
|
Useful rule of thumb for hold-up calculation:
|
|
|
|
`C = I * dt / dV`
|
|
|
|
Example:
|
|
|
|
- current draw = 120 mA
|
|
- required hold-up time = 5 ms
|
|
- allowable droop = 0.3 V
|
|
|
|
Then:
|
|
|
|
- `C = 0.12 * 0.005 / 0.3 = 0.002 F = 2000 uF`
|
|
|
|
This immediately shows why "save everything to flash when power drops" is often unrealistic unless the load is tiny or the allowed droop is large.
|
|
|
|
#### Power sequencing and power-good signals
|
|
|
|
Some systems require analog, IO, and core rails to come up and down in a specific order. If a peripheral drives bus lines while the MCU is unpowered, strange failure modes can happen.
|
|
|
|
#### Firmware write strategy
|
|
|
|
If power can disappear unexpectedly, nonvolatile data writes must be robust:
|
|
|
|
- use journaling or double-buffered records,
|
|
- include CRCs,
|
|
- avoid partially updating the only copy,
|
|
- commit with version markers,
|
|
- design so sudden reset leaves either the old record or the new record valid.
|
|
|
|
### 4.6 Choosing thresholds correctly
|
|
|
|
This is a professional-level design decision.
|
|
|
|
Your BOR or supervisor threshold should consider:
|
|
|
|
- MCU minimum operating voltage at the chosen clock frequency,
|
|
- flash programming minimum voltage,
|
|
- regulator dropout behavior,
|
|
- sensor and transceiver valid operating limits,
|
|
- worst-case temperature,
|
|
- transient droop depth and duration,
|
|
- desired margin.
|
|
|
|
Common mistake: setting the threshold just above the absolute minimum CPU voltage. That protects only the core in a narrow sense. It may not protect flash writes, clock accuracy, communication peripherals, or actuator outputs.
|
|
|
|
### 4.7 Software-hardware cooperation during brownout
|
|
|
|
A good design makes the software do only what it can genuinely finish before the rail collapses.
|
|
|
|
Bad assumption:
|
|
|
|
- "We detect falling voltage in an ADC interrupt and save full state to flash."
|
|
|
|
Why this is often wrong:
|
|
|
|
- ADC sample may already be invalid.
|
|
- CPU timing may be marginal.
|
|
- flash write latency may exceed available hold-up time.
|
|
- the rail may bounce around the threshold.
|
|
|
|
Better strategy:
|
|
|
|
- Use BOR or supervisor to force a clean reset.
|
|
- Keep critical persistent data updated incrementally during normal operation.
|
|
- At reboot, detect brownout cause and validate storage with CRC.
|
|
- If needed, use a very small emergency state save only when hardware guarantees enough hold-up energy.
|
|
|
|
### 4.8 Production scenarios
|
|
|
|
#### Battery-powered wireless sensor
|
|
|
|
Radio transmit bursts pull short current peaks. The system passes basic bench tests but resets when battery is cold. Root cause: battery internal resistance rises at low temperature, causing rail droop during transmit. Fix: more local capacitance, lower burst power, revised battery selection, BOR threshold check, and transmit scheduling.
|
|
|
|
#### Motor-driven embedded controller
|
|
|
|
Motor start causes shared rail dip that corrupts encoder communication. The MCU does not fully reset, but the peripheral state machine breaks. Fix: isolate motor power path, improve return routing, add supervisor IC, and define safe reinitialization sequence.
|
|
|
|
#### Automotive module
|
|
|
|
Cranking causes supply dip below nominal for tens of milliseconds. The design must either ride through the crank or shut down cleanly and restart without unsafe output behavior.
|
|
|
|
### 4.9 Common mistakes in brownout design
|
|
|
|
- No distinction between POR and BOR behavior.
|
|
- Threshold too low to protect flash or peripherals.
|
|
- No hysteresis, causing chatter around threshold.
|
|
- Assuming a multimeter is enough to validate supply quality.
|
|
- No analysis of peak current paths.
|
|
- Writing critical data non-atomically.
|
|
- Forgetting that debug bench supply is much better than real field power sources.
|
|
|
|
### 4.10 How to debug brownout problems
|
|
|
|
1. Use an oscilloscope, not just a DMM.
|
|
2. Probe the rail at the MCU pins, not only at the power connector.
|
|
3. Capture the reset line, power-good signal, and suspicious load enable signal at the same time.
|
|
4. Intentionally inject supply droops using a programmable supply or load step.
|
|
5. Read and log brownout reset flags.
|
|
6. Validate nonvolatile data integrity after repeated power interruption tests.
|
|
|
|
A useful field test is repeated power interruption at different points in the product duty cycle. This often reveals data corruption bugs that normal functional tests miss.
|
|
|
|
### 4.11 Interview-level understanding
|
|
|
|
A strong engineer should be able to explain:
|
|
|
|
- Why brownouts are more dangerous than simple power-off.
|
|
- Why BOR threshold selection must consider more than CPU minimum voltage.
|
|
- Why flash corruption risk is a brownout problem even when code still seems to run.
|
|
- Why power integrity measurements must be time-domain measurements at the load.
|
|
|
|
---
|
|
|
|
## 5. ESD
|
|
|
|
### 5.1 First principles
|
|
|
|
Electrostatic discharge is the sudden flow of charge between bodies at different potentials. In embedded products, the human body, cables, enclosure parts, and external connectors can all build charge and then discharge into the system.
|
|
|
|
ESD events are fast, high-voltage, and high-current relative to normal signal behavior. Even if the total energy is not huge, the speed of the event makes it dangerous.
|
|
|
|
Why it matters:
|
|
|
|
- semiconductors are small,
|
|
- junctions can be overstressed,
|
|
- internal parasitic structures can trigger latch-up,
|
|
- fast current pulses can inject noise into ground and supply networks,
|
|
- software can fail even when hardware is not permanently damaged.
|
|
|
|
### 5.2 ESD models engineers should know
|
|
|
|
- HBM (human body model): useful as a component-level reference.
|
|
- IEC 61000-4-2: system-level ESD immunity testing used in real product qualification.
|
|
|
|
Important practical point: a part rated well for HBM is not automatically sufficient for IEC system-level events. Component robustness and system robustness are related but not identical.
|
|
|
|
### 5.3 What ESD does to embedded systems
|
|
|
|
ESD can cause:
|
|
|
|
- immediate catastrophic damage,
|
|
- latent damage that weakens the part and fails later,
|
|
- logic upset,
|
|
- corrupted communication frames,
|
|
- unexpected reset,
|
|
- latch-up with excessive supply current,
|
|
- sensor misreadings,
|
|
- reboot loops if reset handling is weak.
|
|
|
|
This is why engineers distinguish between survival and functional immunity. A product that survives an ESD hit but reboots or locks up every time may still fail compliance or field expectations.
|
|
|
|
### 5.4 Think in terms of current path, not just voltage rating
|
|
|
|
One of the most important reliability intuitions is this: ESD design is mostly about controlling current path.
|
|
|
|
If a charged person touches a connector pin, ask:
|
|
|
|
- Where does the current enter?
|
|
- What is the lowest-impedance path to chassis or return?
|
|
- Does the current flow through sensitive silicon before reaching protection?
|
|
- Does the protection device clamp early enough and sit physically close enough?
|
|
|
|
If you only ask "Did we place a TVS diode?" you are asking the wrong question.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
HIT[ESD Strike at Connector] --> TVS[TVS / Clamp Near Entry]
|
|
TVS --> CHASSIS[Chassis or Controlled Return Path]
|
|
HIT --> BAD[Bad Path Through MCU Ground / Signal Trace]
|
|
BAD --> MCU[MCU / Transceiver Stress]
|
|
CHASSIS --> SAFE[Energy Diverted Away from Sensitive Silicon]
|
|
```
|
|
|
|
### 5.5 Typical protection elements
|
|
|
|
#### TVS diodes
|
|
|
|
TVS devices clamp voltage by conducting surge current. They are effective only when:
|
|
|
|
- their working voltage fits the interface,
|
|
- they are placed very close to the entry point,
|
|
- the return path is short and low inductance,
|
|
- the PCB routes ESD current away from sensitive circuitry.
|
|
|
|
#### Series resistors or ferrite beads
|
|
|
|
These add impedance, slowing or reducing surge current into IC pins. They are often used on slower interfaces or GPIOs.
|
|
|
|
#### Common-mode chokes
|
|
|
|
Useful on differential external interfaces such as USB, CAN, or Ethernet in some designs, depending on data rate and EMC goals.
|
|
|
|
#### Chassis grounding and shielding
|
|
|
|
If the product has a metal enclosure or shield, a controlled path to chassis can dramatically improve robustness. The goal is to keep the discharge current out of the digital ground plane whenever possible.
|
|
|
|
### 5.6 Layout matters as much as component selection
|
|
|
|
Many ESD failures come from correct parts used incorrectly.
|
|
|
|
Good layout practice:
|
|
|
|
- place protection at the connector entry,
|
|
- keep the path from connector to TVS short,
|
|
- keep the path from TVS to chassis or return short and wide,
|
|
- avoid routing protected signals deep into the board before the protection element,
|
|
- separate noisy entry regions from sensitive analog or clock regions.
|
|
|
|
Common mistake: placing a TVS near the MCU because that is where there was board space. By then, the damaging current has already traveled across the board.
|
|
|
|
### 5.7 Interface-specific examples
|
|
|
|
#### Exposed GPIO button
|
|
|
|
If a user-accessible button trace runs directly into the MCU pin, the pin becomes an entry point for ESD. Better design may include:
|
|
|
|
- series resistor,
|
|
- RC filtering if timing allows,
|
|
- external clamp or TVS depending on exposure,
|
|
- Schmitt-trigger input if available,
|
|
- route discipline.
|
|
|
|
#### USB or UART connector
|
|
|
|
Use interface-appropriate low-capacitance protection. High-speed interfaces require careful capacitance and routing choices. A TVS with excessive capacitance can destroy signal integrity even while improving surge robustness.
|
|
|
|
#### Industrial IO line
|
|
|
|
Often needs stronger protection, perhaps TVS, series resistance, filtering, isolation, and careful cable shield handling.
|
|
|
|
### 5.8 Software implications of ESD
|
|
|
|
ESD is not only a hardware problem.
|
|
|
|
Software should assume an ESD event may cause:
|
|
|
|
- one bad frame,
|
|
- one invalid sensor sample,
|
|
- temporary peripheral bus lockup,
|
|
- unexpected reset.
|
|
|
|
Useful software responses:
|
|
|
|
- CRC on communication,
|
|
- timeouts and retry logic,
|
|
- reinitialization of stuck peripherals,
|
|
- reset cause logging,
|
|
- plausibility filtering of suddenly impossible sensor values.
|
|
|
|
This is a good example of hardware-software co-design. Hardware reduces how often the event enters the system. Software prevents one transient from turning into a persistent failure.
|
|
|
|
### 5.9 Latch-up and why it is serious
|
|
|
|
Latch-up is a condition where parasitic structures inside a semiconductor create a low-impedance path between supply rails, causing excessive current until power is removed or the device is destroyed.
|
|
|
|
ESD and overvoltage conditions can trigger it.
|
|
|
|
Symptoms:
|
|
|
|
- sudden high current draw,
|
|
- overheating,
|
|
- permanent failure or device that only recovers after power cycle.
|
|
|
|
If ESD testing causes abnormal supply current, latch-up should be high on the suspect list.
|
|
|
|
### 5.10 Production scenarios
|
|
|
|
#### Handheld device with exposed connector
|
|
|
|
Users plug in long cables and touch the connector shell. Device survives lab bench use but resets in dry winter conditions. Root cause: ESD current couples into logic ground due to poor connector-to-chassis discharge path.
|
|
|
|
#### Factory sensor node
|
|
|
|
Long cable runs act like antennas and charge collectors. System does not die, but communication errors spike after handling. Fix: interface protection, cable shield strategy, better grounding, and software retries with event counters.
|
|
|
|
### 5.11 Common ESD design mistakes
|
|
|
|
- Choosing a TVS only by peak power rating, ignoring capacitance and clamping behavior.
|
|
- Placing protection too far from the connector.
|
|
- Dumping ESD current into sensitive digital ground without controlled return.
|
|
- Assuming internal MCU clamp diodes are enough for external interfaces.
|
|
- Forgetting enclosure, cable, and user interaction in the system-level current path.
|
|
- Not testing functional recovery after ESD strike.
|
|
|
|
### 5.12 How to debug ESD problems
|
|
|
|
1. Identify exposed touch points and connectors.
|
|
2. Reproduce with controlled ESD test method if available.
|
|
3. Monitor reset lines, supply rails, key communication lines, and supply current.
|
|
4. Distinguish between permanent damage, temporary upset, and reset.
|
|
5. Inspect layout around entry points and current return paths.
|
|
6. Review whether failures happen on contact discharge, air discharge, or cable connection events.
|
|
|
|
If the hardware survives but functionality degrades, the software recovery path may be inadequate even if the board-level protection is reasonable.
|
|
|
|
### 5.13 Interview-level understanding
|
|
|
|
A strong engineer should be able to explain:
|
|
|
|
- Why ESD protection is mainly about current path control.
|
|
- Why HBM part ratings do not guarantee IEC system immunity.
|
|
- Why placement of the TVS diode matters as much as the device itself.
|
|
- Why software needs recovery logic even in a hardware protection discussion.
|
|
|
|
---
|
|
|
|
## 6. Noise Issues
|
|
|
|
### 6.1 First principles
|
|
|
|
Noise is any unwanted electrical disturbance that interferes with the signal or power behavior your system depends on.
|
|
|
|
In embedded systems, noise is not just random fuzz on a waveform. It becomes bugs:
|
|
|
|
- false interrupts,
|
|
- ADC readings that wander,
|
|
- corrupted serial packets,
|
|
- relays triggering unexpectedly,
|
|
- intermittent resets,
|
|
- "works on the bench, fails near the motor" field complaints.
|
|
|
|
### 6.2 Source, path, victim
|
|
|
|
This is the most useful practical model for reasoning about noise.
|
|
|
|
- Source: where the disturbance originates.
|
|
- Path: how it couples into something else.
|
|
- Victim: the signal or subsystem that gets disturbed.
|
|
|
|
If you only focus on the victim, you will often patch symptoms instead of fixing the root cause.
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
SRC[Source\nMotor, relay, buck converter, clock edge] --> PATH[Coupling Path\nCapacitive, inductive, common impedance, radiated]
|
|
PATH --> VIC[Victim\nADC, reset line, UART, sensor input]
|
|
VIC --> FAIL[Observed Failure\nReset, glitch, bad sample, comm error]
|
|
```
|
|
|
|
### 6.3 Major types of noise in embedded systems
|
|
|
|
#### Conducted noise
|
|
|
|
Travels through wires, traces, or shared power and ground impedance.
|
|
|
|
Examples:
|
|
|
|
- switching regulator ripple,
|
|
- motor current spikes,
|
|
- ground bounce,
|
|
- supply droop coupling into analog front ends.
|
|
|
|
#### Radiated noise
|
|
|
|
Travels through electromagnetic fields.
|
|
|
|
Examples:
|
|
|
|
- fast clock edges coupling into nearby traces,
|
|
- long cable acting as an antenna,
|
|
- RF transmitter affecting high-impedance sensor line.
|
|
|
|
#### Common-mode noise
|
|
|
|
Both lines move together relative to a reference. Often important in cables and differential interfaces.
|
|
|
|
#### Differential noise
|
|
|
|
Appears between the two signal conductors directly and can corrupt data bits or analog measurements.
|
|
|
|
### 6.4 Where noise comes from in real products
|
|
|
|
- DC motors and brushed motors,
|
|
- relays and solenoids,
|
|
- switching power supplies,
|
|
- long cable harnesses,
|
|
- high dI/dt load transients,
|
|
- poorly decoupled digital ICs,
|
|
- shared return paths,
|
|
- un-terminated or badly routed fast digital traces.
|
|
|
|
### 6.5 Why digital systems still fail from analog noise
|
|
|
|
Digital logic depends on thresholds and timing.
|
|
|
|
Noise causes failure when it changes either:
|
|
|
|
- the apparent voltage level at the sampling moment,
|
|
- the reference used to interpret that level,
|
|
- or the timing of the transition.
|
|
|
|
Examples:
|
|
|
|
- a noisy reset line crosses threshold briefly and resets the MCU,
|
|
- an encoder input double-counts because of ringing,
|
|
- an ADC reference bounces during a conversion,
|
|
- an I2C line gets stretched or stuck due to noise-induced state corruption.
|
|
|
|
### 6.6 Noise mitigation hierarchy
|
|
|
|
Best practice is to attack noise in this order when possible:
|
|
|
|
1. Reduce the source.
|
|
2. Interrupt the coupling path.
|
|
3. Harden the victim.
|
|
4. Add software filtering and fault tolerance.
|
|
|
|
This order matters. Software filtering cannot fully fix a reset line that is physically glitching.
|
|
|
|
### 6.7 Hardware techniques
|
|
|
|
#### Decoupling capacitors
|
|
|
|
Local capacitors supply transient current close to the IC and reduce rail disturbance. Their effectiveness depends on value, ESR, ESL, and placement. A decoupling capacitor that is physically far away loses much of its high-frequency benefit.
|
|
|
|
#### Bulk capacitance
|
|
|
|
Helps with lower-frequency load steps and supply stability.
|
|
|
|
#### Ground and return path design
|
|
|
|
Current always returns somehow. If high-current return shares impedance with sensitive circuits, noise appears as voltage drop on the shared path.
|
|
|
|
This is why layout is often more important than schematic beauty.
|
|
|
|
#### Filtering
|
|
|
|
- RC low-pass for slow inputs,
|
|
- LC or pi filters for power paths,
|
|
- ferrites for high-frequency isolation,
|
|
- common-mode chokes for cable interfaces,
|
|
- flyback diodes or snubbers for inductive loads.
|
|
|
|
#### Signal integrity practices
|
|
|
|
- termination when edges are fast relative to trace length,
|
|
- controlled routing for clocks and fast buses,
|
|
- avoiding long high-impedance sensor traces,
|
|
- differential signaling for noisy environments.
|
|
|
|
### 6.8 Software techniques
|
|
|
|
Software cannot replace hardware integrity, but it can dramatically improve robustness.
|
|
|
|
Useful strategies:
|
|
|
|
- debounce digital inputs,
|
|
- median or moving-average filters for sensor data where latency permits,
|
|
- plausibility checks and range validation,
|
|
- CRC and retry on communication,
|
|
- timeout and reinitialization logic for buses,
|
|
- majority voting over repeated samples,
|
|
- event counters to correlate failures with operating conditions.
|
|
|
|
Example:
|
|
|
|
A noisy mechanical input should usually be handled with both hardware and software measures:
|
|
|
|
- hardware RC or Schmitt-trigger input to clean the edge,
|
|
- software debounce to reject remaining bounce and transient spikes.
|
|
|
|
### 6.9 Practical examples
|
|
|
|
#### ADC noise from switching regulator
|
|
|
|
Symptom: ADC values vary only when DC/DC converter load changes.
|
|
|
|
Possible causes:
|
|
|
|
- poor analog supply filtering,
|
|
- bad reference routing,
|
|
- conversion timing aligned with switching activity,
|
|
- ground return sharing.
|
|
|
|
Potential fixes:
|
|
|
|
- improve analog reference decoupling,
|
|
- separate analog and high-current return paths appropriately,
|
|
- sample synchronously during quiet intervals,
|
|
- add input RC filter if bandwidth allows.
|
|
|
|
#### False reset from noisy line
|
|
|
|
Symptom: MCU resets when relay or motor switches.
|
|
|
|
Possible causes:
|
|
|
|
- reset trace routed near noisy switching node,
|
|
- reset pull-up too weak,
|
|
- no supervisor IC,
|
|
- ground bounce affecting reset threshold.
|
|
|
|
Potential fixes:
|
|
|
|
- shorten and shield reset routing,
|
|
- strengthen pull-up as appropriate,
|
|
- add RC filtering if supported by device requirements,
|
|
- use a proper reset supervisor.
|
|
|
|
#### Serial communication errors near motor drive
|
|
|
|
Symptom: UART or RS-485 errors only when motor PWM is active.
|
|
|
|
Potential fixes:
|
|
|
|
- improve cable routing and separation,
|
|
- improve common-mode control and grounding,
|
|
- use shield strategy correctly,
|
|
- add proper line termination,
|
|
- implement CRC and frame retry.
|
|
|
|
### 6.10 Common mistakes with noise
|
|
|
|
- Measuring only average voltage and missing fast transients.
|
|
- Using long oscilloscope ground leads and diagnosing probe-induced ringing as a real problem.
|
|
- Treating all ground as magically identical everywhere on the board.
|
|
- Adding large filters without checking bandwidth, delay, or startup side effects.
|
|
- Relying entirely on software filters when hardware is fundamentally unstable.
|
|
- Forgetting cable routing and connector placement in the system design.
|
|
|
|
### 6.11 How to debug noise problems
|
|
|
|
1. Reproduce under the exact operating condition that triggers failure.
|
|
2. Identify likely source, path, and victim.
|
|
3. Measure with proper probing technique.
|
|
4. Correlate failures with switching events, load changes, or communication timing.
|
|
5. Temporarily isolate aggressors one at a time.
|
|
6. Add or remove filtering experimentally to test hypotheses.
|
|
|
|
A key professional habit: do not call something noise until you have a time correlation. Intermittent means hard to see, not random.
|
|
|
|
### 6.12 Interview-level understanding
|
|
|
|
A strong engineer should be able to explain:
|
|
|
|
- Why source-path-victim is a better model than simply "there is noise."
|
|
- Why layout and return current paths dominate many noise problems.
|
|
- Why hardware filtering and software filtering solve different parts of the problem.
|
|
- Why proper measurement technique matters when debugging fast transients.
|
|
|
|
---
|
|
|
|
## 7. Reset Circuits
|
|
|
|
### 7.1 Why reset circuits matter
|
|
|
|
A reset signal tells a digital system when it is allowed to start from a known state. That sounds simple, but it is one of the most important reliability mechanisms on the board.
|
|
|
|
Without a clean reset strategy:
|
|
|
|
- the CPU may start before power is valid,
|
|
- peripherals may initialize in the wrong order,
|
|
- one chip may drive a bus while another is still unpowered,
|
|
- the device may enter random states after brownout or watchdog events.
|
|
|
|
Reset design is where power integrity, sequencing, and recovery behavior meet.
|
|
|
|
### 7.2 Sources of reset in embedded systems
|
|
|
|
- power-on reset,
|
|
- brownout reset,
|
|
- watchdog reset,
|
|
- manual reset button,
|
|
- external supervisor reset,
|
|
- debug or programming tool reset,
|
|
- software-requested reset.
|
|
|
|
These different causes should ideally be distinguishable in firmware so the boot path can respond intelligently.
|
|
|
|
### 7.3 What a good reset circuit must do
|
|
|
|
A reliable reset mechanism should:
|
|
|
|
- assert reset early enough during power rise and power fall,
|
|
- keep reset asserted until voltage and clocking are valid,
|
|
- provide sufficient reset pulse width,
|
|
- avoid glitches or chatter,
|
|
- coordinate multiple devices if needed,
|
|
- allow cause diagnosis after reboot.
|
|
|
|
### 7.4 RC reset versus supervisor IC
|
|
|
|
Many beginners learn RC reset first. It can work in simple low-risk systems, but it has limitations.
|
|
|
|
| Approach | Advantages | Limitations | Best fit |
|
|
| --- | --- | --- | --- |
|
|
| Simple RC reset | Cheap, minimal parts | Threshold is not precise, sensitive to ramp rate and tolerances, weak for complex sequencing | Very simple designs |
|
|
| Dedicated supervisor | Precise threshold, hysteresis, guaranteed timing, better brownout behavior | Extra cost and parts | Most professional products |
|
|
|
|
Why RC reset is limited:
|
|
|
|
- actual reset release depends on the receiving input threshold,
|
|
- supply ramp shape changes timing,
|
|
- temperature and tolerance affect behavior,
|
|
- brownout behavior is usually poor compared to a proper supervisor.
|
|
|
|
In systems that must behave predictably across supply variations, supervisors are usually worth the BOM cost.
|
|
|
|
### 7.5 Reset distribution architecture
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
VIN[Input Power] --> REG[Regulator / PMIC]
|
|
REG --> SUP[Voltage Supervisor / BOR]
|
|
WDG[Watchdog IC] --> ORRST[Reset Combination Logic or Shared Reset Net]
|
|
BTN[Manual Reset] --> ORRST
|
|
SUP --> ORRST
|
|
ORRST --> MCU[MCU Reset]
|
|
ORRST --> PER1[Peripheral Reset]
|
|
ORRST --> PER2[Communication IC Reset]
|
|
MCU --> LOG[Capture Reset Cause Early in Boot]
|
|
```
|
|
|
|
In some systems, not all devices should reset at the same time or for the same duration. For example, a communication transceiver may require a different sequence from the MCU.
|
|
|
|
### 7.6 Reset cause logging
|
|
|
|
Reset without diagnosis is incomplete engineering. Early boot code should capture the reset source before it is overwritten or cleared.
|
|
|
|
```c
|
|
void early_boot_reset_capture(void)
|
|
{
|
|
reset_cause_t cause = read_and_clear_reset_flags();
|
|
persist_reset_event(cause);
|
|
|
|
if (cause == RESET_CAUSE_WATCHDOG) {
|
|
persist_fault_counter_increment(FAULT_WATCHDOG);
|
|
}
|
|
|
|
if (cause == RESET_CAUSE_BROWNOUT) {
|
|
persist_fault_counter_increment(FAULT_BROWNOUT);
|
|
}
|
|
}
|
|
```
|
|
|
|
This simple pattern is the basis of a real field diagnostic history.
|
|
|
|
### 7.7 Reset sequencing examples
|
|
|
|
#### MCU with external sensor and transceiver
|
|
|
|
Potential issue: sensor powers up slowly, but MCU starts immediately and tries to read it, creating false fault reports.
|
|
|
|
Possible solution:
|
|
|
|
- hold MCU in reset until sensor power-good is valid,
|
|
- or let MCU boot but gate sensor initialization on a validated ready signal,
|
|
- or reset the peripheral separately after its rail stabilizes.
|
|
|
|
#### Shared bus system
|
|
|
|
If one device is held in reset while another is driving the bus, the reset device's pins must not load or corrupt the bus. That requires checking IO default states, pull resistors, and power-domain behavior.
|
|
|
|
### 7.8 Reset lines are noise-sensitive signals
|
|
|
|
A reset pin is often one of the most important and most fragile nets on the board.
|
|
|
|
Good practice:
|
|
|
|
- keep it short,
|
|
- keep it away from aggressive switching nodes,
|
|
- use a defined pull-up or pull-down,
|
|
- filter only within device requirements,
|
|
- avoid accidental coupling from user-accessible traces,
|
|
- test with real transients.
|
|
|
|
### 7.9 Manual reset and user experience
|
|
|
|
Manual reset buttons look simple, but product behavior matters.
|
|
|
|
Questions to decide:
|
|
|
|
- Does a short press reset the MCU only, or the whole system?
|
|
- Should holding reset erase settings or enter bootloader mode?
|
|
- Is button debounce handled electrically, in firmware, or both?
|
|
- Can the user accidentally trigger reset through ESD or noise on the button line?
|
|
|
|
### 7.10 Common reset design mistakes
|
|
|
|
- Using only RC reset in a design with meaningful power ramp uncertainty.
|
|
- Ignoring brownout behavior and validating only power-up.
|
|
- Failing to verify reset pulse width requirements of peripherals.
|
|
- Not logging reset cause.
|
|
- Sharing reset nets carelessly across incompatible voltage domains.
|
|
- Routing reset lines near switching nodes or long exposed traces.
|
|
|
|
### 7.11 How to debug reset issues
|
|
|
|
1. Measure the supply rail and reset line together.
|
|
2. Check if reset deasserts before the rail is actually stable.
|
|
3. Verify oscillator and clock startup timing if relevant.
|
|
4. Trigger on reset assertion during real load events.
|
|
5. Confirm which reset source firmware reports.
|
|
6. Check whether peripherals require additional initialization after reset.
|
|
|
|
A classic failure pattern is: the MCU resets correctly, but an external peripheral remains wedged because it was not reset or reinitialized.
|
|
|
|
### 7.12 Interview-level understanding
|
|
|
|
A strong engineer should be able to explain:
|
|
|
|
- Why reset design is more than just getting the MCU to boot.
|
|
- Why RC reset is often insufficient in professional systems.
|
|
- Why reset cause capture is essential for field reliability.
|
|
- Why brownout behavior and reset behavior must be designed together.
|
|
|
|
---
|
|
|
|
## 8. Fail-Safe Design
|
|
|
|
### 8.1 What fail-safe means
|
|
|
|
Fail-safe means that when the system fails, it moves to a state that minimizes harm.
|
|
|
|
That state depends on the application.
|
|
|
|
Examples:
|
|
|
|
- A motor controller safe state may be torque disabled.
|
|
- A smart lock safe state may depend on fire and security policy.
|
|
- A medical infusion device safe state may be pump stop plus alarm.
|
|
- A battery system safe state may be contactor open.
|
|
|
|
Important point: fail-safe does not always mean power off, and it does not always mean the same thing as fail-operational.
|
|
|
|
Definitions:
|
|
|
|
- Fail-safe: enter a state designed to reduce risk.
|
|
- Fail-silent: stop producing outputs.
|
|
- Fail-operational: continue operation despite some faults.
|
|
|
|
Designers must choose deliberately which one applies.
|
|
|
|
### 8.2 Safe state must be defined before implementation
|
|
|
|
If the team cannot clearly answer "What should the system do when X fails?" then fail-safe design has not yet started.
|
|
|
|
That answer should exist for:
|
|
|
|
- sensor failure,
|
|
- communication timeout,
|
|
- watchdog event,
|
|
- brownout,
|
|
- overtemperature,
|
|
- actuator feedback mismatch,
|
|
- internal self-test failure.
|
|
|
|
### 8.3 A practical fail-safe design process
|
|
|
|
1. Identify hazards.
|
|
2. Identify faults that can create those hazards.
|
|
3. Determine how to detect those faults.
|
|
4. Define the safe state.
|
|
5. Decide whether recovery is automatic, manual, or disallowed.
|
|
6. Design the hardware and software path to enforce the safe state.
|
|
7. Verify behavior with fault injection.
|
|
|
|
This is a simplified engineering version of hazard analysis and FMEA thinking.
|
|
|
|
### 8.4 Fail-safe architecture
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> Startup
|
|
Startup --> Normal: Self-test passed
|
|
Startup --> SafeState: Self-test failed
|
|
Normal --> Degraded: Non-critical fault detected
|
|
Normal --> SafeState: Critical fault detected
|
|
Degraded --> Normal: Fault cleared and validated
|
|
Degraded --> SafeState: Fault escalates
|
|
SafeState --> Recovery: Reset or operator action allowed
|
|
Recovery --> Startup
|
|
```
|
|
|
|
### 8.5 Hardware mechanisms for fail-safe behavior
|
|
|
|
Reliable fail-safe action often must exist in hardware, because software may be the thing that failed.
|
|
|
|
Examples:
|
|
|
|
- gate driver enable pin defaults low through hardware pull-down,
|
|
- relay or contactor opens on loss of control power,
|
|
- hardware comparator cuts off output on overcurrent,
|
|
- thermal fuse provides last-resort protection,
|
|
- external watchdog can reset or disable outputs if heartbeat is lost,
|
|
- redundant interlock chain overrides MCU command.
|
|
|
|
This is a central lesson in embedded reliability: do not require broken software to be responsible for making the system safe.
|
|
|
|
### 8.6 Software mechanisms for fail-safe behavior
|
|
|
|
Software still plays a major role:
|
|
|
|
- sensor plausibility checking,
|
|
- command timeout detection,
|
|
- state machine enforcement,
|
|
- degraded-mode control,
|
|
- alarm generation,
|
|
- logging for maintenance and incident analysis.
|
|
|
|
Example:
|
|
|
|
```c
|
|
void apply_motor_command(int torque_cmd)
|
|
{
|
|
bool safe = sensors_plausible() &&
|
|
!comms_timed_out() &&
|
|
!brownout_warning_latched() &&
|
|
!watchdog_recovery_mode();
|
|
|
|
if (!safe) {
|
|
set_gate_driver_enable(false);
|
|
set_brake(true);
|
|
return;
|
|
}
|
|
|
|
set_brake(false);
|
|
send_torque_to_driver(torque_cmd);
|
|
}
|
|
```
|
|
|
|
The software intent is clear, but the hardware should also support a safe default if this function never runs.
|
|
|
|
### 8.7 Fail-safe versus nuisance trips
|
|
|
|
One of the hardest practical tradeoffs is sensitivity.
|
|
|
|
If the fault detection is too sensitive:
|
|
|
|
- the system trips unnecessarily,
|
|
- customers perceive poor reliability,
|
|
- availability suffers.
|
|
|
|
If the fault detection is too insensitive:
|
|
|
|
- real hazards are missed,
|
|
- damage or unsafe operation may continue too long.
|
|
|
|
This tradeoff is why engineers use:
|
|
|
|
- thresholds with hysteresis,
|
|
- timers and persistence checks,
|
|
- graded fault levels,
|
|
- degraded mode before full shutdown when appropriate.
|
|
|
|
### 8.8 Fail-safe examples by product type
|
|
|
|
#### Motor drive
|
|
|
|
Critical hazard: unintended torque.
|
|
|
|
Fail-safe ideas:
|
|
|
|
- hardware disable line to gate driver,
|
|
- watchdog-supervised control loop,
|
|
- speed feedback plausibility,
|
|
- STO-like concept where applicable,
|
|
- separate fault latch requiring deliberate reset.
|
|
|
|
#### Battery management system
|
|
|
|
Critical hazard: overcharge, overcurrent, thermal runaway.
|
|
|
|
Fail-safe ideas:
|
|
|
|
- hardware cutoff thresholds,
|
|
- contactor control default open when logic supply is absent,
|
|
- independent temperature sensing where needed,
|
|
- persistent fault logging.
|
|
|
|
#### Industrial controller
|
|
|
|
Critical hazard: output remains active after controller fault.
|
|
|
|
Fail-safe ideas:
|
|
|
|
- output state defaults defined in hardware,
|
|
- communication timeout maps to safe output state,
|
|
- startup requires explicit requalification before outputs re-enable.
|
|
|
|
#### Connected consumer device
|
|
|
|
Critical concern may be less about physical hazard and more about data integrity or availability.
|
|
|
|
Fail-safe ideas:
|
|
|
|
- rollback-safe updates,
|
|
- reset-and-recover policy,
|
|
- preserving last known valid configuration,
|
|
- not bricking on interrupted power.
|
|
|
|
### 8.9 Common mistakes in fail-safe design
|
|
|
|
- Defining safe state too vaguely.
|
|
- Assuming reset alone is a safe state.
|
|
- Depending on firmware to disable outputs after firmware has already failed.
|
|
- Not separating critical and non-critical faults.
|
|
- No plan for repeated reboot loops.
|
|
- No fault injection testing.
|
|
|
|
### 8.10 How to debug fail-safe behavior
|
|
|
|
Ask structured questions:
|
|
|
|
1. What exact fault was injected or observed?
|
|
2. Was it detected?
|
|
3. How long did detection take?
|
|
4. What state transition occurred?
|
|
5. Did hardware outputs match the intended safe state?
|
|
6. Could the system recover, and was that recovery appropriate?
|
|
7. Was the event logged clearly enough to support field diagnosis?
|
|
|
|
A product that enters a safe state but cannot explain why will create expensive support problems.
|
|
|
|
### 8.11 Interview-level understanding
|
|
|
|
A strong engineer should be able to explain:
|
|
|
|
- Why fail-safe is application-specific.
|
|
- Why reset is not automatically a safe-state action.
|
|
- Why hardware interlocks are often required in addition to software checks.
|
|
- Why degraded mode can be better than immediate shutdown for some fault classes.
|
|
|
|
---
|
|
|
|
## 9. How These Topics Interact in Real Products
|
|
|
|
The requested topics are tightly connected. Here are common cross-topic interactions.
|
|
|
|
### 9.1 Brownout plus watchdog
|
|
|
|
If voltage droops, software timing may degrade before the watchdog resets the MCU. If the safe state depends on firmware action, that may not happen reliably. This is why a brownout supervisor and hardware-safe outputs are often both needed.
|
|
|
|
### 9.2 ESD plus reset behavior
|
|
|
|
An ESD strike may not damage hardware, but it can momentarily disturb reset or communication lines. If reset distribution is weak or cause logging is missing, the result looks like a random bug.
|
|
|
|
### 9.3 Noise plus brownout
|
|
|
|
Large current transients from motors or regulators create both voltage dip and high-frequency noise. Engineers sometimes classify the failure as either power or EMI, but it is often both at once.
|
|
|
|
### 9.4 Watchdog plus fail-safe design
|
|
|
|
The watchdog should not simply reboot the CPU. It should fit into a fault-response strategy:
|
|
|
|
- disable hazardous outputs,
|
|
- record state,
|
|
- reset cleanly,
|
|
- prevent endless unsafe reboot loops.
|
|
|
|
### 9.5 Reset circuits plus fail-safe outputs
|
|
|
|
At reset, IO pins may float, tristate, or change mode. If an actuator-enable line is not designed with safe electrical defaults, a clean reset can still produce unsafe behavior.
|
|
|
|
---
|
|
|
|
## 10. Reliability Design Review Checklist
|
|
|
|
Use this section as a practical design review reference.
|
|
|
|
### 10.1 Watchdog checklist
|
|
|
|
- Is the watchdog enabled in production firmware?
|
|
- Who is allowed to feed it?
|
|
- Does feeding require proof of real task health?
|
|
- What happens after one watchdog reset? After repeated watchdog resets?
|
|
- Is reset cause captured early?
|
|
- Does firmware update mode handle watchdog timing safely?
|
|
|
|
### 10.2 Brownout checklist
|
|
|
|
- What are the real supply worst cases?
|
|
- Are BOR and supervisor thresholds chosen with margin?
|
|
- Is hysteresis present?
|
|
- Is nonvolatile storage robust against mid-write reset?
|
|
- Has the design been tested with realistic droops and load steps?
|
|
- Are all rails and power sequencing dependencies understood?
|
|
|
|
### 10.3 ESD checklist
|
|
|
|
- What are the user-touchable and cable-exposed entry points?
|
|
- Is protection placed at the entry point?
|
|
- Where does ESD current return?
|
|
- Are protection devices appropriate for interface speed and capacitance?
|
|
- Has functional recovery been verified, not just survival?
|
|
- Does layout support the intended current path?
|
|
|
|
### 10.4 Noise checklist
|
|
|
|
- What are the major noise sources?
|
|
- Through what paths can they couple into sensitive circuits?
|
|
- Are return current paths controlled?
|
|
- Are reset, clock, analog, and communication lines hardened appropriately?
|
|
- Are measurement methods good enough to see fast transients?
|
|
- Is software filtering used appropriately but not as a crutch?
|
|
|
|
### 10.5 Reset checklist
|
|
|
|
- What reset sources exist and how are they prioritized?
|
|
- Is reset timing guaranteed across ramp conditions?
|
|
- Do all peripherals need the same reset behavior?
|
|
- Are reset lines protected from noise?
|
|
- Is reset cause logged and used at boot?
|
|
- Does startup sequencing avoid false alarms and unsafe outputs?
|
|
|
|
### 10.6 Fail-safe checklist
|
|
|
|
- Is safe state explicitly defined for each critical fault?
|
|
- Can hardware enforce safe outputs if software is dead?
|
|
- Are degraded and critical faults distinguished?
|
|
- Is recovery policy defined and tested?
|
|
- Are repeated-reset or repeated-fault scenarios handled deliberately?
|
|
- Has fault injection been performed?
|
|
|
|
---
|
|
|
|
## 11. Troubleshooting Playbook
|
|
|
|
When a field issue appears as "random resets" or "intermittent lockup," use a structured process.
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Observed Failure] --> B{Did the system reset?}
|
|
B -->|Yes| C[Read and log reset cause]
|
|
B -->|No| D[Check for lockup, bad comms, or corrupted outputs]
|
|
C --> E{Brownout?}
|
|
E -->|Yes| F[Measure rails under dynamic load]
|
|
E -->|No| G{Watchdog?}
|
|
G -->|Yes| H[Check task health, timing, deadlocks, long blocking calls]
|
|
G -->|No| I[Inspect reset line noise, supervisor behavior, ESD exposure]
|
|
D --> J{Correlated with switching, motor, RF, cable touch?}
|
|
J -->|Yes| K[Investigate source-path-victim noise or ESD path]
|
|
J -->|No| L[Check software state machine, memory corruption, peripheral wedging]
|
|
F --> M[Validate BOR threshold, capacitance, layout, load step]
|
|
H --> N[Improve watchdog policy and breadcrumbs]
|
|
I --> O[Review reset architecture and layout]
|
|
K --> P[Improve protection, routing, filtering, recovery logic]
|
|
L --> Q[Add diagnostics and targeted fault injection]
|
|
```
|
|
|
|
### 11.1 Step-by-step debugging discipline
|
|
|
|
1. Reproduce the problem in the narrowest possible setup.
|
|
2. Add evidence collection before changing the design.
|
|
3. Classify whether the problem is reset, lockup, data corruption, or unsafe output behavior.
|
|
4. Correlate the failure with power, timing, switching, cable interaction, or temperature.
|
|
5. Change one thing at a time and retest.
|
|
|
|
This sounds simple, but it is how senior engineers avoid wasting days on false theories.
|
|
|
|
### 11.2 Fault injection is not optional in serious designs
|
|
|
|
Reliable products are tested by deliberately causing trouble:
|
|
|
|
- power interruption during writes,
|
|
- induced load steps,
|
|
- communication loss,
|
|
- forced task starvation,
|
|
- ESD events at external interfaces,
|
|
- unplugging and replugging peripherals,
|
|
- invalid sensor inputs,
|
|
- repeated reset cycles.
|
|
|
|
If your design has never been tested under controlled failure, your first field deployment is the real validation lab.
|
|
|
|
---
|
|
|
|
## 12. Common Engineering Patterns That Improve Reliability
|
|
|
|
These patterns show up repeatedly in successful products.
|
|
|
|
### 12.1 Safe electrical defaults
|
|
|
|
Critical outputs should default to the non-hazardous state through passive hardware whenever practical.
|
|
|
|
Examples:
|
|
|
|
- pull-down on gate driver enable,
|
|
- relay de-energized as safe condition where appropriate,
|
|
- communication transceiver disabled until MCU explicitly enables it.
|
|
|
|
### 12.2 Incremental state persistence
|
|
|
|
Do not wait for a fault to save all important data. Update critical state incrementally during normal operation using atomic record strategies.
|
|
|
|
### 12.3 Layered supervision
|
|
|
|
Use multiple layers:
|
|
|
|
- local timeout on a peripheral transaction,
|
|
- task-level health monitoring,
|
|
- internal watchdog,
|
|
- external supervisor.
|
|
|
|
Each layer catches a different failure class.
|
|
|
|
### 12.4 Explicit degraded mode
|
|
|
|
A system does not need to be only fully healthy or fully shut down.
|
|
|
|
Examples of degraded mode:
|
|
|
|
- reduced motor torque,
|
|
- lower communication bandwidth,
|
|
- sensor fusion disabled and simpler fallback estimator used,
|
|
- user alerted while non-critical features are suspended.
|
|
|
|
### 12.5 Reliability telemetry
|
|
|
|
Collect counters and logs such as:
|
|
|
|
- resets by cause,
|
|
- brownout count,
|
|
- communication retry count,
|
|
- watchdog reset history,
|
|
- maximum temperature reached,
|
|
- stack margin minimum,
|
|
- last fault code.
|
|
|
|
This data turns mysterious field returns into diagnosable engineering problems.
|
|
|
|
---
|
|
|
|
## 13. What Experienced Engineers Usually Look For First
|
|
|
|
When a product seems unreliable, senior engineers often check these first:
|
|
|
|
- Is power actually clean at the IC pins under real load?
|
|
- Is reset timing correct under power-up and power-down, not just power-up?
|
|
- Is the watchdog genuinely supervising health or just being petted?
|
|
- Are there external touch points or cables that invite ESD trouble?
|
|
- Are there inductive or switching loads sharing return paths with sensitive logic?
|
|
- Are critical outputs electrically safe during boot, reset, and fault?
|
|
- Is there enough logging to know what happened last time?
|
|
|
|
This is not because these are the only failure mechanisms. It is because they are common, high-impact, and often underdesigned.
|
|
|
|
---
|
|
|
|
## 14. Final Takeaways
|
|
|
|
Embedded reliability is about controlled behavior under non-ideal conditions.
|
|
|
|
The core ideas of this handbook can be summarized like this:
|
|
|
|
- Watchdogs are for detecting loss of healthy execution, not merely loss of activity.
|
|
- Brownout protection is about preventing partially powered misbehavior, not just resetting eventually.
|
|
- ESD protection is about steering current away from sensitive silicon through intentional paths.
|
|
- Noise problems are best understood through source, path, and victim.
|
|
- Reset circuits are reliability infrastructure, not a minor startup detail.
|
|
- Fail-safe design must be defined at the system level and enforced in hardware as well as software.
|
|
|
|
If you remember one engineering principle from this guide, make it this:
|
|
|
|
**A reliable embedded system is not one that never experiences faults. It is one that anticipates faults, limits their consequences, recovers appropriately, and leaves evidence behind.**
|
|
|
|
That mindset is what separates a board that works in the lab from a product that survives the field.
|