Files

T

tarun-elango 62197e52c0 ml

Co-authored-by: Copilot <copilot@github.com>

2026-04-30 19:59:29 -04:00

40 KiB

Raw Blame History

Reinforcement Learning Basics

Learning through feedback.

1. Why Reinforcement Learning Matters

Reinforcement learning (RL) is the branch of machine learning concerned with making sequences of decisions under uncertainty. Instead of learning from a fixed dataset of correct answers, an RL system learns by interacting with an environment, observing the consequences of its actions, and adjusting behavior to maximize long-term reward.

That makes RL fundamentally different from the machine learning workflows most engineers see first:

In supervised learning, you are usually given input-output pairs and asked to imitate the right answer.
In unsupervised learning, you are asked to find structure in unlabeled data.
In reinforcement learning, there often is no direct correct action label. The system must discover good behavior from consequences.

This matters in real engineering because many important systems are not one-shot prediction problems. They are control problems.

Examples:

A robot arm does not make one prediction; it performs a sequence of motor commands.
A recommendation engine can optimize not just clicks today, but retention over weeks.
A data center controller can adjust power, cooling, and workload placement over time.
A chip power-management policy can trade energy against performance over many operating cycles.
A network congestion controller must continuously react to packet delay, loss, and changing traffic.

The central idea is simple:

A good action is not just one that looks good immediately. It is one that improves the total future outcome.

That single idea is what makes RL powerful and what makes it difficult.

2. When RL Is the Right Tool and When It Is Not

RL is attractive because it sounds general: an agent learns by trial and error. In practice, many teams misuse it.

RL is a strong fit when:

The problem is inherently sequential.
Actions change future states.
There is delayed feedback.
The environment is interactive or can be simulated.
It is hard to hand-code a strategy, but possible to define a performance signal.

RL is a poor fit when:

A simple rule-based controller already solves the problem reliably.
The action is effectively one-shot, so supervised learning is enough.
Exploration is unsafe or too expensive.
There is no good reward signal.
You cannot simulate the environment and cannot afford bad real-world behavior.

Practical decision rule:

If your problem can be solved as predict then act, start there first. Use RL only when the real difficulty is the acting over time, not the prediction.

3. First Principles: What RL Is Actually Solving

At first principles, RL is about four facts:

The system is embedded in a world that changes over time.
The system can influence that world by choosing actions.
The quality of actions is often only visible after multiple future steps.
The system must improve behavior while still being uncertain about the world.

These lead to the core engineering challenges of RL.

3.1 Sequential Dependence

If you brake too late in autonomous control, that error affects the next state. If a recommender shows low-quality content now, user engagement may drop later. In RL, actions are coupled through time.

3.2 Delayed Consequences

The hardest part of RL is that the reward often appears after many actions. This is called the credit assignment problem.

Example:

A warehouse robot takes 30 small navigation decisions.
It collides only at the end.
Which earlier decisions caused the failure?

RL algorithms exist largely to solve this credit assignment problem efficiently.

3.3 Uncertainty

The agent usually does not know in advance how the environment will react. It must learn from samples.

3.4 Exploration

If the agent only repeats what currently seems best, it may miss better strategies. If it explores too much, performance suffers. This is the exploration-exploitation tradeoff.

4. Core Concepts and Intuition

Before math, get the mental model right.

4.1 Agent

The learner or decision-maker.

Examples:

A software process scheduling workloads
A robot controller
A bidding strategy in an ad platform
A cache tuning policy in a hardware system

4.2 Environment

Everything the agent interacts with. The environment receives the action, updates the world, and returns observations and rewards.

4.3 State

A state is the information needed to choose a good action. In theory, a state captures everything relevant about the current situation. In engineering practice, state design is a major source of success or failure.

Examples:

In a robot: position, velocity, sensor readings, battery level
In networking: RTT, packet loss, queue size, throughput history
In CPU power management: utilization, temperature, current frequency, power budget

If the state leaves out a critical variable, the agent may behave irrationally because it is effectively blind.

4.4 Observation vs State

In many real systems, the agent does not observe the true state directly. It only sees partial measurements. Cameras do not reveal everything. Network telemetry is noisy. Sensor streams can be delayed.

This is called partial observability.

Practical implication:

Many real-world RL problems are not clean fully observable state-control problems. They are closer to history-based control under uncertainty.

4.5 Action

The decision the agent makes.

Discrete action example: choose route A, B, or C
Continuous action example: steering angle, motor torque, voltage setting

4.6 Reward

A scalar feedback signal telling the agent how good the immediate outcome was.

This looks simple, but reward design is one of the most dangerous parts of RL. If you specify the wrong reward, the agent can optimize the wrong behavior very effectively.

4.7 Policy

A policy is the agent's behavior rule: given the current state or observation, what action should it take?

There are two broad types:

Deterministic policy: always picks the same action for the same state
Stochastic policy: outputs a distribution over actions

Stochastic policies are often useful when exploration matters or when the environment is uncertain.

4.8 Return

RL does not optimize only immediate reward. It optimizes return, usually the total discounted future reward.

If rewards are r_t, r_{t+1}, r_{t+2}, \dots, then the return from time t is:


G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots

where \gamma is the discount factor.

Intuition:

If \gamma is small, the agent is short-sighted.
If \gamma is close to 1, the agent values long-term outcomes more strongly.

4.9 Episode

An episode is one rollout from start to termination.

Examples:

One game in Atari
One robot pick-and-place attempt
One session of user interaction

Some systems are episodic. Others are continuing tasks without a natural end.

5. The Standard RL Interaction Loop

flowchart LR
	A[Agent observes state or observation] --> B[Policy selects action]
	B --> C[Environment applies action]
	C --> D[Environment transitions to next state]
	D --> E[Environment emits reward]
	E --> F[Agent updates value estimates or policy]
	F --> A

This loop hides the true difficulty. Each stage has practical engineering choices:

What exactly is observed?
How often are actions applied?
How noisy is the reward?
Is the environment stationary?
How are updates computed?
How do we keep training stable?

6. Markov Decision Processes from First Principles

The classical mathematical model for RL is the Markov Decision Process (MDP).

An MDP contains:

A set of states S
A set of actions A
A transition function P(s' \mid s, a)
A reward function R(s, a, s')
A discount factor \gamma

6.1 Why the Markov Property Matters

The Markov property says:

The future depends on the present state and action, not on the full past history, if the state is defined properly.

This is not magic. It is a modeling assumption.

If the state representation is complete enough, the present summarizes the useful past.

Example:

If you control a drone and only include current position, but not velocity, then the next state is not predictable enough.
Add velocity, and the model becomes much more Markov.

This is a deep engineering lesson:

Poor state design turns an easy RL problem into a hard one.

6.2 State Transitions

stateDiagram-v2
	[*] --> Observe
	Observe --> Decide: choose action a
	Decide --> Transition: environment reacts
	Transition --> Reward: emit r
	Reward --> Observe: new state s'

6.3 Model-Based vs Model-Free View

Model-based RL tries to learn or use the transition dynamics and rewards explicitly.
Model-free RL learns good behavior or values directly from interaction without an explicit world model.

In practice:

Model-based methods can be more sample efficient when models are accurate.
Model-free methods are often simpler to implement but can require much more data.

7. Value Functions: Why They Exist

A core idea in RL is that instead of reasoning about all future consequences from scratch every time, the agent can estimate how good states or actions are.

7.1 State Value Function

The value of a state under a policy \pi is the expected return if you start there and follow that policy:


V^\pi(s) = \mathbb{E}[G_t \mid s_t = s]

This tells you: how promising is this state?

7.2 Action Value Function

The action value, or Q-value, measures the expected return if you take action a in state s and then continue according to the policy:


Q^\pi(s, a) = \mathbb{E}[G_t \mid s_t = s, a_t = a]

This tells you: how promising is this action in this state?

7.3 Why Value Functions Are Useful

Suppose you are controlling a warehouse robot at an intersection.

Going left gives a slightly slower immediate path.
Going right looks shorter, but often leads to congestion.

Immediate reward may favor right. Long-term return may favor left. A value function captures that future effect.

8. Bellman Equations: The Core Recursive Insight

Bellman equations are central because they express long-term value recursively.

8.1 The Main Intuition

The value of a state is:

the immediate reward you expect now
plus the value of where you expect to land next

That recursive structure lets RL break a long horizon problem into repeated local updates.

8.2 Bellman Expectation Equation

For a fixed policy:


V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s', r} P(s', r \mid s, a) \left[r + \gamma V^\pi(s')\right]

Interpretation:

average over possible actions chosen by the policy
average over possible next states and rewards
immediate reward plus discounted future value

8.3 Bellman Optimality Equation

For the optimal value function:


V^*(s) = \max_a \sum_{s', r} P(s', r \mid s, a) \left[r + \gamma V^*(s')\right]

and for the optimal action value:


Q^*(s, a) = \sum_{s', r} P(s', r \mid s, a) \left[r + \gamma \max_{a'} Q^*(s', a')\right]

This is the foundation behind dynamic programming, Q-learning, and many other algorithms.

8.4 Step-by-Step Bellman Backup Intuition

Suppose a robot in state s can choose two actions.

Estimate the immediate reward for each action.
Estimate where each action usually leads.
Look up how valuable those next states are.
Combine immediate reward and next-state value.
Prefer the action with the larger total.

This repeated propagation of future value backward into present decisions is called a backup.

9. Exploration vs Exploitation

This is the most famous RL tradeoff.

Exploitation means choosing what currently looks best.
Exploration means trying alternatives to learn whether something better exists.

9.1 Why It Is Hard

If a controller only exploits, it can get stuck with a mediocre strategy. If it explores aggressively in production, it may cause failures or cost.

9.2 Common Exploration Strategies

Epsilon-Greedy

With probability \varepsilon, choose a random action. Otherwise choose the best-known action.

Good for simple discrete tasks, but crude.

Softmax or Boltzmann Exploration

Actions with higher estimated value are more likely, but not guaranteed.

Optimism in the Face of Uncertainty

Prefer less-visited actions because their value is uncertain.

Upper Confidence Bound Style Exploration

Common in bandits and some RL settings, balancing estimated value with uncertainty.

Entropy Regularization

Popular in policy-gradient methods. Encourages action diversity during training.

9.3 Production Reality

In many production systems, uncontrolled exploration is unacceptable.

Examples:

Robotics can damage hardware.
Ad bidding can waste money.
Power control can violate thermal limits.
Medical systems can cause harm.

So teams often use:

simulation-first training
offline policy evaluation
constrained exploration
shadow deployments
human approval gates

10. Major Algorithm Families

You do not need to memorize every algorithm. You need to understand the families and why they exist.

10.1 Dynamic Programming

Dynamic programming assumes you know the full environment model.

Examples:

policy evaluation
policy iteration
value iteration

Why it matters:

It gives conceptual foundations.
It is often not directly usable in messy real-world systems because the model is not known exactly.

10.2 Monte Carlo Methods

Monte Carlo methods wait until the end of an episode and use actual sampled returns to estimate values.

Strengths:

Conceptually simple
Uses real returns, not bootstrapped estimates

Weaknesses:

High variance
Needs episode completion
Slow credit assignment for long tasks

10.3 Temporal Difference Learning

Temporal Difference (TD) learning updates estimates using a one-step bootstrapped target.

Basic idea:


	ext{new estimate} \leftarrow \text{old estimate} + \alpha (\text{target} - \text{old estimate})

In TD learning, the target uses the current estimate of the next state's value.

Why TD is powerful:

Learns online
Does not need episode termination
Often lower variance than Monte Carlo

Tradeoff:

Introduces bias through bootstrapping

10.4 SARSA

SARSA is an on-policy TD control method.

Update intuition:

It learns from the action actually taken next.

This tends to make it more conservative when exploration is present.

10.5 Q-Learning

Q-learning is an off-policy TD control method.

Its famous update is:


Q(s, a) \leftarrow Q(s, a) + \alpha \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)

Step-by-step meaning:

Start with the old estimate for action a in state s.
Observe reward r.
Estimate the best future value at the next state s'.
Form a target: immediate reward plus discounted best future estimate.
Move the current estimate a bit toward that target.

Why it works conceptually:

It repeatedly enforces Bellman optimality through sampled experience.

Why it fails in practice for large spaces:

Tables do not scale to large or continuous states.
Sample inefficiency becomes severe.
Function approximation can make learning unstable.

10.6 Deep Q-Networks (DQN)

DQN replaces the Q-table with a neural network.

This makes large state spaces tractable, but creates instability because:

consecutive samples are correlated
targets change while the network is learning

Two famous engineering fixes:

Experience replay: store transitions and train on randomized batches
Target network: use a slower-moving copy of the Q-network to stabilize targets

flowchart TD
	A[Environment transition] --> B[Store transition in replay buffer]
	B --> C[Sample random mini-batch]
	C --> D[Q-network predicts Q values]
	C --> E[Target network builds target values]
	D --> F[Compute TD loss]
	E --> F
	F --> G[Gradient update Q-network]
	G --> H[Periodically sync target network]

10.7 Policy Gradient Methods

Instead of learning values and deriving actions indirectly, policy-gradient methods optimize the policy directly.

Why use them:

natural handling of stochastic policies
good for continuous action spaces
direct optimization of behavior

Main challenge:

gradients can have high variance

10.8 Actor-Critic Methods

Actor-critic combines two components:

Actor: chooses actions
Critic: estimates how good states or actions are

Why this architecture is common:

the actor improves the policy
the critic reduces variance by providing a learned baseline or value estimate

10.9 PPO and Why Engineers Use It So Often

Proximal Policy Optimization (PPO) became popular because it is relatively robust and easier to tune than many earlier policy-gradient approaches.

Why teams like it:

good practical stability
works across many environments
simpler than more fragile second-order methods

Why it is not magic:

still sample hungry
still sensitive to reward design
can still overfit simulator quirks

10.10 Deterministic Policy Gradient and Continuous Control

For continuous control tasks like torque, steering, power allocation, or analog parameter tuning, deterministic or stochastic actor-critic methods are common.

Examples:

DDPG
TD3
SAC

Practical note:

Continuous control adds sensitivity to scaling, action clipping, and simulation accuracy.

11. Bandits vs Full Reinforcement Learning

Many engineers should use contextual bandits before RL.

Bandits:

choose an action
observe reward
no long state transition chain

RL:

actions change future states
delayed consequences matter

If your recommendation system only needs to choose the next item and future dynamics are weak, a contextual bandit may be a better engineering choice than full RL.

12. Reward Design: The Most Common Source of Failure

Reward design is not a small detail. It defines the optimization target.

12.1 Good Reward Properties

A reward should be:

aligned with the real business or system objective
measurable
not too sparse unless the algorithm can handle it
resistant to exploitation
stable enough that training can make progress

12.2 Reward Hacking

An RL agent will optimize exactly what you specify, not what you meant.

Examples:

A robot learns to spin in place because movement sensor counts increase reward.
A recommender maximizes clicks by showing low-quality sensational content, hurting retention.
A thermal controller reduces measured temperature by throttling performance so aggressively that throughput collapses.

12.3 Sparse vs Dense Rewards

Sparse reward: reward only when the final goal is achieved
Dense reward: reward provides intermediate guidance

Sparse reward is often more faithful but much harder to learn from.

Dense reward is easier for learning but can distort behavior if shaping is poorly designed.

12.4 Step-by-Step Reward Design Process

Write down the actual system objective in operational terms.
List what signals are directly measurable online.
Identify unintended shortcuts the agent might exploit.
Add penalties or constraints for unacceptable behavior.
Test reward behavior on edge cases before large-scale training.

13. State and Action Space Design

RL performance often depends more on problem formulation than algorithm choice.

13.1 State Design Mistakes

Common mistakes:

omitting variables that affect dynamics
including huge numbers of irrelevant features
mixing signals with inconsistent time scales
ignoring latency and observation delay
feeding raw values with poor normalization

13.2 Action Design Mistakes

Common mistakes:

giving the agent more freedom than the system can safely support
using overly fine-grained continuous actions with noisy actuators
using unrealistic actions not available in deployment

Engineering rule:

Restrict the action space to what the real actuator, API, or controller can actually execute.

14. Model-Free vs Model-Based in Real Systems

14.1 Model-Free Advantages

simpler conceptually
easier to start with from logs plus interaction
often fewer assumptions about system dynamics

14.2 Model-Free Disadvantages

typically data hungry
costly if environment interaction is expensive
weaker extrapolation under distribution shift

14.3 Model-Based Advantages

can improve sample efficiency
enables planning and imagination rollouts
useful when system dynamics are partially known from physics or engineering models

14.4 Model-Based Disadvantages

learned models can be wrong in dangerous ways
compounding model errors can break planning
engineering complexity is higher

14.5 Hardware-Connected Example

Consider RL for CPU frequency scaling.

A model-free agent may learn from measured utilization, latency, temperature, and power.
A model-based system may also use known thermal dynamics or power-performance models.

Tradeoff:

model-based methods may learn faster
but a bad model of thermal lag can produce unstable oscillations in deployment

15. Online RL, Offline RL, and Batch Learning

15.1 Online RL

The agent learns by interacting with the live environment.

Pros:

direct adaptation
no mismatch between static dataset and real behavior loop

Cons:

risky
costly
exploration may be unsafe

15.2 Offline RL

Offline RL learns from previously collected logs without live exploration during training.

Pros:

safer for high-risk domains
uses historical data
easier to audit initially

Cons:

limited by data coverage
agent may choose actions not well supported by the dataset
evaluation is hard

15.3 Practical Engineering Choice

Many production teams use a staged approach:

start with logged data
train offline
evaluate carefully
deploy in shadow mode
allow narrow online adaptation later

16. Training Pipeline in Practice

An RL project is not just an algorithm. It is a pipeline.

flowchart LR
	A[Problem definition] --> B[State, action, reward design]
	B --> C[Simulator or environment integration]
	C --> D[Data collection rollouts]
	D --> E[Training jobs]
	E --> F[Evaluation and safety checks]
	F --> G[Shadow deployment]
	G --> H[Controlled production rollout]
	H --> I[Monitoring and retraining]

16.1 Environment Interface

At minimum, an environment usually needs:

reset() to start a new episode
step(action) to apply an action and receive next observation, reward, done flag, and metadata

This looks simple, but environment correctness is critical.

If step() has bugs, your agent will learn the wrong physics, wrong costs, or wrong timing.

16.2 Simulator Quality

A simulator is often the most important component in applied RL.

Bad simulators cause:

unrealistic policies
exploitation of artifacts
failure during sim-to-real transfer

What to verify:

timing fidelity
sensor noise realism
latency realism
actuator saturation
failure modes
distribution of rare events

16.3 Experience Collection

Choices include:

single-threaded rollouts
vectorized environments
distributed actor workers

Tradeoff:

More parallel rollout improves throughput but can create stale policies if learners and actors drift apart.

17. Example Implementation Skeleton

for episode in range(num_episodes):
	obs = env.reset()
	done = False

	while not done:
		action = policy(obs)
		next_obs, reward, done, info = env.step(action)
		replay_buffer.add(obs, action, reward, next_obs, done)
		learner.update(replay_buffer)
		obs = next_obs

This loop hides the real concerns:

action noise and exploration policy
reward normalization
target calculation
batching
device placement
checkpointing
replay sampling strategy
deterministic evaluation runs

18. Stability Problems and Why Deep RL Is Hard

Deep RL is not just supervised learning with rewards. It is harder for structural reasons.

18.1 Non-Stationary Targets

The data distribution changes as the policy changes.

In supervised learning, labels are usually fixed. In RL, the agent changes behavior, which changes visited states, which changes the training data.

18.2 Correlated Samples

Sequential experience samples are correlated. That breaks the IID assumptions many optimizers behave best under.

18.3 Bootstrapping Error

If you estimate targets from other estimates, errors can feed into future errors.

18.4 Overestimation Bias

Using a max over noisy estimates can bias values upward. Double Q-learning style fixes aim to reduce this.

18.5 Distribution Shift

Policies may fail badly on states underrepresented in training.

19. Evaluation: How to Know If the Agent Is Actually Good

RL evaluation is often weaker than teams think.

19.1 Training Reward Is Not Enough

A rising training reward does not guarantee production usefulness.

The agent may be:

overfitting the simulator
exploiting reward loopholes
succeeding only on easy scenarios
unstable under real latency or noise

19.2 What to Measure

Measure at least:

primary task success
safety violations
worst-case behavior
sample efficiency
robustness to perturbations
sensitivity to random seeds
inference latency
resource usage

19.3 Evaluation Regimes

deterministic evaluation runs
unseen scenario tests
adversarial stress tests
ablations
baseline comparisons against heuristic controllers

19.4 Baselines Matter

A common RL mistake is comparing against weak baselines.

Always compare against:

simple heuristics
rule-based controllers
classical optimization methods
supervised or bandit baselines where applicable

If RL cannot beat a clear heuristic, the project may not justify itself.

20. Sim-to-Real Transfer

Many exciting RL demos fail when leaving simulation.

20.1 Why Sim-to-Real Is Hard

Real systems differ from simulation in:

friction and wear
sensor bias
delays
packet loss
thermal inertia
manufacturing variation
actuator nonlinearities

20.2 Common Mitigations

domain randomization
system identification
safety envelopes and fallback controllers
gradual rollout on hardware
residual learning on top of trusted controllers

20.3 Software and Hardware Connection

A robot policy that is stable in a perfect simulator may oscillate on real hardware because motor drivers saturate, encoder readings lag, and battery voltage drops under load.

That is not just an algorithm issue. It is a software-hardware co-design issue.

21. Safety, Constraints, and Guardrails

In real systems, maximizing reward is not enough. You need constraints.

Examples:

do not exceed temperature limit
do not violate collision boundaries
do not overspend budget
do not exceed network loss threshold
do not trigger unstable oscillations in control loops

Practical safety patterns:

action clipping
rule-based safety layers
constrained RL formulations
fallback controller takeover
kill switches
anomaly detection around policy outputs

flowchart TD
	A[Policy proposes action] --> B{Safety validator}
	B -->|Safe| C[Execute action]
	B -->|Unsafe| D[Fallback controller or clipped action]
	C --> E[Observe result and log]
	D --> E

22. Real Industry Use Cases

22.1 Robotics and Industrial Automation

Use cases:

grasping
locomotion
path planning assistance
manipulation under uncertainty

Challenges:

expensive exploration
sim-to-real gap
safety-critical failures

22.2 Recommendation and Personalization

Use cases:

long-term engagement optimization
notification timing
multi-step user interaction policies

Challenges:

delayed rewards
confounding from user behavior
exploration ethics and product risk

22.3 Data Center and Cloud Control

Use cases:

cooling optimization
workload placement
power-performance tuning

Challenges:

slow dynamics
multi-objective tradeoffs
partial observability

22.4 Networking and Congestion Control

Use cases:

adaptive congestion control
routing decisions
queue management

Challenges:

non-stationary traffic
noisy measurements
unfairness risks

22.5 Hardware and Computer Engineering Scenarios

Use cases:

dynamic voltage and frequency scaling (DVFS)
cache prefetching and memory policy tuning
NoC routing heuristics
compiler optimization ordering
chip floorplanning assistance

Why RL appears here:

many control knobs
long-term tradeoffs between power, latency, throughput, and thermals
hard-to-model interactions between hardware layers and workloads

22.6 Operations Research and Scheduling

Use cases:

job-shop scheduling
warehouse dispatch
fleet management

Challenges:

large combinatorial action spaces
sparse rewards
hard constraints

23. Common Failure Modes

23.1 Reward Misalignment

The policy gets better at the specified metric while the real system gets worse.

23.2 State Aliasing

Different real situations look identical to the agent because the observation is missing critical variables.

23.3 Unstable Training

Loss spikes, Q-values diverge, policy collapses, or performance varies wildly across seeds.

23.4 Simulator Exploitation

The policy finds loopholes in simulation that do not exist in reality.

23.5 Offline-to-Online Collapse

A policy trained on logs chooses actions outside the data support and fails after deployment.

23.6 Over-Optimization of Proxy Metrics

A system maximizes short-term measurable metrics while harming long-term objectives.

24. Debugging Reinforcement Learning Systems

RL debugging must be systematic. Random tuning is usually wasted effort.

flowchart TD
	A[Policy performs poorly] --> B{Environment correct?}
	B -->|No| C[Fix transition, reward, reset, or done logic]
	B -->|Yes| D{Reward aligned?}
	D -->|No| E[Redesign reward and constraints]
	D -->|Yes| F{State sufficient?}
	F -->|No| G[Add missing signals, history, normalization]
	F -->|Yes| H{Baseline beats agent?}
	H -->|Yes| I[Re-evaluate algorithm choice and hyperparameters]
	H -->|No| J{Generalizes to unseen cases?}
	J -->|No| K[Expand evaluation distribution and regularize]
	J -->|Yes| L[Investigate deployment latency, scaling, and safety layers]

24.1 Debugging Order That Saves Time

Verify environment correctness.
Verify reward correctness.
Verify state and action definitions.
Beat a trivial baseline on a tiny version of the problem.
Check reproducibility across random seeds.
Only then tune larger models and advanced algorithms.

24.2 Practical Checks

manually inspect several episodes step by step
print or plot rewards, dones, and critical state variables
confirm resets do not leak hidden state
confirm action bounds match deployment bounds
compare policy outputs against a hand-written controller
run ablations removing one feature at a time

24.3 If Training Is Unstable

Check:

learning rate too high
reward scale too large
missing normalization
target updates too aggressive
replay buffer too small or too biased
actor-critic update imbalance
insufficient entropy or too much entropy

25. Best Practices for Engineers

25.1 Start with the Simplest Viable Environment

Reduce the problem first. If you cannot learn in a toy version, the full version will not work.

25.2 Build Strong Baselines

Before deep RL, implement:

random policy
heuristic policy
supervised or bandit baseline if applicable
classical controller if available

25.3 Make the Environment Observable

Log everything needed to reconstruct behavior:

observations
actions
rewards
next observations
done reasons
policy version
simulator version

25.4 Control Randomness

Use fixed seeds for debugging, then multiple seeds for serious evaluation.

25.5 Normalize Inputs and Sometimes Rewards

Poor scaling can destroy stability.

25.6 Separate Training and Evaluation

Do not judge a policy while exploration noise is still active unless that is the real deployment behavior.

25.7 Protect Production with Guardrails

Never assume the learned policy alone is a sufficient safety mechanism.

26. Tradeoffs and Design Decisions

26.1 Discrete vs Continuous Actions

Discrete is easier algorithmically.
Continuous is often more realistic for control.
Discretizing continuous control can simplify training but may reduce policy quality.

26.2 Dense Reward vs Sparse Reward

Dense reward improves learning speed.
Sparse reward may preserve objective fidelity.
Reward shaping helps, but can introduce bias and shortcuts.

26.3 On-Policy vs Off-Policy

On-policy methods are often more stable conceptually but waste more data.
Off-policy methods can reuse data and be more sample efficient, but stability becomes trickier.

26.4 Simulation Fidelity vs Simulation Speed

High fidelity improves realism.
High speed improves experimentation.
Teams usually need a layered stack: fast simulator for iteration, high-fidelity simulator for validation.

26.5 Single Objective vs Multi-Objective Control

Real systems often optimize multiple goals:

latency
power
cost
reliability
fairness

Collapsing them into one reward scalar is convenient but dangerous because it hides tradeoffs.

27. Interview-Level Understanding

An engineer should be able to explain these clearly.

27.1 What Makes RL Harder Than Supervised Learning?

no direct labels for correct actions
delayed rewards
exploration required
non-stationary data distribution
agent actions affect future data

27.2 What Is the Difference Between Monte Carlo and TD?

Monte Carlo uses full sampled returns after episode completion.
TD bootstraps using current estimates of future value.
Monte Carlo has higher variance, TD introduces bias but is often more practical.

27.3 Why Is Experience Replay Useful?

breaks temporal correlations
improves data reuse
stabilizes deep Q-learning training

27.4 Why Is Reward Design So Important?

Because the agent optimizes the specified objective precisely, including loopholes. A badly defined reward produces systematically bad behavior.

27.5 What Is the Difference Between On-Policy and Off-Policy?

On-policy learns from data generated by the current policy.
Off-policy can learn from data generated by a different policy.

This matters for data reuse, stability, and deployment strategy.

28. A Practical Worked Example: Thermal-Aware CPU Frequency Control

This example connects RL to computer engineering.

28.1 Problem

Choose CPU frequency over time to balance:

application latency
throughput
power consumption
thermal limits

28.2 State Candidates

current frequency
CPU utilization
queue depth
recent latency percentiles
package temperature
recent power draw
workload type indicator

28.3 Actions

discrete frequency steps
optional turbo enable or disable

28.4 Reward Example

One crude reward might be:


	ext{reward} = -w_1 \cdot \text{latency} - w_2 \cdot \text{power} - w_3 \cdot \max(0, \text{temp} - T_{\max})

28.5 Risks

the agent may oscillate frequency rapidly
sensor lag may create delayed feedback loops
reward may encourage aggressive throttling that hurts throughput

28.6 Practical Safeguards

minimum dwell time before changing frequency again
thermal emergency override
action smoothing
evaluation on bursty and sustained workloads

This example shows why RL in engineering is never just about the algorithm. It is about control stability, sensing, actuation, and constraints.

29. Step-by-Step Mental Model for Solving an RL Problem

When you face a new RL problem, work through it in this order.

What is the real objective over time?
What decisions actually influence future outcomes?
What information is available at decision time?
What actions are truly possible in production?
What makes exploration risky or expensive?
Can the environment be simulated accurately enough?
What simple baseline should already work?
What metrics prove the policy is useful and safe?

If you cannot answer these clearly, the problem is not ready for RL.

30. Common Mistakes Engineers Make

choosing RL for a problem that is really supervised learning or optimization
defining rewards that are easy to exploit
skipping strong heuristic baselines
trusting simulator performance too early
using huge models before validating environment correctness
ignoring latency, actuator limits, or safety constraints
evaluating on only one seed or one narrow scenario set
forgetting that deployment distribution changes over time

31. Production Deployment Patterns

31.1 Shadow Mode

Run the policy without controlling the system. Compare proposed actions against the live controller.

31.2 Human-in-the-Loop Approval

Useful for high-cost decisions where operators can review policy suggestions.

31.3 Hierarchical Control

Let RL handle high-level strategy while a classical controller handles low-level stable execution.

Example:

RL chooses target speed or energy budget
PID or MPC handles actuator-level control

31.4 Safe Rollback

Every deployment should support immediate fallback to a trusted baseline.

flowchart LR
	A[Offline training] --> B[Replay and simulator evaluation]
	B --> C[Shadow mode]
	C --> D[Limited traffic rollout]
	D --> E[Monitored production]
	E --> F{Anomaly detected?}
	F -->|Yes| G[Rollback to baseline]
	F -->|No| H[Expand rollout]

32. Tooling and Infrastructure Considerations

A serious RL stack often needs:

experiment tracking
reproducible configuration management
distributed rollout workers
replay storage
checkpointing
metrics dashboards
simulator versioning
model registry and deployment controls

Questions engineers should ask early:

Can we reproduce a result from two weeks ago?
Can we trace a deployed policy back to training data and config?
Can we replay bad episodes exactly?
Can we compare policy versions under the same evaluation set?

33. How RL Connects to Classical Control and Optimization

RL is not a replacement for control theory, optimization, or systems engineering.

In fact, many good RL projects combine them.

Classical control gives stability, safety, and domain structure.
Optimization gives constraints and planning tools.
RL adds adaptability when hand-designed policies are too rigid or incomplete.

A strong engineer asks:

Should RL fully control the system, or should it tune a classical controller?

Often the second option is more robust.

34. Summary of Key Intuitions

RL is about maximizing long-term outcome, not immediate gain.
The central challenge is credit assignment through time.
State, action, and reward design usually matter as much as algorithm choice.
Exploration is necessary but often dangerous in real systems.
Deep RL is hard because data is sequential, targets move, and policies change the data distribution.
Production RL requires baselines, evaluation discipline, safety layers, and deployment controls.
Many practical wins come from combining RL with simulation, constraints, and domain knowledge.

35. Final Checklist for Engineering Use

Before committing to RL, confirm:

the task is genuinely sequential
future state depends on actions
a reward can be defined and audited
a baseline exists
a simulator or safe data source exists
evaluation metrics cover robustness and safety
rollout, rollback, and monitoring plans exist

If those are not true, the correct engineering decision may be to avoid RL.

36. What to Study Next

After mastering the basics in this handbook, the next useful topics are:

contextual bandits
deep Q-learning in detail
policy gradients and actor-critic math
PPO and SAC in practice
offline RL and counterfactual evaluation
constrained and safe RL
multi-agent RL
sim-to-real robotics workflows
RLHF and preference optimization for foundation models

The right next step depends on your domain. For robotics and hardware control, focus on safety, simulation, and continuous control. For product systems, focus on bandits, offline evaluation, and reward alignment. For research-heavy ML systems, focus on actor-critic methods, stability, and scaling.

40 KiB Raw Blame History

Reinforcement Learning Basics

1. Why Reinforcement Learning Matters

2. When RL Is the Right Tool and When It Is Not

3. First Principles: What RL Is Actually Solving

3.1 Sequential Dependence

3.2 Delayed Consequences

3.3 Uncertainty

3.4 Exploration

4. Core Concepts and Intuition

4.1 Agent

4.2 Environment

4.3 State

4.4 Observation vs State

4.5 Action

4.6 Reward

4.7 Policy

4.8 Return

4.9 Episode

5. The Standard RL Interaction Loop

6. Markov Decision Processes from First Principles

6.1 Why the Markov Property Matters

6.2 State Transitions

6.3 Model-Based vs Model-Free View

7. Value Functions: Why They Exist

7.1 State Value Function

7.2 Action Value Function

7.3 Why Value Functions Are Useful

8. Bellman Equations: The Core Recursive Insight

8.1 The Main Intuition

8.2 Bellman Expectation Equation

8.3 Bellman Optimality Equation

8.4 Step-by-Step Bellman Backup Intuition

9. Exploration vs Exploitation

9.1 Why It Is Hard

9.2 Common Exploration Strategies

Epsilon-Greedy

Softmax or Boltzmann Exploration

Optimism in the Face of Uncertainty

Upper Confidence Bound Style Exploration

Entropy Regularization

9.3 Production Reality

10. Major Algorithm Families

10.1 Dynamic Programming

10.2 Monte Carlo Methods

10.3 Temporal Difference Learning

10.4 SARSA

10.5 Q-Learning

10.6 Deep Q-Networks (DQN)

10.7 Policy Gradient Methods

10.8 Actor-Critic Methods

10.9 PPO and Why Engineers Use It So Often

10.10 Deterministic Policy Gradient and Continuous Control

11. Bandits vs Full Reinforcement Learning

12. Reward Design: The Most Common Source of Failure

12.1 Good Reward Properties

12.2 Reward Hacking

12.3 Sparse vs Dense Rewards

12.4 Step-by-Step Reward Design Process

13. State and Action Space Design

13.1 State Design Mistakes

13.2 Action Design Mistakes

14. Model-Free vs Model-Based in Real Systems

14.1 Model-Free Advantages

14.2 Model-Free Disadvantages

14.3 Model-Based Advantages

14.4 Model-Based Disadvantages

14.5 Hardware-Connected Example

15. Online RL, Offline RL, and Batch Learning

15.1 Online RL

15.2 Offline RL

15.3 Practical Engineering Choice

16. Training Pipeline in Practice

16.1 Environment Interface

16.2 Simulator Quality

16.3 Experience Collection

17. Example Implementation Skeleton

18. Stability Problems and Why Deep RL Is Hard

18.1 Non-Stationary Targets

18.2 Correlated Samples

40 KiB

Raw Blame History