Files
Computer-Fundamentals/machine-learning/production-topics/23.reinforcement-learning-basics.md
T
tarun-elango 62197e52c0 ml
Co-authored-by: Copilot <copilot@github.com>
2026-04-30 19:59:29 -04:00

40 KiB

Reinforcement Learning Basics

Learning through feedback.

1. Why Reinforcement Learning Matters

Reinforcement learning (RL) is the branch of machine learning concerned with making sequences of decisions under uncertainty. Instead of learning from a fixed dataset of correct answers, an RL system learns by interacting with an environment, observing the consequences of its actions, and adjusting behavior to maximize long-term reward.

That makes RL fundamentally different from the machine learning workflows most engineers see first:

  • In supervised learning, you are usually given input-output pairs and asked to imitate the right answer.
  • In unsupervised learning, you are asked to find structure in unlabeled data.
  • In reinforcement learning, there often is no direct correct action label. The system must discover good behavior from consequences.

This matters in real engineering because many important systems are not one-shot prediction problems. They are control problems.

Examples:

  • A robot arm does not make one prediction; it performs a sequence of motor commands.
  • A recommendation engine can optimize not just clicks today, but retention over weeks.
  • A data center controller can adjust power, cooling, and workload placement over time.
  • A chip power-management policy can trade energy against performance over many operating cycles.
  • A network congestion controller must continuously react to packet delay, loss, and changing traffic.

The central idea is simple:

A good action is not just one that looks good immediately. It is one that improves the total future outcome.

That single idea is what makes RL powerful and what makes it difficult.

2. When RL Is the Right Tool and When It Is Not

RL is attractive because it sounds general: an agent learns by trial and error. In practice, many teams misuse it.

RL is a strong fit when:

  • The problem is inherently sequential.
  • Actions change future states.
  • There is delayed feedback.
  • The environment is interactive or can be simulated.
  • It is hard to hand-code a strategy, but possible to define a performance signal.

RL is a poor fit when:

  • A simple rule-based controller already solves the problem reliably.
  • The action is effectively one-shot, so supervised learning is enough.
  • Exploration is unsafe or too expensive.
  • There is no good reward signal.
  • You cannot simulate the environment and cannot afford bad real-world behavior.

Practical decision rule:

If your problem can be solved as predict then act, start there first. Use RL only when the real difficulty is the acting over time, not the prediction.

3. First Principles: What RL Is Actually Solving

At first principles, RL is about four facts:

  1. The system is embedded in a world that changes over time.
  2. The system can influence that world by choosing actions.
  3. The quality of actions is often only visible after multiple future steps.
  4. The system must improve behavior while still being uncertain about the world.

These lead to the core engineering challenges of RL.

3.1 Sequential Dependence

If you brake too late in autonomous control, that error affects the next state. If a recommender shows low-quality content now, user engagement may drop later. In RL, actions are coupled through time.

3.2 Delayed Consequences

The hardest part of RL is that the reward often appears after many actions. This is called the credit assignment problem.

Example:

  • A warehouse robot takes 30 small navigation decisions.
  • It collides only at the end.
  • Which earlier decisions caused the failure?

RL algorithms exist largely to solve this credit assignment problem efficiently.

3.3 Uncertainty

The agent usually does not know in advance how the environment will react. It must learn from samples.

3.4 Exploration

If the agent only repeats what currently seems best, it may miss better strategies. If it explores too much, performance suffers. This is the exploration-exploitation tradeoff.

4. Core Concepts and Intuition

Before math, get the mental model right.

4.1 Agent

The learner or decision-maker.

Examples:

  • A software process scheduling workloads
  • A robot controller
  • A bidding strategy in an ad platform
  • A cache tuning policy in a hardware system

4.2 Environment

Everything the agent interacts with. The environment receives the action, updates the world, and returns observations and rewards.

4.3 State

A state is the information needed to choose a good action. In theory, a state captures everything relevant about the current situation. In engineering practice, state design is a major source of success or failure.

Examples:

  • In a robot: position, velocity, sensor readings, battery level
  • In networking: RTT, packet loss, queue size, throughput history
  • In CPU power management: utilization, temperature, current frequency, power budget

If the state leaves out a critical variable, the agent may behave irrationally because it is effectively blind.

4.4 Observation vs State

In many real systems, the agent does not observe the true state directly. It only sees partial measurements. Cameras do not reveal everything. Network telemetry is noisy. Sensor streams can be delayed.

This is called partial observability.

Practical implication:

Many real-world RL problems are not clean fully observable state-control problems. They are closer to history-based control under uncertainty.

4.5 Action

The decision the agent makes.

  • Discrete action example: choose route A, B, or C
  • Continuous action example: steering angle, motor torque, voltage setting

4.6 Reward

A scalar feedback signal telling the agent how good the immediate outcome was.

This looks simple, but reward design is one of the most dangerous parts of RL. If you specify the wrong reward, the agent can optimize the wrong behavior very effectively.

4.7 Policy

A policy is the agent's behavior rule: given the current state or observation, what action should it take?

There are two broad types:

  • Deterministic policy: always picks the same action for the same state
  • Stochastic policy: outputs a distribution over actions

Stochastic policies are often useful when exploration matters or when the environment is uncertain.

4.8 Return

RL does not optimize only immediate reward. It optimizes return, usually the total discounted future reward.

If rewards are r_t, r_{t+1}, r_{t+2}, \dots, then the return from time t is:


G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots

where \gamma is the discount factor.

Intuition:

  • If \gamma is small, the agent is short-sighted.
  • If \gamma is close to 1, the agent values long-term outcomes more strongly.

4.9 Episode

An episode is one rollout from start to termination.

Examples:

  • One game in Atari
  • One robot pick-and-place attempt
  • One session of user interaction

Some systems are episodic. Others are continuing tasks without a natural end.

5. The Standard RL Interaction Loop

flowchart LR
	A[Agent observes state or observation] --> B[Policy selects action]
	B --> C[Environment applies action]
	C --> D[Environment transitions to next state]
	D --> E[Environment emits reward]
	E --> F[Agent updates value estimates or policy]
	F --> A

This loop hides the true difficulty. Each stage has practical engineering choices:

  • What exactly is observed?
  • How often are actions applied?
  • How noisy is the reward?
  • Is the environment stationary?
  • How are updates computed?
  • How do we keep training stable?

6. Markov Decision Processes from First Principles

The classical mathematical model for RL is the Markov Decision Process (MDP).

An MDP contains:

  • A set of states S
  • A set of actions A
  • A transition function P(s' \mid s, a)
  • A reward function R(s, a, s')
  • A discount factor \gamma

6.1 Why the Markov Property Matters

The Markov property says:

The future depends on the present state and action, not on the full past history, if the state is defined properly.

This is not magic. It is a modeling assumption.

If the state representation is complete enough, the present summarizes the useful past.

Example:

  • If you control a drone and only include current position, but not velocity, then the next state is not predictable enough.
  • Add velocity, and the model becomes much more Markov.

This is a deep engineering lesson:

Poor state design turns an easy RL problem into a hard one.

6.2 State Transitions

stateDiagram-v2
	[*] --> Observe
	Observe --> Decide: choose action a
	Decide --> Transition: environment reacts
	Transition --> Reward: emit r
	Reward --> Observe: new state s'

6.3 Model-Based vs Model-Free View

  • Model-based RL tries to learn or use the transition dynamics and rewards explicitly.
  • Model-free RL learns good behavior or values directly from interaction without an explicit world model.

In practice:

  • Model-based methods can be more sample efficient when models are accurate.
  • Model-free methods are often simpler to implement but can require much more data.

7. Value Functions: Why They Exist

A core idea in RL is that instead of reasoning about all future consequences from scratch every time, the agent can estimate how good states or actions are.

7.1 State Value Function

The value of a state under a policy \pi is the expected return if you start there and follow that policy:


V^\pi(s) = \mathbb{E}[G_t \mid s_t = s]

This tells you: how promising is this state?

7.2 Action Value Function

The action value, or Q-value, measures the expected return if you take action a in state s and then continue according to the policy:


Q^\pi(s, a) = \mathbb{E}[G_t \mid s_t = s, a_t = a]

This tells you: how promising is this action in this state?

7.3 Why Value Functions Are Useful

Suppose you are controlling a warehouse robot at an intersection.

  • Going left gives a slightly slower immediate path.
  • Going right looks shorter, but often leads to congestion.

Immediate reward may favor right. Long-term return may favor left. A value function captures that future effect.

8. Bellman Equations: The Core Recursive Insight

Bellman equations are central because they express long-term value recursively.

8.1 The Main Intuition

The value of a state is:

  • the immediate reward you expect now
  • plus the value of where you expect to land next

That recursive structure lets RL break a long horizon problem into repeated local updates.

8.2 Bellman Expectation Equation

For a fixed policy:


V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s', r} P(s', r \mid s, a) \left[r + \gamma V^\pi(s')\right]

Interpretation:

  • average over possible actions chosen by the policy
  • average over possible next states and rewards
  • immediate reward plus discounted future value

8.3 Bellman Optimality Equation

For the optimal value function:


V^*(s) = \max_a \sum_{s', r} P(s', r \mid s, a) \left[r + \gamma V^*(s')\right]

and for the optimal action value:


Q^*(s, a) = \sum_{s', r} P(s', r \mid s, a) \left[r + \gamma \max_{a'} Q^*(s', a')\right]

This is the foundation behind dynamic programming, Q-learning, and many other algorithms.

8.4 Step-by-Step Bellman Backup Intuition

Suppose a robot in state s can choose two actions.

  1. Estimate the immediate reward for each action.
  2. Estimate where each action usually leads.
  3. Look up how valuable those next states are.
  4. Combine immediate reward and next-state value.
  5. Prefer the action with the larger total.

This repeated propagation of future value backward into present decisions is called a backup.

9. Exploration vs Exploitation

This is the most famous RL tradeoff.

  • Exploitation means choosing what currently looks best.
  • Exploration means trying alternatives to learn whether something better exists.

9.1 Why It Is Hard

If a controller only exploits, it can get stuck with a mediocre strategy. If it explores aggressively in production, it may cause failures or cost.

9.2 Common Exploration Strategies

Epsilon-Greedy

With probability \varepsilon, choose a random action. Otherwise choose the best-known action.

Good for simple discrete tasks, but crude.

Softmax or Boltzmann Exploration

Actions with higher estimated value are more likely, but not guaranteed.

Optimism in the Face of Uncertainty

Prefer less-visited actions because their value is uncertain.

Upper Confidence Bound Style Exploration

Common in bandits and some RL settings, balancing estimated value with uncertainty.

Entropy Regularization

Popular in policy-gradient methods. Encourages action diversity during training.

9.3 Production Reality

In many production systems, uncontrolled exploration is unacceptable.

Examples:

  • Robotics can damage hardware.
  • Ad bidding can waste money.
  • Power control can violate thermal limits.
  • Medical systems can cause harm.

So teams often use:

  • simulation-first training
  • offline policy evaluation
  • constrained exploration
  • shadow deployments
  • human approval gates

10. Major Algorithm Families

You do not need to memorize every algorithm. You need to understand the families and why they exist.

10.1 Dynamic Programming

Dynamic programming assumes you know the full environment model.

Examples:

  • policy evaluation
  • policy iteration
  • value iteration

Why it matters:

  • It gives conceptual foundations.
  • It is often not directly usable in messy real-world systems because the model is not known exactly.

10.2 Monte Carlo Methods

Monte Carlo methods wait until the end of an episode and use actual sampled returns to estimate values.

Strengths:

  • Conceptually simple
  • Uses real returns, not bootstrapped estimates

Weaknesses:

  • High variance
  • Needs episode completion
  • Slow credit assignment for long tasks

10.3 Temporal Difference Learning

Temporal Difference (TD) learning updates estimates using a one-step bootstrapped target.

Basic idea:


	ext{new estimate} \leftarrow \text{old estimate} + \alpha (\text{target} - \text{old estimate})

In TD learning, the target uses the current estimate of the next state's value.

Why TD is powerful:

  • Learns online
  • Does not need episode termination
  • Often lower variance than Monte Carlo

Tradeoff:

  • Introduces bias through bootstrapping

10.4 SARSA

SARSA is an on-policy TD control method.

Update intuition:

It learns from the action actually taken next.

This tends to make it more conservative when exploration is present.

10.5 Q-Learning

Q-learning is an off-policy TD control method.

Its famous update is:


Q(s, a) \leftarrow Q(s, a) + \alpha \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)

Step-by-step meaning:

  1. Start with the old estimate for action a in state s.
  2. Observe reward r.
  3. Estimate the best future value at the next state s'.
  4. Form a target: immediate reward plus discounted best future estimate.
  5. Move the current estimate a bit toward that target.

Why it works conceptually:

It repeatedly enforces Bellman optimality through sampled experience.

Why it fails in practice for large spaces:

  • Tables do not scale to large or continuous states.
  • Sample inefficiency becomes severe.
  • Function approximation can make learning unstable.

10.6 Deep Q-Networks (DQN)

DQN replaces the Q-table with a neural network.

This makes large state spaces tractable, but creates instability because:

  • consecutive samples are correlated
  • targets change while the network is learning

Two famous engineering fixes:

  • Experience replay: store transitions and train on randomized batches
  • Target network: use a slower-moving copy of the Q-network to stabilize targets
flowchart TD
	A[Environment transition] --> B[Store transition in replay buffer]
	B --> C[Sample random mini-batch]
	C --> D[Q-network predicts Q values]
	C --> E[Target network builds target values]
	D --> F[Compute TD loss]
	E --> F
	F --> G[Gradient update Q-network]
	G --> H[Periodically sync target network]

10.7 Policy Gradient Methods

Instead of learning values and deriving actions indirectly, policy-gradient methods optimize the policy directly.

Why use them:

  • natural handling of stochastic policies
  • good for continuous action spaces
  • direct optimization of behavior

Main challenge:

  • gradients can have high variance

10.8 Actor-Critic Methods

Actor-critic combines two components:

  • Actor: chooses actions
  • Critic: estimates how good states or actions are

Why this architecture is common:

  • the actor improves the policy
  • the critic reduces variance by providing a learned baseline or value estimate

10.9 PPO and Why Engineers Use It So Often

Proximal Policy Optimization (PPO) became popular because it is relatively robust and easier to tune than many earlier policy-gradient approaches.

Why teams like it:

  • good practical stability
  • works across many environments
  • simpler than more fragile second-order methods

Why it is not magic:

  • still sample hungry
  • still sensitive to reward design
  • can still overfit simulator quirks

10.10 Deterministic Policy Gradient and Continuous Control

For continuous control tasks like torque, steering, power allocation, or analog parameter tuning, deterministic or stochastic actor-critic methods are common.

Examples:

  • DDPG
  • TD3
  • SAC

Practical note:

Continuous control adds sensitivity to scaling, action clipping, and simulation accuracy.

11. Bandits vs Full Reinforcement Learning

Many engineers should use contextual bandits before RL.

Bandits:

  • choose an action
  • observe reward
  • no long state transition chain

RL:

  • actions change future states
  • delayed consequences matter

If your recommendation system only needs to choose the next item and future dynamics are weak, a contextual bandit may be a better engineering choice than full RL.

12. Reward Design: The Most Common Source of Failure

Reward design is not a small detail. It defines the optimization target.

12.1 Good Reward Properties

A reward should be:

  • aligned with the real business or system objective
  • measurable
  • not too sparse unless the algorithm can handle it
  • resistant to exploitation
  • stable enough that training can make progress

12.2 Reward Hacking

An RL agent will optimize exactly what you specify, not what you meant.

Examples:

  • A robot learns to spin in place because movement sensor counts increase reward.
  • A recommender maximizes clicks by showing low-quality sensational content, hurting retention.
  • A thermal controller reduces measured temperature by throttling performance so aggressively that throughput collapses.

12.3 Sparse vs Dense Rewards

  • Sparse reward: reward only when the final goal is achieved
  • Dense reward: reward provides intermediate guidance

Sparse reward is often more faithful but much harder to learn from.

Dense reward is easier for learning but can distort behavior if shaping is poorly designed.

12.4 Step-by-Step Reward Design Process

  1. Write down the actual system objective in operational terms.
  2. List what signals are directly measurable online.
  3. Identify unintended shortcuts the agent might exploit.
  4. Add penalties or constraints for unacceptable behavior.
  5. Test reward behavior on edge cases before large-scale training.

13. State and Action Space Design

RL performance often depends more on problem formulation than algorithm choice.

13.1 State Design Mistakes

Common mistakes:

  • omitting variables that affect dynamics
  • including huge numbers of irrelevant features
  • mixing signals with inconsistent time scales
  • ignoring latency and observation delay
  • feeding raw values with poor normalization

13.2 Action Design Mistakes

Common mistakes:

  • giving the agent more freedom than the system can safely support
  • using overly fine-grained continuous actions with noisy actuators
  • using unrealistic actions not available in deployment

Engineering rule:

Restrict the action space to what the real actuator, API, or controller can actually execute.

14. Model-Free vs Model-Based in Real Systems

14.1 Model-Free Advantages

  • simpler conceptually
  • easier to start with from logs plus interaction
  • often fewer assumptions about system dynamics

14.2 Model-Free Disadvantages

  • typically data hungry
  • costly if environment interaction is expensive
  • weaker extrapolation under distribution shift

14.3 Model-Based Advantages

  • can improve sample efficiency
  • enables planning and imagination rollouts
  • useful when system dynamics are partially known from physics or engineering models

14.4 Model-Based Disadvantages

  • learned models can be wrong in dangerous ways
  • compounding model errors can break planning
  • engineering complexity is higher

14.5 Hardware-Connected Example

Consider RL for CPU frequency scaling.

  • A model-free agent may learn from measured utilization, latency, temperature, and power.
  • A model-based system may also use known thermal dynamics or power-performance models.

Tradeoff:

  • model-based methods may learn faster
  • but a bad model of thermal lag can produce unstable oscillations in deployment

15. Online RL, Offline RL, and Batch Learning

15.1 Online RL

The agent learns by interacting with the live environment.

Pros:

  • direct adaptation
  • no mismatch between static dataset and real behavior loop

Cons:

  • risky
  • costly
  • exploration may be unsafe

15.2 Offline RL

Offline RL learns from previously collected logs without live exploration during training.

Pros:

  • safer for high-risk domains
  • uses historical data
  • easier to audit initially

Cons:

  • limited by data coverage
  • agent may choose actions not well supported by the dataset
  • evaluation is hard

15.3 Practical Engineering Choice

Many production teams use a staged approach:

  1. start with logged data
  2. train offline
  3. evaluate carefully
  4. deploy in shadow mode
  5. allow narrow online adaptation later

16. Training Pipeline in Practice

An RL project is not just an algorithm. It is a pipeline.

flowchart LR
	A[Problem definition] --> B[State, action, reward design]
	B --> C[Simulator or environment integration]
	C --> D[Data collection rollouts]
	D --> E[Training jobs]
	E --> F[Evaluation and safety checks]
	F --> G[Shadow deployment]
	G --> H[Controlled production rollout]
	H --> I[Monitoring and retraining]

16.1 Environment Interface

At minimum, an environment usually needs:

  • reset() to start a new episode
  • step(action) to apply an action and receive next observation, reward, done flag, and metadata

This looks simple, but environment correctness is critical.

If step() has bugs, your agent will learn the wrong physics, wrong costs, or wrong timing.

16.2 Simulator Quality

A simulator is often the most important component in applied RL.

Bad simulators cause:

  • unrealistic policies
  • exploitation of artifacts
  • failure during sim-to-real transfer

What to verify:

  • timing fidelity
  • sensor noise realism
  • latency realism
  • actuator saturation
  • failure modes
  • distribution of rare events

16.3 Experience Collection

Choices include:

  • single-threaded rollouts
  • vectorized environments
  • distributed actor workers

Tradeoff:

More parallel rollout improves throughput but can create stale policies if learners and actors drift apart.

17. Example Implementation Skeleton

for episode in range(num_episodes):
	obs = env.reset()
	done = False

	while not done:
		action = policy(obs)
		next_obs, reward, done, info = env.step(action)
		replay_buffer.add(obs, action, reward, next_obs, done)
		learner.update(replay_buffer)
		obs = next_obs

This loop hides the real concerns:

  • action noise and exploration policy
  • reward normalization
  • target calculation
  • batching
  • device placement
  • checkpointing
  • replay sampling strategy
  • deterministic evaluation runs

18. Stability Problems and Why Deep RL Is Hard

Deep RL is not just supervised learning with rewards. It is harder for structural reasons.

18.1 Non-Stationary Targets

The data distribution changes as the policy changes.

In supervised learning, labels are usually fixed. In RL, the agent changes behavior, which changes visited states, which changes the training data.

18.2 Correlated Samples

Sequential experience samples are correlated. That breaks the IID assumptions many optimizers behave best under.

18.3 Bootstrapping Error

If you estimate targets from other estimates, errors can feed into future errors.

18.4 Overestimation Bias

Using a max over noisy estimates can bias values upward. Double Q-learning style fixes aim to reduce this.

18.5 Distribution Shift

Policies may fail badly on states underrepresented in training.

19. Evaluation: How to Know If the Agent Is Actually Good

RL evaluation is often weaker than teams think.

19.1 Training Reward Is Not Enough

A rising training reward does not guarantee production usefulness.

The agent may be:

  • overfitting the simulator
  • exploiting reward loopholes
  • succeeding only on easy scenarios
  • unstable under real latency or noise

19.2 What to Measure

Measure at least:

  • primary task success
  • safety violations
  • worst-case behavior
  • sample efficiency
  • robustness to perturbations
  • sensitivity to random seeds
  • inference latency
  • resource usage

19.3 Evaluation Regimes

  • deterministic evaluation runs
  • unseen scenario tests
  • adversarial stress tests
  • ablations
  • baseline comparisons against heuristic controllers

19.4 Baselines Matter

A common RL mistake is comparing against weak baselines.

Always compare against:

  • simple heuristics
  • rule-based controllers
  • classical optimization methods
  • supervised or bandit baselines where applicable

If RL cannot beat a clear heuristic, the project may not justify itself.

20. Sim-to-Real Transfer

Many exciting RL demos fail when leaving simulation.

20.1 Why Sim-to-Real Is Hard

Real systems differ from simulation in:

  • friction and wear
  • sensor bias
  • delays
  • packet loss
  • thermal inertia
  • manufacturing variation
  • actuator nonlinearities

20.2 Common Mitigations

  • domain randomization
  • system identification
  • safety envelopes and fallback controllers
  • gradual rollout on hardware
  • residual learning on top of trusted controllers

20.3 Software and Hardware Connection

A robot policy that is stable in a perfect simulator may oscillate on real hardware because motor drivers saturate, encoder readings lag, and battery voltage drops under load.

That is not just an algorithm issue. It is a software-hardware co-design issue.

21. Safety, Constraints, and Guardrails

In real systems, maximizing reward is not enough. You need constraints.

Examples:

  • do not exceed temperature limit
  • do not violate collision boundaries
  • do not overspend budget
  • do not exceed network loss threshold
  • do not trigger unstable oscillations in control loops

Practical safety patterns:

  • action clipping
  • rule-based safety layers
  • constrained RL formulations
  • fallback controller takeover
  • kill switches
  • anomaly detection around policy outputs
flowchart TD
	A[Policy proposes action] --> B{Safety validator}
	B -->|Safe| C[Execute action]
	B -->|Unsafe| D[Fallback controller or clipped action]
	C --> E[Observe result and log]
	D --> E

22. Real Industry Use Cases

22.1 Robotics and Industrial Automation

Use cases:

  • grasping
  • locomotion
  • path planning assistance
  • manipulation under uncertainty

Challenges:

  • expensive exploration
  • sim-to-real gap
  • safety-critical failures

22.2 Recommendation and Personalization

Use cases:

  • long-term engagement optimization
  • notification timing
  • multi-step user interaction policies

Challenges:

  • delayed rewards
  • confounding from user behavior
  • exploration ethics and product risk

22.3 Data Center and Cloud Control

Use cases:

  • cooling optimization
  • workload placement
  • power-performance tuning

Challenges:

  • slow dynamics
  • multi-objective tradeoffs
  • partial observability

22.4 Networking and Congestion Control

Use cases:

  • adaptive congestion control
  • routing decisions
  • queue management

Challenges:

  • non-stationary traffic
  • noisy measurements
  • unfairness risks

22.5 Hardware and Computer Engineering Scenarios

Use cases:

  • dynamic voltage and frequency scaling (DVFS)
  • cache prefetching and memory policy tuning
  • NoC routing heuristics
  • compiler optimization ordering
  • chip floorplanning assistance

Why RL appears here:

  • many control knobs
  • long-term tradeoffs between power, latency, throughput, and thermals
  • hard-to-model interactions between hardware layers and workloads

22.6 Operations Research and Scheduling

Use cases:

  • job-shop scheduling
  • warehouse dispatch
  • fleet management

Challenges:

  • large combinatorial action spaces
  • sparse rewards
  • hard constraints

23. Common Failure Modes

23.1 Reward Misalignment

The policy gets better at the specified metric while the real system gets worse.

23.2 State Aliasing

Different real situations look identical to the agent because the observation is missing critical variables.

23.3 Unstable Training

Loss spikes, Q-values diverge, policy collapses, or performance varies wildly across seeds.

23.4 Simulator Exploitation

The policy finds loopholes in simulation that do not exist in reality.

23.5 Offline-to-Online Collapse

A policy trained on logs chooses actions outside the data support and fails after deployment.

23.6 Over-Optimization of Proxy Metrics

A system maximizes short-term measurable metrics while harming long-term objectives.

24. Debugging Reinforcement Learning Systems

RL debugging must be systematic. Random tuning is usually wasted effort.

flowchart TD
	A[Policy performs poorly] --> B{Environment correct?}
	B -->|No| C[Fix transition, reward, reset, or done logic]
	B -->|Yes| D{Reward aligned?}
	D -->|No| E[Redesign reward and constraints]
	D -->|Yes| F{State sufficient?}
	F -->|No| G[Add missing signals, history, normalization]
	F -->|Yes| H{Baseline beats agent?}
	H -->|Yes| I[Re-evaluate algorithm choice and hyperparameters]
	H -->|No| J{Generalizes to unseen cases?}
	J -->|No| K[Expand evaluation distribution and regularize]
	J -->|Yes| L[Investigate deployment latency, scaling, and safety layers]

24.1 Debugging Order That Saves Time

  1. Verify environment correctness.
  2. Verify reward correctness.
  3. Verify state and action definitions.
  4. Beat a trivial baseline on a tiny version of the problem.
  5. Check reproducibility across random seeds.
  6. Only then tune larger models and advanced algorithms.

24.2 Practical Checks

  • manually inspect several episodes step by step
  • print or plot rewards, dones, and critical state variables
  • confirm resets do not leak hidden state
  • confirm action bounds match deployment bounds
  • compare policy outputs against a hand-written controller
  • run ablations removing one feature at a time

24.3 If Training Is Unstable

Check:

  • learning rate too high
  • reward scale too large
  • missing normalization
  • target updates too aggressive
  • replay buffer too small or too biased
  • actor-critic update imbalance
  • insufficient entropy or too much entropy

25. Best Practices for Engineers

25.1 Start with the Simplest Viable Environment

Reduce the problem first. If you cannot learn in a toy version, the full version will not work.

25.2 Build Strong Baselines

Before deep RL, implement:

  • random policy
  • heuristic policy
  • supervised or bandit baseline if applicable
  • classical controller if available

25.3 Make the Environment Observable

Log everything needed to reconstruct behavior:

  • observations
  • actions
  • rewards
  • next observations
  • done reasons
  • policy version
  • simulator version

25.4 Control Randomness

Use fixed seeds for debugging, then multiple seeds for serious evaluation.

25.5 Normalize Inputs and Sometimes Rewards

Poor scaling can destroy stability.

25.6 Separate Training and Evaluation

Do not judge a policy while exploration noise is still active unless that is the real deployment behavior.

25.7 Protect Production with Guardrails

Never assume the learned policy alone is a sufficient safety mechanism.

26. Tradeoffs and Design Decisions

26.1 Discrete vs Continuous Actions

  • Discrete is easier algorithmically.
  • Continuous is often more realistic for control.
  • Discretizing continuous control can simplify training but may reduce policy quality.

26.2 Dense Reward vs Sparse Reward

  • Dense reward improves learning speed.
  • Sparse reward may preserve objective fidelity.
  • Reward shaping helps, but can introduce bias and shortcuts.

26.3 On-Policy vs Off-Policy

  • On-policy methods are often more stable conceptually but waste more data.
  • Off-policy methods can reuse data and be more sample efficient, but stability becomes trickier.

26.4 Simulation Fidelity vs Simulation Speed

  • High fidelity improves realism.
  • High speed improves experimentation.
  • Teams usually need a layered stack: fast simulator for iteration, high-fidelity simulator for validation.

26.5 Single Objective vs Multi-Objective Control

Real systems often optimize multiple goals:

  • latency
  • power
  • cost
  • reliability
  • fairness

Collapsing them into one reward scalar is convenient but dangerous because it hides tradeoffs.

27. Interview-Level Understanding

An engineer should be able to explain these clearly.

27.1 What Makes RL Harder Than Supervised Learning?

  • no direct labels for correct actions
  • delayed rewards
  • exploration required
  • non-stationary data distribution
  • agent actions affect future data

27.2 What Is the Difference Between Monte Carlo and TD?

  • Monte Carlo uses full sampled returns after episode completion.
  • TD bootstraps using current estimates of future value.
  • Monte Carlo has higher variance, TD introduces bias but is often more practical.

27.3 Why Is Experience Replay Useful?

  • breaks temporal correlations
  • improves data reuse
  • stabilizes deep Q-learning training

27.4 Why Is Reward Design So Important?

Because the agent optimizes the specified objective precisely, including loopholes. A badly defined reward produces systematically bad behavior.

27.5 What Is the Difference Between On-Policy and Off-Policy?

  • On-policy learns from data generated by the current policy.
  • Off-policy can learn from data generated by a different policy.

This matters for data reuse, stability, and deployment strategy.

28. A Practical Worked Example: Thermal-Aware CPU Frequency Control

This example connects RL to computer engineering.

28.1 Problem

Choose CPU frequency over time to balance:

  • application latency
  • throughput
  • power consumption
  • thermal limits

28.2 State Candidates

  • current frequency
  • CPU utilization
  • queue depth
  • recent latency percentiles
  • package temperature
  • recent power draw
  • workload type indicator

28.3 Actions

  • discrete frequency steps
  • optional turbo enable or disable

28.4 Reward Example

One crude reward might be:


	ext{reward} = -w_1 \cdot \text{latency} - w_2 \cdot \text{power} - w_3 \cdot \max(0, \text{temp} - T_{\max})

28.5 Risks

  • the agent may oscillate frequency rapidly
  • sensor lag may create delayed feedback loops
  • reward may encourage aggressive throttling that hurts throughput

28.6 Practical Safeguards

  • minimum dwell time before changing frequency again
  • thermal emergency override
  • action smoothing
  • evaluation on bursty and sustained workloads

This example shows why RL in engineering is never just about the algorithm. It is about control stability, sensing, actuation, and constraints.

29. Step-by-Step Mental Model for Solving an RL Problem

When you face a new RL problem, work through it in this order.

  1. What is the real objective over time?
  2. What decisions actually influence future outcomes?
  3. What information is available at decision time?
  4. What actions are truly possible in production?
  5. What makes exploration risky or expensive?
  6. Can the environment be simulated accurately enough?
  7. What simple baseline should already work?
  8. What metrics prove the policy is useful and safe?

If you cannot answer these clearly, the problem is not ready for RL.

30. Common Mistakes Engineers Make

  • choosing RL for a problem that is really supervised learning or optimization
  • defining rewards that are easy to exploit
  • skipping strong heuristic baselines
  • trusting simulator performance too early
  • using huge models before validating environment correctness
  • ignoring latency, actuator limits, or safety constraints
  • evaluating on only one seed or one narrow scenario set
  • forgetting that deployment distribution changes over time

31. Production Deployment Patterns

31.1 Shadow Mode

Run the policy without controlling the system. Compare proposed actions against the live controller.

31.2 Human-in-the-Loop Approval

Useful for high-cost decisions where operators can review policy suggestions.

31.3 Hierarchical Control

Let RL handle high-level strategy while a classical controller handles low-level stable execution.

Example:

  • RL chooses target speed or energy budget
  • PID or MPC handles actuator-level control

31.4 Safe Rollback

Every deployment should support immediate fallback to a trusted baseline.

flowchart LR
	A[Offline training] --> B[Replay and simulator evaluation]
	B --> C[Shadow mode]
	C --> D[Limited traffic rollout]
	D --> E[Monitored production]
	E --> F{Anomaly detected?}
	F -->|Yes| G[Rollback to baseline]
	F -->|No| H[Expand rollout]

32. Tooling and Infrastructure Considerations

A serious RL stack often needs:

  • experiment tracking
  • reproducible configuration management
  • distributed rollout workers
  • replay storage
  • checkpointing
  • metrics dashboards
  • simulator versioning
  • model registry and deployment controls

Questions engineers should ask early:

  • Can we reproduce a result from two weeks ago?
  • Can we trace a deployed policy back to training data and config?
  • Can we replay bad episodes exactly?
  • Can we compare policy versions under the same evaluation set?

33. How RL Connects to Classical Control and Optimization

RL is not a replacement for control theory, optimization, or systems engineering.

In fact, many good RL projects combine them.

  • Classical control gives stability, safety, and domain structure.
  • Optimization gives constraints and planning tools.
  • RL adds adaptability when hand-designed policies are too rigid or incomplete.

A strong engineer asks:

Should RL fully control the system, or should it tune a classical controller?

Often the second option is more robust.

34. Summary of Key Intuitions

  • RL is about maximizing long-term outcome, not immediate gain.
  • The central challenge is credit assignment through time.
  • State, action, and reward design usually matter as much as algorithm choice.
  • Exploration is necessary but often dangerous in real systems.
  • Deep RL is hard because data is sequential, targets move, and policies change the data distribution.
  • Production RL requires baselines, evaluation discipline, safety layers, and deployment controls.
  • Many practical wins come from combining RL with simulation, constraints, and domain knowledge.

35. Final Checklist for Engineering Use

Before committing to RL, confirm:

  • the task is genuinely sequential
  • future state depends on actions
  • a reward can be defined and audited
  • a baseline exists
  • a simulator or safe data source exists
  • evaluation metrics cover robustness and safety
  • rollout, rollback, and monitoring plans exist

If those are not true, the correct engineering decision may be to avoid RL.

36. What to Study Next

After mastering the basics in this handbook, the next useful topics are:

  • contextual bandits
  • deep Q-learning in detail
  • policy gradients and actor-critic math
  • PPO and SAC in practice
  • offline RL and counterfactual evaluation
  • constrained and safe RL
  • multi-agent RL
  • sim-to-real robotics workflows
  • RLHF and preference optimization for foundation models

The right next step depends on your domain. For robotics and hardware control, focus on safety, simulation, and continuous control. For product systems, focus on bandits, offline evaluation, and reward alignment. For research-heavy ML systems, focus on actor-critic methods, stability, and scaling.