Key Concepts in Reinforcement Learning¶
This page introduces the foundational concepts of reinforcement learning. We cover the mathematical framework (MDPs), the core objects (policies, value functions), the key equations (Bellman equations), and the fundamental challenges (exploration vs. exploitation).
The Agent-Environment Interface¶
At its core, RL is about an agent interacting with an environment over discrete time steps. At each step \(t\):
- The agent observes a state \(s_t \in \mathcal{S}\)
- The agent selects an action \(a_t \in \mathcal{A}\) according to its policy
- The environment transitions to a new state \(s_{t+1}\) and emits a reward \(r_t \in \mathbb{R}\)
The agent's goal is to learn a behavior (policy) that maximizes the cumulative reward it receives over time.
Markov Decision Process (MDP)¶
The standard mathematical formulation for RL problems is the Markov Decision Process, defined by the tuple \((\mathcal{S}, \mathcal{A}, P, R, \gamma)\):
| Symbol | Meaning |
|---|---|
| \(\mathcal{S}\) | State space |
| \(\mathcal{A}\) | Action space |
| \(P(s' \mid s, a)\) | Transition dynamics — probability of next state \(s'\) given current state \(s\) and action \(a\) |
| \(R(s, a, s')\) | Reward function |
| \(\gamma \in [0, 1)\) | Discount factor |
The Markov property states that the future is conditionally independent of the past given the present:
When Markov Doesn't Hold
In practice, agents often observe partial state information (e.g., a robot's camera image doesn't reveal hidden object positions). This leads to Partially Observable MDPs (POMDPs), where the agent receives observations \(o_t\) instead of true states \(s_t\). Many practical RL systems handle partial observability by using observation histories or recurrent policies.
Return and Discount Factor¶
The return \(G_t\) is the total discounted reward from time step \(t\):
The discount factor \(\gamma\) controls the trade-off between immediate and future rewards:
- \(\gamma \to 0\): myopic agent, only cares about immediate reward
- \(\gamma \to 1\): far-sighted agent, values future reward almost as much as immediate
Why Discount?
Discounting serves multiple purposes: (1) it ensures the return is finite for infinite-horizon problems, (2) it models preference for sooner rewards, and (3) it reduces variance in return estimates.
Policy¶
A policy \(\pi\) defines the agent's behavior — how it selects actions given states.
Stochastic policy: \(\pi(a \mid s) = P(a_t = a \mid s_t = s)\)
Deterministic policy: \(a = \mu(s)\)
The goal of RL is to find an optimal policy \(\pi^*\) that maximizes the expected return:
where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) denotes a trajectory sampled under policy \(\pi\).
Value Functions¶
Value functions estimate "how good" it is for the agent to be in a given state (or to take a given action in a state) under a policy \(\pi\).
State-Value Function¶
Action-Value Function (Q-Function)¶
The relationship between \(V\) and \(Q\):
Advantage Function¶
The advantage function measures how much better an action is compared to the average:
The advantage function is central to many policy gradient algorithms (PPO, TRPO, GAE).
Bellman Equations¶
The Bellman equations express a recursive relationship between value functions.
Bellman Expectation Equation¶
For \(V^\pi\):
For \(Q^\pi\):
Bellman Optimality Equation¶
For the optimal value functions \(V^*\) and \(Q^*\):
Why Bellman Equations Matter
Almost all RL algorithms are, at their core, methods for approximately solving the Bellman equations. Value-based methods directly approximate \(V^*\) or \(Q^*\). Policy-based methods optimize \(\pi\) to implicitly satisfy optimality. Understanding Bellman equations is essential for understanding RL.
Exploration vs. Exploitation¶
A fundamental challenge in RL is the exploration-exploitation trade-off:
- Exploitation: Choose actions that maximize expected reward based on current knowledge
- Exploration: Try new actions to discover potentially better strategies
Common exploration strategies:
| Strategy | Description |
|---|---|
| \(\varepsilon\)-greedy | With probability \(\varepsilon\), take a random action; otherwise, take the greedy action |
| Boltzmann (softmax) | Sample actions proportional to \(\exp(Q(s,a)/\tau)\) where \(\tau\) is temperature |
| UCB | Choose actions that maximize \(Q(s,a) + c\sqrt{\ln(t)/N(s,a)}\) |
| Entropy regularization | Add entropy bonus to the objective: $\mathcal{H}(\pi(\cdot |
| Intrinsic motivation | Reward curiosity — visiting novel states (ICM, RND, etc.) |
Temporal Difference Learning¶
Temporal Difference (TD) learning is a central idea in RL that combines Monte Carlo sampling with bootstrapping from value estimates.
The simplest TD update (TD(0)) for estimating \(V^\pi\):
where \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error.
TD learning has several advantages over Monte Carlo methods:
- Lower variance (bootstraps from value estimates instead of full returns)
- Online learning (updates after each step, not after complete episodes)
- Works for continuing tasks (doesn't require episodes to terminate)
TD vs Monte Carlo vs Dynamic Programming
- Dynamic Programming: Requires full model \(P(s'|s,a)\), bootstraps, no sampling
- Monte Carlo: Model-free, no bootstrapping, requires complete episodes
- TD Learning: Model-free, bootstraps, online — combines advantages of both
What's Next¶
With these core concepts in place, proceed to:
- Algorithm Taxonomy — How RL algorithms are organized
- Intro to Policy Optimization — The policy gradient theorem