Key Concepts in Reinforcement Learning¶

This page introduces the foundational concepts of reinforcement learning. We cover the mathematical framework (MDPs), the core objects (policies, value functions), the key equations (Bellman equations), and the fundamental challenges (exploration vs. exploitation).

The Agent-Environment Interface¶

At its core, RL is about an agent interacting with an environment over discrete time steps. At each step $t$:

The agent observes a state $s_t \in \mathcal{S}$
The agent selects an action $a_t \in \mathcal{A}$ according to its policy
The environment transitions to a new state $s_{t+1}$ and emits a reward $r_t \in \mathbb{R}$

The agent's goal is to learn a behavior (policy) that maximizes the cumulative reward it receives over time.

Markov Decision Process (MDP)¶

The standard mathematical formulation for RL problems is the Markov Decision Process, defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

Symbol	Meaning
$\mathcal{S}$	State space
$\mathcal{A}$	Action space
$P(s' \mid s, a)$	Transition dynamics — probability of next state $s'$ given current state $s$ and action $a$
$R(s, a, s')$	Reward function
$\gamma \in [0, 1)$	Discount factor

The Markov property states that the future is conditionally independent of the past given the present:

\[ P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t) \]

When Markov Doesn't Hold

In practice, agents often observe partial state information (e.g., a robot's camera image doesn't reveal hidden object positions). This leads to Partially Observable MDPs (POMDPs), where the agent receives observations $o_t$ instead of true states $s_t$. Many practical RL systems handle partial observability by using observation histories or recurrent policies.

Return and Discount Factor¶

The return $G_t$ is the total discounted reward from time step $t$:

\[ G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \]

The discount factor $\gamma$ controls the trade-off between immediate and future rewards:

$\gamma \to 0$: myopic agent, only cares about immediate reward
$\gamma \to 1$: far-sighted agent, values future reward almost as much as immediate

Why Discount?

Discounting serves multiple purposes: (1) it ensures the return is finite for infinite-horizon problems, (2) it models preference for sooner rewards, and (3) it reduces variance in return estimates.

Policy¶

A policy $\pi$ defines the agent's behavior — how it selects actions given states.

Stochastic policy: $\pi(a \mid s) = P(a_t = a \mid s_t = s)$

Deterministic policy: $a = \mu(s)$

The goal of RL is to find an optimal policy $\pi^*$ that maximizes the expected return:

\[ \pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] \]

where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ denotes a trajectory sampled under policy $\pi$.

Value Functions¶

Value functions estimate "how good" it is for the agent to be in a given state (or to take a given action in a state) under a policy $\pi$.

State-Value Function¶

\[ V^\pi(s) = \mathbb{E}_{\pi} \left[ G_t \mid s_t = s \right] = \mathbb{E}_{\pi} \left[ \sum_{k=0}^{\infty} \gamma^k r_{t+k} \mid s_t = s \right] \]

Action-Value Function (Q-Function)¶

\[ Q^\pi(s, a) = \mathbb{E}_{\pi} \left[ G_t \mid s_t = s, a_t = a \right] \]

The relationship between $V$ and $Q$:

\[ V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot|s)} \left[ Q^\pi(s, a) \right] \]

Advantage Function¶

The advantage function measures how much better an action is compared to the average:

\[ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) \]

The advantage function is central to many policy gradient algorithms (PPO, TRPO, GAE).

Bellman Equations¶

The Bellman equations express a recursive relationship between value functions.

Bellman Expectation Equation¶

For $V^\pi$:

\[ V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right] \]

For $Q^\pi$:

\[ Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') Q^\pi(s',a') \]

Bellman Optimality Equation¶

For the optimal value functions $V^*$ and $Q^*$:

\[ V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right] \]

\[ Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a') \]

Why Bellman Equations Matter

Almost all RL algorithms are, at their core, methods for approximately solving the Bellman equations. Value-based methods directly approximate $V^*$ or $Q^*$. Policy-based methods optimize $\pi$ to implicitly satisfy optimality. Understanding Bellman equations is essential for understanding RL.

Exploration vs. Exploitation¶

A fundamental challenge in RL is the exploration-exploitation trade-off:

Exploitation: Choose actions that maximize expected reward based on current knowledge
Exploration: Try new actions to discover potentially better strategies

Common exploration strategies:

Strategy	Description
$\varepsilon$-greedy	With probability $\varepsilon$, take a random action; otherwise, take the greedy action
Boltzmann (softmax)	Sample actions proportional to $\exp(Q(s,a)/\tau)$ where $\tau$ is temperature
UCB	Choose actions that maximize $Q(s,a) + c\sqrt{\ln(t)/N(s,a)}$
Entropy regularization	Add entropy bonus to the objective: $\mathcal{H}(\pi(\cdot
Intrinsic motivation	Reward curiosity — visiting novel states (ICM, RND, etc.)

Temporal Difference Learning¶

Temporal Difference (TD) learning is a central idea in RL that combines Monte Carlo sampling with bootstrapping from value estimates.

The simplest TD update (TD(0)) for estimating $V^\pi$:

\[ V(s_t) \leftarrow V(s_t) + \alpha \left[ r_t + \gamma V(s_{t+1}) - V(s_t) \right] \]

where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

TD learning has several advantages over Monte Carlo methods:

Lower variance (bootstraps from value estimates instead of full returns)
Online learning (updates after each step, not after complete episodes)
Works for continuing tasks (doesn't require episodes to terminate)

TD vs Monte Carlo vs Dynamic Programming

Dynamic Programming: Requires full model $P(s'|s,a)$, bootstraps, no sampling
Monte Carlo: Model-free, no bootstrapping, requires complete episodes
TD Learning: Model-free, bootstraps, online — combines advantages of both

What's Next¶

With these core concepts in place, proceed to:

Algorithm Taxonomy — How RL algorithms are organized
Intro to Policy Optimization — The policy gradient theorem

Symbol	Meaning
\(\mathcal{S}\)	State space
\(\mathcal{A}\)	Action space
\(P(s' \mid s, a)\)	Transition dynamics — probability of next state \(s'\) given current state \(s\) and action \(a\)
\(R(s, a, s')\)	Reward function
\(\gamma \in [0, 1)\)	Discount factor

Strategy	Description
\(\varepsilon\)-greedy	With probability \(\varepsilon\), take a random action; otherwise, take the greedy action
Boltzmann (softmax)	Sample actions proportional to \(\exp(Q(s,a)/\tau)\) where \(\tau\) is temperature
UCB	Choose actions that maximize \(Q(s,a) + c\sqrt{\ln(t)/N(s,a)}\)
Entropy regularization	Add entropy bonus to the objective: $\mathcal{H}(\pi(\cdot
Intrinsic motivation	Reward curiosity — visiting novel states (ICM, RND, etc.)