Exercises¶
Hands-on exercises to deepen your understanding. Each exercise set corresponds to a section of the resource and builds progressively from conceptual to implementation challenges.
Part I: Reinforcement Learning¶
Exercise Set 1: Fundamentals¶
1.1 Bellman Equation Derivation
Derive the Bellman expectation equation for \(Q^\pi(s,a)\) starting from the definition of the action-value function. Show all steps.
1.2 Discount Factor Analysis
Consider a simple MDP with two states and deterministic transitions. The agent receives reward +1 at every step. Compute \(V^\pi(s_0)\) analytically for \(\gamma = 0.9\) and \(\gamma = 0.99\). What happens as \(\gamma \to 1\)?
1.3 Advantage Function Properties
Prove that \(\mathbb{E}_{a \sim \pi}[A^\pi(s,a)] = 0\) for any state \(s\) and policy \(\pi\). Why is this property useful for variance reduction?
Exercise Set 2: Policy Gradient¶
2.1 Implement REINFORCE
Implement REINFORCE from scratch in PyTorch. Train it on CartPole-v1. Plot the learning curve (episode return vs. training step) over 5 random seeds.
Bonus: Add a learned baseline and compare the learning curves with and without the baseline.
2.2 Variance Analysis
Empirically compare the variance of the policy gradient estimator using: (a) Full trajectory return (b) Reward-to-go © Reward-to-go with baseline
Estimate the variance by computing the gradient over many trajectories and measuring its variance.
2.3 Implement PPO
Implement PPO-Clip with GAE. Train on HalfCheetah-v4 (MuJoCo). Compare with your REINFORCE implementation on CartPole. Key components:
- Clipped surrogate objective
- Value function loss
- Entropy bonus
- GAE advantage estimation
- Mini-batch updates over multiple epochs
Exercise Set 3: Value-Based Methods¶
3.1 Implement DQN
Implement DQN with experience replay and target networks. Train on PongNoFrameskip-v4. Key components:
- Frame stacking (4 frames)
- \(\epsilon\)-greedy exploration with annealing
- Experience replay buffer
- Target network (update every 10K steps)
3.2 Double DQN
Modify your DQN to use Double Q-learning. Compare the Q-value estimates (plot Q-values over training) between DQN and Double DQN.
Part II: World Models¶
Exercise Set 4: Representation Learning¶
4.1 Train a VAE
Train a convolutional VAE on frames from a simple environment (e.g., CartPole rendered). Visualize the latent space and reconstructions.
- What happens when you change \(\beta\) in the \(\beta\)-VAE objective?
- Can you find meaningful structure in the latent space?
4.2 Latent Dynamics Model
Extend Exercise 4.1: train a latent dynamics model on top of the VAE.
- Collect trajectories from a random policy
- Train: encoder, decoder, transition model
- Evaluate: predict 10 steps ahead, visualize reconstructed observations
- Measure prediction error vs. rollout horizon
Exercise Set 5: Planning¶
5.1 CEM Planning
Implement CEM planning with a learned dynamics model.
- Train a dynamics model on CartPole transitions
- Implement CEM to plan action sequences
- Use MPC (re-plan at each step) to control the environment
- Compare with a random policy and a trained RL policy
Part III: Embodied AI¶
Exercise Set 6: Locomotion¶
6.1 Train a Walking Policy
Using Isaac Gym (or MuJoCo), train a quadruped locomotion policy with PPO:
- Set up the environment with a simple quadruped (e.g., ANYmal or Unitree Go1)
- Design a reward function for velocity tracking
- Train with domain randomization (at least 3 randomized parameters)
- Evaluate: plot tracking error vs. commanded velocity
6.2 Terrain Curriculum
Extend Exercise 6.1 with a terrain curriculum:
- Start training on flat ground
- Gradually introduce slopes, then stairs, then rough terrain
- Compare learning curves with and without curriculum
- Analyze what terrain types are hardest
Exercise Set 7: Data Collection¶
7.1 Behavior Cloning Pipeline
Build a complete behavior cloning pipeline:
- Collect 100 demonstrations (can be scripted for simplicity) of a pick-and-place task
- Train a BC policy (MLP or simple Transformer)
- Evaluate success rate
- Analyze failure modes — when does BC fail?
Part IV: Distributed RL¶
Exercise Set 8: Scaling¶
8.1 Vectorized Environments
Benchmark the effect of environment parallelism:
- Run PPO on CartPole with 1, 4, 16, 64, 256, 1024 parallel environments
- Plot: wall-clock time to reach a target return vs. number of environments
- Plot: sample efficiency (return vs. total environment steps) vs. number of environments
- Where does adding more environments stop helping?
8.2 Profiling
Profile an RL training run (e.g., PPO on a MuJoCo task):
- Measure time spent in: environment stepping, policy inference, gradient computation, data transfer
- Identify the bottleneck
- Propose and implement one optimization
- Measure the speedup
General Research Exercises¶
R.1 Paper Reproduction
Choose a paper from the Key Papers list and reproduce its main result:
- Implement the algorithm from the paper description (not from existing code)
- Run the same experiment setup
- Compare your results to the reported results
- Write a 1-page report on what you learned, including any discrepancies
R.2 Ablation Study
Take any algorithm you've implemented and perform a thorough ablation study:
- Identify 3-5 design choices (e.g., network size, learning rate schedule, advantage estimation method)
- Test each choice independently
- Present results in a clear table or figure
- Which design choices matter most?