Lec 03 - Policy Gradients
Policy gradient derivation, REINFORCE, variance reduction, off-policy methods
Core RL Objective
\[\theta^\star = \arg\max_{\theta} \; \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right]\]where trajectory $\tau = (s_1, a_1, s_2, \ldots, s_T, a_T)$ and $p_\theta(\tau) = p(s_1) \prod_t \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$
Policy Gradient Derivation
Step 1: Express objective as expectation
\[J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] = \int p_\theta(\tau) r(\tau) d\tau\]where $r(\tau) = \sum_t r(s_t, a_t)$ is the total trajectory reward.
Step 2: Take gradient
\[\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau) r(\tau) d\tau = \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau\]Step 3: Log-derivative trick
Key identity: $\nabla_\theta \log p_\theta(\tau) = \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}$
Therefore: $\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$
Derivation of identity: \(\nabla_\theta \log p_\theta(\tau) = \nabla_\theta \log p_\theta(\tau) \cdot \frac{p_\theta(\tau)}{p_\theta(\tau)} = \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}\)
Substituting back:
\[\nabla_\theta J(\theta) = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau) d\tau = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \cdot r(\tau) \right]\]Step 4: Simplify log probability
Recall: $p_\theta(\tau) = p(s_1) \prod_t \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$
Taking log:
\[\log p_\theta(\tau) = \log p(s_1) + \sum_t \log \pi_\theta(a_t|s_t) + \sum_t \log p(s_{t+1}|s_t, a_t)\]Taking gradient (only policy $\pi_\theta$ depends on $\theta$):
\[\nabla_\theta \log p_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\]Step 5: Final policy gradient formula
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_t r(s_t, a_t)\right) \right]\]Step 6: Monte Carlo estimate
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\right) \left(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\right)\]REINFORCE Algorithm (Vanilla Policy Gradient)
Loop:
- Sample trajectories ${\tau^i}$ from current policy $\pi_\theta(a_t\mid s_t)$
- Compute gradient: \(\nabla_\theta J(\theta) \approx \sum_i \left(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i)\right) \left(\sum_t r(s_t^i, a_t^i)\right)\)
- Update: $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
Intuition:
- Increase log-probability of actions taken in high-reward trajectories
- Decrease log-probability of actions taken in low-reward trajectories
- Formalization of “trial-and-error” learning
Relationship to Imitation Learning
Imitation Learning (Behavioral Cloning) objective:
\[\min_{\theta} -\mathbb{E}_{(s,a) \sim \mathcal{D}} [\log \pi_\theta(a|s)]\]This is equivalent to maximizing: \(\mathbb{E}_{(s,a) \sim \mathcal{D}} [\log \pi_\theta(a\mid s)]\)
Gradient: \(\nabla_\theta J_{BC}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}} [\nabla_\theta \log \pi_\theta(a|s)]\)
Policy Gradient can be seen as “weighted imitation learning”:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_t r(s_t, a_t)\right) \right]\]Key differences:
| Aspect | Imitation Learning | Policy Gradient |
|---|---|---|
| Data source | Expert demonstrations | Self-generated trajectories |
| Weighting | All actions weighted equally | Actions weighted by trajectory reward |
| Objective | Mimic expert behavior | Maximize expected reward |
| Gradient | $\nabla_\theta \log \pi_\theta(a \mid s)$ | $\nabla_\theta \log \pi_\theta(a\mid s) \cdot r(\tau)$ |
Connection:
- If all rewards are equal ($r(\tau) = c$ for all $\tau$), policy gradient reduces to behavioral cloning
- Policy gradient = imitation learning where you imitate your own good behaviors (high reward) and avoid your own bad behaviors (low reward)
- Both use the same $\nabla_\theta \log \pi_\theta(a\mid s)$ term, but PG adds reward-based weighting
Variance Reduction Techniques
1. Causality (Reward-to-Go)
Key insight: Policy at time $t$ cannot affect rewards at time $t’ < t$
Using causality, we only weight by future rewards:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) \right]\]Derivation:
Starting from: \(\nabla_\theta J(\theta) = \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \sum_{t'=1}^T r(s_{i,t'}, a_{i,t'})\)
We can rearrange as: \(= \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=1}^{t-1} r(s_{i,t'}, a_{i,t'}) + \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\right)\)
The first term $\sum_{t’=1}^{t-1} r(s_{i,t’}, a_{i,t’})$ doesn’t depend on $a_t$, so:
\[\mathbb{E}_{\tau} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'})\right] = 0\]This leaves only future rewards.
2. Baselines
Subtract a constant baseline $b$ (typically average reward):
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) (r(\tau) - b) \right]\]Common choice: $b = \frac{1}{N} \sum_i r(\tau^i)$ (average reward across batch)
Proof of unbiasedness:
We need to show: $\mathbb{E}{\tau} [\nabla\theta \log p_\theta(\tau) \cdot b] = 0$
\[\mathbb{E}_{\tau} [\nabla_\theta \log p_\theta(\tau) \cdot b] = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) \cdot b \, d\tau\] \[= b \int p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} d\tau = b \int \nabla_\theta p_\theta(\tau) d\tau\] \[= b \nabla_\theta \int p_\theta(\tau) d\tau = b \nabla_\theta 1 = 0\]Effect:
- Actions in above-average trajectories get positive gradients (increase probability)
- Actions in below-average trajectories get negative gradients (decrease probability)
3. Combined: Causality + Baseline
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\]Implementation: Surrogate Objective
Why We Need a Surrogate
Problem: We derived the gradient $\nabla_\theta J(\theta)$ mathematically, but computing $\nabla_\theta \log \pi_\theta(a\mid s)$ manually for a neural network is impractical.
Solution: Construct a scalar function $\tilde{J}(\theta)$ whose gradient equals our derived policy gradient, then use automatic differentiation.
The Surrogate Function
\[\tilde{J}(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\]Key insight: When we differentiate $\tilde{J}(\theta)$:
\[\nabla_\theta \tilde{J}(\theta) = \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\]This is exactly our policy gradient from the derivation! The reward terms are constants, so they pass through the gradient operator unchanged.
Workflow
- Derive mathematically: $\nabla_\theta J(\theta)$ (what we want)
- Construct surrogate: $\tilde{J}(\theta)$ (scalar function without $\nabla_\theta$)
- Let autodiff compute: $\nabla_\theta \tilde{J}(\theta) = \nabla_\theta J(\theta)$ automatically
Implementation
# Define surrogate (no gradient symbol in code!)
log_probs = policy(states).log_prob(actions)
rewards_to_go = compute_rewards_to_go(rewards) - baseline
surrogate = (log_probs * rewards_to_go).mean()
# Autodiff computes the gradient for us
loss = -surrogate # Negative for gradient ascent
loss.backward() # PyTorch computes ∇_θ surrogate = ∇_θ J
optimizer.step()
Policy Parameterizations
Discrete actions (Cross-Entropy):
- Network outputs logits, apply softmax: $\pi_\theta(a\mid s) = \text{softmax}(f_\theta(s))_a$
- Surrogate uses: $\log \pi_\theta(a\mid s)$
- Resembles classification loss
Continuous actions (Gaussian):
- Network outputs mean: $\mu_\theta(s)$, with variance $\sigma^2$
- Action sampled: $a \sim \mathcal{N}(\mu_\theta(s), \sigma^2)$
- Log prob: $\log \pi_\theta(a\mid s) \propto -\frac{1}{2\sigma^2}|a - \mu_\theta(s)|^2$
- Surrogate uses squared error form
Challenges & Practical Considerations
Problem 1: High Variance
Issues:
- Policy gradients are noisy/high-variance even with tricks
- Sensitive to reward scale
- Worse for sparse rewards (e.g., most trajectories may have zero reward)
Example (from slides):
- Humanoid walking: Only some trajectories move forward, most fall
- Jacket folding: Only 1 out of 4 trajectories succeeds
Solutions:
- Use large batch sizes (more trajectories per update)
- Use dense reward shaping when possible
- Apply both causality and baselines
- Consider more advanced methods (PPO, TRPO)
Problem 2: On-Policy Constraint
Definition:
- On-policy: Algorithm uses only data from current policy $\pi_\theta$
- Off-policy: Algorithm can reuse data from past policies
Issue with vanilla PG:
- Gradient formula requires $\tau \sim \pi_\theta$ (current policy)
- After updating $\theta \rightarrow \theta’$, old trajectories are invalid
- Must recollect data after every update (sample inefficient)
Why vanilla PG is on-policy:
The expectation \(\mathbb{E}_{\tau \sim \pi_\theta}\) is with respect to the current policy. After \(\theta\) changes, the distribution changes, so old samples don’t represent the new policy.
Summary
Key equations:
-
Policy gradient theorem: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) r(\tau) \right]\)
-
With causality: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right]\)
-
With baseline: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^T r(s_{t'}, a_{t'}) - b\right) \right]\)
Key takeaways:
- Do more of what worked (high reward), less of what didn’t (low reward)
- Use causality (reward-to-go) and baselines to reduce variance
- Method is on-policy and requires careful tuning
- Trade-off: Simple and general, but high variance and sample inefficient
Off-Policy Policy Gradients
Motivation: Sample Efficiency Problem
On-policy limitation: Vanilla policy gradient requires trajectories from current policy $\pi_{\theta’}$
- After each gradient update $\theta \rightarrow \theta’$, all old data becomes invalid
- Must recollect entirely new trajectories
- Extremely sample inefficient
Goal: Reuse data from old policy $\pi_\theta$ to update new policy $\pi_{\theta’}$
IS: Importance Sampling
Key idea: Convert expectation under one distribution to expectation under another
\[\mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[f(\tau)] = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} f(\tau)\right]\]Importance weight: $\frac{p_{\theta’}(\tau)}{p_{\theta}(\tau)}$
For trajectories, this ratio expands to:
\[\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = \frac{p(s_1) \prod_{t=1}^T \pi_{\theta'}(a_t|s_t) p(s_{t+1}|s_t, a_t)}{p(s_1) \prod_{t=1}^T \pi_{\theta}(a_t|s_t) p(s_{t+1}|s_t, a_t)} = \prod_{t=1}^T \frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\]| Critical observation: Dynamics $p(s_{t+1} | s_t, a_t)$ cancel out! Only policy ratios remain. |
Off-Policy Policy Gradient (Trajectory-Level)
Starting from on-policy gradient with causality and baseline, where $\pi_{\theta’}$ is latest policy:
\[\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)} \left[ \left(\sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_t|s_t)\right) \left(\left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) - b\right) \right]\]Apply importance sampling to sample from $\pi_\theta$ instead:
\[\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} \left(\sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_t|s_t)\right) \left(\left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) - b\right) \right]\]Substitute the importance weight:
\[\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \prod_{t=1}^T \frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \left(\sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_t|s_t)\right) \left(\left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) - b\right) \right]\]Problem with trajectory-level IS:
- Product $\prod_{t=1}^T \frac{\pi_{\theta’}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}$ can explode or vanish for long horizons
- High variance, unstable training
Off-Policy Policy Gradient (Timestep-Level)
Better approach: Apply importance sampling at individual timestep level instead of full trajectory
\[\nabla_{\theta'} J(\theta') \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \frac{\pi_{\theta'}(a_{i,t}|s_{i,t})}{\pi_{\theta}(a_{i,t}|s_{i,t})} \nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}|s_{i,t}) \left(\left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\right) - b\right)\]Key advantages:
- Importance weight is now per-timestep: $\frac{\pi_{\theta’}(a_{i,t} \mid s_{i,t})}{\pi_{\theta}(a_{i,t} \mid s_{i,t})}$ (single ratio, not product)
- Much less likely to explode/vanish
- More stable variance
Common simplification: When $\pi_{\theta’}$ hasn’t changed much from $\pi_\theta$, approximate ratio as $\approx 1$:
\[\nabla_{\theta'} J(\theta') \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}|s_{i,t}) \left(\left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\right) - b\right)\]Off-Policy Algorithm
Full algorithm:
- Sample trajectories ${\tau^i}$ from old policy $\pi_\theta(a_t\mid s_t)$
- Compute off-policy gradient $\nabla_\theta J(\theta)$ using ${\tau^i}$ with importance weights
- Update: $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
- Key benefit: Can take multiple gradient steps on same batch before resampling
Policy Constraint Problem
Issue: If $\pi_{\theta’}$ drifts too far from $\pi_\theta$ during updates:
- Old data no longer representative of states new policy visits
- Importance weights become inaccurate
- Gradient estimates degrade
Solution: Constrain policy updates to keep $\pi_{\theta’}$ close to $\pi_\theta$
Common constraint (KL divergence):
\[\mathbb{E}_{s \sim \pi_\theta} [D_{KL}(\pi_{\theta'}(\cdot | s) \| \pi_{\theta}(\cdot | s))] \leq \delta\]This ensures the new policy doesn’t deviate too much from the old one in expectation.
Methods using this constraint:
- TRPO (Trust Region Policy Optimization): Hard constraint via constrained optimization
- PPO (Proximal Policy Optimization): Soft constraint via clipped objective
Summary: On-Policy vs Off-Policy
| Aspect | On-Policy (Vanilla PG) | Off-Policy (IS) |
|---|---|---|
| Data source | Current policy $\pi_{\theta’}$ | Old policy $\pi_\theta$ |
| Sample efficiency | Low (discard all data after update) | High (reuse data) |
| Gradient | $\mathbb{E}{\tau \sim \pi{\theta’}}[\cdots]$ | $\mathbb{E}{\tau \sim \pi{\theta}}[\text{IS weight} \cdot \cdots]$ |
| Variance | Lower | Higher (due to IS weights) |
| Updates per batch | 1 | Multiple |
| Constraint needed? | No | Yes (to prevent drift) |
Key insight: Off-policy methods trade increased variance (from importance sampling) for dramatically improved sample efficiency (by reusing data).
References
- Lecture 03 Video CS224R Stanford (2025)
- Lecture 03 Slides CS224R Stanford (2025)