Lec 03 - Policy Gradients

Policy gradient derivation, REINFORCE, variance reduction, off-policy methods

Core RL Objective

\[\theta^\star = \arg\max_{\theta} \; \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t r(s_t, a_t) \right]\]

where trajectory $\tau = (s_1, a_1, s_2, \ldots, s_T, a_T)$ and $p_\theta(\tau) = p(s_1) \prod_t \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$

Policy Gradient Derivation

Step 1: Express objective as expectation

\[J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ r(\tau) \right] = \int p_\theta(\tau) r(\tau) d\tau\]

where $r(\tau) = \sum_t r(s_t, a_t)$ is the total trajectory reward.

Step 2: Take gradient

\[\nabla_\theta J(\theta) = \nabla_\theta \int p_\theta(\tau) r(\tau) d\tau = \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau\]

Step 3: Log-derivative trick

Key identity: $\nabla_\theta \log p_\theta(\tau) = \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}$

Therefore: $\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$

Derivation of identity: \(\nabla_\theta \log p_\theta(\tau) = \nabla_\theta \log p_\theta(\tau) \cdot \frac{p_\theta(\tau)}{p_\theta(\tau)} = \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)}\)

Substituting back:

\[\nabla_\theta J(\theta) = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau) d\tau = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) \cdot r(\tau) \right]\]

Step 4: Simplify log probability

Recall: $p_\theta(\tau) = p(s_1) \prod_t \pi_\theta(a_t \mid s_t) p(s_{t+1} \mid s_t, a_t)$

Taking log:

\[\log p_\theta(\tau) = \log p(s_1) + \sum_t \log \pi_\theta(a_t|s_t) + \sum_t \log p(s_{t+1}|s_t, a_t)\]

Taking gradient (only policy $\pi_\theta$ depends on $\theta$):

\[\nabla_\theta \log p_\theta(\tau) = \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\]

Step 5: Final policy gradient formula

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_t r(s_t, a_t)\right) \right]\]

Step 6: Monte Carlo estimate

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \left(\sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})\right) \left(\sum_{t=1}^T r(s_{i,t}, a_{i,t})\right)\]

REINFORCE Algorithm (Vanilla Policy Gradient)

Loop:

  1. Sample trajectories ${\tau^i}$ from current policy $\pi_\theta(a_t\mid s_t)$
  2. Compute gradient: \(\nabla_\theta J(\theta) \approx \sum_i \left(\sum_t \nabla_\theta \log \pi_\theta(a_t^i|s_t^i)\right) \left(\sum_t r(s_t^i, a_t^i)\right)\)
  3. Update: $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$

Intuition:

  • Increase log-probability of actions taken in high-reward trajectories
  • Decrease log-probability of actions taken in low-reward trajectories
  • Formalization of “trial-and-error” learning

Relationship to Imitation Learning

Imitation Learning (Behavioral Cloning) objective:

\[\min_{\theta} -\mathbb{E}_{(s,a) \sim \mathcal{D}} [\log \pi_\theta(a|s)]\]

This is equivalent to maximizing: \(\mathbb{E}_{(s,a) \sim \mathcal{D}} [\log \pi_\theta(a\mid s)]\)

Gradient: \(\nabla_\theta J_{BC}(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}} [\nabla_\theta \log \pi_\theta(a|s)]\)

Policy Gradient can be seen as “weighted imitation learning”:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) \left(\sum_t r(s_t, a_t)\right) \right]\]

Key differences:

Aspect Imitation Learning Policy Gradient
Data source Expert demonstrations Self-generated trajectories
Weighting All actions weighted equally Actions weighted by trajectory reward
Objective Mimic expert behavior Maximize expected reward
Gradient $\nabla_\theta \log \pi_\theta(a \mid s)$ $\nabla_\theta \log \pi_\theta(a\mid s) \cdot r(\tau)$

Connection:

  • If all rewards are equal ($r(\tau) = c$ for all $\tau$), policy gradient reduces to behavioral cloning
  • Policy gradient = imitation learning where you imitate your own good behaviors (high reward) and avoid your own bad behaviors (low reward)
  • Both use the same $\nabla_\theta \log \pi_\theta(a\mid s)$ term, but PG adds reward-based weighting

Variance Reduction Techniques

1. Causality (Reward-to-Go)

Key insight: Policy at time $t$ cannot affect rewards at time $t’ < t$

Using causality, we only weight by future rewards:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) \right]\]

Derivation:

Starting from: \(\nabla_\theta J(\theta) = \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \sum_{t'=1}^T r(s_{i,t'}, a_{i,t'})\)

We can rearrange as: \(= \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=1}^{t-1} r(s_{i,t'}, a_{i,t'}) + \sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\right)\)

The first term $\sum_{t’=1}^{t-1} r(s_{i,t’}, a_{i,t’})$ doesn’t depend on $a_t$, so:

\[\mathbb{E}_{\tau} \left[\nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'})\right] = 0\]

This leaves only future rewards.

2. Baselines

Subtract a constant baseline $b$ (typically average reward):

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \nabla_\theta \log p_\theta(\tau) (r(\tau) - b) \right]\]

Common choice: $b = \frac{1}{N} \sum_i r(\tau^i)$ (average reward across batch)

Proof of unbiasedness:

We need to show: $\mathbb{E}{\tau} [\nabla\theta \log p_\theta(\tau) \cdot b] = 0$

\[\mathbb{E}_{\tau} [\nabla_\theta \log p_\theta(\tau) \cdot b] = \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) \cdot b \, d\tau\] \[= b \int p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} d\tau = b \int \nabla_\theta p_\theta(\tau) d\tau\] \[= b \nabla_\theta \int p_\theta(\tau) d\tau = b \nabla_\theta 1 = 0\]

Effect:

  • Actions in above-average trajectories get positive gradients (increase probability)
  • Actions in below-average trajectories get negative gradients (decrease probability)

3. Combined: Causality + Baseline

\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\]

Implementation: Surrogate Objective

Why We Need a Surrogate

Problem: We derived the gradient $\nabla_\theta J(\theta)$ mathematically, but computing $\nabla_\theta \log \pi_\theta(a\mid s)$ manually for a neural network is impractical.

Solution: Construct a scalar function $\tilde{J}(\theta)$ whose gradient equals our derived policy gradient, then use automatic differentiation.

The Surrogate Function

\[\tilde{J}(\theta) = \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\]

Key insight: When we differentiate $\tilde{J}(\theta)$:

\[\nabla_\theta \tilde{J}(\theta) = \frac{1}{N} \sum_i \sum_t \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\]

This is exactly our policy gradient from the derivation! The reward terms are constants, so they pass through the gradient operator unchanged.

Workflow

  1. Derive mathematically: $\nabla_\theta J(\theta)$ (what we want)
  2. Construct surrogate: $\tilde{J}(\theta)$ (scalar function without $\nabla_\theta$)
  3. Let autodiff compute: $\nabla_\theta \tilde{J}(\theta) = \nabla_\theta J(\theta)$ automatically

Implementation

log_probs = policy(states).log_prob(actions)
rewards_to_go = compute_rewards_to_go(rewards) - baseline
surrogate = (log_probs * rewards_to_go).mean()

# Autodiff computes the gradient for us
loss = -surrogate  # Negative for gradient ascent
loss.backward()    # PyTorch computes ∇_θ surrogate = ∇_θ J
optimizer.step()

Policy Parameterizations

Discrete actions (Cross-Entropy):

  • Network outputs logits, apply softmax: $\pi_\theta(a\mid s) = \text{softmax}(f_\theta(s))_a$
  • Surrogate uses: $\log \pi_\theta(a\mid s)$
  • Resembles classification loss

Continuous actions (Gaussian):

  • Network outputs mean: $\mu_\theta(s)$, with variance $\sigma^2$
  • Action sampled: $a \sim \mathcal{N}(\mu_\theta(s), \sigma^2)$
  • Log prob: $\log \pi_\theta(a\mid s) \propto -\frac{1}{2\sigma^2}|a - \mu_\theta(s)|^2$
  • Surrogate uses squared error form

    Challenges & Practical Considerations

Problem 1: High Variance

Issues:

  • Policy gradients are noisy/high-variance even with tricks
  • Sensitive to reward scale
  • Worse for sparse rewards (e.g., most trajectories may have zero reward)

Example (from slides):

  • Humanoid walking: Only some trajectories move forward, most fall
  • Jacket folding: Only 1 out of 4 trajectories succeeds

Solutions:

  • Use large batch sizes (more trajectories per update)
  • Use dense reward shaping when possible
  • Apply both causality and baselines
  • Consider more advanced methods (PPO, TRPO)

Problem 2: On-Policy Constraint

Definition:

  • On-policy: Algorithm uses only data from current policy $\pi_\theta$
  • Off-policy: Algorithm can reuse data from past policies

Issue with vanilla PG:

  • Gradient formula requires $\tau \sim \pi_\theta$ (current policy)
  • After updating $\theta \rightarrow \theta’$, old trajectories are invalid
  • Must recollect data after every update (sample inefficient)

Why vanilla PG is on-policy:

The expectation \(\mathbb{E}_{\tau \sim \pi_\theta}\) is with respect to the current policy. After \(\theta\) changes, the distribution changes, so old samples don’t represent the new policy.

Summary

Key equations:

  1. Policy gradient theorem: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)\right) r(\tau) \right]\)

  2. With causality: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right]\)

  3. With baseline: \(\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \left(\sum_{t'=t}^T r(s_{t'}, a_{t'}) - b\right) \right]\)

Key takeaways:

  • Do more of what worked (high reward), less of what didn’t (low reward)
  • Use causality (reward-to-go) and baselines to reduce variance
  • Method is on-policy and requires careful tuning
  • Trade-off: Simple and general, but high variance and sample inefficient

Off-Policy Policy Gradients

Motivation: Sample Efficiency Problem

On-policy limitation: Vanilla policy gradient requires trajectories from current policy $\pi_{\theta’}$

  • After each gradient update $\theta \rightarrow \theta’$, all old data becomes invalid
  • Must recollect entirely new trajectories
  • Extremely sample inefficient

Goal: Reuse data from old policy $\pi_\theta$ to update new policy $\pi_{\theta’}$

IS: Importance Sampling

Key idea: Convert expectation under one distribution to expectation under another

\[\mathbb{E}_{\tau \sim p_{\theta'}(\tau)}[f(\tau)] = \mathbb{E}_{\tau \sim p_{\theta}(\tau)}\left[\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} f(\tau)\right]\]

Importance weight: $\frac{p_{\theta’}(\tau)}{p_{\theta}(\tau)}$

For trajectories, this ratio expands to:

\[\frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} = \frac{p(s_1) \prod_{t=1}^T \pi_{\theta'}(a_t|s_t) p(s_{t+1}|s_t, a_t)}{p(s_1) \prod_{t=1}^T \pi_{\theta}(a_t|s_t) p(s_{t+1}|s_t, a_t)} = \prod_{t=1}^T \frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}\]
Critical observation: Dynamics $p(s_{t+1} s_t, a_t)$ cancel out! Only policy ratios remain.

Off-Policy Policy Gradient (Trajectory-Level)

Starting from on-policy gradient with causality and baseline, where $\pi_{\theta’}$ is latest policy:

\[\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta'}(\tau)} \left[ \left(\sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_t|s_t)\right) \left(\left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) - b\right) \right]\]

Apply importance sampling to sample from $\pi_\theta$ instead:

\[\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \frac{p_{\theta'}(\tau)}{p_{\theta}(\tau)} \left(\sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_t|s_t)\right) \left(\left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) - b\right) \right]\]

Substitute the importance weight:

\[\nabla_{\theta'} J(\theta') = \mathbb{E}_{\tau \sim p_{\theta}(\tau)} \left[ \prod_{t=1}^T \frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)} \left(\sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_t|s_t)\right) \left(\left(\sum_{t'=t}^T r(s_{t'}, a_{t'})\right) - b\right) \right]\]

Problem with trajectory-level IS:

  • Product $\prod_{t=1}^T \frac{\pi_{\theta’}(a_t \mid s_t)}{\pi_{\theta}(a_t \mid s_t)}$ can explode or vanish for long horizons
  • High variance, unstable training

Off-Policy Policy Gradient (Timestep-Level)

Better approach: Apply importance sampling at individual timestep level instead of full trajectory

\[\nabla_{\theta'} J(\theta') \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \frac{\pi_{\theta'}(a_{i,t}|s_{i,t})}{\pi_{\theta}(a_{i,t}|s_{i,t})} \nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}|s_{i,t}) \left(\left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\right) - b\right)\]

Key advantages:

  • Importance weight is now per-timestep: $\frac{\pi_{\theta’}(a_{i,t} \mid s_{i,t})}{\pi_{\theta}(a_{i,t} \mid s_{i,t})}$ (single ratio, not product)
  • Much less likely to explode/vanish
  • More stable variance

Common simplification: When $\pi_{\theta’}$ hasn’t changed much from $\pi_\theta$, approximate ratio as $\approx 1$:

\[\nabla_{\theta'} J(\theta') \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta'} \log \pi_{\theta'}(a_{i,t}|s_{i,t}) \left(\left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'})\right) - b\right)\]

Off-Policy Algorithm

Full algorithm:

  1. Sample trajectories ${\tau^i}$ from old policy $\pi_\theta(a_t\mid s_t)$
  2. Compute off-policy gradient $\nabla_\theta J(\theta)$ using ${\tau^i}$ with importance weights
  3. Update: $\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)$
  4. Key benefit: Can take multiple gradient steps on same batch before resampling

Policy Constraint Problem

Issue: If $\pi_{\theta’}$ drifts too far from $\pi_\theta$ during updates:

  • Old data no longer representative of states new policy visits
  • Importance weights become inaccurate
  • Gradient estimates degrade

Solution: Constrain policy updates to keep $\pi_{\theta’}$ close to $\pi_\theta$

Common constraint (KL divergence):

\[\mathbb{E}_{s \sim \pi_\theta} [D_{KL}(\pi_{\theta'}(\cdot | s) \| \pi_{\theta}(\cdot | s))] \leq \delta\]

This ensures the new policy doesn’t deviate too much from the old one in expectation.

Methods using this constraint:

  • TRPO (Trust Region Policy Optimization): Hard constraint via constrained optimization
  • PPO (Proximal Policy Optimization): Soft constraint via clipped objective

Summary: On-Policy vs Off-Policy

Aspect On-Policy (Vanilla PG) Off-Policy (IS)
Data source Current policy $\pi_{\theta’}$ Old policy $\pi_\theta$
Sample efficiency Low (discard all data after update) High (reuse data)
Gradient $\mathbb{E}{\tau \sim \pi{\theta’}}[\cdots]$ $\mathbb{E}{\tau \sim \pi{\theta}}[\text{IS weight} \cdot \cdots]$
Variance Lower Higher (due to IS weights)
Updates per batch 1 Multiple
Constraint needed? No Yes (to prevent drift)

Key insight: Off-policy methods trade increased variance (from importance sampling) for dramatically improved sample efficiency (by reusing data).

Lecture Slides

References

  1. Chelsea Finn, Sergey Levine. Lecture 03 Video CS224R Stanford (2025)
  2. Chelsea Finn, Sergey Levine. Lecture 03 Slides CS224R Stanford (2025)