Lec 04 - Actor-Critic
Value functions, advantage estimation, actor-critic algorithm, bootstrapping
Actor-Critic Method
Basis for PPO (LM model training)
- Value Function $V^{\pi}(s)$: future expected rewards starting at $s$ and following $\pi$ \(V^{\pi}(s_t) = \sum_{t'= t} \mathbb{E}_{s_{t'}, a_{t'} \sim \pi} \left[ r(s_{t'}, a_{t'}) | s_t \right]\)
here, $a_t$ is future action sampled from $\pi$
- Q - Funciton $Q^{\pi}(s,a)$: future expected rewards starting at $s$, taking $a$ and then following $\pi$
here, take a specific action $a_t$ at first (which might not be from $\pi$), then follow $\pi$ for all subsequent actions.
\[V^{\pi}(s) = \mathbb{E}_{a \sim \pi(\cdot|s_t)} \left[Q^{\pi}(s,a)\right]\]- Advantage $A^{\pi}(s,a)$: how much better it is to take $a$ than to follow $\pi$ at state $s$
Improving policy gradient
Estimating reward to go: estimate of future rewards if we take $a_{i, t’}$ in state $s_{i, t’}$
$r(\tau) = \sum_{t’ = t} r(s_{i, t’}, a_{i, t’})$
Instead of summing all the rewards together, we can run the policy many times and then get Q-funtion which will be true expected rewards to go
| $\sum_{t’=t} \mathbb{E}{\pi{\theta}} \left[ r(s_{t’}, a_{t’}) | s_{t}, a_{t} \right] = Q(s_{t}, a_{t})$ |
Therefore, \(\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(\sum_{t'=t}^T r(s_{i,t'}, a_{i,t'}) - b\right)\)
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) \left(Q(s_{i,t'}, a_{i,t'}) - V^{\pi}(s_t)\right)\]where, $b = \frac{1}{N} \sum_i Q(s_{i,t}, a_{i,t})$ which is eqivalent to $V^{\pi}(s_t)$
\[\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t}) A^{\pi}(s_{i,t'}, a_{i,t'})\]This is unbiased estimates
Better estimates of $A$ lead to less noisy gradients!
Algorithm
Initialize the policy (randomly, with imitation learning, with heuristics)
Do till convergence:
- Run policy to collect batch of data
- Fit model to estimate expected return: Estimate $V^{\pi}, Q^{\pi} or A^{\pi}$
- Improve policy: $\theta \leftarrow \theta + \nabla_{\theta}J(\theta)$
Estimate Expected Return
-
$A^{\pi}(s_t,a_t) = Q^{\pi}(s_t,a_t) - V^{\pi}(s_t)$
-
$A^{\pi}(s_t,a_t) \approx r(s_t, a_t) + V^{\pi}(s_{t+1}) - V^{\pi}(s_t)$
This lets us only train for $V^{\pi}$ using neural network $(\phi)$ whose output will be scalr value
Derivation
$
\begin{aligned}
Q^{\pi}(s_t, a_t) &= \sum_{t’=t}^{T} E_{\pi_{\theta}} \left[ r(s_{t’}, a_{t’}) | s_t, a_t \right]
&= r(s_t, a_t) + \sum_{t’=t+1}^{T} E_{\pi_{\theta}} \left[ r(s_{t’}, a_{t’}) | s_t, a_t \right]
&= r(s_t, a_t) + E_{s_{t+1} \sim p(\cdot | s_t, a_t)} \left[ V^{\pi}(s_{t+1}) \right]
&\approx r(s_t, a_t) + V^{\pi}(s_{t+1}) \quad \text{(using single sample estimate)}
\end{aligned}
$
$Q^{\pi}s_t,(a_t)$ : Reward at current time step + expected value after that time step
Estimating $V^{\pi}$
Single Sample vs Monte Carlo Estimates
Single trajectory estimate: \(V^{\pi}(s_t) \approx \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \quad \text{(single trajectory estimate)}\)
Monte Carlo estimate (true averaging): \(V^{\pi}(s_t) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \quad \text{(requires resetting to } s_t \text{ for } N \text{ trajectories; often impractical)}\)
where $s_{i,t}$ denotes state at timestep $t$ in trajectory $i$.
Practical Approach: Function Approximation
Since resetting to the same state multiple times is impractical, we use function approximation instead:
Step 1: Aggregate dataset of single sample estimates
Run $N$ trajectories. For each state-timestep pair encountered, compute the return (sum of future rewards): \(y_{i,t} = \sum_{t'=t}^{T} r(s_{i,t'}, a_{i,t'})\)
This gives us dataset: ${(s_{i,t}, y_{i,t})}$ containing many state-return pairs from different states.
Step 2: Supervised learning to fit value function
Train a neural network $\hat{V}_{\phi}^{\pi}$ to predict returns by minimizing: \(\mathcal{L}(\phi) = \frac{1}{2} \sum_{i,t} \left\| \hat{V}_{\phi}^{\pi}(s_{i,t}) - y_{i,t} \right\|^2\)
Key Insight: Each training sample $(s_{i,t}, y_{i,t})$ uses a single noisy estimate, but the neural network learns to generalize across many states and effectively “averages out” the noise. This avoids needing to visit the same state multiple times.
Version 2: Bootstrapping (Temporal Difference Learning)
Monte Carlo target: \(y_{i,t} = \sum_{t'=t}^{T} E_{\pi_{\theta}} \left[ r(s_{t'}, a_{t'}) | s_i, a_i \right]\)
Ideal target (Bellman equation): \(y_{i,t} = r(s_{i,t}, a_{i,t}) + \sum_{t'=t+1}^{T} E_{\pi_{\theta}} \left[ r(s_{t'}, a_{t'}) | s_{i,t+1} \right] \approx r(s_{i,t}, a_{i,t}) + V^{\pi}(s_{i,t+1})\)
Bootstrapped target (using fitted value): \(y_{i,t} = r(s_{i,t}, a_{i,t}) + \hat{V}_{\phi}^{\pi}(s_{i,t+1})\)
Step 1: Aggregate dataset of “bootstrapped” estimates
Training data: $\left{ \left( s_{i,t}, \underbrace{r(s_{i,t}, a_{i,t}) + \hat{V}{\phi}^{\pi}(s{i,t+1})}{y{i,t}} \right) \right}$
Important: Update labels $y_{i,t}$ every gradient update!
Step 2: Supervised learning
\[\mathcal{L}(\phi) = \frac{1}{2} \sum_{i} \left\| \hat{V}_{\phi}^{\pi}(s_i) - y_i \right\|^2\]Also called temporal difference (TD) learning.
Comparison: Monte Carlo vs. Bootstrap
| Method | Target | Variance | Description |
|---|---|---|---|
| Monte Carlo | $y_{i,t} = \sum_{t’=t}^{T} r(s_{i,t’}, a_{i,t’})$ | Very high | Supervise with roll-out’s summed rewards |
| Bootstrap/TD | $y_{i,t} = r(s_{i,t}, a_{i,t}) + \hat{V}{\phi}^{\pi}(s{i,t+1})$ | Much lower | Supervise using reward + value estimate |
Key difference:
- MC: Very high variance (sums many random rewards)
- Bootstrap (TD): Much lower variance (only one reward + value estimate)
Example: Trajectory with mostly $r=0$, then $r=-1$ or $r=1$ at some point - MC has high variance, TD has lower variance.
Example: MC vs TD Estimates
TODO: Add diagram image
Scenario: From state $s$, two possible outcomes:
- Branch 1: Get $r=-1$ then $r=1$ → total return = 0
- Branch 2: Get $r=1$ then $r=-1$ → total return = 0
- Trajectory has $r=0$ everywhere else
Analysis:
Starting from state $s$ (blue):
Monte Carlo estimates:
- $\hat{V}_{MC}^{\pi}(s)$: If we take the branch that goes left (red $r=-1$), MC return = $-1 + 0 + 0 + … = -1$ (high variance)
- If we take the branch that goes right (green $r=1$), MC return = $1 + 0 + 0 + … = 1$ (high variance)
- True value: $V^{\pi}(s) = 0$ (average of both paths)
TD estimates:
- $\hat{V}{TD}^{\pi}(s)$: Uses $r + \hat{V}(s{next})$
- After seeing both branches a few times, $\hat{V}(s_{next})$ learns to predict 0
- TD estimate converges faster to true value with lower variance
Key insight: TD can learn from incomplete episodes and has lower variance because it uses the value function to “fill in” future rewards.
N-Step Returns: Middle Ground Between MC and TD
Motivation: Instead of using just one reward (TD) or all rewards (MC), we can use the next $n$ rewards and then bootstrap from the value estimate at $t+n$.
N-step return formula: \(y_{i,t} = \sum_{t'=t}^{t+n-1} r(s_{i,t'}, a_{i,t'}) + \hat{V}_{\phi}^{\pi}(s_{i,t+n})\)
Comparison:
| Method | Formula | Variance | Bias |
|---|---|---|---|
| Monte Carlo ($n = T-t$) | $\sum_{t’=t}^{T} r(s_{i,t’}, a_{i,t’})$ | Very high | None (unbiased) |
| N-step returns ($1 < n < T$) | $\sum_{t’=t}^{t+n-1} r(s_{i,t’}, a_{i,t’}) + \hat{V}{\phi}^{\pi}(s{i,t+n})$ | Medium | Medium |
| 1-step Bootstrap/TD ($n = 1$) | $r(s_{i,t}, a_{i,t}) + \hat{V}{\phi}^{\pi}(s{i,t+1})$ | Much lower | Highest |
N-step returns = Middle ground between MC and TD
In practice: $n > 1, n < T$ often works best!
Aside: Discount Factor $\gamma$
Problem: If $T = \infty$, $\hat{V}_{\phi}^{\pi}$ can become infinitely large.
Solution: Use discount factor $\gamma \in [0, 1]$ (typically 0.99):
\[y_{i,t} \approx r(s_{i,t}, a_{i,t}) + \gamma \hat{V}_{\phi}^{\pi}(s_{i,t+1})\]Interpretation of $\gamma$
$\gamma = 0.99$ means:
- The agent has a 1% probability per timestep that it will get stuck and stay in a terminal state forever
- Equivalently: $(1 - \gamma) = 0.01$ probability of episode ending at each step
Mathematically: $\gamma$ modifies the MDP dynamics:
-
$\tilde{p}(s’ s,a) = \gamma p(s’ s,a)$ for non-terminal states -
$\tilde{p}(\text{terminal} s,a) = 1-\gamma$
Choosing $\gamma$
Rule of thumb:
- If variance increases slowly over time → use larger $\gamma$ (e.g., 0.99, 0.995)
- If variance increases quickly → use smaller $\gamma$ (e.g., 0.95, 0.9)
Why? Larger $\gamma$ means we care more about distant future rewards, which increases variance. If rewards are already highly variable, a smaller $\gamma$ helps reduce this variance.
General N-step form with discount: \(y_{i,t} = \sum_{t'=t}^{t+n-1} \gamma^{t'-t} r(s_{i,t'}, a_{i,t'}) + \gamma^n \hat{V}_{\phi}^{\pi}(s_{i,t+n})\)
Actor-Critic Algorithm
Full Algorithm
-
Sample batch of data ${(s_{1,i}, a_{1,i}, …, s_{T,i}, a_{T,i})}$ from $\pi_{\theta}$
-
Fit $\hat{V}{\phi}^{\pi{\theta}}$ to returns: \(\mathcal{L}(\phi) = \frac{1}{2} \sum_{i} \left\| \hat{V}_{\phi}^{\pi}(s_i) - y_i \right\|^2\)
where $y_{i,t} = \sum_{t’=t}^{t+n-1} \gamma^{t’-t} r(s_{i,t’}, a_{i,t’}) + \gamma^n \hat{V}{\phi}^{\pi}(s{i,t+n})$
-
Evaluate advantage: \(\hat{A}^{\pi_{\theta}}(s_{t,i}, a_{t,i}) = r(s_{t,i}, a_{t,i}) + \gamma \hat{V}_{\phi}^{\pi_{\theta}}(s_{t+1,i}) - \hat{V}_{\phi}^{\pi_{\theta}}(s_{t,i})\)
-
Evaluate policy gradient: \(\nabla_{\theta} J(\theta) \approx \sum_{t,i} \nabla_{\theta} \log \pi_{\theta}(a_{t,i} | s_{t,i}) \hat{A}^{\pi_{\theta}}(s_{t,i}, a_{t,i})\)
-
Update policy: $\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)$
Architecture:
-
Actor: $\pi_{\theta}(a s)$ outputs action probabilities - Critic: $\hat{V}_{\phi}^{\pi}(s)$ estimates state values
Summary: Policy Evaluation Methods
Three approaches:
- Monte Carlo: Supervised learning on sum of future rewards
- Very high variance, unbiased
- Bootstrapping (1-step TD): Supervised learning on reward + next state value
- Much lower variance, biased
- N-step returns: Sum of next $n$ rewards + value after that
- Middle ground: balanced variance and bias
Off-Policy Actor-Critic
Version 1: Multiple Gradient Steps
Reuse collected data for multiple gradient updates. Same algorithm as above, but:
- In policy gradient step: use importance weights to correct for distribution mismatch
Note: When policy changes after multiple updates, data is off-policy. Importance sampling corrects this, leading to algorithms like PPO, TRPO, SAC. Now includes: