Lec 02 - Imitation Learning
Behavioral cloning, DAgger, HG-DAgger, and addressing compounding errors
Goal of Imitation Learning
Learn a policy $\pi_\theta$ that performs at the expert level by mimicking expert demonstrations.
Input: Dataset $\mathcal{D} := {(s_1, a_1), \ldots, (s_T, a_T)}$ collected by expert policy $\pi_{expert}$
Example Application: Autonomous driving
- Sensor readings + steering commands from human drivers
Behavioral Cloning (BC)
Version 0: Deterministic Policy
Algorithm:
- Given expert demonstrations $\mathcal{D}$
- Train policy using supervised regression: \(\min_\theta \frac{1}{|\mathcal{D}|} \sum_{(s,a)\in\mathcal{D}} ||a - \hat{a}||^2 \quad \text{where } \hat{a} = \pi_\theta(s)\)
- Deploy learned policy $\pi_\theta$
Problem: Fails with Multimodal Data
- When data collected from multiple experts/people
- Different driving styles (e.g., merge left vs stay straight)
- L2 regression averages behaviors and outputs mean
- Policy learns to “average” contradictory actions (crashes)
Key Insight: Happens all the time in practice with multi-human datasets
Version 1: Expressive Policies
Algorithm:
- Given expert demonstrations $\mathcal{D}$
- Train generative model of expert actions: \(\min_\theta -\mathbb{E}_{(s,a)\sim\mathcal{D}}[\log \pi_\theta(a|s)]\) Maximize log probability of demo actions under the policy
- Deploy learned policy $\pi_\theta$
Three Generative Model Approaches:
- Mixture of Gaussians (GMM)
- Output: $\mu_1, \sigma_1, w_1, \mu_2, \sigma_2, w_2, \ldots$
- Can represent multimodal distributions
- Discretize + Autoregressive
- Output: \(p(a_{t,1}), p(a_{t,2} \mid \hat{a}_{t,1}), p(a_{t,3}\mid \hat{a}_{t,1:2}), \cdots\)
- Sequential action prediction
- In cars we can descretize streeing and acceleration
- depend on the current action so there will be conditional
- Diffusion
- Iteratively denoise: $s_t, a_t + \sum c_i$
- $n = N \ldots 1$ denoising steps
- Output: $\epsilon_n$ (noise estimate)
Important Note: Neural network expressivity is often distinct from distribution expressivity
Empirical Results (Version 1 vs Version 0):
- Simulated transport task:
- Single human collected data: Diffusion (1.0), GMM (0.9)
- Multi-human collected data: Diffusion (0.9), GMM (0.4)
- Real shirt hanging task (multi-human data):
- Diffusion (0.7) vs L1 (0.25)
Addressing Compounding Errors
Problem with Pure BC
BC is fully offline - learns only from fixed dataset
- Policy errors compound over time
- Agent visits states unseen in training data
- Distribution shift between training and deployment
Solution: Online Data Collection
DAgger (Dataset Aggregation)
Algorithm:
- Roll out learned policy $\pi_\theta$: $s’_1, \hat{a}_1, \ldots, s’_T$
- Query expert action at visited states: $a^* \sim \pi_{expert}(\cdot \mid s’)$
- Aggregate corrections with existing data: $\mathcal{D} \leftarrow \mathcal{D} \cup {(s’, a^*)}$
- Update policy: $\min_\theta \mathcal{L}(\pi_\theta, \mathcal{D})$
Advantages:
- Data-efficient way to learn from expert
- No reward function needed
- Can achieve reliable performance
Disadvantages:
- Challenging to query expert when agent has control
- May need impractically large amounts of data
HG-DAgger (Human-Gated DAgger)
Algorithm:
- Start to roll-out learned policy $\pi_\theta$: $s’_1, \hat{a}_1, \ldots, s’_t$
- Expert intervenes at time $t$ when policy makes mistake
- Expert provides (partial) demonstration: $s’_t, a^*_t, \ldots, s’_T$
- Aggregate new demos with existing data: $\mathcal{D} \leftarrow \mathcal{D} \cup {(s’_i, a^*_i)}$ for $i \geq t$
- Update policy: $\min_\theta \mathcal{L}(\pi_\theta, \mathcal{D})$
Key Difference: Collect corrective behavior data while taking full control
Advantages:
- Much more practical interface for providing corrections
- Expert takes control when needed
Disadvantages:
- Can be hard to catch mistakes quickly in some domains
- Requires ability to intervene in real-time
Open Question: Could you automatically detect when intervention is needed?
Summary Comparison
Behavioral Cloning (BC) - Offline
Definition: Train policy to mimic offline expert demonstrations
Properties:
- Best with expressive generative models over actions
- Fully offline algorithm
Advantages:
- Simple, no need for online data collection
- Safe (offline data can be verified)
- No reward function needed
Disadvantages:
- Doesn’t provide framework for self-improvement
- Compounding errors in deployment
DAgger / HG-DAgger - Online
Definition: Improve policy using online expert interventions
Properties:
- Requires interface for human/expert intervention
- Algorithm runs policy online
Advantages:
- Possible path to reliable performance
- More data-efficient than offline BC
- No reward function needed
Disadvantages:
- May need impractically large amounts of data for reliable performance
- Requires ability to query expert online
- Safety concerns with online data collection
Key Takeaway
Many successful methods combine imitation learning and reinforcement learning:
- Use BC for initialization
- Use RL for fine-tuning and self-improvement
- Best of both worlds: expert knowledge + autonomous learning
Next: Lecture 03 - Policy Gradients - Learn how to optimize policies directly using gradients
References
- Lecture 02 Video CS224R Stanford (2025)
- Lecture 02 Slides CS224R Stanford (2025)