Lec 01 - Introduction to RL
Core concepts, MDP vs POMDP, policy and value functions, RL algorithm types
Introduction to RL
Core Concepts
Supervised Learning: $f(x) := y$
Reinforcement Learning: Learn policy $\pi(a|s)$
Fundamental Components
State Representation
State ($s_t$): Complete environment description basically state of the world at time $t$ (e.g., robot position, velocity)
Observation ($o_t$): Partial view when full state isn’t accessible: $o_t = f(s_t)$
Action ($a_t$): Decision taken at time $t$
TODO: Image of state representation
Sequences and Dynamics
Trajectory ($\tau$): Sequence of states/observations and actions \(\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\)
- Also called: rollout or episode
- Length $T$ can be variable
Transition Dynamics: $p(s_{t+1} \mid s_t, a_t)$ - probability of next state given current state-action
Trajectory Distribution: \(p(\tau) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t)\)
Reward ($r(s, a)$): Scalar feedback indicating how good state-action pair is
TODO: Add image of trajectory example
Markov Property
Fully Observable (MDP): \(s_{t+1} \perp s_{0:t-1} \mid s_t\) Future independent of past given present. Current state contains all relevant information.
Partially Observable (POMDP): \(o_{t+1} \not\perp o_{0:t-1} \mid o_t\) History matters when observations don’t capture full state. Need: $a_t = \pi(o_{t-m}, \ldots, o_t)$
Key distinction: States are Markovian; observations may not be.
Goal and Objectives
Maximize expected sum of rewards: \(\max_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^{T} r(s_t, a_t) \right]\)
Discount Factor ($\gamma$): $0 < \gamma \leq 1$
- Used when you don’t want to wait for trajectory completion
- Discounted return: $\sum_{t=1}^{T} \gamma^{t-1} r(s_t, a_t)$
- $\gamma \to 1$: far-sighted, $\gamma \to 0$: myopic
Stochasticity sources:
- Transition dynamics $p(s_{t+1} \mid s_t, a_t)$
- Stochastic policy $\pi_\theta(a_t \mid s_t)$
Policy and Value Functions
Policy ($\pi$): Maps states/observations to actions
- $\pi(a \mid s)$: probability distribution over actions given state
- Parameterized: $\pi_\theta(a \mid s)$
Value Function ($V^\pi(s)$): Future expected reward starting at $s$, following $\pi$ \(V^\pi(s) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \mid s_t = s \right]\)
Q-Function ($Q^\pi(s, a)$): Future expected reward starting at $s$, taking $a$, then following $\pi$ \(Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \mid s_t = s, a_t = a \right]\)
Relationship: $V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot \mid s)} [Q^\pi(s, a)]$
Why Stochastic Policies?
- Exploration: To learn from experience, must try different things
- Modeling stochastic behavior: Existing data exhibits varying behaviors; can leverage generative modeling tools
Problem Formulation Checklist
Define:
- State $s$ or observation $o$
- Action $a$
- Trajectory $\tau$ (including horizon $T$)
- Reward $r(s, a)$
Flow: $s_t \xrightarrow{\pi(\cdot \mid s_t)} a_t \xrightarrow{p(\cdot \mid s_t, a_t)} s_{t+1} \rightarrow \tau$
Examples
Example 1: Robotic Arm (MDP)
State: RGB images, joint positions, joint velocities
Action: Commanded next joint position
Trajectory: 10-sec sequence at 20 Hz, $T = 200$
Reward:
\(r(s, a) = \begin{cases}
1 & \text{if towel on hook} \\
0 & \text{otherwise}
\end{cases}\)
Example 2: Chatbot (POMDP)
Observation: User’s most recent message
Action: Chatbot’s next message
Trajectory: Variable length conversation
Reward:
\(r(s, a) = \begin{cases}
+1 & \text{if upvote} \\
-10 & \text{if downvote} \\
0 & \text{if no feedback}
\end{cases}\)
Types of RL Algorithms
- Imitation Learning: Mimic a policy that achieves high reward
- Policy Gradients: Directly differentiate the RL objective
- Actor-Critic: Estimate value of current policy and use it to improve
- Value-Based: Estimate value of optimal policy
- Model-Based: Learn dynamics model, use for planning or policy improvement
Why Many Algorithms?
Different trade-offs and assumptions:
- Data collection ease: Simulator vs. real-world
- Supervision forms: Demonstrations, detailed rewards
- Stability: Importance of reliable convergence
- Action space: Dimensionality, continuous vs. discrete
- Model learning: Ease of learning accurate dynamics
References
- Lecture 01 Video CS224R Stanford (2025)
- Lecture 01 Slides CS224R Stanford (2025)