Introduction to RL

Core Concepts

Supervised Learning: $f(x) := y$
Reinforcement Learning: Learn policy $\pi(a|s)$

Fundamental Components

State Representation

State ($s_t$): Complete environment description basically state of the world at time $t$ (e.g., robot position, velocity)

Observation ($o_t$): Partial view when full state isn’t accessible: $o_t = f(s_t)$

Action ($a_t$): Decision taken at time $t$

TODO: Image of state representation

Sequences and Dynamics

Trajectory ($\tau$): Sequence of states/observations and actions $\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)$

Also called: rollout or episode
Length $T$ can be variable

Transition Dynamics: $p(s_{t+1} \mid s_t, a_t)$ - probability of next state given current state-action

Trajectory Distribution: $p(\tau) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t)$

Reward ($r(s, a)$): Scalar feedback indicating how good state-action pair is

TODO: Add image of trajectory example

Markov Property

Fully Observable (MDP): $s_{t+1} \perp s_{0:t-1} \mid s_t$ Future independent of past given present. Current state contains all relevant information.

Partially Observable (POMDP): $o_{t+1} \not\perp o_{0:t-1} \mid o_t$ History matters when observations don’t capture full state. Need: $a_t = \pi(o_{t-m}, \ldots, o_t)$

Key distinction: States are Markovian; observations may not be.

Goal and Objectives

Maximize expected sum of rewards: $\max_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^{T} r(s_t, a_t) \right]$

Discount Factor ($\gamma$): $0 < \gamma \leq 1$

Used when you don’t want to wait for trajectory completion
Discounted return: $\sum_{t=1}^{T} \gamma^{t-1} r(s_t, a_t)$
$\gamma \to 1$: far-sighted, $\gamma \to 0$: myopic

Stochasticity sources:

Transition dynamics $p(s_{t+1} \mid s_t, a_t)$
Stochastic policy $\pi_\theta(a_t \mid s_t)$

Policy and Value Functions

Policy ($\pi$): Maps states/observations to actions

$\pi(a \mid s)$: probability distribution over actions given state
Parameterized: $\pi_\theta(a \mid s)$

Value Function ($V^\pi(s)$): Future expected reward starting at $s$, following $\pi$ $V^\pi(s) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \mid s_t = s \right]$

Q-Function ($Q^\pi(s, a)$): Future expected reward starting at $s$, taking $a$, then following $\pi$ $Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \mid s_t = s, a_t = a \right]$

Relationship: $V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot \mid s)} [Q^\pi(s, a)]$

Why Stochastic Policies?

Exploration: To learn from experience, must try different things
Modeling stochastic behavior: Existing data exhibits varying behaviors; can leverage generative modeling tools

Problem Formulation Checklist

Define:

State $s$ or observation $o$
Action $a$
Trajectory $\tau$ (including horizon $T$)
Reward $r(s, a)$

Flow: $s_t \xrightarrow{\pi(\cdot \mid s_t)} a_t \xrightarrow{p(\cdot \mid s_t, a_t)} s_{t+1} \rightarrow \tau$

Examples

Example 1: Robotic Arm (MDP)

State: RGB images, joint positions, joint velocities
Action: Commanded next joint position
Trajectory: 10-sec sequence at 20 Hz, $T = 200$
Reward: $r(s, a) = \begin{cases} 1 & \text{if towel on hook} \\ 0 & \text{otherwise} \end{cases}$

Example 2: Chatbot (POMDP)

Observation: User’s most recent message
Action: Chatbot’s next message
Trajectory: Variable length conversation
Reward: $r(s, a) = \begin{cases} +1 & \text{if upvote} \\ -10 & \text{if downvote} \\ 0 & \text{if no feedback} \end{cases}$

Types of RL Algorithms

Imitation Learning: Mimic a policy that achieves high reward
Policy Gradients: Directly differentiate the RL objective
Actor-Critic: Estimate value of current policy and use it to improve
Value-Based: Estimate value of optimal policy
Model-Based: Learn dynamics model, use for planning or policy improvement

Why Many Algorithms?

Different trade-offs and assumptions:

Data collection ease: Simulator vs. real-world
Supervision forms: Demonstrations, detailed rewards
Stability: Importance of reliable convergence
Action space: Dimensionality, continuous vs. discrete
Model learning: Ease of learning accurate dynamics

Lec 01 - Introduction to RL