Introduction to RL

Core Concepts

Supervised Learning: $f(x) := y$
Reinforcement Learning: Learn policy $\pi(a|s)$


Fundamental Components

State Representation

State ($s_t$): Complete environment description basically state of the world at time $t$ (e.g., robot position, velocity)

Observation ($o_t$): Partial view when full state isn’t accessible: $o_t = f(s_t)$

Action ($a_t$): Decision taken at time $t$

TODO: Image of state representation

Sequences and Dynamics

Trajectory ($\tau$): Sequence of states/observations and actions \(\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\)

  • Also called: rollout or episode
  • Length $T$ can be variable

Transition Dynamics: $p(s_{t+1} \mid s_t, a_t)$ - probability of next state given current state-action

Trajectory Distribution: \(p(\tau) = p(s_1) \prod_{t=1}^{T} \pi_\theta(a_t | s_t) p(s_{t+1} | s_t, a_t)\)

Reward ($r(s, a)$): Scalar feedback indicating how good state-action pair is

TODO: Add image of trajectory example


Markov Property

Fully Observable (MDP): \(s_{t+1} \perp s_{0:t-1} \mid s_t\) Future independent of past given present. Current state contains all relevant information.

Partially Observable (POMDP): \(o_{t+1} \not\perp o_{0:t-1} \mid o_t\) History matters when observations don’t capture full state. Need: $a_t = \pi(o_{t-m}, \ldots, o_t)$

Key distinction: States are Markovian; observations may not be.


Goal and Objectives

Maximize expected sum of rewards: \(\max_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)} \left[ \sum_{t=1}^{T} r(s_t, a_t) \right]\)

Discount Factor ($\gamma$): $0 < \gamma \leq 1$

  • Used when you don’t want to wait for trajectory completion
  • Discounted return: $\sum_{t=1}^{T} \gamma^{t-1} r(s_t, a_t)$
  • $\gamma \to 1$: far-sighted, $\gamma \to 0$: myopic

Stochasticity sources:

  1. Transition dynamics $p(s_{t+1} \mid s_t, a_t)$
  2. Stochastic policy $\pi_\theta(a_t \mid s_t)$

Policy and Value Functions

Policy ($\pi$): Maps states/observations to actions

  • $\pi(a \mid s)$: probability distribution over actions given state
  • Parameterized: $\pi_\theta(a \mid s)$

Value Function ($V^\pi(s)$): Future expected reward starting at $s$, following $\pi$ \(V^\pi(s) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \mid s_t = s \right]\)

Q-Function ($Q^\pi(s, a)$): Future expected reward starting at $s$, taking $a$, then following $\pi$ \(Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t'=t}^{T} r(s_{t'}, a_{t'}) \mid s_t = s, a_t = a \right]\)

Relationship: $V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot \mid s)} [Q^\pi(s, a)]$


Why Stochastic Policies?

  1. Exploration: To learn from experience, must try different things
  2. Modeling stochastic behavior: Existing data exhibits varying behaviors; can leverage generative modeling tools

Problem Formulation Checklist

Define:

  1. State $s$ or observation $o$
  2. Action $a$
  3. Trajectory $\tau$ (including horizon $T$)
  4. Reward $r(s, a)$

Flow: $s_t \xrightarrow{\pi(\cdot \mid s_t)} a_t \xrightarrow{p(\cdot \mid s_t, a_t)} s_{t+1} \rightarrow \tau$


Examples

Example 1: Robotic Arm (MDP)

State: RGB images, joint positions, joint velocities
Action: Commanded next joint position
Trajectory: 10-sec sequence at 20 Hz, $T = 200$
Reward: \(r(s, a) = \begin{cases} 1 & \text{if towel on hook} \\ 0 & \text{otherwise} \end{cases}\)

Example 2: Chatbot (POMDP)

Observation: User’s most recent message
Action: Chatbot’s next message
Trajectory: Variable length conversation
Reward: \(r(s, a) = \begin{cases} +1 & \text{if upvote} \\ -10 & \text{if downvote} \\ 0 & \text{if no feedback} \end{cases}\)


Types of RL Algorithms

  1. Imitation Learning: Mimic a policy that achieves high reward
  2. Policy Gradients: Directly differentiate the RL objective
  3. Actor-Critic: Estimate value of current policy and use it to improve
  4. Value-Based: Estimate value of optimal policy
  5. Model-Based: Learn dynamics model, use for planning or policy improvement

Why Many Algorithms?

Different trade-offs and assumptions:

  • Data collection ease: Simulator vs. real-world
  • Supervision forms: Demonstrations, detailed rewards
  • Stability: Importance of reliable convergence
  • Action space: Dimensionality, continuous vs. discrete
  • Model learning: Ease of learning accurate dynamics