Rohit Kumar | AI Research Blog

Flash Attention: Making Transformers Scale

2024-12-18T00:00:00+00:00

In my previous post on Transformer Attention, we explored the mathematical foundations of attention. The key limitation? Quadratic memory complexity $O(n^2)$ makes long sequences prohibitively expensive. Flash Attention solves this.

[!NOTE] Flash Attention achieves 2-4x speedup and dramatically reduces memory usage without any approximation — it’s mathematically identical to standard attention.

The Memory Bottleneck Problem

Standard attention computes and stores the full $n \times n$ attention matrix:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

For a sequence of length 8192 with batch size 1:

Attention matrix: $8192^2 \times 4$ bytes = 256 MB per head
With 32 heads: 8 GB just for attention weights!

[!CRITICAL] This quadratic scaling is why GPT-3 was limited to 2048 tokens, and why long-context models like Claude and GPT-4 required architectural innovations.

Flash Attention: The Key Insight

Flash Attention exploits the memory hierarchy of modern GPUs:

Memory Type	Size	Bandwidth
SRAM (on-chip)	~20 MB	~19 TB/s
HBM (GPU RAM)	40-80 GB	~1.5 TB/s

The insight: Memory I/O is the bottleneck, not compute. Standard attention:

Loads Q, K from HBM → computes $QK^T$ → writes to HBM
Loads $QK^T$ from HBM → computes softmax → writes to HBM
Loads softmax output from HBM → multiplies by V → writes to HBM

Flash Attention fuses all operations into a single kernel that keeps intermediate results in fast SRAM.

The Tiling Algorithm

Flash Attention processes the attention matrix in tiles that fit in SRAM:

# Pseudocode for Flash Attention forward pass
def flash_attention(Q, K, V, block_size=64):
    """
    Tiled attention computation.
    Q, K, V: (batch, seq_len, d_head)
    """
    n = Q.shape[1]
    output = torch.zeros_like(Q)
    
    # Process in blocks
    for i in range(0, n, block_size):
        q_block = Q[:, i:i+block_size]
        
        # Track running max and normalizer for numerical stability
        m_i = torch.full((q_block.shape[0], block_size), float('-inf'))
        l_i = torch.zeros((q_block.shape[0], block_size))
        o_i = torch.zeros_like(q_block)
        
        for j in range(0, n, block_size):
            k_block = K[:, j:j+block_size]
            v_block = V[:, j:j+block_size]
            
            # Compute attention scores for this tile
            scores = q_block @ k_block.transpose(-1, -2) / math.sqrt(d_k)
            
            # Online softmax update
            m_ij = torch.max(scores, dim=-1).values
            m_new = torch.maximum(m_i, m_ij)
            
            # Rescale and accumulate
            alpha = torch.exp(m_i - m_new)
            beta = torch.exp(m_ij - m_new)
            
            l_i = alpha * l_i + beta * torch.sum(torch.exp(scores - m_ij), dim=-1)
            o_i = alpha * o_i + beta * (torch.exp(scores - m_ij) @ v_block)
            m_i = m_new
        
        # Final normalization
        output[:, i:i+block_size] = o_i / l_i.unsqueeze(-1)
    
    return output

[!TIP] The magic is the online softmax algorithm — we can compute exact softmax incrementally without storing the full attention matrix!

Online Softmax: The Mathematical Trick

Standard softmax requires two passes:

Find max for numerical stability
Compute exp and normalize

Online softmax does it in one pass using running statistics:

\[m^{(j)} = \max(m^{(j-1)}, \max(S_{:,j}))\] \[\ell^{(j)} = e^{m^{(j-1)} - m^{(j)}} \ell^{(j-1)} + \sum_i e^{S_{i,j} - m^{(j)}}\] \[O^{(j)} = e^{m^{(j-1)} - m^{(j)}} O^{(j-1)} + e^{S_{:,j} - m^{(j)}} V_j\]

[!PROOF] Correctness: At convergence, $O/\ell$ equals the exact attention output. This follows from the distributive property of the softmax normalization. ∎

Memory Complexity Comparison

Algorithm	Memory	I/O Complexity
Standard Attention	$O(n^2)$	$O(n^2 d + n^2)$
Flash Attention	$O(n)$	$O(n^2 d^2 / M)$

Where $M$ is SRAM size (~20 MB) and $d$ is head dimension (~64-128).

[!SUCCESS] For typical transformer configs, Flash Attention reduces memory from quadratic to linear in sequence length!

Practical Performance

```python import torch from flash_attn import flash_attn_func import time def benchmark_attention(seq_len, n_heads=32, d_head=64, batch=1): """Compare standard vs Flash Attention""" device = 'cuda' q = torch.randn(batch, seq_len, n_heads, d_head, device=device, dtype=torch.float16) k = torch.randn(batch, seq_len, n_heads, d_head, device=device, dtype=torch.float16) v = torch.randn(batch, seq_len, n_heads, d_head, device=device, dtype=torch.float16) # Warmup for _ in range(10): _ = flash_attn_func(q, k, v) torch.cuda.synchronize() # Flash Attention start = time.time() for _ in range(100): _ = flash_attn_func(q, k, v) torch.cuda.synchronize() flash_time = (time.time() - start) / 100 # Standard attention (for comparison, smaller seq_len) q_std = q.transpose(1, 2) k_std = k.transpose(1, 2) v_std = v.transpose(1, 2) start = time.time() for _ in range(100): attn = torch.matmul(q_std, k_std.transpose(-2, -1)) / (d_head ** 0.5) attn = torch.softmax(attn, dim=-1) _ = torch.matmul(attn, v_std) torch.cuda.synchronize() std_time = (time.time() - start) / 100 print(f"Seq len {seq_len}: Flash={flash_time*1000:.2f}ms, Std={std_time*1000:.2f}ms, Speedup={std_time/flash_time:.2f}x") # Run benchmarks for seq_len in [1024, 2048, 4096, 8192]: benchmark_attention(seq_len) ``` **Typical results on A100:** | Sequence Length | Standard | Flash Attention | Speedup | |-----------------|----------|-----------------|---------| | 1024 | 1.2 ms | 0.4 ms | 3.0x | | 2048 | 4.8 ms | 1.2 ms | 4.0x | | 4096 | 19.2 ms | 4.1 ms | 4.7x | | 8192 | OOM | 15.3 ms | ∞ |

Using Flash Attention in Practice

With Hugging Face Transformers

from transformers import AutoModelForCausalLM

# Flash Attention 2 is enabled automatically for supported models
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2"
)

Direct Usage with flash-attn Library

from flash_attn import flash_attn_func

# Q, K, V shape: (batch, seq_len, n_heads, head_dim)
output = flash_attn_func(q, k, v, causal=True)

[!WARNING] Flash Attention requires GPU with compute capability >= 8.0 (Ampere or newer). For older GPUs, consider xFormers or PyTorch’s built-in scaled_dot_product_attention.

Flash Attention 2 & 3

Flash Attention has evolved:

Flash Attention 2 (2023):

Better work partitioning across GPU threads
2x faster than FA1
Better parallelism for small batch sizes

Flash Attention 3 (2024):

Exploits Hopper architecture (H100)
Asynchronous operations
1.5-2x faster than FA2 on H100

Key Takeaways

[!TIP] Summary: Flash Attention is IO-aware — it minimizes memory transfers between GPU HBM and SRAM. By tiling the computation and using online softmax, it achieves linear memory with no approximation.

Memory I/O is the bottleneck, not compute — Flash Attention optimizes for this
Tiling + online softmax = exact attention with linear memory
2-4x speedup with 10-20x memory reduction for long sequences
Drop-in replacement — mathematically identical to standard attention

Previous: Transformer Attention: A Mathematical Deep Dive

Transformer Attention: A Mathematical Deep Dive

2024-12-17T00:00:00+00:00

The attention mechanism is the core innovation behind transformers. Let’s break it down mathematically and implement it from scratch.

[!NOTE] This post assumes familiarity with basic linear algebra and neural networks. If you’re new to these topics, check out my prerequisites guide.

The Attention Formula

At its heart, attention computes a weighted sum of values based on query-key similarity:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where:

$Q$ = Query matrix of shape $(n, d_k)$
$K$ = Key matrix of shape $(m, d_k)$
$V$ = Value matrix of shape $(m, d_v)$
$d_k$ = Key/query dimension (scaling factor)

[!TIP] Think of attention as a “soft” dictionary lookup: queries find relevant keys, and retrieve their associated values.

Why Scale by $\sqrt{d_k}$?

The dot product $QK^T$ grows with dimension. For large $d_k$, the softmax saturates to one-hot vectors, killing gradients. Scaling by $\sqrt{d_k}$ keeps variance stable:

\[\text{Var}(q \cdot k) = d_k \cdot \text{Var}(q_i) \cdot \text{Var}(k_i) = d_k\]

After scaling: $\text{Var}\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = 1$

[!WARNING] Forgetting this scaling factor is a common bug! Without it, gradients vanish for $d_k > 64$.

PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k: int, dropout: float = 0.1):
        super().__init__()
        self.scale = d_k ** -0.5
        self.dropout = nn.Dropout(dropout)
    
    def forward(
        self, 
        query: torch.Tensor,  # (batch, n, d_k)
        key: torch.Tensor,    # (batch, m, d_k)
        value: torch.Tensor,  # (batch, m, d_v)
        mask: torch.Tensor = None
    ) -> tuple[torch.Tensor, torch.Tensor]:
        # Compute attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) * self.scale
        
        # Apply mask (for causal attention or padding)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Softmax over keys
        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        
        # Weighted sum of values
        output = torch.matmul(attn_weights, value)
        
        return output, attn_weights

Multi-Head Attention

Instead of a single attention function, transformers use multiple “heads” in parallel:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

[!QUESTION] Why use multiple heads instead of one large attention? Answer: Each head can attend to different aspects of the input (syntax, semantics, position, etc.).

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int, dropout: float = 0.1):
        super().__init__()
        assert d_model % n_heads == 0
        
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(self.d_k, dropout)
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        
        # Linear projections and reshape for multi-head
        Q = self.W_q(query).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        x, attn = self.attention(Q, K, V, mask)
        
        # Concatenate heads and project
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
        return self.W_o(x), attn

Complexity Analysis

Operation	Time Complexity	Space Complexity
$QK^T$ computation	$O(n \cdot m \cdot d_k)$	$O(n \cdot m)$
Softmax	$O(n \cdot m)$	$O(n \cdot m)$
Attention × Value	$O(n \cdot m \cdot d_v)$	$O(n \cdot d_v)$
Total	$O(n \cdot m \cdot d)$	$O(n \cdot m)$

For self-attention ($n = m$), this is quadratic in sequence length — the main bottleneck for long sequences.

Interactive Demo

Try this attention visualization on Hugging Face:

Try It Yourself

Run this simple attention calculation directly in your browser:

import numpy as np

# Simple attention example (no PyTorch needed!)
def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def attention(Q, K, V):
    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)
    weights = softmax(scores)
    return weights @ V, weights

# Create sample query, key, value vectors
np.random.seed(42)
Q = np.random.randn(2, 4)  # 2 queries, dim 4
K = np.random.randn(3, 4)  # 3 keys
V = np.random.randn(3, 4)  # 3 values

output, attn_weights = attention(Q, K, V)

print("Query shape:", Q.shape)
print("Key shape:", K.shape)
print("Value shape:", V.shape)
print("\nAttention weights (which keys each query attends to):")
print(np.round(attn_weights, 3))
print("\nOutput shape:", output.shape)

Key Takeaways

[!TIP] Summary: Attention is a “soft dictionary” — queries find keys, retrieve values. Multi-head attention learns multiple perspectives. The $\sqrt{d_k}$ scaling prevents gradient issues.

Attention is a soft dictionary lookup: queries find relevant keys, retrieve values
Scaling prevents gradient vanishing in high dimensions
Multi-head = multiple perspectives on the same input
Quadratic complexity motivates efficient variants (Flash Attention, Linear Attention)

New Feature Showcase

This section demonstrates the new “Second Brain” features added to the blog.

Enhanced Callouts

[!ABSTRACT] This post provides a mathematical deep-dive into the attention mechanism, the core innovation behind transformer architectures. We cover scaled dot-product attention, multi-head attention, complexity analysis, and provide interactive implementations.

[!DEFINITION] Scaled Dot-Product Attention is a function that maps a query and a set of key-value pairs to an output, where the output is computed as a weighted sum of the values, with weights determined by the compatibility of the query with the corresponding keys.

[!PROOF] Variance Stability Proof: Let $q_i, k_i \sim \mathcal{N}(0, 1)$ be i.i.d. standard normal. Then $\text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) = d_k$. Dividing by $\sqrt{d_k}$ gives $\text{Var}\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = 1$. ∎

[!EXAMPLE] For a 3-token sequence “The cat sat”, self-attention allows “sat” to attend to “cat” (subject) with high weight, while “The” attends mostly to itself since it’s a common determiner.

[!CRITICAL] GPU Memory Warning: Attention’s $O(n^2)$ space complexity means a sequence of length 8192 requires ~256MB just for the attention matrix (float32). This is why Flash Attention and memory-efficient variants are essential for long-context models!

[!SUCCESS] After implementing attention correctly with proper scaling, you should see smooth training curves and stable gradients even with $d_k = 512$ or higher.

Collapsible Code Block

The full multi-head attention implementation is hidden by default to keep the article clean:

class TransformerBlock(nn.Module):
    """Complete transformer block with attention, FFN, and residual connections."""
    
    def __init__(self, d_model: int, n_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        
        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, n_heads, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        # Self-attention with residual
        attn_out, _ = self.attention(x, x, x, mask)
        x = self.norm1(x + attn_out)
        
        # FFN with residual
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        
        return x

# Example usage
block = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)
x = torch.randn(2, 100, 512)  # batch=2, seq_len=100, d_model=512
output = block(x)
print(f"Input: {x.shape} -> Output: {output.shape}")

Video Demo (Example Syntax)

Here’s how you can embed videos showing your model in action:

Image Comparison Slider

Drag the slider to compare raw vs processed attention patterns:

Raw Processed

Usage syntax:

 class="image-compare" 
     data-before="/path/to/before.png" 
     data-after="/path/to/after.png">
   class="compare-label-before">Before
   class="compare-label-after">After

Multi-Image Layouts

Single Image (1x1) - Default layout:

Single image with caption

Two-Column (1x2) - Just add two images:

Self-Attention

Multi-Head Attention

Grid (3+) - Automatically creates grid:

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

Usage syntax:

{# Single image #}
{% include img.html src="/path/image.png" %}
{% include img.html src="/path/image.png" cap="With caption" %}

{# Two-column #}
{% include img.html src="/path/1.png, /path/2.png" cap="Left, Right" %}

{# Grid (3+ images) #}
{% include img.html src="/1.png, /2.png, /3.png" cap="A, B, C" cols="3" %}

Summary of New Markdown Syntax

Feature	Syntax
Abstract	`> [!ABSTRACT]`
Definition	`> [!DEFINITION]`
Proof	`> [!PROOF]`
Example	`> [!EXAMPLE]`
Critical	`> [!CRITICAL]`
Success	`> [!SUCCESS]`
Collapsible Code	`...`
Video Embed
Image Compare
Single Image	`{% include img.html src="/path.png" %}`
Two Images	`{% include img.html src="/1.png, /2.png" %}`
Image Grid	`{% include img.html src="/1.png, /2.png, /3.png" cols="3" %}`

Next post: We’ll implement Flash Attention and benchmark against naive attention.