Chapter 3 was the math chapter and the chapter that closes Part 1. The promise
at the top was a 30-minute refresher; the actual chapter is longer than that
because the refresher needed to be specific to robot policies rather than to
deep learning in general. The five sections introduced the four objects you
will see on almost every page of the rest of the book (vectors, matrices,
gradients, the chain rule), the probabilistic frame that makes those objects
behave (random variables, expectations, KL divergence), the 50-line PyTorch
loop that turns the math into running code, the three loss families that
parameterize “training” across the whole field, and the debugging discipline
that keeps a run from silently producing nonsense. This summary collects the
load-bearing ideas in one place and tells you which ones the next four chapters
are going to lean on hardest.
The four ideas worth carrying forward
Calculus is the API between a policy and an optimizer. §3.1 made the case
that the gradient of the loss with respect to the weights is the only thing
SGD ever sees. Everything else — the architecture, the tokenization, the data
pipeline — exists to shape that gradient into something useful. The chain rule
is the mechanism by which a scalar loss at the end of a 7-billion-parameter
network produces a per-parameter update at the beginning. Two consequences
follow. First, when a model fails to learn, the gradient is the first thing
to inspect, not the architecture. Second, the manipulator Jacobian
J(q)∈R6×7 and the loss Jacobian
∂L/∂θ are the same kind of object, computed by the
same kind of chain-rule decomposition; the only difference is whether the
chain runs through joint angles to end-effector pose, or through network
weights to a scalar loss. This is why a person who can derive the Jacobian
of a 2-link arm tends to pick up backprop quickly, and why a person who has
trained a deep network on images tends to pick up inverse kinematics quickly.
They are reusing the same idea.
Cross-entropy and KL divergence are the same loss in two notations. §3.2
worked through this twice — once with the math and once with the OpenVLA
action-bin example — because it is the single most useful identity for
reading the modern literature. RT-1 and OpenVLA describe themselves as
training with “cross-entropy loss on discretized action bins”; π0 describes
itself as training with a flow-matching loss that is equivalent to minimizing
a particular KL divergence between forward and reverse processes. The
descriptions sound different. They are not. Cross-entropy
H(p,q)=−∑ipilogqi and KL divergence
DKL(p∥q)=∑ipilog(pi/qi) differ only by the
entropy of p, which is a constant when p is the target distribution.
Minimizing one minimizes the other. Once you internalize that, the loss
function of essentially every supervised and self-supervised action model
in the book reduces to one of three patterns: cross-entropy on tokens
(RT-1, OpenVLA), MSE on continuous actions (ACT, Diffusion Policy
arXiv:2303.04137), or a denoising-style velocity loss (π0,
arXiv:2410.24164). The variety in named loss functions is mostly notation.
A training loop is six lines that do the actual work and a few hundred
lines of plumbing. §3.3 made this concrete by writing the SmallPolicy loop
in 50 lines and annotating which six lines are the optimization step. The
remaining 44 lines are data loading, logging, learning-rate scheduling, and
device placement — necessary, but mechanical. When you read the OpenVLA
training code, or the Diffusion Policy training code, or the π0 training
code, the six lines are present unchanged: forward pass, loss computation,
zero gradients, backward pass, optimizer step, scheduler step. What differs
between codebases is the plumbing, especially the data pipeline. This
matters because the plumbing is where almost all the bugs are. A
correctly-written six-line update on a corrupted dataloader produces a model
that does not work. The diagnostic from §3.5 — print action statistics
before training a single step — exists because the data pipeline is the
default suspect.
The loss family determines the failure mode. §3.4 introduced the three
families and §3.5 turned them into a diagnostic table. Supervised losses
fail by overfitting, by mode collapse on multimodal data, and by label
noise. Reinforcement-learning losses fail by reward hacking, by sparse-
reward stagnation, and by reward miscalibration. Self-supervised losses
fail by noise-schedule mismatch and by integrator error at sampling time.
Knowing which family your loss belongs to narrows the hypothesis space
the moment a run starts misbehaving. The point of organizing the chapter
this way is that “my model isn’t training” is not a question anyone can
answer; “my supervised behavior-cloning loss plateaus at 1.2 nats per
token after step 5000 on a dataset with two demonstrators” is a question
with two or three plausible answers and a clear next experiment.
What you should be able to do now
Four concrete capabilities, in increasing order of how much the rest of the
book is going to rely on them.
You should be able to read a model card or paper appendix and identify the
loss function in one pass. The OpenVLA paper says “we train with the
standard next-token cross-entropy objective over 256 discrete action bins
per dimension.” That sentence packs three §3.4 commitments — supervised
family, cross-entropy form, tokenized output — and it lets you predict the
diagnostic suite from §3.5 you will need when you fine-tune the model. The
Diffusion Policy paper says “we train an ϵ-prediction denoising
network conditioned on observations, with a fixed cosine noise schedule.”
That sentence packs four commitments — self-supervised family, MSE-on-noise
form, continuous output, schedule fixed. Reading either sentence quickly
saves the half-day you would otherwise spend reverse-engineering the loss
from the training script.
You should be able to write a 50-line PyTorch training loop from scratch
without consulting a reference. Not a state-of-the-art one — a working
one. The six update lines, plus a dataloader, plus a logging hook, plus
a learning-rate scheduler. The SmallPolicy from §3.3 is the template;
copying it for a new architecture should be a 20-minute job, not a
two-day job. The skill matters because every chapter from 6 onward asks
you to train at least one variant of the model under discussion, and the
hands-on exercises at the end of each chapter assume you can stand up a
loop without ceremony. The handful of moving parts the loop needs — gradient
zeroing, backward call, optimizer step, scheduler step — are universal
across PyTorch projects; the §3.4 loss family is the only piece that
changes from chapter to chapter.
You should be able to debug a non-converging run in under an hour by working
the §3.5 checklist. Print action statistics. Verify timestamp alignment.
Run for 10 steps on a batch of 4. Inspect gradient norms. Add NaN assertions.
Read the loss curve. Apply the loss-family diagnostic. The checklist exists
because in practice the cause is almost always one of seven things, and
identifying which of the seven saves you from the much longer tail of “it
must be the architecture.” A modern VLA fine-tune costs on the order of
hundreds of GPU-hours; the cost of mis-diagnosing is paid in days. The
checklist is cheap.
You should be able to predict the architectural consequences of a loss
choice. If a paper proposes a new VLA that uses MSE on continuous actions
with a single deterministic head, you should immediately ask: how does it
handle bimodal demonstrations? If a paper proposes RL fine-tuning on a
real robot, you should immediately ask: what is the reward function and
how dense is it? If a paper proposes a flow-matching action head, you
should immediately ask: how many integration steps at inference, and what
is the latency budget? These are not gotcha questions. They are the
questions the §3.4 taxonomy implies; asking them surfaces the design
decisions a paper has made and the trade-offs it has accepted. By the end
of Chapter 10, you will recognize each of these trade-offs as the entry
point to a specific later chapter — and the §3.4 vocabulary is the index.
Where the chapter has set up the rest of the book
Part 1 is now complete. Three explicit forward references are worth
re-naming, because they structure all of Part 2 and most of Part 3. The
three-loss-families framing from §3.4 is the spine of Chapter 5 (RL family
in depth), Chapter 6 (imitation family in depth), Chapter 7 (deep RL),
and Chapter 10 (the self-supervised family applied to action generation).
Each of those chapters elaborates one family into its own design space.
The PyTorch loop from §3.3 is the substrate for every code listing in
Chapters 6, 7, 10, 11, and 12; later chapters add larger models, bigger
datasets, and richer logging, but the six-line update stays. The
debugging checklist from §3.5 is the substrate for Chapter 16’s
fine-tuning recipes and Chapter 17’s deployment monitoring — both
chapters treat the checklist as a prerequisite rather than introducing
their own.
The chapter has not set up two things you might have expected. It has not
covered transformers, attention, or the architectural details of the models
listed in Part 4; those wait for Chapter 8, where they can be discussed in
the context of Decision Transformer and the lineage that produced RT-1. It
has also not covered the data side — Open X-Embodiment, LIBERO, CALVIN,
SimplerEnv — which is the entire subject of Chapter 15. The §3.5
admonition to “print action statistics before training a single step” is a
preview; the systematic treatment of robot datasets, their failure modes,
and their benchmark properties is twelve chapters away.
Part 1 closes here. Chapters 1, 2, and 3 between them gave you the
vocabulary of action models (Chapter 1), a working VLA you trained and
broke yourself (Chapter 2), and the math, code, and debugging instinct that
turn the rest of the book into a series of one-section variations on
recognizable themes (Chapter 3). Part 2 starts in Chapter 4 with the
classical-methods family — PDDL, IK, motion planning, computed-torque
control — and walks the four-family taxonomy forward through Chapter 7. By
the time you reach Chapter 11 and the VLA recipe proper, every component
you encounter will be something you have seen the smaller, earlier version
of. That progression is the design of the book, and Part 1 is the part
where the design is set.
§3.x closes the chapter with one hands-on exercise — a debugging puzzle on
an intentionally broken SmallPolicy run, with the cause hidden in one of
the seven places §3.5 named — and the full reading list for the chapter.
References
Kim et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246.
Brohan et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. arXiv:2212.06817.
Black et al. (2024). π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164.
Chi et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv:2303.04137.