Chapter 1 · The robot learning problem

§1.2 Anatomy of an action model: inputs, outputs, training signal

0:00/12:49
AI-narrated by Kokoro

Section 1.1 argued that turning intent into motion is structurally hard and named four reasons. This section gets concrete. We will take a modern action model apart and label its components, in language that the rest of the book will use. By the end of the section you should be able to look at any paper in the field — RT-1, OpenVLA, π0, Helix, GR00T N1, the next one published the week after you read this — and identify the three slots that define it: what goes in, what comes out, and what training signal told it the difference.

Three slots, in that order. They look simple. Most of the design space of the field lives in the choices you make at each one.

Slot 1 — Inputs: what the policy is allowed to look at

An action model takes some observation of the world and produces something the robot can do with it. The first design choice is what counts as “observation.”

The minimum is one RGB camera and a clock. Most modern systems use more. The typical input bundle for a manipulation VLA is:

Three observations about this slot that will recur. First, the input modality mix is not fixed by the problem; it is a design choice. The same task can be done by a single-camera policy or a five-camera one, and the comparison is non-trivial because more cameras means more compute and more places for the model to overfit. Second, the inputs are heterogeneous: pixels, tokens, floating-point joint angles. Combining them is its own architectural question, which we will treat in Chapter 8 when we discuss tokenization and in Chapter 11 when we trace the CLIP → RT-1 lineage. Third, what you decide not to look at matters. A policy that has access to a force-torque sensor will learn to use it; a policy that does not will silently substitute visual approximations of contact. Both can work; they fail differently.

Slot 2 — Outputs: what the policy is allowed to do

The second design choice is the action space — the set of things the policy is allowed to output. This choice has more consequences than any other single decision in the model, because it determines what kind of motion the policy can express and how the policy is trained.

Action spaces come in two cuts: by type and by frame. The type of an action is what the number physically means. The frame is what coordinate system the number is expressed in.

The four common types, ordered roughly from lowest- to highest-level:

The frame distinction cuts across all four types. Most action spaces are expressed relative to the current end-effector pose — a “go 1 cm forward from where you are” rather than “go to absolute world coordinate (0.45, 0.10, 0.30).” Relative actions generalize better because they do not depend on the calibration of the robot’s base frame, and they are easier to compose. Almost every contemporary VLA uses relative actions. The exceptions are systems with strong global scene grounding (some 3D-aware VLAs like LEO in Chapter 15) and end-to-end driving policies (OpenDriveVLA), where the relevant frame is the world, not the agent.

A second axis cuts across the type: discrete vs. continuous representation. Even within “pose deltas,” a model can either output continuous floating-point numbers — the natural representation — or discretize each dimension into a fixed number of bins and predict bin indices. RT-1, RT-2, and OpenVLA all chose discretization (256 bins per axis is canonical), because it lets them reuse a language-model decoder head and a cross-entropy loss. π0 and Octo went the other way, with continuous outputs from a diffusion or flow-matching head. We will spend Chapter 10 on the trade-off; for the anatomy, what matters is that the same physical action — “move 1 cm in +x” — gets represented and trained differently depending on which side of this choice the architecture made.

Slot 3 — Training signal: how the model learns to fill the gap

The third slot is what tells the model that one mapping from inputs to outputs is better than another. This is where action models split into the four families we will name in Section 1.4, and it is the slot where the methods have changed the most over the last fifty years.

Three training-signal types dominate the modern field. They are not exclusive; most contemporary systems use two of them in combination.

The modern recipe, almost without exception, layers all three: self-supervised pretraining on internet data, supervised imitation on robot demonstrations, and (optionally) reinforcement-learning fine-tuning to close the last gap. When you read a paper, the most informative question is not “which signal does it use?” but “in what proportions, and in what order?”

A worked instance: OpenVLA in three slots

The anatomy is easier to remember if you pin it to a real model. Take OpenVLA (Kim et al. 2024, arXiv:2406.09246), the open-source VLA you ran in Chapter 2.

Three slots, four design choices each, one paragraph. You can do this exercise for any model in the Model Zoo (Appendix F) and the structure of the design space falls out.

Where this differs from a perception model and from a planner

Two contrasts close out this section, because the entire premise of action models — and of this book — is that they are a distinct object from the two things they are most often confused with.

A perception model — an image classifier, an object detector, a vision-language model — has Slot 1 and Slot 3 but not a Slot 2 in the sense above. Its outputs are labels, segments, or natural-language responses, not commands a robot will execute. A perception model can be a component of an action model, and indeed every modern VLA has a perception model embedded in it. But the embedding is not free: the perception model has to be wired into a head that emits actions, and a training signal that grounds those actions has to be added. Most of the engineering effort in OpenVLA, RT-2, and π0 lives in that wiring.

A planner — STRIPS, PDDL, motion planners like RRT or PRM — has Slot 2 but typically not Slot 1 or Slot 3 in the sense above. It takes a symbolic or geometric description of the world (not raw sensor input), it produces an action or trajectory (Slot 2), and it does not learn from data — the rules are written. Classical planners are extremely good at certain things action models are bad at, and Chapter 4 makes that case in detail. They are bad at certain things action models are good at, which is why the two coexist in modern robotic stacks rather than one replacing the other.

The action models this book is about sit in the middle: they accept raw high-dimensional sensor input like a perception model, they produce executable actions like a planner, and they learn the mapping from data rather than having it written. That combination is what makes them new and what makes them hard. Section 1.3 traces the history of how the field arrived at that combination — and Section 1.4 names the four families that share the slot structure but differ in how they fill it.

References

  1. Kim et al. (2024). OpenVLA. arXiv:2406.09246.
  2. Brohan et al. (2022). RT-1. arXiv:2212.06817.
  3. Sutton & Barto (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  4. Argall et al. (2009). A Survey of Robot Learning from Demonstration. RAS 57(5).