Chapter 1 · The robot learning problem

§1.x Hands-on exercise + chapter references

0:00/11:15
AI-narrated by Kokoro

The Chapter 1 exercise has no GPU requirement, no code requirement, and no robot requirement. It is a reading and classification exercise, designed to take about ninety minutes, that will make the four-family vocabulary from §1.4 and the three-slot anatomy from §1.2 stick. Resist the urge to skip it. Most students who do the exercise report afterwards that it changed how they read the rest of the book; most students who skip it end up doing it implicitly, more slowly, during Chapter 11.

Exercise 1.x.1 — The four-paper triage

Pick four papers — one from each family — read each in abstract plus figures only mode (twenty minutes per paper, no full read), and fill in a table. Recommended choices for a first pass:

For each paper, fill in seven cells:

  1. Family (one of the four).
  2. Era (one of the six from §1.3).
  3. Slot 1 — input. What does the model see? Be specific: image resolution, number of cameras, presence of proprioception, language conditioning.
  4. Slot 2 — output. What does the model emit? Symbolic plan, joint targets, end-effector deltas, action tokens, continuous chunks? At what rate?
  5. Slot 3 — training signal. Derivation, reward, demonstration, or pretraining-plus-fine-tune? On what data?
  6. Compounding-error response. What does the paper do (if anything) to handle the closed-loop / open-loop mismatch from §1.1?
  7. One sentence on why this paper mattered. No hedging.

A worked example, for ALVINN:

CellValue
FamilyImitation
Era2 (end-to-end imitation, 1988 prototype)
Slot 130×32 grayscale camera, single front-facing
Slot 2Steering angle (a single scalar), at ~10 Hz
Slot 3Supervised regression on human-driving demonstrations
Compounding-error responseNone explicitly; trajectory divergence was
the dominant failure mode
Why it matteredFirst demonstration that a neural network could output a
control signal directly from pixels, on real hardware

Doing this for four papers takes ninety minutes the first time and is faster on the second pass. The table is itself the artifact; do not write prose around it.

Exercise 1.x.2 — The dishwasher problem, written down

Re-do the dishwasher worked example from §1.4 in writing, for a robot and kitchen you have actually seen. Choose any household task — emptying a dishwasher, sorting recycling, putting laundry in a basket — and write four short paragraphs (under 150 words each), one per family, that describe how you would build a robot to do it. For each, name:

The exercise is the writing, not the answer. There is no scoring rubric; the value is in noticing where you hesitate. The places you hesitate are the places later chapters will fill in.

Exercise 1.x.3 — Read one survey, end to end

Open Sapkota et al. (arXiv:2505.04769) and read it in one sitting. It is long but readable, and the local source mirror has it cached. Mark, with a pencil or a highlighter, three places where it refers to a model or method that does not yet make sense to you. Those three places will be answered by specific chapters of this book — usually one of Chapters 11 through 14. At the end of each of those chapters, come back to the survey and re-read the marked passages. The before-and-after delta is the single best self-assessment that the book is working.

If you find yourself unable to mark three places — that is, if every sentence of the Sapkota survey already makes sense — you have read the prerequisites the book assumes and can probably skip Chapters 3 (math refresher) and parts of 8 (transformers for control), going directly to Part 4. Most readers will mark more than three.

Exercise 1.x.4 — A short language exercise

Pick five terms from the chapter and write a one-sentence definition for each, without looking back. The list:

  1. Compounding error.
  2. The three-slot anatomy.
  3. Action tokenization.
  4. The dividing line (between classical and learned components in a stack).
  5. Cross-embodiment generalization.

Then look back and check. The point is not the score; the point is noticing which of the five you cannot define cleanly. Each of the five is a recurring concept; the ones you cannot define yet are the ones to read the next chapter with one finger on.

Chapter 1 reading list

The works below are the ones cited or referenced in §1.1–§1.6. They are grouped by purpose. Full bibliographic entries for everything cited in the whole book live in Appendix E.2; this list is the chapter-local subset.

Foundational references

Deep RL and policy learning

Imitation and behavior cloning

Foundation / VLA models

Surveys and field-state references

Background textbooks worth keeping nearby

Chapter summary

Chapter 1 set out a vocabulary for thinking about action models: the three-slot anatomy of inputs, outputs, and training signal; the six-era history that moved from STRIPS to π0; the four families — classical, RL, imitation, foundation/VLA — that organize the rest of the book. With that vocabulary, you can now read a modern VLA abstract and pull out the design choices, place a published system in the right family and era, sketch the four candidate solutions to a new robot task, and predict the failure mode of an action model from the family it sits in. Chapter 2 is where the vocabulary first earns its keep: a complete, end-to-end fine-tune of an OpenVLA checkpoint on a small benchmark, on a single GPU, in roughly seven pages.

References

  1. Fikes & Nilsson (1971). STRIPS. Artificial Intelligence 2(3–4).
  2. Pomerleau (1988). ALVINN. NeurIPS 1988.
  3. LaValle (2006). Planning Algorithms. Cambridge University Press.
  4. Argall et al. (2009). A Survey of Robot Learning from Demonstration. RAS 57(5).
  5. Kober, Bagnell, Peters (2013). RL in Robotics: A Survey. IJRR 32(11).
  6. Mnih et al. (2015). Human-level control through deep RL. Nature 518.
  7. Brohan et al. (2022). RT-1. arXiv:2212.06817.
  8. Brohan et al. (2023). RT-2. arXiv:2307.15818.
  9. Kim et al. (2024). OpenVLA. arXiv:2406.09246.
  10. O'Neill et al. (2023). Open X-Embodiment. arXiv:2310.08864.
  11. Black et al. (2024). π0. arXiv:2410.24164.
  12. Sapkota et al. (2025). VLA Models: Concepts, Progress, Applications, Challenges. arXiv:2505.04769.