Rooms Over Red Pens: What's with the sudden interest in RL?

31 Aug, 2025

During pretraining, progress came from the web. Models read large, diverse, cleaned text and learned to predict the next token. That one skill unlocked many others.

During supervised finetuning, progress came from paired exchanges. Prompts and answers shaped to look like the outputs you want at inference time. Style guides, instruction patterns, careful curation.

Both still matter. Today the scarce asset is environments. Not static questions, but places a model can act, see what happened, and try again. That shifts the goal from imitation to decisions. It also lets a single artifact do double duty: if a setup can judge behavior, small tweaks usually let it teach behavior.

Here is what an environment looks like in practice. Take a monthly invoice sweep on a flaky vendor portal. The agent has tools: open a page, wait for elements, click, rename, move files, post to Slack. The goal is simple to grade: a Slack note that lists the correct filenames now sitting in the correct folder. A minimal reward pays for that end state. Shaping adds small credit when the right file appears in downloads, and when the target folder matches the invoice date. A verifier closes cheap shortcuts by hashing files so moving a random PDF does not score. Freeze the policy and it is an eval. Pay for partial progress and it becomes training.

A second domain looks the same. Consider a code-fix bot in a small repo. The tools are edit, run tests, lint, open PR. The grade is clear: all tests pass without changing the tests. Shaping pays for compiling, for passing more of the suite, for shrinking the diff. A verifier blocks reward hacking by refusing runs that touch the test files. Again, the same scaffold evaluates and trains.

This is the practical difference from a benchmark. A benchmark is a fixed pile of questions. An environment is a sandbox with tools, latency, and consequences. The model does not guess what a human would say. It tries moves. Some help. Some do nothing. Some backfire. The grade lands at the end of the run.

Scale follows from how experience is gathered. Many actors can collect traces in parallel while a single learner updates on the stream. Browsers, CI runners, and users in the wild can all contribute experience asynchronously. The bottleneck shifts from curating labels to generating rollouts.

Why now. Products want models that operate across many steps and tools, where intermediate labels do not exist. Labeling every mouse move does not scale. Well made environments do. They turn real work into repeatable exercises and they make evaluation and training the same asset.

None of this says scalar rewards are perfect. Rewards get gamed if gates are sloppy. Verifiers and preference signals keep the score honest. The claim is narrower and more durable: when the job is multi-step decision-making, leverage lives in environments that are varied, durable, and easy to share.

Red pens polish steps. Rooms produce outcomes. Build the rooms where good outcomes are the easiest thing to do.