Real-World RL: From The Matrix to Planet Earth

Today Physical Intelligence released the technical report of $\pi_{0.6}^*$ ¹. The most important part is that it uses Real-World RL on real robots to improve performance. I had already read some Real-World RL work before, so I used this chance to review the existing work, sort out the story, and think a bit about possible future directions.

Background: What is Real-World RL?#

For a long time, “robots + reinforcement learning” almost meant “collecting experience inside a physics simulator” . We put virtual robots into environments and frameworks such as MuJoCo, Isaac Gym, Isaac Lab, and ManiSkill, let them practice day and night, and after hundreds of millions of actions we find a policy that maximizes return. Then we try to use domain randomization, sim-to-real, or residual actions to make up for the dynamics gap between the simulator and the real world. This gap is usually called the Sim2Real Gap, and it becomes more obvious as tasks become more dynamic and longer-horizon.

The term “Real-World RL” refers to another path that is more direct, tougher, and also more meaningful: letting data collected in the real world on real robots carry most of the learning — either not using simulation at all, or only using it as a rough starting point, and relying on repeated interaction in the real world to polish the final performance. This is a very appealing goal, because just like evolution of life on Earth, the real world is the richest, most complex, and most faithful training ground, and the only place where a robot can really learn how to survive.

Compared with classic “sim-to-real” , there are two key differences:

The trajectories for RL come from the real world, so there is no Sim2Real gap
Algorithms and systems must face real-world constraints seriously: hardware breaks, objects get lost, environments are messy, and human patience is limited

This is why DayDreamer ² emphasizes from the start: no simulation, no human demonstrations, only online RL on real robots. It takes the Dreamer world model and puts it on real platforms, and shows that a world model can also learn behaviors efficiently in the physical world.

Later, A Walk in the Park ³ lowered the bar even further: no world model, no complex pretraining. As long as you polish a model-free RL algorithm and the controller and engineering details, a Unitree A1 quadruped can learn to walk stably on various outdoor terrains in about 20 minutes of real-world interaction.

From these works on, Real-World RL stopped being just a “cool vision” and became a technical route that people could discuss seriously.

First Stage: From Learning in Simulation to Learning in Reality#

If you only read the titles, it is easy to treat DayDreamer and A Walk in the Park as two separate lines: one from the world-model side, the other from the model-free side. But if you stretch out the timeline, they actually work together to show one thing: doing RL in the real world is not a myth, but an engineering problem that can be solved.

DayDreamer: Not in World, in World Model#

In short, DayDreamer moves the Dreamer world model to real robots and tests whether it can still learn. It evaluates on several different platforms: for example, training robots in the real world to navigate, to balance, and to do locomotion control, without relying on simulation or human teleoperation. Data collection and training run online in parallel. ²

In this setup the world model is important because it gives real-world RL a virtual environment and increases the efficiency of using real data: the robot first learns a policy in the learned world model, which greatly reduces the number of real interactions needed. This makes online RL in the real world feasible for the first time.

DayDreamer is like a small road sign written on a rough wooden board: this road is walkable, even if the path is still narrow.

A Walk in the Park: Algorithm or Engineering?#

A year later, A Walk in the Park gave a result that looks somewhat “against intuition” : You do not need a world model, you do not need a fancy algorithm. If you refine a model-free method (for example SAC-like algorithms) and the low-level controller carefully, you can train the A1 robot directly in the real world, let it learn to walk in about 20 minutes, and adapt to grass, gravel, mountain trails, and other outdoor terrains. ³

The message of this work is subtle:

On one hand, it follows the direction of DayDreamer: no simulation, learn directly in the real world
On the other hand, it attributes most of the success to MDP design, control stack design, and engineering optimization, rather than to some “revolutionary RL algorithm”

From then on, the sentence “the real difficulty lies in the system and engineering, not in the loss function” started to sound more convincing.

RoboCat: Self-generated Data#

In 2023, DeepMind released RoboCat, which takes a different path. It is not a system designed specifically for Real-World RL, but it looks like a kind of “rehearsal” for real-world RL. ⁴

RoboCat builds on a Gato-style vision-based decision Transformer. It trains a generalist agent from demonstration data covering many robots and many tasks, and then runs a self-improvement loop: humans provide 100–1000 demonstrations for a new task, the model is fine-tuned, then it practices about ten thousand times in simulation or in the real world, and the generated data is fed back into the training set to produce a new, stronger version.

Although RoboCat does not emphasize “online RL” as strongly as DayDreamer or A Walk in the Park, it clearly brings out another important idea:

A generalist robot policy can keep getting stronger by “generating its own data and improving itself” .

This idea will later become a core theme in $\pi_{0.6}^*$ .

Second Stage: From “Can Learn” to “Learns Well”#

The first stage showed that real-world RL can learn. But when people try to deploy these systems, they care about two other questions:

Can the agent learn in a way that is stable enough, with success rates close to 100%?
Can it keep running for a long time, without needing a human to rescue it every half hour?

A series of works between 2023 and 2025 can be seen as systematic answers to these questions.

HIL-SERL: Data + Human Corrections + RL#

HIL-SERL (Human-in-the-Loop Sample-Efficient RL) comes from Berkeley RAIL and appeared in Science Robotics 2025. It targets a set of tasks that are much harder than “learning to walk” : dynamic shaking to pull out a block, precise assembly, dual-arm collaboration, pan tossing for cooking, and other real manipulation tasks. ⁵

The training procedure of HIL-SERL is simple but effective: first they collect good and bad trajectories using teleoperation, and train a binary reward model that judges success or failure; then they use a small number of demonstrations to initialize the policy; finally they run online RL on real robots, where humans step in to correct the robot at key moments. RL then improves the policy using these “corrected data” and the learned reward.

The results are very direct: on a set of complex manipulation tasks, HIL-SERL can push the success rate of vision-based policies close to 100% in about 1–2.5 hours of interaction, and the final execution speed is even faster than human teleoperation.

This work makes two points that strongly influence later research:

Real-world RL should not start from random exploration, but should stand on top of demonstrations
Human interventions are not “noise” ; they are the key component that makes RL both safe and efficient

You can view it as an upgrade of what DayDreamer and A Walk in the Park did: from “can learn” to “can learn well, and learn fast” .

RL-100: Systematizing the Pipeline#

If HIL-SERL is still a method, RL-100 has already grown into an engineering system.

RL-100 proposes a three-stage pipeline: first, use imitation learning to inject human experience into a diffusion policy; second, run offline RL with offline policy evaluation (OPE) to obtain conservative policy improvement; finally, run a short period of online RL on real robots to clean up the remaining failure modes. ⁶

They validate the system on seven real-robot tasks, including cloth folding, pouring fluids and granular materials, dynamic pushing, dexterous nut tightening, multi-stage orange juicing, and so on. In 900 evaluations they achieve 900/900 successes, and some tasks can run 250 times in a row without failure.

Technically, RL-100 and HIL-SERL share the same spirit:

Both rely on demonstrations and offline data to ensure a good starting point
All exploration stays within safety boundaries monitored by OPE or humans
The role of RL is to fix long-tail failures, not to invent motions from scratch

But RL-100 does one extra important thing: it turns the whole pipeline into a framework that is relatively agnostic to tasks, robot platforms, and sensing modalities. This is a step from “paper demo” toward “reusable system” .

Contact-Rich Sim-to-Real: A Compromise Route#

For assembly and tight insertion tasks where contact mechanics are very sensitive, learning entirely in the real world is still too risky. For such settings, work from Tomizuka’s group proposes a hybrid idea: learn trajectories and compliance parameters with RL in simulation, then, in the real world, only do online fine-tuning of a small admittance residual. ⁷

This style of method may not be as eye-catching as HIL-SERL or RL-100, but it is very practical in industrial scenarios: most of the risk is handled in simulation, and real-world RL only “tightens the screw a little bit” .

You can view it as an important side branch in the second stage: Real-World RL is not always the main actor, but can serve as the final adaptation layer in sim-to-real.

Third Stage: From Task-Specific to General Policies#

So far, most work still focuses on “letting the robot learn a single task” . $\pi_{0.6}^*$ does something slightly counter-intuitive: it changes the object of RL training from “a task” to “a general policy” .

A Good-Enough General VLA#

Physical Intelligence released $\pi_{0}$ in 2024. This model is essentially a vision-language-action (VLA) foundation model: it uses internet-scale vision–language pretraining plus large-scale robot data to train a model that can zero-shot and few-shot generalize across robots and tasks. ⁸

$\pi_{0.5}$ and $\pi_{0.6}$ then increase model size, training data, and architectural capacity, forming a large-policy model that can “basically get the job done” on many household and simple industrial tasks. But, just like all the previous systems we have discussed, it hits the familiar problem: success rates are passable, but still not high enough for real use.

This is the background for $\pi_{0.6}^*$ .

RL with Experience & Corrections#

The technical report of $\pi_{0.6}^*$ tells the story in a very human way: first you go to class, then the teacher corrects you, and finally you practice by yourself. ¹

More concretely, they propose Recap (RL with Experience & Corrections via Advantage-conditioned Policies), which looks like this:

First they run offline RL on $\pi_{0.6}$ , so that the model learns from offline data to “tell good actions from bad actions”
For each concrete task, they then run a round of supervised / imitation learning fine-tuning from human demonstrations, so the model has a decent starting point
Next they deploy the model on real robots and let it run the task by itself. Humans only step in when there are clear mistakes; these corrections are logged as examples of “what the correct action should have been in states where the model actually fails”
Finally they train a value function on the model’s own trajectories, and compute an advantage signal that marks which actions are “better than average” or “worse than average” . This advantage is fed as a condition into the VLA, so that the policy learns to prefer high-advantage behavior

This may sound abstract, but we can describe it more simply: $\pi_{0.6}^*$ uses RL to fix the real errors that $\pi_{0.6}$ makes in the physical world, one by one. It does not only fix the single state where you corrected the robot; by using an advantage-conditioned policy, it tries to improve behavior in all similar situations.

What about results? The report lists concrete numbers and case studies: on complex tasks like making espresso, assembling cardboard boxes, and folding different types of clothes, Recap doubles the throughput of $\pi_{0.6}^*$ (the number of tasks finished per unit time), and cuts failure rates to half or even less. The team runs robots from 5:30 in the morning to 11:30 at night making coffee, folding 50 unseen garments in a stranger’s house, or assembling 59 real boxes on a factory line, without any run ending early because of model errors.

If you zoom out on the timeline, it is natural to see $\pi_{0.6}^*$ standing on the shoulders of the previous work:

Like HIL-SERL, it uses the trio of demonstrations + human corrections + RL to solve long-tail failures
Like RL-100, it treats RL as a “final repair layer” that upgrades performance from “sometimes wrong” to “rarely wrong”
But it also goes further: it is not optimizing a policy for a single task, but fine-tuning a large, general model

At the level of $\pi_{0.6}^*$ , Real-World RL finally changes its role from “a skill learning algorithm” to “the last-mile training tool for a general policy” .

Summary and Outlook: Where Might Real-World RL Go Next?#

Compressed into one sentence, the story above is roughly:

DayDreamer and A Walk in the Park show that “real-world RL can learn” ; HIL-SERL and RL-100 show that “it can learn stably and for a long time” ; and $\pi_{0.6}^*$ shows that “it can become the last step for training general robot policies” .

From the point of view of research methodology, Real-World RL has quietly gone through several conceptual shifts:

From “we need new RL algorithms” to “we need reliable system engineering and training pipelines”
From “letting RL learn one skill in the real world” to “letting RL fix all the corners that a VLA cannot handle in the real world”
From “simulation is the main work and the real world is only for evaluation” to “real experience is a necessary stage, and simulation is just the appetizer”

Looking forward, some promising directions might cluster around:

Larger-scale real-world data: trying “robots generate their own training data” across many tasks at the same time
More automated and cheaper human intervention and safety mechanisms: for example, better semi-automatic correction, batch annotation tools, making human corrections cheaper and more natural, and letting robots recover autonomously in more situations, instead of requiring an engineer to stand by as the “big red button”
More dexterous motion: overcoming the large Sim2Real gap in dexterous manipulation or high-dynamics manipulation, so that Real-World RL can really learn “human-like” in-hand manipulation instead of just composing simple pick-and-place actions, such as one-handed Rubik’s Cube rotations or using chopsticks to pick up objects

From your perspective, if you are doing research or products in robot learning, the most practical value of Real-World RL today may not be “inventing an even fancier RL algorithm” , but carefully answering two very concrete questions:

Which parts of your system should be handled by demonstrations and offline training, so that the model becomes smart enough to not self-destruct easily?
And then, where must RL touch the real world, and learn from real failures and long tails?

$\pi_{0.6}^*$ gives the following answer:

Demonstrations and pretraining are responsible for getting the success rate above zero, and Real-World RL is responsible for walking through all scenarios where the policy would crash in the real world and filling in every hole, until the robot can really live in the physical world.

That is probably the most attractive part of Real-World RL: it is not meant to replace everything else, but to make the whole robot system finally stand firm in the real world.

$\pi_{0.6}^*$ : A VLA that Learns from Experience. Physical Intelligence Blog, 2025-11-17. https://www.pi.website/blog/pistar06 ↗ ↩ ↩²
DayDreamer: World Models for Physical Robot Learning. CoRL 2022. https://danijar.com/project/daydreamer/ ↗ ↩ ↩²
Laura Smith et al. A Walk in the Park: Learning to Walk in 20 Minutes With Model Free Reinforcement Learning. RSS Demo Track 2023. https://arxiv.org/abs/2208.07860 ↗ ↩ ↩²
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation. DeepMind, 2023. https://arxiv.org/abs/2306.11706 ↗ ↩
HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Sample-Efficient Robotic Reinforcement Learning. Science Robotics, 2025. https://hil-serl.github.io/ ↗ ↩
Kun Lei et al. RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning. arXiv:2510.14830, 2025. https://arxiv.org/abs/2510.14830 ↗ ↩
Xiang Zhang et al. Efficient Sim-to-real Transfer of Contact-Rich Manipulation Skills with Online Admittance Residual Learning. CoRL 2023. https://arxiv.org/abs/2310.10509 ↗ ↩
$\pi_{0}$ : A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence Blog, 2024-10-31. https://www.physicalintelligence.company/blog/pi0 ↗ ↩

Background: What is Real-World RL?#

First Stage: From Learning in Simulation to Learning in Reality#

DayDreamer: Not in World, in World Model#

A Walk in the Park: Algorithm or Engineering?#

RoboCat: Self-generated Data#

Second Stage: From “Can Learn” to “Learns Well”#

HIL-SERL: Data + Human Corrections + RL#

RL-100: Systematizing the Pipeline#

Contact-Rich Sim-to-Real: A Compromise Route#

Third Stage: From Task-Specific to General Policies#

A Good-Enough General VLA#

RL with Experience & Corrections#

Summary and Outlook: Where Might Real-World RL Go Next?#

Footnotes#