Real-World RL: From The Matrix to Planet EarthReal-World RL: From The Matrix to Planet Earth

Physical Intelligence released the technical report for π0.6\pi_{0.6}^* 1. A key point is that it uses real-world RL on physical robots to improve performance. I had been following real-world RL for a while, so I used this as a chance to review the line of work and sketch a few possible directions.

Background: What is Real-World RL?#

For a long time, robots + reinforcement learning almost meant collecting experience inside a physics simulator. We put virtual robots into environments and frameworks such as MuJoCo, Isaac Gym, Isaac Lab, and ManiSkill, let them run day and night, and after hundreds of millions of actions we find a policy that maximizes return. Then we try to use domain randomization, sim-to-real, or residual actions to make up for the dynamics gap between the simulator and the real world. This gap is usually called the Sim2Real Gap, and it becomes more obvious as tasks become more dynamic and longer-horizon.

Real-world RL takes a more direct path: let real-robot data carry most of the learning, use simulation only as a rough warm-up (if at all), and improve performance through repeated real interaction. The payoff is that training happens under the deployment dynamics, with no simulator bias.

Compared with classic sim-to-real, there are two key differences:

  • The trajectories for RL come from the real world, so there is no Sim2Real Gap
  • Algorithms and systems must face real-world constraints seriously: hardware breaks, objects get lost, environments are messy, and human patience is limited

This is why DayDreamer 2 emphasizes from the start: no simulation, no human demonstrations, only online RL on real robots. It takes the Dreamer world model and puts it on real platforms, and shows that a world model can also learn behaviors efficiently in the physical world.

Later, A Walk in the Park 3 lowered the bar even further: no world model, no complex pretraining. As long as you polish a model-free RL algorithm and the controller and engineering details, a Unitree A1 quadruped can learn to walk stably on various outdoor terrains in about 20 minutes of real-world interaction.

From these works on, Real-World RL stopped being just a vision and became a technical route that people could discuss seriously.

First Stage: From Learning in Simulation to Learning in Reality#

If you only read the titles, it is easy to treat DayDreamer and A Walk in the Park as two separate lines: one from the world-model side, the other from the model-free side. But if you stretch out the timeline, they actually work together to show one thing: doing RL in the real world is not a myth, but an engineering problem that can be solved.

DayDreamer: Not in World, in World Model#

In short, DayDreamer moves the Dreamer world model to real robots and tests whether it can still learn. It evaluates on several different platforms: for example, training robots in the real world to navigate, to balance, and to do locomotion control, without relying on simulation or human teleoperation. Data collection and training run online in parallel. 2

In this setup the world model is important because it gives real-world RL a virtual environment and increases the efficiency of using real data: the robot first learns a policy in the learned world model, which greatly reduces the number of real interactions needed. This makes online RL in the real world feasible for the first time.

DayDreamer’s main result is proof that online RL with a learned world model is workable on real robots, turning the path from theory into something you can walk.

A Walk in the Park: Algorithm or Engineering?#

A year later, A Walk in the Park reported that carefully tuned SAC-style control and low-level controllers let an A1 learn outdoors in ~20 minutes without a world model. The takeaway is that the win came from task/MDP and controller engineering, not a new RL trick, reinforcing that system and controller engineering (not the loss function) is the bottleneck. 3

RoboCat: Self-generated Data#

In 2023, DeepMind released RoboCat, which takes a different path. It is not a system designed specifically for Real-World RL, but it looks like a kind of rehearsal for real-world RL. 4

RoboCat builds on a Gato-style vision-based decision Transformer. It trains a generalist agent from demonstration data covering many robots and many tasks, and then runs a self-improvement loop: humans provide 100-1000 demonstrations for a new task, the model is fine-tuned, then it practices about ten thousand times in simulation or in the real world, and the generated data is fed back into the training set to produce a new, stronger version.

Although RoboCat does not emphasize online RL as strongly as DayDreamer or A Walk in the Park, it clearly brings out another important idea:

A generalist robot policy can keep getting stronger by generating its own data and improving itself.

This idea will later become a core theme in π0.6\pi_{0.6}^*.

Second Stage: From Can Learn to Learns Well#

The first stage showed that real-world RL can learn. But when people try to deploy these systems, they care about two other questions:

  • Can the agent learn in a way that is stable enough, with success rates close to 100%?
  • Can it keep running for a long time, without needing a human to rescue it every half hour?

A series of works between 2023 and 2025 can be seen as systematic answers to these questions.

HIL-SERL: Data + Human Corrections + RL#

HIL-SERL (Human-in-the-Loop Sample-Efficient RL) comes from Berkeley RAIL and appeared in Science Robotics 2025. It targets a set of tasks that are much harder than learning to walk: dynamic shaking to pull out a block, precise assembly, dual-arm collaboration, pan tossing for cooking, and other real manipulation tasks. 5

The training procedure of HIL-SERL is simple but effective: first they collect good and bad trajectories using teleoperation, and train a binary reward model that judges success or failure; then they use a small number of demonstrations to initialize the policy; finally they run online RL on real robots, where humans step in to correct the robot at key moments. RL then improves the policy using this correction data and the learned reward.

The results are very direct: on a set of complex manipulation tasks, HIL-SERL can push the success rate of vision-based policies close to 100% in about 1-2.5 hours of interaction, and the final execution speed is even faster than human teleoperation.

This work makes two points that strongly influence later research:

  • Real-world RL should not start from random exploration, but should stand on top of demonstrations
  • Human interventions are not noise; they are the key component that makes RL both safe and efficient

You can view it as an upgrade of what DayDreamer and A Walk in the Park did: from can learn to can learn well, and learn fast.

RL-100: Systematizing the Pipeline#

If HIL-SERL is still a method, RL-100 has already grown into an engineering system.

RL-100 proposes a three-stage pipeline: first, use imitation learning to inject human experience into a diffusion policy; second, run offline RL with offline policy evaluation (OPE) to obtain conservative policy improvement; finally, run a short period of online RL on real robots to clean up the remaining failure modes. 6

They validate the system on seven real-robot tasks, including cloth folding, pouring fluids and granular materials, dynamic pushing, dexterous nut tightening, multi-stage orange juicing, and so on. In 900 evaluations they achieve 900/900 successes, and some tasks can run 250 times in a row without failure.

Technically, RL-100 and HIL-SERL share the same spirit:

  • Both rely on demonstrations and offline data to ensure a good starting point
  • All exploration stays within safety boundaries monitored by OPE or humans
  • The role of RL is to fix long-tail failures, not to invent motions from scratch

But RL-100 does one extra important thing: it turns the whole pipeline into a framework that is relatively agnostic to tasks, robot platforms, and sensing modalities. This is a step from paper demo toward a reusable system.

Contact-Rich Sim-to-Real: A Compromise Route#

For assembly and tight insertion tasks where contact mechanics are very sensitive, learning entirely in the real world is still too risky. For such settings, work from Tomizuka’s group proposes a hybrid idea: learn trajectories and compliance parameters with RL in simulation, then, in the real world, only do online fine-tuning of a small admittance residual. 7

This style of method may not be as eye-catching as HIL-SERL or RL-100, but it is very practical in industrial scenarios: most of the risk is handled in simulation, and real-world RL only applies a small residual update.

You can view it as an important side branch in the second stage: Real-World RL is not always the main actor, but can serve as the final adaptation layer in sim-to-real.

Third Stage: From Task-Specific to General Policies#

So far, most work still focuses on letting the robot learn a single task. π0.6\pi_{0.6}^* does something slightly counter-intuitive: it changes the object of RL training from a task to a general policy.

A Good-Enough General VLA#

Physical Intelligence released π0\pi_{0} in 2024. This model is essentially a vision-language-action (VLA) foundation model: it uses internet-scale vision-language pretraining plus large-scale robot data to train a model that can zero-shot and few-shot generalize across robots and tasks. 8

π0.5\pi_{0.5} and π0.6\pi_{0.6} then increase model size, training data, and architectural capacity, forming a large-policy model that can “basically get the job done” on many household and simple industrial tasks. But, just like all the previous systems we have discussed, it hits the familiar problem: success rates are passable, but still not high enough for real use.

This is the background for π0.6\pi_{0.6}^*.

RL with Experience & Corrections#

The technical report of π0.6\pi_{0.6}^* describes a staged training recipe: offline pretraining, supervised fine-tuning, and online correction-driven RL. 1

More concretely, they propose Recap (RL with Experience & Corrections via Advantage-conditioned Policies), which looks like this:

  • First they run offline RL on π0.6\pi_{0.6}, so the model learns action preferences from offline data
  • For each concrete task, they then run a round of supervised / imitation learning fine-tuning from human demonstrations, so the model has a decent starting point
  • Next they deploy the model on real robots and let it run the task by itself. Humans only step in when there are clear mistakes; these corrections are logged as supervision in the failure states
  • Finally they train a value function on the model’s own trajectories, and compute an advantage signal that scores actions as better or worse than average. This advantage is fed as a condition into the VLA, so that the policy learns to prefer high-advantage behavior

This may sound abstract, but we can describe it more simply: π0.6\pi_{0.6}^* uses RL to fix the real errors that π0.6\pi_{0.6} makes in the physical world, one by one. It does not only fix the single state where you corrected the robot; by using an advantage-conditioned policy, it tries to improve behavior in all similar situations.

What about results? The report lists concrete numbers and case studies: on complex tasks like making espresso, assembling cardboard boxes, and folding different types of clothes, Recap doubles the throughput of π0.6\pi_{0.6}^* (the number of tasks finished per unit time), and cuts failure rates to half or even less. The team runs robots from 5:30 in the morning to 11:30 at night making coffee, folding 50 unseen garments in a stranger’s house, or assembling 59 real boxes on a factory line, without any run ending early because of model errors.

If you zoom out on the timeline, it is natural to see π0.6\pi_{0.6}^* standing on the shoulders of the previous work:

  • Like HIL-SERL, it uses the trio of demonstrations + human corrections + RL to solve long-tail failures
  • Like RL-100, it treats RL as a final repair layer that upgrades performance from sometimes wrong to rarely wrong
  • But it also goes further: it is not optimizing a policy for a single task, but fine-tuning a large, general model

At the level of π0.6\pi_{0.6}^*, Real-World RL changes its role from a skill learning algorithm to a last-mile training tool for a general policy.

Summary and Outlook: Where Might Real-World RL Go Next?#

DayDreamer and A Walk in the Park show that real-world RL can learn. HIL-SERL and RL-100 show that it can learn stably and for a long time. π0.6\pi_{0.6}^* shows that it can become the last step for training general robot policies.

From the point of view of research methodology, Real-World RL has quietly shifted:

  • From we need new RL algorithms to we need reliable system engineering and training pipelines
  • From letting RL learn one skill in the real world to letting RL fix the corners that a VLA cannot handle in the real world
  • From simulation is the main work and the real world is only for evaluation to real experience is a necessary stage, and simulation is just a warm-up

Looking forward, some promising directions might cluster around:

  • Larger-scale real-world data: trying robots generate their own training data across many tasks at the same time
  • More automated and cheaper human intervention and safety mechanisms: for example, better semi-automatic correction, batch annotation tools, and more autonomous recovery, instead of requiring an engineer to stand by with an emergency stop
  • More dexterous motion: overcoming the large Sim2Real gap in dexterous manipulation or high-dynamics manipulation, so that Real-World RL can learn in-hand manipulation beyond simple pick-and-place (for example, one-handed Rubik’s Cube rotations, or using chopsticks)

From your perspective, if you are doing research or products in robot learning, the most practical value of Real-World RL today may not be inventing an even fancier RL algorithm, but carefully answering two very concrete questions:

  • Which parts of your system should be handled by demonstrations and offline training, so that the model becomes smart enough to not self-destruct easily?
  • And then, where must RL touch the real world, and learn from real failures and long tails?

π0.6\pi_{0.6}^* gives the following answer:

Demonstrations and pretraining are responsible for getting the success rate above zero, and Real-World RL is responsible for addressing real-world failure cases and closing the remaining gaps, until the robot can operate reliably in the physical world.

That is probably the most attractive part of Real-World RL: it is not meant to replace everything else, but to make the whole robot system work reliably in the real world.


Footnotes#

  1. π0.6\pi_{0.6}^*: A VLA that Learns from Experience. Physical Intelligence Blog, 2025-11-17. https://www.pi.website/blog/pistar06 2

  2. DayDreamer: World Models for Physical Robot Learning. CoRL 2022. https://danijar.com/project/daydreamer/ 2

  3. Laura Smith et al. A Walk in the Park: Learning to Walk in 20 Minutes With Model Free Reinforcement Learning. RSS Demo Track 2023. https://arxiv.org/abs/2208.07860 2

  4. RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation. DeepMind, 2023. https://arxiv.org/abs/2306.11706

  5. HIL-SERL: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Sample-Efficient Robotic Reinforcement Learning. Science Robotics, 2025. https://hil-serl.github.io/

  6. Kun Lei et al. RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning. arXiv:2510.14830, 2025. https://arxiv.org/abs/2510.14830

  7. Xiang Zhang et al. Efficient Sim-to-real Transfer of Contact-Rich Manipulation Skills with Online Admittance Residual Learning. CoRL 2023. https://arxiv.org/abs/2310.10509

  8. π0\pi_{0}: A Vision-Language-Action Flow Model for General Robot Control. Physical Intelligence Blog, 2024-10-31. https://www.physicalintelligence.company/blog/pi0

Real-World RL: From The Matrix to Planet Earth
https://www.lyt0112.com/blog/real_world_rl-en
Author Yutong Liang
Published at November 17, 2025
Blog Content Copyright CC BY 4.0
Comment seems to stuck. Try to refresh?✨