Frontier AI research lab Decart is aiming to bridge the gap between synthetic simulation and physical AI with the launch of its latest world model, Oasis 3.
Announced recently, the new video output model was designed to accelerate the training of operating system models for robots and autonomous vehicles. The goal is to create intelligent hardware that won’t be fazed by the unpredictable nature of the world it operates in.
Makers of robotics face significant challenges due to the dearth of useful data that can be harnessed to train their machines to operate in complex physical environments.
While autonomous cars can be taught to navigate a static parking lot with fixed traffic cones easily enough, such environments are nothing like what they face on the open road, especially as weather and lighting conditions change.
Getting them to the level where they can navigate chaotic city streets amid torrential rainfall and react to a dog that suddenly dashes into the road without warning is an entirely different ball game. This is the challenge Oasis 3 is designed to solve.
The robotics training bottleneck
The development of large language models for generating text and images has dramatically outpaced that of general-purpose robotics, also known as physical AI, because of the masses of media resources available.
As Bessemer Ventures pointed out in a research report earlier this year, LLM developers have had the luxury of being able to scrape billions of webpages from the public internet. But the Vision-Language-Action models needed to power robots to interact with the physical world do not have that luxury.
VLA models, as they’re known, work by ingesting data from their environment, processing it so they can understand what’s happening, and finally by reacting to that input. When it comes to training them, developers have three options.
The first is to create their own teleoperation data, which means sticking a human in a suit in order to mimic a robot working in a particular scenario. Doing this provides the highest quality training data, but it’s extremely expensive and slow, which makes it impossible to scale to the level required.
The second option is to use videos from the open web. The supply is plentiful, but the usefulness of such videos is limited due to their messy nature – environments are inconsistent, can’t be controlled to replicate the needed variety of conditions, and they lack spatial data telemetry and direct-action conditioning.
Alternatively, developers can use synthetic data, which is a kind of middleground between the two. But the challenge with this is the substandard nature of the existing physics engines used to create it, which struggle to reflect the nuances of the real-world due to their rigid boundaries.
Researchers have labeled this disconnect the “sim-to-real gap”. In a nutshell, the AI software used to generate virtual training environments for robotics just cannot account for the chaos of the real world, where anything can and usually does happen – for instance, oil spills on a road, or uncharacteristically fragile boxes in a factory warehouse.
When confronted with such randomness, autonomous vehicles and robots usually don’t know how to react.
Closing the gap with closed-loop, generative simulations
Decart says Oasis 3 is designed to overcome the limitations of existing virtual training grounds by marrying photorealistic, interactive motion graphics capabilities with a uniquely powerful physics engine.
They’re integrated within a single, high-performance training loop, enabling Oasis 3 to create action-conditioned video streams where developers can generate just about any kind of chaos they can imagine. This enables a superior training environment that’s much more like the physical world.
Developers can use Oasis 3 to create multiview environments that are not only ultra-realistic, but also highly-controllable. Should a self-driving car veer to the left, the real-time generative stream will instantly adjust the perspective with less than 200 milliseconds latency, well within the threshold needed to support reinforcement learning.
The model was co-designed with Nvidia’s physical AI ecosystem and runs on CoreWeave’s specialized cloud infrastructure at 22 frames per second, generating interactive virtual environments at 512x768x3 resolution.
It supports a native three-camera view in order to maintain spatial and temporal consistency from multiple angles, allowing autonomous systems to accurately gauge depth and peripheral environments.
It’s being made available through Decart’s API, so developers can easily integrate it into their existing physical AI simulation workflows.
Training robots to conquer uncharted territory
For physical AI to reach the level of science-fiction humanoids, developers need to be able to train robots to handle unique edge cases in real time.
This means creating scenarios that are impossible to replicate in any lab, such as when a load falls off the back of a truck into the path of an oncoming autonomous vehicle while its camera has been smeared with mud.
This is exactly the kind of situation Oasis 3 allows developers to create. Using simple natural language prompts, it’s possible to spin up an infinite number of variations of such an event, from numerous angles and in all kinds of inclement weather, and on different road surfaces, for example.
Developers may finally have an affordable way to expose their models to millions of different hazards and ensure that they’ll be ready for anything that could plausibly happen in the real world.
