VLA and Teleoperation Are Dead in 2026 — NVIDIA’s Jim Fan Reveals the Future of Robotics
Nvidia's Jim Fan declares the VLA architecture and teleoperation obsolete for next-gen robotics, advocating for physics-simulating world action models and autonomous data collection as the new frontier.

VLA and Teleoperation Are Dead in 2026 — NVIDIA’s Jim Fan Reveals the Future of Robotics
summarize3-Point Summary
- 1Nvidia's Jim Fan declares the VLA architecture and teleoperation obsolete for next-gen robotics, advocating for physics-simulating world action models and autonomous data collection as the new frontier.
- 2VLA and Teleoperation Are Dead in 2026 — NVIDIA’s Jim Fan Reveals the Future of Robotics VLA and teleoperation are dead — according to Jim Fan, Senior Research Scientist and Lead of AI Agents at NVIDIA.
- 3In a recent interview with Sequoia Capital, Fan delivered a paradigm-shifting critique of current robotics architectures, asserting that Vision-Language-Action (VLA) models are fundamentally misaligned with the demands of physical autonomy.
psychology_altWhy It Matters
- check_circleThis update has direct impact on the Robotik ve Otonom Sistemler topic cluster.
- check_circleThis topic remains relevant for short-term AI monitoring.
- check_circleEstimated reading time is 4 minutes for a quick decision-ready brief.
VLA and Teleoperation Are Dead in 2026 — NVIDIA’s Jim Fan Reveals the Future of Robotics
VLA and teleoperation are dead — according to Jim Fan, Senior Research Scientist and Lead of AI Agents at NVIDIA. In a recent interview with Sequoia Capital, Fan delivered a paradigm-shifting critique of current robotics architectures, asserting that Vision-Language-Action (VLA) models are fundamentally misaligned with the demands of physical autonomy. He argues that predicting next tokens, as done in language models, is irrelevant in environments governed by physics, not language. Instead, Fan proposes a radical shift toward "world action models" that simulate next-frame dynamics to guide robotic behavior in real time.
Why VLAs Fail in Physical Environments
Vision-Language-Action (VLA) models rely on human-labeled datasets and linguistic patterns to predict actions. But in dynamic, physics-driven settings — like a robot grasping a slippery object or navigating uneven terrain — these models lack grounding in cause-and-effect. They mimic behavior, not understanding. This leads to brittle performance when faced with novel scenarios. As Fan notes, "Language models don’t know gravity. Robots do."
The Rise of Physics-Based World Action Models
Fan’s vision replaces token prediction with physics-based simulation. Rather than learning from human demonstrations or static datasets, future robots must internalize the laws of motion, friction, gravity, and object interaction. These world action models generate internal simulations of possible outcomes before executing actions, enabling adaptive, safe, and efficient behavior without human intervention. This mirrors the scaling laws that propelled large language models — but applied to the physical world.
NVIDIA’s Roadmap for Embodied AI
According to Reuters, Fan’s team at NVIDIA has already begun testing prototype systems that integrate neural physics engines with multimodal perception. Early results show a 40% reduction in trial-and-error failures during object manipulation tasks compared to VLA-based approaches. The shift isn’t merely technical — it’s philosophical. Robotics, Fan contends, must stop mimicking human behavior and start embodying autonomous intelligence grounded in physical reality.
Why Teleoperation Is Becoming Obsolete
Equally consequential is Fan’s prediction that teleoperation will become negligible within two years. Once the dominant method for training robots via remote human control, teleoperation introduces latency, inconsistency, and scalability limits. Human operators can’t scale to millions of tasks. Fan asserts that the future lies in ego-centric autonomous data collection: robots learning from their own sensory experiences, correcting their own errors, and generating synthetic training data through simulation.
The Sim2Real Advantage in 2026
He points to NVIDIA’s Sim2Real framework as a critical enabler. By running millions of simulated trials in parallel, robots can accumulate vast, diverse experience without physical wear or human oversight. This approach allows systems to develop "intuition" for complex tasks — like assembling irregular parts or navigating cluttered homes — far beyond what any human could demonstrate in a lab. The result? End-to-end learning that scales exponentially.
Industry analysts are taking notice. Venture capital firms are pivoting funding from teleoperation startups to companies building physics-aware AI agents. The implications stretch beyond manufacturing and logistics into healthcare, disaster response, and even space exploration. If Fan’s roadmap holds, we are witnessing the end of the "human-in-the-loop" era in robotics.
While skeptics question the computational demands of real-time physics simulation, Fan counters that advances in NVIDIA’s Grace Hopper architecture and accelerated computing make it not only feasible but cost-effective at scale. The era of manually coded behaviors and human-guided training is over. The future belongs to machines that learn physics, not language.
VLA and teleoperation are dead — replaced not by incremental upgrades, but by a new paradigm rooted in embodied intelligence and self-supervised physical learning. The robotics revolution is no longer coming. It’s already here.


