From LLMs to LWMs: With Human Motion At The Core

February 10, 2026
10 min read
By
Jakob Balslev

A few years ago, everyone was suddenly forced to learn a new term overnight: LLMs, or large language models. Just as that concept has reached mainstream understanding, another paradigm shift is beginning to take shape. It may be time to start thinking in terms of LWMs - large world models.

But here’s the thing: AI still doesn’t understand how humans move through the world. It can write like us. It can draw like us. But it cannot yet convincingly inhabit a body, with balance, intention, gravity, and consequence. And that gap is not cosmetic. It’s architectural.

For the past decade, AI progress has largely followed a familiar formula: scale language models using massive datasets and compute, and watch new capabilities emerge.

That paradigm gave us systems that can write, reason, code, and generate images and video. It fundamentally changed how we interact with software and how content is created.

But a new shift is beginning to take shape.

Large World Models (LWMs) are AI systems that don’t just predict words, pixels, or audio, but attempt to predict how the world evolves over time. Instead of generating content frame by frame, these models aim to understand physics, causality, interaction, and intention.

If LLMs are autocomplete for language, world models are simulators for reality. They attempt to predict not just what comes next in a sentence, but what happens next in the world. In other words, AI is moving from models that describe the world to models that simulate and act inside it.

This emerging paradigm has been increasingly discussed in research and industry conversations, including a recent perspective by NVIDIA’s Jim Fan describing what he calls the “second pre-training paradigm” - moving from next-token prediction toward predicting future world states conditioned on actions.

And as this transition accelerates, one challenge is becoming increasingly clear:

Realistic human motion may be one of the most difficult problems world models must solve.

The shift from language understanding to physical intelligence

Language is structured, symbolic, and surprisingly compressible. This is part of why scaling language models has been so effective. The internet provided a massive corpus of structured data describing human knowledge and communication.

Physical reality is fundamentally different.

World models must learn:

  • How objects interact through contact and force
  • How balance and locomotion work
  • How actions influence future states
  • How intention translates into movement
  • How timing and coordination create believable behavior

Video generative models and robotics systems are early examples of this transition. They attempt to predict future world states conditioned on actions rather than simply generating isolated outputs.

However, across many of these systems, one domain consistently proves difficult: human movement.

Why human motion is uniquely challenging

Human motion is hard because:

  • It unfolds across time, not in a single frame
  • It obeys physics, gravity, inertia, friction
  • It reflects intent,  emotion, goals, reaction
  • It is high-dimensional, dozens of joints interacting simultaneously
  • It must stay coherent from every camera angle

Motion is extremely high dimensional

A single human performance involves:

  • Full-body articulation
  • Finger movement with dozens of degrees of freedom
  • Facial expression and micro-movement
  • Interaction with objects and environments
  • Rhythm, balance, and momentum

Each of these layers adds exponential complexity. When combined, they form one of the most information-dense signals in human communication.

Motion is deeply temporal

Generating a plausible pose is relatively straightforward for modern AI systems.

Generating physically consistent movement across seconds or minutes is dramatically harder. Models must maintain long-range dependencies while respecting biomechanical constraints, environmental interaction, and performer intent.

Imagine a humanoid robot trained to assist in a warehouse.
If it doesn’t understand balance and weight distribution, it drops objects.
If it doesn’t understand intent, it hesitates.
If it doesn’t understand human motion, it cannot collaborate safely.

Motion exists in a long-tail reality

Unlike text or imagery, there is no massive, self-organizing internet-scale corpus of structured human movement. Motion is highly contextual, often culturally specific, and frequently never recorded. Capturing authentic human performance remains one of the most challenging data collection problems in AI.

The emerging hybrid approach: Human performance + motion intelligence

The conversation around motion and AI is often framed as a competition between motion capture and generative models. In practice, the future increasingly looks hybrid.

World models benefit from combining two complementary sources of motion intelligence.

Human performance capture provides ground truth

Performance capture delivers:

  • Physically accurate movement
  • Intent-driven performances
  • Rare or specialized motion examples
  • High-fidelity temporal continuity

These elements anchor motion data in physical reality.

AI models provide scale and flexibility

Generative motion systems enable:

  • Motion interpolation and variation
  • Procedural animation and synthesis
  • Workflow acceleration and iteration
  • New forms of interactive control

Rather than replacing human performance, AI expands and amplifies it.

This mirrors transformations seen in speech, vision, and image generation, where real-world data provides realism while AI enables scalability and creative flexibility.

Motion as infrastructure for embodied AI

As robotics, digital humans, and interactive simulations advance, motion is shifting from a production tool to foundational infrastructure.

Embodied AI systems require movement that is:

  • Physically safe
  • Socially believable
  • Contextually adaptive
  • Efficient and controllable

Without convincing motion:

  • Robots struggle in human environments
  • Digital humans feel artificial
  • Training simulations lose realism
  • Interactive content breaks immersion

Motion sits at the intersection of physical intelligence and human communication. It is both functional and expressive, governed by physics and psychology simultaneously.

Rokoko’s perspective: Building the motion layer for world models

At Rokoko, we see the future of motion evolving across three interconnected layers: capture, structuring, and motion intelligence.

Capturing human reality

Our core mission has always been to make high-quality performance capture accessible. Full-body, hand, and facial capture allow creators, developers, and researchers to capture authentic human movement in production environments and experimental workflows alike.

Custom human performance capture will continue to play a critical role in industries ranging from entertainment to robotics training and simulation.

Structuring motion into learnable systems

Raw motion data only becomes valuable when it can be standardized, cleaned, retargeted, and integrated across different platforms and skeleton systems. This structuring layer is often overlooked but is essential for both production pipelines and AI training.

Advancing motion foundation models

Alongside capture tools, we are investing heavily in what we refer to as Motion Foundation Models - AI systems designed to understand, generate, and transform human movement.

These models aim to:

  • Generate new motion from high-level prompts
  • Blend and adapt performance data
  • Fill gaps between captured sequences
  • Enable new hybrid workflows between human performers and AI

Our approach is built around the belief that authentic human motion and generative AI are not competing forces, but complementary components of the same future motion stack.

The next bottleneck in AI progress

Language models transformed communication between humans and machines.

World models aim to transform how AI interacts with the physical world.

As AI systems move closer to embodiment - whether through robotics, virtual agents, or digital humans - movement becomes a central unsolved challenge. Human motion combines physics, cognition, and social signaling in ways that remain extremely difficult to synthesize convincingly.

This is not just a modeling challenge. It is equally a data challenge, a workflow challenge, and an interface challenge between human performance and machine intelligence.

Looking ahead

The next generation of AI will not only generate content or answer questions. It will act, collaborate, and interact inside dynamic environments. To do that convincingly, AI must learn to move like we do.

At Rokoko, we believe authentic human motion will play a central role in this transition. By combining accessible performance capture with motion intelligence models, we aim to help build the motion layer that future world models will rely on.

The shift from language models to world models is still unfolding. But one thing is becoming increasingly clear: Teaching AI to move may be one of the defining challenges of the next era.

The next era of AI will not be defined by who can generate the most text. It will be defined by who can simulate the world - with physical consistency, human intention, and embodied intelligence.

Language was the first frontier. Motion may be the foundation the next era of AI depends on.

Frequently asked questions

No items found.

Book a personal demonstration

Schedule a free personal Zoom demo with our team, we'll show you how our mocap tools work and answer all your questions.

Product Specialists Francesco and Paulina host Zoom demos from the Copenhagen office