Gunhee Lee 2026.06.10

RFM: Action-Oriented Intelligence for Physical AI

“Thinking Brain” to the “Action-Oriented Brain”

Recent advancements in AI technology have moved beyond simply generating text and images on a screen, evolving into the era of “physical AI,” where AI interacts with the real world through physical embodiment. If traditional Large Language Model (LLM)-based AI was a “thinking brain” that communicated through language in a virtual world, physical AI is an “acting brain” that perceives the physical environment and performs tasks directly using actual hardware as a medium. At the heart of this transformation lies the Robot Foundation Model (RFM), a universal robotic intelligence that can be applied across various robots and tasks.

Previous robot AI research has primarily focused on training specialized policies optimized for specific tasks. Today, however, the paradigm is shifting toward building Robot Foundation Models (RFMs) that can be applied across diverse environments and robot embodiments. Just as foundation models in natural language processing and computer vision have demonstrated their versatility across various tasks through large-scale pre-training, the core challenge in robotics lies in extending this capability into actions within the physical world.

The growing attention on RFMs stems from the clear limitations of conventional robot learning methods. Previous robot policies were often overfitted to a specific robot, environment, and task. Under this paradigm, performance is highly sensitive to environmental changes. Even minor variations in the following can cause a sharp decline in performance.

However, an RFM is far more than a linear extension of an LLM or VLM. This is because robots must go beyond simply understanding text and images; they need to actively interact with the physical world through their actions, observe the outcomes of those behaviors, and dynamically modify their actions when faced with failures. Currently, RFM research is primarily centered around two directions: generating robot actions based on visual and linguistic information, and predicting physical world changes to utilize them for action generation.

In this context, this article explores the primary landscapes of RFM research through two distinct technological avenues. The first is the Vision-Language-Action (VLA) model, which directly maps visual and textual inputs to robot actions. The second is the World Model (or World Action Model) approach, which focuses on predicting future environmental changes alongside action generation.

1. Primary Landscaping of RFM Research

RFM research is currently unfolding across two primary technical approaches. The first is the Vision-Language-Action (VLA) framework, which directly generates robot actions based on visual and textual inputs. The second is the World Model (or World Action Model) approach, which focuses on predicting future environmental states alongside action generation.

While early VLA models primarily focused on a straightforward pipeline of "current observation + textual instruction → action," recent advancements increasingly incorporate reasoning, task context, failure cases, and tactile/force signals to further sophisticate control policies. Concurrently, to achieve cross-embodiment generalization—the core objective of RFMs—methodologies for joint training on datasets from disparate robot platforms have emerged as a critical focal point. Conversely, the World Action Model paradigm bridges video-based future prediction with robot action generation, evolving to complement the limitations of VLA models in forecasting physical and dynamic environmental shifts.

Image 1. VLA Model and Cosmos World Foundation Model
(Source: https://pub.towardsai.net/what-are-world-models-41ff394ed871)

1-1. VLA Models: Mapping Vision and Language directly to Real-World Action

VLA models predict optimal robot actions by taking images or videos, natural language instructions, and the current state of the robot as inputs. Unlike conventional imitation learning, which merely trains a robot to replicate human-demonstrated motions, VLA synthesizes these demonstrations with language instructions and visual grounding. In essence, it extends the semantic understanding inherent in VLMs into the robot's physical action space.
Research within the VLA paradigm has recently been diversifying through several sophisticated approaches:

Gemini Robotics ^[1]: Exemplifies the trend of reinforcing embodied reasoning, including scene understanding, spatial reasoning, and task planning.
NVIDIA GR00T N1^[2] and Figure AI Helix^[3]: Serve as flagship examples of VLA-based RFMs tailored for humanoid robotics. GR00T N1 aims to be a generalist humanoid robot foundation model, while Helix was introduced as a generalist humanoid VLA capable of performing diverse household object manipulations and multi-robot collaborations based on natural language instructions.
Physical Intelligence π0 Series^[4,5,6,7]: Pursues a generalist robot policy, showing a distinct trend toward minimizing recurring failure modes in real-world deployment. It achieves this by utilizing rich context to condition policies and incorporating autonomous rollouts, failure cases, and human interventions directly into the training loop.

RLWRLD의 RLDX-1^[8]: Represents a unique pivot within the VLA domain that emphasizes dexterous manipulation. While conventional VLAs heavily prioritized vision-and-language-based universality, RLDX-1 integrates motion awareness, long-term memory, and tactile/torque signals to fortify contact-rich manipulation, which is essential for complex operational environments.

In summary, the VLA framework is advancing by leveraging its core architecture—mapping visual and textual information to action—and augmenting it with reasoning, context conditioning, failure learning, and tactile/force feedback to ensure robust operation in real-world environments. Concurrently, the effective utilization of multi-embodiment data within a unified model remains a paramount challenge under active investigation.

Image 2. Model Architecture of NVIDIA GR00T N1^[2]

1-2. World Action Models: Visualizing the Future Beyond Action

While VLA models focus on predicting immediate actions based on current observations and textual instructions, the World Model (or World Action Model) paradigm simultaneously addresses how the environment will transform post-action. In other words, the robot learns not only "what action to take" but also to simultaneously predict the resulting impact of that action on the external world.

In this approach, video serves as a critical supervisory signal. While a single image merely offers a static snapshot of the current state, video capturing implicitly embeds object displacement, physical contact, and temporal transitions of the scene dynamics. Consequently, World Model-based approaches leverage video to first internalize the physics and dynamics of the real world, subsequently injecting this understanding into robot action generation.

DreamZero^[9]: A flagship study that directly anchors a world-model approach to robot control. By co-modeling video and action trajectories based on a pretrained video diffusion backbone, it predicts both the immediate action and the resulting future visual state. This approach effectively complements VLA models by capturing intricate physical changes and enhancing motion generalization across novel environments.
DreamDojo^[10]: An illustrative case of utilizing world models as a sandbox for robot learning and evaluation. By learning diverse everyday environments and object interactions from egocentric human videos, it generates future rollouts without requiring continuous execution on physical hardware. This capability is then leveraged for policy evaluation and model-based planning.
NVIDIA Cosmos^[11]: An infrastructure-centric approach providing a world foundation model platform tailored for Physical AI, supporting robot synthetic data generation and simulation-based learning. Notably, Cosmos-Predict2.5^[12] is engineered to generate and predict future scenes given text, image, and video inputs, serving as a powerful alternative to mitigate the bottleneck of large-scale physical data collection.

In short, while DreamZero aligns world models directly with real-world robot policies, DreamDojo exploits them as planning and evaluation environments. NVIDIA Cosmos scales these methodologies into a massive world foundation model platform designed to fuel and support robot learning data pipelines.

Image 3. Model Architecture of DreamZero^[9]

2. RFM : Learning Action and Physics in the Physical World

RFM research is dynamically evolving across multiple dimensions, anchored primarily by the VLA and World Model frameworks. The VLA track concentrates on mapping visual and textual cues directly onto robot actions, whereas the World Model track prioritizes forecasting environmental state transitions post-action. Though their points of departure differ, both approaches share the ultimate objective of enabling robots to operate with greater stability and robustness in the physical world.

The overarching challenge threading through this paradigm shift is cross-embodiment generalization. For an RFM to transcend platform-specific control policies and become a true 'foundation model,' it must effectively synthesize and exploit data from heterogeneous robot forms within a unified learning architecture. This bottleneck manifests as an action representation or robot-specific adapter challenge within the VLA domain, while in the World Model domain, it centers on extracting shared physical dynamics across disparate embodiments via video data.

Current studies are actively expanding the generalization capabilities of RFMs from multi-faceted angles, including action generation, future state prediction, failure learning, tactile/force feedback, and cross-embodiment learning. Viewed from this perspective, RFM research has progressed beyond the rudimentary stage of merely adapting LLMs/VLMs to robotics; it is now steering toward autonomous systems that act reliably in the physical world and iteratively self-improve through failure experiences.

3. The Potential and Roadmap of LG's “Physical AI”

As illustrated above, RFMs are rapidly advancing toward translating visual and textual comprehension into executable robotic actions. However, deploying these models into messy, real-world deployments—such as manufacturing facilities or smart homes—still presents formidable hurdles. Chief among these are generalization across diverse hardware forms and shifting environments, the stability of contact-rich manipulation, and real-time recovery capabilities in the face of unexpected failures.

Leveraging its deep-seated capabilities in text/visual comprehension and reasoning matured through the EXAONE LLM/VLM research, the LG AI Research is actively spearheading RFM initiatives to expand these frontiers into the domain of Physical AI. Specifically, the organization is systematically sophisticating core technologies critical for physical grounding and execution, spanning vision-language-based task comprehension, scene and object grounding, behavior planning, and low-level robot control integration.

On the empirical front, LG AI Research is driving Proof of Concept (PoC) initiatives in collaboration with the LG Electronics Production Engineering Research Institute (PRI) to deploy these models into live manufacturing ecosystems. Given that factory floors are highly volatile environments characterized by frequent process variations demanding extreme precision and uptime, this collaboration serves as an ideal stress-test to validate whether RFMs can accurately digest on-site operational instructions and visual feeds to solve complex automation bottlenecks.

Concurrently, we are exploring home robot intelligence. Unlike factories, household environments are inherently unstructured, heavily populated by non-rigid objects, and require safe navigation alongside human cohabitants under diverse, ambiguous user prompts. Consequently, the smart home represents a pivotal application domain for mapping EXAONE's multimodal understanding and high-level reasoning onto safe, dependable robotic action sequences.

Moving forward, Physical Intelligence Lab at LG AI Research, intends to leverage the competitive edge of the EXAONE LLM/VLM ecosystem to architect a versatile RFM capable of operating seamlessly across both industrial and domestic landscapes, thereby solidifying a robust technological foundation for Physical AI that acts reliably in the real world.

참고

[1] Gemini Robotics Team, “Gemini Robotics: Bringing AI into the Physical World.” arXiv, 2025.

[2] NVIDIA, “GR00T N1: An Open Foundation Model for Generalist Humanoid Robots.” arXiv, 2025.

[3] Figure AI, “Helix: A Vision-Language-Action Model for Generalist Humanoid Control.” Figure AI, 2025.

[4] Physical Intelligence, “π?: A Vision-Language-Action Flow Model for General Robot Control.” arXiv, 2024.

[5] Physical Intelligence, “π?.?: A Vision-Language-Action Model with Open-World Generalization.” arXiv, 2025.

[6] Physical Intelligence, “π*?.?: A VLA That Learns From Experience.” arXiv, 2025.

[7] Physical Intelligence, “π?.?: A Steerable Generalist Robotic Foundation Model with Emergent Capabilities.” arXiv, 2026.

[8] D. Kim et al., “RLDX-1 Technical Report.” arXiv, 2026.

[9] S. Ye et al., “World Action Models are Zero-shot Policies.” arXiv, 2026.

[10] S. Gao et al., “DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos.” arXiv, 2026.

[11] N. Agarwal et al., “Cosmos World Foundation Model Platform for Physical AI.” arXiv, 2025.

[12] A. Ali et al., “World Simulation with Video Foundation Models for Physical AI.” arXiv, 2025.

목록보기