Abstract
Applications such as embodied intelligence, AI hardware, autonomous driving, and intelligent cockpits heavily rely on a real-time perception–decision–action closed loop, posing stringent challenges for real-time streaming video understanding. However, current agents generally suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time inference and proactive interaction under long-duration, continuous streaming inputs. These shortcomings have become a key bottleneck for preventing agents from sustaining perception, making real-time decisions, and executing closed-loop actions in complex real-world environments, severely constraining their deployment and potential in dynamic, open physical worlds.
To alleviate these critical issues, we propose StreamingClaw, a unified agent framework for real-time streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time multimodal streaming interaction. StreamingClaw systematically integrates the following five core capabilities:
(1) It supports real-time streaming reasoning over multimodal inputs.
(2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives.
(3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents, tailored for real-time streaming scenarios.
(4) It supports a closed-loop of perception–decision–action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments.
(5) It is compatible with the OpenClaw framework, allowing it to fully leverage the extensive resources and support of the open-source community.
With the above design, StreamingClaw integrates online real-time perception, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by using scalable tools and skills to translate decisions into executable actions, StreamingClaw enables direct control of the physical world, supporting practical deployment of embodied interaction.