StreamingClaw

Abstract

Applications such as embodied intelligence, AI hardware, autonomous driving, and intelligent cockpits heavily rely on a real-time perception–decision–action closed loop, posing stringent challenges for real-time streaming video understanding. However, current agents generally suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time inference and proactive interaction under long-duration, continuous streaming inputs. These shortcomings have become a key bottleneck for preventing agents from sustaining perception, making real-time decisions, and executing closed-loop actions in complex real-world environments, severely constraining their deployment and potential in dynamic, open physical worlds.

To alleviate these critical issues, we propose StreamingClaw, a unified agent framework for real-time streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time multimodal streaming interaction. StreamingClaw systematically integrates the following five core capabilities:

(1) It supports real-time streaming reasoning over multimodal inputs.

(2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives.

(3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents, tailored for real-time streaming scenarios.

(4) It supports a closed-loop of perception–decision–action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments.

(5) It is compatible with the OpenClaw framework, allowing it to fully leverage the extensive resources and support of the open-source community.

With the above design, StreamingClaw integrates online real-time perception, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by using scalable tools and skills to translate decisions into executable actions, StreamingClaw enables direct control of the physical world, supporting practical deployment of embodied interaction.

Framework of StreamingClaw

Pipeline of StreamingClaw: Multi-end inputs and the user query are fed into the main–sub agents for perception and decision-making. The instructions produced by the main–sub agents then guide the execution of tools and skills, whose results are fed back to the main–sub agents, forming a closed-loop pipeline of perception–decision–action.

StreamingMind Agent

StreamingMind targets streaming video understanding scenarios with continuous input and output. Its main goal is to achieve real-time perception, understanding, and reasoning under low-latency constraints, responding to user queries and producing results in real time.

Streaming Reasoning: StreamingMind enables real-time streaming video reasoning by chunking incoming video, using a dynamic sliding window to bound context, and leveraging a pruned streaming KV-cache for efficient incremental decoding, thereby maintaining low-latency understanding and response generation over long video streams.
Self-Planning Scheduling: StreamingMind dynamically parses user instructions to self-plan and schedule sub-agents, optionally retrieving hierarchical memory or delegating proactive interaction decisions, otherwise performing direct low-latency streaming multimodal reasoning.

Flowchart of SteamingMind’s streaming reasoning and self-planning scheduling.

HierarchicalMemory Agent

Flowchart of HierarchicalMemory’s memory storage, evolution, and retrieval algorithm.

HierarchicalMemory is introduced to provide evolving, efficient, multimodal, hierarchical long-term memory support for complex streaming video understanding tasks.

Multimodal Memory Storage: Store vision-centric multimodal memories as incremental memory nodes.
Hierarchical Memory Evolution: Evolve fragments into actions and events, improving retrieval.
Efficient Memory Retrieval: Command-driven, parallel, temporal traversal retrieval boosts efficiency and robustness.
Cross-Agent Unified Memory: Unified interfaces enable cross-agent memory sharing with differentiated management.

ProactiveMind Agent

ProactiveMind monitors streaming video for events and triggers timely proactive responses via training-free or training-based paradigms.

Interaction scenarios of the ProactiveMind agent: time-aware scenario and event-grounding scenario.

Implementation of the ProactiveMind agent: training-free adaptation and training-based adaptation.

Flowchart of the training-based adaptation pipeline for the ProactiveMind agent.

Scalable Tools and Skills

StreamingClaw integrates scalable tools and skills for multimodal perception–decision–action loop, including video- and streaming-specific extensions.

Example of video cut tool invocation.

Example of call memory tool invocation.

Citation

BibTeX

@article{streamingclaw2026,
  title={StreamingClaw Technical Report},
  author={Chen, Jiawei and Chen, Zhe and Du, Chaoqun and He, Maokui and He, Wei and Li, Hengtao and Li, Qizhen and Liu, Zide and Ma, Hao and Pan, Xuhao and Ren, Chang and Rao, Xudong and Shen, Xintian and Wang, Chenfeng and Wei, Tao and Yu, Chengjun and Yu, Pengfei and Yao, Shengyu and Zhou, Chunpeng and Zhan, Kun and Zheng, Lihao and Zhou, Pan and Zhu, Xuhan and Zheng, Yufei},
  journal={arXiv preprint arXiv:2603.22120},
  year={2026}
}

🚀 StreamingClaw