StreamingClaw

Abstract

Emerging applications such as embodied intelligence, AI hardware, autonomous driving, and intelligent cockpits rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents mostly suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming input. These shortcomings have become a key bottleneck for preventing agents from sustaining perception, making real-time decisions, and executing closed-loop actions in complex real-world environments, constraining their deployment and potential in dynamic, open physical worlds. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. Beyond maintaining full compatibility with the OpenClaw framework, it natively supports real-time, multimodal streaming interactions. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term memory storage, hierarchical memory evolution, efficient memory retrieval, and memory sharing across multiple agents. (4) It supports a closed loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by using scalable tools and skills to translate decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

StreamingReasoning Agent

StreamingReasoning targets streaming video understanding scenarios with continuous input and output. Its main goal is to achieve real-time perception, understanding, and reasoning under low-latency constraints, responding to user queries and producing results in real time.

Streaming Inference: StreamingReasoning enables real-time streaming video reasoning by chunking incoming video, using a dynamic sliding window to bound context, and leveraging a pruned streaming KV-cache for efficient incremental decoding.
Self-Planning Scheduling: StreamingReasoning dynamically parses user instructions to self-plan and schedule sub-agents, optionally retrieving hierarchical memory or delegating proactive interaction decisions, otherwise performing direct low-latency streaming multimodal reasoning.

Flowchart of StreamingReasoning’s streaming inference and self-planning scheduling.

StreamingMemory Agent

Flowchart of StreamingMemory’s memory storage, evolution, and retrieval algorithm.

StreamingMemory is introduced to provide evolving, efficient, multimodal, hierarchical long-term memory support for complex streaming video understanding tasks.

Multimodal Memory Storage: Store vision-centric multimodal memories as incremental memory nodes.
Hierarchical Memory Evolution: Evolve fragments into actions and events, improving retrieval.
Efficient Memory Retrieval: Command-driven, parallel, temporal traversal retrieval boosts efficiency and robustness.
Cross-Agent Unified Memory: Unified interfaces enable cross-agent memory sharing with differentiated management.

StreamingProactivity Agent

StreamingProactivity monitors streaming video for events and triggers timely proactive responses via training-free or training-based paradigms.

Interaction scenarios of the StreamingProactivity agent: time-aware scenario and event-grounding scenario.

Implementation of the StreamingProactivity agent: training-free adaptation and training-based adaptation.

Flowchart of the training-based adaptation pipeline for the StreamingProactivity agent.

The training adaptation approach can achieve the following objectives:

Support Multiple Event Types: The agent can connect to diverse proactive interaction scenarios via a unified visual-signal triggering interface. With scenario-specific signal mappings, it improves the accuracy of multi-event recognition, and enhances cross-scenario generalization.
Customizable Events with High Recognition Accuracy: By constructing domain-tailored video training data, the training scheme learns the correspondence between events and proactive signals, achieving higher event recognition accuracy than non-training-based approaches.
Reduced Inference Overhead: When multiple proactive interaction requirements coexist, the model can generate multiple scenario-specific tokens as response signals in a single forward pass, instead of running inference multiple times to produce multiple signals.

Scalable Tools and Skills

Streaming-Adapted Toolbox

Example of call memory tool invocation.

Example of video cut tool invocation.

Embodied Interaction Skill Library

Daily streaming dialogue - multi-agent collaboration: proactive interaction and cross-agent streaming multimodal memory

Embodied vehicle skill - driver monitoring system: serving as a driving companion assistant to provide safety protection and emotional support

Embodied vehicle skill - vehicle welcome: serving as a thoughtful travel concierge

Embodied robot skill - home care: providing elderly companionship, family caregiving, round-the-clock continuous care, and danger alerts

日常流式对话 - 多 agent 协作：主动交互，跨 agent 流式多模态记忆

具身汽车 skill - 驾驶员监测系统：作为一个驾驶陪伴助手，提供安全防护和情绪价值

具身汽车 skill - 车辆迎宾：做一个知心的出行管家

具身机器人 skill - 家庭看护：老龄陪伴，家庭护工，全天候持续陪护，危险告警

Citation

BibTeX

@article{streamingclaw2026,
  title={StreamingClaw Technical Report},
  author={Chen, Jiawei and Chen, Zhe and Du, Chaoqun and He, Maokui and He, Wei and Li, Hengtao and Li, Qizhen and Liu, Zide and Ma, Hao and Pan, Xuhao and Ren, Chang and Rao, Xudong and Shen, Xintian and Wang, Chenfeng and Wei, Tao and Yu, Chengjun and Yu, Pengfei and Yao, Shengyu and Zhou, Chunpeng and Zhan, Kun and Zheng, Lihao and Zhou, Pan and Zhu, Xuhan and Zheng, Yufei},
  journal={arXiv preprint arXiv:2603.22120},
  year={2026}
}

🚀 StreamingClaw