Framework of StreamingClaw

StreamingClaw Framework

Pipeline of StreamingClaw: Multi-end inputs and the user query are fed into the main–sub agents for perception and decision-making. The instructions produced by the main–sub agents then guide the execution of tools and skills, whose results are fed back to the main–sub agents, forming a closed-loop pipeline of perception–decision–action.

StreamingMind Agent

StreamingMind targets streaming video understanding scenarios with continuous input and output. Its main goal is to achieve real-time perception, understanding, and reasoning under low-latency constraints, responding to user queries and producing results in real time.

  • Streaming Reasoning: StreamingMind enables real-time streaming video reasoning by chunking incoming video, using a dynamic sliding window to bound context, and leveraging a pruned streaming KV-cache for efficient incremental decoding, thereby maintaining low-latency understanding and response generation over long video streams.
  • Self-Planning Scheduling: StreamingMind dynamically parses user instructions to self-plan and schedule sub-agents, optionally retrieving hierarchical memory or delegating proactive interaction decisions, otherwise performing direct low-latency streaming multimodal reasoning.
StreamingMind Agent Architecture

Flowchart of SteamingMind’s streaming reasoning and self-planning scheduling.

HierarchicalMemory Agent

HierarchicalMemory Agent Architecture

Flowchart of HierarchicalMemory’s memory storage, evolution, and retrieval algorithm.

HierarchicalMemory is introduced to provide evolving, efficient, multimodal, hierarchical long-term memory support for complex streaming video understanding tasks.

  • Multimodal Memory Storage: Store vision-centric multimodal memories as incremental memory nodes.
  • Hierarchical Memory Evolution: Evolve fragments into actions and events, improving retrieval.
  • Efficient Memory Retrieval: Command-driven, parallel, temporal traversal retrieval boosts efficiency and robustness.
  • Cross-Agent Unified Memory: Unified interfaces enable cross-agent memory sharing with differentiated management.

ProactiveMind Agent

ProactiveMind monitors streaming video for events and triggers timely proactive responses via training-free or training-based paradigms.

ProactiveMind Scenarios

Interaction scenarios of the ProactiveMind agent: time-aware scenario and event-grounding scenario.

ProactiveMind Agent Schemes

Implementation of the ProactiveMind agent: training-free adaptation and training-based adaptation.

ProactiveMind Training

Flowchart of the training-based adaptation pipeline for the ProactiveMind agent.

Scalable Tools and Skills

StreamingClaw integrates scalable tools and skills for multimodal perception–decision–action loop, including video- and streaming-specific extensions.

Citation

BibTeX

@article{streamingclaw2026,
  title={StreamingClaw Technical Report},
  author={Chen, Jiawei and Chen, Zhe and Du, Chaoqun and He, Maokui and He, Wei and Li, Hengtao and Li, Qizhen and Liu, Zide and Ma, Hao and Pan, Xuhao and Ren, Chang and Rao, Xudong and Shen, Xintian and Wang, Chenfeng and Wei, Tao and Yu, Chengjun and Yu, Pengfei and Yao, Shengyu and Zhou, Chunpeng and Zhan, Kun and Zheng, Lihao and Zhou, Pan and Zhu, Xuhan and Zheng, Yufei},
  journal={arXiv preprint arXiv:2603.22120},
  year={2026}
}