StreamingReasoning Agent

StreamingReasoning targets streaming video understanding scenarios with continuous input and output. Its main goal is to achieve real-time perception, understanding, and reasoning under low-latency constraints, responding to user queries and producing results in real time.

  • Streaming Inference: StreamingReasoning enables real-time streaming video reasoning by chunking incoming video, using a dynamic sliding window to bound context, and leveraging a pruned streaming KV-cache for efficient incremental decoding.
  • Self-Planning Scheduling: StreamingReasoning dynamically parses user instructions to self-plan and schedule sub-agents, optionally retrieving hierarchical memory or delegating proactive interaction decisions, otherwise performing direct low-latency streaming multimodal reasoning.
StreamingReasoning Agent Architecture

Flowchart of StreamingReasoning’s streaming inference and self-planning scheduling.

StreamingMemory Agent

HierarchicalMemory Agent Architecture

Flowchart of StreamingMemory’s memory storage, evolution, and retrieval algorithm.

StreamingMemory is introduced to provide evolving, efficient, multimodal, hierarchical long-term memory support for complex streaming video understanding tasks.

  • Multimodal Memory Storage: Store vision-centric multimodal memories as incremental memory nodes.
  • Hierarchical Memory Evolution: Evolve fragments into actions and events, improving retrieval.
  • Efficient Memory Retrieval: Command-driven, parallel, temporal traversal retrieval boosts efficiency and robustness.
  • Cross-Agent Unified Memory: Unified interfaces enable cross-agent memory sharing with differentiated management.

StreamingProactivity Agent

StreamingProactivity monitors streaming video for events and triggers timely proactive responses via training-free or training-based paradigms.

ProactiveMind Scenarios

Interaction scenarios of the StreamingProactivity agent: time-aware scenario and event-grounding scenario.

ProactiveMind Agent Schemes

Implementation of the StreamingProactivity agent: training-free adaptation and training-based adaptation.

ProactiveMind Training

Flowchart of the training-based adaptation pipeline for the StreamingProactivity agent.

The training adaptation approach can achieve the following objectives:

  • Support Multiple Event Types: The agent can connect to diverse proactive interaction scenarios via a unified visual-signal triggering interface. With scenario-specific signal mappings, it improves the accuracy of multi-event recognition, and enhances cross-scenario generalization.
  • Customizable Events with High Recognition Accuracy: By constructing domain-tailored video training data, the training scheme learns the correspondence between events and proactive signals, achieving higher event recognition accuracy than non-training-based approaches.
  • Reduced Inference Overhead: When multiple proactive interaction requirements coexist, the model can generate multiple scenario-specific tokens as response signals in a single forward pass, instead of running inference multiple times to produce multiple signals.

Scalable Tools and Skills

Streaming-Adapted Toolbox

Embodied Interaction Skill Library

Citation

BibTeX

@article{streamingclaw2026,
  title={StreamingClaw Technical Report},
  author={Chen, Jiawei and Chen, Zhe and Du, Chaoqun and He, Maokui and He, Wei and Li, Hengtao and Li, Qizhen and Liu, Zide and Ma, Hao and Pan, Xuhao and Ren, Chang and Rao, Xudong and Shen, Xintian and Wang, Chenfeng and Wei, Tao and Yu, Chengjun and Yu, Pengfei and Yao, Shengyu and Zhou, Chunpeng and Zhan, Kun and Zheng, Lihao and Zhou, Pan and Zhu, Xuhan and Zheng, Yufei},
  journal={arXiv preprint arXiv:2603.22120},
  year={2026}
}