Skip to main contentPlayground
A cutting-edge MoE model achieving SOTA performance across text, image, audio, and video simultaneously. It uses a Thinker–Talker architecture for low-latency, real-time, streaming responses.
Key Features
- Natively Omni-Modal: Unifies processing of text, image, audio, and video, ensuring high performance across all modalities.
- Real-Time Speed: Features ultra-low latency streaming and natural speech output, enabling fluent audio-visual dialogue.
- SOTA Audio: Achieves state-of-the-art results in audio benchmarks, excelling at speech recognition and sound analysis.
- Flexible Control: Supports customization via system prompts and function calling for seamless integration with external tools.