Research on Large Language Models that can process video as native input, covering proprietary APIs and open-source alternatives.
Documents
Quick Comparison
| Model | Video Input | API Access | Key Strengths |
|---|
| Gemini 2.5 Pro | ✅ Native | Google AI API | Up to 6 hours of video, 2M token context |
| Gemini 2.0 Flash | ✅ Native + Live | Google AI API | Real-time streaming video, 1M context |
| GPT-4o | ⚠️ Frame extraction | OpenAI API | Vision model processes video frames as images |
| Claude 3.5/4.5 | ❌ No video | Anthropic API | Strong image/PDF understanding, no video |
| Video-LLaMA 3 | ✅ Native | Open Source | Audio-visual understanding, self-hostable |
| LLaVA-NeXT | ✅ Zero-shot | Open Source | Strong video transfer from image training |
Key Findings
- Google Gemini leads in native video understanding capabilities across proprietary models
- GPT-4o requires frame extraction workaround - not true native video support via API
- Claude focuses on text/image/document analysis - no video support
- Open source alternatives (Video-LLaMA, LLaVA) offer self-hosted video understanding