README

Research on Large Language Models that can process video as native input, covering proprietary APIs and open-source alternatives.

Documents

Document	Description
Video-Capable LLMs Overview	Comprehensive guide to LLMs with native video understanding in 2025

Quick Comparison

Model	Video Input	API Access	Key Strengths
Gemini 2.5 Pro	✅ Native	Google AI API	Up to 6 hours of video, 2M token context
Gemini 2.0 Flash	✅ Native + Live	Google AI API	Real-time streaming video, 1M context
GPT-4o	⚠️ Frame extraction	OpenAI API	Vision model processes video frames as images
Claude 3.5/4.5	❌ No video	Anthropic API	Strong image/PDF understanding, no video
Video-LLaMA 3	✅ Native	Open Source	Audio-visual understanding, self-hostable
LLaVA-NeXT	✅ Zero-shot	Open Source	Strong video transfer from image training

Key Findings

Google Gemini leads in native video understanding capabilities across proprietary models
GPT-4o requires frame extraction workaround - not true native video support via API
Claude focuses on text/image/document analysis - no video support
Open source alternatives (Video-LLaMA, LLaVA) offer self-hosted video understanding

Local LLM Inference - Running LLMs locally on Apple Silicon
Nano Banana - Gemini image generation APIs
Google NotebookLM API - Video output generation (not input)