Multimodal AI in 2026: What Models Can Actually See and Do

Multimodal AI — models that process both text and images (and increasingly audio and video) — has moved from research novelty to production capability. Here is what is actually possible in 2026 and what the limits are.

What Current Models Can See

The frontier models as of 2026 (Claude 3.5+, GPT-4o, Gemini Pro) can process images with substantial capability. What they are genuinely good at: describing what is in an image in accurate detail; reading text in images (OCR — including handwriting, though accuracy varies); understanding charts, graphs, and diagrams; identifying objects, people (in general terms — recognising specific individuals is both limited and restricted by policy), and scenes; answering questions about image content; and understanding spatial relationships (“what is to the left of the red car?”). Document understanding: models can process photos of documents (contracts, forms, receipts, handwritten notes) and extract information from them. This is one of the most practically useful multimodal applications — feeding a photo of an expense receipt to a model and extracting the date, amount, and merchant accurately is a production use case at scale. Visual reasoning: models can answer questions that require combining visual information with world knowledge — “is this mushroom likely to be edible?” or “what’s wrong with this circuit diagram?” These work reasonably well for common cases but should not be trusted for high-stakes decisions without expert verification.

What Is New in 2026: Video and Real-Time

Video understanding: Gemini 1.5+ and GPT-4o support video input, allowing analysis of video content. Practical applications: summarising video content, identifying key moments in footage, answering questions about what happened in a video. Current limitation: video processing is slower and more expensive than image processing; very long videos (over 30 minutes) hit context length and cost limits. Real-time vision: computer use tools (Anthropic’s computer use, OpenAI Operator) let AI models see and interact with computer screens in real time. This represents a qualitative shift — from analysing a static image to actively observing and responding to a live visual stream. These tools are early-stage in 2026 but represent the foundation of what agentic AI systems that operate in visual environments will look like. Audio: GPT-4o supports audio input and output natively (not just speech-to-text piped to a text model); Gemini Ultra has native audio understanding. This enables real-time conversation with natural-sounding voice without noticeable latency — qualitatively different from previous voice interfaces.

What Multimodal Models Cannot Do Well

Precise spatial measurements: “how many centimetres is this object?” — models are very bad at precise spatial measurement from images. Reading very small text: text below roughly 12px in the original image is often unreadable or misread. Distinguishing visually similar objects: two similar bird species, similar medication pills, similar circuit components — accuracy drops significantly when visual discrimination requires expert knowledge. Counting: counting large numbers of objects (>20) in an image is unreliable. Detecting subtle image manipulation (deepfakes): current models are not reliable detectors of sophisticated synthetic media. Understanding three-dimensional structure from 2D images: spatial reasoning about 3D structure is limited. The practical implication: multimodal capabilities are very useful for common visual tasks but require human review for any application where precision is required and errors are costly.

上一篇 波尔图:为什么对大多数旅行者来说它比里斯本更好
下一篇 2026年的多模态AI:模型实际上能看到和做什么