Suppose we directly model p(text, pixels, sound) with one big autoregressive transformer
OpenAI's GPT-4o sets a new benchmark with real-time multimodal capabilities, matching GPT-4 in text tasks and excelling in audio response with low latency.
OpenAI's GPT-4o sets a new benchmark with real-time multimodal capabilities, matching GPT-4 in text tasks and excelling in audio response with low latency.