OpenAI announced what it says is a vastly superior large language model capable of interacting with human-like speeds using text, voice, and visual prompts. But at least one analyst said the company is just playing catch-up with competitors at this point. Credit: OpenAI After weeks of speculation, ChatGPT creator OpenAI announced a new desktop version of ChatGPT and a user interface upgrade called GPT-4o that allows consumers to interact using text, voice, and visual prompts. GPT-4o can recognize and respond to screenshots, photos, documents, or charts uploaded to it. The new GPT-4o model can also recognize facial expressions and information written by hand on paper. OpenAI said the improved model and accompanying chatbot can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, “which is similar to human response time in a conversation.” The previous versions of GPT also had a conversational Voice Mode, but they had latencies of 2.8 seconds (in GPT-3.5) and 5.4 seconds (in GPT-4) on average. GPT 4o now matches the performance of GPT-4 Turbo (released in November) on text in English and code, with significant improvement on text in non-English languages, while also being faster and 50% cheaper in the API version, according to OpenAI Chief Technology Officer Mira Murati. “GPT-4o is especially better at vision and audio understanding compared to existing models,” OpenAI said in its announcement. During an on-stage event, Murati said GPT-4o will also have new memory capabilities, giving it the ability to learn from previous conversations with users and add that to its answers. Chirag Dekate, a Gartner vice president analyst, said that while he was impressed with OpenAI’s multimodal large language model (LLM), the company was clearly playing catch-up to competitors, in contrast to its earlier status as an industry leader in generative AI tech. “You’re now starting to see GPT enter into the multimodal era,” Dekate said. “But they’re playing catch-up to where Google was three months ago when it announced Gemini 1.5, which is its native multimodal model with a one-million-token context window.” Still, the capabilities demonstrated by GPT-4o and its accompanying ChatGPT chatbot are impressive for a natural language processing engine. It displayed a better conversational capability, where users can interrupt it and begin new or modified queries, and it is also versed in 50 languages. In one onstage live demonstration, the Voice Mode was able to translate back and forth between Murati speaking Italian and Barret Zoph, OpenAI’s head of post-training, speaking English. During a live demonstration, Zoph also wrote out an algebraic equation on paper while ChatGPT watched through his phone’s camera lens. Zoph then asked the chatbot to talk him through the solution. While the voice recognition and conversational interactions were extremely human-like, there were also noticeable glitches in the interactive bot where it cut out during conversations and picked things back up moments later. The chatbot then was asked to tell a bedtime story. The presenters were able to interrupt the chatbot and have it add more emotion to its voice intonation and even change to a computer-like rendition of the story. In another demo, Zoph brought up software code on his laptop screen and used ChatGPT 4o’s voice command app to have it evaluate the code, a weather charting app, and determine what it was. GPT-4o was then able to read the app’s chart and determine data points on it related to high and low temperatures. From left to right, OpenAI CTO Mira Murati, head of Frontiers Research Mark Chen, and head of post-training Barret Zoph demonstrate GPT-4o’s ability to interpret a graphic’s data during an onstage event. OpenAI Murati said GPT-4o text and images capabilities will be rolled out iteratively with extended “red team” access starting today. Paying ChatGPT Plus users will have up to five times higher message limits. A new version of Voice Mode with GPT-4o will arrive in alpha in the coming weeks, Murati said. Model developers can also now access GPT-4o in the API as a text and vision model. The new model is two times faster, half the price, and has five times higher rate limits compared to GPT-4 Turbo, Murati said. “We plan to launch support for GPT-4o’s new audio and video capabilities to a small group of trusted partners in the API in the coming weeks,” she said. Zoph demonstrates using his smartphone’s camera how GPT-4o can read math equations written on paper and assist a user in solving them. OpenAI What was not clear in OpenAI’s GPT-4o announcement, Dekate said, was the context size of the input window, which for GPT-4 is 128,000 tokens. “Context size helps define the accuracy of the model. The larger the context size, the more data you can input and the better outputs you get,” he said. Google’s Gemini 1.5, for example, offers a one-million-token context window, making it the longest of any large-scale foundation model to date. Next in line is Anthropic’s Claude 2.1, which offers a context window with up to 200,000 tokens. Google’s larger context window translates into being able to fit an application’s entire code base for updates or upgrades by the genAI model; GPT-4 had the ability to accept only about 1,200 lines of code, Dekate said. An OpenAI spokesperson said GPT-4o’s context window size remains at 128k. Mistral also announced its LLaVA-NeXT multimodal model last week earlier this month. And Google is expected to make further Gemini 1.5 announcements at its Google I/O event tomorrow. “I would argue in some sense that OpenAI is now playing catch-up to Meta, Google, and Mistral,” Dekate said. Nathaniel Whittemore, CEO of AI training platform Superintelligent, called OpenAI’s announcement “the most divisive” he’d ever seen. “Some feel like they’ve glimpsed the future; the vision from Her brought to real life. Others are left saying, ‘that’s it?’ he said in an email reply. “Part of this is about what this wasn’t: it wasn’t an announcement about GPT4.5 or GPT-5. There is so much attention on the state of the art horserace that for some, anything less than that was going to be a disappointment no matter what.” Murati said OpenAI recognizes that GPT-4o will also present new opportunities for misuse of the real-time audio and visual recognition. She said the company will continue to work with various entities, including the government, the media, and the entertainment industry to try to address the security issues. The previous version of ChatGPT (4.0) also had a Voice Mode that used three separate models: one model transcribes audio to text, another takes in text and outputs text, and a third model that converts that text back to audio. That model, Murati explained, can observe tone, multiple speakers, or background noises, but it can’t output laughter, singing, or express emotion. GPT-4o, however, uses a single end-to-end model across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network for more of a real-time experience. “Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations,” Murati said. “Over next few weeks, we will continue iterative deployments to bring to you.” Related content opinion AI and AR can supercharge ‘ambient computing’ A 33-year-old vision is now fully realizable thanks to the coming ubiquity of generative AI and augmented reality (AR) glasses. By Mike Elgan Aug 16, 2024 7 mins Augmented Reality Technology Industry Virtual Reality opinion Agentic RAG AI — more marketing hype than tech advance CIOs are so desperate to stop generative AI hallucinations they’ll believe anything. Unfortunately, Agentic RAG isn’t new and its abilities are exaggerated. By Evan Schuman Aug 16, 2024 5 mins Technology Industry Generative AI Emerging Technology opinion Where are my AR glasses? We’ve been on the brink of a thriving market for AR glasses for years. What’s taking so long for them to arrive? By Mike Elgan Aug 09, 2024 7 mins Augmented Reality Desktop Virtualization Technology Industry news analysis China sets its sights on human brain-computer interface standards A new committee formed by the country's Ministry of Industry and Information Technology is hearing opinions on various issues — including technical aspects as well as ethics and safety concerns — that will be key to the future of the By Elizabeth Montalbano Jul 02, 2024 5 mins Technology Industry Emerging Technology Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe