OpenAI is not the first to offer generative AI technology that can transform a text prompt into realistic video, but its tool appears to be among the most advanced to date. Credit: Shutterstock/Below the Sky OpenAI last week unveiled a new capability for its generative AI (genAI) platform that can use a text input to generate video — complete with life-like actors and other moving parts. The new genAI model, called Sora, has a text-to-video function that can create complex, realistic moving scenes with multiple characters, specific types of motion, and accurate details of the subject and background “while maintaining visual quality and adherence to the user’s prompt.” Sora understands not only what a user asks for in the prompt, but also how those things exist in the physical world. The technology basically translates written descriptions into video content, leveraging AI models that understand textual input and generate corresponding visual and auditory elements, according to Bernard Marr, a technology futurist and business and technology consultant. “This process involves deep learning algorithms capable of interpreting text and synthesizing videos that reflect the described scenes, actions, and dialogues,” Marr said. While not a new capability for AI engines offered by other providers, such as Google’s Gemini, Sora’s impact is expected to be profound, according to Marr. Google Google’s Lumiere off-the-shelf text-based image editing methods can be used for video editing. Like any advanced genAI technology, he said, Sora’s impact will help reshape content creation, enhancing storytelling and democratizing video production. “Text-to-video capabilities hold immense potential across diverse fields such as education, where they can create immersive learning materials; marketing, for generating engaging content; and entertainment, for rapid prototyping and storytelling,” Marr said. However, Marr warned, the ability for AI models to translate textual descriptions into full-fledged videos also underscores the need for rigorous ethical considerations and safeguards against misuse. “The emergence of text-to-video technology introduces complex issues regarding copyright infringement, particularly as it becomes capable of generating content that might closely mirror copyrighted works,” Marr said. “The legal landscape in this area is currently being navigated through several ongoing lawsuits, making it premature to definitively state how copyright concerns will be resolved.” Potentially more concerning is the ability of the technology to produce highly convincing deepfakes, raising serious ethical and privacy issues, underscoring the need for close scrutiny and regulation, Marr said. Dan Faggella, a founder and lead researcher of Emerj Artificial Intelligence, did a presentation about deep fakes at United Nations five years ago. At the time, he emphasized that regardless of warnings about deep fakes, “people will want to believe what they want to believe.” There is, however, a bigger consideration: soon, people will be able to live in genAI worlds where they strap on a headset and tell an AI model to create a unique world to satisfy emotional needs, be it relaxation, humor, action – all programmatically built specifically for that user. “And what the machine is going to be able to do is conjure visual and audio and eventually haptic experiences for me that are trained on the [previous experiences] wearing the headset,” Faggella said. “We need to think about this from a policy standpoint; how much of that escapism do we permit?” Text-to-video models can also build applications that conjure AI experiences to help people be productive, educate them, and keep them focused on their most important work. “Maybe train them to be a great salesperson, maybe help them write great code, and do a lot more coding than they can do right now,” he said. Both OpenAI’s Sora and Google’s Gemini 1.5 multimodal AI model are for now internal research projects only being offered to a specific body of third-party academics and others testing the technology. Unlike OpenAI’s popular ChatGPT, Google said, users can feed into its query engine a much larger amount of information to get more accurate responses. Even though Sora and Gemini 1.5 are currently internal research projects, they showcase real examples and detailed info, including videos, photos, gifs, and related research papers. Along with Google’s Gemini multimodal AI engine, Sora was predated by several text-to-video models, including Meta’s Emu, Runway’s Gen-2, and Stability AI’s Stable Video Diffusion. Stable Diffusion/Wikipedia The denoising process used by Stable Diffusion. The model generates images by iteratively clearing random noise until a configured number of steps have been reached; it’s guided by a CLIP text encoder pretrained on concepts along with the attention mechanism, creating an image depicting a representation of the trained concept. Google has two concurrent research projects advancing what a spokesperson called “state-of-the-art in video generation models.” Those projects are Lumiere and VideoPoet. Released earlier this month, Lumiere is Google’s more advanced video generation technology; it offers 80 frames per second compared to 25 frames per second from competitors such as Stable Video Diffusion. “Gemini, designed to process information and automate tasks, offers a seamless integration of modalities from the outset, potentially making it more intuitive for users who seek a straightforward, task-oriented experience,” Marr said. “On the other hand, GPT-4’s layering approach allows for a more granular enhancement of capabilities over time, providing flexibility and depth in conversational abilities and content generation.” In a head-to-head comparison, Sora appears more powerful than Google’s video generation models. While Google’s Lumiere can produce a video with 512×512-pixel resolution, Sora claims to reach resolutions of up to 1920×1080 pixels or HD quality. Lumiere’s videos are limited to about 5 seconds in length; Sora’s videos can run up to one minute. Additionally, Lumiere cannot make videos composed of multiple shots, while Sora can. Sora, like other models, is also reportedly capable of video-editing tasks such as creating videos from images or other videos, combining elements from different videos, and extending videos in time. “In the competition between OpenAI’s Sora and startups like Runway AI, maturity may offer advantages in terms of reliability and scalability,” Marr said. “While startups often bring innovative approaches and agility, OpenAI, with large funding from companies like Microsoft, will be able to catch up and potentially overtake quickly.” Related content news Platform lets creators monetize their content for use in LLM training Avail’s Corpus tool ‘flies in the face’ of comments made by head of Microsoft AI, says analyst. By Paul Barker Jul 17, 2024 5 mins Artificial Intelligence news ChatGPT users speechless over delays OpenAI has delayed an alpha release of its new voice mode for ChatGPT, citing safety and scalability concerns By Gyana Swain Jun 26, 2024 4 mins Generative AI Voice Assistants Artificial Intelligence news Public opinion on AI divided While many think it may benefit society as a whole, they find it hard to see what’s in it for them, highlighting some lessons for employers and developers. By Lynn Greiner May 28, 2024 7 mins Employee Experience Generative AI IT Skills news analysis There aren't nearly enough workers to support new US chip production Even as the semiconductor industry hopes to find and recruit skilled workers, a lack of talent could undermine national objectives, push up labor costs, and hinder the returns from the billions of dollars being spent, according to a McKinsey & Co By Lucas Mearian May 15, 2024 10 mins CPUs and Processors Government Manufacturing Industry Podcasts Videos Resources Events SUBSCRIBE TO OUR NEWSLETTER From our editors straight to your inbox Get started by entering your email address below. Please enter a valid email address Subscribe