The other day, Google held its annual developer conference, Google I/O, where they announced "Gemini Omni," a new multimodal generative AI. Google has championed AGI (Artificial General Intelligence) since its inception, viewing multimodal AI as an essential requirement to achieve it. In this article, we will use "Gemini Omni" to examine just how much closer we have come to AGI.
1. What Kind of AI is "Gemini Omni"?
First, let's look at the explanation released by Google (1).
"We’re introducing Gemini Omni, where Gemini’s ability to reason meets the ability to create. Omni is our new model that can create anything from any input — starting with video. With Omni, you can combine images, audio, video and text as input and generate high-quality videos grounded in Gemini's real-world knowledge. You can also easily edit your videos through conversation.Gemini Omni Flash is a model that can create anything from any input – starting with video."
In short, it can be described as "a generative AI that can take any form of information as input and output it in any format." It appears that "Omni" understands 3D spatial information, visual elements, and physical laws—such as objects falling downward—which are difficult to grasp through text alone. This is truly a massive leap forward toward AGI.
The Omni Flash model that debuted this time is limited to video output only. However, in line with the "any-to-any" concept, the next version is highly expected to support output across all formats. It is something to look forward to.
2. The Task: Singing to a Given Theme
So, how capable is Omni Flash in practice? Can it successfully integrate various forms of information? Can it maintain consistency in its output? To test this, we will use the image below, add a prompt, and see if it can sing emotionally based on a specific theme. She is Leia, an instructor at ToshiStats Co. She is a familiar face on YouTube, but this time she is participating in our experiment.
Leia, Instructor at ToshiStats Co.Ltd.
For this experiment, we prepared the following prompt:
"She is singing 'Kita-wing' in English. It is 80s Japanese pop. This must be 1. An urban and bittersweet melody, 2. about emotion of an independent, mature woman for love, 3. provide courage for action, 4. A movie-like scenery born from a 'midnight flight', 5. A deep, plaintive, and vibrating long vibration. 6, This scene is needed 'An airplane gliding through the midnight sky above the glowing metropolis.'."
We entered this prompt along with the image above. We believe this makes the singing theme reasonably clear. In particular, we want to focus on how well it can express emotional nuances, such as item 1: "An urban and bittersweet melody."
While you can listen to Leia’s actual singing later on YouTube, let's walk through the analysis first. Although the original Leia had a bright smile, the singing Leia looks somewhat sorrowful.
When it transitions to a close-up, those emotions become very clear.
We specified in the prompt to incorporate a "midnight flight" scene. It has indeed been inserted effectively. In the actual video, the airplane moves slowly.
Her physical expressions and body language look natural as she conveys emotion. It is impressive.
Actually, the video ended right at the climax. Ah, what a pity. I wanted to hear more. The maximum generation time for the current Omni Flash is 10 seconds, so it cannot be helped. Let's look forward to an extended generation time in the next version update.
3. The Roadmap to AGI
In this test, Omni Flash consistently generated quite difficult emotional expressions. It understood the meaning and context—keeping her original clothing unchanged while swapping out only the background to match the theme—to create the video. Its adherence to the prompt was also excellent. While the short generation time remains a bottleneck, the content itself deserves high praise.
It is highly probable that Google will use Omni Flash as a starting point to accelerate its development toward AGI. The AI industry is currently suffering from a shortage of GPU supplies, and Google has become one of the few actively speaking out about AGI. Ultimately, being able to develop and produce their own computing resources, such as TPUs, gives them an overwhelming advantage. Demis Hassabis, CEO of Google DeepMind, who is leading the development of Omni Flash, has stated that AGI is "just a few years away" (2).
What did you think? Through this experiment, we confirmed the latent potential of the new multimodal generative AI "Omni" and discussed its possibilities for achieving AGI. Here at ToshiStats, we will continue to explore various ideas under the theme of "Road to AGI." Stay tuned!
You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!
1) Introducing Gemini Omni, Google
2) A new era of discovery: AI and the frontiers of science with Demis Hassabis, May 22, 2026, Google for Developers
Copyright © 2026 ToshiStats Co., Ltd. All right reserved.
Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.
