It's remarkable to see the rapid progress of generative AI. Recently, the improvement in multimodal capabilities, which process information like images and videos in addition to natural language, has been outstanding. This is sometimes referred to as AI's "spatial understanding." Let's briefly experiment with what kind of information generative AI can extract from images to check the performance of the current Gemini 2.5-flash model.

1. Google AI Studio

I'll be using the familiar generative AI development platform, Google AI Studio (1), again. I've prepared a no-code app for spatial understanding. It can display the number of identified objects and their coordinates. For example, for "hands," it shows them like this. It accurately identifies two hands.

2. Generative AI Understands the Meaning of Words and Can Identify Objects

So, what about a task that requires understanding the positional relationship between a flower and a hand, such as "a hand holding a flower"? The result is a successful identification.

Conversely, what about a task like "a hand not holding a flower"? The result is also a successful identification. This is impressive; it identified it with no problem.

Next, can it identify an object based solely on its positional relationship? Let's ask it to identify "what's on the hamburg." It easily answered "fried egg." While this generative AI, Gemini, has been touted for its high-performance image processing since its debut in December 2023, I'm honestly surprised it can do this much.

3. Can It Identify Station Names from a Sign?

Let's try a slightly more difficult task. This is a section of a subway station sign in Kuala Lumpur, the capital of Malaysia. Let's see if it can identify the three stations between Ampang Park and Chan Sow Lin from this image of the sign.

The result was that it accurately identified the three stations. This is a task that requires it to not only read the text in the image correctly but also understand the positional relationship of the stations. It accomplished this without any difficulty. I have nothing more to say; it's amazing!

What do you think? I'm sure many of you are surprised by the high level of spatial understanding. Generative AI is still in its early stages, so its performance will continue to improve, and accordingly, its practical applications will expand. It's something to look forward to. Also, I created this AI app on Google AI Studio without writing any code. Google AI Studio is very user-friendly and high-performing. I encourage you all to try it. Toshi Stats will continue to challenge itself to build various AI apps. Please stay tuned!

1) Google AI Studio

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.