Toshifumi Kuga

March 14, 2026

Many-Shot, In-Context Learning, AI agent, context window

Many-Shot In-Context Learning: The Game Changer of the Long-Context AI Era

Toshifumi Kuga

March 14, 2026

Many-Shot, In-Context Learning, AI agent, context window

Recently, OpenAI released its newest AI model, GPT-5.4 (1). While much of the praise has focused on its overall performance, I want to highlight its context window length. The context window refers to the amount of information a generative AI can process in a single go. GPT-5.4 now supports 1M (one million) tokens. With its rival Opus 4.6 also at 1M and Google Gemini having achieved 1M two years ago, all frontier models from the "Big Three" now possess 1M-token context windows. We can officially say that AI has entered the Long-Context Era.

How will this impact the development of AI agents? Let’s explore.

1. What is Many-Shot In-Context Learning?

When you ask ChatGPT, "What is the capital of Japan?" and it replies, "Tokyo," that question or instruction is called a prompt. However, you can input much more than just a short prompt.

For example, if you provide examples first—such as "Where was the World Expo held in Japan?" followed by "Osaka"—and then ask your actual question, the accuracy is known to improve. This technique is called In-Context Learning. When the number of examples exceeds roughly 10 and you provide a massive amount of data, it is referred to as Many-Shot In-Context Learning. Here is a brief summary.

2. Challenging a 20-Class Classification Task Using Bank Complaint Data

To measure the effectiveness of Many-Shot In-Context Learning, I decided to tackle a difficult 20-class classification task using bank complaint data (2). This dataset contains an "issue" column describing why a complaint occurred. The goal is to read the "text" column and select the correct cause from 20 possible categories. For this, I used Gemini 3.1 Flash-Lite (3).

Rather than using a simple prompt like "Please classify this," I asked the AI itself to "create the optimal prompt," resulting in a highly detailed set of instructions—what you might call a "Prompt Powered by AI."

I first attempted this using Zero-shot (providing no examples), even with this enhanced prompt. Unfortunately, the accuracy was only 46%. Since it gets it wrong more than half the time, it isn't yet viable for practical business use.

3. Executing Many-Shot In-Context Learning with 1,000 Samples

Next, I implemented Many-Shot In-Context Learning by providing 1,000 examples alongside the prompt. While the underlying process remains the same as the Zero-shot approach, the volume of information is massive. The following are the first five examples.

The results were dramatic: accuracy jumped to 70%. This clearly demonstrates the sheer power of the "Many-Shot" approach.

However, with a 30% error rate, there is still room for improvement. I had an AI Agent analyze why the errors occurred and generate a report. The insights gained from this analysis are highly valuable for further refinement.

Conclusion

There are several ways to improve the accuracy of generative AI, but as 1M-token context windows become the standard, Many-Shot In-Context Learning is set to become a major focal point. At ToshiStats, we plan to continue evolving this methodology.

Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Introducing GPT‑5.4, Open AI, March 5, 2026
2) Consumer Complaint Database
3 )Gemini 3.1 Flash-Lite: Built for intelligence at scale, Google, Mar 03, 2026

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Toshifumi Kuga

March 8, 2026

Gemini 3.1 Flash-Lite, AI agent, Opus4.6, Banking Complaint

Which AI Model Should You Use Daily? Why Gemini 3.1 Flash-Lite is the Top Choice!

Toshifumi Kuga

March 8, 2026

Gemini 3.1 Flash-Lite, AI agent, Opus4.6, Banking Complaint

I’ve been using Opus 4.6 for coding lately, but I've realized that the costs can really add up when running it via API. This led me to think that for tasks where absolute peak precision isn't the only priority, a more budget-friendly model would be a better fit. Right on cue, Google announced the gemini-3.1-flash-lite-preview—a model built for speed and affordability (1). I decided to put it to the test immediately.

1. The Perfect Balance of Speed, Cost, and Performance

The Flash-Lite series is the most affordable tier in the Gemini lineup. It’s likely the engine behind many of Google’s own internal services. Speed, in particular, seems to be its standout feature.

When compared to its rivals, the processing speed is remarkably fast. Its cost-efficiency is equally impressive: at $0.25 per 1 million input tokens, it is poised to be a powerhouse for tasks involving massive amounts of data. For a startup like ours, this is incredibly encouraging.

Affordability hasn't come at the expense of performance, however. As shown in the Leaderboard (2), it boasts a score exceeding 1430. Given that the top-tier frontier models are currently competing around the 1500 mark, a score of 1430 for a lightweight model is truly outstanding.

2. Performance Evaluation: Banking Complaint Classification

To see what it can really do, I tested the model on a banking complaint classification task. Using this dataset (3), I provided the model with customer complaints from the "text" column and asked it to select the most relevant category from six financial products listed in the "Product" column. I ran this test on 100 samples to see how accurately it could categorize each complaint.

Here is the detailed prompt I used.

The results were fantastic, achieving a 92% accuracy rate. The entire process finished in about 60 seconds, demonstrating its high-speed processing capabilities. I’ve attempted this specific task several times in the past, but this is the first time a model has exceeded 90% accuracy without any fine-tuning. Truly impressive!

3. A High-Speed Model You Can Use Without Budget Anxiety

For the past few months, I’ve relied on Opus 4.6 for its sheer coding power. While its performance is top-notch, the costs are substantial. When you want to run various experiments where success isn't guaranteed, the budget can become a significant hurdle.

That’s where gemini-3.1-flash-lite-preview shines. Its balance of performance and cost makes it easy to iterate and experiment freely. It’s the perfect "partner" for development, and I plan to integrate it into my workflow even more moving forward.

What do you think? It looks like Google will continue to roll out new AI models one after another. We might even see some open-source models soon, so it's definitely something to keep an eye on. Here at ToshiStats, we’ll keep testing and integrating various AI models into our workflow. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Gemini 3.1 Flash-Lite: Built for intelligence at scale, Google, Mar 03, 2026
2) Arena
3) Consumer Complaint Database

Toshifumi Kuga

March 1, 2026

Agent teams, AI agent, claude code, Corporate Strategy

The Rise of the AI Strategist: Can AI Agents Master Corporate Strategy?

Toshifumi Kuga

March 1, 2026

Agent teams, AI agent, claude code, Corporate Strategy

Claude Code, the coding assistant that's exploding in popularity worldwide—did you know you can use Agent teams (1) to run AI agents as a team? The idea is to run multiple AI agents simultaneously according to their purpose, achieving performance that a single agent couldn't deliver. This time, we'd like to test whether we can use Agent Teams to develop corporate strategy. Let's get started!

1. Implementing Five Forces Analysis with Agent Teams

There's a well-known framework in competitive strategy called Five Forces Analysis (2). This time, we'd like to apply it to the Japanese digital payment market and explore the possibility of market entry. We'll analyze from the following five perspectives, setting up an AI agent for each one.

We entered the following prompt into Claude Code, which you're all familiar with by now. There's nothing particularly difficult about it. Of course, no programming is required. However, if this is your first time using Agent Teams, you'll need to configure the settings, so don't forget (1).

The multi-agent system we'll actually build looks like the following. A total of seven AI agents will be running, but the key point is the loop involving Agent 6 and Agent 7. After Agent 6 creates a report summarizing the research findings, Agent 7, positioned independently, verifies that report. The report isn't complete until Agent 7 approves it and gives the go-ahead. Quite rigorous, isn't it?

2. The Report Creation Process

Now let's follow the report creation process on the actual screen. As you can see below, seven AI agents have indeed been configured. You can also see that the crucial verification loop has been created.

First, Phase 1. The five research AI agents begin by pulling information from the web. They gather information about the Japanese digital payment market from the five perspectives of Five Forces Analysis. Each AI agent operates independently and processes in parallel, making it very efficient.

Work has progressed, and it appears four of the research tasks are complete. The competitive landscape from each perspective is documented as well. Just a little more to go.

The research by all five AI agents is complete, and we move into Phase 2: creating the integrated report. I'm excited to see what kind of report it will be.

Then we enter the most important phase—Phase 3: the verification loop. Here, the goals are: 1) fact-checking through search, 2) identifying logical inconsistencies, and 3) identifying hallucinations, all aimed at improving the quality of the integrated report.

It appears eight errors were identified and corrected.

The report is finally complete. As shown below, there are six types of reports. We compiled all six into a single PDF file, and it spans 60 pages of content. Impressive, isn't it?

3. Structure of the Generated Analysis Report

The structure of the consolidated report is as follows. It's written in accordance with the Five Forces Analysis framework.

We can't present everything here, but the summary in Chapter 1 looks like the following—I think it's very clearly organized. Please note that this summary is for educational purposes only and should not be directly applied to business decisions or the like.

notice : This is for educational purpose only

So, what did you think? We carried out corporate strategy development using Five Forces Analysis, and the AI agents produced an excellent report. While further verification is needed, it could potentially be used as a starting point for discussion. I should note that Agent Teams is currently in an experimental phase, so changes to specifications are possible going forward (1). At Toshi Stats, we'll continue applying multi-agent systems across various fields. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Orchestrate teams of Claude Code sessions, Anthropic
2) Porter's five forces analysis, Wikipedia

Toshifumi Kuga

January 2, 2026

AI agent, Machine Learning, claude code, Governance

What Awaits Us in 2026? Bold Predictions for AI Agents & Machine Learning

Toshifumi Kuga

January 2, 2026

AI agent, Machine Learning, claude code, Governance

Happy New Year!

As we finally step into 2026, I am sure many of you are keenly interested in how AI agents will develop this year. Therefore, I would like to make some bold predictions by raising three key points, while also considering their connection to machine learning. Let's get started.

1. A Dramatic Leap in Multimodal Performance

I believe the high precision of the image generation AI "Nano Banana Pro (1)," released by Google on November 20, 2025, likely stunned not just AI researchers but the general public as well. Its ability to thoroughly grasp the meaning of a prompt and faithfully reproduce it in an image is magnificent, possessing a capability that could be described as "Text-to-Infographics."

Furthermore, its multilingual capabilities have improved significantly, allowing it to perfectly generate Japanese neon signs like this: "明けましておめでとう 2026" (Happy New Year 2026)

This model is not a simple image generation AI; it is built on top of the Gemini 3 Pro frontier model with added image generation capabilities. That is why the AI can deeply understand the user's prompt and generate images that align with their intent. Google also possesses AI models like Genie 3(2) that perform simulations using video, leading the industry with multimodal models. We certainly cannot take our eyes off their movements in 2026.

2. The Explosive Popularity of "Agentic Coding"

Currently, coding by AI agents—"Agentic Coding"—has become a massive global movement. However, for complex code, it is not yet 100% perfect, and human review is still necessary. Additionally, humans still need to create the Product Requirement Document (PRD), which serves as the blueprint for implementation.

I have built several default prediction models used in the financial industry, and I always feel that development is more efficient when the human side first creates a precise PRD. By doing so, we can largely entrust the actual coding to the AI agent. This is an example of default prediction model.

However, the speed of evolution for frontier models is tremendous. In the latter half of 2026, we expect updates like Gemini 4, GPT-6, and Claude 5, and frankly, it is difficult to even imagine what capabilities AI agents will acquire as a result.

Alongside the progress of these models, the toolsets known as "code assistants" are also likely to significantly improve their capabilities. Tools like Claude Code, Gemini CLI, Cursor, and Codex have become indispensable for programmers today, but in 2026, these code assistants will likely play an active role in fields closer to business, such as machine learning and economic analysis.

At this point, calling them "code assistants" might be off the mark; a broader name like "Thinking Machine for Business" might be more appropriate. The day when those who don't know how to code can master these tools may be close at hand. It is very exciting.

3. AI Agents and Governance

As mentioned above, it is predicted that in 2026, AI agents will increasingly permeate large organizations such as corporations and governments. However, there is one thing we must be careful about here.

The behavior of AI agents changes probabilistically. This means that different outputs can be produced for the same input, which is vastly different from current systems. Furthermore, if an AI agent possesses the ability for Recursive Self-Improvement (updating and improving itself), it means the AI agent will change over time and in response to environmental changes. In 2026, we must begin discussions on governance: how do we structure organizational processes and achieve our goals using AI agents that possess characteristics unlike any previous system? This is a very difficult theme, but I believe it is unavoidable if humanity is to securely capture the benefits and gains from AI agents. I previously established corporate governance structures in the financial industry, and I hope to contribute even a little based on that experience.

What did you think? It looks like AI evolution will accelerate even further in 2026. I hope we can all enjoy it together. I look forward to another great year with you all.

You can enjoy our video news ToshiStats-AI from this link, too!

1) Introducing Nano Banana Pro, Google, Nov 20, 2025
2) Genie 3: A new frontier for world models, Jack Parker-Holder and Shlomi Fruchter, Google DeepMind, August 5, 2025

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Toshifumi Kuga

December 25, 2025

Gemini 3 Flash, AI agent, Google DeepMind, Google AI Studio

Gemini 3 Flash: The Multi-modal Powerhouse Dominating the 2026 AI Scene!

Toshifumi Kuga

December 25, 2025

Gemini 3 Flash, AI agent, Google DeepMind, Google AI Studio

Gemini 3 Flash (1) — likely the final major AI model debut of 2025 — is currently making waves. Despite being positioned as an affordable, mid-tier model, its performance is reportedly on par with flagship models. Today, I want to put Gemini 3 Flash to the test and see just how much its multimodal capabilities have evolved. Let’s dive right in.

1. App Development

To conduct our experiments, I wanted to create a simple application using Google AI Studio. By simply entering a prompt into the interface, the app was ready in an instant. No Python was used at all. This level of accessibility means even non-engineers can build functional apps now. Things have truly become incredibly convenient.

2. Object Counting

First, I challenged the model with a task that has historically been difficult for AI: counting objects. I asked the AI to count the number of cans and cars in an image. I counted them myself as well, and the AI’s response was spot on. At this level of accuracy, we might no longer need specialized object detection models for general tasks.

3. Economic Analysis from Charts

Next, let’s try a task that requires a higher level of intelligence: interpreting economic indicators from charts and generating an analytical report. Japan has entered a super-aging society faster than any other developed nation, and the labor force is steadily declining. For this test, I provided charts for the labor force population, unemployment rate, and Manufacturing Sector hourly wages. I then instructed the AI to read these charts, synthesize the data, and produce a comprehensive analysis.

In 30 seconds, the economic report was generated. Below is an excerpt. I was genuinely impressed by the depth of analysis derived from just three charts. Gemini 3 Flash is truly formidable!

Conclusion

What do you think? Gemini 3 Flash is a fantastic value, being significantly cheaper than rival flagship models. Given that its multimodal performance is top-tier, I believe this will become the "go-to" model for many users. For AI startups like ours, having a model that allows for extensive experimentation with high token volumes without breaking the bank is incredibly reassuring. I highly recommend giving it a try!

Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Gemini 3 Flash: frontier intelligence built for speed, Dec 17, 2025, Google

Copyright © 2025 Toshifumi Kuga. All right reserved
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Toshifumi Kuga

December 18, 2025

Vibe Coding, claude code, AI agent, Machine Learning, Plan Mode, MetaPrompt

Improving ML Vibe Coding Accuracy: Hands-on with Claude Code's Plan Mode

Toshifumi Kuga

December 18, 2025

Vibe Coding, claude code, AI agent, Machine Learning, Plan Mode, MetaPrompt

2025 was a year where I actively incorporated "Vibe Coding" into machine learning. After repeated trials, I encountered situations where coding accuracy was inconsistent—sometimes good, sometimes bad.

Therefore, in this experiment, I decided to use Claude Code "Plan Mode" (1) to automatically generate an implementation plan via an AI agent before generating the actual code. Based on this plan, I will attempt to see if a machine learning model can be built stably using "Vibe Coding." Let's get started!

1. Generating an Implementation Plan with Claude Code "Plan Mode"

Once again, I would like to build a model that predicts in advance whether a customer will default (on a loan, etc.). I will use publicly available credit card default data (2). For the code assistant, I am using Claude Code, and for the IDE, the familiar VS Code.

To provide input to the Claude Code AI agent, I summarized the task and implementation points into a "Product Requirement Document (PRD)." This is the only document I created.

I input this PRD into Claude Code "Plan Mode" and instructed it to: "Create a plan to create predictive model under the folder of PD-20251217".

Within minutes, the following implementation plan was generated. Comparing it to the initial PRD, you can see how refined it is. Note that I am only showing half of the actual plan generated here—a truly detailed plan was created. I can only say that the ability of the AI agent to envision this far is amazing.

2. Beautifully Visualizing Prediction Accuracy

When this implementation plan is approved and executed, the prediction model is generated. Naturally, we are curious about the accuracy of the resulting model.

Here, it is visualized clearly according to the implementation plan. While these are familiar metrics for machine learning experts, all the important ones are covered and visualized in an easy-to-understand way, summarized as a single HTML file viewable in a browser.

The charts below are excerpts from that file. It includes ROC curves, SHAP values, and even hyperparameter tuning results. This time, the total implementation time was about 10 minutes. If it can be generated automatically to this extent in that amount of time, I’d rather leave it to the AI agent.

3. Meta-Prompting with Claude Code "Plan Mode"

A Meta-Prompt refers to a "prompt (instruction to AI) used to create and control prompts."

In this case, I called Claude Code "Plan Mode" and instructed it to "generate an implementation plan" based on my PRD. This is nothing other than executing a meta-prompt in "Plan Mode."

Thanks to the meta-prompt, I didn't have to write a detailed implementation plan myself; I only needed to review the output. It is efficient because I can review it before coding, and since that implementation plan can be viewed as a highly precise prompt, the accuracy of the actual coding is expected to improve.

To be honest, I don't have the confidence to write the entire implementation plan myself. I definitely want to leave it to the AI agent. It has truly become convenient!

How was it? Generating implementation plans with Claude Code "Plan Mode" seems applicable not only to machine learning but also to various other fields and tasks. I definitely intend to continue trying it out in the future. I encourage everyone to give it a challenge as well.

That’s all for today. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) How to use Plan Mode, Anthropic

2) Default of Credit Card Clients

Toshifumi Kuga

November 11, 2025

Opal, AI agent, marketing, Google

This Is What Happens When an AI Agent Runs Our 2025 Autumn Marketing!

Toshifumi Kuga

November 11, 2025

Opal, AI agent, marketing, Google

Hello, the high temperature in Tokyo has dropped to 16°C, and it's starting to feel very much like autumn. For those unfamiliar with autumn in Japan, this is the season when the leaves on the mountains change from green to orange. The entire mountainside is dyed orange, creating a beautiful and spectacular view. Therefore, I decided to use orange as the background color for this marketing campaign's promotional video. The challenge is: "To devise a campaign to sell cakes to women in Ashiya, an affluent residential area in the Kansai region." What happens when we entrust this task to an AI agent? Let's find out.

1. Creating an AI Marketing Agent with "Google Opal"

This time, I'm creating an AI marketing agent using Google Opal (1). As the description says, "Opal, our no-code AI mini-app builder," you can easily develop an AI agent app like the one below.

For this AI agent's development, I only entered the following prompt: "You are an expert in marketing campaigns. You will be given the following information: 1. The product/service to sell, 2. The target customer, 3. The location/region, 4. The time/season of the campaign, 5. The desired brand image color, 6. A photo of the facilitator. Using this information, please create the following: a. A marketing strategy, b. A marketing campaign name, c. A logo based on the name, d. A promotional video featuring the facilitator, complete with BGM."

Just by executing this, you can create a workflow like the one shown above using the AI agent. After that, you just switch to the app and answer questions related to your task, and the marketing campaign is created. Amazing, isn't it!

2. Marketing Strategy and Logo

Once you input all the necessary information, you get the results back immediately. First is the marketing strategy. In reality, a more detailed discussion followed. This time, I'll just introduce the beginning. Even though I didn't input very detailed information about the campaign at the initial stage, I think this marketing strategy is well-done.

Next is the marketing campaign name and logo. What it generated was a cool, French-style logo. I'd love to try using it sometime.

3. Three Short Promotional Videos

First, I provide the AI agent with a base image of a woman. Then, using this image as a starting point and based on the created marketing strategy, an approximately 8-second short video is generated. It's exciting to see what kind of video the AI agent will produce. This time, it created three videos with BGM. All of them are based on the theme of "Autumn Cakes." It's hard to pick a winner; they are all excellent. After actually creating the videos, I felt that even 8 seconds is enough to convey the image clearly. Which one did you like the best?

What did you think? Although this was just a demo AI agent, I was astonished at what it could accomplish with no code, no programming. It seems like it will become a powerful ally for marketers. Of course, there are limitations, but what I created this time can be done for free with just a Google account. I highly recommend giving it a try. ToshiStats will continue to share more about AI agents. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Opal is now available in more than 160 countries, Google, 7 Nov 2025

Toshifumi Kuga

October 15, 2025

Andrew Ng, AI agent, generative ai, Agentic AI

Your Guide to AI Agents: Insights from Andrew Ng's Latest Course

Toshifumi Kuga

October 15, 2025

Andrew Ng, AI agent, generative ai, Agentic AI

A new online course called "Agentic AI" (1) has been released by DeepLearning AI. The creator is Andrew Ng, an adjunct professor at Stanford University, who is also famous for his past machine learning-related courses. For me, this is the first course I've taken from him since the Deep Learning Specialization in 2018. I've just completed it, and I'd like to share my thoughts and a recommendation.

1. Course Overview

The course is divided into five modules, each consisting of 5-7 short videos (about 5-10 minutes each), a quiz, and coding tasks using jupyter notebook. By passing each assignment, you are ultimately awarded a certificate of completion. The level is listed as intermediate; while a basic knowledge of Python is necessary, I believe that even those without specialized knowledge in AI can progress through the material and naturally come to understand it. The main topics are as follows:

Reflection: AI critiques its own work and iterates to improve quality—like code review, but automated.

Tool Use: Connect AI to databases, APIs, and external services so it can actually perform actions, not just generate text.

Planning: Break complex tasks into executable steps that AI can follow and adapt when things don’t go as expected.

Multi-Agent: Coordinate multiple specialized AI systems to handle different parts of a complex workflow.

Created by Andrew Ng, who teaches at Stanford while concurrently doing practical consulting work, I found the course to have a wonderful balance between theory and practice.

2. Reflection and Tool Use

The second and third modules are critical technologies for the future realization of AGI. In particular, "Reflection," where an AI improves itself, is also known as Recursive Self Improvement and is a field being researched worldwide. This module introduces a method that allows even non-experts to incorporate reflection functionality, which I am very eager to try implementing. Additionally, using tools allows a generative AI to incorporate information that is difficult to acquire on its own, thereby enhancing the AI agent's capabilities. Furthermore, this information can be applied to the "Reflection" process, promising a synergistic effect. I'm also keen to implement this and see what kind of information can be integrated.

3. Error Analysis

As Andrew Ng states, this fourth module is, in my opinion, the most important and valuable content in the course. Generative AI is excellent, but it is not perfect. There is still a considerable possibility that it will produce incorrect answers. Therefore, to raise its accuracy to a practical level, the course emphasizes the importance of adopting a strategy that quickly identifies the parts of the overall process with the lowest performance and allocates resources to improving those areas. I can certainly see how for a complex AI agent that may contain numerous sub-agents, identifying and prioritizing the reinforcement of its weaknesses is incredibly important in practical applications.

So, what did you think? With a flood of AI-related news every day, many people are likely wondering, "How should I proceed with my AI projects from now on?" I believe this course provides a valuable perspective for thinking in the medium to long term. While it is a paid course, it is not as expensive as university tuition, and I highly recommend trying it. Incidentally, because I studied intensively, I was able to receive my certificate in about three days. It's certainly possible for a business professional to complete it over a long weekend.

Well, that's all for today. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Agentic AI, Andrew Ng, DeepLearning AI, Oct 2025

Toshifumi Kuga

October 6, 2025

AI agent, artificial intelligence, MLE-STAR, Machine Learning, ADK

The Secret to High-Accuracy AI: An Exploration of Machine Learning engineering agent

Toshifumi Kuga

October 6, 2025

AI agent, artificial intelligence, MLE-STAR, Machine Learning, ADK

In a previous post, I explained Google's research paper, "MLE STAR" (1), and uncovered the mechanism by which an AI can build its own high-accuracy machine learning models. This time, I'm going to implement that AI agent using the Google ADK and experiment to see if it can truly achieve high accuracy. For reference, the MLE STAR code is available as open source (2).

1. The Information I Provided

With MLE STAR, humans only need to handle the data input and task definition. The data I used for this experiment comes from the Kaggle competition "Home Credit Default Risk" (3). While the original data consists of 8 files, I combined them into a single file for this experiment. I reduced the training data to 10% of the original, resulting in about 30,000 samples, and kept the original test data of 48,700 samples.

The task was set as follows: "A classification task to predict default." Note that to speed up the experiment, the number of iterative loops was set to a minimum.

2. Deciding Which Model to Use

MLE STAR uses a web search to select the optimal model for the given task. In this case, it ultimately chose LightGBM. To finish the experiment quickly, I configured it to select only one model. If I had set it to select two, it likely would have also chosen something like XGBoost. Both are models frequently used in data science competitions.

It generated the initial script below. As a frequent user of LightGBM, the code looks familiar, but the ability to generate it in an instant is something only an AI can do. It's amazing!

3. Identifying Key Code Blocks with "Ablation Studies"

Next, it uses ablation studies to identify which code blocks should be improved. In this case, ablation2 showed that removing Early Stopping worsened the model's performance, so this feature was kept in the training process from then on.

**Ablation Studies Results by MLE STAR**

4. Iteratively Improving the Model

Based on the ablation studies, MLE STAR decided to improve the model using the following two techniques: K-fold target encoding and binary encoding. These techniques themselves are common in machine learning and are not particularly unusual.

This ability to "use ablation studies to identify which code blocks to improve" is likely a major reason for MLE STAR's high accuracy. I look forward to seeing how this functionality evolves in the future.

5. The Results Are In. Unfortunately, I Lost.

For its final step, MLE STAR ensembles the models to create the final version. For more details, please see the research paper. It also generates a CSV file with the default predictions, which I slightly modified and promptly submitted to Kaggle. This task is evaluated using AUC, where a score closer to 1 indicates higher accuracy.

The top score is the result I achieved using my own LightGBM model. The score in the red box at the bottom is the one automatically generated by MLE STAR. With a difference of more than 0.01 on both the Public and Private scores, it was my complete defeat.

**Kaggle Prediction Accuracy Evaluation (AUC)**

Improving the AUC by 0.01 is quite a challenge, which gives a glimpse into how excellent MLE STAR is. I didn't perform any extensive tuning on my LightGBM model, so I believe my score would have improved if I had spent time tuning it manually. However, MLE STAR produced its result in about 7 minutes from the start of the computation, so from an efficiency standpoint, I couldn't compete.

So, what did you think? Although this was a limited experiment, I feel I was able to grasp the high potential of MLE STAR. I was truly impressed by the power of its Recursive Self-Improvement, which identifies specific code blocks and improves upon them autonomously.

Here at Toshi Stats, I plan to continue digging into MLE STAR. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam1 2 *, Jinsung Yoon1, Jiefeng Chen1, Jinwoo Shin2, Sercan Ö. Arık1 and Tomas Pfister1, Google Cloud1, KAIST2, 23, Aug 2025

2) Machine Learning Engineering with Multiple Agents (MLE-STAR) , Google

3) Home Credit Default Risk, kaggle

Toshifumi Kuga

September 29, 2025

Machine Learning, MLE-STAR, AGI, AI agent, artificial intelligence

Is an AI Machine Learning Assistant Finally a Reality? I Looked Into It, and It's Incredible!

Toshifumi Kuga

September 29, 2025

Machine Learning, MLE-STAR, AGI, AI agent, artificial intelligence

I often build machine learning models for my job. The process of collecting data, creating features, and gradually improving the model's accuracy takes time, specialized knowledge, and programming skills in various libraries. I've always found it to be quite a challenge. That's why I've been hoping for an AI that could skillfully assist with this work, and recently, a potential candidate has emerged. I'd like to take a deep dive into it right away.

A Basic Three-Layer Structure

This AI assistant is called MLE-STAR, and according to a research paper (1), it has the following structure. Simply put, it first searches the internet for promising libraries. Next, after writing code using those libraries, it identifies which parts, called "code blocks," should be improved further. Finally, it decides how to improve those code blocks. Let's explore each of these steps in detail.

2. Selecting the Optimal Library with a Search Function

To create a high-accuracy machine learning model, you first need to decide "what kind of model to use." This means you have to select a library to implement the model. This is where the search function comes in. For example, in a finance task to calculate default probability, many methods are possible, but gradient boosting is often used in competitions like Kaggle. I also use gradient boosting in most cases. It seems MLE-STAR can use its search function to find the optimal library on its own, even without me specifying "use gradient boosting." That's amazing! This would eliminate the need for humans to research everything, leading to greater efficiency.

3. Finding Where to Improve the Code and Steadily Making Progress

Once the library is chosen and a baseline script is written, it's time to start making improvements to increase accuracy. But it's often difficult to know where to begin. MLE-STAR employs an ablation study to understand how accuracy changes when a feature is added or removed, thereby identifying the most impactful code block. This part of the process typically relies on human experience and intuition, involving a lot of trial and error. By using MLE-STAR, we can make data-driven decisions, which is incredibly efficient.

4. Iterating Until Accuracy Actually Improves

Once the code block for improvement is identified, the system gradually changes parameters and confirms the accuracy improvements. This is also done automatically within a loop, without requiring human intervention. The accuracy is calculated at each step, and as a rule, only changes that improve performance are adopted, ensuring that the model's accuracy steadily increases. Incredible, isn't it? In fact, a graph comparing the performance of MLE-STAR with past AI assistants shows that MLE-STAR won a "gold medal" in approximately 36% of the tasks, highlighting its superior performance.

So, what did you think? This new framework for an AI assistant looks extremely promising. In particular, its ability to identify which code blocks to improve and then actually increase the accuracy is likely to become even more powerful as the performance of foundation models continues to advance. I'm truly excited about future developments.

Next time, I plan to apply it to some actual analysis data to see what kind of accuracy it can achieve. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

Toshifumi Kuga

September 14, 2025

AI agent, ADK, generative ai, marketing

A Sweet Strategy: Selling Cakes in Wealthy Residential Areas !

Toshifumi Kuga

September 14, 2025

AI agent, ADK, generative ai, marketing

Has everyone ever thought about starting a cake shop? As a cake lover myself, I often find myself wondering, "What kind of cake would be perfect?" However, developing a concrete business strategy is a real challenge. That's why this time, I'd like to conduct a case study with the support of an "AI marketing-agency." Let's get started.

1. Selling Cakes in an Upscale Kansai Neighborhood

The business scenario I've prepared for this case is a simple one:

Goal: To sell premium fruit cakes in the Kansai region.

Cake Features: Premium shortcakes featuring strawberries, peaches, and muscat grapes.
Target Audience: Women in their 20s to 40s living in upscale residential areas.
Stores: 3 cafes near Yamate Dentetsu Ashiya Station, 1 cafe near Kaigan Dentetsu Ashiya Station.
Direct Sales Outlet: 1 store inside the Yamate Dentetsu Ashiya Station premises.
Branding: The brand's primary color will be blue, with the website and logo also unified in blue.
Current Plan: In the process of planning a sales promotion for the autumn season.

From here, what kind of concrete business strategy can we derive? First, I'll input the business scenario into the AI marketing-agency.

The first thing it does is automatically generate 10 cool domain names.

It's hard to choose, but for now, I'll proceed with branding using "PremiumAshiyaCake.com".

2. A Practical Business Strategy

Now, let's ask the AI marketing-agency to formulate a business strategy for selling our premium fruit cakes in Kansai. When prompted to input the necessary information, I re-entered the business scenario, and the following business strategy was generated in about two minutes. Amazing!

It's a long document, over five pages, so I can't share it all, but here is the "Core of the Marketing Strategy."

Overall Approach: Direct Response that Inspires Aspiration
- We will build an aspirational, luxury brand image through beautiful content, and then convert that desire into immediate store visits using precisely targeted calls-to-action (CTAs).
Core Message and Positioning:
- Positioning Statement: For the discerning women of Kansai, Premium Ashiya Cake is the patisserie that transforms a moment into a cherished memory with its exquisitely crafted seasonal shortcakes.
- Tagline / Core Message: "Premium Ashiya Cake: An exquisite moment, crafted for you."
Key Pillars of the Strategy:
- Visual Elegance and a "Blue" Signature: All visuals must be of professional, magazine-quality. The brand color "blue" will be used as a sophisticated accent in styling—such as on blue ribbons, parts of the tableware, or as background elements—to create a recognizable and unique visual signature.
- Hyper-local Exclusivity: Marketing efforts will be geographically and demographically laser-focused on the target audience residing in Ashiya and its surrounding affluent areas. This creates an "in-the-know" allure for locals.
- Seasonal Storytelling: Treat each season's campaign as a major event. We will build a narrative around the star ingredients, such as Shine Muscat grapes from a specific partner farm, to build anticipation and justify the premium price point.

This is wonderfully practical content. The keywords I provided—"blue," "Ashiya," and "muscat"—have been skillfully integrated into the strategy.

3. The Logo is Excellent, Too—This is Usable!

Because I specified in the initial business scenario that I wanted to "unify the color scheme based on blue," it created this cool logo for me. It really looks like something I could use right away. Google's image generation AI, Imagen 3.0, is used here. The quality of this AI is always highly rated, so it's no surprise that the logo generated this time is also of outstanding quality.

So, what did you think of the AI marketing-agency? The business strategy is professional, and it's amazing how it automatically created the domain names and logo with such excellent results. Although I couldn't introduce it this time, it also includes a website creation feature. It's surprising that a tool this high-performance is actually available for free. A development kit called "Google ADK" is provided as open-source, and the AI marketing-agency from this article can be downloaded and used for free as Sample (1). For those who can use Python, I think you'll get the hang of it with a little practice. The operational costs are also limited to the usage fees for Google Gemini 2.5 Pro, so the cost-effectiveness is outstanding. I encourage you all to give it a try.

Please note that this story is a work of fiction and does not represent anything that actually exists. That's all for today, stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Marketing Agency, Google, May 2025

Toshifumi Kuga

August 24, 2025

AI agent, GPT5, generative ai, promt, prompt engineering

Let's Explore the Best Practices for Crafting GPT-5 Prompts!

Toshifumi Kuga

August 24, 2025

AI agent, GPT5, generative ai, promt, prompt engineering

We are already hearing from many in the field that with the arrival of GPT-5, "the writing style is different from GPT-4o and earlier" and "its performance as an agent is on another level." Here, we will build upon the key points from OpenAI's "GPT-5 Prompt Guide (1)" and organize, from a practical perspective, "how to write prompts to stably reproduce desired behaviors." The following three keywords are key:

GPT-5 acts very proactively as an AI agent.
Self-reflection and guiding principles.
Instruction following with "surgical precision."

Let's delve into each of these.

1. GPT-5 acts very proactively as an AI agent.

GPT-5's enhanced capabilities in tool-calling, understanding long contexts, and planning allow it to proceed autonomously even with ambiguous tasks. Whether you "harness" or "suppress" this capability depends on how you design the agent's "eagerness").

1-1. Controlling Eagerness with Prompts

To suppress eagerness, intentionally limit the depth of exploration and explicitly set caps on parallel searches or additional tool calls. This is effective in situations where processing time and cost are priorities, or when requirements are clear and exploration needs to be minimized.

To enhance eagerness, explicitly state rules for persistence, such as "Do not end the turn until the problem is fully resolved" and "Even with uncertainty, proceed with the best possible plan." This is suitable for long-duration tasks where you want the agent to see them through to completion with minimal check-ins with the user.

Practical Snippet (To suppress eagerness):

<context_gathering>
Goal: Reach a conclusion quickly with minimal information gathering.
Method: A single-batch search, starting broad and then narrowing down. Avoid duplicate searches.
Budget: A maximum of 2 tool calls.
Escape: If a conclusion is reasonably certain, accept minor incompleteness to provide an early answer.
</context_gathering>

Practical Snippet (To encourage eagerness):

<persistence>
Do not end the turn until the problem is completely resolved.
Reason through uncertainty and continue with the best possible plan.
Minimize clarifying questions. Adopt reasonable assumptions and state them later.
</persistence>

1-2. Visualize with a "Tool Preamble"

When the agent outputs a long rollout during execution, having it first provide a brief summary—explaining the objective, outlining the plan, noting progress, and confirming completion—makes it easier for the user to follow along and creates a better user experience.

Recommended Snippet:

<tool_preambles>
First, restate the user's goal in a single sentence. Follow with a bulleted list of the planned steps.
During execution, add concise progress logs sequentially.
Finally, provide a summary that clearly distinguishes between the "Plan" and the "Actual Results."
</tool_preambles>

2. Self-reflection and Guiding Principles

GPT-5 excels at "internally refining" the quality of its output through self-reflection. However, if the criteria for judging quality are not established beforehand, this reflection can become unproductive. This is where guiding principles and a private rubric are effective.

2-1. Provide a "Self-Grading Scorecard" with a Private Rubric

For zero-to-one generation tasks (e.g., creating a new web app, drafting specifications), have the model internally create a scorecard with 5-7 evaluation criteria. Then, have it repeatedly rewrite and re-evaluate its output based on these criteria.

Rubric Generation Snippet:

<self_reflection>
Define the conditions that a world-class deliverable should meet across 5-7 categories (e.g., UI quality, readability, robustness, extensibility, accessibility, accountability). Score your own proposal against these criteria, identify shortcomings, and redesign. The rubric itself should not be shown to the user.
</self_reflection>

2-2. Reduce Inconsistency with Guiding Principles

For ongoing development or modifying existing code, first provide the project's conventions by clearly stating its design principles, directory structure, and UI standards. This ensures that the model's suggested improvements and changes integrate naturally with the existing culture.

Guiding Principles Snippet (Example):

<guiding_principles>
Clarity and Reusability: Keep components small and reusable. Group them and avoid duplication.
Consistency: Unify tokens, typography, and spacing.
Simplicity: Avoid unnecessary complexity in styling and logic.
</guiding_principles>

2-3. Separately Control Verbosity and Reasoning Effort

GPT-5 can control its verbosity (the length of the final answer) and its reasoning_effort (the depth of thought) independently. This allows for context-specific overrides, such as "be concise in prose, but provide detailed explanations in code." The guide introduces a practical example of prompt tuning by Cursor, which is worth checking out. A useful tip for fast mode (minimal reasoning) is to require a brief summary of its thinking or plan at the beginning to assist its process.

3. GPT-5's Instruction Following has "Surgical Precision"

GPT-5 is extremely sensitive to the accuracy and consistency of instructions. Contradictory requests or ambiguous prompts waste reasoning resources and degrade output quality. Therefore, it is crucial to "structure" your instruction hierarchy to prevent contradictions before they occur.

3-1. Design to Avoid Contradictions

Take the example of a healthcare administrator scheduling a patient appointment based on symptoms. "Exceptions," such as altering preceding steps only in emergencies, must be clearly stated so they do not conflict with standard procedures.

Bad Example: The instructions "Do not schedule without consent" and "First, automatically secure the fastest same-day slot" coexist.
Correct Example: When "Always check the profile" and "In an emergency, immediately direct to 911" coexist, the exception rule is declared first.

OpenAI offers the following warning:

We understand that the process of building prompts is an iterative one, and that many prompts are living documents, constantly being updated by different stakeholders. But that’s why it is even more important to thoroughly review for instructions that are phrased improperly. We have already seen multiple early users discover ambiguities and contradictions within their core prompt libraries when they did such a review. Removing them dramatically streamlined and improved GPT-5's performance. We encourage you to test your prompts with our Prompt Optimizer tool to identify these kinds of issues.

How was that? In this article, we explored key points for prompt design from OpenAI's GPT-5 Prompt Guide (1). GPT-5 is a "partner in practice," combining powerful autonomy with precise instruction following. Try incorporating the points discussed today into your prompts and take your AI agents to the next level. That's all for today. Stay tuned!

1) GPT-5 prompting_guide, OpenAI, August 7, 2025

You can enjoy our video news ToshiStats-AI from this link, too!

Toshifumi Kuga

August 10, 2025

Prompt Optimization, AI agent, AGI, prompt engineering, Google DeepMind, generative ai

Prompt Optimization: The Secret to Building Better AI Agents?

Toshifumi Kuga

August 10, 2025

Prompt Optimization, AI agent, AGI, prompt engineering, Google DeepMind, generative ai

The instructions that humans write for generative AI are called "prompts." There are many books and blogs out there that offer guidance on how to write them. Many of you have probably tried, and it's surprisingly difficult, isn't it? While no programming language is required, you have to go through a lot of trial and error to get the output you want from a generative AI. This process can be quite time-consuming, isn't well-systematized, and you often have to start from scratch for each new task.

So, this time, we'd like to experiment with "what happens if we have a generative AI write the prompts for us?" Let's get started.

1. Prompt Optimization

In 2023, Google DeepMind released a research paper titled "LARGE LANGUAGE MODELS AS OPTIMIZERS"(1).

This paper explored the use of LLMs to optimize prompts, and it seems to have worked well for several tasks. While a human writes the initial prompt, subsequent improvements are delegated to the LLM (the optimizer). The LLM is also responsible for judging whether the result was successful or not (the evaluator), meaning this approach can be applied even without labeled data that provides the correct answers. This is very helpful, as tasks involving generative AI often lack labeled data. Below is a flowchart of this process, which is effectively the automation of prompt engineering. This is professionally referred to as "prompt optimization." The specific method we adopted for this experiment is called OPRO (Optimization by PROmpting).

2. Experiment with a Customer Complaint Classification Task

Similar to our blog post on July 26th, we set up a task to predict which financial product a bank's customer complaint is about. We used an LLM to solve a classification task where it selects one of the following six financial products. We used gemini-2.5-flash for this experiment, with a sample size of 100 customer complaints.

Mortgage
Checking or savings account
Student loan
Money transfer, virtual currency, or money service
Bank account or service
Consumer Loan

In this experiment, the LLM handled the prompt generation, but a meta-prompt was necessary to further improve the resulting prompts. I wrote the meta-prompt as follows. Essentially, it tells the LLM to "please further improve the resulting prompt."

We had the LLM generate 20 prompts, and the results are shown below. The final number is the accuracy. An accuracy of 0.8 means 80 out of 100 cases were correct. Since this data came with labeled data, calculating the accuracy was easy.

We adopted the second prompt from the list, which had the best accuracy of 0.89 in this experiment. When we ported this prompt to our regular experimental environment and ran it, the accuracy exceeded 0.9, as shown below. We've done this task many times before, but this is the first time we've surpassed 0.9 accuracy. That's amazing!

3. What Does the Future of Prompt Engineering Look Like?

As you can see, it seems possible to optimize prompts by leveraging the power of generative AI. Of course, when considering cost and time, the results might not always be worth the effort. Nevertheless, I feel there's a strong need for prompt automation. Researchers worldwide are currently exploring various methods, so many things that aren't possible now will likely become possible in the near future. Prompt engineering techniques will continue to evolve, and I'm looking forward to these technological developments and plan to try out various methods myself.

So, what did you think? The ability of an AI agent to fully utilize the power of generative AI and improve itself without human intervention is called "Recursive-self-improvement." At ToshiStats, we will continue to provide the latest updates on this topic. Please look forward to it. Stay tuned!

1) LARGE LANGUAGE MODELS AS OPTIMIZERS Chengrun Yang Xuezhi Wang Yifeng Lu Hanxiao Liu Quoc V. Le Denny Zhou Xinyun Chen , Google DeepMind

Toshifumi Kuga

July 26, 2025

Google AI Studio, AI agent, generative ai

I tried creating and implementing an AI app with no-code on Google AI Studio, and it was amazing!

Toshifumi Kuga

July 26, 2025

Google AI Studio, AI agent, generative ai

Google has been rapidly releasing generative AI and related products recently, with Google AI Studio (1) particularly standing out as a developer platform. It integrates the latest image and video generation AI, truly embodying a multimodal platform. What's more, it's free up to a certain limit, making it a powerful ally for startups like ours. So, let's actually create an AI application with this platform!

1. Google AI Studio Portal

Below is the Google AI Studio portal. It has so many features that an AI beginner might get confused without prior knowledge. I suppose that's why it's a developer-oriented platform. By clicking the button in the red box, you'll be taken to a site where you can create an application simply by writing a prompt.

Here's the prompt I used this time.

"As a 'Complaint Categorization Agent,' you are an expert at understanding which product a customer is complaining about. You can select only one product from the complaint. Comprehensively analyze the provided complaint and classify it into one of the following categories:

Mortgage
Checking or savings account
Student loan
Money transfer, virtual currency, or money service
Bank account or service
Consumer Loan

Your output should be only one of the above categories. All samples must be classified into one of these classes. Results for all samples are required. Create a GUI that adds the ability to input a CSV file of customer complaints and generate a graph showing the distribution of customer complaint classes. Add features to the GUI to add labeled data independently of the customer complaint CSV file, calculate and display accuracy, and display a confusion matrix of the results."

Just by typing this prompt into the box and running it, the application described below is created. I didn't use any coding like Python at all. It's amazing!

2. Tackling a Real Classification Task with the Created App

After two or three attempts, the final application I built is shown below. It handles the task of classifying bank customer complaints by financial product. This time, I've set it to six types of financial products, but generative AI can achieve high accuracy even without prior training, so it's possible to classify many more classes if desired.

We import customer complaints via a CSV file. This time, I'll use 100 complaints. Furthermore, if ground truth data is available, I've added functionality to output accuracy and a confusion matrix. Below are the actual classification results. The distribution of the six financial products is displayed. It seems this customer complaint data primarily concerns mortgages.

Here's the crucial classification accuracy. This time, we achieved over 80% accuracy, at 83%, without any prior training. It's incredible!

The confusion matrix, often used in classification tasks, can also be displayed. This not only provides a numerical accuracy but also shows where classification errors frequently occur, making it easier to set guidelines for improving accuracy and enabling more effective improvements.

Confusion Matrix

3. Agent Evaluation

What I realized when creating this app was that if some evaluation metric is available, the quality of discussions for subsequent improvements deepens. Trying with just a few samples won't give a good grasp of the generative AI's behavior. Ideally, preparing at least 10, and ideally 100 or more, samples with corresponding ground truth data, and having the AI app output evaluation metrics, would enable effective accuracy improvement suggestions. This theme is called "Agent evaluation," and I believe it will become essential for building practical AI applications in the future.

What do you think? Despite not doing any programming at all this time, I was able to create such an amazing AI application. Google AI Studio integrates perfectly with Google Cloud, allowing you to deploy your app to the cloud with a single button and use it worldwide. Toshi Stats will continue to challenge ourselves by building various AI applications. Stay tuned!

1) Google AI Studio

Toshifumi Kuga

June 16, 2025

prompt, AI agent, startup

The Cutting Edge of Prompt Engineering: A Look at Silicon Valley Startup

Toshifumi Kuga

June 16, 2025

prompt, AI agent, startup

Hello everyone. How often do you find yourselves writing prompts? I imagine more and more of you are writing them daily and conversing with generative AI. So today, we're going to look at the state of cutting-edge prompt engineering, using a case study from a Silicon Valley startup. Let's get started.

1. "Parahelp," a Customer Support AI Startup

There's a startup in Silicon Valley called "Parahelp" that provides AI-powered customer support. Impressively, they have publicly shared some of their internally developed prompt know-how (1). In the hyper-competitive world of AI startups, I want to thank the Parahelp management team for generously sharing their valuable knowledge to help those who come after them. The details are in the link below for you to review, but my key takeaway from their know-how is this: "The time spent writing the prompt itself isn't long, but what's crucial is dedicating time to the continuous process of executing, evaluating, and improving that prompt."

When we write prompts in a chat, we often want an immediate answer and tend to aim for "100% quality on the first try." However, it seems the style in cutting-edge prompt engineering is to meticulously refine a prompt through numerous revisions. For an AI startup to earn its clients' trust, this expertise is essential and may very well be the source of its competitive advantage. I believe "iteration" is the key for prompts as well.

2. Prompts That Look Like a Computer Program

Let's take a look at a portion of the published prompt. This is a prompt for an AI agent to behave as a manager, and even this is only about half of the full version.

Here is my analysis of the prompt above:

Assigning a persona (in this case, the role of a manager)
Describing tasks clearly and specifically
Listing detailed, numbered instructions
Providing important points as context
Defining the output format

I felt it adheres to the fundamental structure of a good prompt. Perhaps because it has been forged in the fierce competition of Silicon Valley, it is written with incredible precision. There's still more to it, so if you're interested, please view it from the link. It's written in even finer detail, and with its heavy use of XML tags, you could almost mistake it for a computer program. Incredible!

3. The Future of Prompt Engineering

I imagine that committing this much time and cost to prompt engineering is a high hurdle for the average business person. After learning the basics of prompt writing, many people struggle with what the next step should be.

One tip is to take a prompt you've written and feed it back to the generative AI with the task, "Please improve this prompt." This is called a "meta-prompt." Of course, the challenges of how to give instructions and how to evaluate the results still remain. At Toshi Stats, we plan to explore meta-prompts further.

So, what did you think? Even the simple term "prompt" has a lot of depth, doesn't it?As generative AI continues to evolve, or as methods for creating multi-AI agents advance, I believe prompt engineering itself will also continue to evolve. It's definitely something to keep an eye on. I plan to provide an update on this topic in the near future.

That's all for today. Stay tuned!

ToshiStats Co., Ltd. offers various AI-related services. Please check them out here!

Prompt design at Parahelp, Parahelp, May 28, 2025

1. What is Many-Shot In-Context Learning?

2. Challenging a 20-Class Classification Task Using Bank Complaint Data

3. Executing Many-Shot In-Context Learning with 1,000 Samples

Conclusion

1. The Perfect Balance of Speed, Cost, and Performance

2. Performance Evaluation: Banking Complaint Classification

3. A High-Speed Model You Can Use Without Budget Anxiety

1. A Dramatic Leap in Multimodal Performance

2. The Explosive Popularity of "Agentic Coding"

3. AI Agents and Governance

1. App Development

2. Object Counting

3. Economic Analysis from Charts

Conclusion

1. Generating an Implementation Plan with Claude Code "Plan Mode"

2. Beautifully Visualizing Prediction Accuracy

3. Meta-Prompting with Claude Code "Plan Mode"

1. Creating an AI Marketing Agent with "Google Opal"

2. Marketing Strategy and Logo

3. Three Short Promotional Videos

1. Course Overview

2. Reflection and Tool Use

3. Error Analysis

1. The Information I Provided

2. Deciding Which Model to Use

3. Identifying Key Code Blocks with "Ablation Studies"

4. Iteratively Improving the Model

5. The Results Are In. Unfortunately, I Lost.

A Basic Three-Layer Structure

2. Selecting the Optimal Library with a Search Function

3. Finding Where to Improve the Code and Steadily Making Progress

4. Iterating Until Accuracy Actually Improves

1. Selling Cakes in an Upscale Kansai Neighborhood

2. A Practical Business Strategy

3. The Logo is Excellent, Too—This is Usable!

1. GPT-5 acts very proactively as an AI agent.

2. Self-reflection and Guiding Principles

3. GPT-5's Instruction Following has "Surgical Precision"

1. Prompt Optimization

2. Experiment with a Customer Complaint Classification Task

3. What Does the Future of Prompt Engineering Look Like?

1. Google AI Studio Portal

2. Tackling a Real Classification Task with the Created App

3. Agent Evaluation