AI agent

Your Guide to AI Agents: Insights from Andrew Ng's Latest Course

A new online course called "Agentic AI" (1) has been released by DeepLearning AI. The creator is Andrew Ng, an adjunct professor at Stanford University, who is also famous for his past machine learning-related courses. For me, this is the first course I've taken from him since the Deep Learning Specialization in 2018. I've just completed it, and I'd like to share my thoughts and a recommendation.

 

1. Course Overview

The course is divided into five modules, each consisting of 5-7 short videos (about 5-10 minutes each), a quiz, and coding tasks using jupyter notebook. By passing each assignment, you are ultimately awarded a certificate of completion. The level is listed as intermediate; while a basic knowledge of Python is necessary, I believe that even those without specialized knowledge in AI can progress through the material and naturally come to understand it. The main topics are as follows:

Reflection: AI critiques its own work and iterates to improve quality—like code review, but automated.

Tool Use: Connect AI to databases, APIs, and external services so it can actually perform actions, not just generate text.

Planning: Break complex tasks into executable steps that AI can follow and adapt when things don’t go as expected.

Multi-Agent: Coordinate multiple specialized AI systems to handle different parts of a complex workflow.

Created by Andrew Ng, who teaches at Stanford while concurrently doing practical consulting work, I found the course to have a wonderful balance between theory and practice.

 

2. Reflection and Tool Use

The second and third modules are critical technologies for the future realization of AGI. In particular, "Reflection," where an AI improves itself, is also known as Recursive Self Improvement and is a field being researched worldwide. This module introduces a method that allows even non-experts to incorporate reflection functionality, which I am very eager to try implementing. Additionally, using tools allows a generative AI to incorporate information that is difficult to acquire on its own, thereby enhancing the AI agent's capabilities. Furthermore, this information can be applied to the "Reflection" process, promising a synergistic effect. I'm also keen to implement this and see what kind of information can be integrated.

 

3. Error Analysis

As Andrew Ng states, this fourth module is, in my opinion, the most important and valuable content in the course. Generative AI is excellent, but it is not perfect. There is still a considerable possibility that it will produce incorrect answers. Therefore, to raise its accuracy to a practical level, the course emphasizes the importance of adopting a strategy that quickly identifies the parts of the overall process with the lowest performance and allocates resources to improving those areas. I can certainly see how for a complex AI agent that may contain numerous sub-agents, identifying and prioritizing the reinforcement of its weaknesses is incredibly important in practical applications.

 

So, what did you think? With a flood of AI-related news every day, many people are likely wondering, "How should I proceed with my AI projects from now on?" I believe this course provides a valuable perspective for thinking in the medium to long term. While it is a paid course, it is not as expensive as university tuition, and I highly recommend trying it. Incidentally, because I studied intensively, I was able to receive my certificate in about three days. It's certainly possible for a business professional to complete it over a long weekend.

Well, that's all for today. Stay tuned!

 

You can enjoy our video news ToshiStats-AI from this link, too!


1) Agentic AI, Andrew Ng,  DeepLearning AI, Oct 2025 







Copyright © 2025 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The Secret to High-Accuracy AI: An Exploration of Machine Learning engineering agent

In a previous post, I explained Google's research paper, "MLE STAR" (1), and uncovered the mechanism by which an AI can build its own high-accuracy machine learning models. This time, I'm going to implement that AI agent using the Google ADK and experiment to see if it can truly achieve high accuracy. For reference, the MLE STAR code is available as open source (2).

 

1. The Information I Provided

With MLE STAR, humans only need to handle the data input and task definition. The data I used for this experiment comes from the Kaggle competition "Home Credit Default Risk" (3). While the original data consists of 8 files, I combined them into a single file for this experiment. I reduced the training data to 10% of the original, resulting in about 30,000 samples, and kept the original test data of 48,700 samples.

The task was set as follows: "A classification task to predict default." Note that to speed up the experiment, the number of iterative loops was set to a minimum.

                     Task Setup

 

2. Deciding Which Model to Use

MLE STAR uses a web search to select the optimal model for the given task. In this case, it ultimately chose LightGBM. To finish the experiment quickly, I configured it to select only one model. If I had set it to select two, it likely would have also chosen something like XGBoost. Both are models frequently used in data science competitions.

                Model Selection by MLE STAR

It generated the initial script below. As a frequent user of LightGBM, the code looks familiar, but the ability to generate it in an instant is something only an AI can do. It's amazing!

 

3. Identifying Key Code Blocks with "Ablation Studies"

Next, it uses ablation studies to identify which code blocks should be improved. In this case, ablation2 showed that removing Early Stopping worsened the model's performance, so this feature was kept in the training process from then on.

               Ablation Studies Results by MLE STAR

 

4. Iteratively Improving the Model

Based on the ablation studies, MLE STAR decided to improve the model using the following two techniques: K-fold target encoding and binary encoding. These techniques themselves are common in machine learning and are not particularly unusual.

                   K-fold Target Encoding

                     Binary Encoding

This ability to "use ablation studies to identify which code blocks to improve" is likely a major reason for MLE STAR's high accuracy. I look forward to seeing how this functionality evolves in the future.

 

5. The Results Are In. Unfortunately, I Lost.

For its final step, MLE STAR ensembles the models to create the final version. For more details, please see the research paper. It also generates a CSV file with the default predictions, which I slightly modified and promptly submitted to Kaggle. This task is evaluated using AUC, where a score closer to 1 indicates higher accuracy.

The top score is the result I achieved using my own LightGBM model. The score in the red box at the bottom is the one automatically generated by MLE STAR. With a difference of more than 0.01 on both the Public and Private scores, it was my complete defeat.

             Kaggle Prediction Accuracy Evaluation (AUC)

Improving the AUC by 0.01 is quite a challenge, which gives a glimpse into how excellent MLE STAR is. I didn't perform any extensive tuning on my LightGBM model, so I believe my score would have improved if I had spent time tuning it manually. However, MLE STAR produced its result in about 7 minutes from the start of the computation, so from an efficiency standpoint, I couldn't compete.

 
 

So, what did you think? Although this was a limited experiment, I feel I was able to grasp the high potential of MLE STAR. I was truly impressed by the power of its Recursive Self-Improvement, which identifies specific code blocks and improves upon them autonomously.

Here at Toshi Stats, I plan to continue digging into MLE STAR. Stay tuned!





You can enjoy our video news ToshiStats-AI from this link, too!




1) MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam1 2 *, Jinsung Yoon1, Jiefeng Chen1, Jinwoo Shin2, Sercan Ö. Arık1 and Tomas Pfister1, Google Cloud1, KAIST2,  23, Aug 2025

2) Machine Learning Engineering with Multiple Agents (MLE-STAR) , Google

3) Home Credit Default Risk, kaggle



Copyright © 2025 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Is an AI Machine Learning Assistant Finally a Reality? I Looked Into It, and It's Incredible!

I often build machine learning models for my job. The process of collecting data, creating features, and gradually improving the model's accuracy takes time, specialized knowledge, and programming skills in various libraries. I've always found it to be quite a challenge. That's why I've been hoping for an AI that could skillfully assist with this work, and recently, a potential candidate has emerged. I'd like to take a deep dive into it right away.

 
  1. A Basic Three-Layer Structure

This AI assistant is called MLE-STAR, and according to a research paper (1), it has the following structure. Simply put, it first searches the internet for promising libraries. Next, after writing code using those libraries, it identifies which parts, called "code blocks," should be improved further. Finally, it decides how to improve those code blocks. Let's explore each of these steps in detail.

 

2. Selecting the Optimal Library with a Search Function

To create a high-accuracy machine learning model, you first need to decide "what kind of model to use." This means you have to select a library to implement the model. This is where the search function comes in. For example, in a finance task to calculate default probability, many methods are possible, but gradient boosting is often used in competitions like Kaggle. I also use gradient boosting in most cases. It seems MLE-STAR can use its search function to find the optimal library on its own, even without me specifying "use gradient boosting." That's amazing! This would eliminate the need for humans to research everything, leading to greater efficiency.

 

3. Finding Where to Improve the Code and Steadily Making Progress

Once the library is chosen and a baseline script is written, it's time to start making improvements to increase accuracy. But it's often difficult to know where to begin. MLE-STAR employs an ablation study to understand how accuracy changes when a feature is added or removed, thereby identifying the most impactful code block. This part of the process typically relies on human experience and intuition, involving a lot of trial and error. By using MLE-STAR, we can make data-driven decisions, which is incredibly efficient.

 

4. Iterating Until Accuracy Actually Improves

Once the code block for improvement is identified, the system gradually changes parameters and confirms the accuracy improvements. This is also done automatically within a loop, without requiring human intervention. The accuracy is calculated at each step, and as a rule, only changes that improve performance are adopted, ensuring that the model's accuracy steadily increases. Incredible, isn't it? In fact, a graph comparing the performance of MLE-STAR with past AI assistants shows that MLE-STAR won a "gold medal" in approximately 36% of the tasks, highlighting its superior performance.

 

So, what did you think? This new framework for an AI assistant looks extremely promising. In particular, its ability to identify which code blocks to improve and then actually increase the accuracy is likely to become even more powerful as the performance of foundation models continues to advance. I'm truly excited about future developments.

Next time, I plan to apply it to some actual analysis data to see what kind of accuracy it can achieve. Stay tuned!




You can enjoy our video news ToshiStats-AI from this link, too!



1) MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam1 2 *, Jinsung Yoon1, Jiefeng Chen1, Jinwoo Shin2, Sercan Ö. Arık1 and Tomas Pfister1, Google Cloud1, KAIST2,  23, Aug 2025



Copyright © 2025 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

A Sweet Strategy: Selling Cakes in Wealthy Residential Areas !

Has everyone ever thought about starting a cake shop? As a cake lover myself, I often find myself wondering, "What kind of cake would be perfect?" However, developing a concrete business strategy is a real challenge. That's why this time, I'd like to conduct a case study with the support of an "AI marketing-agency." Let's get started.


1. Selling Cakes in an Upscale Kansai Neighborhood

The business scenario I've prepared for this case is a simple one:

Goal: To sell premium fruit cakes in the Kansai region.

  • Cake Features: Premium shortcakes featuring strawberries, peaches, and muscat grapes.

  • Target Audience: Women in their 20s to 40s living in upscale residential areas.

  • Stores: 3 cafes near Yamate Dentetsu Ashiya Station, 1 cafe near Kaigan Dentetsu Ashiya Station.

  • Direct Sales Outlet: 1 store inside the Yamate Dentetsu Ashiya Station premises.

  • Branding: The brand's primary color will be blue, with the website and logo also unified in blue.

  • Current Plan: In the process of planning a sales promotion for the autumn season.

From here, what kind of concrete business strategy can we derive? First, I'll input the business scenario into the AI marketing-agency.

The first thing it does is automatically generate 10 cool domain names.

It's hard to choose, but for now, I'll proceed with branding using "PremiumAshiyaCake.com".

 

2. A Practical Business Strategy

Now, let's ask the AI marketing-agency to formulate a business strategy for selling our premium fruit cakes in Kansai. When prompted to input the necessary information, I re-entered the business scenario, and the following business strategy was generated in about two minutes. Amazing!

It's a long document, over five pages, so I can't share it all, but here is the "Core of the Marketing Strategy."

  • Overall Approach: Direct Response that Inspires Aspiration

    • We will build an aspirational, luxury brand image through beautiful content, and then convert that desire into immediate store visits using precisely targeted calls-to-action (CTAs).

  • Core Message and Positioning:

    • Positioning Statement: For the discerning women of Kansai, Premium Ashiya Cake is the patisserie that transforms a moment into a cherished memory with its exquisitely crafted seasonal shortcakes.

    • Tagline / Core Message: "Premium Ashiya Cake: An exquisite moment, crafted for you."

  • Key Pillars of the Strategy:

    • Visual Elegance and a "Blue" Signature: All visuals must be of professional, magazine-quality. The brand color "blue" will be used as a sophisticated accent in styling—such as on blue ribbons, parts of the tableware, or as background elements—to create a recognizable and unique visual signature.

    • Hyper-local Exclusivity: Marketing efforts will be geographically and demographically laser-focused on the target audience residing in Ashiya and its surrounding affluent areas. This creates an "in-the-know" allure for locals.

    • Seasonal Storytelling: Treat each season's campaign as a major event. We will build a narrative around the star ingredients, such as Shine Muscat grapes from a specific partner farm, to build anticipation and justify the premium price point.

This is wonderfully practical content. The keywords I provided—"blue," "Ashiya," and "muscat"—have been skillfully integrated into the strategy.

 

3. The Logo is Excellent, Too—This is Usable!

Because I specified in the initial business scenario that I wanted to "unify the color scheme based on blue," it created this cool logo for me. It really looks like something I could use right away. Google's image generation AI, Imagen 3.0, is used here. The quality of this AI is always highly rated, so it's no surprise that the logo generated this time is also of outstanding quality.

 

So, what did you think of the AI marketing-agency? The business strategy is professional, and it's amazing how it automatically created the domain names and logo with such excellent results. Although I couldn't introduce it this time, it also includes a website creation feature. It's surprising that a tool this high-performance is actually available for free. A development kit called "Google ADK" is provided as open-source, and the AI marketing-agency from this article can be downloaded and used for free as Sample (1). For those who can use Python, I think you'll get the hang of it with a little practice. The operational costs are also limited to the usage fees for Google Gemini 2.5 Pro, so the cost-effectiveness is outstanding. I encourage you all to give it a try.

Please note that this story is a work of fiction and does not represent anything that actually exists. That's all for today, stay tuned!

 

You can enjoy our video news ToshiStats-AI from this link, too!

1) Marketing Agency, Google, May 2025



Copyright © 2025 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

How to Turn GPT-5 into a Pro Marketing Analyst with AI Agents!

A while back, I introduced a guide to prompting GPT-5, but it can be quite a challenge to write a perfect prompt from scratch. Not to worry! You can actually have GPT-5 write prompts for GPT-5. Pretty cool, right? Let's take a look at how.

 

1. Using GPT-5 to Do a Marketer's Job

I have some global sales data for stickers(1). Based on this data, I want to develop a sales strategy.

                 Global Sticker Sales Records

In a typical company, a data scientist would analyze the data, and a marketing manager would then create an action plan based on the results. We're going to see if we can get GPT-5 to handle this entire process. Of course, this requires a good prompt, but what kind of prompt is best? This is where it gets tricky. The principle I always adhere to is this: "Data analysis is a means, not an end." There are many data analysis methods, so the same data can be analyzed in various ways. However, what we really want is a sales strategy that boosts revenue. With this in mind, let's reconsider what makes a good prompt.

It's a bit of a puzzle, but I've managed to draft a preliminary version.

 

2. Using Metaprompting to Improve the Prompt with GPT-5

Now, let's have GPT-5 improve the prompt I quickly drafted. The image below shows the process. The first red box is my draft prompt.

                    Metaprompt

The second red box explicitly states the principle: "Perform data analysis with the goal of creating a Marketing strategy." When you provide the data and run this prompt, GPT-5 creates the improvement suggestions you see below, which are very detailed. I actually ran this process twice to get a better result.

                   Final Prompt

 

3. The Result: GPT-5 Generates MARKETING Strategy!

Running the final prompt took about a minute and produced the following output. The detailed analysis and resulting insights are directly connected to marketing actions, staying true to our initial principle. It's fantastic.

The output is concise and perfect for busy executives. Creating this content on my own would likely take an entire day, but with GPT-5, the whole process—including the time it took to draft the initial prompt by myself —takes only about 30 minutes. This really shows how powerful GPT-5 is.

 

What do you think? This time, we explored a method for getting GPT-5 to improve its own prompts. This technique is called Metaprompting, and it's described in the OpenAI GPT-5 Prompting Guide (2).

I encourage you to try Metaprompting starting today and take your AI agent to the next level. That's all for now! Stay tuned!

 



You can enjoy our video news ToshiStats-AI from this link, too!

 

Copyright © 2025 Toshifumi Kuga. All right reserved

1)Forecasting Sticker Sales, kaggle, January 1,2025

2) GPT-5 prompting_guide, OpenAI, August 7, 2025


Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Let's Explore the Best Practices for Crafting GPT-5 Prompts!

We are already hearing from many in the field that with the arrival of GPT-5, "the writing style is different from GPT-4o and earlier" and "its performance as an agent is on another level." Here, we will build upon the key points from OpenAI's "GPT-5 Prompt Guide (1)" and organize, from a practical perspective, "how to write prompts to stably reproduce desired behaviors." The following three keywords are key:

  1. GPT-5 acts very proactively as an AI agent.

  2. Self-reflection and guiding principles.

  3. Instruction following with "surgical precision."

Let's delve into each of these.

 




 

1. GPT-5 acts very proactively as an AI agent.

GPT-5's enhanced capabilities in tool-calling, understanding long contexts, and planning allow it to proceed autonomously even with ambiguous tasks. Whether you "harness" or "suppress" this capability depends on how you design the agent's "eagerness").


1-1. Controlling Eagerness with Prompts

To suppress eagerness, intentionally limit the depth of exploration and explicitly set caps on parallel searches or additional tool calls. This is effective in situations where processing time and cost are priorities, or when requirements are clear and exploration needs to be minimized.

To enhance eagerness, explicitly state rules for persistence, such as "Do not end the turn until the problem is fully resolved" and "Even with uncertainty, proceed with the best possible plan." This is suitable for long-duration tasks where you want the agent to see them through to completion with minimal check-ins with the user.

Practical Snippet (To suppress eagerness):

<context_gathering>
Goal: Reach a conclusion quickly with minimal information gathering.
Method: A single-batch search, starting broad and then narrowing down. Avoid duplicate searches.
Budget: A maximum of 2 tool calls.
Escape: If a conclusion is reasonably certain, accept minor incompleteness to provide an early answer.
</context_gathering>

Practical Snippet (To encourage eagerness):

<persistence>
Do not end the turn until the problem is completely resolved.
Reason through uncertainty and continue with the best possible plan.
Minimize clarifying questions. Adopt reasonable assumptions and state them later.
</persistence>

1-2. Visualize with a "Tool Preamble"

When the agent outputs a long rollout during execution, having it first provide a brief summary—explaining the objective, outlining the plan, noting progress, and confirming completion—makes it easier for the user to follow along and creates a better user experience.

Recommended Snippet:

<tool_preambles>
First, restate the user's goal in a single sentence. Follow with a bulleted list of the planned steps.
During execution, add concise progress logs sequentially.
Finally, provide a summary that clearly distinguishes between the "Plan" and the "Actual Results."
</tool_preambles>
 
 

2. Self-reflection and Guiding Principles

GPT-5 excels at "internally refining" the quality of its output through self-reflection. However, if the criteria for judging quality are not established beforehand, this reflection can become unproductive. This is where guiding principles and a private rubric are effective.


2-1. Provide a "Self-Grading Scorecard" with a Private Rubric

For zero-to-one generation tasks (e.g., creating a new web app, drafting specifications), have the model internally create a scorecard with 5-7 evaluation criteria. Then, have it repeatedly rewrite and re-evaluate its output based on these criteria.

Rubric Generation Snippet:

<self_reflection>
Define the conditions that a world-class deliverable should meet across 5-7 categories (e.g., UI quality, readability, robustness, extensibility, accessibility, accountability). Score your own proposal against these criteria, identify shortcomings, and redesign. The rubric itself should not be shown to the user.
</self_reflection>

2-2. Reduce Inconsistency with Guiding Principles

For ongoing development or modifying existing code, first provide the project's conventions by clearly stating its design principles, directory structure, and UI standards. This ensures that the model's suggested improvements and changes integrate naturally with the existing culture.

Guiding Principles Snippet (Example):

<guiding_principles>
Clarity and Reusability: Keep components small and reusable. Group them and avoid duplication.
Consistency: Unify tokens, typography, and spacing.
Simplicity: Avoid unnecessary complexity in styling and logic.
</guiding_principles>

2-3. Separately Control Verbosity and Reasoning Effort

GPT-5 can control its verbosity (the length of the final answer) and its reasoning_effort (the depth of thought) independently. This allows for context-specific overrides, such as "be concise in prose, but provide detailed explanations in code." The guide introduces a practical example of prompt tuning by Cursor, which is worth checking out. A useful tip for fast mode (minimal reasoning) is to require a brief summary of its thinking or plan at the beginning to assist its process.

 
 


3. GPT-5's Instruction Following has "Surgical Precision"

GPT-5 is extremely sensitive to the accuracy and consistency of instructions. Contradictory requests or ambiguous prompts waste reasoning resources and degrade output quality. Therefore, it is crucial to "structure" your instruction hierarchy to prevent contradictions before they occur.



3-1. Design to Avoid Contradictions

Take the example of a healthcare administrator scheduling a patient appointment based on symptoms. "Exceptions," such as altering preceding steps only in emergencies, must be clearly stated so they do not conflict with standard procedures.

  • Bad Example: The instructions "Do not schedule without consent" and "First, automatically secure the fastest same-day slot" coexist.

  • Correct Example: When "Always check the profile" and "In an emergency, immediately direct to 911" coexist, the exception rule is declared first.

OpenAI offers the following warning:

We understand that the process of building prompts is an iterative one, and that many prompts are living documents, constantly being updated by different stakeholders. But that’s why it is even more important to thoroughly review for instructions that are phrased improperly. We have already seen multiple early users discover ambiguities and contradictions within their core prompt libraries when they did such a review. Removing them dramatically streamlined and improved GPT-5's performance. We encourage you to test your prompts with our Prompt Optimizer tool to identify these kinds of issues.

 
 

How was that? In this article, we explored key points for prompt design from OpenAI's GPT-5 Prompt Guide (1). GPT-5 is a "partner in practice," combining powerful autonomy with precise instruction following. Try incorporating the points discussed today into your prompts and take your AI agents to the next level. That's all for today. Stay tuned!

 
 

Copyright © 2025 Toshifumi Kuga. All right reserved

1) GPT-5 prompting_guide, OpenAI, August 7, 2025

You can enjoy our video news ToshiStats-AI from this link, too!

 

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Prompt Optimization: The Secret to Building Better AI Agents?

The instructions that humans write for generative AI are called "prompts." There are many books and blogs out there that offer guidance on how to write them. Many of you have probably tried, and it's surprisingly difficult, isn't it? While no programming language is required, you have to go through a lot of trial and error to get the output you want from a generative AI. This process can be quite time-consuming, isn't well-systematized, and you often have to start from scratch for each new task.

So, this time, we'd like to experiment with "what happens if we have a generative AI write the prompts for us?" Let's get started.

 


1. Prompt Optimization

In 2023, Google DeepMind released a research paper titled "LARGE LANGUAGE MODELS AS OPTIMIZERS"(1).

This paper explored the use of LLMs to optimize prompts, and it seems to have worked well for several tasks. While a human writes the initial prompt, subsequent improvements are delegated to the LLM (the optimizer). The LLM is also responsible for judging whether the result was successful or not (the evaluator), meaning this approach can be applied even without labeled data that provides the correct answers. This is very helpful, as tasks involving generative AI often lack labeled data. Below is a flowchart of this process, which is effectively the automation of prompt engineering. This is professionally referred to as "prompt optimization." The specific method we adopted for this experiment is called OPRO (Optimization by PROmpting).






2. Experiment with a Customer Complaint Classification Task

Similar to our blog post on July 26th, we set up a task to predict which financial product a bank's customer complaint is about. We used an LLM to solve a classification task where it selects one of the following six financial products. We used gemini-2.5-flash for this experiment, with a sample size of 100 customer complaints.

  • Mortgage

  • Checking or savings account

  • Student loan

  • Money transfer, virtual currency, or money service

  • Bank account or service

  • Consumer Loan

In this experiment, the LLM handled the prompt generation, but a meta-prompt was necessary to further improve the resulting prompts. I wrote the meta-prompt as follows. Essentially, it tells the LLM to "please further improve the resulting prompt."

We had the LLM generate 20 prompts, and the results are shown below. The final number is the accuracy. An accuracy of 0.8 means 80 out of 100 cases were correct. Since this data came with labeled data, calculating the accuracy was easy.

We adopted the second prompt from the list, which had the best accuracy of 0.89 in this experiment. When we ported this prompt to our regular experimental environment and ran it, the accuracy exceeded 0.9, as shown below. We've done this task many times before, but this is the first time we've surpassed 0.9 accuracy. That's amazing!

 






3. What Does the Future of Prompt Engineering Look Like?

As you can see, it seems possible to optimize prompts by leveraging the power of generative AI. Of course, when considering cost and time, the results might not always be worth the effort. Nevertheless, I feel there's a strong need for prompt automation. Researchers worldwide are currently exploring various methods, so many things that aren't possible now will likely become possible in the near future. Prompt engineering techniques will continue to evolve, and I'm looking forward to these technological developments and plan to try out various methods myself.

 

So, what did you think? The ability of an AI agent to fully utilize the power of generative AI and improve itself without human intervention is called "Recursive-self-improvement." At ToshiStats, we will continue to provide the latest updates on this topic. Please look forward to it. Stay tuned!

 

Copyright © 2025 Toshifumi Kuga. All right reserved

1) LARGE LANGUAGE MODELS AS OPTIMIZERS Chengrun Yang Xuezhi Wang Yifeng Lu Hanxiao Liu Quoc V. Le Denny Zhou Xinyun Chen , Google DeepMind

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

I tried creating and implementing an AI app with no-code on Google AI Studio, and it was amazing!

Google has been rapidly releasing generative AI and related products recently, with Google AI Studio (1) particularly standing out as a developer platform. It integrates the latest image and video generation AI, truly embodying a multimodal platform. What's more, it's free up to a certain limit, making it a powerful ally for startups like ours. So, let's actually create an AI application with this platform!


1. Google AI Studio Portal

Below is the Google AI Studio portal. It has so many features that an AI beginner might get confused without prior knowledge. I suppose that's why it's a developer-oriented platform. By clicking the button in the red box, you'll be taken to a site where you can create an application simply by writing a prompt.

Google AI Studio

Here's the prompt I used this time.

"As a 'Complaint Categorization Agent,' you are an expert at understanding which product a customer is complaining about. You can select only one product from the complaint. Comprehensively analyze the provided complaint and classify it into one of the following categories:

  • Mortgage

  • Checking or savings account

  • Student loan

  • Money transfer, virtual currency, or money service

  • Bank account or service

  • Consumer Loan

Your output should be only one of the above categories. All samples must be classified into one of these classes. Results for all samples are required. Create a GUI that adds the ability to input a CSV file of customer complaints and generate a graph showing the distribution of customer complaint classes. Add features to the GUI to add labeled data independently of the customer complaint CSV file, calculate and display accuracy, and display a confusion matrix of the results."

Just by typing this prompt into the box and running it, the application described below is created. I didn't use any coding like Python at all. It's amazing!



2. Tackling a Real Classification Task with the Created App

After two or three attempts, the final application I built is shown below. It handles the task of classifying bank customer complaints by financial product. This time, I've set it to six types of financial products, but generative AI can achieve high accuracy even without prior training, so it's possible to classify many more classes if desired.

Input Screen

We import customer complaints via a CSV file. This time, I'll use 100 complaints. Furthermore, if ground truth data is available, I've added functionality to output accuracy and a confusion matrix. Below are the actual classification results. The distribution of the six financial products is displayed. It seems this customer complaint data primarily concerns mortgages.

Class Distribution

Here's the crucial classification accuracy. This time, we achieved over 80% accuracy, at 83%, without any prior training. It's incredible!

Classification accuracy

The confusion matrix, often used in classification tasks, can also be displayed. This not only provides a numerical accuracy but also shows where classification errors frequently occur, making it easier to set guidelines for improving accuracy and enabling more effective improvements.

Confusion Matrix

 

3. Agent Evaluation

What I realized when creating this app was that if some evaluation metric is available, the quality of discussions for subsequent improvements deepens. Trying with just a few samples won't give a good grasp of the generative AI's behavior. Ideally, preparing at least 10, and ideally 100 or more, samples with corresponding ground truth data, and having the AI app output evaluation metrics, would enable effective accuracy improvement suggestions. This theme is called "Agent evaluation," and I believe it will become essential for building practical AI applications in the future.

 

What do you think? Despite not doing any programming at all this time, I was able to create such an amazing AI application. Google AI Studio integrates perfectly with Google Cloud, allowing you to deploy your app to the cloud with a single button and use it worldwide. Toshi Stats will continue to challenge ourselves by building various AI applications. Stay tuned!

 

Copyright © 2025 Toshifumi Kuga. All right reserved

1) Google AI Studio

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The Cutting Edge of Prompt Engineering: A Look at Silicon Valley Startup

Hello everyone. How often do you find yourselves writing prompts? I imagine more and more of you are writing them daily and conversing with generative AI. So today, we're going to look at the state of cutting-edge prompt engineering, using a case study from a Silicon Valley startup. Let's get started.

 

1. "Parahelp," a Customer Support AI Startup

There's a startup in Silicon Valley called "Parahelp" that provides AI-powered customer support. Impressively, they have publicly shared some of their internally developed prompt know-how (1). In the hyper-competitive world of AI startups, I want to thank the Parahelp management team for generously sharing their valuable knowledge to help those who come after them. The details are in the link below for you to review, but my key takeaway from their know-how is this: "The time spent writing the prompt itself isn't long, but what's crucial is dedicating time to the continuous process of executing, evaluating, and improving that prompt."

When we write prompts in a chat, we often want an immediate answer and tend to aim for "100% quality on the first try." However, it seems the style in cutting-edge prompt engineering is to meticulously refine a prompt through numerous revisions. For an AI startup to earn its clients' trust, this expertise is essential and may very well be the source of its competitive advantage. I believe "iteration" is the key for prompts as well.

 

2. Prompts That Look Like a Computer Program

Let's take a look at a portion of the published prompt. This is a prompt for an AI agent to behave as a manager, and even this is only about half of the full version.

structures of prompts

Here is my analysis of the prompt above:

  • Assigning a persona (in this case, the role of a manager)

  • Describing tasks clearly and specifically

  • Listing detailed, numbered instructions

  • Providing important points as context

  • Defining the output format

I felt it adheres to the fundamental structure of a good prompt. Perhaps because it has been forged in the fierce competition of Silicon Valley, it is written with incredible precision. There's still more to it, so if you're interested, please view it from the link. It's written in even finer detail, and with its heavy use of XML tags, you could almost mistake it for a computer program. Incredible!

 

3. The Future of Prompt Engineering

I imagine that committing this much time and cost to prompt engineering is a high hurdle for the average business person. After learning the basics of prompt writing, many people struggle with what the next step should be.

One tip is to take a prompt you've written and feed it back to the generative AI with the task, "Please improve this prompt." This is called a "meta-prompt." Of course, the challenges of how to give instructions and how to evaluate the results still remain. At Toshi Stats, we plan to explore meta-prompts further.

 

So, what did you think? Even the simple term "prompt" has a lot of depth, doesn't it?As generative AI continues to evolve, or as methods for creating multi-AI agents advance, I believe prompt engineering itself will also continue to evolve. It's definitely something to keep an eye on. I plan to provide an update on this topic in the near future.

That's all for today. Stay tuned!

 

ToshiStats Co., Ltd. offers various AI-related services. Please check them out here!

 

Copyright © 2025 Toshifumi Kuga. All rights reserved.

  1. Prompt design at Parahelp, Parahelp, May 28, 2025

 






Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.





Google DeepMind Announces "AlphaEvolve," Hinting at an Intelligence Explosion!

Google DeepMind has unveiled a new research paper today, introducing "AlphaEvolve" (1), a coding agent that leverages evolutionary computation. It's already garnering significant attention due to its broad applicability and proven successes, such as discovering more efficient methods for matrix calculations in mathematics and improving efficiency in Google's data centers. Let's dive a little deeper into what makes it so remarkable.

 

LLMs Empowered with Evolutionary Computation

In a nutshell, "AlphaEvolve" can be described as an "agent that leverages LLMs to the fullest to evolve code." To briefly touch upon "evolutionary computation," it's an algorithm that mimics the process of evolution in humans and living organisms to improve systems, replicating genetic crossover and mutation on a computer. Traditionally, the function responsible for this, called an "Operator," had to be set by humans. "AlphaEvolve" automates the creation of Operators with the support of LLMs, enabling more efficient code generation. That sounds incredibly powerful! While evolutionary computation itself isn't new, with practical applications dating back to the 2000s, its combination with LLMs appears to have unlocked new capabilities. The red box in the diagram below indicates where evolutionary computation is applied.

 

2. Continued Evolution with Meta-Prompts

I'm particularly intrigued by the "prompt_sampler" mentioned above because this is where "meta-prompts" are executed. The paper explains, "Meta prompt evolution: instructions and context suggested by the LLM itself in an additional prompt-generation step, co-evolved in a separate database analogous to the solution programs." It seems that prompts are also evolving! The diagram below also shows that accuracy decreases when meta-prompt evolution is not applied compared to when it is.

This is incredible! With an algorithm like this, I'd certainly want to apply it to my own tasks.

 

3. Have We Taken a Step Closer to an Intelligence Explosion?

Approximately a year ago, researcher Leopold Aschenbrenner published a paper (2) predicting that computers would surpass human performance by 2030 as a result of an intelligence explosion. The graph below illustrates this projection. This latest "AlphaEvolve" can be seen as having acquired the ability to improve its own performance. This might just be a step closer to an intelligence explosion. It's hard to imagine the outcome of countless AI agents like this, each evolving independently, but it certainly feels like something monumental is on the horizon. After all, computers operate 24 hours a day, 365 days a year, so once they acquire self-improvement capabilities, their pace of evolution is likely to accelerate. He refers to this as "recursive self-improvement" (p47).

 



What are your thoughts? The idea of AI surpassing humans can be a bit challenging to grasp intuitively, but just thinking about what AI agents might be like around 2027 is incredibly exciting. I'll be sure to provide updates if a sequel to "AlphaEvolve" is released in the future. That's all for now. Stay tuned!

 


1) AlphaEvolve: A coding agent for scientific and algorithmic discovery Alexander Novikov* , Ngân Vu˜ * , Marvin Eisenberger* , Emilien Dupont* , Po-Sen Huang* , Adam Zsolt Wagner* , Sergey Shirobokov* , Borislav Kozlovskii* , Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli and Matej Balog* Google DeepMind ,16 May, 2025

2) S I T U AT I O N A L AWA R E N E S S  The Decade Ahead, Leopold Aschenbrenner, June 2024


 


Copyright © 2025 Toshifumi Kuga. All right reserved

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes, the software and the contents.

We Built a Customer Complaint Classification Agent with Google's New AI Agent Framework "ADK"

On April 9th, Google released a new AI agent framework called "ADK" (Agent Development Kit). It's an excellent framework that incorporates the latest multi-agent technology while also being user-friendly, allowing implementation in about 100 lines of code. At Toshi Stats, we decided to immediately try creating a customer complaint classification agent using ADK.

 

1. Customer Complaint Classification Task

Banks receive various complaints from customers. We want to classify these complaints based on which financial product they concern. Specifically, this is a 6-class classification task where we choose one from the following six financial products. Random guessing would yield an accuracy below 20%.

Financial products to classify

 

2. Implementation with ADK

Now, let's move on to the ADK implementation. We'll defer to the official documentation for file structure and other details, and instead show how to write the AI agent below. The "instruction" part is particularly important; writing this carefully improves accuracy. This is what's known as a "prompt". In this case, we've specifically instructed it to select only one from the six financial products. Other parts are largely unchanged from what's described in tutorials, etc. It has a simple structure, and I believe it's not difficult once you get used to it.

AI agent implementation with ADK

 

3. Accuracy Verification

We created six classification examples and had the AI agent provide answers. In the first example, I believe it answered "student loan" based on the word "graduation." It's quite smart! Also, in the second example, it's presumed to have answered "mortgage " based on the phrase "prime location." ADK has a built-in UI like the one shown below, which is very convenient for testing immediately after implementation.

ADK user interface

The generative AI model used this time, Google's "gemini-2.5-flash-04-17," is highly capable. When tasked with a 6-class classification problem using 100 actual customer complaints received by a bank, it typically achieves an accuracy of over 80%. For simple examples like the ones above, it wouldn't be surprising if it achieved 100% accuracy.

 

So, what did you think? This was our first time covering ADK, but I feel it will become popular due to its high performance and ease of use. Combined with A2A(2), which was announced by Google around the same time, I believe use cases will continue to increase. We're excited to see what comes next! At Toshi Stats, we will continue to build even more advanced AI agents with ADK. Stay tuned!

 



1) Agent Development Kit,  Google, April 9th, 2025
2) Agent2Agent.  Google, April 9th, 2025

 



Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.