Maximizing Customer Retention: Churn Prevention Strategies Using AI and Machine Learning

It is always sad when customers who have taken the time to purchase our products or services end up leaving. If possible, we want to catch the signs early and take action to prevent them from churning. However, identifying customers who are likely to churn beforehand is no easy task. That is why, this time, I tried creating a customer churn prediction model. I would like to take on the challenge of predicting and countering customer churn using machine learning and generative AI. For formulating the key churn prevention strategies, I used Gemini 3.5 Flash (1), which offers a fantastic balance of performance and cost. Let's get started.

Gemini 3.5 Flash

 

1. Customer Churn Prediction Model

Using the created customer churn prediction model, let's first take a look at a general customer.

General Customer

The churn probability is 15.0%, indicating a "high likelihood of continuation," so no countermeasures are needed at this time. That's a relief.

SHAP Analysis of a General Customer

At this point, some of you might be wondering, "But why did the model decide that the likelihood of continuation is high?" This is where "SHAP" (2), shown in the figure above, comes into play. Simply put, it is a "value that indicates which data influenced the model's decision and to what extent." The SHAP graph for this customer extends significantly to the left in the negative direction, indicating that the churn probability is low. SHAP values are assigned to individual customers and show why the model made its decision for each specific customer. It is very helpful for us when trying to understand the results.

 

2. How to Prevent Customer Churn

Now, let's look at a customer who is on the verge of churning. Unlike before, the churn probability is 54.5%, indicating a "high likelihood of churning," which suggests that some countermeasures are necessary.

Customer Likely to Churn

Analysis of a Customer Likely to Churn

SHAP Analysis of a Customer Likely to Churn

You can see that the SHAP graph, unlike the previous one, extends significantly to the right. In particular, tenure and MonthlyCharges are large, serving as the main factors that increased this customer's churn probability.

Also, in the explanatory text for "Individual Customer Analysis," it states:

“To retain this customer, we recommend proactive outreach with a targeted retention offer. Specifically, we can address their high monthly charges by offering a loyalty discount, or incentivize them to transition from a flexible month-to-month contract to a more stable longer-term contract.”

This is a personalized retention measure for this specific customer. It is not a generic strategy. This is because, as stated in the explanation, it was created by the generative AI, Gemini 3.5 Flash, based on the individual customer's analysis results:

“The primary factors driving up their churn risk are their tenure (SHAP: +0.2620), high monthly charges (SHAP: +0.0655), and having a month-to-month contract (SHAP: +0.0311).”

It is trustworthy precisely because it is a measure tailored to the individual customer's situation. Fantastic!

 

3. For Further Development

In machine learning and AI, the quantity and quality of the input data are always the key. As these increase, diverse analyses become possible, and accuracy improves. In other words, I believe it is possible to elevate this into a marketing analytics platform in the future. I am really looking forward to its future developments. As the core technologies for this product development, I used Google Gemini 3.5 Flash for natural language processing, Choice-Learn for machine learning, and Google ADK for AI agent implementation. For app development, I am using ClaudeCode. These core technologies do not need to be fixed forever; I think it is best to flexibly use or replace them as needed. Since technological advancement is fast, I plan to adopt the optimal tools available at any given time.

 

What did you think? I felt that with "Machine Learning + AI," we can create fantastic products where they complement each other perfectly. I'm excited about future developments. Here at Toshi Stats, we plan to continue tackling tasks in the marketing field using "Machine Learning + AI." Stay tuned!

 

You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!

1) Gemini 3.5 Flash Best for frontier performance across agents and coding,  Google DeepMind
2) Welcome to the SHAP documentation

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Can AI Agents Invent New Economics? The Future of Theory Generation

Economics offers many theories that are highly useful in business. For example, the economic theory awarded the 2020 Nobel Prize in Economics (1) actually made a massive contribution to the design of spectrum allocation auctions in the United States, successfully creating a market worth over 100 billion dollars. Therefore, this time, I created a simple app using the mathematical proof program LEAN (2) to experiment and see if an AI agent using LEAN can be applied to economics. Let's get right into it.

 

1. Overview of the Developed App

Here is the app's screen. Because of my interest in building new markets, I named it the "Market Designs Verification App." It takes economic propositions and hypotheses and uses LEAN to automatically determine if there are any logical contradictions. I used Google Gemini 3.5 Flash for the natural language processing and Google ADK for the AI agent implementation. For the app development itself, I used Claude Code.

               Market Designs Verification App

The proposition provided this time is as follows. It serves as an example of a simple business plan.

"I would like to report on the premise for goal setting for the next business plan. First, as for our current status, we have secured a solid baseline of at least 100 million yen in Annual Recurring Revenue (ARR). Building on this achievement, our policy for the next term is to set our target ARR at 1.7 times our current level (70% growth), aiming for further business expansion.

On the other hand, while pursuing growth, it is also necessary to develop a plan that takes into account the realistic constraints of the business environment. Considering the current framework of our internal budget and the limitations of our target market size, we do not intend to pursue an open-ended target ARR for the next term. Instead, we anticipate aiming for a realistic landing with a maximum cap of 150 million yen."

I want to verify this hypothesis using LEAN to confirm whether it is feasible. Of course, I want to avoid any hallucinations caused by the LLM.

 

2. Looking at the Analysis Report

When executing this app, the following report was generated in about 3 minutes. Let's take a look. First, the hypothesis, judgment, and conclusion are summarized. The bottom line is that the growth target exceeds the constraints, meaning execution is impossible.

Final Report

The actual verification process using LEAN is displayed. It is a bit complex.

A DAG (Directed Acyclic Graph) is also used. In LEAN, once the compilation finishes, the verification is complete, meaning the proof has been established. Because it is executed strictly, the results are reliable. It is reassuring because there are no hallucinations like those seen with LLMs.

 

3. Implications Obtained

The implications obtained are as follows:

5. Economic Implication

The three business premises cannot all hold at once. To make the plan feasible, at least one premise must be relaxed. Concretely:

  • (a) Relax P1: lower the current-ARR floor 100 to ≤ 88.2353 (start from a smaller base).

  • (b) Relax P2: reduce the growth multiplier to ≤ 1.5× (keep multiplier × 100 ≤ 150).

  • (c) Relax P3: raise the target-ARR cap 150 to ≥ 170 (revisit budget / market-size constraints).

Based on these results, we must leverage them for business decision-making. This time, the results are well summarized, making them easy to understand. This is where the flexibility of LLMs comes into play. It is reassuring that if we can verify even complex management strategies with LEAN, we can automatically determine whether they can be executed without contradictions. In the future, this is expected to become an excellent advisor in boardroom meetings.

 

What did you think? It was a simple hypothesis verification, but I believe it is entirely possible to apply LEAN to economics. I felt that combining the flexibility of LLMs with the strictness of LEAN creates a wonderful system where they beautifully complement each other. I am looking forward to future developments. At Toshi Stats, we plan to continue tackling tasks in the field of economics using LLM + LEAN. Stay tuned!





You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!

 

1) Stanford Economists Paul Milgrom and Robert Wilson Win the Nobel in Economic Sciences, Stanford Graduate School of Business, Oct 12, 2020
2) LEAN

 







Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Logic-Powered Agents: How LLMs Evolution in Math is Shaping the Future of AGI

On June 3rd, a new research paper (1) was released by Google. It states that difficult mathematical proofs were solved by combining the LLM Gemini 3.1-pro with a mathematical proof language called LEAN (2). This time, I would like to delve into this paper and consider what kind of developments we can expect from this new AI agent in the future, beyond the framework of mathematics.

 

1. The Synergistic Effect of the LLM's Flexibility and LEAN's Strictness

Here is the paper, featuring an active AI agent called LEAP. It only uses Gemini 3.1-pro as the LLM, and no specific fine-tuning has been performed. It is being used straight out of the box. Even so, it is reported to demonstrate outstanding exploration capabilities in mathematical proofs. Since it doesn't require any particular additional training, it can be used immediately without doing anything, which is very convenient for practical use. Furthermore, by using LEAN in conjunction, if a proof with contradictory logic due to hallucinations is produced, an error occurs during compilation, creating a mechanism where it is automatically rejected.

         LEAP (LLM-in-Lean Environment Agentic Prover)

Since LLM responses can fluctuate probabilistically, humans need to verify them in detail when conducting rigorous arguments. However, by introducing LEAN, this process has been automated. This is very reassuring. It seems that this fantastic result was achieved by combining the flexibility of the LLM and the strictness of LEAN in this way. Let's look closer.

 

2. The Structure and Accuracy of LEAP

Here is the structure of LEAP. The figure on the left is the roadmap for the theorem to be proved using LEAN. Technically, it forms a structure called a DAG (Directed Acyclic Graph). Complex mathematical proofs are not completed in a single attempt; the proof progresses by going back and forth between the LLM and LEAN several times. The key here is the section in the red frame, where the LLM describes an INFORMAL BLUEPRINT in natural language and converts it into a FORMAL SKETCH in LEAN. Furthermore, a two-tier review by LEAN and the LLM awaits. LEAN verifies whether the new proof method has any contradictions, and the LLM's review verifies whether that method is genuinely effective. In other words, the LLM acts as a pilot in the search for proof methods. Even though it's just using Gemini 3.1-pro as is, its potential is truly surprising.

                LEAP workflow

Now, let's look at the results of applying LEAP to an actual task. It tackled the notoriously difficult Putnam 2025. Putnam 2025 contains twelve undergraduate-level problems from the 86th William Lowell Putnam Mathematical Competition, a highly challenging North American mathematics competition.

Looking at the DAG, you can see how the proof actually progresses. In this example of Putnam 2025 Problem A6, you can see layers upon layers of connected proofs. It's certainly a difficult problem. The green indicates the parts that have already been proven.

            DAG example for Putnam 2025 Problem A6

The results, as shown below, were that LEAP answered all questions correctly. An overwhelming accuracy.

                Results on Putnam 2025

You can see that while the original Gemini 3.1-pro couldn't score at all, it was able to demonstrate tremendous capabilities by combining it with LEAN. I think it is truly a breakthrough.

 

3. Beyond Mathematical Proofs into Various Fields

What we have seen so far were tasks related to mathematical proofs. With LEAP being able to construct such perfect logic, I felt it would be a waste to keep it confined solely to mathematics. In particular, its application to economics, which is directly linked to business practices, has immense scope and depth, and I believe it can contribute to expanding the areas where LLMs can be active. Economics is also generally described using mathematical formulas, so I think it has a high affinity with LEAP. A paper (3) on its application to economics has already been published, so if you are interested, please do give it a try.

 

What did you think? I believe the combination of LLMs and LEAN will be expanded and improved in various ways in the future. It might be stepping closer to AGI. It's very exciting.

At Toshi Stats, we plan to take on tasks in the field of economics moving forward. Stay tuned!

 

You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!

1) LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks, 3 Jun 2026, Google
2) LEAN
3) We Can't Agree to Disagree, Formally: Aumann's Theorem and Assumption Accounting in Lean, May 27, 2026, Ruize Chen, Ben Eltschig, Ken Ono, Jujian Zhang  Axiom Math,  Scott Duke Kominers Harvard University; a16z crypto

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

A Game-Changer for Financial Analysts: How Opus 4.8 Redefines Financial Research !

Anthropic has announced the update of its generative AI, Claude Opus 4.8. This update came less than 40 days after the previous one, which came as a bit of a surprise, but it may indicate that their internal development efficiency has increased significantly. Therefore, in this article, I would like to take on the challenge of using a combination of Claude Code and Opus 4.8 to conduct a financial analysis using US financial statements and create an investment memo.

 

1. Opus 4.8: The Most Powerful Model at Present

As always, when a new generative AI model is released, I compare its performance with existing models. The introduction page for Opus 4.8 (1) features the comparison table shown below. It is reported to have outperformed existing models in almost all areas. While strong coding capability is a tradition for the Opus series, what caught my attention was its exceptional strength in knowledge work. As indicated by the red box, it has achieved excellent results in two benchmarks that measure knowledge work capabilities.

‍           Opus 4.8 Performance Comparison

Therefore, in this article, I would like to verify the potential of Opus 4.8 regarding knowledge work.

 

2. Challenging the Creation of an Investment Memo

This time, I will attempt to create an investment memo for Google using Form 10-K, the annual performance report registered with the US SEC. An investment memo is an internal document created for investors to make a final in-house decision (approval) on whether or not to execute an investment in a specific company. Normally, financial analysts mobilize their expertise to create this based on source materials. This time, I would like to try automating that process.

First, I used the plan mode of Claude Code to formulate an implementation plan. I created a detailed plan this time as well. The following shows the initial part of it, but the actual plan continues further.

‍  ‍            Implementation Plan

After reviewing the created implementation plan and confirming there were no issues, I switched Claude Code to auto mode and actually started coding. This time, the implementation was completed all at once in about 30 minutes without stopping midway. Once I gave the green light, there was no human intervention required. It was a moment where I caught a glimpse of the true capability of Opus 4.8.

Normally, you would need a "prompt" that defines and instructs how to write each section of the investment memo, but I did not need to write it myself. Here too, Opus 4.8 automatically generated the "prompts" for me. The following is an example of this, and it is well-written without missing any key points. It is truly amazing.

‍  ‍              Generated Prompt Example

 

3. Reviewing the Investment Memo

In this experiment, I had the investment memo created in both English and Japanese versions and outputted as PDF files. Let's take a look at the content right away. It summarizes the overview beautifully in the opening section, as shown below. It looks very sophisticated.

investment memo by ClaudeCode with Opus4.8

It also summarizes the investment theme concisely as follows.

The investment memo this time exceeds 10 pages in total, so I cannot introduce the full text here, but I would like to look specifically at the section on competitive advantage analysis.

I think it is very well summarized. If the process can be automated to this extent, humans only need to review it, which will dramatically increase work efficiency. Furthermore, if you desire a deeper analysis leveraging domain knowledge, you can simply rewrite the "prompts." This means you can proceed based on existing work, allowing for smooth and efficient collaboration between humans and generative AI. It is wonderful. By the way, please understand that these texts were created for educational purposes and cannot be used for making investment decisions.

 

What did you think? I challenged the creation of an investment memo using Claude Code and Opus 4.8, and the results exceeded my expectations. I believe the performance of Opus 4.8 in knowledge work was outstanding. However, I would like to emphasize that a final review by a human is absolutely necessary. It is important to bear in mind that hallucinations can still occur. Moving forward, cooperation between generative AI and humans will continue to be essential.

At Toshi Stats, we plan to take on various tasks using Opus 4.8. Stay tuned!

 

You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!

1) Introducing Claude Opus 4.8,   May 28, 2026,  Anthropic PBC

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Is Google Omni One Step Closer to AGI? Testing It in a 10-Second Video

The other day, Google held its annual developer conference, Google I/O, where they announced "Gemini Omni," a new multimodal generative AI. Google has championed AGI (Artificial General Intelligence) since its inception, viewing multimodal AI as an essential requirement to achieve it. In this article, we will use "Gemini Omni" to examine just how much closer we have come to AGI.

 

1. What Kind of AI is "Gemini Omni"?

First, let's look at the explanation released by Google (1).

"We’re introducing Gemini Omni, where Gemini’s ability to reason meets the ability to create. Omni is our new model that can create anything from any input — starting with video. With Omni, you can combine images, audio, video and text as input and generate high-quality videos grounded in Gemini's real-world knowledge. You can also easily edit your videos through conversation.Gemini Omni Flash is a model that can create anything from any input – starting with video."

In short, it can be described as "a generative AI that can take any form of information as input and output it in any format." It appears that "Omni" understands 3D spatial information, visual elements, and physical laws—such as objects falling downward—which are difficult to grasp through text alone. This is truly a massive leap forward toward AGI.

The Omni Flash model that debuted this time is limited to video output only. However, in line with the "any-to-any" concept, the next version is highly expected to support output across all formats. It is something to look forward to.

 

2. The Task: Singing to a Given Theme

So, how capable is Omni Flash in practice? Can it successfully integrate various forms of information? Can it maintain consistency in its output? To test this, we will use the image below, add a prompt, and see if it can sing emotionally based on a specific theme. She is Leia, an instructor at ToshiStats Co. She is a familiar face on YouTube, but this time she is participating in our experiment.

             Leia, Instructor at ToshiStats Co.Ltd.

For this experiment, we prepared the following prompt:

"She is singing 'Kita-wing' in English. It is 80s Japanese pop. This must be 1. An urban and bittersweet melody, 2. about emotion of an independent, mature woman for love, 3. provide courage for action, 4. A movie-like scenery born from a 'midnight flight', 5. A deep, plaintive, and vibrating long vibration. 6, This scene is needed 'An airplane gliding through the midnight sky above the glowing metropolis.'."

We entered this prompt along with the image above. We believe this makes the singing theme reasonably clear. In particular, we want to focus on how well it can express emotional nuances, such as item 1: "An urban and bittersweet melody."

While you can listen to Leia’s actual singing later on YouTube, let's walk through the analysis first. Although the original Leia had a bright smile, the singing Leia looks somewhat sorrowful.

When it transitions to a close-up, those emotions become very clear.

We specified in the prompt to incorporate a "midnight flight" scene. It has indeed been inserted effectively. In the actual video, the airplane moves slowly.

Her physical expressions and body language look natural as she conveys emotion. It is impressive.

Actually, the video ended right at the climax. Ah, what a pity. I wanted to hear more. The maximum generation time for the current Omni Flash is 10 seconds, so it cannot be helped. Let's look forward to an extended generation time in the next version update.

Please take a moment to listen to Leia's song. Both English and Japanese versions are available. The English version is nearly perfect, but the Japanese version has a few parts where the pronunciation is slightly unclear. This is an area for improvement.

 

3. The Roadmap to AGI

In this test, Omni Flash consistently generated quite difficult emotional expressions. It understood the meaning and context—keeping her original clothing unchanged while swapping out only the background to match the theme—to create the video. Its adherence to the prompt was also excellent. While the short generation time remains a bottleneck, the content itself deserves high praise.

It is highly probable that Google will use Omni Flash as a starting point to accelerate its development toward AGI. The AI industry is currently suffering from a shortage of GPU supplies, and Google has become one of the few actively speaking out about AGI. Ultimately, being able to develop and produce their own computing resources, such as TPUs, gives them an overwhelming advantage. Demis Hassabis, CEO of Google DeepMind, who is leading the development of Omni Flash, has stated that AGI is "just a few years away" (2).

 

What did you think? Through this experiment, we confirmed the latent potential of the new multimodal generative AI "Omni" and discussed its possibilities for achieving AGI. Here at ToshiStats, we will continue to explore various ideas under the theme of "Road to AGI." Stay tuned!

 


You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!


1) Introducing Gemini Omni,  Google
2)  A new era of discovery: AI and the frontiers of science with Demis Hassabis, May 22, 2026,  Google for Developers

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

"Agentic Commerce and Agentic Payments: The Next Game Changers for the Financial Industry?"

On May 7, 2026, Mitsubishi UFJ Financial Group, Inc. (hereinafter referred to as MUFG) and Google announced a strategic partnership in the retail sector. They stated that they will collaborate to create new financial services and customer experiences within Japan's retail finance industry. The fact that MUFG, one of Japan's largest financial conglomerates, has teamed up with Google—often regarded as the strongest among AI giants—has an extremely significant impact. In this article, out of several key points, I would like to delve deeper, focusing particularly on AI agents.

 

1. Agentic Commerce and Agentic Payments

First, let's take a closer look at the release from MUFG regarding the section on AI agents (1).

Content of the Partnership (1)

Next-Generation Financial Experiences Supported by AI Agents, spanning from Purchases and Payments to Financial Transaction Decision-Making: Initiatives Toward Autonomous Finance, including Agentic Commerce / Agentic Payments

  • We will collaborate with an eye toward early domestic realization in the fields of "Agentic Commerce" and "Agentic Payments," where AI agents autonomously support a continuous series of processes from product selection and purchasing to payment execution.

  • Google Cloud plans to leverage its expertise in AI and cloud infrastructure to provide MUFG with cloud and AI technologies, as well as technical advice and development support for these initiatives.

  • Through this partnership, MUFG aims to build a next-generation payment infrastructure on Google Cloud to realize Agentic Commerce / Agentic Payments, striving to establish a new standard for purchasing and payments in the AI agent era in Japan.

  • Furthermore, by having AI agents that cooperate at a high level on this same platform support decision-making processes in daily purchases, payments, and various procedures, we aim to realize a new form of finance (autonomous finance) that gently guides customers without burdening them, while respecting their intentions.

  • In addition to digital channels, we will integrate physical touchpoints such as branch offices and remote consultations. By having AI agents understand and support situations across channels, we will provide a consistent sense of security and convenience, while achieving continuous support tailored to each individual customer throughout their daily lives and life events.

As shown above, this is a highly ambitious strategy. In particular, the phrase "MUFG aims to build a next-generation payment infrastructure on Google Cloud to realize Agentic Commerce / Agentic Payments, striving to establish a new standard for purchasing and payments in the AI agent era in Japan" felt like a self-declaration that they will leave other domestic competitors far behind. The following chart is a conceptual diagram of Agentic Commerce / Agentic Payments (1). Next, let's think about why MUFG chose Google.

‍  ‍       Conceptual Diagram of Agentic Commerce / Agentic Payments

 

2. Google’s AI Agent Protocol Suite is One of the Strongest in the World

The primary reason for MUFG choosing Google this time is presumed to be that the suite of AI agent protocols spearheaded by Google is one of the strongest in the world, making it difficult to find alternative options. Starting with the release of the Agent Development Kit (ADK, 2) in April 2025, Google has successively released AI agent protocols (communication standards) such as A2A, AP2, and UCP, expanding its partner network and leading the industry in standardization (3). In particular, the Agent Payments Protocol (AP2) is a protocol specialized for payments, which must have been highly coveted by the financial industry. Currently, each of these is evolving as open-source software, but the fact that Google is driving them is nevertheless crucial. The following material writes well about AI agent protocols. I highly recommend giving it a read (3).

Developer’s Guide to AI Agent Protocols (3)

 

3. Potential for Development from the Japanese Market to the Global Market

Future developments might be easier to understand when looked at from Google’s perspective. Google knows all too well how much of a competitive advantage can be gained by securing a de facto standard in software. A prime example of this is Android, the operating system for mobile devices. Companies that want to manufacture mobile devices typically adopt Android. This is because the Android ecosystem is fully established, and even if a company were to build a proprietary system from scratch now, no partner would willingly adopt a brand-new OS. Many of you probably use mobile devices that run on Android. Through this ecosystem, Google is always able to maintain a competitive advantage in mobile devices. If they can establish a position like Android's in purchasing and payments for the AI agent era, it will bring massive benefits. Although this partnership concerns the Japanese retail market, if it succeeds in the Japanese market, expanding it as-is into the global market would be easy. This is because, inherently, there are no national borders for AI agent protocol suites. We cannot take our eyes off future developments.

 

What do you think? It feels like a harbinger of AI agents entering the payment market in earnest. I am very much looking forward to seeing how future financial services will change.

At ToshiStats, we will continue to think about the evolution of AI agent protocols and financial services. Look forward to it. Stay tuned!


You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!

1) Strategic Partnership between MUFG and Google in the Retail Sector, May 7, 2026, MUFG
2) Agent Development Kit (ADK)
3) Developer’s Guide to AI Agent Protocols, MARCH 18, 2026, Google





Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

"Root Cause Analysis" is All You Need !

Have you ever tried to automate any classification tasks using Generative AI? I do this quite often, but occasionally, as the number of classification classes increases, the accuracy gradually drops to a point where it is no longer viable for practical business use. So, this time, I will tackle the task of classifying bank customer complaints (1) based on their root causes. There are 20 cause classes in total, making it a difficult problem where a random guess would yield only about a 5% accuracy rate. In the example below, the "text" column contains the customer complaint, and the "Issue," which is the underlying cause, is classified by an AI agent.

              Bank Customer Complaints

We are provided with a mere 100 samples. I would like to implement Root Cause Analysis (RCA) during the classification process to see just how crucial RCA is for improving accuracy. Let's get started right away.

 

1. What is RCA?

RCA stands for "Root Cause Analysis". When a problem occurs, it is a method used not just to resolve the superficial events (symptoms) you see, but to pinpoint the "true cause (root cause)" in order to prevent a recurrence. This time, I have designed the following RCA approach for classification failures:

Root Cause Analysis (RCA):

  • Record the success/failure of each sample.

  • Verification: Calculate the accuracy and generate an error analysis report.

  • Failure Analysis: If a classification error occurs, scrutinize the principle and conduct a Root Cause Analysis (RCA) on why it failed (e.g., confusion with similar categories, context complexity, etc.).

  • Create a principle improvement report based on the failure analysis results. Send this feedback to the generator. Take care to ensure the generator does not overfit.

Now, as shown in the infographic below, let's actually take on the bank customer complaint classification task using an AI agent equipped with RCA capabilities.

               AI Agent with RCA Capabilities

Note that I referenced this paper (2) for this experiment. If you are interested, please definitely check it out.

 

2. Implementing the Bank Customer Complaint Classification AI Agent using Claude Code

Once again, I used Anthropic's Claude Code to implement and analyze the AI agent as follows. First, I set it to Plan Mode, compiled what I wanted to accomplish into a PRD (Product Requirements Document), handed it over to Claude Code, and formulated an implementation plan. This PRD incorporates the Root Cause Analysis (RCA) explained above.

              Claude Code's Plan Mode

An implementation plan like the one below is formulated in about 5 minutes. The actual document is much longer, but I will only show the first part here. The important thing is to thoroughly review this implementation plan. It is long and can be tedious, but this stage allows you to confirm whether it aligns with the task's objectives before actually diving into coding. Anthropic's generative AI, Opus 4.7, is extremely high-performing; once it enters the implementation phase, it runs non-stop until the end. Since it is difficult for humans to intervene midway, the accuracy of the implementation plan holds the key to solving the task.

               Implementation Plan via Plan Mode

Since this implementation plan was well-crafted, I will proceed directly to implementation. I switch to Auto Mode as shown below and start coding. You can see the AI agent completing the implementation process step by step.

              Implementation via Auto Mode

This time, we iterated on the analysis and improvement 9 times, which ultimately took over 10 hours, but we obtained the results below. This is the result of classifying 100 randomly sampled customer complaints into 20 classes. You can see that the accuracy gradually improves thanks to the RCA feedback.

               Accuracy at Each Iteration

However, it seems to have overfitted due to repeating the process for far too long. I validated it with newly sampled data, but saw no improvement from iteration 7 onwards.

                 Accuracy on New Data

 

3. Results and Challenges This Time

In this bank customer complaint classification task, the baseline accuracy using a generative AI "out of the box" without doing anything special was under 40%. By adopting a multi-agent system with a generator and an evaluator, and incorporating RCA feedback, we achieved just under 60% accuracy even on new data, so I believe the RCA was effective. However, once the accuracy on the original data exceeded 80%, overfitting occurred, so figuring out how to improve this is a future challenge.

 

What did you think? This time, I explicitly stated the RCA in the PRD, implemented it as a multi-agent system, and tackled the task of classifying bank customer complaints based on their causes. While the accuracy improved from 40% to about 60%, overfitting remained an issue. To aim for an accuracy of 70% or higher on new data, another breakthrough might be necessary.

At ToshiStats, we plan to further develop RCA. Please look forward to it. Stay tuned!

 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Consumer Complaint Database
2) CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification, Hanrong Zhang1∗ Shicheng Fan1∗ Henry Peng Zou1 Yankai Chen2,3
Zhenting Wang2
Jiayu Zhou4 Chengze Li1 Wei-Chieh Huang1 Yifei Yao5
Kening Zheng1 Xue (Steve) Liu2,3 Xiaoxiao Li6 Philip S. Yu1
1University of Illinois Chicago 2MBZUAI 3McGill University
4Columbia University 5Zhejiang University 6University of British Columbia, April 12 2026

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

The Race for AI Supremacy: Will Google Come Out on Top?

The AI market is a battlefield where diverse players like OpenAI, Anthropic, NVIDIA, Alibaba, and Tencent are engaged in fierce competition. Today, I want to focus on Google and delve into whether they can truly seize hegemony in the AI market in the near future.

 

1. Google’s Secret Weapon: The 8th Generation TPU

Google recently announced its 8th generation TPU (1). The most significant feature of this generation is the separation into independent chips for training and inference. What particularly caught my attention is the remarkable improvement in inference speed. As highlighted in the red frame, the computation speed has increased approximately tenfold compared to the previous generation. While I found myself wondering, "Can it really get this much faster in just one year?", I am eager to try it out as soon as possible. It is expected to debut later this year.

                TPU Performance Comparison

With TPU inference becoming this fast, we might see the same generative AI models produce results significantly quicker when running on TPUs. Currently, among public clouds, only Google Cloud offers the TPU option, which is likely to further boost Google Cloud's competitive edge.

 

2. Massive Investment in Anthropic

Currently, the most popular frontier model in the AI market is Claude, developed by Anthropic. It is exceptionally strong, particularly in the B2B sector. Recently, Google reportedly committed to a massive investment in Anthropic (up to $40 billion, albeit with conditions) (2). From the perspective of frontier model development, Google and Anthropic are competitors. On the other hand, Anthropic is a major customer for Google Cloud.

Therefore, this massive investment holds significant strategic weight. If the likelihood of Claude’s training and inference being performed on TPUs increases, so does the potential for Google to generate revenue from it. This can be viewed as a form of risk diversification for Google. While it would be ideal if Google’s own frontier model, Gemini, maintained a dominant market share, rivals are constantly launching high-performance models. Practically speaking, it is a rational risk-hedging strategy to have even competing models run on TPUs—thereby collecting Google Cloud usage fees—or to aim for capital gains through equity stakes in those invested companies. In any case, we must keep a close eye on the collaboration between Google and Anthropic.

 

3. Google DeepMind’s Technical Prowess and Google’s Product Ecosystem

One cannot discuss Google’s AI without mentioning Gemini. Developed by Google DeepMind, this frontier model is natively multimodal and has made headlines for its high performance with every new release. The current model is Gemini 3, and there is anticipation that a next-generation model might be announced at Google I/O, the annual event starting on May 19, 2026. It’s very exciting.

However, Gemini is not the only generative AI from Google DeepMind. Boasting one of the most diverse arrays of models among all AI labs, their portfolio includes image and video generation models, as well as world models like Genie 3 (3).

Furthermore, Google possesses a vast amount of data required for model generation. Google already operates various products globally, and the data harvested from them is immense—YouTube alone is a clear example. Compared to many AI labs that must build their user bases from scratch, Google has an overwhelming advantage. The combination of "Google DeepMind’s technical prowess + data obtained from various products" is unparalleled.

 

What do you think? Today, we took a deep dive into Google. With powerful technology spanning not just AI model development but various other fields, Google’s strength feels overwhelming. They will likely continue to lead the AI market. Conversely, they are so strong that one might even worry about when they might run afoul of antitrust laws. What are your thoughts?

ToshiStats will continue to cover Google in the future. Stay tuned!

 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Our eighth generation TPUs: two chips for the agentic era, Google, Apr 23, 2026
2) Google to invest up to $40B in Anthropic in cash and compute, TechCrunch, April 24, 2026
3) Genie 3: A new frontier for world models, Google, August 5, 2025



Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.


Opus 4.7’s Auto Mode: The Secret Weapon for Boosting Productivity

Anthropic has released the frontier generative AI model, Opus 4.7. This update comes just over two months after the release of Opus 4.6, highlighting the accelerating pace of technological progress. In this article, I will dive deep into the remarkable new feature added alongside Opus 4.7, "Auto Mode," by utilizing it to build a machine learning model for credit default prediction.

 

1. What is Auto Mode?

Boris Cherney, the developer of Claude Code—an Agentic coding development environment—commented on "Auto Mode" as follows:

Auto mode = no more permission prompts

In the past, you either had to babysit the model while it did these sorts of long tasks, our use--dangerously-skip-permissions.We recently rolled out auto mode as a safer alternative. In this mode, permission prompts are routed to a model-based classifier to decide whether the command is safe to run. If it'ssafe, it's auto-approved.

In short, this feature reduces the frequency of "Please approve" requests that appear during long agentic coding sessions, thereby boosting productivity. For someone like me, who handles dozens of these approval requests daily, this is a very welcome addition.

You can verify the "Auto Mode" status via the indicator at the bottom left of the Claude Code interface.

Auto Mode

When you first enable it, a notice will appear; I recommend giving it a thorough read.

notice of Auto Mode

 

2. Building a Default Prediction Model with Auto Mode

I used Claude Code’s "Auto Mode" to actually build a default prediction model. For this project, I used data from Home Credit Default Risk competition(2) at Kaggle .

First, I created an implementation plan using Plan Mode. Through dialogue with Claude Code, a structured plan was established.

                  Implementation Plan

At this stage, Claude Code asks, "Would you like to use Auto Mode?" and answering "Yes" initiates the process.

                   Approval Request

The Implementation Process: I watched to see how many approval requests would appear before completion.

                Implementation using Auto Mode

After approximately 90 minutes, the system announced, "Finished." Remarkably, not a single approval request was triggered. This makes the work significantly easier and the implementation process much more enjoyable.

                   Completion Notice

Accuracy Validation: I checked the evaluation metric on Kaggle. The result was an AUC = 0.79632. This is my personal best for a single model without using ensembles. It ranks within the top 4.2% of the competition. Achieving this score without any manual intervention after the initial planning phase is truly astonishing.

                 Evaluation Metric

 

3. Auto Mode and Productivity in Data Analysis

While Auto Mode makes implementation effortless, its true power lies elsewhere. Because the frequency of approval requests has decreased so dramatically, it is now feasible to work with parallel computing—building multiple models simultaneously.

Whether in Kaggle competitions or practical business scenarios, we are often required to improve accuracy within a limited timeframe. If parallel computing becomes this easy, increasing productivity by 5x to 10x is no longer just a dream. It is a challenge well worth taking.

 

Conclusion

Auto Mode has simplified parallel computing and opened a new path toward enhanced productivity. At ToshiStats, we will continue to explore case studies using Auto Mode.

Stay tuned!

 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) https://x.com/bcherny/status/2044847848035156457, Boris Cherney, Anthropic
2) Home Credit Default Risk, kaggle









Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Revolutionizing Enterprise AI: The Power of Claude Managed Agents

Anthropic, a leader in generative AI, has announced "Claude Managed Agents," an AI agent hosting service. This service appears to offer significant advantages for enterprises utilizing AI agents, so let’s dive deeper into what it’s all about.

 

1. What is "Claude Managed Agents"?

First, what exactly is "Claude Managed Agents"? Let’s look at a quote from Anthropic's technical blog (1):

Harnesses encode assumptions that go stale as models improve. Managed Agents—our hosted service for long-horizon agent work—is built around interfaces that stay stable as harnesses change.

It seems "Claude Managed Agents" refers to an AI agent infrastructure designed for stable, long-term operation, even as underlying models are updated. A key concept here—which is also the title of their blog post—is "Decoupling the brain from the hands."

The solution we arrived at was to decouple what we thought of as the “brain” (Claude and its harness) from both the “hands” (sandboxes and tools that perform actions) and the “session”

Because the functions are separated, if the system stops, you only need to fix the specific affected part to achieve a quick recovery. This certainly looks promising.

              Decoupling the brain from the hands

 

2. Creating a Customer Complaint Classification Agent with "Claude Managed Agents"

Descriptions alone don't quite capture the experience, so let’s try running "Claude Managed Agents" ourselves. First, we enter a prompt into the box on the bottom left.

               Claude Managed Agents Console

For this test, we will create an agent to classify bank customer complaints. I have instructed it to select one of six financial products. Immediately, a configuration file is generated as shown below. Next, we create the agent.

               Prompt Input and Configuration File

The agent is now created. Next, we set up the environment.

                Environment Configuration

The environment is ready. Now, we start a session.

Start Session

The session has begun.

Ready

The preparation was finished in no time. There is nothing technically difficult about this; it’s just a matter of clicking buttons. Let's test it out immediately. I'll enter a bank customer complaint as follows:

Bank Customer Complaint Input

The result came back as "Student loan." Correct!

Now, let’s try one more.

It came back as "Mortgage". Correct!

It’s working perfectly. All I did was provide a prompt instructing the AI agent on what to do. The rest was handled almost automatically by "Claude Managed Agents." This is impressive.

 

3. Easy Enterprise Scaling: The Rakuten Success Story

Now, let's look at an example of a Japanese company that used "Claude Managed Agents" to scale its AI agents: Rakuten, the e-commerce giant. By switching from in-house infrastructure development to "Claude Managed Agents," they succeeded in deploying AI agents across the company with overwhelming speed.

“Deployed Claude Managed Agents across product, sales, marketing, finance within one week“ (2)

It is particularly notable that business-side staff, not just engineers, are actively involved. It truly sounds like a company-wide initiative. Wonderful! I look forward to seeing more Japanese companies follow this lead.

"Claude Managed Agents" Success Story: Rakuten

 

How was that? Between the rapid development enabled by "Claude Managed Agents" and the reduced maintenance burden associated with updating frontier models, this feels like a paradigm shift in enterprise AI. While concerns about vendor lock-in remain, for companies that prioritize speed above all else, "Claude Managed Agents" appears to be an ideal service.

ToshiStats will continue to cover AI agent development in the corporate world. Stay tuned!

 
 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Scaling Managed Agents: Decoupling the brain from the hands,  Anthropic
2) Rakuten accelerates development with Claude Code,  Anthropic

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Unlocking Recursive Self-Improvement via Meta-Harness

Recently, discussions on how to significantly improve AI agent performance by optimizing "what information is provided to the agent and at what timing" have been gaining momentum. In this post, based on a recent research paper, we will explore the possibility of "Recursive Self-Improvement of AI Agents," where agents improve their own performance. Let’s dive in.

 

1. Meta-Harness: A New Methodology for Harness Construction

A paper (1) from Stanford University has introduced a novel approach that significantly boosts accuracy. I believe the two major features are as follows:

  • Full access to past information

  • Adoption of Claude Code

The paper defines a "harness" as follows:

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model.

Simply put, a harness is the mechanism surrounding the generative AI that controls data to maximize its performance. To build this harness using an AI agent, it seems that maximum data access is required.

By running a loop as shown below, "Recursive Self-Improvement"—where the agent learns from past failures to improve itself—becomes possible.

                   Meta-Harness

 

2. Full Access to Information: The Secret to Improved Accuracy

Previously, there were various methods for constructing harnesses, but humans had to summarize or compress large amounts of information in some form. Consequently, critical information was often lost during the process, creating a bottleneck when aiming for higher accuracy.

"Meta-Harness" addresses this by granting the proposer access to all past logs and files. By allowing the agent to see all information without concealment, this structure eliminates the bottleneck. As a result, it achieved excellent performance on the Pareto frontier, as shown below.

‍  ‍                Pareto Frontier

This graph illustrates the relationship between additional information (context) and accuracy. The closer a point is to the top-left, the higher the accuracy achieved with less information, which signifies superior performance.

 

3. The Emergence of Claude Code

The proposer plays a central role in "Meta-Harness." Let’s look at the details through pseudo-code, where P represents the proposer. Looking at the section outlined in red, we can see that a new harness is being created by the proposer.

‍  ‍                 Pseudo-code

In this context, the proposer specifically refers to Claude Code. In other words, the new harness is created based on the latent capabilities of Claude Code. While Claude Code is proving active in various fields, it appears here again in a leading role. It is truly impressive. This demonstrates that future AI research will be driven by AI agents like Claude Code at its core. We are truly at the cutting edge of the era.

 

Conclusion

As we have seen, providing Claude Code with maximum information access enables the construction of high-performance harnesses. Of course, detailed tuning is necessary, so I highly recommend reading the full paper.

At ToshiStats, we will continue to cover harness design, which is the key to improving AI agent accuracy. Stay tuned!

 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Meta-Harness: End-to-End Optimization of Model Harnesses,  Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn,  Mar 30, 2026

 

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

 

Navigating the Evolution of Generative AI: Insights from Anthropic

Every week, a variety of generative AI updates are released, and it feels as though this pace will only continue to accelerate. On the other hand, many people may be feeling lost, wondering how exactly they should navigate these changes. Therefore, in this post, I would like to explore some hints from Anthropic's technical blog (1).

 

1. Experiments at Anthropic

Mr. Prithvi Rajasekaran from the Labs team has provided a detailed report on several implementation experiments.

The experiments consisted of three projects: front-end design development, full-stack 2D retro game development, and Digital Audio Workstation (DAW) development. This time, I would like to focus specifically on the full-stack 2D retro game development. Through various development and implementation processes, they observed cases where long-running agentic coding failed. A common factor was that the AI often overestimated incomplete implementations, judging them to be at a sufficient level when they were actually still unfinished. They believed that unless this was improved, it would be impossible to achieve satisfactory results in long-running agentic coding.

 

2. The Key Technology for Success

To address this, a "harness" design consisting of a pair of a Generator and an Evaluator was introduced. This was reportedly inspired by a technology well-known in image generation called Generative Adversarial Networks (GANs). For more details, please see below. In short, the model does not evaluate its own work.

New Harness Design

A loop was established between the Generator and the Evaluator, where flawed implementations were subjected to rigorous criticism. Naturally, this took a significant amount of time, and costs jumped by 20 times. However, the quality improved even more than the cost suggested. The return on investment was clearly sufficient.

Performance Comparison: Single Agent vs. Full Harness

3. Gains from the Update from Opus 4.5 to 4.6

While the AI engineers were continuing to refine the harness, an update for the generative AI model, Opus, was released, moving the version from 4.5 to 4.6. The performance improvement in Opus 4.6 was remarkable, and as a result, part of the harness that had been necessary for Opus 4.5 became redundant. This allowed the implementation to become simpler. Fantastic! Please see the chart below for details. In the V2 harness, a portion of V1 has indeed been removed.

Harness Design with Opus 4.6

Based on this experience, the blog describes the following lessons:

“the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.”

“From this work, my conviction is that the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.”

In other words, I believe this means: "As the capabilities of generative AI improve, the number of things that can be solved by a standalone baseline model increases, making parts of existing harnesses unnecessary. However, as the capability of the baseline model rises, tasks that were previously unreachable become solvable by improving the harness design." If the things we can do with new generative AI models continue to increase, our opportunities for harness design will also grow, and it looks like we will be kept quite busy.

 

What did you think? As the capabilities of generative AI rise, it is expected that new harness designs will be required to push those capabilities to their limits. It seems there will be plenty to do, at least until AGI is realized. ToshiStats will continue to feature harness designs, which are the key to improving the accuracy of AI agents. Stay tuned!

 
 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Harness design for long-running application development,  Engineering at Anthropic.  Mar 24, 2026

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

The Secret Sauce for Mastering Agentic Coding !

Since the beginning of this year, we've been hearing a lot about "agentic coding"—where AI agents handle the coding—everywhere. While we no longer write programs ourselves and instead focus entirely on giving instructions to AI agents via prompts, many people likely find themselves wondering, "What exactly should I learn to write good prompts?" So, today, I'd like to explore this topic using an experiment conducted at ETH Zurich as our guide.

 

1. Overview of the Experiment

The reference for this discussion is the paper titled "Computer Science Achievement and Writing Skills Predict Vibe Coding Proficiency (1)." They gathered 100 students who first took tests to measure their writing skills, computer science achievement, and general cognitive abilities. I've summarized these three foundational skills below.

‍  ‍        Three Foundational Skills

Afterward, to measure their "agentic coding" proficiency, the participants reviewed a sample application, drafted prompts for an LLM-based agent, tested the generated application, and then further refined it. The final applications were evaluated by human graders.

         Measuring "Agentic Coding" Proficiency

This process reveals the relationship between the three foundational skills and agentic coding proficiency.

 

2. As Expected, Computer Science Skills Mattered

As the results below show, computer science skills were most strongly correlated with agentic coding proficiency, showing a correlation coefficient of 0.39. Writing skills also showed a significant correlation, with a coefficient of 0.29. Here is a summary of the results.

‍  ‍        Skills Correlated with Agentic Coding Proficiency

Now, some of you might find this a bit puzzling. Computer science skills are primarily centered around programming, whereas in agentic coding, humans don't actually write code directly. So, why did computer science skills show such a high correlation? The research paper explains it as follows:

"It may have contributed through problem decomposition or mental models of control flow and state."

It's certainly true that people hone these kinds of abilities through the practice of programming. If that's the case, it makes perfect sense that individuals with strong computer science skills would perform well, even in natural language-driven agentic coding.

 

3. How Those with No Programming Experience Can Become Excellent Agentic Coders

Based on our discussion so far, I'd like to explore a new approach on "how people with no programming experience can become excellent agentic coders." As agentic coding becomes more widespread, it might be inevitable that the incentive to learn traditional programming will fade. However, the following skills are still absolutely essential for mastering agentic coding:

  • The ability to decompose tasks

  • The ability to understand system flows

  • The ability to expand your vocabulary and accurately define requirements in writing

For those without programming experience, deliberately focusing on and studying these specific points alongside your regular prompt writing practice will likely accelerate your improvement. This is something you can start doing right away today. I highly recommend it!

 

What do you think? While we focused on "agentic coding" today, the insights we've gained go far beyond just "coding"—they can be seen as universal skills for unlocking the true potential of AI agents. As AI agents become integrated into various fields in the future, these skills will essentially become mandatory subjects for all of us. Here at ToshiStats, we will continue to discuss the collaboration between business professionals and AI agents. Stay tuned!

 

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!


1) Computer Science Achievement and Writing Skills Predict Vibe Coding Proficiency, Sverrir Thorgeirsson, Theo B. Weidmann, Zhendong Su. 14 Mar 2026


Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

The End of Traditional Research: How "autoresearch" is Changing Everything

"It would be wonderful to have a system where you could give instructions to an AI agent before going to bed, and while you sleep, the AI agent executes the program so that a finished product is ready by the time you wake up in the morning." This is not a story about the future. It is an application called "autoresearch" (1) released on March 6, 2026, and anyone can use it for free. Let’s take a look right away.

 

1. What is "autoresearch"?

This is a project by the renowned AI researcher Andrej Karpathy. According to his GitHub, it is described as "AI agents running research on single-GPU nanochat training automatically," meaning he has created AI agents that automatically train nanochat (2). Nanochat is a small yet high-performance large language model (LLM) that he developed. Usually, he trains nanochat while manually tuning it, but this is a very ambitious project to automate that process using "autoresearch." According to him, even though it has just begun, "autoresearch" has worked very well. For details, please see his post on X (3).

 

2. Simple is Best

When you hear about automating the training of a large language model, you might imagine a very complex system, but there are only three basic files. Furthermore, the only file a human needs to write directly is program.md. In this file, you write in natural language, such as English or Japanese, "what kind of research team we want to form by launching multiple AI agents and what we want them to do." No programming is required. The AI agent that receives these instructions autonomously writes code in train.py to improve the accuracy of nanochat. The final file, prepare.py, is never updated during training. It serves as the foundation for the experiment, so it remains the same until the end. It is a very simple structure. I highly recommend checking Andrej Karpathy’s GitHub for the contents of each file; it will be very informative. I have summarized the overview briefly below.

This is the autoresearch repository for Mac that I executed this time. You can certainly see the three files I introduced. The file structure is extremely simple, and I believe anyone can handle it.

 

3. Running on a MacBook Air

Now, let's run it on my MacBook Air. This Mac was purchased exactly one year ago and is equipped with an M4 chip and 24GB of RAM. Claude Code is active as the development environment once again. It is on duty at our company almost every day.

Claude Code

When I asked Claude Code to draw a diagram, it looked like the one below. It is simple and easy to understand. On the second from the right, it says MLX Train 5m, which means repeating a 5-minute training session many times. It can be executed about 12 times in one hour. On the far right, Evaluate val_bpb means "evaluate the metric val_bpb (validation bits per byte) and check if the value is steadily decreasing." If the value decreases, it means the accuracy is improving. If not, that session is discarded, and training continues from the previous state. If you let this run while you sleep, you can conduct 100 experiments in a single night.

autoresearch Training Process

Andrej Karpathy describes this design as follows: ‘Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.‘

Since I wanted to confirm if it would work properly this time, I ran the loop only three times. As seen below, the evaluation metric did indeed decrease, showing that the training progressed smoothly. During this time, I gave no instructions at all. It’s amazing. It truly is "autoresearch"!

Trends in Evaluation Metric Values

 

What did you think? Andrej Karpathy stated on his X (3) account:

“All LLM frontier labs will do this. “,

“any metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.”

You, too, might be able to create your own AI lab using a Mac. It is a wonderful thing. At ToshiStats, we will continue to conduct experiments incorporating cutting-edge technology. Stay tuned!.

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) autoresearch, Andrej karpathy, March 6, 2026
2) nanochat, Andrej karpathy, Oct13,2025
3) https://x.com/karpathy/status/2031135152349524125

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Many-Shot In-Context Learning: The Game Changer of the Long-Context AI Era

Recently, OpenAI released its newest AI model, GPT-5.4 (1). While much of the praise has focused on its overall performance, I want to highlight its context window length. The context window refers to the amount of information a generative AI can process in a single go. GPT-5.4 now supports 1M (one million) tokens. With its rival Opus 4.6 also at 1M and Google Gemini having achieved 1M two years ago, all frontier models from the "Big Three" now possess 1M-token context windows. We can officially say that AI has entered the Long-Context Era.

How will this impact the development of AI agents? Let’s explore.

 

1. What is Many-Shot In-Context Learning?

When you ask ChatGPT, "What is the capital of Japan?" and it replies, "Tokyo," that question or instruction is called a prompt. However, you can input much more than just a short prompt.

For example, if you provide examples first—such as "Where was the World Expo held in Japan?" followed by "Osaka"—and then ask your actual question, the accuracy is known to improve. This technique is called In-Context Learning. When the number of examples exceeds roughly 10 and you provide a massive amount of data, it is referred to as Many-Shot In-Context Learning. Here is a brief summary.

In-Context Learning

 

2. Challenging a 20-Class Classification Task Using Bank Complaint Data

To measure the effectiveness of Many-Shot In-Context Learning, I decided to tackle a difficult 20-class classification task using bank complaint data (2). This dataset contains an "issue" column describing why a complaint occurred. The goal is to read the "text" column and select the correct cause from 20 possible categories. For this, I used Gemini 3.1 Flash-Lite (3).

     Banking complaints dataset

Rather than using a simple prompt like "Please classify this," I asked the AI itself to "create the optimal prompt," resulting in a highly detailed set of instructions—what you might call a "Prompt Powered by AI."

prompt powered by AI

I first attempted this using Zero-shot (providing no examples), even with this enhanced prompt. Unfortunately, the accuracy was only 46%. Since it gets it wrong more than half the time, it isn't yet viable for practical business use.

Zero-Shot accuracy

 

3. Executing Many-Shot In-Context Learning with 1,000 Samples

Next, I implemented Many-Shot In-Context Learning by providing 1,000 examples alongside the prompt. While the underlying process remains the same as the Zero-shot approach, the volume of information is massive. The following are the first five examples.

Many-Shot samples

The results were dramatic: accuracy jumped to 70%. This clearly demonstrates the sheer power of the "Many-Shot" approach.

Many-Shot accuracy

However, with a 30% error rate, there is still room for improvement. I had an AI Agent analyze why the errors occurred and generate a report. The insights gained from this analysis are highly valuable for further refinement.

Root cause analysis

 

Conclusion

There are several ways to improve the accuracy of generative AI, but as 1M-token context windows become the standard, Many-Shot In-Context Learning is set to become a major focal point. At ToshiStats, we plan to continue evolving this methodology.

Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

 

1) Introducing GPT‑5.4, Open AI, March 5, 2026
2) Consumer Complaint Database
3 )Gemini 3.1 Flash-Lite: Built for intelligence at scale, Google, Mar 03, 2026

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Which AI Model Should You Use Daily? Why Gemini 3.1 Flash-Lite is the Top Choice!

I’ve been using Opus 4.6 for coding lately, but I've realized that the costs can really add up when running it via API. This led me to think that for tasks where absolute peak precision isn't the only priority, a more budget-friendly model would be a better fit. Right on cue, Google announced the gemini-3.1-flash-lite-preview—a model built for speed and affordability (1). I decided to put it to the test immediately.

 

1. The Perfect Balance of Speed, Cost, and Performance

The Flash-Lite series is the most affordable tier in the Gemini lineup. It’s likely the engine behind many of Google’s own internal services. Speed, in particular, seems to be its standout feature.

When compared to its rivals, the processing speed is remarkably fast. Its cost-efficiency is equally impressive: at $0.25 per 1 million input tokens, it is poised to be a powerhouse for tasks involving massive amounts of data. For a startup like ours, this is incredibly encouraging.

               Comparison with Rival AI Models

Affordability hasn't come at the expense of performance, however. As shown in the Leaderboard (2), it boasts a score exceeding 1430. Given that the top-tier frontier models are currently competing around the 1500 mark, a score of 1430 for a lightweight model is truly outstanding.

                 Leaderboard Standings

 

2. Performance Evaluation: Banking Complaint Classification

To see what it can really do, I tested the model on a banking complaint classification task. Using this dataset (3), I provided the model with customer complaints from the "text" column and asked it to select the most relevant category from six financial products listed in the "Product" column. I ran this test on 100 samples to see how accurately it could categorize each complaint.

                 Banking Complaint Data

Here is the detailed prompt I used.

The Prompt

The results were fantastic, achieving a 92% accuracy rate. The entire process finished in about 60 seconds, demonstrating its high-speed processing capabilities. I’ve attempted this specific task several times in the past, but this is the first time a model has exceeded 90% accuracy without any fine-tuning. Truly impressive!

Task Accuracy Results

 

3. A High-Speed Model You Can Use Without Budget Anxiety

For the past few months, I’ve relied on Opus 4.6 for its sheer coding power. While its performance is top-notch, the costs are substantial. When you want to run various experiments where success isn't guaranteed, the budget can become a significant hurdle.

That’s where gemini-3.1-flash-lite-preview shines. Its balance of performance and cost makes it easy to iterate and experiment freely. It’s the perfect "partner" for development, and I plan to integrate it into my workflow even more moving forward.

 

What do you think? It looks like Google will continue to roll out new AI models one after another. We might even see some open-source models soon, so it's definitely something to keep an eye on. Here at ToshiStats, we’ll keep testing and integrating various AI models into our workflow. Stay tuned!

 

You can enjoy our video news ToshiStats-AI from this link, too!

1) Gemini 3.1 Flash-Lite: Built for intelligence at scale,  Google,  Mar 03, 2026
2) Arena
3) Consumer Complaint Database

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

The Rise of the AI Strategist: Can AI Agents Master Corporate Strategy?

Claude Code, the coding assistant that's exploding in popularity worldwide—did you know you can use Agent teams (1) to run AI agents as a team? The idea is to run multiple AI agents simultaneously according to their purpose, achieving performance that a single agent couldn't deliver. This time, we'd like to test whether we can use Agent Teams to develop corporate strategy. Let's get started!

 

1. Implementing Five Forces Analysis with Agent Teams

There's a well-known framework in competitive strategy called Five Forces Analysis (2). This time, we'd like to apply it to the Japanese digital payment market and explore the possibility of market entry. We'll analyze from the following five perspectives, setting up an AI agent for each one.

                  Five Forces Analysis

We entered the following prompt into Claude Code, which you're all familiar with by now. There's nothing particularly difficult about it. Of course, no programming is required. However, if this is your first time using Agent Teams, you'll need to configure the settings, so don't forget (1).

                    Claude Code

The multi-agent system we'll actually build looks like the following. A total of seven AI agents will be running, but the key point is the loop involving Agent 6 and Agent 7. After Agent 6 creates a report summarizing the research findings, Agent 7, positioned independently, verifies that report. The report isn't complete until Agent 7 approves it and gives the go-ahead. Quite rigorous, isn't it?

                Strategic Analysis Multi-Agent System

 

2. The Report Creation Process

Now let's follow the report creation process on the actual screen. As you can see below, seven AI agents have indeed been configured. You can also see that the crucial verification loop has been created.

                    Seven AI Agents

First, Phase 1. The five research AI agents begin by pulling information from the web. They gather information about the Japanese digital payment market from the five perspectives of Five Forces Analysis. Each AI agent operates independently and processes in parallel, making it very efficient.

Work has progressed, and it appears four of the research tasks are complete. The competitive landscape from each perspective is documented as well. Just a little more to go.

The research by all five AI agents is complete, and we move into Phase 2: creating the integrated report. I'm excited to see what kind of report it will be.

Then we enter the most important phase—Phase 3: the verification loop. Here, the goals are: 1) fact-checking through search, 2) identifying logical inconsistencies, and 3) identifying hallucinations, all aimed at improving the quality of the integrated report.

It appears eight errors were identified and corrected.

The report is finally complete. As shown below, there are six types of reports. We compiled all six into a single PDF file, and it spans 60 pages of content. Impressive, isn't it?

 

3. Structure of the Generated Analysis Report

The structure of the consolidated report is as follows. It's written in accordance with the Five Forces Analysis framework.

Structure of the analysis Report

We can't present everything here, but the summary in Chapter 1 looks like the following—I think it's very clearly organized. Please note that this summary is for educational purposes only and should not be directly applied to business decisions or the like.

              notice : This is for educational purpose only

 

So, what did you think? We carried out corporate strategy development using Five Forces Analysis, and the AI agents produced an excellent report. While further verification is needed, it could potentially be used as a starting point for discussion. I should note that Agent Teams is currently in an experimental phase, so changes to specifications are possible going forward (1). At Toshi Stats, we'll continue applying multi-agent systems across various fields. Stay tuned!

 

You can enjoy our video news ToshiStats-AI from this link, too!

1) Orchestrate teams of Claude Code sessions, Anthropic
2) Porter's five forces analysis, Wikipedia

Copyright © 2026 ToshiStats Co., Ltd. All right reserved.

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Predicting Loan Payback through "Agent Skills": The New Standard for Enterprise AI

The most common complaint about AI agents in business? 'The output isn't what I wanted.' In a corporate landscape, consistency is everything—without pre-defined formats, users get lost. Instead of just teaching everyone to prompt better, why not embed that expertise into the organization itself? By providing standardized prompts upfront, users get perfect results from day one. The secret to this is 'Agent skills' (1). Let’s see how it works!

 

1. What are Agent Skills?

Announced as "skills" by the AI giant Anthropic in October 2025, Agent Skills have since been adopted by almost every major AI company. They have become the de facto standard for providing domain-specific knowledge to generative AI. According to Anthropic:

“Agent Skills are modular capabilities that extend Claude's functionality. Each Skill packages instructions, metadata, and optional resources (scripts, templates) that Claude uses automatically when relevant.”

The beauty of defined Agent Skills is their portability—once created, they can be used across different platforms.

 

2. Creating Agent Skills

Now, let's dive right in. I’m going to create an 'Agent Skill' using Claude Cowork. I uploaded the PRD (Product Requirements Document) I typically use for building prediction models and input the following prompt.

‍  ‍           Claude Cowork

Since Claude Cowork has a built-in skill creator, it automatically generates an Agent Skills folder containing a skill.md file. This skill.md stores the most fundamental information for the Agent Skill, and its header always includes the following content. AI agents like Claude Code are designed to read this section first.

         skill.md 1

For tasks related to predictive modeling, the agent reads the specific implementation logic defined in the skill (which, in this case, spans about 240 lines) before moving to the coding phase.

           skill.md 2

 

3. Building a Prediction Model via Agent Skills

Next, I utilized Claude Code for agentic coding. As shown below, the "skills" we just created are active and recognized by the environment.

Claude Code

Because the detailed modeling process is already governed by the Agent Skill, my manual prompt can be as simple as: "Please create a prediction model." For this project, I used data from the Kaggle "Predicting Loan Payback" competition (2), where the goal is to predict whether a borrower will repay their loan. The entire implementation was completed in about two hours with almost no manual corrections. The stability of Opus 4.6 (3) is truly remarkable!

The model achieved an AUC of 0.92435 on the Kaggle leaderboard—a score that is well within the range of practical, production-ready application.

Kaggle leaderboard

One secret behind this high accuracy was the creation of new features based on ratios. By analyzing feature importance, we ensured only the most impactful variables were included in the final model.

new features based on ratios

 

4. Testing the Resulting Model

Let’s look at the model built via Agent Skills in action. First, we calculate the probability of repayment for an individual customer. In this example, the probability exceeds 96%, resulting in a "Success" (likely to repay) classification based on a 50% threshold. This threshold is, of course, adjustable depending on the specific business objectives.

prediction for an individual customer

To avoid the "black box" problem, I use SHAP analysis to explain why a customer received a specific score. As seen in the graph, the length of the red arrows indicates the contribution of each feature. Here, employment_status was the most significant factor driving the "Success" prediction. This transparency is crucial for corporate accountability.

SHAP analysis for a customer

 

We can also apply SHAP to the entire dataset. Again, employment_status emerges as the top contributor across all customers. We can see that this feature also carries a high degree of contribution across the entire customer base.

SHAP analysis for all customers

Furthermore, SHAP allows us to visualize the non-linear relationship between specific features and repayment probability. For example, with credit_score, the probability doesn't just rise linearly. The data shows that the probability remains flat until a score of 550, starts to rise at 600, and accelerates significantly after 700. This level of granular insight is what makes SHAP so valuable.

‍ ‍ Feature-wise SHAP Analysis

 

By using Agent Skills, you can embed entire libraries of domain knowledge directly into your AI’s workflow. These skills are reusable, portable, and—in my opinion—will soon be a requirement for any business using AI agents.

I look forward to seeing how Agent Skills continue to permeate the corporate world and what innovations they will trigger. TOSHI STATS Co. will continue to lead the way in this space.

Stay tuned!

 

You can enjoy our video news ToshiStats-AI from this link, too!

1) Agent Skills
2) Predicting Loan Payback, Yao Yan, Walter Reade, Elizabeth Park. Kaggle, 2025
3) Introducing Claude Opus 4.6, Anthropic, Feb 5 2026

Copyright © 2026 Toshifumi Kuga. All right reserved
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

From Zero to Production: How Opus 4.6 Agentic Coding Revolutionizes Insurance Analytics

In the ever-evolving landscape of InsurTech, cross-selling is a literal goldmine. Utilizing Opus 4.6 and Agentic Coding, I have constructed a sophisticated "Insurance Cross-Sell Prediction Model" implementation pipeline, covering everything from memory-optimized data loading to complex feature engineering. Let’s dive in!

 

1. Agentic Coding with Opus 4.6

Unlike traditional coding, Agentic Coding with Opus 4.6 (1) allows the AI to function as an autonomous engineer. It goes beyond writing snippets; it manages directory structures, ensures memory efficiency for datasets of 11.5 million rows, and completes a production-ready Streamlit dashboard.

In this process, my role was simply to write the "Product Requirement Document (PRD)”—a document in natural language (Japanese or English) defining what I wanted to build. No Python knowledge was required on my part. By putting Claude Code into plan mode, an implementation blueprint is automatically generated, allowing me to verify the coding logic before Opus 4.6 executes it. While I monitored the progress, I never had to write a single line of code myself. Truly remarkable.

 

2. Project Overview

This project features a robust ecosystem designed for real-world application:

  • LightGBM + Optuna: Automated hyperparameter optimization to maximize AUC.

  • 50 Ratio-Based Features: Generation of 50 unique indicators to capture hidden customer behavior patterns.

  • Explainability via SHAP: Implementation of SHAP values to visualize why a specific customer is likely to purchase.

The data was sourced from a Kaggle competition regarding automobile insurance cross-selling (2).

Kaggle competition regarding automobile insurance cross-selling

Performance Results: When evaluating the model built via Opus 4.6 Agentic Coding on the Kaggle leaderboard, it achieved a high score of AUC = 0.88343. This level of accuracy is more than sufficient for practical business use.

Kaggle leaderboard

 

3. Key Features of the Implementation

The model provides two primary functions: individual customer prediction and total customer portfolio analysis.

Individual Prediction

We set the threshold for a "successful" cross-sell at a probability of 35% or higher. Below is an example of a customer predicted to be a successful cross-sell target. To avoid the "Black Box" problem, we use SHAP values to show the contribution of each feature. The larger the SHAP value, the higher its contribution to the positive prediction. This allows staff to understand the concrete reasoning behind the AI's decision.

customer predicted to be success

feature contribution

Conversely, for customers predicted to fail (probability below 35%), the SHAP values indicate which factors are pulling the probability down.

customer predicted to fail

feature contribution

Customer portfolio Analysis

We can also analyze the "Cross-Sell Success Rate" across an entire customer portfolio. In this demo, we imported a CSV of 30,000 customers. With the threshold set at 35%, the model identified 3,708 potential targets. By adjusting the threshold, marketing teams can narrow or broaden their focus for specific campaigns. The dashboard also displays the overall probability distribution across the entire dataset.

probability distribution

 

4. Business Impact

This high-precision model provides sales representatives with a prioritized "Hot Lead" list. Thanks to the Streamlit-based GUI, non-technical staff can execute batch predictions and verify the reasoning via SHAP instantly. This is the definition of Data-Driven Marketing.

 

Conclusion

The synergy between Opus 4.6 and human expertise is redefining the speed of machine learning development and implementation. The potential is, quite frankly, staggering. At TOSHI STATS, we will continue to explore innovations in this field.

Stay tuned!

 

1) Introducing Claude Opus 4.6, Anthropic, Feb 5 2026
2) Binary Classification of Insurance Cross Selling,  Walter Reade and Ashley Chow, Kaggle

You can enjoy our video news ToshiStats-AI from this link, too!

Copyright © 2026 Toshifumi Kuga. All right reserved
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Mind-Blowing Performance: Building a Bank Churn Prediction Model using Claude Opus 4.6

Earlier in 2026, the AI giant Anthropic announced Opus 4.6(1), the latest update to its frontier model series. Today, I want to share my experience using Claude Code to build a bank customer churn prediction model to see just how far this new version can go. Let’s dive in.

 

1. The Ultimate Coding Model

Opus 4.6 is Anthropic’s new masterpiece, outperforming Opus 4.5 across various benchmarks. Its coding capabilities, in particular, are often rated as the best in the industry, and it feels like it’s now a giant leap ahead of the competition.

 

2. Developing a Churn Prediction Model via "Agentic Coding"

I decided to pair Claude Code with Opus 4.6 to develop a prediction model using "agentic coding"—a method where the AI agent handles the entire Python implementation without human intervention.

The task: Bank Customer Churn Prediction. Losing customers is costly and hurts brand loyalty. A predictive model allows us to identify "at-risk" customers and take proactive retention measures before they leave. For this experiment, I used a dataset from a well-known Kaggle competition.

The Workflow

  1. PRD Creation: I wrote a detailed Product Requirement Document (PRD) outlining my goals.

  2. Autonomous Execution: I ran Claude Code in plan mode. It drafted the implementation strategy, and once I gave the green light, it proceeded to code the entire system.

  3. Minimal Intervention: While Claude Code occasionally asked for permissions, I simply hit "yes" every time. It was effectively 100% AI-driven development.


The Resulting GUI

The final application is a sleek tool where you can select a Customer ID to see their specific churn probability. It clearly distinguishes between "Loyal" and "At-Risk" customers.

                Example: Predicted Non-Churner

                Example: Predicted Churner

  • Individual Prediction: Instant probability scores for specific users.

  • Batch Prediction: For a birds-eye view, you can upload a CSV of your entire database (approx. 110,000 customers).

  • Dynamic Thresholding: You can set a churn threshold. For example, at a 50% threshold, 31.2% of the customers are flagged as likely to leave.

By raising the threshold to 90%, the list narrows down to the most critical 8.3% of the customer base. This makes it incredibly easy to target high-stakes marketing campaigns or retention offers.

Efficiency Note: The entire process—from data acquisition to a fully functional predictive model—took only about 90 minutes. Not having to write a single line of Python manually is a massive productivity boost.

To enable even deeper analysis, I’ve also included a CSV export feature. Those proficient in Python can leverage this file to conduct their own custom evaluations as needed.

 

3. Glimpsing the Latent Potential of Opus 4.6

As expected, Opus 4.6 completed the end-to-end development process without a single error. When I attempted this same task with Opus 4.5, I had to tell AI agent to correct a calculation method because I hadn't been specific enough in my pipeline description. This time? Zero rework. The performance improvement is tangible.

 

Opus 4.6 is set to become an indispensable partner in machine learning development. While this isn't a "full" generational leap (like a version 5.0), the refinement is world-class. Rumor has it that Opus 5 is already deep in development at Anthropic and might debut in late 2026. I can’t wait to see what kind of evolution that brings.

Stay tuned!

 

You can enjoy our video news ToshiStats-AI from this link, too!







1) Introducing Claude Opus 4.6, Anthropic, Feb 5 2026
2) Binary Classification with a Bank Churn Dataset, Kaggle, Jan 2, 2024


Copyright © 2026 Toshifumi Kuga. All right reserved
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.