Toshifumi Kuga

May 31, 2026

Opus 4.8, Anthropic, claude code, investment memo

A Game-Changer for Financial Analysts: How Opus 4.8 Redefines Financial Research !

Toshifumi Kuga

May 31, 2026

Opus 4.8, Anthropic, claude code, investment memo

Anthropic has announced the update of its generative AI, Claude Opus 4.8. This update came less than 40 days after the previous one, which came as a bit of a surprise, but it may indicate that their internal development efficiency has increased significantly. Therefore, in this article, I would like to take on the challenge of using a combination of Claude Code and Opus 4.8 to conduct a financial analysis using US financial statements and create an investment memo.

1. Opus 4.8: The Most Powerful Model at Present

As always, when a new generative AI model is released, I compare its performance with existing models. The introduction page for Opus 4.8 (1) features the comparison table shown below. It is reported to have outperformed existing models in almost all areas. While strong coding capability is a tradition for the Opus series, what caught my attention was its exceptional strength in knowledge work. As indicated by the red box, it has achieved excellent results in two benchmarks that measure knowledge work capabilities.

‍　　　　　　　　　　　Opus 4.8 Performance Comparison

Therefore, in this article, I would like to verify the potential of Opus 4.8 regarding knowledge work.

2. Challenging the Creation of an Investment Memo

This time, I will attempt to create an investment memo for Google using Form 10-K, the annual performance report registered with the US SEC. An investment memo is an internal document created for investors to make a final in-house decision (approval) on whether or not to execute an investment in a specific company. Normally, financial analysts mobilize their expertise to create this based on source materials. This time, I would like to try automating that process.

First, I used the plan mode of Claude Code to formulate an implementation plan. I created a detailed plan this time as well. The following shows the initial part of it, but the actual plan continues further.

After reviewing the created implementation plan and confirming there were no issues, I switched Claude Code to auto mode and actually started coding. This time, the implementation was completed all at once in about 30 minutes without stopping midway. Once I gave the green light, there was no human intervention required. It was a moment where I caught a glimpse of the true capability of Opus 4.8.

Normally, you would need a "prompt" that defines and instructs how to write each section of the investment memo, but I did not need to write it myself. Here too, Opus 4.8 automatically generated the "prompts" for me. The following is an example of this, and it is well-written without missing any key points. It is truly amazing.

‍　　‍　　　　　　　　　　　　　　Generated Prompt Example

3. Reviewing the Investment Memo

In this experiment, I had the investment memo created in both English and Japanese versions and outputted as PDF files. Let's take a look at the content right away. It summarizes the overview beautifully in the opening section, as shown below. It looks very sophisticated.

investment memo by ClaudeCode with Opus4.8

It also summarizes the investment theme concisely as follows.

The investment memo this time exceeds 10 pages in total, so I cannot introduce the full text here, but I would like to look specifically at the section on competitive advantage analysis.

I think it is very well summarized. If the process can be automated to this extent, humans only need to review it, which will dramatically increase work efficiency. Furthermore, if you desire a deeper analysis leveraging domain knowledge, you can simply rewrite the "prompts." This means you can proceed based on existing work, allowing for smooth and efficient collaboration between humans and generative AI. It is wonderful. By the way, please understand that these texts were created for educational purposes and cannot be used for making investment decisions.

What did you think? I challenged the creation of an investment memo using Claude Code and Opus 4.8, and the results exceeded my expectations. I believe the performance of Opus 4.8 in knowledge work was outstanding. However, I would like to emphasize that a final review by a human is absolutely necessary. It is important to bear in mind that hallucinations can still occur. Moving forward, cooperation between generative AI and humans will continue to be essential.

At Toshi Stats, we plan to take on various tasks using Opus 4.8. Stay tuned!

You can enjoy our video news “ToshiStats AI Weekly Review” from this link, too!

1) Introducing Claude Opus 4.8, May 28, 2026, Anthropic PBC

Notice: This is for educational purpose only. ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the report, the codes and the software.

Toshifumi Kuga

May 9, 2026

Root Cause Analysis, AI agent, claude code, RCA

"Root Cause Analysis" is All You Need !

Toshifumi Kuga

May 9, 2026

Root Cause Analysis, AI agent, claude code, RCA

Have you ever tried to automate any classification tasks using Generative AI? I do this quite often, but occasionally, as the number of classification classes increases, the accuracy gradually drops to a point where it is no longer viable for practical business use. So, this time, I will tackle the task of classifying bank customer complaints (1) based on their root causes. There are 20 cause classes in total, making it a difficult problem where a random guess would yield only about a 5% accuracy rate. In the example below, the "text" column contains the customer complaint, and the "Issue," which is the underlying cause, is classified by an AI agent.

We are provided with a mere 100 samples. I would like to implement Root Cause Analysis (RCA) during the classification process to see just how crucial RCA is for improving accuracy. Let's get started right away.

1. What is RCA?

RCA stands for "Root Cause Analysis". When a problem occurs, it is a method used not just to resolve the superficial events (symptoms) you see, but to pinpoint the "true cause (root cause)" in order to prevent a recurrence. This time, I have designed the following RCA approach for classification failures:

Root Cause Analysis (RCA):

Record the success/failure of each sample.
Verification: Calculate the accuracy and generate an error analysis report.
Failure Analysis: If a classification error occurs, scrutinize the principle and conduct a Root Cause Analysis (RCA) on why it failed (e.g., confusion with similar categories, context complexity, etc.).
Create a principle improvement report based on the failure analysis results. Send this feedback to the generator. Take care to ensure the generator does not overfit.

Now, as shown in the infographic below, let's actually take on the bank customer complaint classification task using an AI agent equipped with RCA capabilities.

Note that I referenced this paper (2) for this experiment. If you are interested, please definitely check it out.

2. Implementing the Bank Customer Complaint Classification AI Agent using Claude Code

Once again, I used Anthropic's Claude Code to implement and analyze the AI agent as follows. First, I set it to Plan Mode, compiled what I wanted to accomplish into a PRD (Product Requirements Document), handed it over to Claude Code, and formulated an implementation plan. This PRD incorporates the Root Cause Analysis (RCA) explained above.

An implementation plan like the one below is formulated in about 5 minutes. The actual document is much longer, but I will only show the first part here. The important thing is to thoroughly review this implementation plan. It is long and can be tedious, but this stage allows you to confirm whether it aligns with the task's objectives before actually diving into coding. Anthropic's generative AI, Opus 4.7, is extremely high-performing; once it enters the implementation phase, it runs non-stop until the end. Since it is difficult for humans to intervene midway, the accuracy of the implementation plan holds the key to solving the task.

Since this implementation plan was well-crafted, I will proceed directly to implementation. I switch to Auto Mode as shown below and start coding. You can see the AI agent completing the implementation process step by step.

This time, we iterated on the analysis and improvement 9 times, which ultimately took over 10 hours, but we obtained the results below. This is the result of classifying 100 randomly sampled customer complaints into 20 classes. You can see that the accuracy gradually improves thanks to the RCA feedback.

However, it seems to have overfitted due to repeating the process for far too long. I validated it with newly sampled data, but saw no improvement from iteration 7 onwards.

3. Results and Challenges This Time

In this bank customer complaint classification task, the baseline accuracy using a generative AI "out of the box" without doing anything special was under 40%. By adopting a multi-agent system with a generator and an evaluator, and incorporating RCA feedback, we achieved just under 60% accuracy even on new data, so I believe the RCA was effective. However, once the accuracy on the original data exceeded 80%, overfitting occurred, so figuring out how to improve this is a future challenge.

What did you think? This time, I explicitly stated the RCA in the PRD, implemented it as a multi-agent system, and tackled the task of classifying bank customer complaints based on their causes. While the accuracy improved from 40% to about 60%, overfitting remained an issue. To aim for an accuracy of 70% or higher on new data, another breakthrough might be necessary.

At ToshiStats, we plan to further develop RCA. Please look forward to it. Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Consumer Complaint Database
2) CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification, Hanrong Zhang1∗ Shicheng Fan1∗ Henry Peng Zou1 Yankai Chen2,3
Zhenting Wang2
Jiayu Zhou4 Chengze Li1 Wei-Chieh Huang1 Yifei Yao5
Kening Zheng1 Xue (Steve) Liu2,3 Xiaoxiao Li6 Philip S. Yu1
1University of Illinois Chicago 2MBZUAI 3McGill University
4Columbia University 5Zhejiang University 6University of British Columbia, April 12 2026

Toshifumi Kuga

April 25, 2026

Auto Mode, agentic coding, Anthropic, claude code

Opus 4.7’s Auto Mode: The Secret Weapon for Boosting Productivity

Toshifumi Kuga

April 25, 2026

Auto Mode, agentic coding, Anthropic, claude code

Anthropic has released the frontier generative AI model, Opus 4.7. This update comes just over two months after the release of Opus 4.6, highlighting the accelerating pace of technological progress. In this article, I will dive deep into the remarkable new feature added alongside Opus 4.7, "Auto Mode," by utilizing it to build a machine learning model for credit default prediction.

1. What is Auto Mode?

Boris Cherney, the developer of Claude Code—an Agentic coding development environment—commented on "Auto Mode" as follows:

Auto mode = no more permission prompts
In the past, you either had to babysit the model while it did these sorts of long tasks, our use--dangerously-skip-permissions.We recently rolled out auto mode as a safer alternative. In this mode, permission prompts are routed to a model-based classifier to decide whether the command is safe to run. If it'ssafe, it's auto-approved.

In short, this feature reduces the frequency of "Please approve" requests that appear during long agentic coding sessions, thereby boosting productivity. For someone like me, who handles dozens of these approval requests daily, this is a very welcome addition.

You can verify the "Auto Mode" status via the indicator at the bottom left of the Claude Code interface.

When you first enable it, a notice will appear; I recommend giving it a thorough read.

2. Building a Default Prediction Model with Auto Mode

I used Claude Code’s "Auto Mode" to actually build a default prediction model. For this project, I used data from Home Credit Default Risk competition(2) at Kaggle .

First, I created an implementation plan using Plan Mode. Through dialogue with Claude Code, a structured plan was established.

At this stage, Claude Code asks, "Would you like to use Auto Mode?" and answering "Yes" initiates the process.

The Implementation Process: I watched to see how many approval requests would appear before completion.

After approximately 90 minutes, the system announced, "Finished." Remarkably, not a single approval request was triggered. This makes the work significantly easier and the implementation process much more enjoyable.

Accuracy Validation: I checked the evaluation metric on Kaggle. The result was an AUC = 0.79632. This is my personal best for a single model without using ensembles. It ranks within the top 4.2% of the competition. Achieving this score without any manual intervention after the initial planning phase is truly astonishing.

3. Auto Mode and Productivity in Data Analysis

While Auto Mode makes implementation effortless, its true power lies elsewhere. Because the frequency of approval requests has decreased so dramatically, it is now feasible to work with parallel computing—building multiple models simultaneously.

Whether in Kaggle competitions or practical business scenarios, we are often required to improve accuracy within a limited timeframe. If parallel computing becomes this easy, increasing productivity by 5x to 10x is no longer just a dream. It is a challenge well worth taking.

Conclusion

Auto Mode has simplified parallel computing and opened a new path toward enhanced productivity. At ToshiStats, we will continue to explore case studies using Auto Mode.

Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) https://x.com/bcherny/status/2044847848035156457, Boris Cherney, Anthropic
2) Home Credit Default Risk, kaggle

Toshifumi Kuga

April 18, 2026

Claude Managed Agents, AI agent, Anthropic, claude code

Revolutionizing Enterprise AI: The Power of Claude Managed Agents

Toshifumi Kuga

April 18, 2026

Claude Managed Agents, AI agent, Anthropic, claude code

Anthropic, a leader in generative AI, has announced "Claude Managed Agents," an AI agent hosting service. This service appears to offer significant advantages for enterprises utilizing AI agents, so let’s dive deeper into what it’s all about.

1. What is "Claude Managed Agents"?

First, what exactly is "Claude Managed Agents"? Let’s look at a quote from Anthropic's technical blog (1):

Harnesses encode assumptions that go stale as models improve. Managed Agents—our hosted service for long-horizon agent work—is built around interfaces that stay stable as harnesses change.

It seems "Claude Managed Agents" refers to an AI agent infrastructure designed for stable, long-term operation, even as underlying models are updated. A key concept here—which is also the title of their blog post—is "Decoupling the brain from the hands."

The solution we arrived at was to decouple what we thought of as the “brain” (Claude and its harness) from both the “hands” (sandboxes and tools that perform actions) and the “session”

Because the functions are separated, if the system stops, you only need to fix the specific affected part to achieve a quick recovery. This certainly looks promising.

2. Creating a Customer Complaint Classification Agent with "Claude Managed Agents"

Descriptions alone don't quite capture the experience, so let’s try running "Claude Managed Agents" ourselves. First, we enter a prompt into the box on the bottom left.

For this test, we will create an agent to classify bank customer complaints. I have instructed it to select one of six financial products. Immediately, a configuration file is generated as shown below. Next, we create the agent.

The agent is now created. Next, we set up the environment.

The environment is ready. Now, we start a session.

The session has begun.

The preparation was finished in no time. There is nothing technically difficult about this; it’s just a matter of clicking buttons. Let's test it out immediately. I'll enter a bank customer complaint as follows:

The result came back as "Student loan." Correct!

Now, let’s try one more.

It came back as "Mortgage". Correct!

It’s working perfectly. All I did was provide a prompt instructing the AI agent on what to do. The rest was handled almost automatically by "Claude Managed Agents." This is impressive.

3. Easy Enterprise Scaling: The Rakuten Success Story

Now, let's look at an example of a Japanese company that used "Claude Managed Agents" to scale its AI agents: Rakuten, the e-commerce giant. By switching from in-house infrastructure development to "Claude Managed Agents," they succeeded in deploying AI agents across the company with overwhelming speed.

“Deployed Claude Managed Agents across product, sales, marketing, finance within one week“ (2)

It is particularly notable that business-side staff, not just engineers, are actively involved. It truly sounds like a company-wide initiative. Wonderful! I look forward to seeing more Japanese companies follow this lead.

"Claude Managed Agents" Success Story: Rakuten

How was that? Between the rapid development enabled by "Claude Managed Agents" and the reduced maintenance burden associated with updating frontier models, this feels like a paradigm shift in enterprise AI. While concerns about vendor lock-in remain, for companies that prioritize speed above all else, "Claude Managed Agents" appears to be an ideal service.

ToshiStats will continue to cover AI agent development in the corporate world. Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Scaling Managed Agents: Decoupling the brain from the hands, Anthropic
2) Rakuten accelerates development with Claude Code, Anthropic

Toshifumi Kuga

April 11, 2026

Meta-Harness, AI agent, claude code, recursive self-improvemen

Unlocking Recursive Self-Improvement via Meta-Harness

Toshifumi Kuga

April 11, 2026

Meta-Harness, AI agent, claude code, recursive self-improvemen

Recently, discussions on how to significantly improve AI agent performance by optimizing "what information is provided to the agent and at what timing" have been gaining momentum. In this post, based on a recent research paper, we will explore the possibility of "Recursive Self-Improvement of AI Agents," where agents improve their own performance. Let’s dive in.

1. Meta-Harness: A New Methodology for Harness Construction

A paper (1) from Stanford University has introduced a novel approach that significantly boosts accuracy. I believe the two major features are as follows:

Full access to past information
Adoption of Claude Code

The paper defines a "harness" as follows:

The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model.

Simply put, a harness is the mechanism surrounding the generative AI that controls data to maximize its performance. To build this harness using an AI agent, it seems that maximum data access is required.

By running a loop as shown below, "Recursive Self-Improvement"—where the agent learns from past failures to improve itself—becomes possible.

2. Full Access to Information: The Secret to Improved Accuracy

Previously, there were various methods for constructing harnesses, but humans had to summarize or compress large amounts of information in some form. Consequently, critical information was often lost during the process, creating a bottleneck when aiming for higher accuracy.

"Meta-Harness" addresses this by granting the proposer access to all past logs and files. By allowing the agent to see all information without concealment, this structure eliminates the bottleneck. As a result, it achieved excellent performance on the Pareto frontier, as shown below.

This graph illustrates the relationship between additional information (context) and accuracy. The closer a point is to the top-left, the higher the accuracy achieved with less information, which signifies superior performance.

3. The Emergence of Claude Code

The proposer plays a central role in "Meta-Harness." Let’s look at the details through pseudo-code, where P represents the proposer. Looking at the section outlined in red, we can see that a new harness is being created by the proposer.

In this context, the proposer specifically refers to Claude Code. In other words, the new harness is created based on the latent capabilities of Claude Code. While Claude Code is proving active in various fields, it appears here again in a leading role. It is truly impressive. This demonstrates that future AI research will be driven by AI agents like Claude Code at its core. We are truly at the cutting edge of the era.

Conclusion

As we have seen, providing Claude Code with maximum information access enables the construction of high-performance harnesses. Of course, detailed tuning is necessary, so I highly recommend reading the full paper.

At ToshiStats, we will continue to cover harness design, which is the key to improving AI agent accuracy. Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Meta-Harness: End-to-End Optimization of Model Harnesses, Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn, Mar 30, 2026

Toshifumi Kuga

April 3, 2026

harness, agentic coding, claude code, Anthropic

Navigating the Evolution of Generative AI: Insights from Anthropic

Toshifumi Kuga

April 3, 2026

harness, agentic coding, claude code, Anthropic

Every week, a variety of generative AI updates are released, and it feels as though this pace will only continue to accelerate. On the other hand, many people may be feeling lost, wondering how exactly they should navigate these changes. Therefore, in this post, I would like to explore some hints from Anthropic's technical blog (1).

1. Experiments at Anthropic

Mr. Prithvi Rajasekaran from the Labs team has provided a detailed report on several implementation experiments.

The experiments consisted of three projects: front-end design development, full-stack 2D retro game development, and Digital Audio Workstation (DAW) development. This time, I would like to focus specifically on the full-stack 2D retro game development. Through various development and implementation processes, they observed cases where long-running agentic coding failed. A common factor was that the AI often overestimated incomplete implementations, judging them to be at a sufficient level when they were actually still unfinished. They believed that unless this was improved, it would be impossible to achieve satisfactory results in long-running agentic coding.

2. The Key Technology for Success

To address this, a "harness" design consisting of a pair of a Generator and an Evaluator was introduced. This was reportedly inspired by a technology well-known in image generation called Generative Adversarial Networks (GANs). For more details, please see below. In short, the model does not evaluate its own work.

A loop was established between the Generator and the Evaluator, where flawed implementations were subjected to rigorous criticism. Naturally, this took a significant amount of time, and costs jumped by 20 times. However, the quality improved even more than the cost suggested. The return on investment was clearly sufficient.

**Performance Comparison: Single Agent vs. Full Harness**

3. Gains from the Update from Opus 4.5 to 4.6

While the AI engineers were continuing to refine the harness, an update for the generative AI model, Opus, was released, moving the version from 4.5 to 4.6. The performance improvement in Opus 4.6 was remarkable, and as a result, part of the harness that had been necessary for Opus 4.5 became redundant. This allowed the implementation to become simpler. Fantastic! Please see the chart below for details. In the V2 harness, a portion of V1 has indeed been removed.

Based on this experience, the blog describes the following lessons:

“the better the models get, the more space there is to develop harnesses that can achieve complex tasks beyond what the model can do at baseline.”

“From this work, my conviction is that the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination.”

In other words, I believe this means: "As the capabilities of generative AI improve, the number of things that can be solved by a standalone baseline model increases, making parts of existing harnesses unnecessary. However, as the capability of the baseline model rises, tasks that were previously unreachable become solvable by improving the harness design." If the things we can do with new generative AI models continue to increase, our opportunities for harness design will also grow, and it looks like we will be kept quite busy.

What did you think? As the capabilities of generative AI rise, it is expected that new harness designs will be required to push those capabilities to their limits. It seems there will be plenty to do, at least until AGI is realized. ToshiStats will continue to feature harness designs, which are the key to improving the accuracy of AI agents. Stay tuned!

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) Harness design for long-running application development, Engineering at Anthropic. Mar 24, 2026

Toshifumi Kuga

March 21, 2026

autoresearch, AI agent

The End of Traditional Research: How "autoresearch" is Changing Everything

Toshifumi Kuga

March 21, 2026

autoresearch, AI agent

"It would be wonderful to have a system where you could give instructions to an AI agent before going to bed, and while you sleep, the AI agent executes the program so that a finished product is ready by the time you wake up in the morning." This is not a story about the future. It is an application called "autoresearch" (1) released on March 6, 2026, and anyone can use it for free. Let’s take a look right away.

1. What is "autoresearch"?

This is a project by the renowned AI researcher Andrej Karpathy. According to his GitHub, it is described as "AI agents running research on single-GPU nanochat training automatically," meaning he has created AI agents that automatically train nanochat (2). Nanochat is a small yet high-performance large language model (LLM) that he developed. Usually, he trains nanochat while manually tuning it, but this is a very ambitious project to automate that process using "autoresearch." According to him, even though it has just begun, "autoresearch" has worked very well. For details, please see his post on X (3).

2. Simple is Best

When you hear about automating the training of a large language model, you might imagine a very complex system, but there are only three basic files. Furthermore, the only file a human needs to write directly is program.md. In this file, you write in natural language, such as English or Japanese, "what kind of research team we want to form by launching multiple AI agents and what we want them to do." No programming is required. The AI agent that receives these instructions autonomously writes code in train.py to improve the accuracy of nanochat. The final file, prepare.py, is never updated during training. It serves as the foundation for the experiment, so it remains the same until the end. It is a very simple structure. I highly recommend checking Andrej Karpathy’s GitHub for the contents of each file; it will be very informative. I have summarized the overview briefly below.

This is the autoresearch repository for Mac that I executed this time. You can certainly see the three files I introduced. The file structure is extremely simple, and I believe anyone can handle it.

3. Running on a MacBook Air

Now, let's run it on my MacBook Air. This Mac was purchased exactly one year ago and is equipped with an M4 chip and 24GB of RAM. Claude Code is active as the development environment once again. It is on duty at our company almost every day.

When I asked Claude Code to draw a diagram, it looked like the one below. It is simple and easy to understand. On the second from the right, it says MLX Train 5m, which means repeating a 5-minute training session many times. It can be executed about 12 times in one hour. On the far right, Evaluate val_bpb means "evaluate the metric val_bpb (validation bits per byte) and check if the value is steadily decreasing." If the value decreases, it means the accuracy is improving. If not, that session is discarded, and training continues from the previous state. If you let this run while you sleep, you can conduct 100 experiments in a single night.

Andrej Karpathy describes this design as follows: ‘Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.‘

Since I wanted to confirm if it would work properly this time, I ran the loop only three times. As seen below, the evaluation metric did indeed decrease, showing that the training progressed smoothly. During this time, I gave no instructions at all. It’s amazing. It truly is "autoresearch"!

What did you think? Andrej Karpathy stated on his X (3) account:

“All LLM frontier labs will do this. “,

“any metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.”

You, too, might be able to create your own AI lab using a Mac. It is a wonderful thing. At ToshiStats, we will continue to conduct experiments incorporating cutting-edge technology. Stay tuned!.

You can enjoy our video news ToshiStats AI Weekly Review from this link, too!

1) autoresearch, Andrej karpathy, March 6, 2026
2) nanochat, Andrej karpathy, Oct13,2025
3) https://x.com/karpathy/status/2031135152349524125

Toshifumi Kuga

March 1, 2026

Agent teams, AI agent, claude code, Corporate Strategy

The Rise of the AI Strategist: Can AI Agents Master Corporate Strategy?

Toshifumi Kuga

March 1, 2026

Agent teams, AI agent, claude code, Corporate Strategy

Claude Code, the coding assistant that's exploding in popularity worldwide—did you know you can use Agent teams (1) to run AI agents as a team? The idea is to run multiple AI agents simultaneously according to their purpose, achieving performance that a single agent couldn't deliver. This time, we'd like to test whether we can use Agent Teams to develop corporate strategy. Let's get started!

1. Implementing Five Forces Analysis with Agent Teams

There's a well-known framework in competitive strategy called Five Forces Analysis (2). This time, we'd like to apply it to the Japanese digital payment market and explore the possibility of market entry. We'll analyze from the following five perspectives, setting up an AI agent for each one.

We entered the following prompt into Claude Code, which you're all familiar with by now. There's nothing particularly difficult about it. Of course, no programming is required. However, if this is your first time using Agent Teams, you'll need to configure the settings, so don't forget (1).

The multi-agent system we'll actually build looks like the following. A total of seven AI agents will be running, but the key point is the loop involving Agent 6 and Agent 7. After Agent 6 creates a report summarizing the research findings, Agent 7, positioned independently, verifies that report. The report isn't complete until Agent 7 approves it and gives the go-ahead. Quite rigorous, isn't it?

2. The Report Creation Process

Now let's follow the report creation process on the actual screen. As you can see below, seven AI agents have indeed been configured. You can also see that the crucial verification loop has been created.

First, Phase 1. The five research AI agents begin by pulling information from the web. They gather information about the Japanese digital payment market from the five perspectives of Five Forces Analysis. Each AI agent operates independently and processes in parallel, making it very efficient.

Work has progressed, and it appears four of the research tasks are complete. The competitive landscape from each perspective is documented as well. Just a little more to go.

The research by all five AI agents is complete, and we move into Phase 2: creating the integrated report. I'm excited to see what kind of report it will be.

Then we enter the most important phase—Phase 3: the verification loop. Here, the goals are: 1) fact-checking through search, 2) identifying logical inconsistencies, and 3) identifying hallucinations, all aimed at improving the quality of the integrated report.

It appears eight errors were identified and corrected.

The report is finally complete. As shown below, there are six types of reports. We compiled all six into a single PDF file, and it spans 60 pages of content. Impressive, isn't it?

3. Structure of the Generated Analysis Report

The structure of the consolidated report is as follows. It's written in accordance with the Five Forces Analysis framework.

We can't present everything here, but the summary in Chapter 1 looks like the following—I think it's very clearly organized. Please note that this summary is for educational purposes only and should not be directly applied to business decisions or the like.

notice : This is for educational purpose only

So, what did you think? We carried out corporate strategy development using Five Forces Analysis, and the AI agents produced an excellent report. While further verification is needed, it could potentially be used as a starting point for discussion. I should note that Agent Teams is currently in an experimental phase, so changes to specifications are possible going forward (1). At Toshi Stats, we'll continue applying multi-agent systems across various fields. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Orchestrate teams of Claude Code sessions, Anthropic
2) Porter's five forces analysis, Wikipedia

Toshifumi Kuga

February 20, 2026

Agent Skills, agentic coding, claude code, Opus4.6, Loan Payback

Predicting Loan Payback through "Agent Skills": The New Standard for Enterprise AI

Toshifumi Kuga

February 20, 2026

Agent Skills, agentic coding, claude code, Opus4.6, Loan Payback

The most common complaint about AI agents in business? 'The output isn't what I wanted.' In a corporate landscape, consistency is everything—without pre-defined formats, users get lost. Instead of just teaching everyone to prompt better, why not embed that expertise into the organization itself? By providing standardized prompts upfront, users get perfect results from day one. The secret to this is 'Agent skills' (1). Let’s see how it works!

1. What are Agent Skills?

Announced as "skills" by the AI giant Anthropic in October 2025, Agent Skills have since been adopted by almost every major AI company. They have become the de facto standard for providing domain-specific knowledge to generative AI. According to Anthropic:

“Agent Skills are modular capabilities that extend Claude's functionality. Each Skill packages instructions, metadata, and optional resources (scripts, templates) that Claude uses automatically when relevant.”

The beauty of defined Agent Skills is their portability—once created, they can be used across different platforms.

2. Creating Agent Skills

Now, let's dive right in. I’m going to create an 'Agent Skill' using Claude Cowork. I uploaded the PRD (Product Requirements Document) I typically use for building prediction models and input the following prompt.

Since Claude Cowork has a built-in skill creator, it automatically generates an Agent Skills folder containing a skill.md file. This skill.md stores the most fundamental information for the Agent Skill, and its header always includes the following content. AI agents like Claude Code are designed to read this section first.

For tasks related to predictive modeling, the agent reads the specific implementation logic defined in the skill (which, in this case, spans about 240 lines) before moving to the coding phase.

3. Building a Prediction Model via Agent Skills

Next, I utilized Claude Code for agentic coding. As shown below, the "skills" we just created are active and recognized by the environment.

Because the detailed modeling process is already governed by the Agent Skill, my manual prompt can be as simple as: "Please create a prediction model." For this project, I used data from the Kaggle "Predicting Loan Payback" competition (2), where the goal is to predict whether a borrower will repay their loan. The entire implementation was completed in about two hours with almost no manual corrections. The stability of Opus 4.6 (3) is truly remarkable!

The model achieved an AUC of 0.92435 on the Kaggle leaderboard—a score that is well within the range of practical, production-ready application.

One secret behind this high accuracy was the creation of new features based on ratios. By analyzing feature importance, we ensured only the most impactful variables were included in the final model.

4. Testing the Resulting Model

Let’s look at the model built via Agent Skills in action. First, we calculate the probability of repayment for an individual customer. In this example, the probability exceeds 96%, resulting in a "Success" (likely to repay) classification based on a 50% threshold. This threshold is, of course, adjustable depending on the specific business objectives.

To avoid the "black box" problem, I use SHAP analysis to explain why a customer received a specific score. As seen in the graph, the length of the red arrows indicates the contribution of each feature. Here, employment_status was the most significant factor driving the "Success" prediction. This transparency is crucial for corporate accountability.

We can also apply SHAP to the entire dataset. Again, employment_status emerges as the top contributor across all customers. We can see that this feature also carries a high degree of contribution across the entire customer base.

Furthermore, SHAP allows us to visualize the non-linear relationship between specific features and repayment probability. For example, with credit_score, the probability doesn't just rise linearly. The data shows that the probability remains flat until a score of 550, starts to rise at 600, and accelerates significantly after 700. This level of granular insight is what makes SHAP so valuable.

By using Agent Skills, you can embed entire libraries of domain knowledge directly into your AI’s workflow. These skills are reusable, portable, and—in my opinion—will soon be a requirement for any business using AI agents.

I look forward to seeing how Agent Skills continue to permeate the corporate world and what innovations they will trigger. TOSHI STATS Co. will continue to lead the way in this space.

Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Agent Skills
2) Predicting Loan Payback, Yao Yan, Walter Reade, Elizabeth Park. Kaggle, 2025
3) Introducing Claude Opus 4.6, Anthropic, Feb 5 2026

Copyright © 2026 Toshifumi Kuga. All right reserved
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Toshifumi Kuga

February 15, 2026

Insurance Cross-Sell, agentic coding, claude code, Opus4.6

From Zero to Production: How Opus 4.6 Agentic Coding Revolutionizes Insurance Analytics

Toshifumi Kuga

February 15, 2026

Insurance Cross-Sell, agentic coding, claude code, Opus4.6

In the ever-evolving landscape of InsurTech, cross-selling is a literal goldmine. Utilizing Opus 4.6 and Agentic Coding, I have constructed a sophisticated "Insurance Cross-Sell Prediction Model" implementation pipeline, covering everything from memory-optimized data loading to complex feature engineering. Let’s dive in!

1. Agentic Coding with Opus 4.6

Unlike traditional coding, Agentic Coding with Opus 4.6 (1) allows the AI to function as an autonomous engineer. It goes beyond writing snippets; it manages directory structures, ensures memory efficiency for datasets of 11.5 million rows, and completes a production-ready Streamlit dashboard.

In this process, my role was simply to write the "Product Requirement Document (PRD)”—a document in natural language (Japanese or English) defining what I wanted to build. No Python knowledge was required on my part. By putting Claude Code into plan mode, an implementation blueprint is automatically generated, allowing me to verify the coding logic before Opus 4.6 executes it. While I monitored the progress, I never had to write a single line of code myself. Truly remarkable.

2. Project Overview

This project features a robust ecosystem designed for real-world application:

LightGBM + Optuna: Automated hyperparameter optimization to maximize AUC.
50 Ratio-Based Features: Generation of 50 unique indicators to capture hidden customer behavior patterns.
Explainability via SHAP: Implementation of SHAP values to visualize why a specific customer is likely to purchase.

The data was sourced from a Kaggle competition regarding automobile insurance cross-selling (2).

Performance Results: When evaluating the model built via Opus 4.6 Agentic Coding on the Kaggle leaderboard, it achieved a high score of AUC = 0.88343. This level of accuracy is more than sufficient for practical business use.

3. Key Features of the Implementation

The model provides two primary functions: individual customer prediction and total customer portfolio analysis.

Individual Prediction

We set the threshold for a "successful" cross-sell at a probability of 35% or higher. Below is an example of a customer predicted to be a successful cross-sell target. To avoid the "Black Box" problem, we use SHAP values to show the contribution of each feature. The larger the SHAP value, the higher its contribution to the positive prediction. This allows staff to understand the concrete reasoning behind the AI's decision.

Conversely, for customers predicted to fail (probability below 35%), the SHAP values indicate which factors are pulling the probability down.

Customer portfolio Analysis

We can also analyze the "Cross-Sell Success Rate" across an entire customer portfolio. In this demo, we imported a CSV of 30,000 customers. With the threshold set at 35%, the model identified 3,708 potential targets. By adjusting the threshold, marketing teams can narrow or broaden their focus for specific campaigns. The dashboard also displays the overall probability distribution across the entire dataset.

4. Business Impact

This high-precision model provides sales representatives with a prioritized "Hot Lead" list. Thanks to the Streamlit-based GUI, non-technical staff can execute batch predictions and verify the reasoning via SHAP instantly. This is the definition of Data-Driven Marketing.

Conclusion

The synergy between Opus 4.6 and human expertise is redefining the speed of machine learning development and implementation. The potential is, quite frankly, staggering. At TOSHI STATS, we will continue to explore innovations in this field.

Stay tuned!

1) Introducing Claude Opus 4.6, Anthropic, Feb 5 2026
2) Binary Classification of Insurance Cross Selling, Walter Reade and Ashley Chow, Kaggle

You can enjoy our video news ToshiStats-AI from this link, too!

Toshifumi Kuga

February 8, 2026

Bank Churn Prediction, Vibe Coding, claude code, Machine Learning, Opus4.6

Mind-Blowing Performance: Building a Bank Churn Prediction Model using Claude Opus 4.6

Toshifumi Kuga

February 8, 2026

Bank Churn Prediction, Vibe Coding, claude code, Machine Learning, Opus4.6

Earlier in 2026, the AI giant Anthropic announced Opus 4.6(1), the latest update to its frontier model series. Today, I want to share my experience using Claude Code to build a bank customer churn prediction model to see just how far this new version can go. Let’s dive in.

1. The Ultimate Coding Model

Opus 4.6 is Anthropic’s new masterpiece, outperforming Opus 4.5 across various benchmarks. Its coding capabilities, in particular, are often rated as the best in the industry, and it feels like it’s now a giant leap ahead of the competition.

2. Developing a Churn Prediction Model via "Agentic Coding"

I decided to pair Claude Code with Opus 4.6 to develop a prediction model using "agentic coding"—a method where the AI agent handles the entire Python implementation without human intervention.

The task: Bank Customer Churn Prediction. Losing customers is costly and hurts brand loyalty. A predictive model allows us to identify "at-risk" customers and take proactive retention measures before they leave. For this experiment, I used a dataset from a well-known Kaggle competition.

The Workflow

PRD Creation: I wrote a detailed Product Requirement Document (PRD) outlining my goals.
Autonomous Execution: I ran Claude Code in plan mode. It drafted the implementation strategy, and once I gave the green light, it proceeded to code the entire system.
Minimal Intervention: While Claude Code occasionally asked for permissions, I simply hit "yes" every time. It was effectively 100% AI-driven development.

The Resulting GUI

The final application is a sleek tool where you can select a Customer ID to see their specific churn probability. It clearly distinguishes between "Loyal" and "At-Risk" customers.

Individual Prediction: Instant probability scores for specific users.
Batch Prediction: For a birds-eye view, you can upload a CSV of your entire database (approx. 110,000 customers).
Dynamic Thresholding: You can set a churn threshold. For example, at a 50% threshold, 31.2% of the customers are flagged as likely to leave.

By raising the threshold to 90%, the list narrows down to the most critical 8.3% of the customer base. This makes it incredibly easy to target high-stakes marketing campaigns or retention offers.

Efficiency Note: The entire process—from data acquisition to a fully functional predictive model—took only about 90 minutes. Not having to write a single line of Python manually is a massive productivity boost.

To enable even deeper analysis, I’ve also included a CSV export feature. Those proficient in Python can leverage this file to conduct their own custom evaluations as needed.

3. Glimpsing the Latent Potential of Opus 4.6

As expected, Opus 4.6 completed the end-to-end development process without a single error. When I attempted this same task with Opus 4.5, I had to tell AI agent to correct a calculation method because I hadn't been specific enough in my pipeline description. This time? Zero rework. The performance improvement is tangible.

Opus 4.6 is set to become an indispensable partner in machine learning development. While this isn't a "full" generational leap (like a version 5.0), the refinement is world-class. Rumor has it that Opus 5 is already deep in development at Anthropic and might debut in late 2026. I can’t wait to see what kind of evolution that brings.

Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) Introducing Claude Opus 4.6, Anthropic, Feb 5 2026
2) Binary Classification with a Bank Churn Dataset, Kaggle, Jan 2, 2024

Toshifumi Kuga

February 1, 2026

Genie 3, agentic coding, claude code

AGI in 2 Years or 5 Years? — Survival Strategies for 2030

Toshifumi Kuga

February 1, 2026

Genie 3, agentic coding, claude code

In January 2026, several interviews with CEOs of top AI labs were released. One particularly fascinating encounter was the face-to-face interview (1) between Anthropic CEO Dario Amodei and Google DeepMind CEO Demis Hassabis. I have summarized my thoughts on what their comments imply. I hope you find this insightful!

1. Will AGI Arrive Within 2 Years?

Dario seems to hold a more accelerated timeline for the realization of AGI. While prefixing his thoughts with "It is difficult to predict exactly when it will happen," he pointed to the reality within his own company: "There are already engineers at Anthropic who say they no longer write code themselves. In the next 6 to 12 months, AI might handle the majority of code development. I feel that loop is closing rapidly." He argued that AI development is hitting a flywheel effect, particularly noting that progress in coding and research is so remarkable that AI intelligence will surpass public expectations within a few short years.

A prime example is Claude Code, released by Anthropic last year. This revolutionary product is currently taking the software development world by storm. It is no exaggeration to say that the common refrain "I don’t code manually anymore" is a direct result of this tool. In fact, I recently used it to tackle a past Kaggle competition; I achieved an AUC of 0.79 with zero manual coding, which absolutely stunned me (3).

2. AGI is Still 5 Years Away

On the other hand, Demis maintains his characteristically cautious stance. He often remarks that there is a "50% chance of achieving AGI in five years." His reasoning is grounded in the current limitations of AI: "Today’s AI isn't yet consistently superior to humans across all fields. A model might show incredible performance in one area but make elementary mistakes in another. This inconsistency means we haven't reached AGI yet." He believes two or three more major breakthroughs are required, which explains his longer timeline compared to Dario.

Unlike Anthropic, which is heavily optimized for coding and language, Google is focusing on a broader spectrum. One such focus is World Models—simulations of the physical spaces we inhabit. In these models, physics like gravity are reproduced, allowing the AI to better understand the "real" world. Genie 3 (2) is their latest version in this category. While it has only been released in the US so far, I am eagerly anticipating its global rollout. The "breakthroughs" Demis mentions likely lie at the end of this developmental path.

3. Are We Prepared for AGI?

While their timelines differ, Dario and Demis agree on one fundamental point: AGI—which will surpass human capabilities in every field—is not far off. Exactly ten years ago, in March 2016, DeepMind’s AlphaGo defeated the world’s top Go professional. Since then, no human has been able to beat AI in the game of Go. Soon, we may reach a point where humans can no longer outperform AI in any field. What we are seeing in the world of coding today is the precursor to that shift.

It is a world that is difficult to visualize. Industrial structures will be upended, and the very role of "human work" will change. It is hard to say that we are currently prepared for this reality. In 2026, we must begin a serious global dialogue on how to adapt. I look forward to engaging in these discussions with people around the world.

I highly recommend watching the full interview with Dario and Demis. These two individuals hold the keys to our collective future. That’s all for today. Stay tuned!

1) The Day After AGI | World Economic Forum Annual Meeting 2026, World Economic Forum, Jan 21, 2026
2) Genie 3, Google DeepMind, Jan 29, 2026
3) Is agentic coding viable for Kaggle competitions?, January 16, 2026

You can enjoy our video news ToshiStats-AI from this link, too!

Toshifumi Kuga

January 16, 2026

Agentic AI, claude code, Opus 4.5, agentic coding

Is agentic coding viable for Kaggle competitions?

Toshifumi Kuga

January 16, 2026

Agentic AI, claude code, Opus 4.5, agentic coding

The "Agentic Coding" trend continues to accelerate as we enter 2026. In this post, I will challenge myself to see how high I can push accuracy by delegating the coding process to an AI agent, using data from the Kaggle competition Home Credit Default Risk [1]. Let's get started right away.

1. Combining Claude Code and Opus 4.5

I will be using Opus 4.5, a generative AI renowned for its coding capabilities. Additionally, I will use Claude Code as my coding assistant, as shown below. While I enter instructions into the prompt box, I do not write any Python code myself.

You can see the words "plan mode" at the bottom of the screen. In this mode, Claude Code formulates an implementation plan based on my instructions. I simply review it, and if everything looks good, I authorize the execution.

Let's look at the actual instructions I issued. It is quite long for a "prompt," spanning about two A4 pages. The beginning of the implementation instructions is shown below. I wrote it in great detail. I'd like you to pay special attention to the final instruction regarding the creation of 50 new features using ratio calculations.

Part of the Product Requirement Document

Below is a portion of the implementation plan formulated by the AI agent. It details the method for creating new features via ratio calculations. Although I only specified the quantity of features, the plan shows that it selected features likely to be relevant to loan defaults before calculating the ratios.

The AI agent utilized its own domain knowledge to make these selections; they were certainly not chosen at random. This demonstrates the high-level judgment capabilities unique to AI agents.

New feature creation plan by the AI Agent

Part of the new features actually created by the AI Agent

2. Achieving an AUC of 0.79

By adopting LightGBM as the machine learning library, using the newly created features, and performing hyperparameter tuning, I was able to achieve an AUC of 0.79063, as shown below.

Reaching this level without writing a single line of Python code myself marks this experiment as a success. The data used to build the machine learning model consisted of seven different CSV files. These had to be merged correctly, and the AI agent handled this task seamlessly. Truly impressive!

3. Will AI Agents Handle Future Machine Learning Model Development?

While the computation time depends on the number of features created, it generally took between 1 to 4 hours. I ran the process several times, and the calculation never stopped due to syntax errors. The AI agent likely corrected any errors itself before proceeding to the next calculation step.

Therefore, once the initial implementation plan is approved, the results are generated without any further human intervention. This could be revolutionary. You simply input what you want to achieve via a PRD (Product Requirement Document), the AI agent creates an implementation plan, and once you approve it, you just wait for the results. The potential for multiplying productivity several times over is certainly there.

How was it? I was personally astonished by the high potential of the "Claude Code and Opus 4.5" combination. With a little ingenuity, it seems capable of even more.

This story is just beginning. Opus 4.5 will likely be upgraded to Opus 5 within the year. I am already looking forward to seeing what AI agents will be capable of then.

That’s all for today. Stay tuned!

1) Home Credit Default Risk, kaggle

You can enjoy our video news ToshiStats-AI from this link, too!

Toshifumi Kuga

January 9, 2026

Opus 4.5, claude code, generative ai, ADK, Machine Learning, Agentic AI

"ClaudeCode + Opus 4.5" Arrives as the 2026 Game Changer !

Toshifumi Kuga

January 9, 2026

Opus 4.5, claude code, generative ai, ADK, Machine Learning, Agentic AI

2026 has officially begun! The AI community is already abuzz with talk of "agentic coding" using ClaudeCode + Opus 4.5. I decided to build an actual application myself to test the potential of this combination. Let’s dive in.

1. ClaudeCode + Opus 4.5

These are the coding assistant and frontier model from Anthropic, respectively, both renowned for their strength in coding tasks. I imagine many will use them integrated into an IDE like VS Code, as shown below. You can see the selected model is Opus 4.5. Also, notice the "plan mode" indicator at the bottom.

Here, a data scientist inputs a prompt detailing exactly what they want to develop. The system then enters "plan mode" and generates an implementation plan like the following. The actual output is quite long, but here is the summary:

The goal this time is to create an application that combines machine learning and Generative AI, as described above. Once you agree to this implementation plan, the actual coding begins.

2. Completion of the AI App with GUI

In this completed app, you can input customer data via the screen below to calculate the probability of default, which can then be used to assess loan eligibility.

The first customer shows low risk, so a loan appears feasible.

**‍　　　　　　　　　　　　　　　　　‍**Default Probability 2

For the second customer, as highlighted in the red frame, the payment status shows a 2-month delay. The probability of default skyrockets to 65.54%. This is a no-go for a loan.

3. Validating Model Accuracy on a Separate Screen

This screen displays the metrics for the constructed prediction model, allowing you to gauge its accuracy. While figures like AUC are bread and butter for experts, they might be a bit difficult for general business users to grasp.

To address this, I decided to include natural language explanations. By leveraging Generative AI, implementing multilingual support is relatively straightforward.

Switching the setting changes the text from English to Japanese. Of course, support for other languages could be added with further development.

While I used Opus 4.5 during the development phase, this application uses an open-source Generative AI model internally. This allows it to function completely disconnected from the internet—making it ideal even for enterprises with strict security requirements.

So, what are your thoughts?

An application with this rich feature set and a high-precision machine learning model was completed entirely with no-code. I didn't write a single line of code this time.

Opus 4.5 was truly impressive; the process never stalled due to syntax errors or similar issues. I can genuinely feel that the accuracy is on a completely different level compared to just six months ago. moving forward, it seems likely that "agentic coding" will become the standard starting point for creating new machine learning models and GenAI apps. It feels like PoC-level projects could now be knocked out in a matter of days.

I’m looking forward to building many more things. That’s all for today.

Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

Toshifumi Kuga

January 2, 2026

AI agent, Machine Learning, claude code, Governance

What Awaits Us in 2026? Bold Predictions for AI Agents & Machine Learning

Toshifumi Kuga

January 2, 2026

AI agent, Machine Learning, claude code, Governance

Happy New Year!

As we finally step into 2026, I am sure many of you are keenly interested in how AI agents will develop this year. Therefore, I would like to make some bold predictions by raising three key points, while also considering their connection to machine learning. Let's get started.

1. A Dramatic Leap in Multimodal Performance

I believe the high precision of the image generation AI "Nano Banana Pro (1)," released by Google on November 20, 2025, likely stunned not just AI researchers but the general public as well. Its ability to thoroughly grasp the meaning of a prompt and faithfully reproduce it in an image is magnificent, possessing a capability that could be described as "Text-to-Infographics."

Furthermore, its multilingual capabilities have improved significantly, allowing it to perfectly generate Japanese neon signs like this: "明けましておめでとう 2026" (Happy New Year 2026)

This model is not a simple image generation AI; it is built on top of the Gemini 3 Pro frontier model with added image generation capabilities. That is why the AI can deeply understand the user's prompt and generate images that align with their intent. Google also possesses AI models like Genie 3(2) that perform simulations using video, leading the industry with multimodal models. We certainly cannot take our eyes off their movements in 2026.

2. The Explosive Popularity of "Agentic Coding"

Currently, coding by AI agents—"Agentic Coding"—has become a massive global movement. However, for complex code, it is not yet 100% perfect, and human review is still necessary. Additionally, humans still need to create the Product Requirement Document (PRD), which serves as the blueprint for implementation.

I have built several default prediction models used in the financial industry, and I always feel that development is more efficient when the human side first creates a precise PRD. By doing so, we can largely entrust the actual coding to the AI agent. This is an example of default prediction model.

However, the speed of evolution for frontier models is tremendous. In the latter half of 2026, we expect updates like Gemini 4, GPT-6, and Claude 5, and frankly, it is difficult to even imagine what capabilities AI agents will acquire as a result.

Alongside the progress of these models, the toolsets known as "code assistants" are also likely to significantly improve their capabilities. Tools like Claude Code, Gemini CLI, Cursor, and Codex have become indispensable for programmers today, but in 2026, these code assistants will likely play an active role in fields closer to business, such as machine learning and economic analysis.

At this point, calling them "code assistants" might be off the mark; a broader name like "Thinking Machine for Business" might be more appropriate. The day when those who don't know how to code can master these tools may be close at hand. It is very exciting.

3. AI Agents and Governance

As mentioned above, it is predicted that in 2026, AI agents will increasingly permeate large organizations such as corporations and governments. However, there is one thing we must be careful about here.

The behavior of AI agents changes probabilistically. This means that different outputs can be produced for the same input, which is vastly different from current systems. Furthermore, if an AI agent possesses the ability for Recursive Self-Improvement (updating and improving itself), it means the AI agent will change over time and in response to environmental changes. In 2026, we must begin discussions on governance: how do we structure organizational processes and achieve our goals using AI agents that possess characteristics unlike any previous system? This is a very difficult theme, but I believe it is unavoidable if humanity is to securely capture the benefits and gains from AI agents. I previously established corporate governance structures in the financial industry, and I hope to contribute even a little based on that experience.

What did you think? It looks like AI evolution will accelerate even further in 2026. I hope we can all enjoy it together. I look forward to another great year with you all.

You can enjoy our video news ToshiStats-AI from this link, too!

1) Introducing Nano Banana Pro, Google, Nov 20, 2025
2) Genie 3: A new frontier for world models, Jack Parker-Holder and Shlomi Fruchter, Google DeepMind, August 5, 2025

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

Toshifumi Kuga

December 18, 2025

Vibe Coding, claude code, AI agent, Machine Learning, Plan Mode, MetaPrompt

Improving ML Vibe Coding Accuracy: Hands-on with Claude Code's Plan Mode

Toshifumi Kuga

December 18, 2025

Vibe Coding, claude code, AI agent, Machine Learning, Plan Mode, MetaPrompt

2025 was a year where I actively incorporated "Vibe Coding" into machine learning. After repeated trials, I encountered situations where coding accuracy was inconsistent—sometimes good, sometimes bad.

Therefore, in this experiment, I decided to use Claude Code "Plan Mode" (1) to automatically generate an implementation plan via an AI agent before generating the actual code. Based on this plan, I will attempt to see if a machine learning model can be built stably using "Vibe Coding." Let's get started!

1. Generating an Implementation Plan with Claude Code "Plan Mode"

Once again, I would like to build a model that predicts in advance whether a customer will default (on a loan, etc.). I will use publicly available credit card default data (2). For the code assistant, I am using Claude Code, and for the IDE, the familiar VS Code.

To provide input to the Claude Code AI agent, I summarized the task and implementation points into a "Product Requirement Document (PRD)." This is the only document I created.

I input this PRD into Claude Code "Plan Mode" and instructed it to: "Create a plan to create predictive model under the folder of PD-20251217".

Within minutes, the following implementation plan was generated. Comparing it to the initial PRD, you can see how refined it is. Note that I am only showing half of the actual plan generated here—a truly detailed plan was created. I can only say that the ability of the AI agent to envision this far is amazing.

2. Beautifully Visualizing Prediction Accuracy

When this implementation plan is approved and executed, the prediction model is generated. Naturally, we are curious about the accuracy of the resulting model.

Here, it is visualized clearly according to the implementation plan. While these are familiar metrics for machine learning experts, all the important ones are covered and visualized in an easy-to-understand way, summarized as a single HTML file viewable in a browser.

The charts below are excerpts from that file. It includes ROC curves, SHAP values, and even hyperparameter tuning results. This time, the total implementation time was about 10 minutes. If it can be generated automatically to this extent in that amount of time, I’d rather leave it to the AI agent.

3. Meta-Prompting with Claude Code "Plan Mode"

A Meta-Prompt refers to a "prompt (instruction to AI) used to create and control prompts."

In this case, I called Claude Code "Plan Mode" and instructed it to "generate an implementation plan" based on my PRD. This is nothing other than executing a meta-prompt in "Plan Mode."

Thanks to the meta-prompt, I didn't have to write a detailed implementation plan myself; I only needed to review the output. It is efficient because I can review it before coding, and since that implementation plan can be viewed as a highly precise prompt, the accuracy of the actual coding is expected to improve.

To be honest, I don't have the confidence to write the entire implementation plan myself. I definitely want to leave it to the AI agent. It has truly become convenient!

How was it? Generating implementation plans with Claude Code "Plan Mode" seems applicable not only to machine learning but also to various other fields and tasks. I definitely intend to continue trying it out in the future. I encourage everyone to give it a challenge as well.

That’s all for today. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) How to use Plan Mode, Anthropic

2) Default of Credit Card Clients

Copyright © 2025 Toshifumi Kuga. All right reserved
Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.