In a previous post, I explained Google's research paper, "MLE STAR" (1), and uncovered the mechanism by which an AI can build its own high-accuracy machine learning models. This time, I'm going to implement that AI agent using the Google ADK and experiment to see if it can truly achieve high accuracy. For reference, the MLE STAR code is available as open source (2).

1. The Information I Provided

With MLE STAR, humans only need to handle the data input and task definition. The data I used for this experiment comes from the Kaggle competition "Home Credit Default Risk" (3). While the original data consists of 8 files, I combined them into a single file for this experiment. I reduced the training data to 10% of the original, resulting in about 30,000 samples, and kept the original test data of 48,700 samples.

The task was set as follows: "A classification task to predict default." Note that to speed up the experiment, the number of iterative loops was set to a minimum.

2. Deciding Which Model to Use

MLE STAR uses a web search to select the optimal model for the given task. In this case, it ultimately chose LightGBM. To finish the experiment quickly, I configured it to select only one model. If I had set it to select two, it likely would have also chosen something like XGBoost. Both are models frequently used in data science competitions.

It generated the initial script below. As a frequent user of LightGBM, the code looks familiar, but the ability to generate it in an instant is something only an AI can do. It's amazing!

3. Identifying Key Code Blocks with "Ablation Studies"

Next, it uses ablation studies to identify which code blocks should be improved. In this case, ablation2 showed that removing Early Stopping worsened the model's performance, so this feature was kept in the training process from then on.

**Ablation Studies Results by MLE STAR**

4. Iteratively Improving the Model

Based on the ablation studies, MLE STAR decided to improve the model using the following two techniques: K-fold target encoding and binary encoding. These techniques themselves are common in machine learning and are not particularly unusual.

This ability to "use ablation studies to identify which code blocks to improve" is likely a major reason for MLE STAR's high accuracy. I look forward to seeing how this functionality evolves in the future.

5. The Results Are In. Unfortunately, I Lost.

For its final step, MLE STAR ensembles the models to create the final version. For more details, please see the research paper. It also generates a CSV file with the default predictions, which I slightly modified and promptly submitted to Kaggle. This task is evaluated using AUC, where a score closer to 1 indicates higher accuracy.

The top score is the result I achieved using my own LightGBM model. The score in the red box at the bottom is the one automatically generated by MLE STAR. With a difference of more than 0.01 on both the Public and Private scores, it was my complete defeat.

**Kaggle Prediction Accuracy Evaluation (AUC)**

Improving the AUC by 0.01 is quite a challenge, which gives a glimpse into how excellent MLE STAR is. I didn't perform any extensive tuning on my LightGBM model, so I believe my score would have improved if I had spent time tuning it manually. However, MLE STAR produced its result in about 7 minutes from the start of the computation, so from an efficiency standpoint, I couldn't compete.

So, what did you think? Although this was a limited experiment, I feel I was able to grasp the high potential of MLE STAR. I was truly impressed by the power of its Recursive Self-Improvement, which identifies specific code blocks and improves upon them autonomously.

Here at Toshi Stats, I plan to continue digging into MLE STAR. Stay tuned!

You can enjoy our video news ToshiStats-AI from this link, too!

1) MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement
Jaehyun Nam1 2 *, Jinsung Yoon1, Jiefeng Chen1, Jinwoo Shin2, Sercan Ö. Arık1 and Tomas Pfister1, Google Cloud1, KAIST2, 23, Aug 2025

2) Machine Learning Engineering with Multiple Agents (MLE-STAR) , Google

3) Home Credit Default Risk, kaggle

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.