We are already hearing from many in the field that with the arrival of GPT-5, "the writing style is different from GPT-4o and earlier" and "its performance as an agent is on another level." Here, we will build upon the key points from OpenAI's "GPT-5 Prompt Guide (1)" and organize, from a practical perspective, "how to write prompts to stably reproduce desired behaviors." The following three keywords are key:

GPT-5 acts very proactively as an AI agent.
Self-reflection and guiding principles.
Instruction following with "surgical precision."

Let's delve into each of these.

1. GPT-5 acts very proactively as an AI agent.

GPT-5's enhanced capabilities in tool-calling, understanding long contexts, and planning allow it to proceed autonomously even with ambiguous tasks. Whether you "harness" or "suppress" this capability depends on how you design the agent's "eagerness").

1-1. Controlling Eagerness with Prompts

To suppress eagerness, intentionally limit the depth of exploration and explicitly set caps on parallel searches or additional tool calls. This is effective in situations where processing time and cost are priorities, or when requirements are clear and exploration needs to be minimized.

To enhance eagerness, explicitly state rules for persistence, such as "Do not end the turn until the problem is fully resolved" and "Even with uncertainty, proceed with the best possible plan." This is suitable for long-duration tasks where you want the agent to see them through to completion with minimal check-ins with the user.

Practical Snippet (To suppress eagerness):

<context_gathering>
Goal: Reach a conclusion quickly with minimal information gathering.
Method: A single-batch search, starting broad and then narrowing down. Avoid duplicate searches.
Budget: A maximum of 2 tool calls.
Escape: If a conclusion is reasonably certain, accept minor incompleteness to provide an early answer.
</context_gathering>

Practical Snippet (To encourage eagerness):

<persistence>
Do not end the turn until the problem is completely resolved.
Reason through uncertainty and continue with the best possible plan.
Minimize clarifying questions. Adopt reasonable assumptions and state them later.
</persistence>

1-2. Visualize with a "Tool Preamble"

When the agent outputs a long rollout during execution, having it first provide a brief summary—explaining the objective, outlining the plan, noting progress, and confirming completion—makes it easier for the user to follow along and creates a better user experience.

Recommended Snippet:

<tool_preambles>
First, restate the user's goal in a single sentence. Follow with a bulleted list of the planned steps.
During execution, add concise progress logs sequentially.
Finally, provide a summary that clearly distinguishes between the "Plan" and the "Actual Results."
</tool_preambles>

2. Self-reflection and Guiding Principles

GPT-5 excels at "internally refining" the quality of its output through self-reflection. However, if the criteria for judging quality are not established beforehand, this reflection can become unproductive. This is where guiding principles and a private rubric are effective.

2-1. Provide a "Self-Grading Scorecard" with a Private Rubric

For zero-to-one generation tasks (e.g., creating a new web app, drafting specifications), have the model internally create a scorecard with 5-7 evaluation criteria. Then, have it repeatedly rewrite and re-evaluate its output based on these criteria.

Rubric Generation Snippet:

<self_reflection>
Define the conditions that a world-class deliverable should meet across 5-7 categories (e.g., UI quality, readability, robustness, extensibility, accessibility, accountability). Score your own proposal against these criteria, identify shortcomings, and redesign. The rubric itself should not be shown to the user.
</self_reflection>

2-2. Reduce Inconsistency with Guiding Principles

For ongoing development or modifying existing code, first provide the project's conventions by clearly stating its design principles, directory structure, and UI standards. This ensures that the model's suggested improvements and changes integrate naturally with the existing culture.

Guiding Principles Snippet (Example):

<guiding_principles>
Clarity and Reusability: Keep components small and reusable. Group them and avoid duplication.
Consistency: Unify tokens, typography, and spacing.
Simplicity: Avoid unnecessary complexity in styling and logic.
</guiding_principles>

2-3. Separately Control Verbosity and Reasoning Effort

GPT-5 can control its verbosity (the length of the final answer) and its reasoning_effort (the depth of thought) independently. This allows for context-specific overrides, such as "be concise in prose, but provide detailed explanations in code." The guide introduces a practical example of prompt tuning by Cursor, which is worth checking out. A useful tip for fast mode (minimal reasoning) is to require a brief summary of its thinking or plan at the beginning to assist its process.

3. GPT-5's Instruction Following has "Surgical Precision"

GPT-5 is extremely sensitive to the accuracy and consistency of instructions. Contradictory requests or ambiguous prompts waste reasoning resources and degrade output quality. Therefore, it is crucial to "structure" your instruction hierarchy to prevent contradictions before they occur.

3-1. Design to Avoid Contradictions

Take the example of a healthcare administrator scheduling a patient appointment based on symptoms. "Exceptions," such as altering preceding steps only in emergencies, must be clearly stated so they do not conflict with standard procedures.

Bad Example: The instructions "Do not schedule without consent" and "First, automatically secure the fastest same-day slot" coexist.
Correct Example: When "Always check the profile" and "In an emergency, immediately direct to 911" coexist, the exception rule is declared first.

OpenAI offers the following warning:

We understand that the process of building prompts is an iterative one, and that many prompts are living documents, constantly being updated by different stakeholders. But that’s why it is even more important to thoroughly review for instructions that are phrased improperly. We have already seen multiple early users discover ambiguities and contradictions within their core prompt libraries when they did such a review. Removing them dramatically streamlined and improved GPT-5's performance. We encourage you to test your prompts with our Prompt Optimizer tool to identify these kinds of issues.

How was that? In this article, we explored key points for prompt design from OpenAI's GPT-5 Prompt Guide (1). GPT-5 is a "partner in practice," combining powerful autonomy with precise instruction following. Try incorporating the points discussed today into your prompts and take your AI agents to the next level. That's all for today. Stay tuned!

1) GPT-5 prompting_guide, OpenAI, August 7, 2025

You can enjoy our video news ToshiStats-AI from this link, too!

Notice: ToshiStats Co., Ltd. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithms or ideas contained herein, or acting or refraining from acting as a result of such use. ToshiStats Co., Ltd. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on ToshiStats Co., Ltd. and me to correct any errors or defects in the codes and the software.

The instructions that humans write for generative AI are called "prompts." There are many books and blogs out there that offer guidance on how to write them. Many of you have probably tried, and it's surprisingly difficult, isn't it? While no programming language is required, you have to go through a lot of trial and error to get the output you want from a generative AI. This process can be quite time-consuming, isn't well-systematized, and you often have to start from scratch for each new task.

So, this time, we'd like to experiment with "what happens if we have a generative AI write the prompts for us?" Let's get started.

1. Prompt Optimization

In 2023, Google DeepMind released a research paper titled "LARGE LANGUAGE MODELS AS OPTIMIZERS"(1).

This paper explored the use of LLMs to optimize prompts, and it seems to have worked well for several tasks. While a human writes the initial prompt, subsequent improvements are delegated to the LLM (the optimizer). The LLM is also responsible for judging whether the result was successful or not (the evaluator), meaning this approach can be applied even without labeled data that provides the correct answers. This is very helpful, as tasks involving generative AI often lack labeled data. Below is a flowchart of this process, which is effectively the automation of prompt engineering. This is professionally referred to as "prompt optimization." The specific method we adopted for this experiment is called OPRO (Optimization by PROmpting).

2. Experiment with a Customer Complaint Classification Task

Similar to our blog post on July 26th, we set up a task to predict which financial product a bank's customer complaint is about. We used an LLM to solve a classification task where it selects one of the following six financial products. We used gemini-2.5-flash for this experiment, with a sample size of 100 customer complaints.

Mortgage
Checking or savings account
Student loan
Money transfer, virtual currency, or money service
Bank account or service
Consumer Loan

In this experiment, the LLM handled the prompt generation, but a meta-prompt was necessary to further improve the resulting prompts. I wrote the meta-prompt as follows. Essentially, it tells the LLM to "please further improve the resulting prompt."

We had the LLM generate 20 prompts, and the results are shown below. The final number is the accuracy. An accuracy of 0.8 means 80 out of 100 cases were correct. Since this data came with labeled data, calculating the accuracy was easy.

We adopted the second prompt from the list, which had the best accuracy of 0.89 in this experiment. When we ported this prompt to our regular experimental environment and ran it, the accuracy exceeded 0.9, as shown below. We've done this task many times before, but this is the first time we've surpassed 0.9 accuracy. That's amazing!

3. What Does the Future of Prompt Engineering Look Like?

As you can see, it seems possible to optimize prompts by leveraging the power of generative AI. Of course, when considering cost and time, the results might not always be worth the effort. Nevertheless, I feel there's a strong need for prompt automation. Researchers worldwide are currently exploring various methods, so many things that aren't possible now will likely become possible in the near future. Prompt engineering techniques will continue to evolve, and I'm looking forward to these technological developments and plan to try out various methods myself.

So, what did you think? The ability of an AI agent to fully utilize the power of generative AI and improve itself without human intervention is called "Recursive-self-improvement." At ToshiStats, we will continue to provide the latest updates on this topic. Please look forward to it. Stay tuned!

1) LARGE LANGUAGE MODELS AS OPTIMIZERS Chengrun Yang Xuezhi Wang Yifeng Lu Hanxiao Liu Quoc V. Le Denny Zhou Xinyun Chen , Google DeepMind

Let's Explore the Best Practices for Crafting GPT-5 Prompts!