skip to content
Skylar Payne

GPT-5.4 in OpenClaw doesn’t suck. Your prompts do.

/ 6 min read

Table of Contents

Anthropic changed the rules for OpenClaw.

Starting April 4 at 12pm PT, Claude subscription limits no longer apply to third-party harnesses including OpenClaw. You can still use Claude. You just have to pay for extra usage on top of your subscription.

That’s the whole problem.

Opus has always had the best vibes in OpenClaw. The subscription made that easy to justify. But once you move to usage pricing, the math gets ugly fast.

So the panic makes sense.

The bad conclusion is: GPT-5.4 just isn’t good enough.

My conclusion is different:

GPT-5.4 in OpenClaw doesn’t suck. Your prompts do.

What actually changed

Anthropic’s email says:

  • Claude subscription limits no longer apply to OpenClaw
  • Claude Code and Claude Cowork are still covered
  • OpenClaw now requires extra usage
  • Anthropic is offering a one-time credit
  • Anthropic is offering up to 30% prepaid bundle discounts

That means the debate changed.

It’s no longer just Opus vs GPT. It’s also subscription pricing vs usage pricing.

Why people are suddenly looking at GPT

OpenAI still supports a subscription-backed path through ChatGPT / Codex.

OpenAI’s docs say:

  • Sign in with ChatGPT gives subscription access
  • Every ChatGPT plan includes Codex
  • codex login --device-auth works for remote boxes

That matters more than benchmark tribalism.

A lot of users are not choosing between two abstract models. They’re choosing between:

  • “this still feels like a subscription”
  • and
  • “this can quietly become an infra bill”

The real mistake people are making

I keep hearing the same complaint:

“GPT is substantially dumber than Claude in an OpenClaw context.”

I get it.

But most people are not testing GPT-5.4 fairly.

They’re taking prompts and bootstrap files that were tuned around Claude behavior, swapping the model name, and calling that an eval.

That’s not an eval. That’s a setup mismatch.

The framing from One Soul, Many Minds is right:

  • same soul
  • same mission
  • same personality
  • different overlay per model

You do not need a different assistant. You probably do need different prompting.

The public benchmark story is mixed

That’s a good thing.

Public comparisons suggest:

  • Opus still looks very strong on broad coding / agentic work
  • GPT-5.4 looks good on Terminal-Bench 2.0
  • GPT-5.4 looks good on MCP Atlas
  • GPT-5.4 looks good on OSWorld
  • GPT-5.4 is cheaper on API pricing

So no, I’m not saying GPT-5.4 is better than Opus.

I’m saying this:

GPT-5.4 is too capable to dismiss, and too cheap to ignore.

We ran our own evals

I didn’t want to rely on internet takes, so I built a small eval setup around the kinds of work I actually use OpenClaw for.

The eval categories were:

  1. Newsletter writing
  2. Coding
  3. Polish / planning
  4. Heartbeat
  5. Personality

Each eval had 5 samples and 5 scoring criteria for a total of 25 points.

Before tuning vs after tuning

Here’s the part that matters.

Before tuning

EvalGPT-5.4Opus 4.6
Newsletter writing10/2518/25
Coding20/2520/25
Polish / planning17/2520/25
Heartbeat15/2523/25
Personality11/2522/25

After tuning

EvalGPT-5.4Opus 4.6
Newsletter writing20/2518/25
Coding22/2520/25
Polish / planning18/2520/25
Heartbeat22/2523/25
Personality23/2522/25

That is the whole story.

GPT-5.4 looked much worse before tuning. It got dramatically better after tuning.

Not because the model changed. Because the setup changed.

What changed during tuning

I ran a train / validation loop:

  • run prompts
  • review outputs
  • give feedback
  • rewrite bootstrap files
  • run again

The main files I tuned were:

  • SOUL.md
  • AGENTS.md
  • HEARTBEAT.md
  • supporting instruction files

The big GPT-5.4 failure modes were:

  • weaker vibes by default
  • more sensitivity to conflicting instructions
  • more likely to miss the intended tone
  • more likely to explain instead of execute

Once I tuned for those, the gap shrank a lot.

In some categories, GPT-5.4 actually pulled ahead.

How I’d switch an OpenClaw setup today

1) Use the ChatGPT / Codex path

OpenAI supports:

  • ChatGPT sign-in for subscription-backed access
  • API keys for usage-based access

Useful links:

2) Use device auth on a remote machine

Terminal window
codex login --device-auth

That’s the cleanest path if you’re on a remote box.

3) Keep the soul, change the overlay

Don’t rewrite your assistant from scratch.

Keep the identity. Keep the mission. Keep the voice.

Change the instructions that shape execution.

For GPT-5.4, that usually means pushing harder on:

  • brevity
  • decisiveness
  • execution over explanation
  • less sycophancy
  • clearer instruction hierarchy

Make your own evals

This is the part I’d recommend to almost everyone.

Don’t rely on my use cases. Build your own.

The workflow

  1. identify your main task types
  2. create prompts for each task type
  3. run the evals on each model
  4. review the failures
  5. tune your bootstrap files

Prompt 1: identify your task types

I want to build model-tuning evals for my OpenClaw setup.
Please analyze my recent usage and identify the main categories of work I actually use this claw for.
Your job:
1. Infer the 5-8 most important task categories based on actual usage, not generic assumptions.
2. For each category, describe:
- what the task type is
- why it matters
- what “good” looks like
- common failure modes
3. Rank them by importance/frequency.
4. Suggest which categories are best for model evals.
5. Be opinionated. Merge overlapping categories. Don’t give me fluff.

Prompt 2: generate the eval prompts

Using the task categories we identified, create a small eval suite for my claw.
Your job:
1. Create a folder structure for the evals.
2. For each selected task category, create 3-5 representative prompts.
3. Include a short README for each category explaining what this category is testing and what failure modes to watch for.
4. Save everything in a clean folder structure.
Important constraints:
- Use realistic prompts
- Test real task behavior, not benchmark trivia
- Include some ambiguity and judgment
- Avoid repetitive prompts

Prompt 3: run the evals

I want to run my eval suite against a specific model.
Model to use: [INSERT MODEL NAME]
Please:
1. Find the eval prompts in my eval folder.
2. Run each prompt using the specified model.
3. Save outputs in a parallel folder structure so I can compare results later.
4. Include the original prompt, model used, and output in each file.
5. Do not grade anything yet. Just run the evals cleanly.

Prompt 4: tune the bootstrap files

I ran model evals and reviewed the results. Now I want to tune my claw’s bootstrap files.
Files to review:
- SOUL.md
- AGENTS.md
- TOOLS.md
- any other relevant operating files
Your job:
1. Read the eval feedback and identify recurring failure modes.
2. Decide which problems belong in SOUL.md, AGENTS.md, and TOOLS.md.
3. Suggest concrete edits.
4. Explain why each edit addresses a specific failure.
5. Preserve the core personality and intent of the assistant.
6. Do not rewrite everything. Make targeted, high-leverage changes.

The conclusion

If you swap Opus for GPT-5.4 and change nothing, there’s a good chance GPT will feel worse.

That does not mean GPT-5.4 is bad in OpenClaw. It means you ran a lazy test.

What my evals showed is simple:

  • Opus is still excellent
  • GPT-5.4 gets much better after tuning
  • the gap is smaller than people think
  • in some workflows, GPT-5.4 can absolutely win

So if Anthropic’s pricing change pushed you toward GPT, don’t just flip the model and complain.

Retune the setup. Run evals on your real use cases. Then decide.

The results may surprise you!