Coming around on LLMs

Published on May 28, 2025

A screenshot of the Claude web UI with the prompt 'Say "Hello!"'. The response is "Hello!" — Claude says hi.

For a long time I’ve been a sceptic of LLMs and how they’re being used and marketed. I tried ChatGPT when it first launched, and was totally underwhelmed. Don’t get me wrong: I find the technology damn impressive, but I just couldn’t see any use for it.

Recently I’ve seen more and more comments along the lines of “people who criticise LLMs haven’t used the latest models”, and a good number of developers that I respect have said they use coding models in some capacity. So it seemed like it was time to give them another shake.

The first decision to make was which model to try. OpenAI are no longer the only player in the game, every tech company of a certain size is now also somehow an AI company^[1]. I looked a bit at some benchmarks, and then mostly ignored them and went with the only company that I didn’t outright hate: Anthropic, and their model Claude.

Initial impressions

The latest Claude models do feel a lot more “capable” than the earlier ChatGPT versions I remember, but they also still have a lot of the same problems. At their heart, they’re still text-prediction models, and still seem to be trained to predict text that will please the user rather than be factually accurate or useful.

One of the improvements that people kept mentioning was the ability for models to access the web directly. I’ve seen it kick in a bit naturally, and asked for it explicitly sometimes, and it’s… nothing special? Are people just really bad at searching the web? Maybe they should try Kagi? I feel like about 80% of the time I could have found the information just as quickly myself, about 10% of the time it searched and then hallucinated an answer, and the remaining 10% it found something quicker than I otherwise would have. Those aren’t great results, especially when you consider how much cheaper and simpler just searching the web is.

There is another way to trigger web searches in Claude: using “research mode”. When you enable it, it splits off into different models: a “lead researcher” to come up with a plan, and then some minions that execute it. It ends up doing hundreds of web queries in the span of a few seconds. Suddenly I understand why things like Anubis need to exist. I’ve not used the “please launch a DoS attack” button since.

What did impress me, though, was its ability to churn out reasonable-ish code. It can hack together a bash script as well as I can, and do it far faster than I’d be able to. Sometimes they even work. That made me wonder what it would be like doing actual coding with it. Anthropic have a CLI tool called claude-code, so I paid them lots of money and gave it a spin.

Coding with Claude

The first thing I notice about claude-code is that I really like the interface. It’s basically an input box in a terminal. It doesn’t force me to use a certain IDE or do things in a certain way. By default it asks before making any changes, showing you a side-by-side diff of what it’s doing and allowing you to provide feedback. If I had to design a way to interact with a coding agent from scratch, I can’t think of many things I’d improve.

A screenshot of claude-code. I ask it for the permalink for the latest post, and it gives a wrong answer. After prompting it again it gets it right. — A simple example of a claude-code session

The power in claude-code versus just using the web UI is that it can use tools. It can query git, run grep commands, even use sed if it wants to change something in lots of files at once. It’s very good at figuring out its way around a codebase, even without any explicit instructions. You can see in the screenshot that with a little prompting it managed to get the permalink to this post; I didn’t tell it where the posts were stored, or how to work out the latest, I just told it when it was wrong. If I’d run /init before it would have probably picked up on the fact that my posts have custom permalinks, and noted it in CLAUDE.md.

But how good is it at actually writing code? It’s like having a keen but not particularly thorough Junior Engineer at your beck and call. If you give it a clearly defined task and guidance on how to implement it (and maybe some feedback as it suggests changes), it’s more than capable of doing it. If you don’t give it enough guidance it tends to go more off the rails. I tried having it generate a simple application from scratch with minimal technical guidance and no review of what it was doing, and it made such a mess of it I decided it was quicker to throw it away and start again by hand.

The man behind the curtain

Even with sufficient guidance, at times it’s really obvious that it’s an LLM generating pleasing-token-strings and not something that genuinely understands what it’s doing. It will spit out code like this:

if err != nil {
    if err == sql.ErrNoRows {
       return nil, err
    }
    return nil, err
}

Why’s that check for sql.ErrNoRows there? It’s entirely pointless. There was no instruction to check for it, none of the existing code checked for it, but I assume it comes up quite a bit in the training data. But it didn’t understand why, so it put the check in, and returned the exact same thing as if it hadn’t.

It also occasionally tries to “cheat” or solve problems the wrong way. It sometimes feels a bit like you’re asking for wishes from a Monkey Paw. “Stop the unit tests failing”, you’ll say; Claude will respond with a request to delete the failing test. The man behind the curtain isn’t particularly well hidden, and the training data and pattern matching often shows through.

This lack of understanding also makes it very hard to get Claude to use comments in a sensible way. It absolutely loves doing nonsense like this:

// render form
function renderForm() {
    // ...
}

Even with explicit instructions to avoid useless comments, or comments that only explain what the code does (not why it does it), or other prompts. Again, I’m chalking it up to a mixture of training data that does that, and lack of any actual understanding about why a human developer may want a comment to exist. Of course, there are plenty of flesh-and-blood devs out there that also don’t comment effectively, so maybe I shouldn’t give Claude too much of a hard time on this.

… and yet …

So with all those problems, it sounds like it’s just not worth it, right? Well, not quite. I feel like programming mostly consists of two distinct tasks: thinking how to implement things; actually implementing them; and then debugging why you’re off-by-one somewhere. Letting an LLM do the thinking is a big no-go for me: it’s evidently not good at it, and frankly that’s one of the things I most like about programming. Having let it loose on small projects, I shudder to think what the codebases of all the “vibe coded” projects that are popping up are like.

For the second stage, though, it’s actually quite nice. If I’ve thought through how I want something to be implemented, I can feed those steps to Claude and have it churn out the otherwise not-too-interesting code. This requires far less time and concentration on my part than writing code, to the extent that I can be thinking about the next feature, or doing something else at the same time. I don’t feel like I’m going to end up deskilling myself this way, as I’ve already formed the idea of what I want to code, I’m just using the LLM to spit it out faster than I can type it.

A lot of the most boring bits of coding like implementing CRUD-y operations can be summarised as “Look at this file/function. Do the same thing but slightly differently elsewhere”. Claude is great at this, especially when explicitly prompted like that. You’re basically playing to its pattern-matching strengths, rather than asking it to come up with anything novel. Most of my prompts tend to be prefixed with “Look at @some_file.go and @other_file.go.” to cue up the patterns I want it to use, and I find this works well.

As for debugging, it’s a mixed bag. It’s sometimes amazingly insightful, and sometimes just runs around in circles trying the wrong things over and over. It’s worth asking the question, but I definitely wouldn’t rely on it over my own abilities.

The future

I’m probably going to carry on using claude-code, at least for personal projects. I’ve got so much done that I just wouldn’t have been bothered to do if I was doing it all by hand. I very much enjoy being in the more “diffuse thinking” mindset, planning how things are going to work, rather than being stuck in the mines digging out SQL queries. After all, who wouldn’t want an over-eager assistant to work on all their hobby projects?

Work is a slightly different matter: a private project or even an open source project that disclaims any liability is different to something I’m being paid to deliver, and bear responsibility for fixing if it’s not done correctly. I’m not saying I won’t use it at all, but if I do it’ll be much more constrained than I would in personal projects.

As for non-code usages: I’m not sold. My sceptic hat is still firmly in place. I don’t think chat is a particularly good interface for many things, and hallucinations are still a big problem despite what people say. I hate the tide of AI slop that’s taking over the Internet, and how LLM-powered chat agents are being forced into every random product. You’re definitely not going to be seeing any AI-generated blog posts from me!

One topic I’ve not gone into here is the ethical concerns about using LLMs. They obviously exist, and I do have thoughts, but that’s a topic for another day.

I’m surprised it’s not gone more mainstream: why’s there no Tesco Value LLM model, yet? ↩︎

Thanks for reading!

Want to comment on this post? Point out a mistake? Send me a message on BlueSky! Want to get in touch privately instead? Use this form to send me an e-mail. Found this helpful? Or just really, really enjoyed it? You could buy me a coffee to say thanks.

How tech companies failed to build the Star Trek computer

In most Star Trek series, the ship or station computer is ever-present in the background, waiting to be called on by the main characters. It nearly always does exactly the right thing, and there’s little limit to the functions it can perform. Take this mundane example from DS9:

Finding an awkward bug with Claude Code

I recently encountered a bug in one of my projects that I couldn’t immediately figure out. It was an issue in Centauri, my reverse proxy. After its config was updated, I noticed it stopped serving responses. Looking at the logs, I could see it was obtaining new certificates from Let’s Encrypt for a couple of domains, but I’d designed it so that wouldn’t block requests (or so I thought). After a few minutes of confusion, everything went back to working.

The Ethics of LLMs

I’ve written about LLMs a few times recently, carefully dodging the issue of ethics each time. I didn’t want to bog down the other posts with it, and I wanted some time to think over the issues. Now I’ve had time to think, it’s time to remove my head from the sand. There are a lot of different angles to consider, and a lot of it is more nuanced than is often presented. It’s not all doom and gloom, and it’s also not the most amazing thing since sliced bread. Who would have thought?

An app can be a ready meal

Three years ago I read “an app can be a home-cooked meal” by Robin Sloan. It’s a great article about how Robin cooked up an app for his family to replace a commercial one that died. It’s been stuck in my head ever since. It’s only recently that I’ve actually done anything like Robin described, though. Part of the reason was my brain got too hung up on the family aspect: in my head, a home-cooked meal is one where your family or friends all gather around to eat it with you (in much the same way as Robin’s app is used in the article). It took me an embarrassingly long time to realise that you can apply all the same arguments to an app built just for you. And it doesn’t even have to be difficult. In fact, it can be more like a ready meal than a family dinner.

// Personal website of Chris Smith