Inconsistency as a service
LLMs are far from the first technology prone to frequent unpredictable errors.
Long ago I worked on hardware: specifically, a DSP chip for Nortel Networks. On the walls of the windowless fab where I worked hung massive posters of the final circuitry of previously designed chips, giant engineering diagrams that looked like abstract art. What I hadn’t appreciated, prior to walking into that building, was how many of those circuits—how much of the real estate of every integrated circuit you use—had been devoted entirely to testing whether the chip actually worked, rather than any useful signal processing.
Semiconductor fabrication is a stochastic process: for any given wafer, some of its chips will work, some will not, and you have to identify which is which. As such, to this day, “test cost [is] about 30 percent of the cost of a chip.” Without the ongoing enormous time, effort, and cost poured into maintaining quality testing infrastructure, semiconductors would be a disastrous mess.
Even longer ago I worked in radio: specifically, radio communications for RCMP units posted to Canada’s far north, where it gets so cold they leave their cars running 24/7 lest the engine blocks freeze. Radio is a weird medium, especially in unusual climates and geographies, so radio communication is a stochastic process. As such it makes heavy use of error-correcting techniques such as checksums and TCP windowing. Without this, packet-switched radio would be a disastrous mess.
Today, LLMs are stochastic token generators, and … you can see where I’m going here.
Test The Model, Test The Process
Above I really described two different types of testing. One is the ‘shakedown cruise’; does the chip work? Suppose you built your system for the ‘GPT-x’ model; now, does the new ‘GPT-x-latest’, or alternatively the cheaper ‘limerick’ model, work as well for your purposes, or better, or worse? How do you know? Vibes? Focus groups? Quantified ratings? This makes sense to traditional software people, who write unit and integration and sometimes even load tests for their work as a matter of course.
The other is the ‘watchful referee’; sure, you did the right thing last time, but how can I believe you’ll do it this time? This is alien to most software people, for who assumes that software does the same thing every time you run it, but second nature to most hardware people, familiar with the random cruelties of the physical world.
You may be thinking: are you saying that every time you get an LLM to do anything, you have to evaluate its output? And the answer is: maybe! This is even true for simple data-processing LLM applications. Maybe it works every time you give it good data, but what if your traditional-software RAG process fails, you don’t give the LLM enough (or any) data for its in-context learning, and starved of anything to work with, it hallucinates wildly?
Testing is cheap, once you’ve built the test harness. Use claude-haiku or gemini-flash or 4o-mini for quick sanity checking, they’re all incredibly inexpensive and more than good enough. Throw the results away if they pass to avoid log noise and cognitive load if you like. Maybe even decide to not always test. But at least build the harness, and make it easy to test, to make sure it’s an informed rather than a lazy decision.
How To Test
The quickest, easiest, and sometimes way to test an LLM is to have its outputs rated by … an LLM. This can and does lead to a turtles-all-the-way-down problem, but it’s at least useful for a quick-and-dirty ‘is this disastrous’ test, and when well-honed with good prompts and structured outputs, can actually make for a quality test harness — one whose results you can then LLM-summarize for human consumption. Turtles, I know. But still.
Ultimately, though, you might run into the same problem as LLM benchmarks: a lot of what LLMs do is just hard to quantify. If you’re processing unstructured data into JSON, this may not be a problem. But if you’re generating anything for human consumption, it’s really hard to quantify how it will feel to human eyes. Ultimately you’re going to need some kind of vibes check. Ideally you can quantify that somewhat too, in an NPS-ish kind of way … but don’t lose sight of the fact it’s still a vibes check.
We’re still in the early days of AI engineering and figuring out what the term even means, with drastic changes hitting the field every six months, so we shouldn’t be too surprised that we haven’t collectively come up with standards and best practices yet. It took decades for traditional software engineers to realize testing was not just part of their remit, but a crucial part. It will hopefully take much less time to realize that the only way to move LLMs from “amazing canned demos” to “works most of the time” to “works 99.9% of the time, more nines available at commensurate expense” are robust test harnesses … just like every other form of stochastic engineering.