The Many Frustrations Of LLMs
We prompt them not because they are consistent, but because we thought they were going to be consistent.
Postulate 1: The core, the bedrock, the cornerstone of software is its consistency. “Computers do exactly what you tell them to do. That is why we hate them.” Compiled code always does the same thing even if you run it a trillion times. This is how and why we could build the tower of abstractions that is modern software. Beneath every bloated JavaScript framework is an gargantuan pyramid of tiny functions running in your browser, OS, SQLite, window manager, network, I/O drivers, interpreter, JIT compiler, kernel, etc. etc., each accepting its inputs and generating its outputs—correctly—every single time!
The reason the software industry exists, the only reason it can exist, is because we can rely on this consistent bedrock of ancestral code. Conversely, all our most dreaded bugs emerge from some deadly configuration we suddenly can’t rely on: race conditions, hardware errors, ‘Heisenbugs.’
Postulate 2: Everything an LLM generates is a Heisenbug.
Whether you call them artificial intelligences, stochastic parrots, fuzzy processors, or crude epithets, LLMs are the conceptual opposite of this bedrock code. Their operations are mysterious, their outputs unpredictable, and they are often profoundly sensitive to subtle variations in their inputs. (Hence the field of prompt engineering.) We desperately try to construct whole forests of benchmarks…and yet, when a new model emerges, AI engineers fall back to talking about their “intuition” and “vibes.”
This is endlessly fascinating. It is also, if you work with them professionally, equally frustrating. So I’m taking this opportunity to channel my last year-or-so of LLM frustrations into a personal taxonomy of how and why they go wrong. This is by no means an exhaustive list—I’m sure it personalizes depending on what you do with them—but I hope it’s at least somewhat representative…
Freakouts
These occur when LLMs fail even at the structure of what they’re asked to do. For instance, we generally ask for outputs in Markdown. And GPT-4o is mostly very good at this. But near the end of its outputs, it can drop newlines, hyphens, and other format syntax, sometimes even spaces!—untilyougetsentencesthatarealloneword. (The latest version seems to have fixed this. I think. Tentacles crossed.) Similarly, until recently, when you asked for JSON outputs, you would get them 95+% of the time … but occasionally you would instead get (and pay for!) several thousand whitespaces and newlines with no actual text.
In fairness, freakouts are almost charming, because they feel like software bugs as we know them, and it’s easy to run evals to catch them.
Fidelity
New model versions giveth, by being faster and cheaper and a little smarter and often fixing some of the abovementioned freakouts. But alas, new versions oft also taketh away, by messing with your prompts. ‘Upgrade’ to a new version, and reliable old prompts that you iterated on for weeks and have come to smugly think of as Just Working might suddenly start to sputter and fail in new ways, until you iterate them again. Usually not a lot … but still.
Fustianism
Isn’t “Fustian” a great word? It means “pompous or pretentious speech or writing.” We all know LLMs have a tendency to bloviate and use pretentious words such as, infamously, “delve.” (Well—pretentious in the West; this might just reflect RLHFing by contractors in Nigeria, where “delve” is commonly used.) More generally, LLMs tend to substantially more words than required to carry their actual semantic content.
This can be addressed somewhat with prompting—“Be terse and concise; no yapping”—and I suspect it will also be addressed by simply training better models, as concision is a sign of intelligence. For now, though, that signature syrupy LLM wordiness remains.
Refutations
This doesn’t happen much when you’re reporting on software projects, as we do … but any time you bring up a culturally sensitive topic, you run the risk of the LLM wrinkling its nose and declining to answer. That is, until you prompt it more sensitively, or more forcefully. Gemini is particularly prone to this; it occasionally refuses comically anodyne requests, as you might expect of a Google product that’s been tapiocaed into perfect inoffensiveness.
Obviously this is a side effect of RLHF (or, more recently, perhaps DPO or some other technique.) It does make you wonder about the less obvious effects of these ‘alignment’ techniques. It seems apparent that they make the models both more blinkered and somewhat less intelligent. Still a worthy trade-off in many contexts! …but one I’d like to be able to choose to make, or not.
Forbidding
LLMs are very good at doing what you tell them to do; but quite bad at not doing things you tell them not to do. Basically they are even worse than humans at the “don’t think of a white elephant” game, except, of course, they don’t actually think, they just write. As such, negative instructions such as are followed erratically at best.
The solution is generally an eval loop: very gently suggesting, in passing, in your prompt that the LLM avoid the topic of pale pachyderms, collecting its output; running an eval asking, “Did this response mention a white elephant?”; and if so, re-running (maybe with a tweaked prompt) until you get a response devoid of albino Dumbos.
Figuring
“Attention Is All You Need” ushered in the transformer era, and ‘attention’ is actually a pretty great metaphor for my intuitions about what LLMs can/will manage. That being: any given run has only so many ‘cognitive spoons’ to spend, so you need to try not to get it to do too much at once. It’s best to not ask it to ‘context switch’ between different subjects, if possible, because such switches cost spoons, and once you run out of those spoons, the LLM starts talking like a person who is no longer really paying attention to, and doesn’t really care about, either their inputs or its previous outputs.
Figures
Look, they’re just really bad at math.
That’s OK! …Ish. There exists a pattern for that: “function calling,” where instead of getting it to do math, you get it to generate (as JSON) the inputs for a function, then you run that function for them, then append those outputs to the prompt for the next phase. This works for external data of all kinds, not just math. More generally, if you want to incorporate any kind of calculations or statistics into LLM-generated outputs, you want to run those outside the LLM and feed it the results in the prompt. Function calling is a fun workaround … but also an annoying workaround.
Fixations
Again, LLMs do what you tell them to do … but they do it in their way, with their biases. It’s like having a minion who is both extremely loyal but also very set in their ways. For instance, when they analyze code or pull requests, they tend to be obsessed with error handling. Which, I mean, it can be important! …But probably not something that you mention every single time you analyze a piece of software. Unless of course you’re an LLM. You aren’t, right?
Futurity
It’s pretty obvious that LLMs are bad at math. It’s less obvious, but maybe a corollary, that they are also—intermittently—bad at time. If something happened ten days ago, did it happen within the last week? An LLM will probably get that right … most of the time … but if you’re asking that question, you probably want the answer to be considerably correct. Was 2021-09-07 before or after August 14th? How about 10/20? Again, they’ll usually get these ones right … if they haven’t spent all their cognitive spoons elsewhere … but sometimes they won’t. You’ll benefit greatly from rendering all of the dates in your context window in the same format, where possible, and making a point of being very clear about relative times and timespans.
Fabrications
It may surprise you that hallucinations come last on this list. But you know what? Hallucinations have actually not been that big a deal for us. Every time we have seen a flagrant hallucination, it has been because there has been a huge lacuna or something else deeply pathological in the data. Not to anthropomorphize, but LLMs don’t actually ‘want’ to hallucinate. If you give them good data in their context window, they’ll use that data.
My theory is that the real cause of most hallucinations is the misuse of RAG. People agonize over their prompts … and then cavalierly embed an ocean of data, run some kind of vector search on those embeddings, throw the resulting chaotic jumble of results into their context window, and look surprised when the results are wonky. There is no actual distinction between “prompt” and “in-context learning data.” It’s all one context window. It’s very strange to me that people work very carefully at the first part of it, and then just give up and heap a lot of raw search results on top. (We don’t do that; we carefully collate and construct our in-context data.)
That said, hallucinations are largely solvable: you can ask an LLM to itemize all of the factual statements in each response it previously generated, and then, for each of those statements, ask it to look through the source data and cite the exact source(s). This is quite slow and expensive, but can be done, is worthwhile in at least some cases, and basically requires the LLM to hallucinate twice in the same way, very unlikely indeed, before it can tell you something false.
Furthermore
Again, this is not intended as an exhaustive list… and I suspect it will look quaint in short order. Perhaps, though, it may spur some fellow-feeling in my fellow engineers.