LLMs and The Internet's Original Sin

We can all agree the conception has been anything but immaculate.

May 30, 2024

Once upon a time the Internet was invented, and lo, it was good, and a few people used it for email, file transfer, and newsgroups. Then Tim Berners-Lee created the World Wide Web, and Marc Andreessen devised the img tag, and lo, it was even better, and web pages began to proliferate, and we immediately started trying to index them. From the beginning, from info.cern.ch to Excite to Lycos to Yet Another Hierarchical Online Organizer aka YAHOO!, people and companies sought to curate the Web, and curated "portals" became big business, for a time, during the dot-com boom.

I’m very pleased that DALL-E has learned to spell … if you browbeat it enough.

There were attempts to search the web from the beginning too, of course. All the portals offered search. But they ... weren't great. Even specialists like AltaVista were ... not much better. Until two Stanford graduate students noted that web links were a form of inherent implicit distributed curation themselves, and one could index the whole web and use that curation to rank search results, and Google was born.

Of course search rapidly became much more complicated. And better. And it conquered the web, and supplanted curation, and the portals withered and faded away. There was a brief burst of curation in the blogging era, "blogrolls" and Google Reader and so forth; but it was a false dawn, and Google launched AdWords and bought DoubleClick and turned their search supremacy into the mightiest fountain of money in the history of capitalism, and google.com became where you went to find things online, and SEO ruled over all with its iron fist.

To some this was the original sin of the Internet, the root from which the crapware, and the centralization of a shockingly few all-powerful sites and services, proliferated. To others it was ... generally vaguely acceptable ... though it meant constant warfare between Google and those hackers who sought to use SEO to harvest a tiny trickle of that gargantuan money fountain, with sites that were crapware crammed full of ads. Indeed most of the web (by sheer page volume, though thankfully not by traffic) became such crapware, and simple recipes had to be prefaced with five hundred words of bloviation, and at times the SEO hackers would be in the ascendancy, and much frustration and vituperation would be visited upon Google.

We are in such a time today. This is nothing new. (Here's some vituperation from yours truly at another such time, thirteen years ago.) We already know what happens next: Google recovers the upper hand; the great fountain of money continues to flow.

...Or does it?

We are entering the era of AI, which means, in practice, online, the era of large language models. For proof, look no further than … Google, which is incorporating AI into its search results, with … let’s call them “sporadically suboptimal” … results. It makes sense, right? LLMs like ChatGPT take a text input and generate text outputs, back-and-forth, just like how search works. Google has used previous generations of AI to improve search for years.

But the problem is, as everyone has noticed, LLMs may have the form factor of search, but that doesn’t mean they’re good at it. They hallucinate, of course, if you simply ask them what they know. This was inevitable — if they didn’t hallucinate, it would mean they somehow contained all the knowledge of their trillions of tokens of training data within a few billion parameters, magical 1000:1 compression. This isn’t super relevant to Google, who of course are not asking their LLMs to answer questions, but it affects the public’s trust in information that LLM has touched…

…and it affects what Google is doing, which is, feeding search results into LLMs to synthesize and summarize the most salient information for their users. A noble goal! (Although one wonders how the people whose information is synthesized and summarized so well that they no longer get clicks from users … and therefore a few droplets of the money fountain … might feel about this.) But LLMs can make mistakes there, too. They ‘understand’ (in a Plato’s Cave / Jungle Snooker sense) the world remarkably well … but they can still answer the wrong question in the wrong ways, think that one should glue your cheese to your pizza, and so forth. Call these “meta-hallucinations” — not so much factual errors as contextual ones.

In part this is ascribable to trying to use LLMs to answer questions regarding every human endeavor, when they’re still palpably better in some fields, such as software, and palpably worse in others, such as, well, math. (This is why my own AI startup focuses on AI analysis/reporting of software projects — I expect all kinds of projects will ultimately be amenable to automated analysis, in the GPT-5 er and/or beyond, but today’s frontier models understand software particularly well and seem far more reliable when limiting their remit accordingly.) But also, at Google scale, even an 0.1% error rate means hundreds of thousands a day, and that, combined with people’s pre-existing (well-earned) mistrust of LLMs, is likely to lead to at least some loss of faith.

The fundamental problem is that Google is forcing the square peg of LLMs into the round hole called “search.” The ironic thing is that, speaking as someone who works with them every day, one thing LLMs would be really good at is … wait for it.. curation. After all, the limiting factor of curation, the reason it lost out to search, was that it relied on human intervention, did not scale, could not be automated. …Until suddenly, now, today, it can. But attempts at curation at scale died so long ago that we no longer have any relict machinery to even try to supercharge with LLMs.

And yet. Curation never really died. Your Twitter feed was a form of curation. Journalism is a form of curation. Both Twitter and journalism are widely perceived to be dying, of course. But it would be a deliciously ironic twist if LLMs powered, rather than a new era of search, a pendulum swing back to curation at scale. I know which side I’ll be cheering for, if so.

Gradient Ascendant

Discussion about this post