The fine art of fine-tuning

LLMs are the new microprocessors ... until/unless they aren't.

Jun 28, 2023

Modern AI models are completely stateless. ChatGPT and Midjourney are exactly the same systems, with exactly the same weights, every time you interact with them. Oh, they get updated every so often, after laborious and expensive training cycles, and by default there's some randomness in their outputs … but if you set ChatGPT’s ‘temperature’ to zero, any given prompt will consistently provoke exactly the same output. The model does not change, does not learn, and certainly doesn’t modify itself.

I’m fond of the analogy that modern AI models are like yesteryear’s microprocessors; OpenAI/Google/Anthropic are the new AMD/Intel/DEC, and“fuzzy processors” are the new “processors,” as a divide grows between people/companies who train models and those who use them. Some say all organizations will have to train their own models, because AI systems that are mere “wrappers around GPT-4” cannot compete…

… but when I hear that, I don my skeptical face. That’s what people said about pure software companies, too, way back when. To quote Neal Stephenson:

Jobs and Wozniak, the founders of Apple, came up with the very strange idea of selling information processing machines for use in the home. The business took off, and its founders made a lot of money and received the credit they deserved for being daring visionaries. But around the same time, Bill Gates and Paul Allen came up with an idea even stranger and more fantastical: selling computer operating systems. This was much weirder than the idea of Jobs and Wozniak.

IBM outsourced the PC’s OS to Microsoft because they didn’t consider it a real business. After all, how could mere software compete in a world of custom hardware? …Rather well, it turns out, as you may have noticed.

But history does not actually repeat, all analogies are flawed, and such flaws make the future interesting. It is not possible for microprocessors to meaningfully change, once etched. (Modulo FPGAs, which are fascinating but never really took off.) But it is possible for AI models to evolve. In fact, in principle, there's no reason a model couldn't continually receive additional training.

But that’s neither effective nor cost-effective. What is possible, however, is to take a base model and “fine-tune” it for a particular purpose / on a particular dataset, so that it’s more attuned to — and responds more accurately to — a given type of input.

In fact you’ll probably never interact with an AI model which isn’t fine-tuned. Modern models are first “pre-trained” — the famous process of using enormous amounts of computing data, an approximate copy of the entire Internet, and mounds of time and money (an estimated $100 million for GPT-4) — then fine-tuned to interact with users. ChatGPT is tuned to be conversational; Copilot, to suggest lines of code in an IDE. Such tuning gives you an “instruct” or “SFT” model (after the technique used, Supervised Fine-Tuning.)

They are then tuned even more aggressively, to restrict their outputs, for safety. Modern AI models are consumer products, and consumer products must be made safe… even if this nerfs them somewhat Raw foundation models might go anywhere from their inputs; abuse, bigotry, death threats, declarations of love, pornography, suicidal ideation, etc. The models you interact with are trained to avoid such subjects, using a technique known as RLHF, or Reinforcement Learning from Human Feedback. (You may recall that early Bing-GPT had... colorful... interactions with users, probably because its RLHF training was incomplete.)

How does all this work? Well, rather than retread well-worn tires: Jon Stokes, of Ars Technica fame, has a good two-part write-up of instruction and RLHF tuning. Nathan Lambert has written an excellent deep dive into RLHF. And prompt engineer extraordinaire Riley Goodside has a good Twitter overview:

https://twitter.com/goodside/status/1668383898315751425

Ultimately, once these several phases of training are complete, you're left with a "foundation model." (A term also used for what I call "base models" above. The field is new enough that its terms of art can be muddy.) These are the ones you interact with, through ChatGPT‘s input box or API, or Midjourney's Discord ... or by simply running a copy of an open-source model, like Meta's LLaMa or the UAE's Falcon. (Both of which come already fine-tuned; for instance, Falcon seems careful not to say anything bad about the UAE….)

https://twitter.com/jankulveit/status/1670735364707721216

If you have such a model on hand, you can further fine-tune it yourself; and/or, OpenAI makes various models available for custom fine-tuning through their API ... although not their flagship GPT-4. This is balky, expensive, and time-consuming:

Assemble a lot of "given input / desired output" pairs
Set up a fine-tuning training pipeline
Connect the input/output pairs into that pipeline
Train (and wait, or pay)

There are lots of detailed guides out there. The general consensus is that fine-tuning doesn't really make the model in question more powerful or capable; the training intensity is a tiny fraction of the enormous cost and effort of pre-training. But itt can add to a model’s implicit knowledgebase, and also seems to attune the fine-tuned model to its mission, like a decathlete focusing on a single Olympic event.

Furthermore, the ability to fine-tune open-weight models on your own GPU does make them much more flexible, and thus more competitive with, superior closed-weight models such as OpenAI’s and Anthropic’s. Previously, before a mini-breakthrough called QLoRa, such “local fine-tuning” would have been impossible:

Most large language models (LLM) are too big to be fine-tuned on consumer hardware. To fine-tune a 65-billion-parameter model we need more than 780 Gb of GPU memory. This is equivalent to ten A100 80 Gb GPUs. Now, with QLoRa (Dettmers et al., 2023), you could do it with only one A100.

…Of course that’s still really only of interest if you have $10K to drop on an A100. But the approach is intriguing. As you may recall, neural networks are constructed by stacking layers of neurons. A technique called “LoRa” adds a tiny ‘adapter’ to each layer, so we can fine-tune by tweaking only that adapter, like transforming the look of a movie by simply putting a filter on a camera lens, not changing all the lighting. QLoRa then “quantizes” those weights, rounding them to fewer significant digits, e.g. from 1.38987438 to 1.39 — which has surprisingly little effect on the results.

(As an aside, one can also “distill” models, arguably a form of fine-tuning; this takes a large, expert model, such as GPT-4, and ‘downloads’ a specific, focused subset of its knowledge to a smaller model which can be used more efficiently and on smaller computers — maybe even phones.)

Does fine-tuning make a huge difference? Generally, no, but it’s enough to matter for many use cases. What makes it especially interesting, for me, is that it’s a way in which foundation models are not fixed and eternal, and do change after emerging from the factory. Many of modern AI’s limitations stem from statelessness. (For example, GPT-4 is phenomenally good at writing software ... but is unaware of, so cannot make use of, open-source tools and libraries which have emerged/evolved since its training.) Much research is ongoing into adding storage or memory to LLMs, beyond today’s crude “search a vector database, turn the results into LLM inputs.” It seems likely, at least to me, that ongoing fine-tuning will be a big part of any such adaptation.

Gradient Ascendant

Discussion about this post