It’s been another extremely nonboring month in AIstan. OpenAI released Code Interpreter, which makes a slew of new tasks—data analysis, file format conversion, creating graphs, running calculations—available via the same handy chatbot interface. (It’s basically a highly sophisticated wrapper around GPT-4, with access to extra tools.) Meta released Llama 2, a family of open-weight models that you can run and fine-tune on your own hardware to your heart’s content, for any purposes including commercial ones, as long as you have fewer than 700 million users.
Meanwhile, OpenAI also quietly shut down its attempts to build a system which can determine whether text is AI- or human-generated. That, like prompt injection, seems—for now, at least—a genuinely unsolvable problem. We live in a strange time.
In this post I want to look a little bit further forward, though.
The transformer/attention mechanism which powers LLMs is not restricted to working on language alone. (This should be entirely unsurprising. LLMs don’t actually ever see language. We feed them a bunch of numbers, and get back a bunch of probabilities; the translation to words happens outside their remit.) They can be used for anything whose inputs can be digitized into a string of numbers, and whose outputs need to be “pick one of X options.”
In particular, they can be used for robots.
Robot programming is famously difficult. To quote ROS.org, home of probably the best-known robot SDK, “you have all the difficulties of any software development effort combined with the need to interact asynchronously with the physical world, through sensors and actuators.” Furthermore, robots have to work in real time to do many useful things, rather than simply waiting until computations are finished.
This difficulty is multiplied even further by Moravec’s Paradox: “contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources.” (Longtime readers may recall that Hans Moravec was an early Extropian.)
But … suppose that instead of writing reams of complex Python or C++ code, you simply tokenize a robot’s sensory inputs, and motor/actuator outputs, and train it to walk, chew gum, and juggle in exactly the same way that one trains an LLM to write poetry, code, and memos? (And then act quasi-independently in the same way that LLMs can power autonomous agents?)
OK, yes, simply is doing a lot of extremely ironic work there. But quietly, behind the scenes, robotic transformer progress is moving fast. Last December, Google open-sourced their RT-1 Robotics Transformer code. Six months later — i.e., last month — DeepMind published RoboCat: A self-improving robotic agent. (Here’s a great analysis by Eric Jang of 1X.)
Key to RoboCat is that it improves from self-generating data. Sufficient quantities of data are, of course, absolutely vital to training transformer architectures. This is especially true for robots, which do not have the equivalent of an entire scraped Internet as available data: as such, Google’s list of challenges that AI robotics models face starts with “the lack of large-scale and diverse robotic data.”
RoboCat, however, has been trained on enough data that it can generalize from as few as 100 examples of a new task. As such:
With RoboCat’s diverse training, it learned to operate different robotic arms within a few hours. While it had been trained on arms with two-pronged grippers, it was able to adapt to a more complex arm with a three-fingered gripper and twice as many controllable inputs […] RoboCat has a virtuous cycle of training: the more new tasks it learns, the better it gets at learning additional new tasks.
(I note, handwaving madly, that while it seems—though it’s not yet certain—that data generated by LLMs is not useful for training other LLMs, except in certain specific cases such as distillation … robots don’t have to synthesize data; they have sensors; they can collect it, from the real world. In the long run maybe this will be a big advantage?)
In Jang’s analysis he suggests a possible wilder prospect yet; that robotics can just be handled directly by LLMs, in the same way that LLMs learned how to do state-of-the-art translation without ever really being trained for that in particular:
The question weighing on many researcher’s minds these days is whether visual foundation models (sometimes referred to as VLMs) like GPT4 + Images will just zero-shot robotics. If the answer is yes, then roboticists should stop wasting their time on real robots and their difficulties and just work on computer vision and NLP benchmarks like everyone else until the model wakes up one day and knows how to control motors.
I encourage you to read Jang’s writeup, which is far, far, far more expert than mine — he is VP AI at 1X, a robotics company funded by OpenAI —
but in general I encourage you to keep an eye on the robot space. It’s increasingly plausible that Large Limbed Machines may soon … ish … join Large Language Models on the wavefront of AI progress.
Finally, on a personal note; in only six weeks, my epic AI novel Exadelic hits bookstores, and I’m pleased to report that it received a starred review from Publishers Weekly (only ~5% of the thousands of books they review each year are so honored) and a clutch of great reviews/blurbs are trickling in. Yes, I’ll be reminding you about this very regularly over the next few months…
I'm looking forward to reading Exadelic. It was great that it got that PW encomium. "Large Limbed Machines"--did you think that one up? If so, good one!