Among many Large Language Model (LLM) memes I’ve seen making the rounds lately, one that’s concerned me is the idea that “LLMs are the ultimate search engine”. In fact, I think this is a particularly problematic and ultimately unhelpful way to think about them.
One place I saw this recently was in a tweet by Noah Smith. I don’t mean this as an attack on Noah at all – I follow him on Twitter (and Substack) because I appreciate his perspective on a variety of topics. This is why it distressed me to see him repeating this meme:
Noah referenced a piece by Senior PowerPoint Engineer (@ryxcommar) titled “ChatGPT as a query engine on a giant corpus of text”. The actual article is solid and presents a much more nuanced take than the title (and Noah’s tweet) would suggest. It’s an interesting perspective, and while it differs from mine in certain ways, I recommend giving it a read.
Before diving into the main topic for today, let me get one of the more nitpicky issues I have with Noah’s comment out of the way. That is the conflation of ChatGPT with LLMs. ChatGPT is not an LLM. ChatGPT is an application built on top of an LLM.
That’s an important distinction which I wish everyone talking about this space would internalize and explicitly call out whenever discussing one or the other. “LLMs” are a class of machine-learned model. GPT-4 is an instance of such a class. ChatGPT is an application, built on top of GPT-x models. Indeed, ChatGPT Plus subscribers can currently select between three different GPT-x models to use through the ChatGPT experience.
I wrote a bit about how I like to explain the high-level functionality of an LLM a couple of months ago, and explained how the “agent” you talk to when using ChatGPT is imaginary. If you haven’t read that, consider doing that now, as it may provide good context for the rest of this post.
So if an LLM isn’t a search engine, is ChatGPT one?
I still say no. But with a caveat. If you enable certain ChatGPT “plug-ins”, my answer changes. Before we get to that, though, let’s talk a little more about LLMs and why I think it’s not a good idea to think of them like search engines.
Reason #1: Large Language Models model language, not knowledge
This may seem trite, but as language models, LLMs model language. This is very different from modeling, say, the world. Or information. Or knowledge.
Let’s say you study English to the point where you know every single word in the Oxford English Dictionary. Let’s further say that you know every rule of grammar, as well as every style guideline from every style guide. In fact, you know the language and its common usage so well, you can read poor quality English almost as well as that which follows the “official” rules. In accomplishing this feat, you’ve developed your own internal language model for English, and practiced it by reading literature and writing essays about that literature – as one typically does in a high school English class.
We can then say that you know and understand English. Now, does knowing English mean that you know everything that can be expressed with it? Can you answer every question someone might ask you, just because they use words and grammatical constructions you understand? Can you enumerate every known fact about the world? Can you provide a source for every string of words you’ve encountered in the course of learning, or that has ever been spoken or written using a language you understand? Of course, the answer is no, you cannot.
However, you can probably answer some questions. For example, if I ask you “What shape is a basketball?”, you can probably answer this solely by knowing the definitions of commonly used words. You can also likely reason over concepts modeled in commonly used language constructs. For example, if I ask, “How many generations removed is a grandson from their grandfather?”, you can probably work out the answer just using your knowledge of what those words mean.
You also might be able to answer questions about general knowledge and trivia that you picked up in the course of your study of the language. If I ask you who coined the term “Gilded Age”, you might remember having read that it was Mark Twain. Or perhaps you remember that it was an author from that time period, so you could take a reasonable guess. If I ask you what comes after “It was the best of times”, you might be able to answer, “it was the worst of times”. That’s because this phrase is referenced commonly in English writing, and you may have even read A Tale of Two Cities in the course of your English studies.
However, if I ask you, “Was my grandfather in the Air Force or the Army?”, you probably understand the question, but you probably don’t know the answer. You could take a guess, but even that relies on you assuming that the answer is one of the two options I gave you. In general, that’s probably a reasonable assumption. In this case, it turns out I was trying to trick you – my grandfather was in the Navy.
Hopefully you can see where I’m going with this. The purpose of a language model (“large” or otherwise) is to model language, not the world or its information. This is why I repeatedly tell people that LLMs should not be treated as repositories of knowledge. They are not, in fact, a “blurry JPEG of the web”. Or at least, to the extent that they are is merely a side effect of how they learned written human language. If GPT-4 is a blurry JPEG of the web, then you are a blurry JPEG of the dictionary and all books you’ve read in your life. Do you think that’s a useful way to describe yourself? I would guess not.
Reason #2: The goal of Deep Learning is to generalize, not memorize
The goal of deep learning models like the those in OpenAI’s GPT-x family is not to build a lookup table or search index over their training data – not even a compressed one. A model that does do that is said to have “overfit” to its training data. The objective of this kind of machine learning is to generalize a solution by learning patterns from a set of training examples.
Let’s say you wanted to train an artificial neural network to perform basic arithmetic, so you give it examples like:
1+2 = 3
2+2 = 4
1+4 = 5
It wouldn’t be very useful if it could only perform those specific operations with those specific numbers. You’d at least want the model to be able to perform addition operations with other values like 1+1, 2+3, and ideally a wide (if not unbounded) range of inputs. You’d want it to learn the pattern, not the examples.
LLMs are at least intended to generalize, not memorize. In practice, they seem to do a bit of both. You probably do that, too. As kids, many of us learned to memorize our multiplication tables – and thus can recite by rote the answer to every multiplication operation involving single positive integer values. This is an optimization we use to then perform multiplication on multi-digit numbers. We memorize the simple cases, then learn how to break complex cases down into multiple steps, where each step can leverage what we memorized.
Thus, it shouldn’t be surprising that LLMs memorize some things from their training data. In some cases, they likely memorize primitives which they can then leverage in the course of more complex reasoning (as humans tend to do in my multiplication example). In other cases, it may be that the training example is incredibly unique and thus doesn’t fit into a learnable pattern, so the only way to learn it is to memorize it. You probably did that with some digits of pi. In other cases, the same input may be repeated many times across documents the model was trained on, so it decides that the exact wording is important and holds special meaning (e.g., “It was the best of times”).
Search engines, by their nature, must memorize. LLMs, on the other hand, should in general be generalizing.
Reason #3: The random element
Part of what makes an LLM application like ChatGPT so engaging is that there’s an element of randomness to it. It doesn’t always give the same answer. Part of the idea is to emulate creativity, and to act “naturally” in its interactions with the user. This is accomplished by introducing elements of randomization into how the output is chosen.
You may know that an autoregressive LLM like GPT-3.5 works by predicting the next token (more or less, word) that is most likely to follow the input text. The process is repeated in a loop to generate the sequences of text you see as output in something like ChatGPT. However, the underlying model doesn’t just output the highest confidence prediction. It outputs many predictions along with a confidence value for each. For OpenAI’s APIs, they use a configurable parameter called “temperature” to control how much randomness should be applied in the selection of the output from the top N candidates. If you set temperature to zero, you’ll always get the same answer for a given input.
ChatGPT does not use a temperature value of zero. That means, even within the bounds of things the model has memorized, you won’t always get the memorized answer from ChatGPT. However, it’s important to note that this randomization occurs for each token that is output, and so selecting a non-top token for the first output token doesn’t necessarily mean you won’t end up with the same answer to your question – it might just use different words (and a different number of words) to get there.
For example, if you set the temperature to 0 and run the model with the input “My favorite animals are”, the output will always be “dogs”. With the temperature set higher, the result could be “cats”, but it could also be “large”, “brown”, “lazy”, “good”, or “the”. Then the next word, calculated by running the newly extended text (e.g., “My favorite animals are lazy”) through the model again, could be “dogs”. Or perhaps the answer you get thanks to the randomness factor is even longer, e.g., “the furry friends we call dogs”.
Reason #4: Facts and Fiction look a lot alike
As I mentioned in my earlier post, LLMs are trained on both fact and fiction (and opinions, mistakes, deliberate deception, out-of-date statements, and every other kind of text you might find on the internet). They don’t really have a way to know the difference. Indeed, even humans often struggle to discern this. If you’d like to see just how difficult this can be, just post a tweet asking if an airplane can take off from a conveyor belt going in the opposite direction. I apologize for the rabbit hole I just sent you down.
Basically, nobody told the machine learning algorithm responsible for training the model whether any given string of text it was shown during training represented something “true”. And even if we had a way to do that (theoretically, it’s possible), defining what is true versus false is not always something humans can agree on.
What’s more, in the domain of human recorded language (e.g., text), “verifiable facts” surely make up a miniscule subset. This means the available training data would be much more limited. Imagine if growing up you were only allowed to read non-fiction. Would you have a better or worse grasp of language and all its uses and nuances?
Ignoring the practical difficulties, you might think one approach to training a “factual LLM” would be to train it only on factual material (perhaps relying on a consensus from a body of respected scholars to determine which inputs receive that label). While there might be some value in doing that for certain use cases, I think you’d discover two things:
First, you can’t really generalize across a sea of factual statements. While the content the model memorizes may be much more reliably accurate, it can’t escape making things up (or producing interpolated, and thus sometimes incorrect, results) when given input it didn’t see during training. If you trained it to always say “I don’t know” in cases it didn’t memorize, you’ve now forced it to overfit – and rather than a language model, you really do have a compressed form of the training data.
This is actually a thing! And can be useful, if compressing a data set is your objective. I’d describe this as a compression algorithm which shares techniques with machine learning, but it’s not really learning anything in the sense we normally mean. It’s certainly not what we want for a language model.
Second, generalizing across fact and fiction can be really useful! Indeed, many of the things people love doing with ChatGPT do not involve recitation of verifiable facts. I’m pretty sure writing a rap battle between React Native and Electron is not something you can do with a model trained only on “facts”. So even if you did somehow create a fact-only model, there would still be value in a general language one.
Reason #5: A lot of information is dynamic
Some pieces of information will, at least for practical purposes, never change. It’s probably safe to say that the sun will always rise in the east. However, what time it will rise is dynamic. It changes every day, and depends on where you are on the Earth’s surface. Sure, there’s a predictable pattern to it, which in theory could be learned or memorized to an extent, but training such a pattern into a language model would be an extremely inefficient solution to the problem of being able to answer, “what time will the sun rise tomorrow?”.
Other pieces of information are not predictable at all. For example: “What’s the price of a plane ticket from Seattle to New York?”, “Who is Kim Kardashian dating?”, and “What happened on last week’s episode of Star Trek: Picard?”.
Even if you wanted to build one massive, real-time, all-encompassing “universal knowledge model”, it wouldn’t be remotely practical to do so. Perhaps someday in the very, very distant future this could theoretically be accomplished. But would it ever make sense to do it even if you could? I’m skeptical. Maybe someone has a good argument why that would be valuable. But that argument would be entirely academic, as it’s not at all feasible today.
So what’s the deal with Bing Chat?
Great question! If LLMs, and chat bots based on them like ChatGPT should not be thought of as search engines, then what’s up with Bing? It’s really quite simple…
Bing Chat is not just a wrapper around a Large Language Model presented as a chat bot. In fact, the whole premise is that it does not generally rely on the underlying LLM to answer your queries. Instead, it uses the LLM to understand and sometimes reason over your input. It then issues one or more queries to the “classic” Bing search engine. Finally, it looks over the results and reads them to figure out which ones best answer your query. It then summarizes the answer and provides links to those results as its sources.
This is basically the same thing you or I could do if someone asked us a question we didn’t know the answer to. For example, if someone asks me how much of the US electrical supply currently comes from renewable energy, I don’t know the answer off the top of my head. But I know how to type “US electricity sources 2022” into Bing or Google, look for a result that seems like it may contain the answer I’m looking for, and then skim through the page to see if it’s there. I can then distill that answer into a succinct form to give it to the person who asked it of me. And if they want to know where I got it from, I can tell them.
This also gets back to my earlier comment regarding ChatGPT’s new “plug-ins” capability. With certain plug-ins, ChatGPT can gain abilities similar to what Bing Chat does, among others. It can query a web search engine, query a database of airline ticket prices, or in some cases even take actions on your behalf.
In my opinion, this is how LLMs are most useful. Not in reciting knowledge, but in understanding and reasoning over human language. That can be pretty powerful on its own, but it’s super powerful when you pair it with all the things that computers are “classically” good at. Storing and retrieving data in a fast, reliable database. Encoding and decoding video files. Doing math.
An LLM being able to perform certain mathematical operations using only what it learned in its training is neat. But it’s also absurdly inefficient to use it for that. Adding two integers by performing a trillion matrix multiplications on an 800GB neural net is the kind of thing a Captain Planet villain would dream up. Or, you know, those guys.
The answer to how to make ChatGPT do math reliably and efficiently isn’t to train a better language model. It’s to hand it a calculator. If you want it to help you find information on the web, hook it up to a search engine.
This was really great. Thx for writing!