Transcript of Currents 088: Melanie Mitchell on AI Measurement and Understanding

The following is a rough transcript which has not been revised by The Jim Rutt Show or Melanie Mitchell. Please check with us before using any quotations from this transcript. Thank you.

Jim: Today’s guest is Melanie Mitchell. She’s a professor at the Santa Fe Institute, and her current research focuses on conceptual abstraction, analogy making, and visual recognition in artificial intelligence systems. Melanie is the author, editor of six books and numerous scholarly papers in the fields of artificial intelligence, cognitive science, and complex systems. Her book Complexity is still, to my mind, the best beginner’s introduction to complexity science, and I recommend it highly for those of you listeners out there who want to learn more about this complexity stuff we’re always talking about. That’s Complexity by Melanie Mitchell. Her latest book is Artificial Intelligence, A Guide for Thinking Humans, and she also has a Substack called AI, the initials, A Guide For Thinking Humans. Anyway, welcome back, Melanie. You’ve been on the show many times.

Melanie: Yeah, thanks. Good to be here.

Jim: Yeah, it’s actually quite interesting and actually a little bit humorous, as I was prepping today. Melanie had published an essay, I think on our Substack, I’m not really a hundred percent sure, called Did ChatGPT Really Pass Graduate Level Exams? And very good careful analysis. Alas, it was on ChatGPT 3.5, and that was on February 9th, what we are now end of March. So if we can say that dog years are seven to one to human years, what are generative AI years? A hundred to one? It just feels like 10 years have gone by since the 9th of February. So I think we’re going to start with that question. Are you up on what ChatGPT-4 is doing on these exams, which is much better?

Melanie: Yeah, so I read the paper that OpenAI put out about GPT-4, which they gave various standardized exams in lots of different areas, and it did great on all of them except evidently AP English, which it did not score well on at all, but it’s doing amazing, obviously. But my concern with the ChatGPT 3.5, if you like, I think those concerns are still relevant when talking about giving standardized exams to these language models.

Jim: Let’s dig into it,

Melanie: So what does it mean when you give a human standardized exam or any kind of test? You’re making assumptions that the performance on the test will carry over to concepts that are relevant to the test, that your performance on the test will mean something in the real world.

Jim: Yeah, like if you’re past the medical exam, you can probably take somebody’s tonsils out without killing them, right?

Melanie: Yeah, possibly, although there is some question about that. If you were given another test with different questions but around the same topics, testing the same concepts that you would still do well. So there’s all kinds of issues with giving these language models, these tests. One is you have to ask, well, those questions or similar questions appear in its training data. And the way that these tests are made, for instance, by say the College Board which makes AP tests, they come up with new questions every year, but the questions are similar, they’re sort of rewritten in certain ways. And so you wonder what’s been in the training data and how much is the system actually responding due to its kind of memorization or compression of previous text that it’s encountered, that’s similar to the questions that it’s being given.

So to test that, back in the ancient history of GPT 3.5 or ChatGPT I tested this. I took a question that… So this was for an MBA exam from a Wharton MBA professor, and he published a paper where he gave this question to MBA students on his test. He gave it to ChatGPT and it produced an answer that he said was great A plus answer. So I took that same question and I rewrote it with the same exact problem, but with a different kind of word scenario and found that ChatGPT didn’t do well on it at all. And we know that these systems are very sensitive to the prompts that they’re given. And so you have to ask if it does very well on one question but can’t answer the exact same question with slightly different word scenario, how much does it actually understand about the underlying concepts?

Now with GPT-4, obviously it’s doing better on all of these standardized tests that they’ve given it according to their technical report. But none of us can really test that because number one, we don’t have access to GPT-4. We can have some restricted access through the new ChatGPT interface, but we don’t have access to the model that they gave the test on. So we can’t really probe it, and we also don’t know what exactly the material is that they tested on. So I think there’s a lack of transparency here that’s making it difficult to actually do science.

Jim: That’s really, really annoying. And in fact, a number of us out here in the field keep saying, Hey, OpenAI, why don’t you change your name to Closed AI? Because open AI is definitely not what you’re about at the moment.

Melanie: Yeah, I think the original plan for OpenAI when it was founded was to be really a research lab rather than a commercial company. And things change when you get into the world of having a bottom line and customers and so on.

Jim: $15 billion worth of investment from Microsoft, right?

Melanie: Yeah.

Jim: It’s really a very different game though. I’m going to use this a chance to point out something, I think very important for research purposes based on a podcast I did the other day with Shivanshu Purohit from EleutherAI, is also works at Stability AI. Turns out Stability and Eleuther are working on a joint venture where they will very soon, he just kept saying, soon be producing a ensemble of full open source models, including the software that build them, including the data sets. And they’re going to range in size from one gigabyte model to about 80 gigabyte model, which they estimate will be about as powerful as ChatGPT 3.5.

And importantly, they were all built using exactly the same dataset, and many of them were processed in exactly the same order. And they did some experiments with models where they use the same data set but parsed in different orders and they’re going to make all of that available to people. And it’s was specifically designed for being able to do stable, interesting scientific research. So if OpenAI isn’t helping the research field, this Eleuther, Stability project should be huge. I’m very really excited about seeing this.

Melanie: Yeah, I think that’s great. And there’s other companies like Hugging Face and even Meta, which has open sourced some of their language model software. So there are going to be opportunities to do real science on these systems, but right now what we have more is kind of these companies like OpenAI and Microsoft saying, trust us, here’s a paper we wrote that looks like a scientific paper, but you can’t check any of it. Just trust us. This is what these systems can do. And I think we really don’t yet have a clear view of what these systems can and cannot do.

Jim: That makes sense. I do like the interesting finding that if you take the same prompt and go with essentially a synonym of the prompt, it may not do nearly as well.

Melanie: Yeah. This gets into this whole area of what’s called prompt engineering.

Jim: Yeah. I did an experiment this morning where one of the things I’ve been thinking about is how do I use GPT as a sky hook for itself? And one of its problems, as we know, is that hallucinates like crazy, just makes stuff up. And particularly things on the fringe, you ask it for the biography of George Washington. It’s pretty accurate. You ask it for the biography of Melanie Mitchell, it’s going to be probably so-so. You ask it for the biography of Jim Rutt, very fringe character. It’s mostly wrong. It will attempt, it knows who I am, quote unquote, but it’s mostly wrong. At least GPT 3 was GPT 3.5 was not, four is surprisingly better. But anyway, I took 3.5 and I tried an experiment where a prompt I’ve used before to probe on it is, who are the 10 most prominent guests that appeared on the Jim Rutt Show?

And it does know about the Jim Rutt Show. It’s got all the transcripts I can tell by asking it questions. And 3.5 of the 10, 8 out of the 10 were hallucinations. There were no such episodes. Though, they were very plausible. They were people like Taleb Nassim and Eric Weinstein and Richard Dawkins, and every one of them were people that, yeah, if I could get them on my show, I’d probably get them on, or maybe I wouldn’t, I’m not sure. I usually don’t go for the really big names. So I then tried an experiment where I took that prompt and I then asked GPT itself to generate 25 paraphrases of it.

And then I tested those paraphrases to see how different the results were, and they were significantly different. The next step I need to do is automate it. I just got my access to GPT-4 yesterday to the API that is, and with the API, I can now automate all this stuff and then start actually plotting in vector space where these answers are, check for the central IT, if that’s better. Is it better to take the average of them? But anyway, this idea of using GPT itself to generate synonym prompts, at least in 45 minutes of work this morning looks fairly promising.

Melanie: Yeah, that’s interesting. I think there’s going to be a lot of this kind of bootstrapping where you ask it to critique its own answer and you ask it to modify its own prompt, the prompt you gave it to be a better prompt for it. And I think all that stuff is going to end up being… Either it’s take you into completely crazy land or really get it to actually do a lot better. So we’ll see.

Jim: Yeah, and I will say that it was not obvious to me from doing maybe 10 of these by hand, whether there actually is a way to boost the signal from the noise. Because my digital hypothesis, if I ran a hundred of them and I did something like… All right, let’s just capture the number of times given names occur. If I take the top 10 of the names occurrence in a hundred runs, will that actually concentrate to more real people or not? And I think I would say that doing a sample of eight of those runs was not enough to tell me one way or the other, whether doing it at scale will work. But even if it doesn’t work, I’ll then have degrees of freedom to play with without the prompts or structured and things of that sort to see if we can find this sky hook where GPT can lift itself up.

Melanie: You would assume that that would only work if there’s some kind of statistical signal there that like my name having been on your show would have a stronger statistical association than Richard Dawkins or somebody, but I’m not sure that that’s there.

Jim: Well, it’s there for GPT-4. GPT-4, when I ask at the 10, nine out of 10 are correct, and oh, by the way, you turn up as one of the 10 most prominent.

Melanie: Oh, excellent.

Jim: And the only hallucination is an extremely plausible one, which is Jeffrey West.

Melanie: Oh, I’m surprised he hasn’t been on your show.

Jim: Yeah, we were actually scheduled to do it about two years ago, three years ago, and he fell down a flight of stairs and couldn’t do it. And we haven’t gotten around to reschedule, but 4 is much, much, much, much better on that regard. But here’s my hypothesis on this question, and I originally came up with this not thinking about sort of auto generation of paraphrase queries, but what happens when we have indifferent LLMs that are based on different training sets and different algorithms? Could we create an oracle based on having say, 25 different models and take the central rate of the answers in say, latent semantic space?

And would that be a significant boost? The theory being… I mean the hypothesis not yet a theory, a hypothesis is that the hallucinations are less correlated than the correct answers so that there’s a greater entropy in the hallucinations than there is in the correct answers. And therefore, over time with a big enough N, the signal of the correct answers will start to stand out. And of course, that depends on the correlation and the correlation between the models and a bunch of other statistical attributes of what these hallucinations are really all about, which I don’t think anybody yet really knows.

Melanie: Yeah. Well, and I’m curious why GPT-4 was better, if it’s been trained on more data or if it has some other kind of mechanism for being less prone to hallucinations.

Jim: Yeah, it’s a be interesting question, but unfortunately we don’t know. And now we do know that the parameter model is something on the order of, it’s five or six times, I guess I got the definitive number the other day, 1.3 trillion parameters, so it’s about eight times bigger than the best of the GPT 3.5.

Melanie: I’m curious how you got that number because they didn’t publish that.

Jim: Well, I have my sources and I’ve heard numbers all over the place from 1 trillion to a hundred trillion, and this person I would put at B plus quality source.

Melanie: Okay.

Jim: And he was pretty definitive. The number was 1.3.

Melanie: I have also heard all kinds of rumors about numbers but…

Jim: And that may just be for the tech side. I’m sure it may not include the multimodal stuff, which is, I guess they’re now starting to surface over on Bing. I haven’t had a chance to play with that yet. That’s going to open up this whole damn thing to a whole nother level.

Melanie: Yeah, yeah, that’ll be really interesting. Yes.

Jim: So let’s get back to this idea of comparison of GPT X performance versus humans on things like standardized tests.

Melanie: Yeah, so I guess the real question is when we formulate test questions for humans, we make a lot of assumptions about human cognition, namely that, well, one obvious one is that human hasn’t memorized all of Wikipedia, or as somebody put it, most babies haven’t been trained on all of GitHub code. So the tests that we give… When we give humans, we make certain assumptions that allows us to kind of extrapolate and extrapolation is not perfect. Your score on the SAT of course doesn’t predict in very great detail your likely success in life or whatever, but there is some things that can be extrapolated. But I don’t think it’s been shown that those same things can be extrapolated from large language models passing these tests. So I think we kind of have to figure out how to probe them in a more systematic, different way than the way that we probe humans.

Jim: That makes sense. Let’s think about that for a second. I’m not sure I want to let GPT loose with a scalpel and do medical stuff, but be safe enough to have it write legal briefs.

Melanie: With editing. I wouldn’t let it do it autonomously.

Jim: Well, I would do let it autonomously for the purpose of an experiment and have first year associates do some and have GPT do some and then do blind analysis by let’s say law school profs, give them grades and see how GPT does on performance of the actual work, which is writing a relatively routine set of legal briefs on issues, which is exactly the work that first year, at least some of the work first year legal associates do. And I just was chatting with a guy the other day who has a law firm and he says his view is he can cut his associates by two thirds, hire slightly smarter associates that he normally does. He’s not a big law firm, so he doesn’t usually get to hire the Yales, but he may move from Ohio State to Notre Dame or something and say, instead of hiring 12 from Ohio State, I’ll hire three from Notre Dame and give him GPT-4. And he thinks that’s a winning strategy.

Melanie: Yeah, I mean, there’s already companies I think that are doing that kind of service in the legal sphere and…

Jim: Legal and accounting seem to be really… Certain parts of legal and accounting, not the heavy duty reasoning and counseling part, but the straightforward part, all right, let’s file a UCC lien on this block of a hundred railroad cars or something like that.

Melanie: Right. Yeah. I think that there’s a good chance that that will work well, but I still think you can’t let it be autonomous when you’re actually using it in a real situation, like a real legal kind of encounter where the stakes are real because these things can make unpredictable errors and they’re good. They’re great a lot of the time, but they are not thinking the way that we think. They’re not reasoning the way that we reason, they don’t understand the world, the way we understand the world, and therefore they can make mistakes. As you saw with your experiment with the GPT-4, they’re as confident about their mistakes as they are about their correct answers.

Jim: Yeah. I give this example all the time that one of the things that makes them dangerous is that they don’t know they’re lying. If you talk to a detective, I happen to be from a police family. My father was a cop for 20 years, my brother was a federal law enforcement for 31 years. My favorite cousin was a cop. So anyway, I know a lot of cop lore and the cops will all tell you they can, at a high level of probability, not a hundred percent tell if somebody’s lying. Because when the people lie, their language structure is different. In particular, they provide more detail in certain off areas than you would expect. And of course, as parents, we know that our nine-year olds, when they lie, the lies sort of feel different than the truth. But because ChatGPT does not know it’s lying, it’s hallucinations aren’t linguistically different than it’s truths, which actually is something that works around our traditional ability to detect lies.

Melanie: Right. I mean, they don’t have a model of what’s true and what’s not true the way that we do, and therefore they’re using statistics and the statistics of something that’s untruthful to them is equal to the statistics that something’s truthful. So I don’t see how they could tell the difference

Jim: Yep, they couldn’t. And that tricks us because normally lies have some stylistic differences from truth if generated by humans, if they know they’re lying and they’re not sociopaths. If they’re sociopaths, you can’t tell. Right, ’cause that’s just the way they are. Okay. Before we move on from testing, I gave GPT 3.5 a IQ test back on March 3rd. It was by the advice of a psychology professor. I know it’s the vocabulary IQ test called VIQT. It was actually quite clever, and I think it had less of the risk of it having found the artifacts, the actual questions and the answers that may be biasing some of the other tests in that, because this thing generated five word lists.

And the test was just to find the two words that were most similar in the five word list. And at first it was totally trivial, but eventually it got pretty tricky. You had to have a big, big vocabulary to be able to do it. And when I gave it the full test, which took me about an hour, it came up with an IQ of 119, which I kind of thought was interesting, which is about the level of a four-year college grad from a third tier state university.

Melanie: Yeah. I don’t know. But that seems like an ideal task for a language model, given all the language that it’s been trained on.

Jim: Yeah, I was a little surprised that actually didn’t totally ace it. When I started it, I said, what’s my prior here? I said, I would not be shocked if it entirely aced it, but not even close. It got 38 questions and seven wrong out of 45.

Melanie: Well, yeah, I mean, probably the GPT-4 would ace it or have a higher, quote unquote, IQ. But that’s a great example of what I’m talking about, which is now a human with that level of… That test is meant to not just test the humans’ like knowledge of vocabulary, but also some more general intelligence, right?

Jim: Well, it’s at least assumes that GE, the general intelligence function is relatively strongly correlated with the vocabulary IQ test, which it’s nowhere near one as it turns out and it’s not even the strongest of the subtest, but it is relatively strongly correlated.

Melanie: Right. But then does that mean that the same is true for ChatGPT when you give it that test? Is that same correlation there? And I would say, I don’t know. Probably not. But that’s something that that’s an empirical question that… Giving that test, which is predictive of humans, might not be predictive of machines.

Jim: Yeah. It’s funny, the very first assignment I gave ChatGPT was basically ripping off an exercise from senior English, in honors English in high school, which was compare and contrast Moby Dick and Conrad’s Lord Jim. It’s the classic high school’s honors freshman English, third tier state university. And it did a credible, though not brilliant job. I suppose I could say that it was on that very subjective unverified test, it kind of felt like an IQ 119 in terms of its ability to write a literary essay. It was very disciplined. It was insightful enough. It wrote perfect grammar, of course, and they go… So in that sense, it knocked off the equivalent of an assignment that you might get in freshman English.

Melanie: Yeah. Well, I think this is a big challenge for us going forward is to figure out what are the right assessments to give these systems and that would actually predict their abilities in real world tasks. And there’s a lot of people working on this. There’s a thing that people at Stanford put out called the holistic LLM or holistic assessment for LLMs that tries to get more at that, that tries to get more at a better way to test these things that will actually correlate to whether they can help us as doctors or lawyers or programmers or other kinds of professional work in the real world. Which is really the thing we want to know, right?

Jim: Yeah. This looks like the paper holistic evaluation of language models with about 40 co-authors.

Melanie: Exactly, exactly. Yeah.

Jim: I’ll have to read that because this alone would be an interesting business. Young entrepreneurs out there, remember people, I’m too old and too rich to fool with this stuff. So all these ideas, please run with them, right?

Melanie: Yeah. A company like the College Board, but that tests language models and stuff.

Jim: Exactly. There’s a little idea for you people, you could do it. So reminds me of the PC in 1980 where there was a zillion bits of low hanging fruit that 1, 2, 3, 5 person teams could easily knock off.

Melanie: Yeah, I think that’s exactly the analogy for where we are is that this is kind of like the PC. It’s this thing that’s going to open up a huge number of applications but it’s going to take human creativity to do it. It’s not already there.

Jim: And to your point, most of the really valuable uses are going to be with humans in the loop for quite a while. As you said, I wouldn’t let a GPT-4 write a UCC filing that I was going to give to the court. Hell no, right? But on the other hand, it may be able to do it with some oversight at a much higher level of productivity. I actually did use GPT early on for a productive purpose. I wrote a resignation letter from a board of advisors I was on, and it was a company I wanted to be remained on good terms with and had a good experience.

And so it was one of those letters that you had to write in a nuance fashion, probably would’ve taken me an hour to write one page letter with the right level of nuance. But instead, I just gave it the hints. It did a perfect job better than I would’ve done spending an hour, press send done. So was basically, it took five minutes at most, and so it was a 12 to one boost of productivity as a classic example of we’re humans in the loop. And I didn’t make any changes. I just had to quality control. Yeah, yeah, it’s exactly what I wanted.

Melanie: Yeah. No, I think that’s exactly the right, and we’re going to all be using these things for all kinds of things. I use it to write code, but to write little pieces of code that are things that would be kind of take me a couple of hours and it just spits it out and it has errors often, but then I can go fix the errors.

Jim: What I found for coding and to me it’s probably even more of an aid to me than someone who codes regularly. I go on these coding binges about once every six months to a year, and I don’t do any coding in between. So my fingers forget how to write Python, for instance. And man, this time when I went back to writing some of this stuff around these models and also building my own chat bots and some things of that sort, wow, even ChatGPT 3.5 almost perfect in writing functions. So 10 to 30 lines of code, even if it was something fairly convoluted. And what’s particularly impressive, at least in the Python space where it seems to have very good coverage is it really seems to know the APIs. So some new API that I’m using, and I don’t know it’s syntax, some complicated, many parameters.

Man, it just knows how to do that. It’s astounding. It’s at least five to one for me, it maybe 10 to one probably for experience programmer, someone who’s current day in days, maybe less, but maybe not, but it’s certainly big. Well, let’s move on from assessments. Let’s let some young aggressive folks build the College Board for LLMs. Well, I’ll take advantage of it. Let’s move on to a paper you co-wrote with David Krakauer, who’s been on the podcast a couple of times. He’s the president of the Santa Fe Institute. Very interesting fella. And the name of the paper is The Debate Over Understanding in AI’s Large Language Models. Now that is a mighty big question.

Melanie: Yeah. Yeah. So this paper came about because I wanted to sort of summarize what are people saying, what are the sides in this debate. There’s been a whole bunch of people on one side saying these language models can understand human language the same way humans do, and maybe even our conscious in some sense. That’s another debate too. And we really should treat them as language understanders who understand the world the way we do, because they’ve been trained on language, which is kind of a intermediate representation of the world. Then there’s another group of people you might call the stochastic parrot side, who say these things don’t understand anything.

They are just parroting in a slightly more sophisticated sense, the language that they’ve been trained on, and they’re computing the probability of the next word, but they’re not understanding. We wouldn’t have said that Google search engine understands your queries. It’s using an algorithm that’s not like a thing that understands, it’s not the kind of category of thing that understands. So we tried to review what these people were saying and what the notion of understanding means in cognitive science, how psychologists and neuroscience, so on talk about human understanding or animal understanding and how it compares with these systems. So basically the idea is the word understanding really… There’s a lot of stress on it today.

Jim: Exactly, yeah. The word understanding is not well understood.

Melanie: Absolutely not. And that’s the problem. It’s a word that can’t take the new stress that is being put on it by these language models.

Jim: Oh, that’s beautiful. Because this is something I keep saying is that we may learn a hell of a lot more about intelligence, consciousness, cognition, understanding from having to deal with these LLMs. It’s forcing us to clarify our thinking.

Melanie: And AI has done that throughout its history from the very beginning. I mean, back in the 1960s and 70s, people were saying that to get a chess playing computer, a system that could play at the level of a grand master or be a world champion, you would have to have full-blown human general intelligence. And clearly AI proved us wrong, right?

Jim: Absolutely.

Melanie: We don’t want to say that Deep Blue is intelligent.

Jim: They can’t even drive a car, god-damn it.

Melanie: We don’t want to say that intelligence is brute force search. I mean, at least I don’t want to say that. So I say, okay, maybe we were naive about what intelligence is. Maybe we don’t understand. And these new artifacts that are created by AI, by people working in AI, will pressure us to refine and clarify our understanding of what these mental terms mean.

Jim: Yeah. You mentioned one, which is one of my pet peeves because the area I work in the most is the scientific study of consciousness and areas around the possibilities of machine consciousness. And that conversation is a complete botch. The guy that got fired from Google saying that their language model was conscious, right?

Melanie: He said sentient.

Jim: Did he say sentient? Okay.

Melanie: Yes, which maybe is not. I mean, that’s…

Jim: Yeah, okay. Yeah, then what the difference between sentient and intelligence and… The big takeaway I keep trying to remind people is that consciousness and intelligence overlap, but not necessarily a lot in certain places. Something like self-driving car is pretty intelligent. It’s dealing with navigating in high complexity on high stakes tasks, lots of different tasks. It has to deal with pretty open-ended, but is it conscious? Really hard to imagine. On the other hand, a toad is probably conscious at a relatively rudimentary level in the John Searle sense of consciousness, which is the one I happen to like.

And yet, in terms of higher order intelligence, got lots of lower order intelligence and keeping its body working and digesting flies and things like that. But in terms of higher level consciousness, it doesn’t have a hell of a lot, would not score high on an EIQ test. And yet it’s got consciousness. And then of course, you even run into the strange cases of humans with severe brain damage where much of their cortex is gone or their hippocampus memory building areas are gone, and yet they’re still clearly conscious. Consciousness and intelligence are… There’s some overlap, but they’re by no means the same thing, and people get them so confused.

Melanie: Well, even intelligence is not a single thing. It’s very multidimensional. When we talk about machine intelligence and we talk about human intelligence, they’re different things, or at least people don’t mean the same thing when they use the term. And I think the same thing is true with understanding. And one of the things we talked about was this notion that humans have this very strong desire, maybe innate to compress. To understand via compression, to take some complicated thing and to compress it into something like Newton’s laws or what we might call a lower dimensional representation of the thing. We have these concepts of the world, which are not equal to the world, but they’re compressed sort of models of the world. And these large language models don’t have that same evolutionary pressure to build compressed models. At least we don’t think they do.

One thing, we have a very small working memory. Whereas GPT-4 has a context window, that’s how many tokens of text it can take in of 32K tokens. You can’t keep 32K tokens in your working memory. It just won’t work. You have to build abstractions, you have to build compressions. So therefore, I think we have a different kind of understanding that maybe will end up being more generalizable than what these language models have. And I actually heard a really interesting talk from the AI researcher, Yoshua Bengio, in Montreal talking about how this constrained to working memory is something that maybe is the secret to our intellectual abilities that these machines won’t have.

Jim: Again, I studied this stuff a lot and the architecture of our conscious cognition, which includes certainly working memory, is a huge bottleneck, has stereotyped our cognition in a certain way that turns out to be good enough to achieve general intelligence. Although it’s nothing at all, as you point out, like the large language model is quite the opposite. In fact, when I’m hypothesizing on this, I put forth the concept that what’s allowed higher animal intelligence is something like heuristic induction. Because of all these bottlenecks, we need to find small rules that have high leverage. And so far, at least these large language models don’t have any evolutionary pressure towards that at all.

Melanie: Exactly.

Jim: And it’s interesting because I put on the Twitter yesterday, or my first projects once I get my APIs fired up, is to build some hierarchies of memory exterior to the LLM and an intentional mechanism that works with those memories and see if I can somehow coerce the LLMs to act like the unconscious processes that we use for understanding and producing language. But not be the repository for all the various hierarchies of memories that we have. It may be just a pipe dream, but it seems like something that could be fruitful.

Melanie: One of the things that we have that these language models don’t have also is a long-term memory. You can say they have a long-term memory in that they have these billions of weights or trillion now maybe, that are storing everything what they’ve learned from their training data, but they don’t have a memory of… I remember back when we had a conversation two years ago, and I remember all different interactions I’ve had with you. I remember, I have episodic memory that forms my own sense of self that these systems don’t have. They’re lacking that, and I think that is something that’s going to limit some of the things that they can do.

Jim: Absolutely.

Melanie: They’re different, but maybe people will be, as you say, building on extra parts or integrating these parts and making them more human like.

Jim: And then again, if we look at work people like Antonio Damasio and Neil Seth, et cetera, who focus on embodied cognition with animal cognition, emotions clearly have a large amount to play in the final decisions that we make, even if we don’t want to admit it. Damasio is a clinician in addition to a researcher, and he’s had patients who had essentially problems that destroyed their emotional machinery and they couldn’t decide what to have for breakfast. Because at the end of the day, the tipping factor is emotion or intuition or whatever we want to call it.

It’s some bodily signal that says, all right, of these options that are there, we’re going for this one. And I’ve long thought that if we look at the mathematics of inference, if you’ll try to do it formally, this was… What was that language we used to fool around with in the 80s that was…? Prologue, yeah, it would fail because of the combinatoric explosion of inference. It just got too big, too fast. Using working memory, the limited conscious contents and emotion as a very simple-minded picker is a kind of a hack to get around the combinatoric explosion of inference that allows a relatively low CPU, low clock speed device to actually process the world.

Melanie: Well, emotions probably evolved because we are such a social species. At least that’s one of the reasons. We have to deal with each other and we have to in sometimes have motivations that emotions are very key to that… Enable our social interactions and we have to care. There was a great article a couple years ago by the philosopher Margaret Boden about say, I think the title was something like AI Won’t Take Over The World Because It Doesn’t Care. And I think it’s a really interesting question, how much does caring matter for intelligence? You have AlphaGo, which is better than any human Go player, but it didn’t care. So maybe caring doesn’t matter for that kind of game playing, but does it matter for actually being intelligent in the real world?

Jim: That’s an interesting question. Well, sometimes couldn’t you say AlphaGo’s care was its definition of the loss function of winning games?

Melanie: I don’t think that it had any emotion, anything like emotion around that. That’s a very impoverished view of caring. That’s goes back to does the thermostat want to keep the temperature constant?

Jim: Of course, Tononi would say the thermostat is conscious at the level of one bit of Phi, which…

Melanie: Yeah. Not that comfortable talking about consciousness ’cause I think there’s so many different definitions and it’s so vague and ill-defined.

Jim: Well, let’s get back to understanding. What else did you guys lay out in your paper on understanding?

Melanie: Well, we said that maybe we should be more pluralistic in our ideas about understanding that we have these different kinds of intelligences that maybe have a different kind of understanding. And so here’s an example, AlphaFold. The program that predicted protein structure, did not have a kind of mechanistic physics model of, say the electrostatics of protein folding and so on. It used a statistical model of correlations between known sequence structures and new sequences and some other information too. Did it understand protein structure? Well, not in the same way that we did, but it certainly did a better job of predicting it. So maybe there’s different kinds of understanding that will be useful in different kinds of circumstances, and we have to make better sense of what we mean by the term in these different systems. So that was kind of the paper. It was a plea for a new better science of intelligence that will help us make sense of what’s going on with these large language models and ourselves.

Jim: Yeah, I do love this actually, that it’s forcing us to think about ourselves. And I sometimes will say just to annoy people, which I always enjoy, especially on Twitter, is I think the biggest takeaway from the LLMs is that humans are a hell of a lot more LLMs than we would’ve ever thought.

Melanie: Maybe and maybe not.

Jim: I’d love to hear your thoughts on that. I threw that out there to be intentionally provocative.

Melanie: Yeah, I mean, clearly a big part of our intelligence is predicting what’s going to happen, and that can involve predicting the next word or predicting the next frame in a video. And there’s a lot of complexity that underlies our ability to predict. So maybe there’s a lot of complexity that underlies these language models ability to predict. But I think we do it for different reasons. We have a different evolutionary pressure on us than these models have. And I think that might make us, the kinds of internal representation’s very different. So there’s been some interesting research on how the language area of the brain that deals with the form of language, syntax, grammar, et cetera.

Jim: Broca’s region, right?

Melanie: Yeah. Some of the representations in that area can actually map onto representations in large language models. So Ev Fedorenko of MIT has done some studies that are really provocative in that area, but that’s the idea of the form of language as opposed to what they call the functionality of language that maps language to all of these bodily sensations and our physical experience. And that’s what the language models don’t have. They don’t have that kind of grounding. So in that sense, I don’t think they understand the world in the same way we do.

Jim: And the question is, could they, if we hooked them-

Melanie: Could they? Yeah. So when David and I, David Krakauer and I wrote the paper, we got into a discussion because one of the things that he was saying is, well, we have this underlying physics model when we think of a situation or described read about a situation, we have a model of the physics of what’s going on. Whereas these language models don’t have such a model of physics. I think that’s kind of debatable, but the question is, could a model of physics come out of just learning from language? Is language rich enough to give these systems, these kinds of intuitive physics models that we use to understand the world or the intuitive psychology models that we use to understand other people, or his language not rich enough or rich enough representation to give that? I think that’s an empirical question.

Jim: And I think if a lot of us five years ago would’ve said, nah, language is not enough. But some of them were recent results, have us scratching our head. I talk quite a bit with Josh Tenenbaum up at MIT, and he has built some simple 18 month old baby AIs and things. And one of the things he built in was a physics model from the Unity game engine. And he’s found that it actually did help quite a bit, but it was something closer to older style symbolic AIs than it is to trying to extract reality from just the most massive amount of language. So I think I should say it’s an empirical question. I’d love to see what happens when the multi-modality comes in when you start saying, all right, Mr GPT, explain why this glass fell off the table? For instance, right?

Melanie: I think of it as Ms GPT. No. Yeah, exactly. I think that’s going to be really interesting. And I think those systems have in some sense a better chance of developing these kinds of basic intuitive physics models. But yeah, we’ll see. But one of the things we cited in our paper was a survey somebody did a of natural language processing researchers. And they asked a question, do you agree or disagree with this statement, to be just training on language will be sufficient for an AI system to learn to understand language in some non-trivial sense? And half agreed and half disagreed. So there’s definitely a big split. I don’t know if that’s changed in the last year, but…

Jim: I’d love to see that one redone. And of course, the other area where the people say, oh my God… Somebody said today they taken off the old saying, if World War III is fought with nuclear weapons, world War IV will be fought with sticks and stones. And his little witticism was when GPT-4 goes to war, GPT 5 will be fought with paperclips. Because I do know how these things work, we know that they’re feed forward networks. There’s no learning, there’s no online learning yet. And if you calculated Tononi’s Phi for ChatGPT, it would be approximately zero, about the same on order as a thermostat, maybe less. It’s much all the blather about these things, it’s useful to keep in mind what these large language models are is that they’re static fee forward networks that were created one shot, and that’s what they are. And while they’re very interesting and surprising, they really are surprising until they have online learning, the ability to adapt themselves to the world as they meet it, et cetera, they’re going to be something else than the kind of self-modifying intelligences that we are.

Melanie: Yeah, definitely. Although there’s some interesting claims about online learning taking place within the attention layers as the system’s doing its inference.

Jim: Yeah, I have read that. Have you dug into that research anyhow? You have an opinion on that?

Melanie: Not really. Yeah, I’d like to understand that better.

Jim: Yeah. Scratched my head about that, but I didn’t know [inaudible 00:49:53].

Melanie: Yeah. But what’s amazing when I talk to journalists and the kind of public media, what surprises them is how much disagreement there is among experts about these systems and how they work and what they can do and can’t do. I think it is a bit amazing how little we understand them, these things that we’ve created, how little we understand exactly what they are and what they can do.

Jim: Yeah, that was actually probably the biggest theme for my podcast last week with Shivanshu Purohit, is that we just don’t know. It’s amazing. Here are these huge, powerful technologies and there is no theory yet on how they do what they do. I had to ask him the complexity question, “You guys are building these models from one gig to 80 gig. Is there a phase change somewhere along the line where interesting emergences occur that didn’t occur at smaller size?” And he says, “Definitely there is. And it’s around the size…” At least for their model, “Around 10 gigabytes.” And of course their model, they believe to be more efficient per parameter than ChatGPT or the GPT family, which was just brute force.

They have allegedly a clever way of building the parameter base, but he says, yeah, you could see a whole class of emergences above and not below. And unfortunately I didn’t have time to get into it, and he may not… He was the engineer building it, not necessarily the scientist probing it. So I’d love to see some complexity oriented scientists work with this scale of models and see if they can set up some experiments that would demonstrate a relatively sharp phase change with respect to size and get back to Phil Anderson’s old chestnut that more is different.

Melanie: Yeah. Yeah. I think that that’s a great challenge for complexity science, and I hope to take part in it.

Jim: Yeah. Very cool. Well, I want to thank Melanie Mitchell, one of the smartest and most experienced people in the AI world, who has been a wonderful guest here today and help us make sense of this unbelievably fast moving world. Probably next time we talk will be a thousand years in LLM time.

Melanie: Exactly. Yeah. Can’t say anything because it’s going to be disproven next week.

Jim: Exactly. It’s crazy. All right. Thank you again, and this has been wonderful.

Melanie: Thanks, Jim.