The following is a rough transcript which has not been revised by The Jim Rutt Show or Shivanshu Purohit. Please check with us before using any quotations from this transcript. Thank you.
Jim: Today we have Shivanshu Purohit, head of engineering at EleutherAI. His day job is he’s a research engineer at Stability AI. Those are the guys that bring us stable diffusion, the rather cool generative art image machine. And I expect other stuff too, though I didn’t really dig into it. Back ways, we had Connor Leahy also at that time of Eleuther, currents 38 and 33 if people want to back and check what they’re up to. And I’m checking in today with Shivanshu to see what’s going on in the world of open source, generative models. So welcome, Shiv.
Jim: Great to be here.
Shivanshu: Yeah, yeah, I guess. Great to be here.
Jim: Yeah. Love to chat about this stuff. So you’re involved with two big open source generative model efforts, a really big one at Stability and an interesting one at Eleuther. From your perspective, what’s the reason to do open source generative models?
Shivanshu: Well, first of all, I firmly believe, obviously everyone’s going to say it, but I definitely firmly believe that the technology that we’re working on is probably the most interesting thing, not just in computer science, but industries everywhere. And fact that it’s almost at a point where it can be locked off by a couple of organizations doesn’t sit well with me. And since the ease of access in terms of how fast you can get to an expert level at training these models, it just begs the question that why not just do it hacker style and make it freely available for everyone? I mean, the technology is both powerful and easily accessible, so everyone should have it.
Jim: Yeah, I absolutely agree. And of course, I’ve played a lot with particularly the OpenAI stuff. I had a very early access to it and it’s very annoying their, what I call nanny rails, they put on. My partner and I were trying to work… We’re working on some software to use GPT to aid in movie script writing, for instance. And one of our hypotheses is we want to have psychologically realistic profiles on all the characters. And we’re using a psychological model from the scientific literature called the OCEAN Model, which stands for openness, conscientiousness, extroversion, agreeableness, and neuroticism. And we got it trained to, in a clever way, generate characters based on decile numbers one to 10 on each of those five. But then when we asked it to increase the neuroticism of one of the characters, it gave us a lecture about, “Stereotyping people based on their mental health is wrong, blah, blah, and I can’t do that. Fuck you.” Right?
And I go, “What the hell?” Right? And another one I just verified, read it someplace, probably on Twitter. I verified that even in the playground of OpenAI, which has got a little bit less nanny rails than ChatGPT. If you ask it to write a diatribe against Donald Trump, it’s happy to do so. If you ask it to write a diatribe against Joe Biden, it says, “I don’t do bad things like that.” And why should a bunch of jerk offs working for OpenAI tell me what I can and can’t do with technology?
Shivanshu: Exactly. Basically that. It reminds me of this recent development that we had. So Facebook, now Meta, but I still like to call them Facebook-
Jim: I call them Assbook myself.
Shivanshu: So Facebook released this suite of new models called LAMA, and they’re basically… I mean, it’s a decent contribution to open source basically. It’s the first time someone released super powerful, nearly state-of-the-art models for public consumption. But again, they still have their own terms of service and whatnot so you basically can’t fine tune the model or create a service out of them. It’s just a toy that they gave you and there’s some guardrails around it on what you can do. But what the guys from Stanford did basically is they took the model, they used $200 of credit on ChatGPT to generate some data, and then fine tune one of the LAMA models on the data basically. And that created a sort of publicly available option to ChatGPT basically. It’s an open source at ChatGPT.
And the problem with that is not just the fact that Facebook doesn’t allow you to fine tune the model, but the fact that OpenAI doesn’t allow people to train their own models on the completions generated by the OpenAI API. So it’s definitely jarring not just by the fact that, okay, they are sort of handholding you on what you can and cannot do, but then you realize the fact that some real startups are actually using OpenAI API for majority of their backend. And the fact that OpenAI can basically just make these silly rules on fly could basically kill these businesses if they ever wanted to.
Jim: And of course, OpenAI has this huge advantage. They see all the queries, right? And if you have a business that’s really doing well, they can just kill you. They can say, “All right, this is what these guys are doing. Let’s hire three smart kids, have them use GPT-4 to do the coding so it takes almost no time at all, and let’s knock these suckers off.” I would not want to trust OpenAI. Well, I mean, you’re kind of forced to at the moment, but if I had an alternative, I would definitely consider it if I were doing a startup that relied upon those APIs.
Shivanshu: There may very well be an alternative in the very near immediate future. I mean, there’s obviously a lot of interest in open sourcing these technologies, and since language models have become so powerful, finally people have started to consider that maybe it’s at a point where things are getting too powerful. So it’s definitely not a good idea to let just some companies have their hands on everything. And hopefully in the near future you will see open source replications that aren’t just kind of formalities like how Facebook did it with just releasing the checkpoint and giving you a way to just do the inference, which basically means you run the model locally.
Jim: Yeah. Didn’t somebody hijack the actual model weights?
Shivanshu: I mean, yeah, basically. So if you have an academic email, you could get access to it. Someone just dumped it on a Torrent for everyone to download, basically.
Shivanshu: That being said, what Facebook did, I would still consider it to be a half assed effort, basically. I mean, they did release the model weights. Sure, I congratulate them on that. But what they didn’t do is they didn’t show you what the actual model hyperparameters were. So if you want to train a model yourself, you’ll not be able to, because first of all, you don’t know the actual architecture. I mean, obviously everyone knows it’s a language model and it’s a transformer architecture that’s basically been used since the last five years, but obviously there are some tweaks to make it more performant. And then they didn’t release the code either. So you definitely will not be able to train the model, even if you have the same scale that Facebook did, which is basically impossible for an individual anyway.
Jim: Indeed. So yeah, I think that’s a very interesting and important distinction that real open source should include the code that built it as well as the hyperparameters for the model. And of course, then we’ll see a whole evolution of ability to intervene in the model. For instance, I can imagine model codes that allow some level of inspection of what’s going on, some ability to generate at least somewhat reliable attribution of where things came from, et cetera. And so I agree, they give you just a brittle black box, you can’t do that. That’s no fun. Jesus.
Shivanshu: To that point, it’s actually desperately needed in the field. Basically no one currently knows why these models work the way they work and why is it that they can just grasp all of the knowledge that they actually have. I mean, on a high enough level, you could say that… For example, if you have a model trained on English language, then English is basically a distribution in some very high dimensional space. And the model can basically learn that distribution and draw samples from it. At a high enough level, you could say that, but how does it actually do that? We have no idea.
Jim: I have to say, it is surprising to me that they work as well as they do considering what they are. And as you say, from a theory perspective, we don’t know why. And it is very important that we figure that out. I imagine people at Google and Microsoft or OpenAI, I guess Microsoft, OpenAI are same thing, and Facebook probably have some idea, but they’re keeping that very close to-
Shivanshu: They don’t.
Jim: They don’t? Okay.
Shivanshu: They don’t, yep.
Shivanshu: If you read a couple of papers, you basically have the same amount of knowledge as they do, on why they work the way they work.
Jim: Got that.
Shivanshu: The only knowledge I guess they have that the regular public lack is how to effectively train these models. And that’s actual engineering work that goes into it. But as far as the scientific reasons for why we do the things we do is concerned, basically no one understands that.
Jim: And that’s exciting. As a person who’s basically a scientist, I’m more of a scientist than an engineer, even though much of my business career was kind of engineering oriented, I am intrigued and particularly my strong interest in cognitive science, I’m on the Board of Visitors at the MIT Brain and Cognitive Science Department, and I’m on the Board of Advisors at the Frail and Science Institute, or whatever the hell they’re calling it these days, which is a good neuroscience institute at Virginia Tech, and so I am utterly fascinated by these big picture theory questions. How does a statistical compilation of language based around attention that’s essentially predicting the next word in a sequence from a priming, how in the world does that write code from a very simple description? And I’ve done it. It does it and it does a better job than I do as sort of an amateur program, a guy who writes a program about once every six months.
It’s a shitload better python programmer at the moment than I am. And I have quickly realized that my productivity’s going up by at least 5X by putting GPT-3.5 or 4 in the loop and having it write functions. It’s still not that great for writing long, detailed code, but for filling out a framework with a whole, “Hello, world,” version of Flask or something, or Django and writing even a pretty convoluted 50 line function, it’s great. And so learning how you can use your tools is kind of the key to getting to take an advantage of this kind of stuff.
Shivanshu: So yeah, it kind of is a new paradigm now. It’s basically a repository of knowledge that can talk back. So you are basically able to talk to a computer, if you want to call it that, in English. And you literally can program with the way you speak. So everyone’s a programmer now.
Jim: Yeah, I don’t remember who it was, I wish I could remember his name so I can give him credit. But if somebody on Twitter said, “Within a year, the number one programming language will be English.” And that’s going to be quite interesting. And I’m going to make a prediction right here, I just made it to a couple of VCs yesterday, which is there may be a swing back to hiring liberal arts majors. Liberal arts majors, particularly English majors or foreign language people, people who know multiple languages. And also I also predicted music composition people will suddenly be very valuable because the occupation of prompt engineering is actually quite different than writing C++.
Jim: And the kind of people that are good at writing C++ are likely psychographically quite different than the people that are excellent at prompt engineering. And so all you liberal arts graduates out there, you’re not destined to be baristas forever, right? Tomorrow go get yourself a OpenAI API account, start playing with it. And most importantly, read the paper that was published yesterday on OpenAI plugins. Did you read that?
Shivanshu: I actually haven’t. It’s super hard to keep up with the pace of how everything’s going.
Jim: Oh, I know.
Shivanshu: Yeah. My own job is to train models at this scale. So I don’t know, 10 hours a day, I’m basically just looking at a screen or multiple screens and just hoping that I don’t get a crash on my ROMs because there’s so many of them that I have to spend maybe half an hour just to restart all of these runs, if I run into a problem.
Jim: Oh, you need to get a DevOps front end on GPT-4 and just tell it to do it.
Shivanshu: Yeah, I actually might do it at some point.
Jim: And with the plugin, I mean, the last time I gave an all-hands alert on a new software technology was the day Java was released by Sun Microsystems. I sent out an email to 500 techies and I said, “I just read this paper, this is going to change the fucking world, people.” And it probably didn’t change the world quite as much as I thought it might, but it turned out to be a damn good call.
Shivanshu: Well, it did shape the internet, so there’s that.
Shivanshu: I mean, it already has been off to the races, at least as far as scaling out these models is concerned. But now we are at a point where we know that these models can actually perform quite on par with a human, at least for the limited cognitive task that is the work we do. So for that, we can basically automate a lot of it. And now is the time to make money, I guess.
Jim: Yeah, and it reminds me in another way, talking about mixing metaphors here of the PC business around 1980, ’81, where there were millions of easy but valuable things to do. Every domain needed PC software and three or four kids could knock out some application for some vertical and make a couple of million dollars in a couple of years. And you didn’t have to be too smart and you didn’t have to be too good. You just had to move fast and be good enough. And it feels like that land rush right now, that the door has opened. And if you know something about automating some domain and you understand how particularly contexts work, the context window, especially in the language model, it’s so key. Get your head around the concept of how do you dance data in and out of the context model to solve real world problems programmatically, and you can make as much money as you want, as fast as you’re willing to run. And it’s so wide open, the adjacent possible, as we’d say in complexity science, is suddenly huge.
Jim: It’s very interesting. So let’s now turn back to the core of what I wanted to talk about today, which is the state of play of true open source generative models. What do you know about all that?
Shivanshu: So I would like to say meaningfully, but yeah, maybe I’m overselling myself, but I’ve been contributing to open source AI research for maybe three years now, since 2020 when I joined Eleuther as a volunteer. Basically back then, Covid had recently stuck. Everyone was basically stuck in their homes. We had a lot of free time and GPT-3 had come out. So basically we thought, “Why not just try to open source it?” We were fortunate enough to get some decent compute, which I mean, in today’s age, it’s basically trivial. But back then… I’m saying back then when referring to 2020 as if it’s like a decade ago, but that’s basically how fast the field is moving. 2020 feels like 30 years ago to me. So yeah, back then, we had some access to this initiative from Google called TPU Research Cloud. So basically Google has its own architecture, its own chip, which is parallel to whatever NVIDIA does with GPUs, right?
Jim: Yeah, the tensor chips.
Shivanshu: Yeah, the tensor processing unit. So back then, no one was using it because it’s a pain in the ass to actually use them. I mean, not today. They have made some progress on it, but back then it was basically hell, trying to get your hand inside a cheese grater kind of thing.
Jim: Wasn’t as bad as trying to use NVIDIA before CUDA.
Shivanshu: Oh yeah, could be. But yeah, I’m too young to remember if there was NVIDIA before CUDA.
Jim: There was. Oh yeah. Hell, yes. I remember fooling with it. How can we use these GPUs to do parallel processing before CUDA? It was fucking difficult, let me tell you.
Shivanshu: Yeah, I mean, I grew up with CUDA so I basically assumed that GPUs came out and maybe a year later, they invented CUDA.
Jim: Oh, 15 years later.
Shivanshu: Okay. Yeah. Then those 15 years probably were hell. So yeah, similarly, Google had its own framework. It was called TensorFlow. It’s fallen badly out of fashion at this point, but back then we had to use it because there was no other alternative. You have to work with a software that works with the hardware. So we basically wrote some code with TensorFlow to actually train relatively large model on the amount of TPUs we had. So we trained a 2.7 billion and a 1.3 billion parameter model way back in 2020. So back then it was kind of a big deal because other than GPT-3, there was basically nothing that came close in size.
Jim: I recall my conversation with Connor. We kind of dubbed Neo or the most advanced model they had is about a GPT-2.5, something like that.
Shivanshu: Right, yeah. So we named it GPTNeo basically.
Shivanshu: Yeah. So yeah, that was fun times, I think. But back when we did it, yeah, it definitely caught a lot of news because actually training a billion model in a billion parameter regime was unheard of. And the fact that some people did it basically on their free time and open sourced it was definitely a very major event basically. It was kind of like what Facebook did with LAMA today because our model was quite performant and neck and neck with basically everything out there except for GPT’s recent releases. It was one year in the making and a hundred million dollars in the bank. So obviously you can’t compete with that at least from the get-go,
Jim: Unless you’re a crypto dude. I keep saying, a hundred million is a lot of money, but not for some kid living in his mother’s basement in Romania, right?
Shivanshu: And maybe.
Jim: Maybe. So what’s happened since then in the world of… And let’s initially just talk about text models. We’ll talk about some of the other models later, but let’s stick with the text models right now, language models.
Shivanshu: So basically in 2021, we also released a very large corpus of text. We named it Pile, which was basically the largest available open source text corpus. It consisted of 300 billion tokens. And tokens are basically words in language model lingo. So it was a very large pile of words that we released. So you could just use that to train your own models basically. And yeah, it was the biggest available database, I guess. So we released that and accompanying that were the two models that we trained back in 2020. So it got us quite a lot of attention from multiple companies, one of which was a cloud provider named CorView. They were interested in helping us scale out even more. So in 2021, they offered to build their own data center since they had recently started their foray into using HPC for machine learning applications.
So they offered to build their data center with us based on our recommendations, and we could basically train our models on their cluster before they opened up for public access. So yeah, that was the first time we got real taste of high performance computing on an actual supercomputer, which is close to the state of the art. So we spent the whole year trying to figure out a lot of things because deep learning, I like to call it alchemy, basically, the way chemistry was before chemistry was invented, back when you had no periodic table and everything, right?
Jim: Yeah. You’re just guessing on how things work. You didn’t have the principles, right?
Shivanshu: Yeah. So deep learning is basically that even now, but I guess there’s a bunch of empirical evidence on how you would do certain things, but there was even less knowledge available back then. So we bumped our heads into a lot of things, and a year later we came out with a model we call GPTNeoX. We called it NeoX because it was on GPUs basically. So we wanted to differentiate between them. So Neo was what we trained on APUs, and we named the newer library GPTNeoX because it’s GPU, so let’s make it a bit different. And we trained a 20 billion parameter model, which was the biggest available open source model in 2022. And by 2022, I think closed source models had progressed even further. Although even then, OpenAI’s GPT-3 was probably the most performant of all of the models available.
And the model was, again, a big hit with a lot of people because it was the first time an open source model could actually be used for some real world application like creating a chatbot or writing your own stories and everything. So yeah, Eleuther’s been working on pushing the limits on open source research for, I would say, three years now. But after that, I think we put some breaks on what we wanted to do because it’s one thing to just keep training larger and larger models for the fun of it, and another thing to actually have a way to sustain it. You won’t get people who will be willing to foot the bill for that unless there’s some very rich benefactor. So in 2022, Stability AI offered to help us out. Stability was a recent startup that just had opened shop itself, and the CEO basically offered to foot the bill for a lot of open source projects, not just language models. And we have been working closely with them since then. I joined as a full-time employee around the same time as well.
Jim: So what are you guys working on now?
Shivanshu: Currently we have been putting some breaks on the actual model size. It’s one thing to just create a bigger model by just adding more layers, but I guess over the years, people have found out that just training a bigger model isn’t the way to go. The way you squeeze out the best performance possible is to scale out both your model size as well as the data. So you need a bigger data set to actually be able to train a bigger model. So one idea is to basically train smaller models for basically more number of tokens, which is what a lot of companies, and even in open source people have been pursuing for I guess a year now since it became public knowledge back when DeepMind released this paper called Chinchilla, which basically showed that everyone had been doing it wrong, and even the OpenAI’s GPT-3 was severely undertrained.
They proved it by training like a 70 billion parameter model, which is less than half the size of what GPT-3 was, and getting nearly 50% more performance across the board on every single benchmark. So having a smaller model that performs better has many benefits, first of which being if you actually deploy it, you can deploy it on cheaper resources.
Jim: Yeah, much less expensive. I mean, you look at the OpenAI pricing for GPT-4 versus 3.5, it’s like 15X higher. It’s way more expensive. I was talking to somebody who had an early access to Claude and they were saying, “Very, very impressive for only 80 billion parameters.” So essentially what you’re thinking is, “Rather than just brute forcing it and going to a trillion parameters,” which I guess is what I hear GPT-4 is, maybe more… Oh, by the way, have you heard what that number actually is?
Shivanshu: I heard numbers on grapevine and yeah, basically it’s a 1.6 trillion parameter model, although it’s not exactly comparable to any of the previous model.
Jim: Got it.
Shivanshu: It’s a mixture of experts model, which is kind of a sparser model, which is cheaper to train, but that doesn’t mean it was actually cheap to train. It was just cheaper to train compared to what it would be for a 1.6 trillion parameter regular, which we call dense transformer.
Jim: Okay, that’s good. Thanks, 1.6. I’ve heard everything from 1 trillion to a hundred trillion, and I’ve guessed it was somewhere around 1 trillion based on realistic… What you could do. And so that’s good to know, 1.6, but not quite the same as a dense model. I like that. But Claude at 80 billion or thereabouts seems to be better than GPT-3.5, maybe not quite as good as GPT-4. So is this idea that data is important and the actual architecture is important, the software that builds the architecture, the theory of the neural nets. So essentially what you guys are trying to do, because you can’t compete for pure horsepower with Microsoft or Google, is how to squeeze goodness and usability out of less total computation. Is that a fair way to describe it?
Shivanshu: Yeah, basically. Even if you actually do have the scale for it, the way to go is to not train the biggest model possible, but train an intermediately large model with the maximum number of tokens possible, which is the kind of idea that we have been pursuing. But that being said, just training a very large model itself seems to not be the spirit that we want to operate in at Eleuther. Our priorities initially were aligned with actually releasing a very large model, because you need a large model to do actual science with it, but you don’t need to just keep training very large models for the sake of it. We want to do actual science with it.
So in that sense, I guess we shifted our focus on actually trying to understand just what the hell’s going on when you actually train a model. To that end, we spent the last year basically training a whole suite of models from models being as tiny as like 49 million parameter models all the way up to 13 billion parameter models. And what we did was we basically released all of them, and we didn’t just release them, we released the code as well, and went above and beyond that to actually release the intermediate checkpoints throughout the training.
Shivanshu: Yeah. Okay, first of all, we had our own experiments to do with them, but the idea was that if you release the entire suite of checkpoints that you had across the training, then people can do actual science with it, poke and prod every single checkpoint for a specific layer, a specific neuron, and see how it evolved over the course of training. So it would actually produce some knowledge that would be useful for everyone. That was the idea. So that’s basically what we did over the course of the year, basically just trained models with a very strict criteria even. Our criteria was basically to train the models such that for every single checkpoint, regardless of the model size, the model would see the same number of tokens across the whole breadth of suite.
So for example, if we had a 49 million parameter model, the first checkpoint that we did would, for example, be at, let’s say, when the model has been trained on 1 billion tokens, let’s say. So if the 49 million parameter model was checkpointed at 1 billion tokens, the 13 billion parameter model would be checkpointed at 1 billion tokens exactly as well. So you have a very rigorous consistency across the checkpoints, and there’s very little randomness across the model.
Jim: I like that. So you use exactly the same pile also, right?
Shivanshu: Yep. I mean, basically exactly everything was the same except for the model sizes, and the idea was to just analyze a bunch of things we want to analyze over the course of training between the model layers, which we will continue to keep doing with even larger models now that we have the compute. So basically starting 2022, our biggest benefactor has been Stability AI and Stability definitely has evolved a lot over the course of the year. I joined back when we had, I don’t know, it was maybe thirty or forty A100s. So A100s are the top of the line GPUs from NVIDIA. Not anymore. Now in 2023, they released H100s, which will hit the shelves for data centers very soon. But for the last three years, A100s have been the greatest and the best GPUs basically.
So Stability started off with just forty A100s on AWS and it’s now to a point where stability has 6,000 or maybe 7,000 A100s. And compared to the publicly available top 500 supercomputers, we would land somewhere at the seventh or eighth largest super computer there is in terms of equivalent performance, and basically all of the compute is being used by research groups like Eleuther on open-sourcing AI research. So Stability has definitely been a very big sponsor of open source research. So most of the research that we did last year has been on Stability compute, although we do have a compute from other providers as well. CorView, obviously we do have a great relationship with them and continue to use their resources.
Jim: I know you’re a engineer and not a scientist, but do you have a sense of what the biggest takeaways were from this experiment of using the same training and the same data on models from a few tens of million of parameters up to 13 billion parameters? What was extracted in terms of knowledge about that experiment?
Shivanshu: So we found out that the order of training actually doesn’t affect how the models memorize something. That was the initial hypothesis that we wanted to test. But since the training suite and the training recipe was so versatile, we could do multiple experiments with them, some of which we are already doing as we speak. But the initial idea that we started off with was trying to figure out how memorization is affected by in what orders the model sees some tokens or documents or basically anything. And we found out that the order doesn’t matter actually, but the models still memorize a lot of stuff, especially if the data is very sparse. So for example, if you don’t pre-process your text good enough and don’t deanonymize it and, say, for some reason someone’s credit card information ended up in your dataset, even if it was just one instance of the credit card number being present in the dataset, the model can more or less actually verbatim memorize the information. And if you open source it or give the checkpoint to someone, the other person can just extract the information if they know what they’re doing.
Jim: As a complexity science guy, one of our ideas is that more is different, and sometimes you’ll see a sharp phase change at scale where something occurs. In this experiment of scaling up the number of parameters, did you see any range where performance suddenly changed or over a relatively short range qualitatively became different?
Shivanshu: Yep. So all of the prompt attuning and prompt engineering stuff actually happens at around six to 10 billion parameter range. If your model is smaller than that, then you can prompt it to do all of the whacky stuff you can get ChatGPT or GPT-3 or 4 to do. Your model needs to be a specific size to be able to do that. And in my experience, I’ve found out that around 6 billion parameters, you can expect such general capabilities to emerge. Your model can learn in context itself rather than learn only during the process of the actual training.
Jim: Interesting. So if one were to probe on the theory of what it is these large language models are actually doing to show more generalization than we might expect, you’d want to do it around the six to 10 or above threshold, and you have published models of that size, right?
Shivanshu: Yeah, we have.
Jim: That’s very cool. And you also alluded to early on when we started chatting that something big is going to happen soon.
Jim: Is that Eleuther or is that somebody else that’s going to put out an open source really big text model?
Shivanshu: Well, I don’t know if I should publicly say it.
Jim: Yeah, you should. Hell, yes. This is the Jim Rutt Show. You could say anything. We’re like, “Do anything now,” the DAN Jailbreak, right?
Shivanshu: Yeah. Okay. Yeah, yeah, that was funny. Well, I mean, it’s a joint effort from both Stability and Eleuther in the sense that Stability provides the compute and there’s some Stability employees working on the project as well as as some folks who work part-time at Stability, but are going to transition to Eleuther soon now that Eleuther is a nonprofit organization. So it’s a kind of collaborative effort that we’re working on, and the idea is to basically release an even larger dataset and an even bigger suite of models trained on the even larger dataset. So Pile basically was 300 billion tokens. The new data set that we are working with may end up being 2 trillion tokens. Again, I am not sure on the total number of token count because it changes based on what kind of tokenizer you use and what the vocabulary size for that is. So we have experiments that use multiple tokenizers, so depending on what we use, the actual token count can change. So 2 trillion is a ballpark number basically.
Jim: How about on the model size and the model approach? What are you guys looking at there?
Shivanshu: So we are going to train models from, let’s say, 1 billion parameter all the way to 60 to 70 billion parameters. And the idea is to stop at 60, 70 billion parameter ranges because the largest available… Any accelerator, basically, be it TPU, GPU, or any of the hardware startups, the biggest available memory is 80 GB of high bandwidth memory available on this GPU from NVIDIA called A100, as I said. And with int8 precision or whatever eight bit precision you use, you technically have 80 GB of memory available to you. So I guess that should determine what kind of model size you want to be working on. I mean, that’s a rough idea. We could obviously go higher once the current suite is finished, but for now, the idea is to just stop at 70 billion parameters, which is kind of good, I guess.
The model will be nearly state of the art in terms of capabilities. Well, not the state of the art now because GPT-4 is here, but discounting that, tech no one knows what model size it is or what data it was used on or whether even any numbers that they posted are even real because you can’t replicate them. So discounting that, it’ll most probably be up there with anything out there.
Jim: So about equivalent of GPT-3.5, maybe something like that, or Claude maybe?
Jim: That’s impressive. I find GPT-3.5 is good for a lot of stuff. GPT-4 is better, but for a lot of stuff, I would probably use 3.5 in production because it’s faster and cheaper, but the only things that really need the extra oomph of GPT-4, is it worth spending 15 times as much? And of course, there’s always going to be that tradeoff. When I buy PCs… Since 1980, I buy a new PC every couple, two, three years, and I usually buy one one step below the state of the art because to get that final state of the art, it costs so much.
And the things that I’m doing don’t really require it, even though I do like a nice, fast computer. So when 1.2 gigahertz was state-of-the-art for Pentium, I would buy a one gigahertz Pentium. Fast enough, but half the price. I think we’re going to see a lot of that now, especially when you get to good enough. To write a 50 line Python function, 3.5 is good enough. Probably not good enough to write a 500 line full application, and so then you might well be worth using the bigger models. So that would be very impressive. Do you have any sense of when, let’s say, a 70 billion parameter model might be available?
Shivanshu: Well, the answer to that would be soon because as I said, deep learning is basically alchemy and it’s not a real science right now. So when you actually train models at this scale with the number of GPU’s that I play with, you find out new problems that you wouldn’t have expected beforehand. Basically every couple weeks or so, I just find out that there’s some novel bug that I have discovered that I didn’t discover at smaller scale model with smaller number of GPUs, right? Yeah. I mean, basically it’s just, for example, you are training a 10 billion parameter model with say like 256 GPUs.
You can’t even extrapolate what the performance or the behavior of the code would be if you just train a hundred billion parameter model at the same 256 GPUs or if you trained a hundred billion parameter on 2,560 GPUs, because then you are both scaled up and scaled out in terms of the hardware and the model size itself. So basically you find something new every time you actually scale out, and then there’s no solution you can Google or stack overflow. You have to discover the problem yourself, you have to fix it yourself, and that definitely takes time. So I can’t promise when that comes out, but when it comes out, you’ll see the noise that it generates, I guess.
Jim: That’ll be cool. And yeah, I understand it because when you’re doing this kind of work, you are at the edge, right? Which makes it both exciting and difficult. As you say, you can’t Google or… Who Googles anymore? GPT-4, right?
Jim: And it’s not going to tell you anything about how to do this. This will be exciting-
Shivanshu: Probably doesn’t know anything about itself either, because the solutions to these questions are very deeply in some Facebook Messenger chats or Slack direct messages or something like that.
Jim: Okay. Goodness. And you’re going to continue to stick with the Eleuther model of releasing the code, the data set, the hyperparameters, the whole thing so anyone that had a few million dollars and want to duplicate your work could do so?
Shivanshu: Yeah. Yeah. No half measures.
Jim: Oh, I love this. This is good. I’m so glad that you are sticking to your guns, and I presume probably for political reasons, you’ll have some nanny rails in some parts of it.
Shivanshu: Well, you do need to have some of it for an actual product because you can’t… I don’t know how familiar you are with what the memescape is on AI safety on Twitter, but basically there’s this new idea that a pre-trained model is basically a Lovecraftian entity. It’s a kind of foreign alien, monstrous creature, and then you apply this new technique that’s invented in the last two or three years, it’s called reinforcement learning with human feedback.
Jim: Yeah, yeah, yeah.
Shivanshu: Yeah, it captures the intent of the user into something. So the way we represent it is the pre-trained model is just an alien entity with lots of eyes and tentacles and mouth, and RLHF is basically smiley mask you can put on it-
Jim: But you don’t have to.
Jim: But you don’t have to if you don’t want to.
Shivanshu: Well, I mean, yeah, again, it’s a question of what would you like to interact with? Would you like to interact with a smiley emoji or would you like to interact with this very alien entity that you definitely don’t understand at all? So yeah, basically we will put some guardrails specifically because of that. And then there’s the other thing that, yeah, it’s obviously more convenient because a pre-trained model is basically an auto complete. But if you do RLHF or any other fine tuning techniques, you can make the model understand that, “Okay, we actually do want to use you for some real world work and just rather than auto complete, do what I’m saying basically,” is the idea.
Jim: Got it.
Shivanshu: In the process of actually doing that, you need to basically create very richly labeled data. You can’t scrape that from the internet. You have to create it, you have to pay for it. There’s currently a lot of startups that have specifically focused on that, and they definitely are making bank on it because that kind of data doesn’t come cheap. You are looking at millions of dollars on data alone if you want enough samples basically. If you want a hundred thousand samples, you can expect to pay anything from $1 million to $5 million depending on how specific your tasks are for all of those queries-
Jim: That you’d generate.
So that would be $10 to $50 per coupled query and query and answer.
Jim: That seems high.
Shivanshu: Well, yeah, but then you are getting something… Basically you give them a Word document that says, “I want 50 solutions to top 500 problems posted on HackerRank.” And the people who post the solutions then also need to elaborate why they wrote every single line of program that they did and what they thought process was.
Jim: Oh, okay. So $50 is reasonable for that, but of course, now this is what we would call in economics, the non-rivalrous risk goods thing.
Jim: Once it’s created, the marginal cost of creating a copy for somebody else to use is zero approximately.
Shivanshu: Yeah, exactly zero.
Jim: So the first user might pay $50 if they create it for themselves, but if they then sell it to other people, they could sell it for a dollar per question or 10 cents a question or 1 cent a question downstream, right?
Shivanshu: Yeah. Well, if you can automate that and yeah, this is where the OpenAI policy to not allow users to train their own models on the ChatGPT output comes from. They’re basically desperately trying to form a MOT because there currently is no MOT. Even with all the fancy stuff OpenAI has done, there still is very serious lack of MOT around how to make this stuff economically viable. I mean, obviously Microsoft is going to keep pumping money into OpenAI until AI is as omnipresent as the internet or the mobile or basically everything. Or Microsoft goes bankrupt basically. But yeah, they’re all in on it. With that being said, you still need to find a way to actually make money and having a very large model isn’t a MOT because there’s other people in the world and there’s other money in the world outside Microsoft.
So basically anyone can train it once they find out just what actually goes on. And at this point, it’s a fairly straightforward way to do it. And as you mentioned, there’s diminishing returns that we can currently see even now. ChatGPT is already good enough for most applications, and GPT-4 isn’t like 10 times as capable as ChatGPT in a way that it can basically just eliminate all human labor or whatever. It’s just better, but how better, it depends on what your appetite to pay Microsoft is?
Jim: And it’s like anything, right? Just like my PC example, there’s times when it makes sense to pay for the very best. There’s a lot of times when being just a bit behind the curve is fine once you reach sufficiency for the task, is the term I used in business. Is this sufficient for the task? And so it’s your sense that your 70 or 80 billion parameter model should be as sufficient for many tasks as GPT-3.5 or ChatGPT essentially.
Shivanshu: Exactly. I mean, we could even outperform it. Let’s see. I can’t make promises, but I have some ideas, so we could.
Jim: That’d be fun.
Jim: I’ll be looking forward to… I mean, hell, I might do some of it, some benchmarking and post them on Twitter as I’ve been doing with some of the other engines. In fact, I think I’m going to get access to Bard here in the next couple of days. I used a little bit of grease with some people I know to get moved up the list a little bit.
Jim: It’d be very, very interesting. And how about on the deployment side? What size hardware? If someone wanted to take your model, how much hardware would they need to deploy it?
Shivanshu: So this is where our approach of actually training a whole suite of model comes in handy, right? So back when we trained the previous scaling suite, we call it Pythia. It’s named for some Greek term I forgot myself, can’t remember off the top of my head.
Jim: Yeah, she’s the person who sat at the Delphi Oracle.
Shivanshu: Oh, yeah.
Jim: Yeah. She was the voice of the gods was the theory.
Jim: Actually, she was probably high on a ethane vapors that existed there in that cave, but yeah, she was the mouthpiece of the gods.
Shivanshu: Right. So yeah, we named our scaling suite Pythia. I guess the Pythia V1 was suite from 49 million parameters all the way up to 13 billion parameters. But that was specifically trained for doing a lot of scientific research. I guess we will train the current training suite that I’m running, maybe we can repackage it as Pythia or maybe we’ll name it something else. Let’s see what the optics for that turn out to be. But that being said, we are currently training another suite of models, and it’ll range all the way from 1 billion to 70 billion. So there will be intermediate model sizes that will be usable even on consumer hardware. I mean, a 1 billion parameter model, you could actually just run on your own GPU. Even if it’s a potato GPU, it would still have the one gigabyte of VRAM available. So I suspect that models, for example, like 7 billion and 15 billion parameters when trained to the optimal number of tokens, which is 1.5, 1.6 trillion tokens, would be on par for the course with the kind of performance you see on ChatGPT.
I mean, maybe it could be a bit less, but that’s what I suspect the model size for ChatGPT is. Obviously they did a lot of reinforcement learning with human feedback as well, and spent millions of dollars on it, which, let’s see. That’s our plan as well at Stability and even some Eleuther adjacent projects to actually focus on. So for example, the one goal we have for later this year is to release the first ever publicly available model trained with reinforcement learning with human feedback, as well as the base model being capable enough, rather than being a offshoot copy of some model Facebook release basically.
Jim: Yeah. So this is interesting, and this is what I’ve been predicting is that just like Neo was about half a generation behind the state-of-the-art GPT-3, if this is say equivalent to or a little better than 3.5, you’ll be half a generation behind 4. And the difference between 2.5 and 3 was big enough to be very annoying. 3 was really a lot better than 2.5, but let’s say you stay half a generation behind the big boys, and I’ve predicted that by 5, it won’t matter for most things. GPT-5, yeah, it’ll be able to do some really cool shit, but an open source 4.5 equivalent will be able to do almost anything that matters in the real world, in which case basically it’s game over for OpenAI. They can no longer charge these ridiculous prices.
Shivanshu: Exactly. And also to basically break their monopoly, because I really don’t like the attitude that they have when pushing. They actually trained GPT-2. That was way back in 2019, almost like 100 years ago.
Jim: Yeah, Stone Age. Yeah. We were still chipping in flint at that point, right?
Shivanshu: So back then, it was basically expected that people would not just release the code, but the model as well, because everyone did it back then, and it was a time when even academia had resources to compete with all of these big labs basically, the good old days. But even then, OpenAI released this paper on GPT-2 and basically said that they will not be releasing the model weights and code because the model is too dangerous to be released. And then they got a very serious backlash on their dubious claim, and they just caved in and released the model and the weights. The idea was that they said it’ll generate a lot of fake news and disinformation and all the other political bullshit, which didn’t happen obviously because the model couldn’t actually generate coherent enough texts that you would be fooled by it.
So even back then, basically you could smell that this is the direction they will be moving when they actually have models that are capable enough to generate money for them. So they went from GPT-2, which they reluctantly released the code and the weights to, to GPT-4, which is basically just a technical report and an actual product. And the technical report is literally just one step beyond not having released a report at all, which is what my bet was actually. And people say that I lost that bet, but I’m not going to back down because it basically says nothing, right? So it’s more or less that I won the bet, and they didn’t release anything.
Jim: You can’t replicate their work, right? That’s the scientific-
Shivanshu: You don’t even know what their work actually is.
Jim: Yeah, exactly. I read that 99 page report, and I said, “Pretty interesting, but doesn’t tell me shit really about how to do it myself.”
Shivanshu: I actually saw a meme where someone just photoshopped it and the first page says, “GPT-4 Technical Report,” and then the abstract says, “We use Python,” and then it had a photo of a monkey that’s looking suspiciously to the left and right.
Jim: I love it. That’s good.
Shivanshu: We’ve already said too much.
Jim: I love that. Yeah, I’ll have to find that. I wonder when they’ll be guilty enough about calling themself OpenAI to change their name to ClosedAI.
Shivanshu: Yeah, that’s a running gag, I guess.
Shivanshu: Although I guess they technically could be moving towards that because they recently bought this domain, ai.com, so just dropping Open. Maybe they could be dropping the open very soon.
Jim: I mean, they should if they want to be-
Shivanshu: They should already, yeah. I mean at this rate, it feels like GPT-2 came out with the model and the paper or model weights, paper, and code, GPT-3 just came out with an actual paper, GPT-4 came out with basically next to nothing, just like the evaluation results. GPT-5 may probably be, according to them, too dangerous to even exist. And if it is the case, which I believe is almost close to be true, then why even train it?
Jim: Yep, yep. The other alternative, it’ll be a glossy PDF that will say, “Better than dope in a bag. Pay us 5 cents per query and you can do whatever you want, except that we won’t let you do anything because we’ll have nanny rails stacked up to the sky,” right?
Shivanshu: Yep. And then they claim that it’s for some alignment reasons or basically trying to align it to what human values are, which really isn’t what the case is. It’s basically just putting what you believe in into the model just by having some human annotators do CodeMonkey work and just write a lot of text for you for $2 a day. I don’t know if you have read about it, but OpenAI basically hired people from Ghana and paid them $2 an hour to annotate their data, which they use for RLHF. So basically they have people do all of this stuff and they give them a rough idea on what kind of text they want. So basically they can politically influence how the models should behaved just by that, right?
Jim: Yeah. That’s annoying.
Shivanshu: And then they have the audacity to say that, “Well, the model learns what it learns, and we can’t really expect the model to be politically biased, and we don’t know why it does that.” And basically drop the blame onto the model rather than the actual data that it was trained on, which you guys make other people create.
Jim: Yep, indeed. Are you guys going to release your human reinforced learning data set?
Shivanshu: Well, that’s not for me to decide, to be honest, because first of all, it’s very expensive to buy, so maybe we will figure something out, but for now, I can’t say for sure.
Jim: I’ll tell you what, if it’s too expensive to give out, maybe you could appeal to the world to fund the release of it.
Shivanshu: Huh? Now that is an interesting thought.
Jim: I will commit right now on the air to put up $50,000 as part of a fund to open source the reinforcement learning data set that you guys created. Of course, it’s going to take more than that to pay for it, but I’ll put it up right now, 50K in USDC. Just give me the address and I’ll send it there, right?
Shivanshu: Well, yeah, that is a generous offer and it could kickstart a lot of stuff. I am going to talk to the guy who mostly oversees a lot of our RLHF progress, he is definitely going to be excited to hear that crowdfunding.
Jim: Tell him I’ll promote it, I’ll promote it on the air, and I’ll get a bunch of my friends, some of whom are much more influential than I to promote it, and we’ll make that happen.
Shivanshu: Yeah, I mean, if that can happen, then yeah, obviously we’re going to release it. The only bottleneck is just the fact that it is very expensive to create that kind of data, basically.
Jim: But it would be huge for the world because I would love, again, for both replication purposes in production, but also research purposes, and then as a base for the world to grow on. I mean, the thing about open source that’s always been so wonderful is that each new thing becomes a base to build upon. All the way back to GNU, right? GNU Linux, GNU Unix, which basically turned into GNU Unix, which turned into Linux essentially is probably the most famous example of how one thing leads to another and getting all this stuff out there would be really good. Final question for you. As a dude, in the world of open source models, you’re doing a lot of work. You must hear through the grapevine what other people are doing to some degree.
Jim: What else do you know about, that you can feel comfortable just speculating about, what else might be coming in open source models, not from you guys?
Shivanshu: Yeah, I mean in terms of open source models, basically, I don’t exactly know what’s coming for open source models. I do keep my ear on the grapevine for what’s actually going on in AI in general. So I can tell you that GPT-5 is currently already training and maybe it’s going to be released around this time next year. It’s obviously going to take a year for them to train the model and then do all of the evaluations that they do, and then it’s going to hit the shelves. By shelves, I mean just that Microsoft is going to plug it into Microsoft Office and then whatever third party partners want to plug it in. Then there’s definitely going to be a lot of focus on multimodal models rather than simply text models. So for example, image generation is almost solved at this point.
I mean, there’s probably nothing you can’t create with a text prompt at this point. If there’s basically any image you can think of in your head, you can create it. So everyone’s going to move to video generation, which is a much harder problem to solve, and there’s already a lot of work going on in the field. But I expect that we will see very capable video generation models in the open source world as well. So people will definitely be able to play with a lot more toys, and I expect that by 2025 or 2026, you may even have a Martin Scorsese coming out from some mom’s basement, basically.
Jim: I love it.
Shivanshu: Put his videos on YouTube.
Jim: Yeah. All you young entrepreneurs out there, the adjacent possible is so huge right now.
Jim: Get out of your mother’s basement, learn just enough programming so you know what programming is, and then become a great prompt engineer in some domain. And you too can be Martin Scorsese 2.0, right?
Jim: Right. I want to thank Shivanshu Purohit, head of engineering at EleutherAI and research engineer at Stability AI for an extraordinarily interesting conversation.
Shivanshu: Yeah, thanks for having me.
Jim: Oh, this has been great. Have you on any time. Whenever you want to come on, let me know and talk to your boss about whether we should do a campaign to fund the release of the reinforcement learning database. I think that would be very cool.
Shivanshu: It actually would be, and I definitely am going to follow up on it because this would definitely be useful not just for creating more capable models for everyone to use, but as you said, a lot of research depends on it. We don’t even know what models learned during the training, and then we have already introduce this new layer of reinforcement learning, which basically contorts the model every which way.
Jim: Indeed. I think that’s very important for both purposes, and I’m happy to help accelerate that in my own little way to the degree that I can.
Shivanshu: Yeah. Yeah. Thanks for the offer.
Jim: Very good. Wonderful conversation. I look forward talking to you again.