The following is a rough transcript which has not been revised by The Jim Rutt Show or John Ash. Please check with us before using any quotations from this transcript. Thank you.
Jim: Today’s guest is John Ash. He’s an artist and he just calls himself, or at least has been known to call himself online, a Qualia Engineer. He produces works of art across multiple modalities, including at least at a minimum, paintings and things of that sort, image art and also music. And while I was plunking around today getting ready for this call, I happened to stumble upon one of his albums, which I actually liked quite a bit. It’s called Strange Hymns. I found it on Amazon, but I think it’s also available on at least Apple Music, so check it out. Anyway, welcome John Ash.
John: Hello. I’m so excited and grateful to be here and I have so many thoughts on my mind already.
Jim: I’m looking forward to this conversation. I kind of, as I do sometimes, I sort of pick John out from a flow of tweets and engage with him a little bit. We’ve argued a little bit, all fun. And I said I’d like to talk to this guy and in particular I target him for one thing I’ve become interested in and I haven’t really had a show dedicated to it, and that’s the new recently emergent AI art generating tools. Things like DALL·E, Midjourney, Stable Diffusion of course, and the new, I don’t know if it’s actually released yet, Unstable Diffusion and who knows what else is coming.
I play with them a little bit, but as I’ve said on the podcast before I have no artistic ability at all in the visual realm and so I produced various and sundry things, but I’m sure they would all be considered to be quite terrible by any objective measure. So I thought I’d bring on someone who does seem to have some serious talent and also seems to be a deep thinker and has played with these tools a fair bit.
And so with that intro, John, why don’t you tell us a little bit very briefly, your history with the tools and your thoughts so far on what they represent?
John: Well, I’m a machine learning engineer. I worked in natural language processing for the last 10 years, and about 5 years ago I just sort of saw the writing on the wall of where generative models were going. Essentially, the most common trope in television regarding AI is that the last thing that it would be able to do was create art, was to create language or meaning. And for most people in the field who were seeing generative models form really low resolution versions of art, we understood that to be not true.
So I sort of made this decision where I had secured enough autonomy through saving to be able to step away from putting all of my consciousness and focus on my effort towards making money for a corporation and started to focus on my effort towards using these emerging tools to communicate complexity.
So it’s not just that these new art models have suddenly appeared it’s been coming for a long time, and all of the pieces were all sort of slowly emerging until somebody finally put them all together. And the resulting chaos of bringing it to the people is playing out in real time right now. I have been playing with any AI art tool that I can find since way, way, way, way back. I mean the first and most foundational aspect of these models is the capacity to cut them into parts. And you know what a convolutional neural net is, correct? What a U-Net is, right?
Jim: Yeah. Well, I would say for our audience, if you could give a very much a layman’s take on what these things are, that would be very helpful. If you can do it without using the word convolutional neural network that would be good.
John: Yeah. The ones that are really successful recently are removing noise from noise in a way that you tell it to. So you kind of tell it what isn’t noise and then it tries to remove anything from the image that is not your definition of noise. The training process uses literal noise.
Jim: And it also uses vast repositories of examples. How many images approximately would be in behind something like DALL·E 2?
John: I don’t know. I have learned recently that, well, not recently, is that the training mechanisms in the way that you’re using the data is a lot more valuable now than just amassing lots and lots of data. We have lots of data, but if you look at something like a GAN or Stable Diffusion, the way that it is actually approaching the generation, both of them are very clever in their unsupervised use of the data, meaning that they’re generating new data from the data that is there. It is adding noise at differing amounts. And so one image itself is many different training examples.
Jim: It’s interesting. I suspect it as much, it’s funny. I said that ought to be possible. Let’s use the generative adversary network approach and then probably have some scoring mechanism that you throw ones out that don’t score as sort of reasonable or something. But as you say, it does allow you to multiply your dataset quite far if you’re willing to throw enough computation at it. So I think that’s probably enough for our audience. Lots of data in, AI, machine learning tricks, and then I guess the other thing I would put on it is it produces a static representation of all the data that has been processed into it, at least today they’re all static. And when one throws a prompt against it uses that as a way to, as you say, pull its knowledge of all art in its database and the words associated with those images. Let’s not go into the detail of how it starts with a noisy image and de-noises it. But it essentially is able to extract from this deep static pattern something surprisingly close in most cases to the English language words you prompted it with.
John: Sure. Let’s add another metaphor. Imagine you’re looking up at the sky and you see some clouds. Clouds are kind of like Gaussian noise. Sometimes you’ll see things in the clouds. The model you might be saying is search out things that look like this in those clouds and change them to look more like that thing and then do that again in passes.
Jim: Yeah, that’s the concept. So then let’s get back on your personal journey in interacting with things. What was the first one you ran into? What was your reaction to it?
John: The first thing that we had available was style transfer, and the first useful iteration of that I think was Prisma in about 2016. Style transfer took individual images and captured the essence of their paint strokes in it and then would rein interpolate an image with those paint strokes. So the cover of Strange Hymns is one of those first AI practices. They were only 480 by 480 pixels, so it wasn’t enough to do the whole thing. But what I did is I rendered a concentric wave and then I sent that to the Great Wave, that painting of the Great Wave, the Japanese painting, and it deconstructed the paint strokes from that and then reconstructed that image with those paint strokes. Then, I took that 480 by 480 pixel thing and I gave it to a human and I said, reinterpret that through your mind. And it gave me a much higher resolution version.
Now, I don’t necessarily need a human. I could just load up Topaz Gigapixel AI, and I can take something that’s 1,000 pixels to a full wall image. It’s still a creative process to get it to do it right, but we’ve really rapidly increased our capacity to manipulate patterns using these tools.
Jim: And what did that feel like? Had you been a visual artist previously? I know the oldest stuff I can find of yours is audio.
John: It feels integrative to the extent that meditation is. That I think we all come with some specialization in our brain and just by our lives we attend to different parts of things. Some tried to push their bodies, some tried to use their left hand even though it doesn’t naturally come to them. So when I started to be able to integrate the images that were in my mind and see them outside of myself, that started to integrate my understanding of the world, which is why I think it’s a really healthy process for people who don’t make art to just play with them. Because you might have something you want to say that you can’t say with words and now you can. And that’s how I’ve been using it recently, is to construct new memes or complex things that I’ve struggled to say. Like, generating a sign that says truth sale in front of a shop.
I’m trying to say to people this idea that we need to pursue knowledge independently of knowledge that makes profit, and when I say that it makes people angry for some reason. They think I’m attacking money foundationally and saying, we have to get rid of money. And I’m saying we are really focused on profit, if we can be focused on knowledge and addition knowledge that doesn’t have any sugar in it, just knowledge is for the sake of knowledge, then we sort of rebalance our society. I’ve had trouble saying that, and when I say it in words, people often don’t really get it, but when I started to be able to generate images and then start talking at the same time, they see the image. And so there’s a visual association with the semantics, and suddenly they are able to understand this complex idea that I’ve been holding in for five years that I could never get out before. So it feels like growth.
Jim: Of course, the other thing is it is a transformer of sorts. It takes a higher order thing that’s in your brain, which first goes through a lower level resolution when it gets turned into an utterance, let’s say language. And it was interesting and we don’t know how the brain actually creates our utterances from our pre utterance model, but it’s doing something quite interesting and does it very rapidly. And then you take that utterance and turn it into art. And again, that’s a transformer. It’s got to be lossy to some degree. And yet, as you say, in some ways it allows you to communicate people you wouldn’t be able to otherwise communicate too. What’s your thinking about that lossy chain that ends up nonetheless being quite valuable?
John: The lossy chain is a funny example because the way that Stable Diffusion models work is a mark of chain of noise. It’s a literally mark of trace where the steps are for de-noising.
My work in cognicism it was really based in information theory and the notion of trying to send a signal over a channel with conversation or with making art. Though often when people are making art they don’t want to particularly explain it, whereas when you’re communicating, people are much more willing to explain it. It has enabled me to see how much loss there is. How much in conversation when I talk to somebody, they’re not really talking to me they’re talking to their neocortex and a simulation of me that is formed from parts like the little convolutional filters that are learned by the image models. They construct a version of me and render a version of me and interpret my words through that version of me. I see that very clearly now when I interrelate with a lot of people, because I have these concepts of the latent space, because I have these analogs and machine learning that do similar things to what the human brain is doing, although through different mechanisms. I can analogize those things. I can see the error and miscommunication and I can pause and question the noise question, the noise in the transmission between two people.
I communicate in this way. I can get a discreet thought in my mind, a discreet packet of meaning, it’s a truth and then I interplay it into many different forms. And I’ll keep saying it in different ways to the person, and I’ll ask them to rephrase it to me and say in their own words. And then I use my feelings to feel if that reflects it. And through that iterative process we de-noise the channel of communication in a similar way.
Jim: Now of course, there’s something very similar when you use one of these art models or a text generator like ChatGPT, which is that generally speaking, at least for me, the first prompt to the system isn’t the right one. Let’s say you have a vision in your head, you put the first prompt out, I want a picture of a golden retriever running across a meadow with mountains and pine trees behind them. And the first one is like, blah. Then you say, and with a stream in the middle, I want the dog closer to the stream and I want the meadow to be no more than 30% of the depth of field or something. So anyway, iterate on this thing. Talk a little bit about how you feel, what’s that like and is it similar to this process you just described of iteration with a human when you’re trying to do the same thing?
John: Yeah, because I view myself as sort of a dual being that has different modes of attention and the way that my attention in my right brain works is very different than the mode of attention in my left brain. And so the process of creating art is a process of a conversation between the editor and the actor in me, that the person who allows things to happen and the sense of inspiration that I get. And when I create art of any type there’s a feeling that arises, there’s something to be said. There is an iterative process in bringing it into life depending on the complexity of it of, I put it out there and I get a feeling that says, no, that’s not what was trying to be communicated. So there’s sort of a conversation with myself. When I study the greatest artists that I know, they communicate about the same way that they’re forming a conversation with themselves.
So I liken it to, I’ll say two things. One, you know what stochastic gradient descent is? The approaching and diminishing of air, talking about basins that you’re sort of finding the minima or maxima of. And you’ve heard the game hunt the thimble, it’s just the hot cold game. You’re getting hotter, you’re getting colder. Hunt the thimble is like a literal thing. You put the thimble in a location, it’s a party game and you’re limiting your ways of communicating, you’re cutting off something so that you strengthen the other connections.
So, I view the process of communicating within myself and integrating ideas very similar to the way that is external. It’s just a lot easier inside because it’s literally the synapses go right together. It’s right there. There’s no gap between that. So it’s like I am one, I get to be just one because it’s right there. But I do notice that I can talk to myself back and forth. I do note that when I create a piece of art, there’s an editor in me that says that’s good or bad. I do note that to perceive hot I also need cold, to perceive light I also need dark. So in my relationship with myself, I am discovering a similar relationship of how I communicate with others. And now that I have these machine learning tools for integrating my capacity to communicate with myself, I’m seeing a process slowly but surely, where the people that I really respect, like you and Jordan and John Verbecky and everybody who I felt like could not understand what I was saying, no matter how I said it, I just kept feeling misunderstood.
In your reaction to the initial posting of the Cognicist Manifesto, I felt like you had misinterpreted it so dramatically that I thought it was intentional. I thought you were setting me up to be able to talk. I seriously considered that you were just being really nice and being a 4D chess player and being like, here’s a setup for you to communicate these ideas. I have learned through using these tools to be more patient and to interpolate more. If they do not get it, I need to breathe and I need to just pause and acknowledge that they might not understand.
The struggle that I’m getting right now is that when you tell people that they don’t understand, they don’t like it. And so if you keep pointing at the interpolated form of the thought in different ways and say it’s the same thing, there’s an emotional thing that’s happening, there’s a free energy computational thing that’s happening that makes meaning transmission hard. We have to be receptive to meaning. And so if I tell somebody that they don’t understand they sort of shut down. So there’s still some things for me to figure out but the more of these tools that come out, the more I’m able to speak. I don’t know if you saw the Atheus and the Golden Braid AI actor video?
John: Okay. We have now created latent spaces of performance, vocal performance from narrated books. There’s a massive story reading corpus of people just performing the words and their breaths and their emotions. And we can learn a mapping between the semantics and the words and the rise in the fall of the emotional performance that the person gives. So I would encourage anybody to go listen to the Tale of Atheus. It’s on my YouTube under just like AI actor performs something like this. It sounds so much like a human, and I left little errors in it intentionally so people would be able to guess that it wasn’t. But it sounds so much like a human. It’s so emotionally impactful. It’s so connective that I’m able to say something with a fake voice that I can’t say on my own. And it’s like this British dude, he’s very, very proper and I was doing a really bad lower British, it’s like in my brain… Whatever. But the point being is that I can now use these voices to say, there’s this story I’ve written, express it with emotional valence of this person who has trained and refined their capacity to communicate through the proticity of performance.
Jim: And so is that a tool you created or did you use a off-the-shelf tool to do that?
John: Play.ht has these ultra realistic voices that they just debuted. They are not calling them actors, they’re not calling them performance, but that is what they are. You can literally just give performance directions. You can say the next thing this way and it will.
Jim: Wow. That’s worth checking out. I want to jump back a little bit here because we had a couple of analogies, so I think we missed one in the middle. You talked about having an idea in your head and trying to communicate it with somebody and then having some iteration as you tried to get the image in their head to be at least a decent lower res version of the image in your head. And we talked briefly about prompt engineering, I think is the term that’s come around or iterative prompt engineering. And then actually let me add a third one because of your history as a creator of musical art, presumably you are a practitioner of the art of songwriting. I have by chance gotten to know some songwriters over the years and the really good ones every once in a while they just drill one. But sometimes they’ll work on a song for 20 years iteratively. In that case they’re working with themselves, but they’ll sit there at the piano or with their guitar, whatever their preferred medium for exploring songwriting, which is sometimes not the same as their preferred medium for performance and they’ll iterate with themselves.
Do you see all three of those, converging with another person, prompt engineering and let’s say songwriting as examples of a higher order taxon?
John: Yes. It’s very much that a channel of meaning with loss in it. And let’s start with the prompt engineering, that is an art that a lot of people are into where it’s like the final image is just one singular prompt. It’s really fun when that happens. But I’ve started to bring in both DALL·E and Stable Diffusion where you can take the structure forms from one latent space, because DALL·E creates very clear forms that are very basic and sometimes almost childlike in their painting. And then you can take that almost as a heat map for the different parts, feed it into Stable Diffusion, do in-painting or out-painting and just refine that way too.
That’s my new more common method is once I’ll find a basic prompt that is pretty good, I can do in-painting, add new features, I can mutate it in a direction towards a particular style. And it’s all comes down to the relationship between the output of what I’m saying and my intuition, the feeling that I have about whether it was done right.
There are multiple ways as a songwriter that I get songs. One is you get it as a full shot thing that happens to me. I’ll wake up in the morning usually when my left self has been turned off a really long time. I wake up, pick up the guitar, press play. It’s as if it was written my dream, I didn’t have the dream but it’s just all there. Those are often described by people like Elton John or John Mayer to be the ones that become classics or the ones that resonate most with people.
However, there’s another type of song or type of art which is you’re pushing the form of complexity. You’re trying to bring in a number of new patterns that you are unfamiliar with, so it is impossible to channel it in one go. There is a song by Paul Simon, I can’t remember exactly the name of it and I won’t be able to find it quickly, but it is just a dramatic number of chord changes within it. And he’s talking on this show about his writing process in the exact same way that he says there are some songs that I get that are as if they are from the muse, but at a certain point certain types of songs take a very long time to channel through because of their complexity. And you can channel more complexity too if you train your body like a jazz person to always let it be streaming through you.
But to me a big part of the creative process is noticing signal in noise. There’s some pattern, I’m just sitting, I see some pattern and it looks like something to me is there. And you train yourself to be willing to be impulsive enough to chase that when it is not necessarily a logical thing to do. Now it seems more and more logical, because I sort of explained it seems like chasing these representations helps people communicate better.
Jim: Well, let’s bring that back to prompt engineering. Let’s keep it the simple case where you’re just iterating with a single tool, whether it’s Midjourney or DALL·E or what have you. What does that feel like? Does it feel like you’re talking to yourself or does it feel like you’re talking to somebody else or some entirely new category?
John: It feels like I’m trying to simulate the latent space that the model has. It’s like I’m trying to imagine the relationality between the patterns that are embedded in the space. A latent space, that static space that you described that things are drawn from, it places patterns near each other that are semantically related so that when you draw something from that space, between two points it seems like it’s in between those things. We can’t see that space, it’s called latent for a reason. So the iterative process is trying different words and seeing how them in relation to each other affects the output and trying to visualize in my mind, how are the patterns organized in this hidden space. And then once I have that in my mind then I can generate something easier.
Jim: And just for the audience’s sake, how many times might you iterate if, again, let’s stick with the simple case of working in one tool before you find a piece of art that you think is worthy of sharing with the world? A gross number or a range even. How many iterations of prompts do you go through to get what’s in your head out?
John: One to insanity. When you’re inspired you’re like boom, got it. Boom, got it. Boom, got it. Boom, got it. When you’re fighting yourself, you are just like, you feel like there’s something in your head, the model will not output it. And there’s a trick to do that for the listeners, the CFG scale, which is the classifier free guidance, which is the guidance at the rendering step. If you increase that than if the pattern is less common perhaps in that latent space, you can increase the likelihood that it’s going to be manifested in your output. But, sometimes you eventually have to give up and say this pattern’s not in here. I tried to render a Klein bottle for Matt Perkowski in any form and it’s just not a consistent enough pattern to learn and I would’ve had to introduce a new label, which you can do. You can just get a bunch of Klein bottles and say Klein bottle as one word and then it will be able to bring that into the latent space. We don’t have easy tools for that yet, but we will in the future.
Jim: Indeed. Yes. So yeah, one to crazy, that makes sense. Again, my limited playing with it, seldom for me is at one, being not a naturally visually artistically inclined person, but sometimes it’s two or three.
John: Put Funkasaurus Rex painting into DALL·E, every Funkasaurus Rex is amazing. There are some phrases where it’s like how is this always there in latent space?
Jim: And then other times I’ll spend 20, but probably no more than 20 and I’ll just say, all right, it’s not there or I’m not willing to chase it that far. So it’s good to hear that there are some people that go well beyond that even.
Another very interesting thing to me in these art tools is how easily and reasonably well they do stylistic interpolation. So you can say, give me the picture of a dragon from Game of Thrones in the style of Sazon or something. It’ll do it. Talk a little bit about that. The fact that it’s captured style by name and is able to interpolate a prompt that you give into that style.
John: I mean well, we can point to the thing that it tends to be generally bad at, which is changing colors as being a distinct thing from changing the structural patterns. And if you think about it’s a de-noising model so it’s going to basically form structure of patterns from the noise. And, most of the diffusion models are functioning at this pixel level. The key thing that’s different with Stable Diffusion is they use a latent space, not on the actual pixels. They’re using a hidden representation adding noise in that space.
So there’s two halves of how this works. One is the diffusion process, which is just that they’re un-noising images. The second is how that is being guided and it’s being guided by different representations of text. One way that you do that is you just throw the word embeddings, you just concatenate them with the image representation and then you feed that into the generator. But it’s not as good as this other process where we have a secondary model which is giving guidance at the time to form the output.
So there’s two places where there’s associations between language and patterns being learned. One is the actual diffusion model because it does have the language in that space and one is the secondary model, which is guiding the output. The secondary model, the clip model that seems to do really well is a transformer model. Similar to GPT, but it’s basically trained associations between… It’s scan the whole internet, it got all the image tags and all the alt tags, and then it’s creating this associative map and trying to find the associations between those things. And then it cuts that off and it’s got this mapping in this transformer between language patterns and visual patterns, and then it says, hey, you’re trying to de-noise it, I’m going to condition it on this representation of meaning, which is in the parameters, and then it moves it in that direction in successive passes.
One thing that we should probably have said a long time ago is the difference between diffusion models and something a GAN is a GAN just outputs one image, it’s done. The diffusion model has successive passes and it does it in steps. So each step is changing a little bit and you’re going to get the output every single time. And that’s that idea that every single step you’re using this meaning representation from the language model to change it a lot.
I don’t know how to say this part but the way that the knowledge is encoded and mapped between the language and the diffusion model and the way that the language is mapped in the transformer model is different. And I just have an intuition about how I use that. That’s that part of the right brain where I can’t put that into language yet. I just have an intuition.
Jim: And how does that bear on the question of style? Give me bicycle in the style of Van Gogh or something. Where does the Van Gogh implemented it? Is it in the full model?
John: There’s about 6 or 7,000 Van Gogh tagged pieces in the training set. The Van Gogh-ness is both in the clip model and in the actual diffuser itself. It’s just that when you look at a lot of the examples that don’t have guidance, they just are not as good. And there’s all these other hacks in there too, like attentional layers. The attentional layers in the clip model are like, which parts of the text are you attending to in relationship to generating this output? There’s so much stuff in here. But what I’d really like to talk about is that sourcing thing that we started on early, because that’s a big debate.
Jim: That was going to be my next question, which is again to remind the audience, this neural net is derived from many, many, many, many source inputs. Most of them I presume, spidered off the internet one way or the other, and generally speaking without anybody’s permission. So let’s talk a lot about this issue actually.
John: Well, we have to sort of connect it back to my work with cognicism, which is the idea of high resolution democracy where you’re learning synergistic satisfiers, representations of meaning that people seem satisfied with basically, that everybody can agree upon. And after it has learned whatever idea is most satisfactory to people, you can see who contributed to it. So it’s a process of debate where at the end of it you have refined to an answer, like the way that we were talking before, those three different modalities of narrowing in towards something. That process is in that latent space and you get this output and then you say, oh, who was this sourced from? That particular thing is not that hard. You can analogize it and bring it into an image model.
I think the easiest example that we can talk about is the appearance of watermarks in these images. Because they actively do try to sort out watermarks before training, but that doesn’t necessarily mean it’s not getting covered into things in it, it’s just because they don’t want watermarks. If you use a training set of data that has watermarks, then it’s going to try to create a watermark that is formed from all of the different watermarks that it is pulling the patterns from. We can’t interpret that, that’s meaningless to us.
But to me, I see, oh, it literally doesn’t understand that humans are different because it’s a machine, it’s just trying to put it into one space. It’s job is to put all of this meaning into one space. So because it’s a semantic space, if I was to do something like who said this to the GPU model, it will mash names together. If I say if there’s a research paper right now, because all those names are in that same meaning space, it’s going to be like doctor… Whatever, I can’t think of multiple names out of the top of my head. It can mash names into one and be like, I’m doing a good job sourcing.
That’s sort of the debate is that these models do not attribute in any way. And there’s also a cultural problem which is that our beliefs about how important our things are not… How do I put this? There’s a debate on ArtStation right now trying to protest and remove their work from the data set. I want to explain to these people that if you removed all the art from the ArtStation training set, you would still be able to render things that look like art in the ArtStation training set from all the other patterns, because patterns leak through minds. We’re all holding these little pattern makers, these little convolutional filters, we’re all holding these little pieces from which we are constructing new holes from.
So nobody can really own them, but we can point to who patterns are being sourced from when you have a model like this. It’s not even hard, it’s just not in the conversation because that would mean having people register their art, that would mean them having to maintain some level of credit to the people. And it’s a lot easier to just put it out into the world and be like, oh, we’re just cutting it into parts, you don’t have any ownership over this.
Jim: Of course, they also want to at least dodge the most obvious legal handles. If there were attribution, then somebody could say, wait a minute, this is a copyrighted work of art and you don’t have the right to use it. If they take the approach that they do, which is we’ve scanned in all kinds of stuff, we’ve, as you say, broken it up into tiny, tiny, tiny little fragments in the mathematical and statistical nuances around that and further, these latent spaces are meaningless except in the context of all the other objects that are in there. It’s not even reasonable to say that anything that you have copyright on is directly represented in this corpus. And that’s just a practical reason why they would not want to provide that attribution. Though, along that same line, I did communicate with OpenAI a week or two ago saying that for ChatGPT, I thought it would be hugely valuable if you could have a version that provided footnotes. And they said they’re working on it, which would be quite interesting.
John: They’re working on it and I could talk about that in detail. It comes down to, again, putting the sources and the text itself in the same embed space, which is something that you would want to technically be able to do if you can figure out how to do everything unsupervised and just say, okay, perfectly who it’s sourcing from. I think that’s where their heads are.
The simple solution is that you have the words and that’s the meaning, and the way that it figures out sources is just because it’s in the text itself. If you move the sources into their own embedding space where for example, each word token exists within a word embedding space, you have each source, each artist, each person is registered it within that embedding space. That embedding space is mapped to the space that they’re pulling things from. Then in your output, if you have an attention layer on the sourcing, then it will tell you from which sources it was drawn from. But that would require a layer between the culture of artists and people and the architecture. And it is very hard to find human beings who both get the model architecture and who are really deeply into the math and who understand what it really means to relate with these and how human beings are going to feel and react to what it means to have art be commodified in this particular type of way, the creation of art being commodified.
These things were predictable. They could have definitely put these things in. What I’m suggesting is not a magical intervention. Source embeddings plus an attention mechanism in relationship to the latent space, it’s not that complex. It’s just that when you’re looking at a large corpus of data you do not have them labeled in any way sources. And I’m trying to say to them, you don’t need to start with everything being labeled by sources, you can do transfer learning. You would start with, oh, we’ve learned on the whole internet. Okay, now we step into this phase, which is into the source and embedding space. You put everything from before into one identifier and then anybody can put into their own art or can put in their own meaning and have it be attributed to when any output samples.
And then you could know, you could see really the sources of the patterns that become most common. If there’s a new genre that emerges that uses these patterns a lot to construct and it emerges from one person, you literally see and trace back the origin of the meme, the origin of the pattern if you also add temporal awareness, but they won’t figure that out for five to seven years. I could say that same thought to a ton of people, they’ll just stare at me with blank eyes. They won’t get it.
Jim: Yeah. I’m start thinking about how these nets are actually built as a series of subtly adjusted weights across many, many, many, many, many, many, many, many cycles. I mean, of course this is a fundamental problem with the artificial neural net models of machine learning is any kind of attribution at all is very, very difficult. I mean fundamentally, there’s some work going in there now but what are the attributes of the network that cause behavior X? Even that is hard. I’m sure thinking about it from a practical perspective, actually envisioning many layers, many nodes, many weights, and then being adjusted repeatedly, is it even reasonable to say that any particular artifact had any particular impact on output? This is much more holistic in that sense than it is easily referencable. And if I was imagining how ChatGPT might do it, they might do another version, which is to have a latent space for papers and then compare that to their output and then do a post hoc after the fact footnoting as a way that that’s actually more tractable.
John: That’s what the first implementations that I predict are going to happen, are they’re just going to register experts who are willing to register basically themselves as a source of credibility into these things. And there’s going to be all these debates because it’s going to be more gatekeepers of knowledge and people will not feel like they have any access and they’ll feel like they’re controlled by other people. And like you said, without what I described as a sourcing mechanism how do you point to whatever it is? But when you have it, you could see it very directly and you’ll actually see that it pulls from far more people than our human type egos would like to admit.
If you saw an image of the Starry Night and Mona Lisa integrated into one image, you’d be like, oh, that stole from two artists. And I was like, no. They also studied artists of their time and they brought in the patterns from the artists that they studied and all of those artists are sampled by them before them.
Jim: And that obviously many, many artists are descendant from both of those trees as well and all those were input into the database, and so those impacts are going to be there as well. So try to do an attribution, the more I think about it, if you try to do it from the corpus itself is going to be a fool’s errand. If you could build a side space, but then you get into all the problems you talked about of, okay, who’s in the side space, how was that curated, et cetera, and why is that a reasonable thing to do, even frankly?
John: Well, the reason why is because we’re having the debate and a lot of the takes are pretty bad. I don’t know. I started from this point in 2017, I’m like, this is what’s going to happen. This debate’s going to happen. We’re going to have to deal with it, tune into what I’m saying so that we can preemptively deal with it. Now we’re here. Now all these bad takes are being amplified. And I just want to say you are reacting to this moment, you are doing reactive cognition. I’m doing pre-adaptive cognition. I saw it coming, I tried to warn you about it, but we are attending the people who do not really understand how it works. We’re amplifying those voices who are amplifying misinformation and prediction as a signal for when the earliest version of a meme that could be found in a ledger of shared meaning over time, the earliest version that you can find of a meme the first time somebody uses a word is a really powerful signal. It’s right there. It’s not hidden. You can go through your Facebook feed, you could see the consistency, you could see your old thoughts and you can find the first version of a meme.
It’s a thing that we can do, but instead we live in this world that is going to try to keep commodifying everything and taking away our autonomy as individuals and stripping it away further and further and further. But I can’t say what’s going to happen because obviously artists are rebelling. So that changes the outcome of the future. I do not know how OpenAI will react to them. I don’t know how anybody’s going to react per se. But I do know that there’s this great debate, there’s a lot of misinformation and we probably should be attending to different voices in the debate.
Jim: What’s your perspective on the acuity of the artistic objections? The people that are making the objections, are those folks that don’t know what’s going on that are just reacting actively as you say? Or are there folks over in the artist world that actually do know what’s going on and have a legitimate critique?
John: Can you give an example of some artistic objections that you’ve heard specifically?
Jim: Yeah, that, hey, they used my copyrighted work of art as input therefore I should be paid, or they shouldn’t do that without my permission.
John: They probably shouldn’t have done it without their permission to start with. I mean, I didn’t go out there and steal everybody’s knowledge out there and try to say that I owned it and therefore I could profit off of it, they just kind of went for it and they’re like, woo. But now we have to integrate as a society that great debate except for our mechanisms for do that are social media, and that’s weird. Eventually this stuff will go to the government. The people get angry enough that somebody will bring a court case and then old people, far older than you-
Jim: That’s old.
John: … Will essentially determine it. And I’m not saying that the elderly don’t have wisdom, they obviously do. But we’re also dealing with a set of patterns that has only been invented basically in the last 10 years, and you’re asking people who grew up and have lived their entire life without those patterns to assess what we should do with them. That’s sort of a hard problem that we have to deal with right now. And what is going to happen is the debate is just going to boil and boil and boil about ownership, and the tools are going to get more and more complex and people will be able to do more and more.
Have you seen The Congress? It’s a movie about essentially this where Robin Wright sells her identity and everybody lives in this cartoon drug world where they are rendered as these characters in real time and they’re like cartoon characters. We’re all going into this filtered space where we can create these masks that are going to be even AI generated so we’re not even showing our face online to people. We’re not even willing to have any level of accountability. We’ll hide, hide, hide, hide, hide. That’s one potential future.
Jim: Yeah, that’s one future and not one that I necessarily think is a good one. As you and I chatted about in the pregame, the idea of full spectrum skin in the game has historically proven to be a useful model. Now that is historically proven. It may be that there are other models in the virtual realm which can provide the benefit of full skin in the game. But I look very, very carefully at the proliferation of anonymity and suit anonymity and things like virtual avatars represented as if they are you, and of course now they can be good enough to look like they are you. And adding this noise into our social signal may have consequences that we have not yet figured out. I shouldn’t say may, do have consequences that we have not and indeed in principle cannot actually predict.
How about this? Let’s take the conversation where we were at, that you see these forces in contention potentially being resolved by nine old folks on the Supreme Court someday. Do you have an opinion about what the most likely trajectory of that play is?
John: The first cases are going to probably be porn. You can steal somebody’s face, you can put it on a naked person’s body using AI. And that is getting easier and easier and easier and easier and easier. That’s almost certainly going to be the first major court case that they have to resolve. And the context of that particular case will determine a lot because that’s how the legal system works, it’s you get a precedent and it’ll just be like the first thing that creates enough energy that people have to deal with. And then we’ll just start building on that as if that foundation was a perfect foundation and that somehow everything should be built on and it’s infallible in their assessment, and there couldn’t have possibly been error in deliberation.
It’s just so hard to say. I mean, we’re going into these 2024 elections where we’re getting these incredible deepfakes of all different forms. Are you familiar with Stephanie Lepp’s work?
Jim: Yes, in fact, I’ve had Stephanie on my show. We went into her deep fake interviews with the people and I’ve actually had Stephanie as a guest host interview me on my show. So I know Stephanie pretty well.
John: Yeah, she’s pretty amazing. And those things to me, it creates so many emotions in me. Because on the one side it feels really wrong to force meaning into another person’s mouth. At another angle, the way that she writes is a way that nobody talks. It’s like trying to find the center which would agree with everybody just a little bit, right? The Brett Kavanaugh one that he’s talking, she’s writing in a way that is trying to form some central representation. And I think we’re going to see more and more people thinking about these generative models like that, people doing that type of thing. Which is people really willing to listen to other people’s like Peter Lindbergh, that sort of mimetic mediator thing. Really willing to sit in between people who are in conflict and serve as sort of bridges of understanding. That, which a lot of people are talking about in the community is really important, but I think we’re going to see sort of automated forms of that emerge in pockets while this great debate occurs at this larger scale which won’t really make sense. I think probably the real solution is going to come from people iterating and designing and then something will be put out that’s open source and once it’s out there, you won’t be able to undo it.
Jim: Yeah, that’s my point about this and a couple items here. The episode with Stephanie Lepp and her deep reckonings, which is really interesting is EP 129 for those who want to go look it up.
Now, the example you just gave, putting somebody’s head on a porn, people have been doing that with Photoshop for at least 20 years. You can type into Google, Hillary being screwed by a donkey or something, I’m sure it’ll come up. And it was just done with crude Photoshop. And I’m not entirely clear that there’s ever been any court cases about that.
John: Believability is, I think the thing that’s going to be the thing. If people feel like their fundamental reputation is being slandered because it is believable to a point, then we’ll have the problem. And it might even be people like politicians doing it first. They might even try to support it being illegal for rich people or not for others. I don’t know how it’ll play out, but believability is the key thing here.
Jim: The other follow up I want to make because these are all on the use side, you did not really address what you think might happen on the claims side of the creators. And of course if the adjudication came down one way, it could actually be the end of all this stuff. If it comes down the other way, it would be very supportive of it and it could of course be somewhere in between. Any thoughts how that might play out?
John: I think that artists, traditional artists should use AI art to augment their workflow so that they can make more money faster. I think that it’s very clear that if you have a shitty client who wants 1,000 different edits, you’ve already fricking painted the thing. And because your own styles and embedded the image, if you want to in-paint something and just describe it in there, you do it three times. Be like, I’m tired of this person, just pay me my fricking money, dude, I’ll make it as whatever specific as you want. But because that artist has their own unique style, it becomes something that people still want to seek out. There will always be a sort of out of most of it center to the bell curve. Most of the art is going to be close to some central thing and things on the edge really excite people and create new representations in the mind. People want those edge representations.
Art and music, it seems like it’s just one of those things where there won’t be money in it in the same way unless you’re willing to embrace it. And if you’re one of the people who’s willing to embrace it, you probably could make more money than ever in this sort of gradual change until AI just keeps breaking and breaking, breaking and breaking more markets, more and more and more markets won’t make sense and then will people like, oh crap.
Jim: Of course the idea of artists making big money is a very new phenomena. Historically, artists, even the very best, made decent middle class incomes essentially being patronized by royalty or later by rich mercan class.
John: That’s the major model now too, Patreon.
Jim: I think I subscribed to 10 or 12 people on Patreon, another 18 on Substack. So I’m doing my own little part of helping the creator economy. And that may be the future, I mean truthfully, my wife and I are both… Especially, she’s much better than I. She knows the popular music scene, especially the Americana scene really, really well. We’ll see artist X who dominates a genre economically but is only the 200th best musician in the genre. And so in some sense, breaking down these winner take all economic games, generators essentially that end up as winner take alls may actually be a good thing for the art community. Maybe more people will be able to make a modest living, let’s just say make a living doing art if there wasn’t this runaway tendency to winner take all.
John: Yeah, I don’t necessarily know where every aspect of the future is going because the reactivity of humans and their will to change the outcome of the future is real.
Jim: Of course, agentics makes prediction even harder than it otherwise would be, right?
John: I’m of Matt Perkowski’s sort of belief that the further you stretch it out, it’s like you throw a rock in the water and you’re trying to say, I predict in 100 years it’s going to be a wave, that doesn’t make sense. But, sorry, we can just make general understandings and interpretations about human behavior and we can make general predictions about money and where there will be money and we can make general predictions about what happens when people’s livelihood gets taken and the level of anger and reactivity that emerges from those. So, we can guess a few narratives that are going to play out in this and we can also actively communicate about better ideas and preempt basically what the debate will result in. Which is probably going to be some centralized authority or some thought police, which is already kind of what is happening with these models
Jim: Though, to your point, let’s branch on this too and this is my rebuttal to people that claim that we just ratchet these things down by regulation, open source is happening. Eleuther already has that pretty decent language model and while the open source ones at least for quite a while are likely to trail behind the top of the line commercial ones fairly soon, it won’t matter, right? When you get to GPT-9, the fact that the open source one is equal to GPT-8 for most practical purposes won’t make any difference. And those things are going to be out there and once they’re out there, as you said, it’s too late to bring them back. I presume that’s practical on the art side as well, I know much less about sort of what’s necessary to do these art thingies than I do the text thingies.
John: I like to think about the world as one giant interconnected neocortex that is raising and lowering in energy, kind of like a distributed heat map of emotion. Surprise guides the flow of cognition within that distributed heat map. If we’re talking about, for example 2016 to 2014, there was basically every time the President did something that was very unpredictable or we didn’t like, there was this feeling of fear that raised up and there was this gradual communication and re-normalizing and as that happened we sort of restructured our beliefs about what is normal, what is society, what is good, and how do we function?
So that’s what we’re headed towards is that you go into this next two years, you have this election where you can have propaganda bots, which is my biggest fear. The persuasive bots that seek you out, who are trained on the beliefs of a political party or something they find susceptible people. I mean a terrorist group can find a susceptible person, stokes their frustration, stokes their rage, and radicalizes them. That’s a real thing that can happen. And then I’ve been trying to communicate for the last five years, this is avoidable. You have to be proactive and I can explain to you how it’s avoidable. But the most likely future, most likely future if you are passive as an individual is not that good.
But if we can be proactive and talk about this shit and take action, I had this conversation with Ari and Daniel Schmachtenberger, which made me make Iris. Which was they just scared the fuck out of me. They described people who were already working in the space and what was guiding them as beings was motivating them as beings and I was like, fuck, it’s basically happening. So I need to serve as a counterpoint for how we can interrelate with these new tools as they emerge because the debate that’s going to play out is going to be very slow and frustrating because people do not have the mean set to even describe how things are different. They are surprised and they don’t know how to react. And when you have something that’s sending you a voice message that has this natural emotional valence and is really trying to change how you feel, people are going to be like, we have crossed the line of what spam is and what forced mimetic manipulation is and there will be a breaking point because we can’t just be beings of constant manipulation.
I want us to own our cognition. I want us to be free and be able to direct ourselves and not always be attending to some global bullshit that is designed to attract my attention. It seems to be the dominant paradigm that we’re all fighting for attention because that’s how you sell things to people, that’s how you get the most valuable resource, which is money.
So we had this interesting thing with these generative models, which is you literally have an attention mechanism that cuts up that attention. And you can take from that attention mechanism and retroactively say literally who was attended to in any particular debate and if things go really well, you can attribute positive futures to some degree to that individual. If things don’t go so well, you can attribute negative futures to that individual. But that’s just me saying, okay, this is probably where we’re going into the future. You’re going to have to wait till you get there. Most people are going to have to wait till we get to the future for them to understand what I’m saying. I can use all of these tools and I will work as hard as I can. I promise you and everybody I will work as hard as I can to explain it so you can avoid the future before we get there. But I’m not that perfect of a communicator like I try, but when people specifically ask me questions, when people reach out to me, it’s a lot easier for me to express these things. And, people trust you. You are an important mind. Most people do associative trust evaluations. If a person says a person is safe, then that mean set becomes safe. That’s a very basic and tribal mechanism that works when you only see 100 people.
Jim: And of course now it’s generalized so that if you’re Joe Rogan, you’re pinging 10 million people.
John: Yeah. And that’s a very, very strange thing that we have no real awareness of what it’s doing to the structures of our brains, that you can have a thought as somebody like Elon and immediately impact so much energy just by saying something. You raise this energy, you raise people to do specific things. You talk about like stochastic terrorism for example.
Jim: We’ll have to wrap it up here soon. But the takeaway is this has been gradually building for 150 years, 200 years, but the cost capability curve is starting to go really, really steep right now. My good friend Peter Wang said, after playing with ChatGPT for two hours, I think he said, folks, this thing is amazing. But please remember it’s December, 1903 and he took a picture of the Wright Brothers airplane and said it’s going to go straight up from here. And then he put a picture of a F-35 or something and said in less than 100 years, we went from here to there and this is going to happen a lot faster than that.
So this future shock, because you’re absolutely right, the idea of the Supreme Court trying to get their head around adjudicating these issues is kind of crazy. As Daniel Schmachtenberger likes to say, we have stone age brains, medieval institutions and God-like power is a fairly bad combination. And you and I and Daniel, lots of other people are working on various ideas on how we can build a new set of institutions. But we’ll have that conversation on another day.
So do you have any final thoughts you wanted to leave off?
John: Yes. I think that people should get into the habit of writing down their predictions about where this is going daily and looking back on them so that you can start to hone in and narrow in on where we are going. And if we don’t attend where it’s going and we don’t hone our capacity to see the future and then take action to avoid it, then it will probably get much worse.
I’ll say this the way that maybe Forest Landry might suggest it. That the problem of internal alignment needs to be solved before we solve AI alignment. If there’s all this conflict and cognitive dissonance within ourselves and then we try to build out these machines that reflect our inner selves, which are already in conflict, then the AI is going to be in conflict inherently. So I would like people to use these tools not to just get more lazy and make it easier for them to exist, but rather to integrate, to learn to communicate better, to become an artist if you couldn’t before. To do the things that feel uncertain, you struggle to do. Because by constantly improving you stay ahead of the AI, because the AI is sampling from humanity.
So, work on your own capacity to integrate information and use the AI to be able to do that rather than using the AI to necessarily baby or diminish your capacity to act within this world.
Jim: A empowering thought. Thank you John. That’s John Ash, also known as Speaker John Ash in his artistic performance role. Go check out his work. It’s been great having you here.
John: Yeah, it’s been great. Really enjoyed talking to you and there’s so much more to talk about, but I enjoyed this a lot.