This week we speak with Professor David Baker about the enormous scope for making new proteins and how that translates into practical uses to tackle diseases, such as Covid-19.
Using machine learning to predict how a protein folds helps solve a riddle in biology. But it is just the start.
These algorithms open up new opportunities to explore the physiological processes that have eluded research, adapt and create proteins for therapeutic purposes, and even power nano-molecular machines.
This week we speak with Professor David Baker about the enormous scope for making new proteins and how that translates into practical uses to tackle diseases, such as Covid-19.
We also discuss the “technological molecular design revolution” and how nanomachines could work like tiny vacuum cleaners to clear arteries from atherosclerosis or our brains from Alzheimer’s amyloid plaques.
David Baker also explains why none of this is possible without a sense of community in the lab.
Hello, you're listening to the third series of Theory & Practice, all about science that will be most impactful in the future. I'm Anthony Philippakis.
And I'm Alex Wiltschko.
Much of what we've been discussing in this series so far, has been related to the human genome. But today, we're going to talk about the output from our genomes, proteins, and specifically about how they fold. Proteins are known as the building blocks of life. They carry and deliver oxygen, they give power to our muscles, and as enzymes, they speed up chemical reactions in our bodies that keep us alive.
Each gene codes the amino acid sequence for an individual protein. The human body only uses around 20 amino acids. But that's just a tiny fraction of all the amino acids in the universe. The average human protein is about 100 amino acids long. So even using the small number of amino acids, you can begin to see the extent of the possibilities for a human protein sequence: 20 times 20 times 20, 100 times, but the complexity does not end there because each protein folds like a paper chain to create an individual form: a 3D structure that drives its function.
The particular shape of the protein determines its function. For example, the shape may create a pocket that another compound can slot into. It is possible to measure the structure of each protein using physical methods like X ray crystallography; more recently, NMR spectroscopy, or cryo-EM, but these can take years and cost hundreds of thousands of dollars. So instead, working out how to predict how it sort of amino acids will fold using machine learning is a holy grail.
There's a biannual competition called CASP - that's “critical assessment of protein structure prediction” - was set up 20 years ago to bring together teams of machine learning scientists with biochemists and biologists at CASP. They tested their algorithms against 90 proteins whose form had been determined experimentally using some of those techniques Anthony just mentioned. But in December of 2020, at CASP14, Demis Hassabis’ Deep Mind lab won the competition, effectively solving the protein folding problem. They use their algorithm called Alphafold2, and it scored an average of 92 out of 100 on the global distance test.
Meanwhile, another lab, the Baker lab at the University of Washington, was also working on the same challenge. More than a decade earlier, they had established an online game called Foldit. design so that citizen scientists could work alongside them and their computer program called Rosetta to solve the protein folding problem.
But DeepMind didn't release their winning code right away. So the baker lab rebuilt the Rosetta programme based on what they learned at CASP14. This new algorithm was almost as accurate as Alphafold2 and was faster. So by July 2021, the world had two useful solutions to the protein folding problem. Interestingly, DeepMind shared that the development of Alphafold2 was partially inspired by the baker labs game Foldit.
Our guest today is Professor David Baker himself. With him. We'll explore the inspiration for Foldit and how he involved non-scientists in this high powered field. We'll also discuss how his lab is using this newfound knowledge to develop de novo proteins to combat COVID-19 And we'll look into the future of what he has called the ‘technological molecular design revolution.’ David, welcome to Theory & Practice.
David Baker 03:39
Thank you very much. Glad to be here.
Let's start by talking about Foldit. I remember Foldit - it's kind of a part of, I guess, my scientific upbringing, and Demis Hassabis has said that Foldit played a part in inspiring the new version of Alphafold. So tell us how Foldit came about and what role you think the Foldit community played in helping to solve the protein folding problem.
David Baker 04:00
As far as how Foldit came out, the calculations that my group has been doing to try and predict protein structure and design new structures are very compute intensive. So we developed a distributed computing project called Rosetta@home, where non-scientists could get involved by contributing spare cycles on their computers, and Rosetta@home is still critical to the computing that we do in my group. Then, after we had established it, there's a screensaver that goes with it and you can watch the protein folding up. And participants started writing in and saying, you know, I'm watching the computer fold up the protein, and I think I could do a better job. And that was the beginning. That was the inspiration.
I think I had folded on a performer 475 or something like that back in the day. But you know, at that time, it was unusual to treat non-scientists in such a way that they're involved in the science. What gave you the idea to do that?
David Baker 04:58
Well, I think it was really Rosetta@home participants who got very excited about scientific projects they were involved in, and they watched the proteins folding up. And when you see that the computers doing a sort of a random search through the possible shapes of protein can adopt. And so once people suggested they might be able to do better, we thought about it, and realized it's an exploration or search problem, or you're trying to find the lowest energy state of the protein. You know, humans are pretty good. If you're stuck in some Valley, say if you're in the mountains, like making a guess about where what paths, you should go over to find a higher lower point. And so we thought it'd be worth it to try to involve the general public. And of course, what we had to do was to pose the problem in a way that human intuition could really be brought to bear on it.
What do you think about the project that led Dennis to think that it was an inspirational part of solving the protein folding problem?
David Baker 05:54
We set Foldit up as a game. So if you sign up for Foldit, now you get puzzles, and the puzzles have a score. And the higher the score, the better you do, the more plausible your protein structure prediction, or more likely, yours is designed to work. And we published a few papers and our publishing papers describing these really interesting and important problems Foldit players are solving. When we wanted to now analyze how they were doing it, we could track a little bit of what they were doing. And we asked them, and they gave explanations of what they did. And it was quite complicated. And Demis pointed out that they were doing really cool stuff. But it was hard to figure out exactly what the basis for it was. And, you know, that's when he thought about or that was one of the things that prompted to think about a machine learning-based approach to the same problem. And in fact, when we got contacted by DeepMind, a number of years ago about actually developing an AI approach to solving Foldit puzzles.
It's fascinating to hear this and Foldit is amazing in terms of engaging the non-scientific community. I also want to talk to you a little bit about Rosetta, which as someone who's quite involved in creating scientific software, I think Rosetta is one of the most utilized and loved software packages in life sciences. And your lab has been making it for a long time. Now, when did you first start it?
David Baker 07:18
We first started developing Rosetta in the late 1990s. And at first it was a graduate student, Kim Simons, in my research group and then more people joined him, then the students and postdoctoral fellows working on it went on to start their own groups. And we decided early on that nobody would get any money in their pocket from it. Instead, we will make it a communal, community-wide software. And we made it available to everybody. And then we also invited people to continue to contribute to it. And that's been a real theme, I mean, both within the Rosetta community. And then with Rosetta@home with Foldit, which is I really believe, that to solve hard problems, it really takes a village and the more people you can involve in it, and the more positive their experience, and the more they enjoy it, the more progress you'll make.
You know, one of the things that you pulled off, and I think there's maybe only two software packages that I'm aware of that have achieved this is that it actually extended beyond your lab. It's not just the David Baker lab that makes Rosetta, but so many of your trainees have gone on and become very successful academics on their own and are still part of the Rosetta community. You know, the other software package that I think is accomplished is the GATK, which is now made by my group at Broad, but was previously made by another person at the Broad. And they solved the problem by not having academic disciples that had to go form their own careers. How did you actually enable your students and trainees to be able to have successful careers and yet still be part of something that you founded because usually, you know, they say ‘You're just doing what your mentor said; you're not establishing independence.’
David Baker 08:53
I think it was in part that Rosetta opened up so many new problems that you could approach and that's protein design. If you look at all the things that proteins do in nature, they're so broad, and there's so many applications. So recognizing that - I encouraged people leaving my group to take on faculty positions to continue to use Rosetta and develop it and to explore whole new areas with it. And that's why they're close to 70 former grad students and postdocs from my group who now have faculty positions; these are just such a huge space to explore.
Excellent. Let's start talking about making new proteins. So it's not just the Foldit community that likes to work on new proteins, but it's actually something that your lab has specialized in, Correct me if I'm wrong, but you started off as just a computational lab, and then actually started at some point to actually make real proteins and characterize them experimentally. Is that correct?
David Baker 09:45
Well, actually, when I first started my group here, we were doing primarily experiments. The idea was to study the process of protein folding and get insights and then we incorporate those insights into Rosetta for protein structure prediction. And then once we were able to go from sequence to structure, we realized we could go backwards from structure to sequence. And that's the protein design problem.
Understood. So let's kind of dive into the protein design problem a little bit, how many proteins are there in the universe? And I know it's a vast number. But can you say a little bit about how that is?
David Baker 10:17
Yes. So there's about 30,000 genes in the human genome, let's say there's an upper bound on the average number of genes per organism is 100,000. And if we have, say, 10, to the ninth organisms, then we're at 10, to the 16th, proteins, counting each gene in each organism is representing a different protein. So that's a big number. But the number of possible amino acid sequences, you have 20 amino acids for the first position; 20, for the second 20, for the third says 20 times 20, you know, up to the length of the protein, so it’s 20 to 100th power, and that's the number on the order of 10 to the 130th power. So even though there's a very large number of proteins that are on earth, and it exists on earth, is an infinitesimal fraction of what the possible number of proteins are. And that's what makes protein design so interesting. There's an essentially infinite space to explore that nature hasn't explored.
So there's 30,000, protein coding genes in the human genome. So we know that those work, right, and we know little changes to them caused them to work better or worse. But when you get into this, like 10, to the 130th, space of proteins, I can't imagine all of them work. So like, how does it work to explore that space, where I'm sure there's a lot of kind of like, emptiness in terms of utility or functionality.
David Baker 11:37
That's a really good point. And that's exactly the challenge of protein design. So the first problem that we worked on, you know, when we first started doing this is - could we design completely new sequences in that space of 10 to 130th, if that actually would fold up to structures that we designed. So that was a structure design problem, and we got to be quite good at that. So it's true that the majority of random sequences will not fold at all - we actually have done some experiments to get out that we estimate that the fraction is less than one in a billion. But, you know, what's one in a billion when you have 30 to choose from? So the next part of your question is, as you pointed out, that most of these sequences won't fold at all. But there's a bigger function you want to design. That's an even rarer fraction. So the first problem is, how do you design and find sequences in this huge space that will fold up to structures? The second problem is how do you find sequences in that huge space that fold up to structures that carry out the function that you're interested in?
You know, along those lines, as you start making synthetic proteins with novel folds? Do we learn anything about naturally occurring proteins, like how they're different than the ones that you've designed?
David Baker 12:47
Yes, we learned quite a bit about naturally occurring proteins. And that's one of the things about design is - you understand what you can build. And so if there's some type of structure or function, you're trying to design, and you think you understand what the principles are, and you try to design it and you fail, that means there's something missing in your understanding. And so in multiple cases, you know, over and over again, that's what we typically … when we take on a new design challenge, we first try putting in everything that we know - and anybody knows about the problem - typically, it doesn't work very well. And then it's a whole iterative process to try and figure out what the missing key concepts are. And that's how you get to ultimately succeed in design. So protein design is really more science than engineering in that sense.
Maybe double clicking on that a little bit. You've talked explicitly about that false impression that people can get when they learn about these new proteins that are being assembled or optimized for therapeutic purposes. And, you know, there's this public perception perhaps that it can look too easy. Yeah. But in fact, there's a lot of trial and error. So you know, an example from you as mini-binders for the Coronavirus spike protein. So, could you maybe use that as a concrete point to discuss, can you tell us what this is, and then walk us through the stages that you took to create this?
David Baker 14:06
Yeah, that's a great example. So after the pandemic broke out, we, like many people, sought to try and apply our methods to helping to combat it. So the spike protein of the Coronavirus gets into our cells by engaging a protein called ACE2. And shortly after the pandemic started the structure of the interaction between the ACE2 protein and the Coronavirus spike protein, was determined. So we tried two approaches. One approach was to take a tiny part of ACE2 that was actually at the binding site and build a small protein around it. And the second approach was to make up brand new proteins and dock them against the spike protein and find the ones that had a perfect shape fit and then we design their interfaces to be a good chemical fit. So it's kind of like fashioning a key for a given lock where the lock would be the spike protein. So we tested these two sets of designs, the first thing that we found, is that only a very small fraction of them actually bound the spike protein. So we had to make large numbers. And so it was clear at that time that there was something we were missing. And we've since been able to go back and learn a lot from the differences between the binders and non-binders that have allowed us to improve the design process. The second thing that was interesting, and not intuitive, was that the designs from this more de novo approach where we did not use a piece of ACE2 actually bound about 1000 fold more tightly to the virus than those from the ACE2-based approach. In fact, the compounds we got are still, I think, the most tightly binding compounds known to the spike protein of the Coronavirus. And when the structures were determined experimentally, it was very exciting because these proteins bound exactly in the way we had designed on the computer.
And are you still continuing that work? And is it having therapeutic implications?
David Baker 16:03
Yeah, these proteins are currently in preclinical studies. And the goal is to have them in clinical trials by the end of the year.
Okay, so maybe changing gears a little bit. There's always a point of a podcast when we want to find more about our guests and what makes them tick. So we've talked in the past about the Foldit community creating Rosetta, and one of the things that you said earlier on, was the importance of creating a community within your lab and the importance of community doing your science.
Yeah, we don't often talk about community building and organizational design. I've read that you actively work on this and actively work to create a sense of community in your lab. So- chocolate tasting Wednesdays I hear, and whole work group hikes in the mountains around you, and then daily lab meetings for cadence. So how important do you think these things are for for people that obviously are brilliant, but are working in a community together in a lab like yours.
David Baker 16:57
I think it's really important. My metaphor for the research group is the communal brain. And the analogy is that if you have many sea slugs, each with a couple of neurons, you can have billions of them. But if those brains aren't communicating, those neurons can't really do much. Whereas the human brain with really high conductivity - you get these incredibly emergent properties. I believe the same thing happens in research groups. So if you have 100 million researchers all working completely on their own, it's absolutely different than 100 billion researchers who are just bouncing ideas off of each other all the time and brainstorming and experimenting with different ideas. So I think maximizing conductivity is really important. Absolutely no hierarchy, I think, is really critical. So all these neurons are at the same level.
But let me ask about that. How do you reconcile that different levels of training needs as people kind of mature, with no hierarchy?
David Baker 17:46
Well, when new students come in, I first ask them to talk to everyone in the group to find out what they're most interested in. And then they tell me, and then we discuss with the students and postdocs who are in that area, and we come up with a project, and then they are the mentors for that person. And then if they join the group for the first year, they work closely with their mentors. And then there's this magic transition that happens when suddenly they start mentoring others. So we take mentorship very seriously, but it's really just at the first year, year and a half in the group. And by that time people, you know, if you're brilliant and motivated, and you've talked to everyone, you’ve just absorbed so much that you're ready now to mentor other people, which is a learning experience in itself.
And how do you conceptualize your own role in the lab?
David Baker 18:28
Well, pre-Covid, I had my own projects, but I've gotten a little bit more scheduled since then - you know, everyone's lives changed with Covid -I spend 95% of my time talking to people in my group, I'm in the lab every day, I don't travel, I'm not on committees, I don't really do anything else. And I think one of my roles is to really maximize conductivity. So basically, I spend my day meeting with individual people or groups of people. And most often what I'll do is, since I've talked to everyone in the last two weeks, I can say, ‘Well, actually what you're working on, someone else might have a solution, or the three of you should get together to discuss this.’ So frankly, what I'll just do is maximize conductivity. And then when new people come in to help frame projects, at least starting directions, but some of the most exciting things that are happening here are things that just bubbled up, you know, where three people working on completely different things got together and started some brand new project. And I think creating an environment where that can happen is really important.
Actually, I think it's really cool that you still do projects. What would be an example of a project you recently led?
David Baker 19:26
Well, my most recent project - I wanted to get some familiarity with deep learning. So I was taking just the backbone structure of a protein and trying to predict the energy that Rosetta would compute for that protein. And that's a time consuming calculation because you have to search through all the different confirmations of the protein side chains. It didn't work very well, but that was probably just as much me as, maybe, the idea.
Actually, while we're on the subject, I have a question for you, especially given that project that you took on. Were you surprised by the success of the deep learning methods? Especially at least to me, one of the things that I find disappointing, and maybe that's unfair is that, you know, they work much more by information theory and evolution, than based on biophysics. And at least when I was an undergrad in the 90s, you know, I followed the protein folding field and thought that our ultimate solutions would be physics-based rather than just pattern recognition. So what's your take on that?
David Baker 20:18
Yeah, it's a really good question. I think the models are capturing quite a bit of the physics. So we're now using deep learning, basically using Rosetta, Foldit, and Alphafold to design brand new proteins, and they fold them and they're completely unrelated to naturally current proteins. So it's not simply coevolution. What's a little bit unsatisfying about these models is that they have hundreds of millions of parameters, so it's not very transparent what it is they're learning. But there's some extremely interesting things. For example, we talked about de novo protein design, and designing proteins that unfold the new structures - these models are really good at predicting their structures. And recently, we figured out how to get these models to hallucinate brand new protein structures; we basically feed in random sequences, and then we optimize them to the point that the model really thinks that they fold to some structure. It's like having a network that classifies images as cats - hallucinate images of ideal cats. And we look at the structures that the networks elucidate, and they look eerily like our ideal protein structures we've been designing for many years. And so I think the human experts setting the physics of folding, and constructing ideal proteins, and came up with the same solutions that the network now constructs, when it's trying to build ideal proteins. I think that's eerie, but also very interesting.
There's something behind that that's really interesting to me - early on, in our conversation, you mentioned a prediction going wrong, or something not working, being an indication that there's something missing in our understanding, right? If you can't predict it, you don't get it. I'm curious what you think about what it means to build scientific understanding, in an era of really highly parameterized models that, as you say, have hundreds of millions of parameters. And although they're not black boxes in the strictest sense, like those numbers are on the computer, they're all available for inspection, not really clear what it is that they're doing. But we can evaluate their performance pretty clearly. So when we have these tools that are predicting well, but what they're doing doesn't fit in the carrying capacity of the English language - what does it mean to understand in that context?
David Baker 22:20
I think that is the question for our era, isn't it, in so many different areas. And I think there are a couple ways to look at it. First of all, we certainly have the notion of humans understanding something. And human brains have billions of neurons, which also, we don't really understand the connectivity. But when we say that a human understand something, that means that they can solve a problem that they couldn't solve before. As far as the difference between knowledge that can be encapsulated in a few elegant formulas versus in the millions of parameters, network, something. I have this discussion with my students quite a bit, and I give them an analogy, say that in the fifth century, before Christ, we had ancient Greece, where the truth and beauty rules, everything was understood. And it was a height of human civilization to some extent, at least for the time. Then later, we went into the dark ages, where everything was mysticism and magic, and there were no principles. And then later the Renaissance came and truth and beauty emerged again. And you could look at the era of deep learning models as mysticism and magic for the reasons you were alluding to. But clearly, historically, you know, there was then sort of a ‘coming together.’ And this may be just a phase in science and technology. And it could be that we'll come out at the other end with an understanding that can be encapsulated in the human language. I think that's a really interesting question, and I think time will tell.
Well, let me push you on that. So is it going to be the case that science is always going to stumble on problems that can be described succinctly from person to person and like what we're doing right now? Or are we going to push into areas of nature, or an understanding of what's going on, simply will not fit into the normal way that we communicate understanding?
David Baker 24:12
Well, the Rosetta energy function and any molecular mechanics energy function is complicated. There are many, many parameters. And the functional form is complex. So even something that we would call a physical model still can be very complicated. So I think probably, there are levels of understanding that can't be conveyed in a few sentences. And you know, you take one of the highlights, you know, the Standard Model of particle physics - that can't be conveyed in a few sentences either. I mean, it rests on a huge body of work. But one thing I was going to say is, I think that another thing that is going to happen is we're going to see the physical models coming together with the deep learning models. And deep learning models are very good when you have really rich and detailed training sets. But when the training sets get sparse, it's much much harder to train the models and you get into overfitting problems. So they, in the case of biology and say proteins, if you just have the 20 natural amino acids, we have, you know, hundreds of thousands of amazing crystal structures with the coordinates of thousands of atoms, you start putting an unnatural amino acids or go away from protein backbones. And suddenly, there's far, far less training data. And so that's where physical models - now the combination can become very powerful.
So earlier on in the podcast, we made reference to the protein design revolution. What do you mean by that? And what is the potential of it?
David Baker 25:32
Well, in nature, we see an amazing variety of proteins that solve all sorts of problems that arose during evolution. And you know, humans face all sorts of new problems. Now we're heating up the planet, we're putting all sorts of toxic compounds out where people are living longer, so there's more diseases. So given the analogy of what happened during evolution, it's natural to think that new proteins could exist, or it should be possible to design new proteins to address these problems. Now, up until very recently, the way one solved a protein engineering problem was to look for the closest protein in nature and tweak it a little bit. And the protein design revolution, basically, is that now rather than starting from nature, and making small modifications, we can now build completely new solutions from scratch, which is really more the way that technological problems are solved outside of the biological realm.
I've recently learned that your lab has developed the most extraordinary thing, the so-called rotary protein machine, which would seem to look like a molecular version of a steam engine. Tell us more.
David Baker 26:36
Yeah, well, we're at the beginning. But one of the current frontiers for computational protein design is to design machines that consist of different parts that can undergo relative motion and do work. And so one of the first systems we've designed is a system of rotors that can rotate on axles. So we've been able to design such systems and the rotors rotate around the axles, but we can't yet drive directed motion. So this is really a current frontier for protein design. And you can imagine it opens up all sorts of new possibilities.
And I mean, one is just the technical achievement of making such a protein. Do you think that actually has applications in medicine or otherwise?
David Baker 27:18
Yes, I think that - one thing I've learned is that every time you make a technological advance or conceptual advance in this area, there are practical applications and translational opportunities that follow, and they can't always be anticipated in advance. So in the case of a machine, you could imagine untangling you know, amyloid plaques, or clearing clogged arteries. But those are very fanciful. One of the things that's interesting is that, although I never imagined I'd be doing this 15 years ago, is that now, new students and postdocs coming into my group rather than wanting to become professors when they leave - now many people come in wanting to start companies that will they want to design proteins to solve really important pressing problems. And so these companies are in all kinds of areas I wouldn't have anticipated.
We're getting to the end of the podcast. And one question I can't resist asking is, you've seen the protein landscape in a way that few people have. What is your favorite protein and, and this example of one where you saw it, and you said, this is truly amazing. And I know just to motivate, like, I remember seeing Rod MacKinnon show the structure of the potassium channel, and how it was able to be selective for potassium. And I just remember being like, wow, that might be one of the most beautiful scientific presentations I've ever seen. And just off of the potassium channel, what's the equivalent of that for you?
David Baker 28:43
Well, you know, that's like asking me what my favorite child is. You know, I love them all.
So you love them all equally?
David Baker 28:53
I think my favorite one is probably, you know, I'm always sort of more looking towards the future than the passive. My favorite lead will be the one we haven't quite designed yet. But we're actually about to do it. You know, I think that's the one I'm most excited about at any given point.
Fair enough. And actually, along those lines, what was it that actually made you decide to work on protein structure as your life's work?
David Baker 29:14
Well, it was a little bit indirect. And I think like, you know, as life goes, I started when I was in college, actually. I first was a social studies major, and then a philosophy major, and I switched to biology in my last year, because I saw that science could actually move forward where those other fields were kind of just static. That was really attractive to me. When I went to graduate school, I said, I was interested the problems of development and the brain. And I actually did my graduate work in B-cell biology. And so I it took me a long time to get to protein folding and then to protein design. And I think when I finally got there, I really loved it, because there are so many different ways you could approach the problem. You know, it's a computer science problem, a physics problem, a biology problem, a genomics problem. You could approach it with computational methods, with experimental methods. And it was just at the end, it was also at the intersection between all these different fields. And I think with protein design, now we're really seeing that that we can now design proteins that have impact in all these different areas. And we're really taking advantage of advances in all the different areas. So that's the central location and sort of its fundamental... It's also like the simplest possible case of biological self organization. And that was something that really fascinated me from the beginning about biology.
If you were starting a PhD today, what would you do?
David Baker 30:32
Oh, I don't know, I would advise people who are starting PhDs just to pick whatever problem they're most excited about and not worry too much about what they're going to do long term because at least for me, it's been many twists in the road.
Thank you so much, David. It's been such a pleasure to chat with you today. Just amazing discussion.
Yeah, thanks for coming on the podcast. I really appreciate your time, David.
David Baker 30:54
Thank you very much. That was fun.
We usually take time at the end of each episode, in the spirit of regular in person meetups in Boston many years ago, to discuss a big problem - the nail and possible solutions - and the hammers - inspired by what we just heard. Alex, do you have a hammer or a nail this week?
I have a big hammer, one of the newest biggest hammers, I think in machine learning and artificial intelligence, which is the large language model. You know, folks might have heard about this, there's some relatively big and famous ones like GPT-3, like LaMDA. So GPT-3 from open AI, LaMDA from Google. And these are very, very large - as their name states- models that have been trained on massive corpuses of text, I mean, like significant fractions of the internet. And they have learned to digest all this information, and then they can actually spit out new text prompted by inputs from users. I think the link to the conversation that we just ahd with David Baker is, not only can these large language models be used to speak not just English, but can also be used to speak other human languages that are in the text corpora that they're trained on. But they can be repurposed, not without effort, you know, not without skill, but they can be repurposed to speak the language of life. So they can be used. And you know, you might say ‘fine tuned’ or ‘retrained,’ to spit out or to produce amino acid sequences, which you can think of as chains of texts, or chains of letters, where each of the 20 amino acids in the human protein corpus can be represented by a letter. In fact, there's various stereotype letters that we use and have to memorize in biochemistry, when you're in your undergraduate. And you can just treat that as a big, big long string of text, and it turns out these LLMs (learning language models) can actually generate interesting amino acid sequences.
Ah, this is a great topic, Alex, I can't wait to dive in. So where do you want to start?
There's a couple things that make these things new. There's a couple things that make these things interesting - these large language models. One thing that's interesting is the massive increase in scale that is required to bring them into existence. And in a way, it's kind of a continued validation of the story of deep learning, which is, as models get bigger, meaning - as the datasets get bigger as the number of parameters and these models get bigger - they get better. And I think there's a lot of skepticism about whether or not that trend was able to continue. And I was sure, you know, like, what if you made these networks 100, or 1000 times bigger? Would they keep getting better? It wasn't clear. But I think these models like GPT-3 and LaMDA have answered that in the affirmative, these are much bigger models than what came before and they are still getting better. And what they're able to produce is kind of shocking in terms of the naturalness of the language they're able to spit out
Yet, it's still not sorted out why such big models work. Now a lot of this is really general - tell me specifically about how this plays out in large language models.
So you know, as I was walking with a friend this morning, and he shared with me the two principles that he takes to be true in machine learning - at least in deep learning these days - which is, there are some models that work extremely well, like competence for images and transformers for text. And the second principle is: nobody can really explain why they work so well, or at least not in detail that would be sufficiently satisfying to, for instance, a physicist or a biologist. But nonetheless, they work. So, what's interesting about these large language models is that the algorithms have been around for more than a couple of years. So there's a seminal paper called ‘Attention is all you need.’
Yeah, it was out of your friends at Google research, right?
Yeah, it’s a Google Brain paper. And so that's Ashish Vaswani, Noam Shazeer, and Niki Parmar, and some other researchers as well. Some of those folks have gone on to work on these models outside of Google. This was kind of a watershed moment in understanding new ways of building architectures for large language models. And I think the world has really taken it and ran with it. And these models are a really exciting avenue for machine learning.
This is one where I remember reading the Attention is all you need paper’ - I find it still very impenetrable on why it works so well. And it was clearly a group of researchers with very deep intuition. and who had been building models for a long time, and had an idea of something that would work. And yet, as someone who isn't part of the field, I look at it, and it just seems out of thin air, like completely unmotivated. I mean, you know, what is attention in deep learning? And why is it ‘All we need?’
So attention is the ability for a neural network to look at the entirety of a sequence of tokens. So that might be a whole sentence, and decide to propagate information from just parts of it, and basically ignore or suppress information from other parts of it. And sometimes there's multiple heads of attention. So that would be like having multiple phobias, or multiple places where your eyes are looking, which sounds a little bit unnatural, but that's just how the engineering works. And, in order to understand the long term structure, you might need to attend to very different parts of the sentence. And another example might be, these confusing things that we say as humans, which is like, ‘I’m absolutely disgusted with how good that movie was’ - you need to know that there's a negative valence word that needs to be ignored in order to get the true valence and say that actually, this person writing the sentence in this review, quite liked the movie, they just have a particular way of saying it. So it's in general a mechanism for attending to or learning to attend to the parts of an input, that's going to give you the information you need in order to make predictions or to forecast the future of what the next sentence might be.
So is it fair to say like, if you think about a CNN, it kind of tells you, for a pixel in an image, the pixels around it are likely to be very important in understanding that pixel. And then, of course, you kind of hierarchically build it up with many layers. But it's kind of - imposing local geometry is important. Is it fair to say that attention is kind of the same thing, but looser. And so it doesn't have to be the words that are right next to each other, but it lets you kind of build models that were - the convolutions, if you will - are farther away.
Yeah, I think it's kind of a flexible learned instead of baking in geometric type importances, as you say, and this is something that the model can learn from from data.
You know, and it's just going back to the theme of biology, I saw a paper out of David Kelley's group. That was quite interesting, using transformers to understand gene regulation. And again, this attention thing is really interesting, because human genes have regulatory elements. This goes back to what we were talking about with Rosana that only 2% of the genome codes for proteins. And a lot of the rest is information that controls when and where to turn the gene on, so-called ‘enhancers.’ And they can be located very far away from the gene that they control, like a million or more letters away. And so it was quite interesting to see the transformers actually working so well, in predicting gene expression and a lot of the same ideas of attention showing up in this kind of very natural biological setting.
Yeah, I think we'll see more applications of this kind of technology in more and more biological areas. I mean, this is kind of been the trend for a lot of machine learning techniques, is that it gets developed in a very methodological context. So it gets published in a machine learning conference with really no applications other than the benchmarks that the field is most familiar with. So like, it gets published in NeurIPS on a natural language benchmark that everybody recognizes. And that's great. That's how progress happens. And then some number of months or quarters or years later, biologists - and this is happening increasingly fast as more biologists I think, have real machine learning and artificial intelligence and statistics jobs - they pick up on these models, and they adapt them to the domains where they can be most useful in understanding the relationship between sequence and function, or in developing new proteins or new aptamers, or new drugs.
You know, it's interesting, this is something Carolyn Uhler and I talk about a lot - I totally agree that what you're seeing right now is very much focused on bringing machine learning into biology. You know, I do hope that we start going the other way, a little bit, too.
I think it will happen. I mean, what I've seen in my professional scientific life is - when you work on really hard problems, and you try to adapt the best methods that you can find to those problems. You generally have to invent new things, and you get new insights. I think that our conversation with Jennifer Listgarten last year is an example of that. I mean, she lives exactly at the intersection. There's a lot of wonderful researchers that live at the intersection. Returning to large language models - there's a cool property of how some of these LLMs can be used that I think might accelerate new applications and accelerate interesting new developments in crosstalk. Here's how it kind of works. So some of these models are set up so that you can feed it some text, and it's called a prompt. And then you can ask the model to continue the prompt. So it's like starting the conversation, say, ‘Hey, how are you doing?’ And then, you ask the model to continue it. And it says, ‘Doing fine.’ ‘How are you?’ And that might seem kind of natural and easy, but you can prompt it with really interesting things. So I actually took one of these models, and they're on the internet, they're available to try for some of these models. Not always the biggest most state of the art ones, but they do exist, you can go search for them, and prompted it with some Japanese poetry. Can I read you one of the poems, one of these models wrote?
Oh, my God, totally. I'm dying to hear this.
Yeah, I'll read you a couple and we can talk about it. Out, left, right, fly - where the path meets a clearing - running grass sounds like thunder. Wow. It's not a perfect haiku. Right? The metering is not there. But it's evocative of a real poem. Here's another one: Call, twitter, trill - they have flown - they're gone.
I feel so zen, right now!
Sometimes I'll let go and search on the internet to see if this is just copying something from something somewhere. It never is. It's new - these are poems that were written by an algorithm that's read a huge portion of what humanity has written. And it's coming up with something that is prompted by myself, or I'm saying write something like this - but it's coming up with something completely new. It's one of the most exciting things in machine learning I think right now.
Oh here’s another one. This one actually, I missed. A newly struck match - ah, it places with a colorful light - in darkness again…
And this is a poem, you know! So, it's not necessarily a great one. But it's certainly more poetic than what I could come up with on my own. And, you know, you don't have to prompt it with Japanese poetry, right? You can prompt it with amino acid sequences. And you can say, look, here's 10 examples - I'm just imagining, I don't know if this works, I'm sure somebody is trying this right now - But you can imagine prompting it with 10 protein sequences that worked to solve a problem, like maybe they glow green in the presence of some molecule you want to detect, alright, that's a common protein engineering task, prompt it with 10 amino acid sequences that are known to work globally. And so give me another 10. And then you can go make those proteins. And you can see if the machine got it right. And I think that this is a really exciting way of taking these models, which have absorbed enormous amounts of information - that, I think that much is true is that the inside of these models, they've read a big chunk of the internet, they know a lot, or they have access to a lot of information. But sometimes it requires tricks in order to prompt them to provide that information. And so this is called prompt programming. And it's this idea that you can have a fixed model or a model that changes only slightly per application. But what you ask it to do and how you ask it to do it can actually be the art or the engineering discipline of a way of automating how computers provide responses.
Fantastic. This was a great conversation today, Alex. I'm so glad we had it.
Oh, absolutely. It's fun to talk about this stuff.
Huge thanks to Professor David Baker. Next episode, we will be speaking to Karl Deisseroth about the future of psychiatry. And later in the series, we'll be meeting Geoff Hinton, the godfather of deep learning. Let's talk about how humans and computers learn - and Cynthia Kenyon to talk about the molecular and cellular basis of aging. In case you missed them, we've already covered the dark genome, gene editing and the future of cancer medicine. If you have any questions for us, or our guests email email@example.com or tweet at @GVTeam. We'd love to hear from you.
This is a GV podcast and Blanchard House production. Our science producer was Hilary Guite. Executive Producers were Duncan Barber and Rosie Pye, with music by Dalo. I'm Anthony Philippakis.
I'm Alex Wiltschko.
And this is Theory & Practice.
Director, Institute for Protein Design
David Baker is the director of the Institute for Protein Design, a Howard Hughes Medical Institute Investigator, the Henrietta and Aubrey Davis Endowed Professor in Biochemistry, and an adjunct professor of genome sciences, bioengineering, chemical engineering, computer science, and physics at the University of Washington. His research group is focused on the design of macromolecular structures and functions.