What is a thought? Some may think that question is quite abstract, but it has huge implications for science and computer design.
What is a thought? Some may think that question is quite abstract, but it has huge implications for science and computer design.
If we cannot define a human thought, how can we know if a computer can think? Only then can true Artificial Intelligence be achieved.
This week we speak to the “godfather of deep learning”, Professor Geoffrey Hinton, a cognitive psychologist and computer scientist. He is now an emeritus Professor at the University of Toronto, and engineering fellow at Google.
Our wide-ranging discussion reflects on Professor Hinton’s journey into this field, his instrumental role in the deep learning revolution, and an analysis of when, if ever, computers might achieve the next level of intelligence.
This season we'll dive deep into the languages of life through explorations of the "dark genome", genome editing, protein folding, the future of aging, and more.
Hosted by Anthony Philippakis (Venture Partner at GV) and Alex Wiltschko (Staff Research Scientist with Google AI), Theory and Practice opens the doors to the cutting edge of biology and computer science through conversations with leaders in the field.
Hello, you're listening to Theory & Practice. I'm Anthony Philippakis.
And I'm Alex Wiltschko.
Some may think our talk today is quite abstract. What is a thought? But it has huge implications for science. If we can't define a human thought, how can we know if a computer can think? How will we know when we have achieved artificial intelligence?
Who better to help us work through this and the person credited as the godfather of Deep Learning. Without deep learning today, we would not be talking about artificial intelligence. Our guest today is Professor Geoffrey Hinton, British Canadian cognitive psychologist and computer scientist.
Now emeritus professor at the University of Toronto, and researcher at Google, he has received many accolades for his work over the last 50 years. In 2018, he received the Turing award with his colleagues Yoshua Bengio, and Yann LeCun. The Turing Award is dubbed the Nobel Prize of computing. The three colleagues received this award for conceptual and engineering breakthroughs that have made neural networks a critical component of computing.
All these recent awards don't really capture how much resilience Professor Hinton is needed in order to overcome what have been termed ‘the AI winters,’ or several periods of little or no funding for artificial intelligence, first in the 1970s, and then in the 1980s.
Today, we'll explore not just these weighty issues of how humans and computers learn, and maybe how computers even think, but also what sustains the scientists taking on an existing paradigm.
Professor Geoffrey Hinton, welcome to Theory & Practice.
Great to have you here. So there's a lot of things we could talk about a lot of things I hope we will talk about. But I'd like to start as early as possible, just to kind of get a sense of where you come from intellectually and the kinds of ideas that you started thinking about. So you started as an experimental psychology student, and there's not many experimental psych graduates in the 1970s that would have ended up working on PhDs in artificial intelligence. What led you to that choice?
I guess, while I was doing my psychology degree, I decided psychologists were completely clueless about how things worked. And before I started my psychology degree, I've been interested in neural nets. And basically, I decided, we had to understand how the brain works to understand psychology. And the only psychologist I was really interested in was Piaget because he at least had some theories, and he distinguished assimilation and accommodation. And they're what we would now call inference and learning. And so I found him interesting but very vague, and I thought, if we could sort of make computer models of what Piaget was saying, we could make progress.
I guess maybe this is jumping ahead in time. But people who study artificial neural nets are sometimes accused of using biologically unrealistic models. What's the interaction there between these two kinds of things in your mind?
When you really don't understand something like the brain, I think it pays to make sort of simple abstract models that you know are wrong in many respects. So things that I think of are like the Bohr atom, which was wrong in a huge number of respects, but still very useful. And I think people fail to understand that, just because there's all sorts of properties of neurons you're not modeling, it doesn't mean your model might not reveal some interesting principles. For example, how could things learn with no supervision. You can get to understand more about that problem and how to solve it without modeling faithfully all the details of neurons.
So it sounds like, if you're going to be wrong, just be wrong in ways you control.
Yes, but it's not so much about being wrong. It’s more about oversimplifying. And science always oversimplifies in order to get something that you can investigate and understand. Everything is an oversimplification. It's just that some oversimplifications are sensible and lead you to understand important principles.
And you know, going back to that moment in time, when you started thinking about neural networks, at this point, there had been very little work done, correct? And what was the state of the field at that moment when you entered it?
So this will be about 1966, when a very smart friend of mine at school got me interested in neural networks and holographic memories. And there had been early work by Rosenblatt obviously, and before him by people like Turing and von Neumann who were very interested in neural nets. Actually, the perceptron is what came after I got interested in neural nets. So basically, there weren't any good learning algorithms. And that was kind of the core problem. Turing understood that was the core problem. And he had a kind of reinforcement view of learning that really doesn't scale very well. So it was pretty primitive back then. And we certainly didn't have backpropagation.
You mentioned a very smart friend - who was that? who got you into neural nets.
He's someone called Inman Harvey, who was a brilliant mathematician, and then totally lost interest in it.
And you stuck to your guns, and the things of the world are different today as a result. So you mentioned that the perceptron book came out after you started. That was an important moment in artificial intelligence I think for many reasons. Maybe you could tell us what was in that book and why was its release important and how you reacted to it when it came out.
It was a quite impressive book, because it had some interesting theorems in it about what you couldn't do with shallow neural nets. But the whole thrust of the book was to kind of, say neural nets were very limited, at least the ones for which there were guaranteed learning procedures, like the perceptron convergence procedure. And even though they never explicitly said that you'll never get anywhere by making these things deeper, that's what they implied. And that's what people took the message to be. And that was a completely wrong message.
So why did you go into them? What made you decide that you're going to start focusing on them, given that book?
Well, the brain has to work somehow. And it has to be that it learns. And so there has to be some way of learning from data. And because there's all sorts of things we do that we clearly didn't evolve - we haven't evolved the ability to read, we learn that. And it seemed pretty obvious to me, despite what Chomsky was saying at the time, that we learn language. So it just seemed to me, there's this basic scientific problem, how does a big network of neurons learn stuff? And I think we basically solved that problem, which is stochastic gradient descent.
So what was your entry point into the field? What was the first significant problem that you worked on? And what did you learn from it?
Okay, the very first problem I worked on was generalizing the perceptron convergence theorem from two state neurons to multi state neurons. I got very excited by that, but I discovered it was a footnote in some book, or there's a book by Nils Nilsson that actually had the generalization in it - that was depressing. But then I used these multistage neurons, which you can consider to be sort of localist embeddings of things. I use those to do true recursion in a neural net. So I recently discovered a paper I wrote in 1973, about how to make neural nets do recursion by using fast weights memories. I didn't publish it, because it wasn't very popular at that time. And it was actually the very first talk I gave, and the end of the talk, people in the research group I was in, kind of looked at me and said, ‘Why don't you just use a computer language?’ They didn't have any idea why I was trying to show how a neural net could do recursion.
Or there's a lot of directions I want to take this. The things that I'm hearing so far that I'd love to dig into are…the very strong and persistent belief that there's something quite deep in these ideas here. But also the persistence in and of itself, to say, ‘No, this is important.’ And you know, regardless of what you say, and what the ecosystem and what the weather is, we've got to work on this. And this is all being motivated by a mission to understand how the mind works, and the seeming like the best toolkits to understand that.
I think I can explain the persistence. And the belief that I must be right that the brain is learning connection strings, and that's how we learn everything. And it's to do with my upbringing. So my father was a Stalinist - not a very popular position right now. But they sent me to a Christian school, because the science teaching was good, at least that's what they told me. And so from the age of about seven, all the kids at the school believed in God. And I didn't believe in God; I thought God was obviously rubbish. And my view is that I turned out to be right. So from a very early age, I got used to the idea that everybody else was just wrong and obviously wrong. I think that's very good training for a scientist. I should also add, my father's position was also wrong.
yeah. How then in that context, do you make sure that you try to destroy the ideas that you have? That might not be right? I mean, I find this interesting, like push-pull in doing science where you have to have these amazing dreams, but then you have to try to kill them as quickly as possible, so you don't waste time.
Right. There's an interesting result, which is, if you ask people, whether something really bad's gonna happen to them in the next five years, like having a close relative die of cancer or being in a major car accident, normal happy people estimate the probabilities much too low. And paranoid schizophrenics estimate the probabilities much too high. The people who estimate the probability about right, are the mildly depressed. Now maybe that's why they're mildly depressed. But they sort of don't have a very positive view of things. And I tend to be in that category. So I'm always not happy with the ideas I produce, and always trying to find something better, but always worried they might all be rubbish. So even though I have this internal, very strong belief about how the brain must work, I'm always very critical of my own ideas, at least I think I am.
So your own mind creates its own ecosystem of idea generation and critique continuously - is that kind of the idea?
Well, I'm sort of mildly manic depressive. And I think that's actually quite functional, that you have manic phases when you're creative and depressive phases when you tear it all down. And what survives is good.
I see. So what's on the whiteboard might have been written by someone in a different state. And so you can come to it fresh.
Interesting. Well, actually let me dive deep into this for a second. One of the intellectual pendulum swings that I've had the pleasure of living through, and you've much more so been a driver of it, was the initial movement towards neural networks. And then there was the initial movement away from it - was support vector machines and statistical learning theory. And then again, the pendulum swinging back towards neural networks with your own work. So, when the pendulum first started swinging away from neural networks, did you start to question them as a model system for the brain? Or was that a moment where actually you were convinced that you were still right?
Oh, I was convinced I was still right about the brain. I started to question whether they would ever be any good at the relatively small scale machine learning problems that support vector machines are very good at. And I was very pleased about it in about 2005, when we got neural nets - actually, without using back propagation at all, just using stacks of restricted Boltzmann machines and the wake-sleep algorithm - to do slightly better than support vector machines on MNIST. So that showed that even on those relatively small problems, like MNIST, neural nets, could compete with support vector machines. And they could do that even without using back propagation.
Okay, so let's actually unpack a little bit of that. So first, tell our audience what back propagation is and why it's important. And then also, earlier on, you talked about stochastic gradient descent, so talk about that as well.
Okay, so let's start with stochastic gradient descent. The idea is you start off with a big neural network, and it just has random weights in it. And what we're going to do is we're going to adjust the weights, so that it works better. And let's suppose it's the simplest kind of neural network that has some inputs, like pixel intensities, and some layers of neurons and then some outputs that are like classes, or decisions about what class of object is in the picture. And stochastic gradient descent means you figure out for each of the weights in the network, each of the connections between a pair of neurons, how changing that weight would make the answers better or worse. So you have some measure of how good the answers are. And you have to solve the problem of how will the answers improve if I change this weight - that's the gradient. And stochastic gradient descent says, take a few examples, figure out just on those few examples, whether the answers will improve if you increase the weight, or if you decrease the weight, and how much they would improve by and then increase or decrease the weight in proportion to how much the answers would improve. And just keep doing that for all the weights and for all the examples. And after a while, it'll get to be very good. Now, it's sort of a blind act of faith to believe that any algorithm like that could work, because you're asking it to search this incredibly complicated space of possibilities, just by following a gradient. And there are all sorts of arguments about why that shouldn't work, like it'll get trapped at local optima. And it turns out those arguments are basically rubbish. That is, they don't stop it working. And it doesn't actually get trapped at local optima - it gets trapped at saddle points. But it's a big act of faith to believe that, and I kind of accepted that because I couldn't see that could be any alternative. I think people in symbolic AI - it was quite reasonable that they didn't believe that could ever work. And for a long time, it didn't work for really big nets, but now it does.
Well. Okay, so let me keep on unpacking for a second. Now, to distinguish there are maybe two, at least possibly separable problems. One is how does the brain work? And then another problem is training a computer to recognize digits like the MNIST example you gave. So let's go back to 1900. There was this new set of ideas on the table of statistical learning theory and support vector machines. And so is it fair to say at that moment in time, and even today, you're much more interested in the brain, rather than teaching a computer to do well on classification tests. You kind of accidentally got very good at that. Is that fair?
Yes, that's fair. Now I'm very pleased that as a side effect of all this, we've got nice learning algorithms that are good engineering. But that's not my main motivation.
So when you saw the rise of support vector machines, first tell your audience what they are and how you perceive them and contrast them to neural networks and then what it means you think about your own sort of line of research.
Okay, I'm going to describe a support vector machine, from the perspective of someone who believes in neural nets.
So it goes like this: you have a bunch of images, and you want to classify them. You want to say which ones are dogs and which ones are cats. And so what you do is use a very clever algorithm for choosing a subset of the images. And you take that subset of images. And when when you get a new image, you compare it with each of the images in your subset, using a special kind of comparison function that says how similar it is. So the subset of training images you selected are called the support vectors. Your comparison function is called your kernel. So the new image - you get to see how similar it is to all the images in this special set. And then based on those similarities, you learn some weights that allow you to make a decision. So you take a kind of weighted average of those similarities. And that tells you whether it's going to be a dog or a cat. And the problem is, it's only got this one level of learning, it's learning the weights to put on those similarities, because it's recipe for which images you take. And how you measure the similarity is a fixed recipe. That's the kernel. And then there's this clever optimization for how you choose the support vectors and how you choose the weights. And that's quite sophisticated stuff. But it's basically just compare with a few prototypes, use appropriate weights, then decide whether it's a cat or a dog.
Now did the initial success of this - did it cause you to question your own line of work and neural networks?
No, it was an annoyance.
This was just a clever game and clever maths that worked well on a toy problem.
Well, can we talk about that toy problem for a second? You know, MNIST has been this workhorse. I've spent a lot of time in MNIST. And it's like almost a friend in a way. You know, you spend a lot of time looking at these digits. And I know you have as well. Like, tell me where MNIST comes from and how its figured in your origin. First of all, what is it? What is MNIST? And then how is it figured in your work?
Okay, MNIST is a data set of handwritten digits based on a data set collected by NIST. But MNIST was made by Yamanaka and his collaborators. And it's designed for studying shape recognition. There's 50,000 training examples and 10,000 validation examples and 10,000 test examples. And it's a very good way of testing out a learning algorithm quickly. If you've got a decent learning algorithm, it will work on MNIST, and if it doesn't work on MNIST, you should be very suspicious. So I think of it as the kind of Drosophila of machine learning. You might ask geneticists, why do you keep working on drosophila? Don't we know enough about your software already? And the point is, there's lots of results on Drosophila. Lots is known about Drosophila, so it's a very good way of doing basic genetics. And MNIST is the equivalent for machine learning. I probably know much too much about MNIST.
Some of my colleagues at Google Brain have names even for some of the specific examples in the dataset.
Oh, I have names for many of the examples. Yes.
Oh, yes. There's the Padlock 2, then the Stretchy 2. So, I had a graduate student called Roland Memisevic. And I thought I'd impress him one day, because he asked me what's my favorite pixel in MNIST? Favorite pixel - thinking this was a joke. And I said, my favorite pixel was 2813. It's 28 across and 13 up. This was a long time ago. I think it was 13 up. And the reason is, there's only one digit in the whole of MNIST that touches that pixel; it's meant to be in the surround. And the way you can touch that pixel is have something that's very wide but has its center of gravity, a long way to the left of its tail, and there's a very scraggly 2, so basically, if that pixel is turned on, you know, it's that 2.
So getting that one right can be either great generalization or horrible cheating in the case of a learning algorithm.
And when I said this to Roland Memisevic, I thought he'd be very impressed by my detailed knowledge of MNIST. Instead, he said, Geoff, get a life.
That's excellent. Now, let me take us back for a second. You know, Geoff, we've talked a little bit about some of the algorithms around neural networks. Let's get back to this question of human thought. What have you learned about human thinking by all of your work from neural nets and kind of take us through the arc of it over time.
Do you mean, what have I learned about what a thought might be?
And how the brain works?
Yes. So I think what I've learned about how the brain works is stochastic gradient descent really does work on big data sets. And it can do amazing things. So lots of big models at Google on similar models like GPT-3 at Open AI, and wonderful things like Alphafold - that's going to have amazing effects on science. Now, there's also a whole bunch of ideas that are going to help, in addition to just using back propagation, you get a gradient and then just going for it. And I think we'll get even more progress by getting better ideas. Like, for example, using fast weights that are adapted to different timescale from the standard weights that adapt much faster and can act as a temporary memory? I think that's been hardly explored so far.
But these are engineering insights, if you will. What do they tell us about the brain?
Oh, the fast weights is [applicable]. The brain definitely uses adaptation of synapses at many different timescales.
And we've hardly touched on that in neural net models. And the reason we haven't touched on it is because of the computers we have, it takes a long time to fetch your weight from memory. And so you have to use it many times. And so you have to have mini batches, in which you apply exactly the same matrix multiplies to all the different cases in the mini batch. And that means you can't afford for the actual weights of the neural network to be a function of the recent history of the neural network. Because then the weight matrix is different for different cases. And you can't use mini batches. And you lose by about an order of magnitude and speed. And that's sufficient, so you just can't compete. So we'll never be able to use fast weights in neural networks, until we have different hardware in which you don't fetch weights from memory, you just use weights in place, like the graph core chip, for example.
Let me ask about that a bit. So I know this is a bit of a side conversation off of where we're moving towards, which is - what is the thought, and even trending towards speculation - but you bring up this topic of hardware. And I don't think we would have neural networks in the form that we do today, without specialized hardware that can do extremely fast matrix multiplies - originally, you know, GPUs, and now quite specialized hardware. And, you know, I think we're seeing emergence and an evolutionary dynamic, where some architectures are surviving, because they can take advantage of this dense arithmetic, like transformers over more recurrent architectures. How do you see the interplay of the hardware and the models that we build? And what would you like to see?
Okay, I think in the long run, we're going to have two different types of computers. And they're going to be immortal computers and mortal computers. And what I mean by immortal computer is, you have either a computer program, or a bunch of weights for a neural net. And if the hardware dies, if that particular copy of the hardware dies, you can put the same weights on different hardware, and you can still run the neural net, or you put the same program on different hardware, and you can still run whatever that program was. So the program or the weights are immortal. They don't depend on that particular copy of the hardware. And that's great for things like banking, and remembering your social security number and stuff like that, where you don't want to make mistakes. But if you want, for example, to have a toaster that you can have a conversation with, about how you like your toaster, what kind of bread it is, and how long it's been in the fridge, you need a chip that can do all that can have sort of GPT-3 like abilities, and cost $2 to make and runs at a few watts. And you're not going to get that with our current approach to hardware. I believe in the end, we're going to have computers, where the hardware and the software are not separate. And that's what I call mortal computers. So the hardware will be produced not by fabrication, which make things accurately, but by growing the computer, just like real biological computers are made. And everyone will be somewhat different. I think of it like the root ball of a plant. If you have a whole bunch of plants in the same shape of pot, the envelope of the root ball will have all the same cone shape, but the details of the root ball are all different. Even though they're all grown with the same rules, they end up with lots of differences. And so you can only get away with that if you've got a learning algorithm that can adapt to the hardware. And once it's learned, it's adapted to that particular hardware, and it's no good for any other hardware. So when it dies, it dies. That's why it's a mortal computer. And the best he can do to pass on his knowledge is to distill its knowledge into other younger computers. And I believe this is going to be pure analog hardware in the end. Possibly it's going to be spiking hardware. It's going to be one bit digital hardware, but certainly not using 16 bit numbers, and it's going to be able to run at very low power - very cheap to produce - because you grow it, you don't manufacture it. And these chips will, whatever they are, and these little biological computers, and when they get old, you'll throw them away.
One of the things I find quite amazing as we started this section of the conversation, is talking about, what have we learned about the brain by studying artificial neural networks. And now we're actually in, how we can design better computers by studying the brain. And so there's this natural interplay…
Yes, but let's get back to what is a thought, because I wanted to say something about that. So before I studied psychology, I studied philosophy. And I formed a very strong opinion then about what a thought is, and it's based on the idea that we have language, and we use this language, but we have a model of how the language works - that's just wrong. And it's just completely wrong; it's as wrong as religion. So let me give you an example, just an everyday example of two words where you use them properly, and you think you understand what they mean. And you do understand what they mean in the sense that you can use them properly. But you have a model of how they work that's completely wrong. And so the two words are vertical and horizontal. If you ask a normal person, you have a bunch of aluminum rods, you throw them up in the air, they tumble and turn, and suddenly, you freeze time. And so this random collection of rods, I'm going to ask you - are there more [rods] that are vertical, within one degree of vertical, or more than or within one degree of horizontal? Or is it about the same? Most people would say, it's about the same. Actually, there's many, many more, that are within one degree of horizontal than within one degree of vertical. But if I do the same with aluminum discs, there's many, many more that are within one degree of vertical and within one degree of horizontal. So for lines, vertical is special and horizontal is very common. For planes, vertical is common and horizontal is very special.
Oh, that's interesting.
Most people don't know that. Most people average across the possibilities of lines and planes. And they also use 2D to get insight into 3D terms. And they end up thinking vertical and horizontal are sort of symmetric pair. They're not at all symmetric in 3D. But it depends on when you're talking about lines or planes. So how our language actually works, and how we think it works are quite different things. And it's the same for all the language you use for trying to tell other people what's inside your head. Obviously, you've got a brain, and the brain has big vectors of neural activities. And if I want to tell you something about the big vector of neural activities I have, it's not much use me telling you that neuron 52 is active. Because my heartbeat is different from your heartbeat in fine detail. It just won't convey anything to you. So what I do is, I tell you something about a state in the real world, that would have caused the state I've got in my brain. Or I tell you about something in the real world that would be caused by the thing I've got in my brain.
If I say for example, I'm seeing pink elephants, I'm experiencing pink elephants - the word pink and elephant are applied to things in the world, not to things in the brain. So what I'm doing is telling you about a hypothetical world, a world full of pink elephants. So I'm getting at what's going on in my brain via normal causes. And those are called sensations. I can also get what's going on in my brain for the normal effects. So if I say I'm angry, that's a feeling. And it means the same pretty much as I feel like hurting somebody. And if I say I feel like hurting somebody; again, I'm talking about actions in the real world. These mental terms refer to hypothetical worlds. What's funny about mental terms, is that they're about hypothetical worlds, not about real worlds. It's not there about funny spooky internal stuff. That's just nonsense. They are terms that refer to things in the world, but in a hypothetical world. So feelings are about getting at my internal states by telling you what they might cause. And sensations are getting at my internal states by telling you what might have caused them. And then thoughts are very special, because they're getting my internal states by telling you a string of symbols that might have caused that internal state and that the internal state might also cause me to add to that string of symbols. So with thoughts we can get them from both ends, we can get them from what they cause in me, and from what they were caused me to say And that's because we have audio in and we have audio out. Notice we don't have video out - life would be very different if we have video out, but we don't.
We've got many dozens of muscles in our face that can project emotional content, right? So maybe it's just a low baud rate video out.
So the idea is all this language about sensations and feelings and experiences and thoughts. There's nothing spooky about it, as soon as you see what the trick is. And the trick is, you refer to your internal states or to other people's internal states, not directly, but via what will cause them or what they would cause. And once you see it like that, there's nothing spooky about mental descriptions. Mental states: the word mental just means hypothetical.
I think of the Cartesian Dualism kind of argument that, you know, there's thoughts that are independent of the brain and there's an idea that the brain is all that there is, and I guess you're sidestepping that, in a way, saying that the best way to describe a thought is not with the implementation hardware of the thought…
…but of the substrates with which thoughts trade in - I don't know if I'm getting this right or summarizing it well…
The only place words are is in the inputs and outputs. Inside, it's all big vectors. And so the idea of this kind of language of thought, I think is just nonsense. It's just that, in order to say what I'm thinking, I do it by saying what strings of input words would have caused me to think that? Or what strings of output words, this thought I'm having to produce?
Let me go back to your life's work of creating artificial intelligence? What are we missing, to be able to replicate what you're seeing right now in a computer? And what will it take to get us to true artificial intelligence?
I wish I knew the answer to that. I believe that one thing is adaptation at multiple timescales, that we have to have fast weight memories for short term memory. So currently, for short term memory in a neural net, you can either use a recurrent net where the short term memory is in the hidden state. But that's not got a very high capacity - you can't remember a lot of recent history that way. Or you can have an external memory of recent vectors of activities. But then you have to decide where to write and when to write, and when to read and so on. Well, the third alternative is, you can be rapidly adapting the weights. So you have fast weights that adapt rapidly and decay rapidly, and are a kind of overlay on the standard weights where you learn the longterm knowledge. And those fast weight memories give you a much higher capacity memory for what happened in the recent past. And in fact, there's a paper with Jimmy Ba as the first author in 2016, where we show how you can use fast weight memories to integrate information over time over multiple fixations. And looking back on that paper, what the fast weight memory is - it is actually a very simple version of our transformer. So what's happening is, you store the history of hidden states in an associative memory. And when you get a new hidden state, you feed it to the associative memory. And what it gives you back is an average of the previous hidden states, but weighted by how similar they are to the current hidden state. And that's like a very simple linear version of a transformer in which the keys and values and queries are all the same vector. And so although transformers went around in 2016, that only just been invented and been published, we were using fast weights to implement those in a way that the brain would be able to do them. So I think, once we can explore fast rates sufficiently, we'll be able to get things like real recursion going on in neural nets, where you reuse the same neurons and the same weights for the recursive call of a function. So for example, if I give you the sentence, ‘John dislikes Bill, because he was rude to Mary,’ the ‘he was rude to Mary’ is an embedded sentence, then you want to use the same hardware for ‘he was rude to Mary’ as used for the whole sentence. But that means you have to free it up. And so you cannot possibly be using that hardware for remembering the main sentence - you have to remember it somewhere else. And I think fast weights is where you do that. And that was the very first talk I ever gave in 1973. And I think it's still at the cutting edge.
So you know, you've talked about twice now fast weights. I mean, is it fair to describe them as a little bit close to working memory?
Yes, I mean, you're going to use fast weights to implement a working memory. Exactly. You're not going to try and do it with neural activities.
And so you know, you talk about some forms of transformers are crudely implementing fast weights. Does this mean that we're on the cusp of artificial intelligence or more will be required, as this is necessary but not sufficient?
Well, I always believe we're gonna get there in about five years. And I'm…
…one thing about five years is that it's always five years away, right?...
Yeah, but this time, this time, I've only got five years, so we better.
So this reflecting on the conversation, you've mentioned, some ideas that you view, as you know, necessary for pushing the field of AI forward, and some also some changes to hardware that would be required. But if we open up the apertures, and make a clear demarcation in our conversation that before we're talking about things with evidence, or you know, doing regular kind of scientific conversation, and now it's just rank speculation about the future…what do you think the future holds for artificial intelligence?
Yeah, I'm very wary about predicting beyond about five years, because you just can't see what's going to happen beyond about five years. But I don't think deep learning has hit any roadblocks. I mean, it's still the case that as you scale it up, it works better. I think. There's no tremendous amount of effort and thought going into making better hardware. And I think we get radically new kinds of hardware that are much better for doing neural nets. But I also think we need more ideas, ideas, like transformers, for example, transformative ideas like that. And I agree with you rransformers were impressive, because they could make really good use of the matrix multipliers. But they're also impressive because of the way they work. They're very interesting way to do processing. And I think we're just going to get more ideas like that. And now that the so many people in the area, almost everybody is going to be working on making big language models work epsilon better. But a few people are really coming up with original ideas that really advance the field.
What do you hope for most in your lifetime to happen in AI?
It's already happened. What I hoped for most in my lifetime to happen in AI, people have finally realized that learning by stochastic gradient descent works. And I think that's fairly decisively realized now.
Wonderful. Thank you so much, Geoff. Anything else you want to say before we end our conversation today, which has been truly inspiring and fascinating.
I intended to mention something about smell, because I know you've been working on smell…
Let’s talk about smell.
It's very impressive what you've done. But I forgot.
I would always love to talk with you, Geoff. It's something that I'm obsessed with. I have my own set of things I would like to have happen in my lifetime. And I envy and admire what you've been able to do with your life's work. I aspire to that.
We usually take time at the end of each episode in the spirit of regular in person meetups in Boston many years ago to discuss a big problem - a nail, the possible solutions - and a hammer, inspired by what we just heard. So Anthony, this week, do you have a hammer or a nail?
This week, I want to talk to you about a hammer. And it's one that I'm learning about. And you've spent an incredible amount of time on, which is so called representation learning, or how we can take data, transform it, map it into a new space where the underlying structures become clearer.
I love this topic. I think it's at the center of a lot of stuff that I love to do. And I also think it's kind of at the center of a lot of science. It's happening. It's super exciting. So what are you thinking in this space? What's exciting you about this topic of representation learning?
Well, first is how powerful it is.
I guess I should ask, what is it? Anthony, what is representation learning?
I was gonna ask you, Alex, why don't you start off?
Oh, gosh, you can't Judo me into this feat.
Okay, okay, okay.
So I'll give you my my perspective on this. So a favorite author of mine, Jorge Borges, wrote a one paragraph short story about an ancient kingdom that's been destroyed. And its last act was to build a map of the nation. And the map was built at a 1:1 scale. So one inch of the real world was one inch of the map, and they created a map that blotted out the sun and killed everyone. Or that's implied in the story. It's like a surrealist idea. But I think it just makes kind of plain and funny, the fact that we never build maps that are 1:1; we always build representations of the world that are smaller. And we do that so that we can kind of navigate things like I don't want to have to travel from Boston to New York to like, figure out what's along the way. Like I want to plan a route to see how long it will take and what gas stations I have to stop at. And I can use a map to do that. I can use my finger to travel around on the phone; I don't have to actually do it. That's what representations are for. They're like compressed parts of the world we can navigate easily, and maybe even learn things from. And this is something that neural networks and other machine learning models can do for us. So this is really important. And I think this is most famous or most well known in the context of image modeling. So I think we might have talked about this in another season, where if you train a neural network on a dataset called ImageNet, which is a big collection of images and labels of what's in the images, and then you take the internal representation inside that neural network, and you freeze it, you can use that instead of using the raw pixels of images, n another classification task, you can just use those embeddings as representations instead of the raw pixels. And you can actually do really, really well at predicting things. And that's because this neural network has learned a way to compress what's important about what's in an image. So hopefully, that kind of makes it a little bit clearer.
You know, one of the things that I want to get a little geeky and talk to you is about this idea of you know, earlier on in the discussion, we talked about this, back and forth between so called kernel methods with Geoff Hinton, and neural nets. And one of the ideas of kernel methods is that I start with data that's finite-dimensional, and then take a feature map - some way of mapping it - actually map it into a space that's infinite dimensional. So I take images, which are the same as 256 by 256. And now I put them into infinite dimensions, and it becomes easier to classify or, or do regression or things like that. And yet, with neural networks, I'm going the exact opposite direction. And I'm starting with the same dimensionality, and then mapping it now to a much smaller dimension. One of the things I wonder about is whether or not some of these ideas of mapping to high dimensions will make a comeback. And dimensionality of the data will still be the same. But maybe it'll be easier to see that dimension if it's living in a much higher space. And one of the things that's recently just come about that I've been reading about and I find very fascinating, is the so-called ‘neural tangent kernel.’ Are you familiar with this, Alex?
You know, I've seen the name around, but I would love to hear what it actually is.
Okay. So, again, let me give a brief synopsis of what I perceive to be two big recurrent themes and try to bring them together. On the one hand, you have neural network people who used to cast a gradient descent, to optimize the weights on the neural network parameters. And they map the data into a smaller dimensional space, and then you do a linear classification at the last layer. And then you have people that do kernels, where I take the data, and I map it into an infinite dimensional space. But because it's linear, I can do tricks. And I can actually compute in infinite dimensions without ever having to realize it is an infinite dimensional thing. And these are two dominant traditions in machine learning that have both kind of gone back and forth for the last at least 30 years. And there was this paper that came out four years ago, where they said, ‘Alright, when I train a neural network, I do gradient descent.’ So let's do a Taylor approximation of my neural network. And let's let that network go infinitely wide, not infinitely deep, but infinitely wide. And it turns into a kernel, which is really surprising, not at all expected. And so actually, these two very different schools of thought, actually ended up kind of converging on the same mathematical structure. And so I think this is still ‘the dust hasn't settled,’ and there's a lot of back and forth. But there's this idea within I think, a lot of the kind of mathematical machine learning community, that maybe what were thought to be two very different things - neural networks and kernels - are actually converging on the same set of ideas. And what we're actually discussing is numerically how to best optimize them. But the rubber will meet the road when it comes down to the representations that we learn. We all believe that the data is some low dimensional manifold. But is it better to study that low dimensional manifold in low dimensional space? Or is it better to study it in an infinite dimensional space? Too early to know and certainly I'm not the person to answer it. But I find it a fascinating area of research right now.
That's interesting. I'm definitely gonna have to learn more about that. Let me express a strong opinion weakly held..
…which is, we're going to want these representations to be low dimensional. And I'll reason by analogy, which is that the history of human exploration and curiosity has been performed on two dimensional representations like the periodic table. If you go look at actually Mendeleev’s original publication, it's just like, text laid out in a grid. There's no table lines or anything like that. And it's a bunch of holes because He didn't know all the elements at the time. And so the map that he was constructing, actually expressed areas of high uncertainty and unknown regions that were worthy of exploration and expeditions, basically. And it's the same for maps. In olden maps, there's the "Here Be Dragons" region, which are like those regions should call out to explorers. The thing I wonder about worry about is whether or not the things that are left to explore fit in a map that can lie flat on the table, or whether or not the concepts that we have to explore are necessarily high dimensional, right? Like our brain builds maps, like our brain is much smaller than the universe that we navigate. And we build representations of people and objects and concepts. And they're not two dimensional representations. Why should we expect our maps of the scientific concepts that we go explore, like, the identity of cell types, to be necessarily two dimensional? So this is something I've been kind of thinking about or struggling with. We need to build maps, but are they gonna fit on a piece of paper, you know, and that's what representation learning is all about, to me is just saying, look, it's whatever dimension it ends up being. Let's go learn it. And let's go use that representation to keep exploring.
Yeah, I mean, so a whole set of things I'd love to pull apart here. So first, I totally agree that a lot of the maps we want to build in our lifetimes are much higher dimension than two, you know, certainly the underlying dimensionality of ImageNet, while not the full pixel dimensions, is much higher than two. And so I definitely agree with you that being able to even read the maps that we make will be challenging. And also here, I'm kind of haunted by one of the lessons of mathematics. So the field that I liked was called topology. And it was kind of the abstract study of shape. And there's this general ethos in the field, that questions that are about one or two dimensions are usually quite easy and got solved early. And we're kind of boring. And then in questions that were above five dimensions, it actually got really easy and trivial, because there was so much space to move around, that you could quickly close all these conjectures. And then the ones that took like forever - generations to solve - were in dimensions three and four. I would certainly believe that actually, if you go above four dimensions, it's that there aren't interesting questions to ask and good conjectures and hard theorems to prove that are important. It's that because we're so embedded in three and four dimensions, that we can't actually ask the right questions. And so it's very plausible to me that we might be in a very similar situation with a lot of our modern data types, where, let's say the Human Cell Atlas - you gave that example - or ImageNet. It's some high dimensional manifold. And we might not actually be able to ask the most interesting questions about it, because they don't occur to us, because our brain isn't wired to think that way. But in some ways, I think actually the example of Geoff Hinton in our discussion today is maybe a reason for optimism rather than pessimism. In the same way that AlphaGo revealed entirely new strategies of gameplay that were hitherto unappreciated. You could imagine that the next generation of artificial intelligence algorithms might similarly help prompt us to deeper scientific questions than we would have been able to achieve unaided
That makes me super pumped for the future, which I guess is the whole, the whole darn point of this podcast. Great thoughts. Thank you for sharing Anthony. Huge thanks to Professor Geoffrey Hinton. Next episode will be the last in the series of Theory & Practice all about futures. We'll be speaking to Professor Cynthia Kenyon about a topic because the future affects us all: aging. In case you missed earlier episodes, we've covered the dark genome gene editing the future of cancer medicine and protein folding. In the last two weeks, we've looked into how we feel, and this week how we think. If you've got any questions for us or our guests, email email@example.com or tweet @GVteam. We'd love to hear from you.
This is a GV podcast in a Blanchard House production. Our science producer was Hilary Guite. Executive Producers are Duncan Barber and Rosie Pye, with music by Dalo. I'm Anthony Philippakis.
And I'm Alex Wiltschko.
And this is Theory & Practice.