Sept. 6, 2023

S4E6: MIT’s James DiCarlo on Reverse-Engineering Human Sight with AI

In this episode, we’re joined by James DiCarlo, the Peter de Florez Professor of Neuroscience at Massachusetts Institute of Technology and Director of the MIT Quest for Intelligence. Trained in biomedical engineering and medicine, Professor DiCarlo brings a technical mindset to understanding the machine-like processes that occur in human brains. His focus is on the machinery that enables us to see.

Show Notes
Transcript

Season 4 of our Theory and Practice podcast investigates the powerful new world of AI applications and what it means to be human in the age of human-like artificial intelligence. Episode 6 explores what happens when AI is explicitly used to understand humans.

In this episode, we're joined by James DiCarlo, the Peter de Florez Professor of Neuroscience at Massachusetts Institute of Technology and Director of the MIT Quest for Intelligence. Trained in biomedical engineering and medicine, Professor DiCarlo brings a technical mindset to understanding the machine-like processes in human brains. His focus is on the machinery that enables us to see.

"Anything that our brain achieves is because there's a machine in there. It's not magic; there's some kind of machine running. So that means there is some machine that could emulate what we do. And our job is to figure out the details of that machine. So the problem is someday tractable. It's just a question of when."

Professor DiCarlo unpacks how well convolutional neural networks (CNNs), a form of deep learning, mimic the human brain. These networks excel at finding patterns in images to recognize objects. One key difference with humans is that our vision feeds information into different areas of the brain and receives feedback. Professor DiCarlo argues that CNNs help him and his team understand how our brains gather vast amounts of data from a limited field of vision in a millisecond glimpse.

Alex and Anthony also discuss the potential clinical applications of machine learning — from using an ECG to determine a person's biological age to understanding a person's cardiovascular health from retina images.

Anthony 00:02

Welcome to GV Theory and Practice. This series is exploring what it means to be human in the age of human-like AI. I'm Anthony Philippakis.

Alex 00:12

And I'm Alex Wiltschko.

Anthony 00:16

So, today's topic is vision. Did you know that that's the sense that most people say they would least like to lose?

Alex 00:22

Yeah, most people, I might take exception for myself, I'm particularly partial to my sense of smell, as you well know.

Anthony 00:29

So I do indeed know that my friend, and it was great to hear more about your love of smell last episode.

Alex 00:34

It was a pleasure to have you in the lab and to show you around and to tell you what I do.

So far, we've been learning about humans from human-like AI. But our guest today, and I'm a fanboy, professor Jim DiCarlo, comes at it from a different direction. So he's using machine learning and an engineering mindset to learn about human vision and seeing as opposed to the other way around, where what we learn from the brain might inform the algorithms that we build. So he has a school of thought that there's three ways to demonstrate understanding: you can explain what you observe, but what's even harder is to predict future data. And the hardest is to control, to fix a system that's broken.

Anthony 01:13

I love this idea of reverse engineering. You know, him as someone who trained as both a doctor and an engineer, I can totally understand that perspective of not only wanting to explain, and predict, but actually to fix. So our guest today, Jim DeCarlo, from MIT, where he is professor of neuroscience in the department of Brain and Cognitive Sciences. And he's the director of MIT's Quest for Intelligence. So he's perfectly poised to understand and explain the complex nature of human vision. Professor Jim DiCarlo, welcome.

Jim 01:46

Thank you for having me.

Anthony 01:47

Let me start off with a really light and simple question. What has been our understanding of vision and seeing over the last few decades, maybe we can go back to the 60s, and talk a little bit about the experiments of Hubel and Wiesel, and what they did and what we learned from that?

Jim 02:01

Yes, of course, I'd love to talk about Hubel and Wiesel. I think that we as lay persons do not think of vision is happening in the brain, we think a vision is happening in the eyes. And of course, Hubel and Wiesel introduced us to what was happening in the early stages of the brain. So most of us think we see with our eyes, and we do see with our eyes, but how does the information that's captured by our eyes get transformed or changed by our brain? And Hubel and Wiesel showed us some of the early processing steps in the brain that take the light information from our eyes and start to translate it into signals that could be more useful to the brain in terms of supporting visually driven behaviors. And I could talk in more detail about exactly what Hubel and Wiesel showed. But they introduced us to that idea of inside the head is a machine that is transforming those visual signals.

Anthony 02:50

And maybe without going into too much detail, you can just say a little bit about the core principles of, you know, center surround and then building up lines and this hierarchy of processing inside the brain.

Jim 03:01

Jim

Okay, the context is that light strikes the eyes, but we need to convert those patterns of light into things that can allow us to behave in the real world. And that's the function of vision. And Hubel and Wiesel suggested that light is first analyzed by doing things like detecting edges, perhaps edges of objects that then are pieced together over successive visual processing stages in the brain, to complete wholes of say things like objects or scenes so that we could support that kind of complex behavior, that vision is evolved to support.

Alex 03:34

So how do you approach studying the visual system? Actually, I was at a talk that you gave, and this might have been 10 years ago. But what you said really stuck with me, which is, there's three ways of demonstrating that we really understand the system: you can build it, you can predict it, and you can fix it. And that's a very engineering mindset to studying a real biological phenomenon. So maybe, could you tell us about how you approach thinking about scientific problems and studying vision?

Jim 04:03

Yes, I mean, I was trained as a biomedical engineer and also a medical doctor. So I've always been thinking about how would an understanding be useful in the real world. So I'm really grounded in that tradition. And I realized that science often tries to explain, predict and control. So explain is: organize the data that we observe; predict is: predict future data from other experiments and control would be: be able to do things like fix a system that's broken.

Alex 04:29

The background of my curiosity is how do you think about studying what seems to be an insurmountably complex biological system?

Jim 04:38

Yes. I mean, you might look at vision or any complex behavior and say, Well, this is too challenging. It's too hard. But anything that our brain is achieving, there's a machine in there, right? It's not magic. So that means there is a solution. There is some machine that could emulate what we do. And our job is really to figure out the details of that machine. So the problem is someday tractable. It's just a question of when. So that's the first step, first getting over a dualist mindset that maybe minds and brains are different. First, just throw that out the door and say there's a machine, our job is to find the right machine.

So then, once you say that, well, how do I search through a space of machines? What are the alternative machines that might be executing the visual behavior? There's a large set of what we want, they'll call those scientific hypotheses, each machine, each imagined build machine that we could build would be one alternative hypothesis. And there's an infinite set of such things. So our job then as scientists in that frame is to figure out which ones search through and reject the ones that are not in line with us and to promote the ones that are most in line with us. So that's how I approach that kind of from a scientific point of view. And then that leads you to say, well, any tools that help us search through machines, any data that helps us search through alternative machines, ways to generate alternative machines, those are all the kinds of things that you need to execute such a program. And then it gets into the details of well, how many machines can you test? And which data should you get? And how do we build these things to decide which one is the one that's most like our brain. But I guess the big picture view there, is there's a machine that can be understood, it's a matter of thinking of the mind as the product of a machine. And everything we do is in that frame. And vision is just a great example of where that project is making great progress and is not done, but is on a path to discovering the machine that's running inside our brain to support our visual behavior.

Alex 06:38

I think that's a phenomenally exciting perspective, and really clear. And it really resonates with me. And I just think you've laid it out in crystal clear terms. And I'm going to use that framing actually, in the future. Thank you. I really liked that framing. Maybe using that terminology that you laid out for us. And this is going to be your perspective. How did you think about the machine of vision when you started your training? And how has your thinking about that machine changed over 30 years?

Jim 07:07

Right. So when I started in the 90s, that kind of broad strokes of the machine of visual processing, at least of the ventral visual stream, which is the center part of vision, this broad strokes of that machine had been motivated by Hubel and Wiesel in the 60s to say there's a hierarchy of going from edges to corners to full objects across a series of visual areas that had already been identified. And there were even computational models that started to essentially lay down the notion of that machine, how you might build it. And when I say machine, what I mean is, that means it's not just words on a page, but it's something that is executed in a computer that could be run, that could make predictions. And that's what I mean by finding machines. And some, most notably, Fukushima in the 80s had already proposed models or machines of that type based on Hubel and Wiesel data. My colleague at MIT Tommy Poggio, proposed a newer versions of those in the 90s when I was really starting out in this area, and those ideas really connected for me because they were aligned with the biology that we knew again, mostly from work of Hubel and Wiesel and others in that vein..

Alex 08:13

…and and Fukushima’s work you're referring to is the neocognitron?

Jim 08:20

Correc

Alex 08:20

You know, what might be the precursor to covnets?

Jim 08:21

Like neocognitron, exactly, which was kind of based on Hubel and Wiesel, basically covering the visual field with a set of local filters that then get accumulated into more complex filters at slightly bigger spatial scale that then again, cover the full space of the visual field and then get accumulated to another complex set of filters, a series of operations that are doing local selectivity and then building up invariance tolerance to slight position or scale shifts over larger regions of the visual field. And then alternating those operations of selectivity and invariance in a hierarchical fashion is what was motivated by Hubel and Wiesel and then executed in the first model by Fukushima called the neocognitron in the 1980s. And that was the form of the machine that felt right to me in terms of the biology, but the details just weren't there.

So that was the broad strokes machines that were motivating when I started in the field of vision. And the tasks seem to be one of now starting to work on Well, what are the details of such a machine, I really believed that was the style of the machine. But how to find the parameters of such a machine was what was unknown and all those parameters matter. Just describing it in words, as I did, is not enough to make the machine actually work or actually line up with the biology. And what's changed. Alex to your question about what's changed over the years is the ability to optimize the parameters of such a space of possible machines, all of that form, which are now called convolutional deep neural networks, optimize the parameters to get them in line with actually performing the behavior and turns out to them line up with the brain and lots of interesting ways, and that's what we've been most known for showing that was kind of alignment.

Anthony 10:04

So let's keep on going with this thread. Jim, what we've seen over the last, call it 20 years, and it's been an amazing time, is the rise of deep learning. And in parallel, we've learned a lot about the cell biology of the neuron and systems neuroscience. And you've really been at the interface of these two fields. So kind of start going through, where a modern convolutional neural network looks a lot like the brain?

Jim 10:30

Okay, so modern convolutional network looks a lot like the brain in terms of its macro architecture. So first of all, it's a convolutional deep neural network. So let's unpack that a bit. So what does deep mean? Well, that means it has multiple processing layers, one layer feeds, the next layer feeds, the next layer feeds the next layer. And typically, such networks would have at least three, but often now they can have hundreds. When we look at Hubel and Wiesel’s work and the anatomy of the visual system, it would suggest roughly about six cortical processing layers. But there's a whole space of models that would have different numbers of layers. They're called deep convolutional networks. So the deep part is the fact that it's hierarchical. Again, feed forward largely one layer processing the next processing the next. So that's the deep part.

Then there's the convolutional part, there's the neural network part. So maybe we should back up? The neural network part means that inside this network are individual elements that we'll call neurons, or artificial neurons, which are essentially little summation devices that take some set of inputs from other artificial neurons, sum them up with some weights, and then maybe have a threshold operator some non-linearity before they issue an output, and then communicate to other neurons. So that's part of any artificial neural network is to have elemental units like that, that are all connected with each other. Again, those units that are organized into areas, if you will, and if there is a deep stack of areas and a feed forward network, we'll call it a deep artificial network.

And then finally, if there's a deep convolutional artificial neural network or a covnet, it's sometimes called for short, what's happening is, if we think about the neurons, the artificial neurons at any one level, they don't take inputs from all the possible neurons below it, but from a local subset of neurons from the area below. They're only waiting over that set of local neurons. And then the trick of convolutional networks is whatever operation that one neuron is doing at one point in visual space is copied over the entire input space for that layer. And when that operation is executed, often as what's called a convolution, which means you just slide that spatial filter across all those input neurons. In the brain, we think of that as being executed in parallel. And so the word convolution sounds like something that engineers are familiar with, but it's not how neuroscientists would classically describe the visual system. But that's just an implementation detail, to simulate what Hubel and Wiesel told us: a way to think about the visual system.

Anthony 13:05

And just as importantly, where doesn’t a modern convolutional neural network look like the brain?

Jim 13:11

The things that are most different, that are often called out are, first of all I mentioned, this is mostly these are feed forward neural networks. So level one processes to level two that then projects to level three. And level two doesn't really project back to level one, and level three doesn't project back to level two. So it's just feed forward flow in that sense for the basic convolutional neural networks. Where we know the visual system has anatomy that suggests connections, both feed forward and feed back. And there's subtleties in there about how to think about those data and what they tell us, but that's one difference that's pointed out between convolutional networks and the neural network running in the ventral stream. One other difference is, I mentioned already, the convolution notion that operator implied in machines is often applied in ways that assumes that you have identical copies of neurons at any point in visual space. We don't think biology has exact identical copies, but it probably has something that approximates identical copies. So I think that's a small difference. But it's often pointed out again, as a word that sounds unfamiliar to neuroscientists, but I think that's really in practice a small difference. The third difference that is, I think, most interesting to people that are working in this field is that these networks often have to be tuned or trained in a certain way. And there are techniques such as back propagation and other optimization techniques are used to train all these weights inside these deep convolutional neural networks, the parameters, if you will, of these networks. And the way that that training takes place is sometimes called learning, but it is probably unlike the kinds of learning that happens in the visual system. And so those are the major similarities and differences as I see them.

Anthony 14:52

Yeah, so this is fascinating. So, let's first just touch a little bit more on the structure of the network. What I heard from you is, one of the ways in which the brain is a lot like a convolutional network is actually this convolution operator, which is that it does some kind of smoothing and keeps on propagating it upwards? Well, I heard you say it was different is that the brain is feed forward and feed backward. So why would the brain want to have a feed backward step? And why don't we add that into our modern convolutional neural networks if it's a good idea?

Jim 15:24

Yeah, the question about feedback in the visual system is, of course, a great one and an open one. So this is, as you say, a structural question about, hey, the networks that we use to model the variant are mostly feed forward. But we know from the anatomy that there's feedback. So first of all, a couple of points. As an engineer, when you model any system, there's lots of details or choices that you might look at and say, Is that relevant to the function or not. So I just want to keep that in mind. First of all, the neurons, the artificial neurons, you didn't say they don't have dendrites, or they don't have channels, right? There's all kinds of detail that are structural details that are ignored by these networks. Keeping in mind any model is going to be an approximation. So our choice to stick with feed forward networks as a start point was not one of rejecting the anatomy that there's feedback, but of knowing that a lot of the physiology was consistent with a dominant feed forward flow. When you show an image, you see neurons light up in a sequence almost across those different areas, as if the feed forward flow was the dominant force in the ventral visual stream under natural viewing conditions. So it was kind of data like that that would suggest to us is, if you're going to start modeling, start with an approximation of the feed forward network to get started realizing that we'll have to return to thinking about the function of the feedback system.

Alex 16:43

Jim, can I ask a question there? So in thinking about the feed forward aspect of human vision, should I think of CNNs as us taking in a scene and gaining across what we see? Or should I think of a CNN as a glance?

Jim 17:01

Yes, great question. So when we're using CNN to model the ventral stream, we're thinking about when a primate, you and I, as human primates look at an image, but we don't just dwell at that image. At one point, our eyes bounced around and sample from many points in the image.

Alex 17:20

How fast does that happen?

Jim 17:22

It happens several times per second. And we don't notice that we're doing it. So we're thinking about what's happening in one of those sub second glimpses, just what's happening in that tiny glimpse, which might be on the order of 200 milliseconds. And we're all making those movements, right now we barely notice that we're doing, especially if you're looking at a page on a screen or a scene that you haven't seen before. So as your eye reaches a new location, you can think of it as getting a new pattern of photons on the eye at that location. And the duration of that pattern is supplied to the eye for about 200 milliseconds. And then your eye moves on to go gather information somewhere else. So we are mostly interested in exactly what you are gathering just for that first 200 milliseconds. And that process is a very strongly feed forward process. So the models that we're building are trying to capture that aspect of vision in just the sub second glimpses. And we think there will be other models needed that helped stitch those outputs of that process together over longer timescales of seconds, or even minutes, that we would still call part of seeing, but have not been the focus of what deep networks had been modeled for. So this has both a temporal scale about 200 milliseconds. And I didn't mention it, but also a spatial scale of around the central 10 degrees of vision. So again, we're modeling the ventral stream. Even though we have a large visual range of field that's like plus or minus 140 degrees, we're really only modeling the central 10 degrees, so not our peripheral vision, but our central vision. So that's the space that with the ventral stream is gathering data. And that's the space of models that our team has been most focused on. So again, it's important to keep that spatial, temporal scale in mind and the context of vision as a whole.

Alex 19:02

I think that's a really important way to anchor ourselves. But it's also, it's surprising, right? Because that's not a lot of time. And that's not a lot of spatial extent. So we can make decisions visually in a second or in a few seconds, and sometimes complicated decisions. How can we possibly have all the information that we need in order to make these decisions?

Jim 19:24

First of all, we are amazing in our decision capability, we don't often make perfect decisions. We also have mechanisms to dwell longer when the information is uncertain. And I'm not talking about any of those processes at the moment. But I think one of the things that supports the question you asked like, how can we do this is many studies in our lab and others well, before I showed that even in a 200 milliseconds, we can do a remarkable amount of information extraction, and most notably work of Molly [nee Mary] Potter and others showed with rapid presentation of images just again, 200 milliseconds, even 10 milliseconds of image - you can estimate, Oh, that was a dog, that was a cat, that was a tree. That, of course, is not everything that you described. But it's surprising how much the visual system extracts in that short amount of time. And that's what we've been focused on modeling is just accepting that fact, positioning it within the idea that we do move our eyes roughly every 200 milliseconds, and say, there's something interesting there. And just how do we get a lot of information on that one glimpse. And that's what we've been focused on. And that leads to those feed forward models that are quickly able to estimate what's out in just that short glimpse of 200 milliseconds. And I'm being cautious here to just acknowledge that that will not solve all visual intelligence problems, as you say, when we're at a street corner, we don't just get a 200 millisecond glimpse, and then walk across the street, right? So we have to sample the information. But I'm pointing out that each of those samples is already quite powerful. And that's where the deep networks have especially enabled us to make advances, understanding that power in just those short glimpses.

Anthony 20:57

So is it fair to say that where there's the closest resemblance between a convolutional neural network and our visual system is in this first 200 milliseconds, where we get a glimpse? And then let me go back to that question of where the feedback comes in? Is that to inform us where to look next? Or where does the feedback come into the system?

Jim 21:16

Great question. So again, there's multiple levels of feedback you mentioned, one of them is, at some point, again, our eyes move. So the fact that our eyes move, we could call that a form of feedback, something is moving the eyes and then we're resampling for another point of information. Now, it's not thought that the ventral stream and the visual system that I study is what drives those movements directly. There are other parts of the brain that redirect the eyes to other information sources. But even that organismal level feedback, if you will, is an important form of feedback, just where do the sensors, the eyes tend to point over time.

But you're asking also about the wiring inside the ventral stream that connects those hierarchical areas to each other and what is it doing? The wiring just means there's the potential for action potentials to flow backwards down the system. And then your question, really, in that frame is like, what are those action potentials doing? Like imagine what if they weren't there? Would I not be able to see? What abilities would I lose if I could cut the wires that flow backward down the visual system? So I'm just trying to kind of channel the notion that feedback as a general idea over long timescales is far beyond the ventral stream. But then there's this sort of more local question about ventral stream and deep networks about feedback within the wires of the ventral stream. So those are connected but different ideas.

Alex 22:35

As a neuroscientist, what you're describing are things that we've learned about the visual system from psychophysics, from direct examination of neural activity inside of the skull, or using MRI to see inside of the brain. One thing that I think is interesting and an approach that I think you pioneered is using ML as a tool of discovery. That's something that we want to understand in this series is how we can learn about ourselves using artificial intelligence, when in fact, the script was different for much of the last 10/20/30 years, which is, how can we build better AI systems with what we know about ourselves? So it seems to be you've got this interesting feedback loop between natural and artificial intelligence. And many AI developers would say there's a point at which you need to stop biomimicry, you need to spend time developing function differently for these machines. Why are you taking a different approach?

Jim 23:31

Well, anyone who considers the brain and the mind to be a machine, I don't see how you won't end up with that approach. Because ultimately, you're just asking where's a source of machines that can serve as hypotheses for how the system works. So you ask why I'm taking that approach is because I want to understand the machine in engineering terms, that means we need a source of machines, of producing machines. And right now, there happens to be a good source, which people currently call AI. There have been other waves like that in the past, machine generation from humans. So I guess I push back a little bit that that's entirely new, that idea of like taking machines, and then thinking about biology in that context, is not really entirely new. It's just that we have better and better machines now that we refer to as AI. And so I think the thing that I was trained to do as a biomedical engineer is just cross those things, just like one is supposed to interact with the other. And so it's natural to think about where's the source of machines? That virtuous loop is serious science in any scientific domain. It's just a question of what machines people are using to model to work on their science. And that's a philosophy of science perspective. But in the case of brains and minds, the machines are complicated, and they tend to be called things like AI, which somehow makes people think that they're different, but they're still just machines like any other scientific machine or hypothesis.

Anthony 24:52

Well, let me push on that just for one second, Jim. planes can fly and birds can fly. But have we learned a lot about how birds fly from studying planes? And could it not be the same thing with the brain? That we're doing something that can think but by very different principles?

Jim 25:07

You're talking about flight now. So first of all, you know, I think it was George Kelly in the 1800s, who started to sort of do the principles of flight. And he actually studied birds. So there was a sort of, you could call it the anatomy, inspiration of flight, looking at birds 200 years ago, and then there was further work past that. But there's always been in any of these scientific areas. and I mentioned that for ventral stream, Oh, there's inspiration of anatomy from Hubel and Wiesel that led to Fukushima, these long arcs of people taking inspiration from one side or the other and crossing it back or forth over decades. And flight follows that too beyond that kind of notion of how science proceeds. I think you're asking, also the notion, Anthony, if we're trying to build a visual system, how biomimetic should we make it in our AIs. And as a biomedical engineer, and a human neuroscientist trying to understand how humans work, it's always, the North star is always to come back to if that machine is aligned with the biology, then that's a good move. If it's less aligned, then that's a bad move. And notice, that's where AI and natural intelligence can diverge, right? Our technology goals for AI don't need to be, let's build models of human brains because those technology goals are different than my goals, which might be as, as you also train as a doctor, you'd like to help people another way to put this like, I could have a great flight model a 747. But the bird comes in to need a repair at the vet, the vets like what the heck is this feather that only knows about 747s, they're not going to be able to help that bird. So it's the same thing with brains and minds. Like we could have flying AI systems, but have no ability to help us in things like mental health or education or other areas of impact, which is why we do this work is to build scientific models of how our natural intelligence works. For all those payoffs, beyond the AI technology payoffs.

Alex 27:01

I've heard the bird versus plane discussion several times, but I've never heard someone bringing a veterinarian into the discussion. And I think that that's really important and profound as why are we doing this? You know, if a bird comes in and it needs fixing, what is the plane going to tell us about this? So one of the big challenges in front of us is correcting human vision loss, all the different ways that we can lose our vision. So what are you working on now in this space? What's your hope, in this space?

Jim 27:30

Yeah, I must admit, I had a sort of, I don't want to call it a bit of a crisis, when we were working hard to find models in the sense of searching for machines that lined up with the brain’s machine. That was sort of the mindset I described in our work. And I would go and present hey, look, we found this new model that seems to line up well with the visual system that we're measuring. And, Wow, isn't that exciting and cool? And I was struck by colleagues reactions, of well, that's not an understanding or almost, that's not science. And it made me sort of step back and go, Wait a minute. Now I realize what I thought I was supposed to be doing isn't actually what people think that we should be doing.

So that crisis sort of made me think about, you know, they're right, in a way, what I need to do is show them the value of having such a model in the biology sense. And that leads to your question about, if we had a perfect digital copy of our visual system, what could we do with it in healthcare education? Even if we had that even if we somehow thought well, I don't really understand it in the sense that I had hoped, in a Newtonian mechanics like way, but it's a digital copy. This is the extreme form, what could you do is such a thing in the healthcare education space. And that leads to all kinds of interesting thoughts. But I'll tell you sort of one that we're, well, there's a few that we're thinking about, well, one obvious one, our brain machine interfaces and ways to inject signals in such a system that would replace the loss of sight. So if you lost your eyes, could you re-inject signals in your brain to replace your vision that aligns up with your kind of your vet analogy, if you didn't have a mapping of how our biology implemented our visual processing, there'd be no way to do that, without such an understanding? Again, again, back to 747s and birds.

But things like asking how we can manipulate images to perhaps improve our lives and our mood. That's the sort of crazy idea that we're working on now. Which is like the visual system connects to mood regulatory centers, and rather than estimating affective state or controlling our affective state, and could we have our computational models help us to see how we might beneficially modulate photons so that we would get improvements in our mental state. And we could imagine, you know, AR glasses that use the knowledge of those models to control the photons on our eyes to improve our mental lives. Those are the kind of things that models might unlock that would be healthcare goals driven by a model based understanding even if it felt not Newtonian, but not beautiful in the sense that scientists want their understanding to be.

Alex 30:00

You're talking about, the, the white hat implementation tech stack of Neal Stephenson’s Snow Crash. What's the random imagery or the carefully tailored imagery that you can pipe into somebody's very, very high bandwidth sensors, right? Our visual system is taking in enormous amounts of information. How can you hitch a ride on that information in order to affect the other things in the brain that it's connected to? I think that's a fantastically inspiring idea. I mean, how far does it go in your mind? I mean, obviously, we haven't done this in full yet. But what's the end state?

Jim 30:32

Well, the true answer is we don't know yet. And that's why the research is ongoing, there's probably limits to what we can do with those hitching a ride ideas. But that's what the models will tell us as we get those models to line up, they'll tell us what we can do and what we can't do. And exploring that is what's exciting right now. So I don't know the answer - check back in in a few years, we're working on it.

But I believe I would just say there's far more that can be done than we realize. And the thing I like to point out as all humans that have ever lived in all of human history have never even seen a tiny fraction of the possible images that one could show to the eyes, right? So the number of images, even if you think about an image of 200 pixels by 200 pixels, and let's say it's black and white. So that's two to the 40,000 possible images. And there's something like 10 to 80, [1080] protons in the universe. So this is a crazy amount of space to explore that we can't even say that we've tapped it at all. And the models are allowing us to tap that space and to try to ask those kinds of questions. What could we do to improve our lives by exploring that space? And the models are unlocking that possibility, which is what science should do?

Alex 31:44

What's a concept in the philosophy of science that you think is under appreciated that people should know, that would help frame their thinking?

Jim 31:52

Well, one book that's influenced me a lot, and it sort of struck, found something that I had been thinking but I could never put in words is Michael Streven’s book called The Knowledge Machine. He points out in terms of scientific revolutions, that it's irrational to want to step away from models that are beautiful, or theories that are beautiful and understandable to humans. And to just stick close to the data and have the models just explain the data. That's not a rational move. It sounds like what science should do. But humans prefer not to make such a move. And he points out that limited science is what prevented a lot of scientific advances from arising apparently so late in human history, really the last few 100 years. And I see that kind of thing playing out even in neuroscience, where people are reticent to just align the model with the data. Rather than debate the explanatory frameworks, rather than work hard to just see if things are aligning with the data. It's what gets humans out of bed is the explanatory frameworks and the theories and motivates them to collect the data. But where the progress happens is the work to align the models with the data. And then that unlocks a lot of the things we were talking about earlier. But humans are sometimes we're in our own way, because we don't naturally engage in that process, even though it sounds like what we should do. But that process under the hood is what's driving the progress. And I think that's one takeaway that I have from his book. And a big part of my efforts is to try to promote platforms that can help make that easier to happen, even if it's not what motivates us as humans to get out of bed and do the work of even measuring the data and building the alternative models, i.e the alternative machines.

Anthony 33:28

I mean, just to push on that, what about the cases where a sense of mathematical beauty has been present, and led us to collect data that we might not have otherwise gone after?

Jim 33:40

Yes, and we're talking a lot about Streven’s book, he actually mentions that, and there's a big section on that. So it's not to discount the idea. And the way I think about that, as let 1000 flowers bloom if some folks want to think about theories that are beautiful, and that inspires new machines in new ways, that's great, that will generate more machines. But if we never bring those models to bear on the empirical data, they will always live in the world of beautiful ideas, and not connect to the biology. And that's what I meant by that's where the work of empirical science has to line up. And again, you're more rewarded for thinking about your beautiful ideas, then you are for bringing things in alignment with data, then you can see if everybody does just the beautiful ideas. So I'm not saying we shouldn't have some of that. But we need that knowledge machine to run under the hood, while also creating those beautiful ideas. So it's not an either or, for me, it's just to make sure we don't forget about aligning with the data.

Anthony 34:30

Jim, this has been an amazing conversation. Maybe just to close out, you could tell us just a little bit about what you most hope for the future of your work and where you want it to go.

Jim 34:39

Well, I think I just mentioned one area. My hope for the future of the work in visual science is that we will achieve a better and better scientific understanding of visual processing. And what I mean by that is engineered level models not just at the level of individual neurons and systems in neurons up to behavior which is where we've mostly been focussed. But coupling goes all the way down ultimately to the molecular level. This is the long run view so that those neurons themselves have well modeled at the molecular level, types of neurons molecules that affect the neurons. And what would that enable? Well it would enable us not just to design images, but would enable us to design other say molecular interventions, genetic modifications, other pharmaceutical interventions that could predictably modulate our visual cognition, maybe improve our lives based on that scientific model.

Anthony 36:18

Amazing. This has been an incredible conversation. Jim, thank you so much for joining us.

Jim 36:22

Really it was lots of fun to talk with you guys. So thank you for having me.

Anthony 36:33

Well, that was a really different take on our thesis of what do we learn about being human in the age of human-like AI and robotics? Alex, let's move on to the Hammer and Nails part of our podcast, where you and I talk about a nail, a problem, or a hammer, the solution, in honor of our in person meetups in Boston many moons ago. What has this episode inspired you to think about today?

Alex 36:54

This takes me back to some of the original hammer and nail topics we did in person, which is what are the uses in healthcare of intelligent human vision application? So what can machine learning, what can artificial intelligence do in the clinic? And how can it help us catch disease earlier?

Anthony 37:16

Awesome. So tell me some of the things you've been thinking about.

Alex 37:19

When I worked at Google Brain, there was an incredible study that came out that looked at images of the back of the eye. And these are called fundus images. And so you can put a camera, look right through the pupil and see the retina and all of the vasculature, all the tissue and looks pretty gnarly, it's not something that you would expect to see in the back of somebody's eye, if you're not a doctor. It looks like this kind of alien Mars landscape. And it turns out that the topology of that landscape, the structure of it, what it looks like, has an incredible amount of information, not just about the health of your vision, that's what the whole imaging setup is for is to say, How well can you see and how is it degrading?

It turns out that if you collect other information about patients like their age, or their gender, or their smoking status, and you blind an AI platform to that information, and you try to predict it, so you provide some of it in a training dataset. And then some of it in a test data set, just from an image in the back of the eye, you can predict somebody's age, you can predict someone's gender, you can predict their smoking status. So we might say that the eyes are the window to the soul. But the eyes are also the window to an incredible amount of information about who we are physically. And I just think that's incredible. I would have never expected to see that information leaking out in that particular place.

Anthony 38:44

You know, this is incredible. You and I clearly are thinking about things the same way. I'll tell you a couple of years ago, there was this wonderful paper that came out of the Mayo Clinic. A cardiologist named Peter Noseworthy, who I actually trained with at MGH and then he moved on to Mayo Clinic, and he had this incredible paper, where they organized all of the ECGs from Mayo Clinic that they’d ever collected, so several million ECGs. And they've actually had a great series of papers now, but one of the first ones was showing that you could call a patient's age and gender from their ECG. And I have to be honest, this is something that no cardiologist can do. And just be like, This is crazy, like the idea that that information is lying hidden in just a one dimensional signal of your heart. I never would have expected that. You know, the reality is there are a lot of easier ways to get someone's age and gender,

Alex 39:33

….you can just ask them, you know.

Anthony 39:36

Exactly. But, you know, it begs the question of what other information is hidden about their health care that could be used in a diagnostic setting. And in fact, this same group went on to show that there are all sorts of people that have hidden diseases of the heart muscle, so called cardiomyopathy. So when you hear stories about the 18 year old basketball player that suddenly drops dead on the court, they often have one of these diseases of the heart muscle. And they showed that you could actually find people from their ECG, and predict who would have a cardiomyopathy. And then you could call the back and do echos and show that, in fact, sure enough, it was real. So this is like, I think, a really big sign of what's to come in the world of diagnostics, which is that these rich imaging modalities can now be explored with computer vision, to find new clues about your health status.

Alex 40:26

There's something really deep and profound here. And I think the life saving properties of these diagnostics are just the tip of the iceberg, right? So finding somebody who might die, and catching that issue early, and bringing them in and then preventing that is incredibly profound. There's a mindset shift that we're at the beginning stages of making here, which is how we measure things. And as a scientist, and perhaps for you, as a doctor and a scientist, we're taught that there's tools that measure one thing, like a thingometer, right, I've got this thingometer, and it just measures a thing. That's almost never the case, you can almost always get more information out of a given instrument or sensory modality than you originally think is there. Because the world is just full of signal. And it's trying to force itself in to our senses. It's trying to force itself in the instruments that we have. So I don't know what kind of a mindset shift is going to come. But it's just, it's so clear, there's a preponderance of evidence, it's layer upon layer upon layer, that there is an incredible amount of signal that is ready to be collected, and ready to be put to good use to save lives and to make people happier. And we're just beginning to scratch this.

Anthony 41:41

Totally, you know, in the spirit of hammer and nail, this is one where I think the hammer really does need to meet the nail. Because it's one thing to be able to collect the information and see new insights, but then it's a much greater bar still, in order to make them clinically actionable. So you know, go back to the example I said before, of predicting someone's age and gender, it's amazing. But clinically, not that impactful.

Alex 42.00

Yeah, right.

Anthony 42:06

On the other hand, with a small twist, it can be made clinically impactful. And let me give an example of this. And again, this is where, kind of, a bit of knowledge of medicine needs to meet with the knowledge of machine learning. If you go back to this idea of calling someone's age from an ECG, you can then say, well, I knew their age when that ECG was done. And so I could look at the Delta age or the age residual, where I look at the computer said they were 83 years old, but actually, they're 85 years old. And then I can start to correlate that with their health status. And you see that people who are substance abusers, their ECG often says that they're older than stated age. And people who had really clean living and exercise, often are called as being younger than their stated age.

Alex 42:53

Which is the goal as we should all strive for that.

Anthony 42:56

Yeah. And so actually, with a small twist, you can take something that's clinically useless, which is age, and turn it into something very clinically meaningful, which is a biomarker of biological age, not just chronological age. I think over and over again, you see these examples were being able to understand what the actual medical or biological question is, coupled to the deep tools of computer vision can have real impact.

Alex 43:22

I completely agree. And I was just reviewing some notes that I took on a book called The Box, which is a strange book, it's about the shipping container. And the quote, paraphrased is, new innovations usually don't work that well, and certainly not in the areas where they end up being applied and most impactful. So I think this is a demonstration that we're in this era where we've got this new tool, and it's a toy, in some regards. It's being used for things that clinically, maybe are even frivolous, but are interesting and insightful. And these little twists that you're talking about that I think will have a whole range of people working incredibly hard coming up with new ways to use this, and will probably be surprised at the way that it's most impactful in the world. But today, what we're looking at, I think is a glimpse of the future.

Anthony 44:11

I couldn't agree more, my friend. This has been a great episode today.

Alex 44:14

It was super fun to talk to Jim. It's always a pleasure to talk with you.

Anthony 44:18

Until next time.

Alex 44:20

Our thanks to Professor Jim DiCarlo, for joining us this week. And finally, we'd love to know what you think of this series. You can write to us at theoryandpractice@gv.com or tweet @GVteam.

Anthony 44:33

This is a GV podcast and a Blanchard House production.

Our science producer was Hilary Guite, executive producer Duncan Barber with music by Dalo.

I'm Anthony Philippakis.

Alex 44:48

I'm Alex Wiltschko.

Anthony 44:51

And this is Theory and Practice.

S4E6: MIT’s James DiCarlo on Reverse-Engineering Human Sight with AI

Listen On

Recent Episodes