About this series
Computer Science is composed of many different areas of research, such as Algorithms, Programming Languages, and Cryptography. Each of these areas has its own problems of interest, publications of record, idioms of communication, and styles of thought.
Segfault is a podcast series that serves as a map of the field, with each episode featuring discussions about the core motivations, ideas and methods of one particular area, with a mix of academics ranging from first year graduate students to long tenured professors.
I’m your host, Soham Sankaran, the founder of Pashi, a start-up building software for manufacturing. I'm on leave from the PhD program in Computer Science at Cornell, where I work on distributed systems and robotics, and I started Segfault to be the guide to CS research that I desperately wanted when I was just starting out in the field.
twitter: @sohamsankaran, website: https://soh.am, email: soham [at] soh [dot] am.
Tweet
Episode 2: Computer Vision with Professor Bharath Hariharan
featuring Professor Bharath Hariharan of Cornell UniversityCornell Professor and former Facebook AI Researcher Bharath Hariharan joins me to discuss what got him into Computer Vision, how the transition to deep learning has changed the way CV research is conducted, and the still-massive gap between human perception and what machines can do.
Consider subscribing via email to receive every episode and occasional bonus material in your inbox.
Soham Sankaran’s Y Combinator-backed startup, Pashi, is recruiting a software engineer to do research-adjacent work in programming languages and compilers. If you’re interested, email soham [at] pashi.com for more information.
Go to transcript
Note: If you’re in a podcast player, this will take you to the Honesty Is Best website to view the full transcript. Some players like Podcast Addict will load the whole transcript with time links below the Show Notes, so you can just scroll down to read the transcript without needing to click the link. Others like Google Podcasts will not show the whole transcript.
Show notes
Participants:
Soham Sankaran (@sohamsankaran) is the founder of Pashi, and is on leave from the PhD program in Computer Science at Cornell University.
Professor Bharath Hariharan is an Assistant Professor in the Department of Computer Science at Cornell University. He works on recognition in Computer Vision.
Material referenced in this podcast:
‘Building Rome in a Day’, a project to construct a 3D model of Rome using photographs found online from the Univeristy of Washington’s Graphics and Imaging Lab (Grail): project website, original paper by Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski in ICCV 2009.
The Scale-Invariant Feature Transform (SIFT) algorithm: wikipedia, original paper by David G. Lowe in ICCV 1999.
The Perceptron: wikipedia, original paper by Cornell’s own Frank Rosenblatt in Psychological Review Vol. 65 (1958). Rosenblatt was a brilliant psychologist with exceptionally broad research interests across the social sciences, neurobiology, astronomy, and engineering. The perceptron, which is a forerunner of much of modern artificial intelligence, initially received great acclaim in academia and the popular press for accomplishing the feat of recognizing triangular shapes through training. In the 60s, however, legendary computer scientists Marvin Minsky (a high-school classmate of Rosenblatt’s) and Seymour Papert released a book, Perceptrons, that made the argument that the perceptron approach to artificial intelligence would fail at more complex tasks, resulting in it falling out of fashion for a few decades in favour of Minsky’s preferred approach, Symbolic AI. Symbolic AI famously failed to produce tangible results, resulting in the AI winter of the 80s and 90s, a fallow period for funding and enthusiasm. Rosenblatt, meanwhile, died in a boating accident in 1971 at the relatively young age of 43, 40 years too early to see himself vindicated in the battle between Minsky’s Symbolic AI and what we now call Machine Learning.
Bharath’s CVPR 2015 paper Hypercolumns for Object Segmentation and Fine-grained Localization with Pablo Arbeláez, Ross Girshick, and Jitendra Malik, in which information pulled from the middle layers of a convolutional neural network (CNN) trained for object recognition was used to establish fine-grained boundaries for objects in an image.
ImageNet, originally created by then Princeton (now Stanford) Professor Fei-Fei Li and her group in 2009: A vast database of images associated with common nouns (table, badger, ocean, etc.). The high quality & scale of this dataset, combined with the vigorous competition between groups of researchers to top the ImageNet benchmarks, fuelled massive advances in object recognition over the last decade.
Credits:
Created and hosted by Soham Sankaran.
Mixed and Mastered by Varun Patil (email).
Transcribed by Sahaj Sankaran & Soham Sankaran.
Transcript
[00:00:00]
Bharath Hariharan: Humans can basically very quickly learn new things. They know exactly when they see something new, they can learn from very few training examples, and they can learn with very little computational effort. Whereas current techniques, they can only learn a small number of things with lots of examples and lots of computational effort. That’s a big gap which causes all sorts of issues when you apply these techniques to the real world.
[ringing tone]
[00:00:38]
Soham Sankaran: Welcome to Episode 2 of Segfault, from Honesty is Best. Segfault is a podcast about Computer Science research. This episode is about computer vision, and it features Professor Bharath Hariharan from Cornell University. I’m your host, Soham Sankaran. I’m the CEO of Pashi, a start-up building software for manufacturing, and I’m on leave from the PhD program in Computer Science at Cornell University, located in perpetually sunny Ithaca, New York.
Computer Science is composed of many different areas of research, such as Operating Systems, Programming Languages, Algorithms and Cryptography. Each of these areas has its own problems of interest, publications of record, idioms of communication, and styles of thought, not to mention, one level deeper, a multitude of sub-areas, just as varied as the areas they are contained within.
This can get extremely confusing very quickly, and I certainly didn’t know how to navigate this terrain at all until I was in graduate school. Segfault, in aggregate, is intended to be a map of the field, with each episode featuring discussions about the core motivations, ideas and methods of one particular area, with a mix of academics ranging from first year graduate students to long tenured professors. I hope that listeners who have dipped their toes in computer science or programming, but haven’t necessarily been exposed to research, get a sense of not only the foundations of each area – what work is being done in it now, and what sorts of opportunities for new research exist for people just entering, but also what it is about each specific area that compelled my guests to work in it in the first place, and what the experience of doing research in that area every day actually feels like.
This is the map of CS that I didn’t even know I desperately wanted in high school and undergrad, and I hope folks who are just starting their journey in computer science will find within it ideas that will excite them so much, and so viscerally, that they can’t help but work on them.
Just a quick note. The first episode was about the research area of programming languages. If you haven’t already listened to it, you can find it at honestyisbest.com/segfault. My company, Pashi, is actually hiring someone with experience in programming languages and compilers, both in industry and in academia. If you fit this description, or know somebody that does, please reach out – send me an email at soham@pashi.com.
[ringing sound]
[00:02:44]
Soham: So I’m with Professor Bharath Hariharan, who does computer vision. If you just want to introduce yourself briefly…
[00:02:50]
Bharath: I’m Bharath. I do computer vision and machine learning. My interests are in visual recognition. I came here after a few years at FAIR – Facebook AI Research – and before that I was a Ph.D student at UC Berkeley.
[00:03:03]
Soham: Where you worked with Jitendra Malik, who is one of the pioneers of vision recently.
[00:03:10]
Bharath: Yeah. He’s one of the… yes.
[00:03:15]
Soham: So, what was it that got you into computer vision in the first place? What was your journey to choosing this as a research field?
[00:03:21]
Bharath: A variety of things. So I think the first source of my interest was that I was actually originally interested in psychology and how brains work. I think that’s been a longstanding interest of mine, and when I started working on computer science, when I started studying computer science, that was the thing I kept going back to. Like, why can’t computers do the things humans can? The other part of it was just images and visual media. Earlier, I had a brief infatuation with computer graphics, which also led to this question of ‘Why can’t machines understand images as well as humans do?’ So that’s sort of roughly the route I took, which is a fairly short route, but it serves as the motivation.
[00:04:12]
Soham: What was the first problem that was very interesting for you in vision? Or the first problem that you worked on?
[00:04:18]
Bharath: So the first problem I worked on was very different from the first problem that caught my interest.
[laughter]
The first problem I worked on was this problem of 3D deconstruction. I was in IIT-Delhi at the time, and one of my mentors, Professor Subhashish Banerjee – we called him Suban – he had this project where they were trying to digitize a lot of these historic monuments in Delhi and the surrounding area. If you’re not aware, IIT-Delhi is fairly close to a cluster of historical monuments that are mostly forgotten, in the Hauz Khas area. A lot of those monuments, if you go there, you can see them, but a lot of people don’t travel. So the idea was that we could create 3D models of these monuments, and then that would be educational. I was an undergraduate, I was mostly implementing techniques that others had suggested, including people in Oxford and MSR Cambridge. That was my first exposure to computer vision. One of the big things that actually caused me to consider this as a direction for graduate study was that, through some talk – I forget who gave the talk – I got to know of work that Noah Snavely did when he was a graduate student. It’s actually one of the papers that recently got the Test of Time Award in computer vision. This was the paper called ‘Building Rome in a Day’, and this was a similar idea, but what they were doing was, they took Internet photographs of tourist destinations like Rome, and basically created this pipeline that would take this collection of Internet photographs and produce a 3D model.
I was fascinated by that. I was also fascinated by the technical details in that. In particular, when you take two images of the same scene taken at different times from different locations, how do you know they refer to the same thing? That was actually the key piece that I ended up focusing on, which led me to recognition, like ‘How do we know we’ve seen the same thing before?’
[00:06:55]
Soham: I see. So let’s talk about this paper a bit more. So these were just arbitrary images they’d taken off the Internet, they had no pre-planning about which images to take, or anything like this?
Bharath: Yeah.
Soham: How many images did they need to reconstruct?
Bharath: I think it varied. They were going for the large scale. It’s useful to imagine what the environment was like at that time. It’s very different from what it is now.
Soham: What year was this?
Bharath: I think the paper came out in the late 2000s. But at that time, it was not taken as a given – surprisingly – that the Internet is a resource for data. People in computer vision were still looking at small scale problems, laboratory scale problems. Or, you know, you take five pictures of this stapler and reconstruct this.
Soham: I see.
[00:07:45]
Bharath: So the idea behind… so this paper, along with some others, particularly from Microsoft Research and the University of Washington, they were among the first to recognize that ‘Look, there is this massive resource called the Internet, which now has people posting photographs of their tourism, and you can just use this to build these 3D models.’ And later on, the same idea got morphed into ‘Okay, let’s do recognition’, blah blah blah, and so on. Till where we are now, where it’s kind of assumed that we have all this data and we’re going ‘What if we didn’t have this data?’
[laughter]
[00:08:23]
Soham: Was this around the same time that the shift was happening from classical vision techniques to ML techniques or had that already happened?
[00:08:32]
Bharath: So people were already using machine learning. People have been using machine learning in computer vision since the late 90s. This was, I would consider… so computer vision tends to have this… mood swings, is what I’d call it. So there are certain problems which people get really excited about, and people work on them for a decade. Then other problems take their place. So this paper was in the heyday of the time when people were talking about geometry. 3D reconstruction was the core problem people were interested in. There were a few machine learning techniques involved, but a lot of the work – for example Noah’s paper, a lot of it is a combination of good systems building, systems challenges, plus just optimization, mathematical optimization techniques. So there’s not much training, there’s not much machine learning, in that paper per se. So the resurgence of machine learning, the focus on recognition, was something that only started to pick up in the late 2000s.
[00:10:05]
Soham: So what was the key technical trick in Noah’s paper? What let him recognize that it was the same object that multiple images were looking at?
[00:10:15]
Bharath: Well, if you read Noah’s paper now… Noah’s paper is actually a systems paper. The key thing is just to put together components that people had explored, but put together those components in a way that was extremely robust, extremely scalable, and so on. The thing that I as an undergraduate got really excited by was another paper that was used in this, about SIFT. So SIFT is Scale-Invariant Feature Transform. SIFT is a paper which, if you read it even now… I really like the paper. It has a few key ideas, very well evaluated, very well motivated. It was, I think, 2001 or 2002 was when it came out, and we’re still writing papers trying to beat SIFT. SIFT is still a baseline for us. I read SIFT as an undergraduate, and I thought ‘Wow. This is what I want to do.’ That was what kind of started the whole thing.
[00:11:22]
Soham: Ok. Explain SIFT.
[laughter]
[00:11:27]
Bharath: So the fundamental problem SIFT was trying to tackle is that… you have two views of the same object, but they might be from very different angles, the object may appear very differently. How do we match them? There’s two parts to the SIFT paper. One component is detecting these key points, parts of the object that are distinctive enough that you can use for matching. The second is description. How do you describe these patches so you can match them reliably across two scenes? There are challenges in both, but the key way the paper describes it, which is a very useful way and is the way we describe it now in our courses, is that there are a set of invariances you want. There are certain transformations that might relate these two images. Those transformations should not cause your system to fail. So one transformation they were looking at was scale transformations – one view might be zoomed in, another might be zoomed out. The other is in-plane rotations – 2D rotations of the image, for example. The third is 3D rotations, but to a limited extent. 3D rotations are hard because you don’t know the 3D structure, you just have the image. But if you are careful about it, then small amounts of 3D rotation you can tolerate. So what they did was, they created this feature description and detection pipeline, where they reasoned about how exactly to introduce these invariances. So if what I want is scale invariance, what I should be doing is running my system at many different scales, identifying the optimal scale. That way, no matter which scale the object appears in, I’ll find it.
[00:13:30]
Soham: So sort of brute-forcing it, in some sense.
[00:13:32]
Bharath: In some sense, right. The other idea is, if I want to do invariance in perspective changes or 3D deformations, then what I need is something more sophisticated, which is discretization, quantization, binning, histogramming, those ideas. The combination of these two, search across a variety of transformations and intelligent quantization and histogramming, was something that SIFT introduced. Those ideas kept repeating in various forms in various feature extraction techniques, all the way up till neural networks.
[00:14:10]
Soham: So they came up with a principled and reasonably applicable set of ways to describe these invariants that were useful for other applications as well?
[00:14:22]
Bharath: Yeah.
[00:14:23]
Soham: I see.
[00:14:23]
Bharath: And if you look at the SIFT paper… even when Yann LeCun talks about convolutional networks nowadays, he harkens back to the way these people were describing it. How do you get invariants? Well, you get translation invariance by something like convolution by doing the same operation at every location. You get invariance to small shifts by doing discretization and binning and pooling and so on. So those ideas have survived, and they became the centerpoint of all of computer vision, to the extent that today no-one even thinks about them. It’s like second nature to most people that you have to do this.
[00:15:00]
Soham: Can you describe exactly what you mean by quantization and binning here?
[00:15:03]
Bharath: So the idea here is as follows. One of the biggest challenges that people don’t realize about computer vision is that, if you take an image… so an image is essentially a 3D array of pixels, pixel values. Now if I just were to slightly shift my camera, all the pixels might move one pixel over, but now if I look at the 2D array, the 2D array is completely different. If I look at corresponding indices, the pixel values are completely different. That throws everything off. So what you do to get around this – and this keeps coming at every level of reasoning – is that you can imagine taking your image… so, the simplest way to do this is that you can think of reducing the size of the image, some sampling. If you do that, then the difference becomes much less. If your image moved by one resolution, if you reduce resolution by a factor of half, by a factor of two, then you only moved by half a pixel. Then it leads to less changes. So that’s sort of the core idea. So what you do is, you take the image, use a grid to divide it…
[00:16:25]
Soham: A coarse grid.
[00:16:27]
Bharath: A coarse grid. And then in each cell, you would compute some aggregate value.
[00:16:32]
Soham: So four pixels becomes one pixel.
[00:16:33]
Bharath: Right. So that’s basically the idea. SIFT also had this one thing of using edges instead of image intensities. But that was something that people had thought of earlier too, that edges are more important than image intensities. But that’s the quantization and binning, that you just divide it into a coarse grid and aggregate each grid. So that gives you a fair amount of invariance.
[00:16:55]
Soham: So now you can compare two images that are somewhat similar, like of the same object, but from slightly different perspectives. And if you have it coarse enough, and if you’re lucky, then they’re going to look the same. Or substantially the same. I see. And that was first introduced by the SIFT paper?
[00:17:09]
Bharath: Right. That was one of the key ideas. There are other related ideas that come at the same time, but SIFT was the first engineered system.
[00:17:25]
Soham: So this really caught your attention when you saw it the first time, as an undergrad.
[00:17:28]
Bharath: Yes. And then after that, it was a fairly straightforward path to where I am, in some sense.
[00:17:25]
Soham: That makes sense. Could you describe the toolkit of modern computer vision? Both the machine learning and non-machine learning components, broadly.
[0:17:46]
Bharath: So modern computer vision, right. So there’s those two kinds of things. A big part of modern computer vision is driven by learning. That includes convolutional networks, what people nowadays call deep learning, and the associated learning techniques. The other part of it is geometry and physics. How images are formed, the mathematics of that, the related properties of the geometry. There’s a lot of things that, because of the way images are taken, lead to certain geometric and physical properties images have. All computer vision papers nowadays usually have some combination of these two. If you’re doing just ‘Is this a cat or not?’ you’re mostly using convolutional networks, machine learning. If you’re doing something like ‘Reconstruct this 3D model of a car’, you might actually use a combination, you might say ‘I’ll use my understanding of geometry to say how some feature-matching thing will lead to a 3D model, but I might also use machine learning to refine the model based on some understanding of what a car looks like in general.’ Something like that. So those are the two big toolkits, geometry and learning. There used to be also a significant part of this which was based on signal processing. So a lot of classical computer vision is based on an understanding of Fourier transforms, frequency analysis, things like that. That’s much less there now, though there’s some evidence that those things are still useful.
[00:19:35]
Soham: But it’s been largely replaced by the ML component? I see. So let’s talk about ML in computer vision. Can we talk about the perceptron model? Tell me what that is, and see if you can explain it to a relatively lay audience.
[00:19:48]
Bharath: So the perceptron was actually one of the first machine learning models. The perceptron is an algorithm. It’s a simple algorithm. The idea is, you have some number of scalar inputs, and you have to predict a scalar output, which is either 0 or 1. Basically, the way you’re going to do this is, you’re going to add and subtract and multiply with some scalar weights your inputs and combine them in that way to get some number. So you might say, I’ll take half of input A, two times input B, a third of input C, add them all up, and then compare it to some threshold value. If it rises above that threshold, I’m going to claim the output is 1. If it falls below that threshold, I’m going to claim the output is 0. So the threshold, and the weights I’m using to mix and match different inputs, those are the things one needs to figure out for any given problem. We call those typically parameters of the model. Perceptron is the model, parameters are the things we need to figure out. Even before we figure out the parameters, we have made certain assumptions about what this decision rule looks like. There are only certain things I can capture with this decision rule, but assuming we can capture this decision rule, the next thing is, how do I figure out these parameters. Now the perceptron algorithm, the training algorithm, is a fairly straightforward algorithm. It basically involves going through some set of training examples where you have inputs and corresponding outputs. Essentially, you see, you try your current setting of the parameters, you see if you get your classification right. If you don’t get it right, then you change your weights in a particular manner so that you get it right. You keep doing this, and there is a proof that you can show which says that if there is a setting of the parameters that would solve the task, you’ll discover this.
[00:22:10]
Soham: In some bounded time, or as time goes to infinity?
[00:22:12]
Bharath: In some bounded time. So that’s the original perceptron algorithm. After that… There’s some history around this, where people showed that the perceptron is not actually capable of capturing a wide variety of techniques, a wide variety of problems, and it was not obvious how to extend that. One of the big things that actually had to change to go from perceptron to more recent techniques is that we had to go from an algorithmic standpoint, like ‘This is the algorithm I’ll use to train’, to an optimization standpoint, like ‘This is a loss function we’re going to optimize, and this is how the optimization procedure is going to operate.’ That kind of optimization-based view leads to a whole class of techniques ranging from logistic regression, which is the next 101 machine learning model from the perceptron, to support vector machines and kernel techniques, to neural networks and beyond.
[00:23:26]
Soham: I see. And this shift to optimization techniques happened when?
[00:23:33]
Bharath: I’m not actually sure. By the 90s, there were optimization-based learning techniques. SVM was already there, logistic regression was already there, neural networks were already there. I don’t actually know exactly when it happened. Perceptron was invented fairly early on. In the middle, there was this AI winter, which was when all talk of AI died down, funding died down, and so on and so forth. When we resurfaced in the 90s, people were talking about that. So some of this also involved… In the 80s and 90s, back propagation was invented, and gradient descent was being figured out by people in optimization and control. There were a lot of things going on in control theory, and so on and so forth. A lot of things happened in that interim.
[00:24:32]
Soham: Got you. So one of the more common machine learning techniques people use are these convolutional neural nets. Can you tell me what a convolution is, and what it means to be using a convolutional neural net in vision?
[00:24:45]
Bharath: So a convolution is basically… Before we talk about a convolution, we have to talk about what a linear function is. The kind of thing we talked about when we said ‘Oh, you know, we’ll just combine, multiply some inputs with some weights and add them together.’ That’s an example of a linear function. A convolution is a special kind of a linear function. What a convolution does is, instead of thinking of your input as a long set of inputs, your input has some structure. Usually it’s a two-dimensional array or a one-dimensional array of inputs.
[00:25:26]
Soham: So this fits well with an image because you have a two-dimensional array of inputs.
[00:25:30]
Bharath: Yeah. So there’s a notion of space or time. So convolution comes first, actually, in the signal processing community. That’s where the whole idea comes from. And the idea is that if you want a linear function of these kinds of inputs that are invariant to space or time – so in a 2D array you have these two spatial dimensions. If you want a linear operation such that at every location in this 2D array it’s basically doing the same thing, that’s basically a convolution. So the operation itself looks like, at every location in this 2D array, you take the neighborhood of that location and pipe that to a linear function. So, neighborhood, pipe that to a linear function, out comes an output. Then you move one pixel over, again take the neighborhood of that pixel, pipe that to a linear function, out comes an output. You keep doing this for every location in the 2D array, and now you have a 2D array of outputs. So that’s the convolution operation…
[00:26:27]
Soham: So it’s sort of like you have a slate, and you’re moving it from pixel to pixel.
[00:26:31]
Bharath: Yeah. The other way people often think of convolution is as a sliding window. So you can think of this as being… So you have your 2D image, you have a small window through which you’re looking at, and you’re sliding that window over this image. At every location where you slide the window over, then you compute a simple linear operation of whatever you see. Convolutional neural networks are basically neural networks which have their primitive operation built up of this convolution. So they just have convolutions, a bunch of convolutions stacked on top of each other. The reason this is useful is because one property of natural images is that they tend to be translation invariant. So the spatial location (1,1) is not particularly different from the spatial location (10,10). All regions of the image are statistically the same thing, essentially. And the reason that happens is, you know, I can take a picture standing up, I can take a picture standing upside-down, I can take a picture kneeling on the ground, I can take a picture kneeling up. It’s rarely the case that you want the top of the image to be processed differently than the bottom of the image. You want everything to be processed similarly. The advantage of convolution compared to other things is that convolution can be expressed with very few parameters. Because you’re using the sliding window, you only need enough parameters to capture the function of the sliding window. So it’s a fairly small window…
[00:28:13]
Soham: So a function the size of the window, as opposed to the size of the entire input.
[00:28:16]
Bharath: Yes. So instead of a 300x300 image, you’re only looking at a 3x3 patch at any given time. So you need only nine numbers to describe this operation.
[00:28:29]
Soham: So what’s a sort of use-case that CNNs are actually used for? What do they actually do?
[00:28:34]
Bharath: So right now, they do almost everything. The simplest thing they do is recognition. You know, you give an image as input, and out pops a label which says ‘Dog’, ‘Cat’, whatever. The way you do this is, the model basically passes the image through a sequence of convolutions. In the middle it does a bunch of these sub samplings and discretizations, as we talked about earlier. This is the same kind of operation that SIFT does. These networks just do it lots of times to reduce a 300x300 image to a 1x1 probability distribution over class labels. Recognition is the big thing – in goes the image, out comes the class label. But more generally, you can have things where you feed in an image and out comes some featurization of the image, some description of what’s going on in the image. This you can use for basically anything. Any kind of operation you want to do on images, you want to figure out whether two images are close or not, you want to figure out if Image 1 has some artifact or not, you want to match two images, anything, you can use this kind of feature representation. And it turns out that you can train these convolutional neural networks to do one thing, like recognize cats, dogs, and other species, and in the process they produce a good intermediate representation that just ends up being useful for lots of things.
[00:30:15]
Soham: I see. And this is sort of why people train on ImageNet, which is this broad collection of images…
Ed. note: ImageNet, originally created by then Princeton (now Stanford) Professor Fei-Fei Li and her group in 2009, is a vast database of images associated with common nouns (table, badger, ocean). The easy availability of this dataset and the competition between groups of researchers to top the ImageNet benchmarks fuelled massive advances in object recognition over the last decade.
[00:30:21]
Bharath: And then test on anything they want to.
[00:30:24]
Soham: So what is going on here? If I’m thinking about a recognition task, if I feed in a bunch of images and these convolutions are happening, what are these convolutions actually doing that allows the recognition to happen?
[00:30:36]
Bharath: It’s a good question, and we don’t really know. The reason is that it’s really hard to visualize what’s going on. There are efforts to do this, but none of it is particularly interpretable in any way. And things that are interpretable tend not to be faithful to the operation of the network. There are some things that are easy to know. You can ask ‘Well, what’s the first layer doing? What’s the first convolution doing?’ It turns out that the first convolution, the first thing these networks do, is detect edges of various kinds. This is also, for example, what SIFT does. This is also what we know the human visual system does. So all of this goes well with what we expect recognition systems to do. Edges are important because edges tend to correspond to places where object boundaries exist, and object boundaries are important for capturing shape. Edges is the first thing it does…
[00:31:40]
Soham: So within each window, it’s finding edges?
[00:31:42]
Bharath: Yeah. So if you look at the output of this, you’d have one convolution operation identifying all vertical lines in the image, another convolution identifying all horizontal lines in the image. As you go deeper…
[00:31:56]
Soham: And these are not engineered? This all happens as part of the training process?
[00:32:00]
Bharath: During training, the model discovers that it has to do this. To recognize that this is a cat, it needs to first detect edges.
[00:32:08]
Soham: So each parameter, which is some factor that’s applied to the input in each of these cells of the sliding window, starts untrained with some null value or something like that… does it start with…
[00:32:20]
Bharath: It starts with random noise.
[00:32:22]
Soham: Wth random noise. And it becomes trained such that the first layer is detecting vertical edges? And then the second layer is detecting horizontal edges?
[00:32:30]
Bharath: Within the same layer, you have multiple filters detecting things in parallel. So you have a bunch of things detecting edges of various orientations. The next layer… there is some evidence that what it ends up doing is detecting things like corners and blobs. Things like “Oh, there’s a red corner.” or “There’s a black-ish blob in the top left corner.”
[00:32:54]
Soham: And we know this because we can output the pixel output that comes out of that layer?
[00:32:58]
Bharath: For the first layer, you can just visualize exactly what the network is doing. For the second layer onwards, because of the face that the operation is now nonlinear… in the middle you do some nonlinear operation on each pixel, which makes it hard to actually visualize between the layers.
[00:33:17]
Soham: What kind of operation is that?
[00:33:19]
Bharath: So usually we do this called a rectified linear unit which, if you know signal processing, is often called half-wave rectification. So half-wave rectification means that if the output is positive you keep it, and if the output is negative you zero it out. So because this operation is now nonlinear, it’s hard to visualize what the second layer is doing. But second layer onward, what you can do is ask ‘How should my image change so that this particular pixel in this second layer output gets a really high value?’ And if you do that, you find what patches actually maximize a particular pixel value, you see blobs and corners and things. Beyond that, as you go deeper and deeper, it becomes really hard to understand what’s going on. Sometimes you see some patterns, like if it seems to like faces, or it seems to like particular colors in particular regions, but usually it becomes hard to understand. So it’s hard to figure out what the network is doing. Only thing you know is it learnt to do this to fit some labels in some training set.
[00:34:34]
Soham: I see. So let’s talk about a recent problem that you’ve worked on, that you published a paper on, and that you effectively solved, you’ve come up with some kind of solution to whatever you’re working on. From the beginning, to what you actually built. What was the problem, and what was the solution? What did you actually do to get there?
[00:34:52]
Bharath: That’s a good question. So given the context of where we are in this conversation, I’ll probably pick something that’s a bit older, which is from my grad school times. This was a problem we introduced, we started thinking about. The goal was basically this. I said that convolutional networks have basically been doing ‘Image in, label out.’ You can do a bit better, and you can say ‘Give me an image in, and tell me not just an image-level label, tell me where the different objects are.’ So this box, this contains a horse, and this box, this contains a person, and so on. For the purposes of this, it’s not important to know how you do this. Basically the idea is that you come up with a few candidate boxes, and then you pipe each box through a convolutional network. That’s basically the idea. So this problem is often called ‘object detection’. What we wanted to do is to say ‘Okay, we’ve figured out that this box contains a horse, but can I actually identify the boundaries of the horse?’
[00:36:05]
Soham: Wait, so you’re provided an image with a box already drawn on it?
[00:36:09]
Bharath: Yes. For the purposes of this, let’s assume we’ve already solved that, object detection. In fact, we built on top of existing object detections. So we have an image, and we have a bunch of boxes, saying this box has a horse this box has a person, this box has a car. You can imagine, maybe, a cropped image where there is a horse bang in the middle of it enlarged. The problem is, you have a box, you know there’s a horse in it. What we want to do is, we want to identify the boundaries of the horse. Why is this important? This is important because for some applications, you really want to know these boundaries. For example for robotics, you may want to actually predict ‘How do I grasp this particular object?’ so you need the boundaries. Or in graphics, you want to crop an object from one image and place it into another, so you want to know the boundaries of the object.
[00:37:04]
Soham: So you want the minimal continuous boundaries of the object?
[00:37:07]
Bharath: Yeah. So you want to know ‘This is their head.’ and ‘This is their tail.’ and ‘These are the legs.’ and so on. The problem was, the way these convolutional networks operate, they collapse images – large, high-resolution images – to single labels. Boundaries are not single labels. We have to retain the resolution of the image, we need fairly precise localization. So the idea that we came up with… this was in discussion with my advisor. So we first posed this problem, and we had some initial techniques, and they were producing very imprecise boundaries. They were very blobby, it would look basically like a circle. So we thought ‘How can we get this to be more precise?’ We realized that, as I said before, the convolutional network is going through these stages of processing. Really early on, it has understandings of edges and line segments, which are very precise with respect to spatial location. You know exactly where the edge is.
[00:38:20]
Soham: In the first layer?
[00:38:22]
Bharath: Right, in the first layer. And as you go deeper into the layers, the abstraction of the model the network is producing increases. So you’re going from edges to things like object parts to things like objects. At the same time, the spatial resolution of these things decreases. And we want both. We want to know horse-ness, but we also want high spatial resolution. So what we’ll do is, we can actually take a trained network and actually tap into these earlier layers. So we can say, ‘Okay, I have a certain pixel in my image, and I want to figure out whether it lies on my horse boundary or not. I’ll basically go to each layer of my convolutional network and look at my output there.’ Early on I’ll ask ‘Does this pixel lie on this edge or not?’ In the middle I’ll be asking ‘Does this pixel lie in a horse leg or not?’ At the end I’ll be asking ‘Is this a horse or not?’ Putting that information together, I can basically solve this problem. I can basically tap into all the intermediate things the network has computed and combine that to do this kind of localization. And we did that, we really churned this out in an afternoon – the first results came out in an afternoon – and it was amazing. It was really precise localization of these boundaries that we’d not seen before. I sent the images to my advisor from a coffee shop and he was like ‘This looks great. The conference deadline is in six hours, can you write it up?’
[laughter]
[00:40:05]
Bharath: I tend to be someone who… my answer was ‘Um, no?’ So we went for the next deadline. That was the last thing I did in grad school, basically.
[00:40:19]
Soham: I see. And what you use to extract the boundaries, that was not actually machine learning? You just went in and deterministically extracted elements from each of the layers, to draw it on the fine limits.
[00:40:32]
Bharath: We just took all, we just took everything. Basically, what we did was, instead of just taking the last whatever the network has done at the very end, we take all the intermediate stuff and stack them on together. My advisor called this hypercolumn representation, but the general idea is also called skip connections. Usually skip connections went the other way, where people tried to use them for just saying whether it’s a cat or not, but our idea was you can actually get the spatial localization very well with this. So this was the idea. This was taking an existing convolutional network and using what it had already learned in an intelligent manner.
[00:41:22]
Soham: So when you say skip connection, were you connecting the earlier layer to another layer up ahead?
[00:41:26]
Bharath: You could describe the architecture in a few ways. The way we described it was that we were taking the intermediate features that the network had produced and concatenating them together, appending all of them together, to produce a new feature representation. And that was being fed to a simple machine learning model.
[00:41:42]
Soham: I see. So you had another model?
[00:41:46]
Bharath: We tried a few variants. But now if you put all of this together, it now looks like skip connections, where earlier layers are feeding into later layers. It’s like a different interpretation of…
[00:41:58]
Soham: Isomorphic architecture. I see – and it turned out that if you had all this information, then a simple model could easily produce very accurate bounding boxes with some labelled examples. That’s very neat, that makes sense. I think it neatly encapsulates that we don’t really know how these things work but we can exploit some things about their structure to produce the sorts of results that we want. Is this how you describe most of the vision research that you do now? You architect some kind of network to solve some kind of problem, and then you use features of that to do specific things in interesting ways?
[00:42:50]
Bharath: Actually, nowadays the kind of work I do takes a very different flavor. In this case, I took a trained network and tried to use my domain knowledge to extract information from it. Nowadays I take a reversed standpoint: how do I, from the start, teach the network domain knowledge? The machine learning way of looking at things is ‘Give me examples, and we’ll figure it out.’ It’s kind of like teaching a kid multiplication by giving nothing except number 1, number 2, number 1 times number 2… Giving a long list of tables. And sure, they’ll figure it out… well, I don’t know if they’ll figure it out, but if you have a million examples, you can figure it out. But that’s not how we teach kids. We teach kids ‘This is what multiplication means, and this is the algorithm to do it.’ And then they practice on a few examples and they’re done. This ends up being much more efficient. The same question applies here. Can we move beyond giving them a mass of datasets – because usually we don’t have a mass of datasets, except in very few cases – can we go beyond giving neural networks a mass of datasets and teach them some domain knowledge?
[00:44:02]
Soham: So what’s an example of this? Just in short?
[00:44:05]
Bharath: So one example that we recently did was… we want to train a neural network to recognize a new class with very limited training data. And we simply said ‘What if, in addition to a limited number of training examples, for maybe one image I tell the network where the object is in the image?’ So I say, maybe ‘This region of the image, this is where you should look at.’
[00:44:34]
Soham: So you initially give it a bunch of training examples that say ‘The object is somewhere in this image’? And now you also put a bounding box around the object?
[00:44:42]
Bharath: On just one image. It’s a very simple, small additional thing, and it gives a massive improvement in performance.
[00:44:47]
Soham: And you don’t put a bounding box around every image because it’ll be more expensive to produce a training set?
[00:44:51]
Bharath: Yes. The idea is, an expert can give one image with one bounding box, and then it can get other examples through some other process.
[00:45:01]
Soham: So if you wanted to do this for diagnostics, you can say that you have all these cancer examples, and then in one case you have the diagnostician paint out exactly where the tumor is.
[00:45:09]
Bharath: Exactly. And that leads to significantly larger improvements, significantly larger performance. I don’t remember the exact number.
[00:45:17]
Soham: Does this work with just standard CNN architectures?
[00:45:20]
Bharath: There’s a few caveats to this. Right now we’re still exploring the full extent of what this can do. We’ve got one paper out of this and one submission that seems to suggest that this is actually a fairly general idea, but I wouldn’t go so far as to say this is something you should always do. It seems to be the case that it’s useful.
[00:45:44]
Soham: Interesting. So in this case you just changed something about the data representation for the training set without changing anything about the architecture?
[00:45:50]
Bharath: Yes. And that’s usually how I work nowadays. My mantra for a lot of my graduate students is ‘If at once you don’t succeed, redefine success.’
[laughter]
Bharath: So can you change the problem setup to make it work better?
[00:46:08]
Soham: Okay, so two more questions. Can you describe any piece of work that you’ve seen recently that’s not your own work that you thought was interesting? And why did you think it was interesting?
[00:46:18]
Bharath: This is a line of work that I have initial results in, initial experiments in, but I’m not an expert in this. And again, this is not one paper but a class of techniques. This has been work that’s happening on 3D reconstruction again, but from a single view. So you get a single view of an object or an image, say a single view of a cup, and you have to reconstruct the 3D model. People have been using machine learning techniques for this, and that has seen, actually, some incredible results. There’s work out of Berkeley, there’s work out of Oxford on things where basically what they’ve done is, they’ve trained the machine learning model so that, from a single image, it predicts a hypothesis for what the 3D shape should look like, and then it tries to render the hypothesis and tries to match it with the corresponding image. That’s how it learns. And it manages to learn… I think there are results now on just producing 3D models… I think one of my friends has talked about this a long time ago. Learning single-image 3D without a single 3D image. You don’t get any 3D information during training, but somehow, using knowledge of geometry, you’re able to train the network to actually do this 3D reconstruction. Original work from CMU, from David Fouhey and colleagues. Then there’s work from Berkeley, and from FAIR, from Shubham Tulsiani and colleagues, Stanford from Leo Guibas’ group, and Oxford from Andrea Vedaldi’s group. There are a bunch of techniques. And also Noah (Snavely) – I don’t mention Noah that much, because you’d just think I’m biased towards Cornell – but Noah also has a bunch of this…
[00:48:39]
Soham: Noah Snavely, who you mentioned earlier, right. So what, in your mind, are the biggest open problems in computer vision still? What are the huge problems that someone might be inspired to take on in the field?
[00:48:54]
Bharath: The biggest problem right now is that… So there’s this impression, especially in the public domain, that computer vision is solved, that we can basically do any kind of recognition task, and kind of perception task. And that’s dangerous. We’re very far from any kind of solution, so it’s more like ‘What have we done?’ rather than ‘What’s left to do?’ The big thing that’s missing is that human perception… Humans can basically very quickly learn new things. They know exactly when they see something new, they can learn from very few training examples, and they can learn with very little computational effort. Whereas current techniques, they can only learn a small number of things with lots of data and lots of computational effort. That’s a big gap which causes all sorts of issues when you apply these techniques to the real world. One of the resulting challenges is what I think of as inequity, in the sense that, right now, the kinds of techniques we’ve produced are only applicable to Google-scale or Facebook-scale problems.
[00:50:16]
Soham: So you have to be in industry or have an industry partner to run these computations?
[00:50:20]
Bharath: Right. Even if… You are in academia, and you are solving those problems, but the kinds of challenges that we are focusing on, like ‘How can we learn to recognize these 200 classes that are relevant to Internet applications?’ are very different from a million examples. ‘How do I use a billion examples to build a good convolutional network?’ That’s very different from the kind of application where a radiologist comes to you and says ‘I want to recognize this specific disease, and here’s 5 images and I can give you 5 more if you give me a year.’
[laughter]
Bharath: But I don’t have them here. And, you know, I have this Windows 98 machine lying around to train. It’s a very different kind of problem, and that’s sort of a big blind spot right now.
[00:50:56]
Soham: I see. So sort of moving closer to the human ability to learn things with few examples with not that much computation, so that we can move to smaller in scope but very important humanitarian and other problems like diagnostics. In a related question, do you think that it’s possible to do good research in computer vision outside of industry, or with no industry partnerships?
[00:51:22]
Bharath: I think it is definitely possible, but you have to have this kind of a mindset. If you talk to anyone in this area, you’ll get a similar answer which is if you try to beat industry at the industry game, that’s stupid. We have to be able to play to our strengths. And our strength in academia is (a.) Exposure to a much wider set of colleagues – people who are doing material science, who are doing agriculture, animal husbandry and all sorts of things – and (b.) The idea that we can explore ideas that don’t immediately lead to impact on benchmarks. That’s sort of the key thing that we have, this longer-term focus. As academicians, we need to focus on that, we need to push on that.
[00:52:24]
Soham: If you were to make a one-minute pitch to somebody who was considering doing computer science that they should work in vision, what would that pitch be?
[00:52:31]
Bharat: Woah. That’s a bit hard. I think the biggest… The most interesting aspect of computer vision is the fact that… To me, the biggest aspect of computer vision is this big, big gap between what I can do – or what a toddler can do – and what a machine can do. It’s so far away from what toddlers can do. I have a one year old niece… she was able to, when we’d come home, she would rush to us, take our suitcases, open the bag, and throw everything out. Getting there – this is both robotics and vision in this case – getting there in terms of perception, figuring out ‘How did she know, having never seen a suitcase, that this was a thing you could open, and this had things inside of it?’
[00:53:27]
Soham: Without even any actuation, just coming up with the plan at all.
[00:53:31]
Bharath: Just coming up with the plan at all. One fine day, she just decides to do this. Getting there, that’s the key question. How do we get there? We’re very far, we’re nowhere near. So I think that’s the motivation.
[00:53:42]
Soham: Great, thank you so much!
[00:53:44]
Bharath: Yeah, cool.
[ringing sound]
[00:53:45]
Soham: Thanks for listening to that episode of Segfault. You can find Professor Hariharan at his website, bharathh.info. You can find me on twitter @sohamsankaran and at my website, soh.am. If you liked this episode, listen to some more and subscribe at honestyisbest.com/segfault or via your favourite podcast app. We’d appreciate it if you shared this episode online, and in particular, with anybody that you think could benefit from it. Segfault is a production of Honesty is Best. Find more of our work at honestyisbest.com. Finally, as I mentioned at the top of the episode, Pashi is hiring someone with experience in compilers and programming languages. If you fit this description, or know someone who does, please feel free to email me at soham@pashi.com. [00:54:28]
Segfault Podcast RSS Feed / Subscribe using Apple Podcasts / Subscribe using Google Podcasts
Learn how to copy the RSS feed into your favourite podcast player here