A neuromoprhic approach to computer vision

1,062 views
998 views

Published on

Published in: Technology, Education
1 Comment
2 Likes
Statistics
Notes
  • What format it is in? I downloaded it and it's a 221MB beast with format '*.key'. Please specify the exact format so that it can be viewed after conversion to that format.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,062
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
57
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

  • Here is the team that I am representing: Tomaso Poggio and Bob Desimone at MIT, Christof Koch at CalTech and Winrich Freiwald who used to be in Bremen now at CalTech and soon at Rockfeller.
  • Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system.

    Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.

  • Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system.

    Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.

  • Our group has been focusing on the computational mechanisms of invariant object recognition. This is obviously a very hard computational problems and despite decades of engineering efforts we still have not been able to build a computer algorithm that could compete with the speed, robustness and efficiency of the primate visual system.

    Our long term goal here is thus to try to build machines that not only mimic the processing of information in the visual cortex but also see and interpret the visual world as well as we do.

  • Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million).

    The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision.

    Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.


  • Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million).

    The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision.

    Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.


  • Over the years we have developed an initial quantitative model of information processing in the visual cortex. The model tries to summarize what is currently known about the anatomy, physiology and organization of the visual cortex. The model does not try to explain the processing of information in one specific visual area but instead spans several visual areas with a relatively large number of units (on the order of 100 million).

    The model combines reverse engineering where the parameters of the model like RF sizes etc are derived from available data but also forward as it is inspired by well known principles from learning theory and computer vision.

    Together with colleagues, we have shown that the resulting architecture is surprisingly consistent with data from V1, V2, V4, MT and IT.


  • Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.

    Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.

    This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  • Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.

    Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.

    This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  • Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.

    Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.

    This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  • Unfortunately I am not going to have too much time to give you details about this model. I would be happy to talk afterwards if anyone has questions. The key assumption here is that when the visual system is flashed with an image, the visual signal is rapidly routed through a hierarchy of visual areas in a single feedforward sweep.

    Here our key assumption is that the goal of the ventral stream of the visual cortex is to build during the first 150ms of visual processing a base representation, whereby object categories can be represented in an position and scale tolerant manner before more complex routines and in particular shifts of attention and eye movements take place.

    This base representation takes the form of a population of model units in various stages of the hierarchy tuned to key features of natural images with different levels of complexity and invariance. Learning in the model of the ventral stream is unsupervised such that when training the model to recognize a new object category we don’t have to retrain the whole hierarchy, only the task specific circuits that sit at the top for instance in the PFC, you can think of these task-specific circuits as a linear classifier if you will.
  • Let me show you one example of some of the validation we have performed on this model. Here for instance we considered a small population of about 200 random model units in one of the top stages of the architecture I just presented. From this population activity we can try to readout the object category of stimuli that are presented to the model. In fact we can try to train a classifier with stimuli presented at one position and scale and see how well it generalizes to other position and scale. This tells you how much built-in invariance is built in the population of units. We get the results indicated here by the light gray bar plots corresponding to different amount of shifts in position and scale. You can play the same game on neurons in IT which is the highest purely visual area and has been critically linked with primates ability to recognize objects invariant of their position and scale. Here we found that the model was able to predict not only the overall level of performance but also the range of invariance to position and scale.

  • Another important validation is behavior assessed here using human psychophysics.

    As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place.

    An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.

    Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...

  • Another important validation is behavior assessed here using human psychophysics.

    As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place.

    An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.

    Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...

  • Another important validation is behavior assessed here using human psychophysics.

    As I mentioned earlier, the original goal of the model was not to explain natural every day vision when you are free to move your eyes and shift your attention but rather was is often called rapid recognition or immediate recognition which corresponds to the first 100-150 ms of visual processing (when an image is briefly presented) ie when the visual system is forced to operate in a feedforward mode before eye movements and shifts of attention take place.

    An example is shown on the left. Here I flash an image for a couple of ms, you probably don’t have time to get every fine details of this image but most people are able to say whether they contain an animal or not.

    Here we had divided our dataset in 4 subcategories: head... overall both the model and human do about 80% on this very difficult task and you can see that they agree quite well in turns of how they perform for these 4 subcategories...

  • This dependency of human and the model performance in terms of clutter motivated a subsequent electrophysiology experiment that was done with Winrich Freiwald during the Neo2 project.

    Here we found that this trend still holds for neurons in monkey IT cortex. Here we used fMRI to find areas that are differentially selective for animal vs. non-animal images. Winrich went on and recorded from a small pop of about 200 neurons in this area. You can see the readout results here on the right. We could reliably readout the animal category information from these difficult real-world images. Interestingly we found that there was also surprisingly high signal at the bold signal level (this is using a contrast agent).

  • More recently we gained access to a population of epileptic patients with intractable epilepsy and that are planned for resective surgery. Typically the patients spend about a week at the hospital with implanted electrodes. They are being monitored 24/7 to try to essentially triangulate the epileptic site. Here these patients are a unique opportunity to not only get behavioral measurements but also simultaneous intracranial recordings (here we measure local field potentials from iEEG). I should emphasize that the spatial and temporal resolution that we get is several orders of magnitude higher than what we could get with non-invasive imaging technique such as fMRI.

    As an illustration, here is one electrode from one patient performing this animal vs non-animal categorization task. Here the electrode location has to be confirmed but is probably somewhere around the temporal lobe. Here you can see that already around 145 ms one can readout the presence or absence of an animal presented to the patient.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • of course one key limitation of this approach is that we have no control over the location of the electrodes which is based solely on medical criterion. However by pooling together data from multiple patients we hope that we will be able to reconstruct the feedforward sweep and recover readout latencies across the temporal lobe.
  • In parallel we have used this model in real-world computer vision applications. For instance we have developed a computer vision system for the automatic parsing of street scene images. Here are examples of automatic parsing by the system overlaid over the original images. The colors and bounding boxes indicate predictions from the model (eg green for trees etc).



  • We have done a number of improvements in terms of the implementation of this model. The original matlab implementation of this model was quite slow...

    We have been working on a number of ways to speed up this model. We started with efficient multi-threaded C/C++ implementation and finally went for exploiting the recent gains in computational power from graphics processing hardware (GPUs).

  • More recently we have extended the approach for the recognition of human actions such as running, walking, jogging, jumping, waving etc...

    In all cases we have shown that the resulting biologically motivated computer vision systems were performing on par or better than state-of-the-art computer vision systems.
  • There are several other systems that

  • Let me switch gears and tell you a little bit about our work on attention. As I showed you earlier, one key limitation of this feedforward architecture is that it performs well for the recognition of objects when the objects to be recognized is large and the amount of background clutter is limited. I have shown you that consistent with human psychophysics and monkey electrophysiology the performance of the model decreases quite significantly when the amount of clutter increases.

    Here we have been working with the assumption that the way the visual system overcome this limitation is via cortical feedback and shifts of attention. In particular our working hypothesis is that the role of spatial attention is to suppress the clutter so that the object of interest appears as if it were presented in isolation.

    In collaboration with electrophysiology labs we are studying the circuits and networks of visual areas involved in attention.
  • In collaboration with electrophysiology labs we are studying the circuits and networks of visual areas involved in attention which involves a complex interaction between the ventral stream and area V4 in particular, prefrontal areas such as the FEF as well as the parietal cortex.
  • We had to perform two key extensions on this model.

    First we have assumed that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas.

    we also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence)


  • We had to perform two key extensions on this model.

    First we have assumed that feature-based attention acts through a cascade of top-down connections though the ventral stream originating in the PFC where a template of the target object is held in memory all the way down to V4 and possibly lower areas.

    we also assume a spatial attention modulation originating from the parietal cortex (here I am assuming LIP based on limited experimental evidence)



  • This attentional mechanisms can be casted in a probabilistic Bayesian framework whereby the parietal cortex represents Location variables, the ventral stream represents feature variables. These are our image fragments.

    Variables for the target object are encoded in higher areas such as PFC...
    This framework is inspired by an earlier model by Rao to explain spatial attention and is a special case of the computational model of the visual cortex described by David Mumford and that probably most of you know...


  • We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  • We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  • We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  • We have implemented the approach in the context of our animal detection task. The performance of the model increases with only one shift of attention. Here is the performance of the feedforward model as I showed you earlier but the performance is averaged across all categories. Here is the performance allowing one shift of attention. Just for comparison here is the performance of human observers when images are flashed very briefly. Here is the performance when human observers are left just a little more time, presumably just enough to allow one shift of attention. Obviously our long-term goal will be to match human level of performance when left with as much time as needed.
  • Let me just summarize some of our main achievements from phase 0 of Neo2.
  • Let me just summarize some of our main achievements from phase 0 of Neo2.
  • Let me just summarize some of our main achievements from phase 0 of Neo2.




  • If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis:
    In particular we need to be able to:
    1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits.
    2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access
    3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
  • If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis:
    In particular we need to be able to:
    1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits.
    2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access
    3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
  • If we want to make real progress in deciphering the computations and representations in the visual cortex we really need to study brains not just at the level of single neurons but we need to integrate multiple levels of analysis:
    In particular we need to be able to:
    1) understand how key computations for object recognition are carried out in cortical microcircuits. And we have been working on new tools for optical silencing and stimulation on neurons based on channel-rhodopsin to study these circuits.
    2) understand the interaction between networks of neurons within single cortical areas, this will require the development of multi-electrode technologies not only in lower visual areas as currently done but also in higher visual areas that are more difficult to access
    3) Finally we need to be able to record not in just one area at a time but multiple areas to understand how these areas communicate between each other.
  • At the same time, these neuroscience data will allow us to not only validate but also extend existing models of the visual cortex and hopefully improve their recognition capabilities. In particular if we want to have computer systems that can compete with the primate visual system we need to go beyond rapid categorization tasks and study vision in more natural cases.
    In particular, I think there are two key Neuroscience questions that need to be studied:
    First as I eluded too already in this talk, cortical feedback and shifts of attention are likely to be the key computational mechanisms by which the visual system solves most of the difficulties inherent to vision namely dealing with significant amount of clutter as well as ambiguity in the visual input because of occlusion or low signal to noise.
    The second one is the processing of image sequences not as a succession of independent snapshots as I showed you in the model of rapid object categorization but rather models that can exploit the temporal continuity of image sequences both for learning invariance to 2D transformations (zooming and looming, translation, 3D rotation etc) but also for the recognition of object in motion.
  • Along those lines we have started to make significant progress in understanding the circuitry of attention and in particular how spatial attention works to suppress the clutter in image displays of this kind.
  • The next step is obviously to move towards more natural stimulus presentations.
  • I think significant progress in computer vision will come from the use of video sequences and the exploitation of temporal continuity in those sequences.

    Here is the way current computer vision systems treat the visual world: As a collection of independent frames. Obviously the visual world is much richer than that and time is obviously an important component of visual perception. Obviously babies do not learn to recognize giraffes via labeled examples of this kind. Instead this baby who is going to the zoo perhaps for the first time has access to a much richer information, whereby giraffes undergo transformations such as rotation in depth, looming or shifting on the retina in a smooth continuous way. It is our belief that by exploiting these principles we will be able to build better learning algorithms.

  • Most of the work in the areas of computer vision and visual neuroscience has focused on the recognition of isolated objects. However, vision is much more than just classification, as it involves interpreting, parsing and navigating in visual scenes. By just looking, a human observer could essentially answer an infinite number of questions about an image: for instance, about the location and the boundary of an object, how to grasp it or to navigate over it. These are essential problems for robotics applications, which in essence have remained unaddressed in the field of neuroscience.
  • Here is the team that I am representing: Tomaso Poggio and Bob Desimone at MIT, Christof Koch at CalTech and Winrich Freiwald who used to be in Bremen now at CalTech and soon at Rockfeller.
  • We have implemented the approach in the context of our animal search
    model mostly improves on medium and far conditions
  • We have implemented the approach in the context of our animal search
    model mostly improves on medium and far conditions
  • We have implemented the approach in the context of our animal search
    model mostly improves on medium and far conditions
  • Computational considerations suggest that you need two types of operations and therefore functional classes of cells for invariant object recognition

    The gaussian-bell tuning was motivated by a learning algorithm called Radial Basis Function while the max operation was motivated by the standard scanning approach in computer vision and theoretical arguments from signal processing.

    The goal of the simple units is to increase the complexity of the representation. Here on this example by pooling together the activity of afferent units with different orientations via this Gaussian-like tuning. This Gaussian tuning is ubiquitous in the visual cortex from orientation tuning in V1 to tuning for complex objects around certain poses in IT.\\
    The complex units pool together afferent units with the same preferred stimuli eg vertical bar but slightly different positions and scales. At the complex unit level we thus build some tolerance with respect to the exact position and scale of the stimulus within the receptive field of the unit.
  • A neuromoprhic approach to computer vision

    1. 1. A Neuromorphic Approach to Computer Vision Thomas Serre & Tomaso Poggio Center for Biological and Computational Learning Computer Science and Artificial Intelligence Laboratory McGovern Institute for Brain Research Department of Brain & Cognitive Sciences Massachusetts Institute of Technology
    2. 2. Past Neo2 team: CalTech, Bremen & MIT Tomaso Poggio, MIT Bob Desimone, MIT Christof Koch, CalTech Expertise: Winrich Freiwald, Bremen Computational neuroscience Animal behavior Neuronal recording in IT and V4 + fMRI in monkeys Data processing Access to human recordings Multi electrodes
    3. 3. The problem: invariant recognition in natural scenes
    4. 4. The problem: invariant recognition in natural scenes Object recognition is hard!
    5. 5. The problem: invariant recognition in natural scenes Object recognition is hard! Our visual capabilities are computationally amazing
    6. 6. The problem: invariant recognition in natural scenes Object recognition is hard! Our visual capabilities are computationally amazing Long-term goal: Reverse- engineer the visual system and build machines that see and interpret the visual world as well as we do
    7. 7. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE o 2 S4 7 10 STP Rostral STS } TG 36 35 o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o PO V3A MT V4 S2 0.6 - 2.4 10 7 o o V2 V3 C1 0.4 - 1.6 10 4 o V1 0.2o- 1.1 10 6 S1 dorsal stream ventral stream 'where' pathway 'what' pathway Simple cells Complex cells Tuning Main routes MAX Bypass routes
    8. 8. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG Large-scale (108 units), V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE spans several areas of the o 2 S4 7 10 STP Rostral STS } TG 36 35 visual cortex o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o PO V3A MT V4 S2 0.6 - 2.4 10 7 o o V2 V3 C1 0.4 - 1.6 10 4 o V1 0.2o- 1.1 10 6 S1 dorsal stream ventral stream 'where' pathway 'what' pathway Simple cells Complex cells Tuning Main routes MAX Bypass routes
    9. 9. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG Large-scale (108 units), V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE spans several areas of the o 2 S4 7 10 STP Rostral STS } TG 36 35 visual cortex o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised Combination of forward o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o and reverse engineering S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o PO V3A MT V4 S2 0.6 - 2.4 10 7 o o V2 V3 C1 0.4 - 1.6 10 4 o V1 0.2o- 1.1 10 6 S1 dorsal stream ventral stream 'where' pathway 'what' pathway Simple cells Complex cells Tuning Main routes MAX Bypass routes
    10. 10. Neurally plausible quantitative model of visual perception Model layers RF sizes Num. units Animal Prefrontal 11, vs. task-dependent learning Cortex 46 8 45 12 13 non-animal classification 10 0 units Supervised Increase in complexity (number of subunits), RF size and invariance PG Large-scale (108 units), V2,V3,V4,MT,MST LIP,VIP,DP,7a V1 AIT,36,35 PIT, AIT TE spans several areas of the o 2 S4 7 10 STP Rostral STS } TG 36 35 visual cortex o TPO PGa IPa TEa TEm C3 7 10 3 PG Cortex task-independent learning AIT o C2b 7 10 3 Unsupervised Combination of forward o o S3 1.2 - 3.2 10 4 DP VIP LIP 7a PP MSTcMSTp FST PIT TF o o and reverse engineering S2b 0.9 - 4.4 10 7 o o C2 1.1 - 3.0 10 5 o o 0.6 - 2.4 10 7 Shown to be consistent PO V3A MT V4 S2 o o V2 V3 C1 0.4 - 1.6 10 4 V1 S1 with many experimental 0.2o- 1.1 o 10 6 dorsal stream 'where' pathway ventral stream 'what' pathway data across areas of visual cortex Simple cells Complex cells Tuning Main routes MAX Bypass routes
    11. 11. Feedforward processing and rapid recognition
    12. 12. Feedforward processing and rapid recognition
    13. 13. Feedforward processing and rapid recognition
    14. 14. Feedforward processing and rapid recognition
    15. 15. Feedforward processing and rapid recognition category selective units linear perceptron
    16. 16. Model validation against electrophysiology data
    17. 17. Model validation against electrophysiology data 1 IT Model 0.8 Classification performance 0.6 0.4 0.2 0 Size: 3.4o 3.4o 1.7o 6.8o 3.4o 3.4o Position: center center center center 2ohorz. 4ohorz. TRAIN Model data: Serre Kouh Cadieu Knoblich Kreiman & Poggio 2005 Experimental data: Hung* Kreiman* Poggio & DiCarlo 2005
    18. 18. Explaining human performance in rapid categorization tasks Serre Oliva & Poggio 2007
    19. 19. Explaining human performance in rapid categorization tasks Serre Oliva & Poggio 2007
    20. 20. Explaining human performance in rapid categorization tasks Head Close-body Medium-body Far-body Animals Serre Oliva & Poggio 2007 Natural
    21. 21. Explaining human performance in rapid categorization tasks 2.6 2.4 Performance (d') 1.8 1.4 Model (82% correct) 1.0 Human observers (80% correct) Head Close-body Medium-body Far-body Head Close- Medium- Far- body body body Animals Serre Oliva & Poggio 2007 Natural
    22. 22. Decoding animal category from IT cortex Recording site in monkey’s IT Meyers Freiwald Embark Kreiman Serre Poggio in prep
    23. 23. Decoding animal category from IT cortex Model IT neurons Recording site in monkey’s IT fMRI Meyers Freiwald Embark Kreiman Serre Poggio in prep
    24. 24. Decoding animal category from IT cortex in humans
    25. 25. Decoding animal category from IT cortex in humans ~145 ms Animal Non-animal
    26. 26. Decoding animal category from IT cortex in humans
    27. 27. Decoding animal category from IT cortex in humans
    28. 28. Decoding animal category from IT cortex in humans
    29. 29. Bio-motivated computer vision Scene parsing and object recognition Computer vision system based on the response properties of neurons in the ventral stream of the visual cortex Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
    30. 30. Bio-motivated computer vision Scene parsing and object recognition Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
    31. 31. Bio-motivated computer vision Scene parsing and object recognition Gflops Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
    32. 32. Bio-motivated computer vision Scene parsing and object recognition Speed improvement since 2006 image size multi-thread GPU (cuda) 64x64 4.5x 14x 128x128 3.5x 14x 256x256 1.5x 17x 512x512 2.5x 25x From ~1 min down to ~1 sec !! Serre Wolf & Poggio 2005; Wolf & Bileschi 2006; Serre et al 2007
    33. 33. Bio-motivated computer vision Action recognition in video sequences motion-sensitive MT-like units wave 2 bend jump 2 side jack wave 1 walk jump run Jhuang Serre Wolf & Poggio 2007
    34. 34. Recognition accuracy Dollar et model chance al ‘05 KTH Human 81.3% 91.6% 16.7% Weiz. Human 86.7% 96.3% 11.1% UCSD Mice 75.6% 79.0% 20.0% ★ Cross-validation: 2/3 training, 1/3 testing, 10 repeats Jhuang Serre Wolf & Poggio ICCV’07
    35. 35. Automatic recognition of rodent behavior Serre Jhuang Garrote Poggio Steele in prep
    36. 36. Automatic recognition of rodent behavior Performance human 72% agreement proposed 71% system commercial 56% system chance 12% Serre Jhuang Garrote Poggio Steele in prep
    37. 37. Neuroscience of attention and Bayesian inference
    38. 38. Neuroscience of attention and Bayesian inference
    39. 39. Neuroscience of attention and Bayesian inference
    40. 40. Neuroscience of attention and Bayesian inference integrated model of attention and recognition
    41. 41. Neuroscience of attention and Bayesian inference PFC IT V4/PIT integrated model of V2 attention and recognition in collaboration with Desimone lab (monkey electrophysiology)
    42. 42. Neuroscience of attention and Bayesian inference PFC feature-based attention IT V4/PIT integrated model of V2 attention and recognition in collaboration with Desimone lab (monkey electrophysiology)
    43. 43. Neuroscience of attention and Bayesian inference PFC feature-based attention IT LIP/FEF V4/PIT spatial attention integrated model of V2 attention and recognition in collaboration with Desimone lab (monkey electrophysiology)
    44. 44. Neuroscience of Attention and Bayesian inference PFC feature-based attention IT LIP/FEF V4/PIT spatial attention V2 see also Rao 2005; Lee & Mumford 2003 Chikkerur Serre & Poggio in prep
    45. 45. Neuroscience of Attention and Bayesian inference PFC O feature-based object priors attention IT Fi LIP/FEF L V4/PIT Fli spatial attention location priors N V2 I see also Rao 2005; Lee & Mumford 2003 Chikkerur Serre & Poggio in prep
    46. 46. Model predicts well human eye-movements Integrating (local) feature-based + (global) context-based cues accounts for 92% of inter-subject agreement! Chikkerur Tan Serre & Poggio in sub
    47. 47. Model performance improves with attention performance (d’) one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
    48. 48. Model performance improves with attention 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
    49. 49. Model performance improves with attention 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
    50. 50. Model performance improves with attention 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
    51. 51. Model performance improves with attention mask no mask 3 performance (d’) 2 1 0 one shift of no attention attention Model Humans Chikkerur Serre & Poggio in prep
    52. 52. Main Achievements in Neo2
    53. 53. Main Achievements in Neo2 Extended + extensively tested feedforward model on real-world recognition tasks [Poggio]: matches neural data mimics human performance in rapid categorization performs at the level of state-of-the-art computer vision systems C++ software + interface available / 100x speed-up combined with saliency algorithm + tested on real-time street surveillance (video)
    54. 54. Main Achievements in Neo2 Extended + extensively tested feedforward model on real-world recognition tasks [Poggio]: matches neural data mimics human performance in rapid categorization performs at the level of state-of-the-art computer vision systems C++ software + interface available / 100x speed-up combined with saliency algorithm + tested on real-time street surveillance (video) Demonstrated read out of cluttered natural images from monkey fMRI and physiology recordings in inferotemporal cortex [Freiwald and Poggio]: first decoding of cluttered complex images agreement with original feedforward model
    55. 55. Main Achievements in Neo2 Extended + extensively tested feedforward model on real-world recognition tasks [Poggio]: matches neural data mimics human performance in rapid categorization performs at the level of state-of-the-art computer vision systems C++ software + interface available / 100x speed-up combined with saliency algorithm + tested on real-time street surveillance (video) Demonstrated read out of cluttered natural images from monkey fMRI and physiology recordings in inferotemporal cortex [Freiwald and Poggio]: first decoding of cluttered complex images agreement with original feedforward model Characterized neural encoding in V4, IT and FEF under passive and task- dependent viewing conditions [Desimone and Poggio]: characterized the dynamics of bottom-up vs. top-down visual information processing (characteristic timing signature of activity in V4 and IT vs. FEF) top-down, task-dependent, attention modulates features in V4 and IT
    56. 56. Main Achievements in Neo2
    57. 57. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter
    58. 58. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter Extended model for classification of video sequences (i.e., action recognition) [Poggio] tested on several video databases and shown to outperform previous algorithms
    59. 59. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter Extended model for classification of video sequences (i.e., action recognition) [Poggio] tested on several video databases and shown to outperform previous algorithms Demonstrated read-out from human medial temporal lobe (MTL) [Koch] Decoding of natural scenes from single neurons in human MTL Improved ability of saliency model to mimic human gaze patterns
    60. 60. Main Achievements in Neo2 Implemented new extended model suggested by these neuroscience data from Desimone lab to include attention via feedback loops from higher areas [Poggio] predicts well human gaze in natural images significantly improves recognition performance of original model in clutter Extended model for classification of video sequences (i.e., action recognition) [Poggio] tested on several video databases and shown to outperform previous algorithms Demonstrated read-out from human medial temporal lobe (MTL) [Koch] Decoding of natural scenes from single neurons in human MTL Improved ability of saliency model to mimic human gaze patterns Model used to transfer neuroscience data to biologically inspired vision systems
    61. 61. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex:
    62. 62. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex: Optical silencing and circuits stimulation technology based on X-rhodopsin
    63. 63. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex: Optical silencing and circuits stimulation technology based on X-rhodopsin Multi-electrode network technology
    64. 64. MIT team: Poggio, Desimone, Serre, Future Directions 1-of-2 IT physiologist, + (Koch+Itti) Develop new technologies to decode computations and representations in the visual cortex: Optical silencing and circuits stimulation technology based on X-rhodopsin Multi-electrode network technology Simultaneous recordings system across areas
    65. 65. MIT team: From the neuroscience Poggio, Desimone, Serre, XXX data towards a system-level model of natural vision 1. Clutter and image ambiguities: Attention and cortical feedback 2. Learning and recognition of objects in video sequences
    66. 66. Clutter and image ambiguities: Attention and cortical feedback IT
    67. 67. Clutter and image ambiguities: Attention and cortical feedback Circuitry of attention and role of synchronization in top-down and bottom-up search tasks: monkey IT electrophysiology in V4, IT and FEF
    68. 68. Clutter and image ambiguities: Attention and cortical feedback + IT
    69. 69. Learning and recognition of objects in video sequences How current computer How brains learn vision systems learn
    70. 70. Learning and recognition of objects in video sequences How current computer How brains learn vision systems learn
    71. 71. Thank you!
    72. 72. Past Neo2 team: CalTech, Bremen & MIT Tomaso Poggio, MIT Bob Desimone, MIT Christof Koch, CalTech Winrich Freiwald, Bremen
    73. 73. IT readout improves with attention stim cue transient change isolated object + object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
    74. 74. IT readout improves with attention stim cue transient change isolated object + attention away from object object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
    75. 75. IT readout improves with attention stim cue transient change isolated object + attention away from object object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
    76. 76. MIT team: IT readout improves Poggio, Desimone, Serre, XXX with attention stim cue transient change isolated object attention on object + attention away from object object not shown Zhang Meyers Serre Bichot Desimone Poggio in prep n=67
    77. 77. Two functional classes of cells to explain invariant object recognition in the visual cortex Simple cells Complex cells Template matching Invariance Gaussian-like tuning max-like operation ~ “AND” ~”OR” Riesenhuber & Poggio 1999 (building on Fukushima 1980 and Hubel & Wiesel 1962)

    ×