Computer Vision
1
Dark Matter of the Digital age
• Everyday, the Internet generates trillions of images and videos and
other scraps of digital minutiae. More than 85% of the content on the
web is Multimedia imagery – and it’s a chaotic mess.
2
Application Areas: Autonomous Vehicles
3
Human Intelligence and Role of Vision
• 100, 000,000,000 of neurons
• 1000, trillion connections between them
4
5
The McCulloch-Pitts Model of Neuron
6
Neuron in Brain
7
Neuron in Computer
8
Single layer neural Network
9
How hard it is to create a neuron?
10
Reverse Engineering the Vision
11
12
Mark I Perceptron [1950]
13
Simple Classification Problem
14
𝑦 = 𝑋′ w
Convolutional Neural Networks
15
Human Visual Cortex
17
18
Evolution of Computer Vision
19
20
I don’t like it.
grrrrrr
The Visual Cortex
• Found that the visual cortex has neurons that are individually
sensitive to different types of lines and angles.
• Noted that when deprived of certain types of visual stimulation early
on, kittens lost the ability to perceive these patterns.
• Suggests there might be a critical period in visual development and
that the brain requires stimulation in its efforts to delegate its
resources to different perceptual tasks brain “learns” to perceive
through experience.
21
Hierarchy of Vision
22
Neocognitron [Fukushima 1980]
23
LeNet-5 [LeCun, Bottou, Bengio, Haffner 1998]
24
AlexNet [Krizheveskhy, Hinton, Sustsker 2012]
25
Vision Decomposition
26
27
28
29
30
31
32
33
Pooling Layer
• - makes the representation smaller and more manageable
• Operates over each activation map independent;ly
34
Max Pooling
35
36
37
What a ConvNet Sees?
38
Rifle?
39
Conclusion
• Today, computers can spot a cat or 57 types of mushrooms, but they
are still a long way from seeing and reasoning like humans and
understanding context, not just content.
• A knife in the kitchen and at a crime scene has two different
meanings.
40

Convolutional neural networks

Editor's Notes

  • #2 In computer vision, we aspire to develop intelligent algorithms that perform important visual perception tasks, such as Object recognition Scene Categorization Integrative scene understanding Human motion recognition Understanding vision and building visual systems is really understanding intelligence, and by see we mean to understand, and not just to record pixels.
  • #3 Teaching computers to see has applications well beyond identifying things that merely appear in our physical world. Those visual descriptors of humankind are growing faster than we can imagine. The volume of photos and videos generated in the past 30 days is bigger than all images dating back to the dawn of civilization. Its humanly impossible to document all of this data, but intelligent machines that recognize patterns and can describe visual content with natural language could well be our future historians.
  • #4 Manned and Unmanned land-based vehicles (ULVs) Manned and Unmanned aerial vehicles (UAVs) The level of autonomy ranges from fully autonomous vehicles to vehicles where computer vision based systems support a driver or pilot in various situations. Fully autonomous vehicles typically use computer vision for navigation
  • #5 Most the information we humans possess and process the large part of it is experienced through Vision. If we want machines to think we need to teach them to see. We use half or out precious human brainpower of visual processing; it’s a cognitive ability that has taken 540 million years of evolution to develop. Vision is so critical to how we understand the world, it’s hard to imagine any intelligent computer of the future without it. Any descent self-driving car will eventually need to distinguish between, say a large rock in the roadway and a similar size paper bag – and that it should brake and steer to avoid rock but ignore the bag.
  • #6 Discovery of Neurons
  • #8 Describe the working of the neuron.
  • #14 Rosenblatt's early work on perceptrons at the Cornell Aeronautical Laboratory (1957-1959) culminated in the development and hardware construction of the Mark I Perceptron. Mark I, a visual pattern classifier, had an input (sensory) layer of 400 photosensitive units in a 20x20 grid modeling a small retina, an association layer of 512 units (stepping motors) each of which could take several excitatory and inhibitory inputs, and an output (response) layer of 8 units. The connections from the input to the association layer could be altered through plug-board wiring, but once wired they were fixed for the duration of an experiment. The connections from the association to the output layer were variable weights (motor-driven potentiometers) adjusted through the perceptron error-propagating training process. Mark I consisted of 6 racks (approximately 36 square feet) of electronic equipment and numerous experiments were conducted on this machine (can we describe some?). Rosenblatt built a machine called the Mark 1 Perceptron, which was essentially an assembly of weight-vector representations for linear discriminations. Noting the machine's ability to learn classification behaviours (through error-correction), Rosenblatt went on to make ambitious claims for the machine's `true originality'.
  • #15 An easy way to define a linear boundary involves using inner product. Assuming data points are fully numeric, we can calculate the inner product of any two by multiplying together their corresponding values So if x and y are two data points, the inner product is calculated as In the 1950s, Frank Rosenblatt demonstrated that a version of the error-correction algorithm is guaranteed to succeed if a satisfactory set of weights exist. If there is a set of weights that correctly classify the (linearly seperable) training datapoints, then the learning algorithm will find one such weight set in a finite number of iterations
  • #16 CNNs are specialized kind of neural network for processing data that has a known, grid like topology. CNNs have been tremendously successful in practical applications. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matric multiplication in at least one of their layers.
  • #17 X(t) is the measure of position x at time t , both x and t are real valued Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceships position, we would like to average together several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function w(a), where a is the age of measurement. If we apply such a weighted average operation at every moment, we obtain a new function s providing a smoothed estimate of the position of the ship.
  • #20 Most the information we humans possess and process the large part of it is experienced through Vision. If we want machines to think we need to teach them to see. In one experiment, done in 1959, they inserted a microelectrode into the primary visual cortex of an anesthetized cat. They then projected patterns of light and dark on a screen in front of the cat. They found that some neurons fired rapidly when presented with lines at one angle, while others responded best to another angle. Some of these neurons responded to light patterns and dark patterns differently. Hubel and Wiesel called these neurons simple cells."[14] Still other neurons, which they termed complex cells, detected edges regardless of where they were placed in the receptive field of the neuron and could preferentially detect motion in certain directions.[15] These studies showed how the visual system constructs complex representations of visual information from simple stimulus features
  • #24 The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in the 1980s. It has been used for handwritten character recognition and other pattern recognition tasks, and served as the inspiration for convolutional neural networks
  • #26 AlexNet is the name of a convolutional neural network, originally written with CUDA to run with GPU support, which competed in the ImageNet Large Scale Visual Recognition Challenge[1] in 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points ahead of the runner up Alexnet contained only 8 layers, first 5 were convolutional layers followed by fully connected layers
  • #39 So it appears to be the case that what is triggering this results is that its seeing enough local evidence that this is not one of the other 999 classes and it enough positive evidence from these local looks to conclude that it’s a school bus. Although things have engineered under the inspiration of natural vision its not working like that.
  • #41 Like any new tech innovation, computer vision has the potential to be used for nefarious purposes, starting with high-level, highly intrusive visual surveillance.