• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Attentional Object Detection - introductory slides.
 

Attentional Object Detection - introductory slides.

on

  • 1,637 views

 

Statistics

Views

Total Views
1,637
Views on SlideShare
1,387
Embed Views
250

Actions

Likes
0
Downloads
27
Comments
0

5 Embeds 250

http://sergeykarayev.com 180
http://0.0.0.0 58
http://localhost 8
http://0.0.0.0:4000 3
http://akhambhati.github.com 1

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • This is related to current research work in ML on anytime algorithms.\nI think that the only solution to this goal is attentional detection.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Sequential decision problem.\nNo post-process because at any point detection can be cut off.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Attentional Object Detection - introductory slides. Attentional Object Detection - introductory slides. Presentation Transcript

  • Attentional Object Detection Why look for everything everywhere? Sergey Karayev for UC Berkeley Computer Vision Retreat 2011
  • Problem:Recognition and localization of objects of multiple classes in cluttered scenes.
  • Proposals Detectors Object DetectionPost-process
  • Proposals Detectors Object DetectionPost-process
  • etc. Sliding window Proposals ...with priors/Voting pruning Efficient search
  • etc. Sliding window Proposals•Too slow: quadratic in number of searchdimensions (x,y,scale,class).•Speed-ups: •Parallelization. ★Priors/Pruning with non-detector features. ★Algorithmic efficiency.
  • ProposalsPriors/pruning •Usesnon-detector features (location, geometry, context, depth, “objectness”) •Often done in post-processing.
  • ProposalsCurrently only works for local features.Voting Efficient subwindow search
  • • Priority ordered? How?Proposals • Pruned / Exhaustive? • Class-specific? DetectorsPost-process
  • • Priority ordered? How?Proposals • Pruned / Exhaustive? • Class-specific? DetectorsPost-process
  • Detector Template/Parts single feature [2]. As a result each stage of the boostingLocal features process, which selects a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effec- tive learning algorithm and strong bounds on generalization A performance [13, 9, 10]. The third major contribution of this paper is a method for combining successively more complex classifiers in a single feature [2]. As a result eachstructure which dramatically increases the speed of cascade stage of the boosting single feature [2]. Asa a resultdetector by focusingboosting on promising regions of the each stage of be attention process, which selects new weak classifier, can theviewed process, which selects a new weak classifier, can behind focus of attention approaches the image. The notion be viewed as a feature selection process. AdaBoost provides an effec- C as a feature selection process.thatbounds on generalization tive learning algorithm and strong it is often possible to rapidly determine where in an is AdaBoost provides an effec- A B tive learning algorithm and image an object might occur [17, 8, 1]. More complex pro- performance [13, 9, 10]. strong bounds on generalization ct B performance [13, 9, contribution of this paper isonly for these promising regions. The The third major 10]. cessing is reserved a method A Figure 1: Example rectangle features shown r for combining successivelykey measure of such is a method is the “false negative” rate more complex classifiers in a The third major contribution of this paper an approach enclosing detection window. The sum of the cascade structure which dramatically increases process. of must be the case that all, or the complex the speed in ofmoreattentional classifiersIt a je for combining successively Decision stumps lie within the white rectangles are subtracted f single feature [2]. As a result the detector by focusing dramaticallypromising regions of of each stage of the boosting cascade structure which attention on object instances are selected by the attentional almost all,increases the speed of pixels in the grey rectangles. Two-rectangle process, which selects a new weak image. Thecan be viewedfocus of attention approaches of the classifier, notion behind filter. on promising regions the that it is often possible attention determine where in an detector by focusing to rapidly C D shown in (A) and (B). Figure (C) shows a th is re as a feature selection process. the image. The notion behind focus ofdescribe a approaches training an extremely sim- AdaBoost provides an effec- We will attention process for C feature, and (D) a four-rectangle feature. D image an object might occur [17, 8, 1]. More complex pro- tive learning algorithm and strong bounds reserved only forple and efficient regions. The an can be used as a “super- on generalizationrapidly determine wherewhich iscessing is often possible tothese promising that it is classifier in performance [13, 9, 10]. vised” focus of attention operator.Figure 1: Example rectangle features shown relative to the A The term supervised B image an objectsuch an occur [17, 8, 1]. More complex pro- enclosing detection window. The sum of the pixels which key measure of might approach is the “false negative” rate The third major contribution of this paper isprocess. refers tobe the case regions. or cessing attentional a methodthese promising that all, The lieoperator 1: white rectangles are subtracted shown relative to the of the is reserved only for must the fact that the attentional within the trained to rectangleusing features rather than the pixels direct It is for Figure Example features from the sum for combining successively more measureobject instancesdetect is the “false negative” rate of In the domain of face key complexof such an approach examples the a particular class. pixels in thedetection window. The reason is that are which act to en almost all, classifiers in are selected by of attentional a common features can grey rectangles. Two-rectangleof the pixels enclosing false neg- sum features is difficult to learn u cascade structure which dramatically attentionalthe speed of must be is possiblethatachieve fewer than (A) and (B). Figure (C) shows a three-rectangle of the increases process. detection it the case to all, or shown in 1% the white rectangles are subtractedthat the sum filter. It domain knowledge lie within from the detector by focusing attention We all, object instances are training anby thepositives usingfeature, and (D) a four-rectangle feature. of training data. For this system th almost will describe a process for selected false attentional a classifier constructed rectangles. Two-rectangle features are on promising regions of atives and 40% extremely sim- quantity of pixels in the grey the image. The notion behind focusandattention classifier which two be used asfeatures. The effect of this filter isD to filter. of efficient approaches ple from can Harr-like a “super- C second critical motivation for features: the f shown in where the (A) and (B). Figure (C) showsmuch faster than a pixel-based system operates a three-rectangle is that it is often possible to rapidly determine attention operator.byThe term supervised vised” focus of where in an reduce over one half the number of locations We will describe a process for training an extremely sim- feature, and (D) a four-rectangle feature.
  • • Priority ordered? How?Proposals • Pruned / Exhaustive? • Class-specific? • Local or global feature? • Shared parts across classes? Detectors • Cascaded? • Confidence ≈ likelihood?Post-process
  • • Priority ordered? How?Proposals • Pruned / Exhaustive? • Class-specific? • Local or global feature? • Shared parts across classes? Detectors • Cascaded? • Confidence ≈ likelihood?Post-process
  • Post-process
  • • Priority ordered? How?Proposals • Pruned / Exhaustive? • Class-specific? • Local or global feature? • Shared parts across classes? Detectors • Cascaded? • Confidence ≈ likelihood? • NMS/Meanshift?Post-process • Context? (Inter-object?)
  • • Priority ordered? How?Proposals • Pruned / Exhaustive? • Class-specific? • Local or global feature? • Shared parts across classes? Detectors • Cascaded? • Confidence ≈ likelihood? • NMS/Meanshift?Post-process • Context? (Inter-object?)
  • Where we areCascaded Deformable Part Models.Per class, ~1 sec / medium-sized image.
  • Where we are• PASCAL: ~5K test images, 20 classes. 28 hours to process.• ImageNet ’11: ~450K test images, 3000 classes. 375,000 hours to process.
  • Where we are• Standard movie: ~130K frames. 36 hours per object class.
  • So what can we do?Not look for everything everywhere!
  • New Performance Evaluation• Goal: Be able to stop detection and have the most correct detections and the fewest incorrect detections at any time. AP AP vs. time time
  • How?
  • Attention• Natural bottleneck in animal vision.• Two kinds: • Bottom-up: rapid, driven by featurization. • Top-down: secondary, driven by task. • Eye fixations are a good proxy for implicit attention. Necessary because of the fovea.
  • Tilke Judd tjudd@mit.edu Krista Ehinger kehinger@mit.edu Fr´ do Durand e fredo@csail.mit.edu Anton torralbatjudd@mit.edu kehinger@mit.edu fredo@csail.mit.edu torralba Basic ideasMIT Computer Science Artificial Intelligence Laboratory and MIT Brain and CoMIT Computer Science Artificial Intelligence Laboratory and MIT Brain and Co Abstract Abstract • Single saliency map fromor many applications in graphics, design, and humanor many applicationsis essential todesign, and humanputer interaction, it in graphics, understand whereans look in a scene. isfoci eye toattention which Where oftracking devices areputer interaction, it essential understand where are selected.a viable option, models of saliency can be used to pre-ans look in a scene. Where eye tracking devices are fixation option, models of saliency can be used to pre- viable locations. Most saliency approaches are based •fixation locations. Most saliency not consider are based ottom-up computation that does approaches top-down Sequential selection duege semantics and often that doesmatch actual eye move-ottom-up computation does not not consider top-down s. To address “inhibition of return,” to this problem, we collected eye tracking e semantics and often does not match actual eye move- of To viewers on 1003 images and use thiseye tracking 15 address this problem, we collected database as ing and testinginformationmodeldatabase as or onexamples to learn use this of saliency s. of 15 viewers 1003 images and a d on low, maximization.model of saliency middle and high-level image features. This ing and testing examples to learn ae databasemiddle and high-level image features. Thisd on low, of eye tracking data is publicly available • this paper. Influenced from the top. database of eye tracking data is publicly available this paper. Figure 1. Eye tracking data. We controduction on 1003 images from 15 viewers to us Figure 1. Eye tracking data. We co
  • model. On average, images contained 4.6 cars and 2.1 pedestrians. targets (cars or pedestrians) and press a key to indicate cox dgiven in Eqs. (1)–(5) induced by the three main assumptions.rmined by the scene description S (e.g., vectorialperties such as global illumination, scene iden-resent). The product of the likelihood P(IjS) and
  • Attentional Object Detector Assume we have a powerful but expensive per-class classifier.• How should we pick locations to consider?• What should we look for at a location?
  • Attentional Object Detector Proposals Detector
  • Some related work
  • Vogel and Freitas. Target-directed attention:Sequential decision-making for gaze planning. ICRA 2008. • GIST and a simple regressor to compute likelihood map. • Reinforcement learning to find best gaze sequence. • “Heavier” feature and regressor to evaluate the fixation locations.
  • Vogel and Freitas. Target-directed attention:Sequential decision-making for gaze planning. ICRA 2008.• Evaluated only on Caltech Office scenes.• Gaze planning improves over just using bottom-up saliency while being only slightly slower.• Detection rate is lower than full image, but maximum precision is higher.
  • Gualdi et al. Multi-stage Sampling with Boosting 200 CascadesPrati, and R. Cucchiara G. Gualdi, A. for Pedestrian Detection in Images and Videos. ECCV 2010. • LogitBoost classifier with covariance descriptors. • Score falls off over some region ofMulti-stage Sampling with Boosting Cascades for Pedestrian Detection 203 Fig. 1. Region of support for the cascade of LogitBoost classifiers trained on INRIA support. to 48x144), pedestrian dataset, averaged over a total 62 pedestrian patches; (a) a positive patch (pedestrian is 48x144); (b-d) response of the classifier: (b) fixed w (equal s sliding wx , wy ; (c) fixed wx (equal to x of patch center), sliding ws , wy ; (d) fixed wy • Sample points in image (equal to y of patch center), sliding wx , ws ; (e) 3D plot of the response in (b). to estimate P(O|I). scale variations, i.e. the response of the classifier in the close neighborhood (both Resample close to in position and scale) of the window encompassing a pedestrian, remains positive (“region of support ” of a positive detection). Having a sufficiently wide region of promising points. support allows to uniformly prune the SW S, up to the point of having at least one window targeting the region of support of each pedestrian in the frame. V ice versa, a too wide region of support could generate de-localized detections [4]. Distribution of samples important advantage of = O n this regard, an across the stages: m the 5 and covariance descriptors is its
  • Gualdi et al. Multi-stage Sampling with Boosting Cascades for Pedestrian Detection in Images and Videos. ECCV 2010.• Evaluated on INRIA Pedestrians, Graz02, and some videos.• Always reduces miss rate over sliding window, while being 2-6x faster.
  • fewer than 25 successive fixations, this foveated approach provide a useful way to improve the search efficiency ofwill be faster than exhaustively applying object detection to specific object detectors, i.e., most regions without objects Butko and Movellan. Optimal Scanning for Fastera high resolution image. Two particular challenges are: (1) sequentially picking tend to have low visual saliency [5]. Unfortunately visual saliency filters are computationally expensive [17] and need Object Detection. CVPR 2009.the fixation locations; (2) integrating the information ac- to be applied to entire images, making them less attractive for scanning very high resolution images. Our work also relates to recent work on optimal image search, like the Efficient Subwindow Search [10]. Our ap- proach is data driven and detector independent, where the ESS approach is more analytic. Our approach requires a dataset of labeled images to build a statistical model of the performance of a given object detector. The ESS ap- ˆ proach requires a function f that must be constructed ana- • Digital fovea placed lytically for each specific object detector for the guarantees of the algorithm to hold, but only some object detectors are amenable to such a construction. The efficiency of the al- sequentially to maximize gorithm depends on the tightness of the upper bound that f computes and the computational overhead of evaulating f . ˆ ˆ expected of Eye-Movement 2. I-POMDP: A Model information gain. • Liken it to stochastic Najemnik & Geisler developed an information maxi- mization (Infomax) model of eye-movements and applied it to explain visual search of simple objects in pink noise optimal control, and use a image backgrounds [12]. The model uses a greedy search approach: saccades are planned one at a time with the next “multinomial infomax saccade made to the location in the image plane that is ex- pected to yield the highest chance of correctly guessing the POMDP” to pick the target location. The Najemnik & Geisler model success- fully captured some aspects of human saccades but it has sequence. two important limitations: (1) Its fixation policy is greedy, i.e., it maximizes the instantaneous information gain rather than the long term gathering of information. (2) It is appli- cable only to artificially constructed images. Butko & Movellan [4] proposed the I-POMDP frame- work for modeling visual search. The framework ex-Figure 1. A digital fovea: Several concentric Image Patches (IPs) tends the Najemnik & Geisler model by applying long-term(Top) are arranged around a point of fixation. The image por- POMPDP planning methods. They showed that long-termtions contained within each rectangle are reduced to a common information maximization reduces search time. Moreover
  • Butko and Movellan. Optimal Scanning for Faster Object Detection. CVPR 2009. Fixation 1 Fixation 2 Fixation 3 4 3.5 • Evaluate on own faces 3 I!POMDP Viola Jones Error (grid cells) dataset against V-J. 2x 2.5 2 Fixation 4 Fixation 5 Fixation 6 1.5 speedup, but small 1 decrease in accuracy. 0.5 0 0 0.02 0.04 0.06 0.08 0.1 Runtime (seconds)Figure 6. Successive fixation choices by the MI-POMDP policy.The face is found in six fixations. The final estimation of the face Figure 8. By changing the Viola Jones scaling factor, both Violalocation is one grid-cell diagonal from the labeled location, giving Jones and I-POMDP become faster and less accurate. MI-POMDPa euclidean distance error of 1.4 grid-cells. is usually closer to the origin on a time-error curve, showing that it gives a better speed-accuracy tradeoff than just applying Viola Jones.crease in accuracy, as shown in the Table below. Both meth-ods on average placed the face between one and two grid-cells off the true face location. 4.2. Speed-Accuracy Tradeoff
  • Vijayanarasimhan and Kapoor. Visual Recognition and Detection Under Bounded Computational Resources. CVPR 2010. Computation Feature Channel Dim time (ms) SIFT R, G, B, Gray 128 0.21 P64 Figure 3. 17 grid weights learnt for each category in the ETHZ T1a S2 The Gray 68 1.2 shape dataset. P18 T2 S2 9 Gray 36 0.09 Table 2. Attributes of theresources. used in the experiments. of computational features • Hough voting with multiplethe INRIA Datasets: We use two challenging object detection datasets namely, the ETHZ shape dataset and (five in our feature types.to compare against several state-of- experiments) order generate an initial set of horses dataset in and potheses. Then, we run each selection strategy iterativ the-art hough based detection approaches [21, 24, 11, 10]. Figure 2. A summary of our algorithm. updating• Uses Value ofisInformationweighted a fix hypotheses as dataset contains 255 the to for five the The ETHZ probability then modeled asgiraffes, mugssum shape images features get added until and shape-based classes (applelogos, bottles, conditional |f ). This term depends on the feature f which is timethe probabilities(1 its lookneighbors: the amount of pick region of horsesin our case). 170 images swans). lapsed to nearest atcontains of has The INRIA sec dataset and(O,x)ito be extracted.However, since we are only trying to determine the In type best feature to qualitative 170 imagescomp 5 we or more some extract. results with- Figure withthe category.side-views of horses and objects occur in out one show In both the datasets, ing the first highly cluttered natural scenes with large variations in botheature to extract, we instead estimate the expected value 1000 points selected by our p(h|f ) select p(gi (O,x) |f, l) = qi active h (2)he term p(gi (O,x) • scale passive selection h∈N (f approach Active approach extracts less ob- |f ) for every feature type t. We do this and theand appearance, and sometimes) contain The first r baseline. multiple contains example imagesfeature inlessfair comparisons. qih ET features, takeseverydatabase FOand = time, andconsidering all the features in the training database that jects per image. We use the same training and testing setupof type t and obtain the average value of the term. The used by h[10] on both datasets for category in the where is a from the shape, the has higher accuracy on ETHZpointsure type with the largest value can be interpreted as the second row refers third conditional probability for part (O,x) |h, l) and to the rows show the Implementation Details: Parameter learning of the p(gi that is expected to provide the best evidence for object the grid model is performed by first scaling strategies, truth lected by and Horses. fixed selectionisallmodelour exper- a the ground resp active and random height term pixels in parameter presence given the features. This gi . For example, for the “body” of a giraffe, texture- bounding boxes to a (100 that needs to be estimated from the training data for every tively. Brightiments) denote selectedgaspect ratio. Then the pointsed features could provide the best evidence and there- feature h and every grid part feature points. dots while preserving the i . And, (O,x) are uniformly sampled along the edges (using a Canny edge
  • Image Attributions• Girschick et al. - Cascaded deformable part models.• Viola & Jones - Rapid object detection.• Judd et al. - Learning to predict where humans looks.• Chikkerur et al. - What and where? A Bayesian theory of attention.• ...and the papers reviewed.