• Save
Andrew Fitzgibbon - Kinect
Upcoming SlideShare
Loading in...5
×
 

Andrew Fitzgibbon - Kinect

on

  • 1,985 views

First talk by Andrew Fitzgibbon in Microsoft Computer Vision Summer School 2011

First talk by Andrew Fitzgibbon in Microsoft Computer Vision Summer School 2011

Statistics

Views

Total Views
1,985
Views on SlideShare
1,297
Embed Views
688

Actions

Likes
0
Downloads
9
Comments
0

3 Embeds 688

http://summerschool2011.graphicon.ru 686
http://www.linkedin.com 1
http://www.slashdocs.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Andrew Fitzgibbon - Kinect Andrew Fitzgibbon - Kinect Presentation Transcript

  • OPTIMIZATION for computer visionTWO PARTS 1
  • COMPUTER VISION TRUTH AND BEAUTY ANDREW FITZGIBBONMICROSOFT RESEARCH CAMBRIDGE
  • BEAUTY TRUTH Elegant Messy Easy Difficult Linear Nonlinear Unoccluded Occluded Unlimited data Missing data Unrealistic RealTHE TWO FACES OF COMPUTER VISION
  • TRUTH & BEAUTYTHE TWO FACES OF BAYES
  •  Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasionally you’ll get a good idea.  Write about it well.HOW TO GET IDEAS 5
  •  Many papers on writing well, all with similar advice  State the problem  Describe the state of the art  Say what is wrong with the state of the art  Say how you fixed it  Show that you have fixed it  Use notation correctlyWRITE ABOUT IT WELL
  •  Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasionally you’ll get a good idea.  Write about it well.HOW TO GET IDEAS 7
  •  1998: we computed a decent 3D reconstruction of a 36-frame sequence  Giving 3D super-resolution  And set ourselves the goal of solving a 1500-frame sequence  Leading to…[FCZ98] Fitzgibbon, Cross & Zisserman, SMILE 1998CHASE THE HIGH-HANGING FRUIT
  •  Program boujou sold by 2d3  Extremely accurate camera calibration because of bundle adjustment  Used on almost every movie you see  Emmy award for boujou in 2002TRY TO MAKE STUFF REALLY WORK
  • Image Based Rendering using image-based priors, ICCV 2003DEVELOP OTHERS’ GOOD IDEAS
  •  2004: Rigid 3D reconstruction is “solved” :)  What about nonrigid?  Bregler, Torresani, Hertzmann, Biermann are already doing it since 2000  Just one group… so OK to join in!CHASE THE HIGH-HANGING FRUIT
  • 1. Get your own giraffe 2. Implement the existing algorithms 3. Find out why they don’t work 4. More on this later…HOW TO JOIN IN
  • Presenter:Andrew FitzgibbonPrincipal ResearcherMicrosoft Research CambridgeBLUE SKIES TO WORLD RECORDSFROM THE LAB TO REAL LIFE
  • Innovation into Products TECHNOLOGY TRANSFERSPATENTS & PAPERS IP Licensing to 3rd parties Expertise for ProductsMSR OUTPUTS
  • COMPUTER VISION AT MSRC
  • COMPUTER VISION AT MSRC
  • COMPUTER VISION AT MSRC
  • 18
  • building water building car cow cat bicycle grass road road [Shotton, Winn, Rother, road Criminisi 06 + 08] [Winn & Shotton 06] [Shotton, Johnson, Cipolla 08]COMPUTER VISION AT MSRC
  • Ground truth Entangled Conventional A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi,Entangled Decision Forests and their Application for Semantic Segmentation of CT Images,in Information Processing in Medical Imaging (IPMI), July 2011COMPUTER VISION AT MSRC
  • COMPUTER VISION AT MSRC
  • Innovation into Products TECHNOLOGY TRANSFERSPATENTS & PAPERS IP Licensing to 3rd parties Expertise for ProductsMSR OUTPUTS
  • THE SCENARIO
  • Okada & Stenger 2008 Navaratnam et al. 2007STATE OF THE ART
  • 3500 mm3000 mm2500 mm2000 mm1500 mm
  • XBox prototype, Sept 2008 Incredibly impressive:  Real time  Accurate  General poses But…  Needs initialization  Limited agilitySTATE OF THE ART
  • 1965. L. G. Roberts, Machine Perception of ThreeDimensional Solids, in Optical and electro-optical informationprocessing, J. T. Tippett (ed.), MIT Press.MODEL-BASED VISION
  • 1980. J. ORourke and N. Badler. Model-based image analysis of humanmotion using constraint propagation. IEEE Trans. on Pattern Analysis andMachine Intelligence.MODEL-BASED VISION
  • D Hogg, Image and VisionComputing, Vol 1 (1983)
  • D Hogg, Image and VisionComputing, Vol 1 (1983)
  • MODEL-BASED VISION
  •  The “model” is Forward/Generative/Graphical Requiring search in many dimensions  say 1013 for the bodyResolved using (a) clever search: gradient descent and better (b) temporal coherence  Assume we were right in the previous frame  And search only “nearby” configurations in thisTHE PROBLEM WITH MODEL-BASED VISION
  • Exponential likelihood of failureAssume 0.1% failure rate per frame• After frames, chance of success = 0.999• At 30 frames per second, that’s: • 3.0% chance of failure after 1 second • 83.5% chance of failure after 1 minute • 99.99% chance of failure after 5 minutesTHE PROBLEM WITH TEMPORAL COHERENCE
  • Exponential likelihood of failureAssume 0.01% failure rate per frame• After frames, chance of success = 0.9999• At 30 frames per second, that’s: • 0.3% chance of failure after 1 second • 16.5% chance of failure after 1 minute • 59.3% chance of failure after 5 minutesTHE PROBLEM WITH TEMPORAL COHERENCE
  •  Need a method which works on a single frame  [Or a small set of frames]  So we can “reinitialize” the tracker Single-frame methods all based on machine learning  So we’ll need training data  Lots of training data And will need to represent multiple hypothesesSO WE CAN’T USE TEMPORAL COHERENCE.
  •  “Tracking”: Elegant, Easy, Wrong.
  •  “Tracking”: Elegant, Easy, Wrong. Ignoring temporal priors: Just wrong.
  • Paul A. Viola, Michael J. Jones Robust Real-Time Face Detection IEEE International Conference on Computer Vision, 2001LEARNING A FACE DETECTOR
  • Andrew Blake, Kentaro Toyama,Probablistic tracking in a metric spaceIEEE International Conference on Computer Vision, 2001DETECTION VS. TRACKING
  • Okada & Stenger 2008 Navaratnam et al. 2007STATE OF THE ART
  • Image “Pose” Features Joint angles 1 1 = ⋮ → = ⋮ LEARNING A POSE ESTIMATOR
  • Step Zero: Training data (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , )LEARNING A POSE ESTIMATOR
  •  Real home visits  Pose: Motion capture  Standard “CMU” database  Custom database  Body size & shape: Retargeting  Effects/Games industry tool: MotionBuilderSOURCES OF VARIED DATA
  • Computer “MoCap” GraphicsActor wearing spherical 3D joint positions Syntheticmarkers Depth ImageObserved by multiplecamerasMOTION CAPTURE 46
  •  Standard motion capture datasets on the web  Feed to MotionBuilder to generate 3D images  Limited range of body typesINITIAL EXPERIMENTS 47
  • Synthetic data: Artificially corrupted data  depth resolution & noiserealistic, but too clean  rough edges  missing pixels: hair/beards  cropping & occlusionsSIMULATING CAMERA ARTEFACTS
  • (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , )SO WE HAVE DATANOW WHAT?
  • (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) function ?“LEARN” FUNCTION FROM DATA
  • (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
  • (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
  • (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
  • Given training set = * , + =1Define : ℝ100 ↦ ℝ27 such that = ()behaves as a “nearest neighbour classifier” =Assume “distance” ⋅,⋅ : ℝ100 × ℝ100 ↦ ℝ+ASIDE: DEFINE NEAREST NEIGHBOUR
  • = min (, ) ∈ min (, ) = ∗ ∗ = for = argmin , ASIDE: DEFINE NEAREST NEIGHBOUR
  • Accuracy (1.0 is best) Number of training images (log scale)ALWAYS TRY NEAREST NEIGHBOUR FIRST
  • 1.0 0.9 0.8 0.7 0.6Accuracy 0.5 Time taken: 0.4 500 milliseconds 0.3 per frame 0.2 0.1 0.0 30 300 3000 30000 300000 Number of training images (log scale)NEAREST NEIGHBOUR DIDN’T SCALE
  • The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 DEALING WITH (LACK OF) DATA
  • (1 , 1 ) (2 , 2 ) Pose, θ Image, TRAINING: PLOT THE DATA
  • Pose, θ Image, TRAINING: FIT FUNCTION =
  • Pose, θ Image, znewGIVEN NEW IMAGE , = ( )
  • 3. GIVEN NEW IMAGE , COMPUTE =( ) Pose, θ Image,
  • 3. OR, MORE USEFULLY, COMPUTE ) Pose, θ Image, znew
  • or ?Multivalued f:
  • Pose, θ Image,
  • Pose, θ Image,
  • Pose, θ Image,
  • Pose, θ Image, GAUSSIAN PROCESS LATENT VARIABLE MODEL
  • Pose, θ Image, znew
  • Pose, θ Image, znew
  • Pose, θ Image, znew
  • t=1TEMPORAL FILTERING 73
  • t=1TEMPORAL FILTERING 74
  • t=2TEMPORAL FILTERING 75
  • t=2TEMPORAL FILTERING 76
  • t=1 t=2TEMPORAL FILTERING 77
  • t=1 t=2TEMPORAL FILTERING 78
  • t=1 t=2 t=3TEMPORAL FILTERING 79
  • Pose, θ Image, WE DON’T HAVE THIS…
  • Pose, θ Image, WE HAVE THIS…
  • Pose, θ Image, OF WHICH A NOT UNREASONABLE MODEL IS…
  • WE HAVE TOO FEW LABELLED (, ) PAIRS
  • But maybe we can capture more unlabelled images, i.e. (,∗) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
  • But maybe we can capture more mocap, i.e. (∗, ) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
  • = ∫ , Image marginal = ∫ , MARGINAL STATISTICS
  • MARGINALS CONTRADICT OUR EARLIER GUESS
  • REQUIRING CONSISTENT MARGINALS GIVES THIS…
  • The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 89
  • That was elegant…BACK TO REALITY
  •  Whole body 1012 poses (say)  Four parts 4 × 103+ poses  But ambiguity increasesBODY PARTS
  • FOUR PARTS GOOD, 32 PARTS BETTER
  • TRAINING DATA
  • Old (holistic) approachNew (parts) approach
  • VIRTUAL HARLEQUIN SUIT 95
  • EXPANDING THE REPERTOIRE 96
  • Random Camera Other Random300 000 Body Poses 15 Models Orientations Parameters Synthetic image Camera noise generation simulation 1 Million Image Pairs
  • 98
  • synthetic (held-out mocap poses) real (from home visits) TEST DATA 99
  • EXAMPLE INPUTS & OUTPUTS 10
  • Input OutputSLIDING WINDOW CLASSIFIER
  • Input OutputSLIDING WINDOW CLASSIFIER
  • probability chest r elbow l shoulder l elbow r shoulder head l hand r hand … • Learn Prob(body part|window) from training dataFOCUS ON A SINGLE PIXEL: WHAT PART AM I?
  • probability head l hand r hand l shoulderEXAMPLE PIXEL 1: WHAT PART AM I? r shoulder chest l elbow r elbow D1 > 60 mm
  • probability D2 > 20 mmEXAMPLE PIXEL 1: WHAT PART AM I? probability yes head l hand r hand l shoulder r shoulder chest l elbow r elbow D1 > 60 mm no
  • probability yes probability D2 > 20 mm headEXAMPLE PIXEL 1: WHAT PART AM I? l hand probability yes r hand no l shoulder r shoulder chest l elbow r elbow D1 > 60 mm no
  • probability yes probability D2 > 20 mm headEXAMPLE PIXEL 1: WHAT PART AM I? l hand probability yes r hand no l shoulder r shoulder chest l elbow r elbow D1 > 60 mm no
  • probability head l hand r hand l shoulderEXAMPLE PIXEL 2: WHAT PART AM I? r shoulder chest l elbow r elbow D1 > 60 mm
  • probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability head l hand r hand l shoulder r shoulder chest l elbow r elbow no D1 > 60 mm D3 > 25 mm
  • probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability yes probability head l hand no r hand D1 > 60 mm l shoulder r shoulder no chest l elbow r elbow D3 > 25 mm
  • probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability yes probability head l hand no r hand D1 > 60 mm l shoulder r shoulder no chest l elbow r elbow D3 > 25 mm
  • head l hand r hand l shoulder r shoulder chest yes l elbow r elbow … D2 > 20 mm head l hand r hand yes no l shoulder r shoulder chest • Same tree applied at every pixel l elbow r elbow • In practice, trees are much deeper … • Different pixels take different pathsDECISION TREE CLASSIFICATION head l hand r hand l shoulder r shoulder yes chest l elbow r elbow no D1 > 60 mm … head l hand r hand l shoulder no r shoulder chest l elbow D3 > 25 mm r elbow …
  •  A forest is an ensemble of trees: tree 1 tree T …… r shoulder l shoulder l elbow r elbow … chest head l hand r hand r shoulder l shoulder l elbow r elbow … chest head l hand r hand Helps avoid over-fitting during training [Amit & Geman 97] [Breiman 01] Testing takes average of leaf nodes distributions [Geurts et al. 06]DECISION FORESTS
  • ground truth 55% 50% Accuracy inferred body parts (most likely) 45% 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of treesNUMBER OF TREES
  • input depth ground truth parts inferred parts (soft) depth 1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2DEPTH OF TREES
  • depth images ground truth parts inferred parts (soft)
  • 65% 900k training images 15k training images 60% 55% Accuracy 50% 45% 40% 35% 30% 8 12 16 20 Depth of treesDEPTH OF TREES
  •  Given  depth image  inferred body part probabilities Cluster high probability parts in 3D hypothesized body jointsBODY PARTS TO JOINT POSITIONS
  • input depth inferred body partsfront view side view top view inferred joint positions: no tracking or smoothing
  • input depth inferred body partsfront view side view top view inferred joint positions: no tracking or smoothing
  • 0.8 Chamfer NN matching Whole pose 0.7 Time taken: Our new body Our algorithm 5 milliseconds 0.6 parts approach per frame 0.5Accuracy 0.4 Time taken: 500 milliseconds 0.3 per frame 0.2 0.1 0.0 30 300 3000 30000 300000 Number of training images (log scale)MATCHING BODY PARTS IS BETTER
  • Joint position hypotheses are not the whole storyFollow up with skeleton fitting incorporating• Kinematic constraints (limb lengths etc)• Temporal coherence (it’s back!)And of course the incredible imagination of games designers…WRAPPING UP