Upcoming SlideShare
×

# Andrew Fitzgibbon - Kinect

2,240 views

Published on

First talk by Andrew Fitzgibbon in Microsoft Computer Vision Summer School 2011

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
2,240
On SlideShare
0
From Embeds
0
Number of Embeds
708
Actions
Shares
0
9
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Andrew Fitzgibbon - Kinect

1. 1. OPTIMIZATION for computer visionTWO PARTS 1
2. 2. COMPUTER VISION TRUTH AND BEAUTY ANDREW FITZGIBBONMICROSOFT RESEARCH CAMBRIDGE
3. 3. BEAUTY TRUTH Elegant Messy Easy Difficult Linear Nonlinear Unoccluded Occluded Unlimited data Missing data Unrealistic RealTHE TWO FACES OF COMPUTER VISION
4. 4. TRUTH & BEAUTYTHE TWO FACES OF BAYES
5. 5.  Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasionally you’ll get a good idea.  Write about it well.HOW TO GET IDEAS 5
6. 6.  Many papers on writing well, all with similar advice  State the problem  Describe the state of the art  Say what is wrong with the state of the art  Say how you fixed it  Show that you have fixed it  Use notation correctlyWRITE ABOUT IT WELL
7. 7.  Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasionally you’ll get a good idea.  Write about it well.HOW TO GET IDEAS 7
8. 8.  1998: we computed a decent 3D reconstruction of a 36-frame sequence  Giving 3D super-resolution  And set ourselves the goal of solving a 1500-frame sequence  Leading to…[FCZ98] Fitzgibbon, Cross & Zisserman, SMILE 1998CHASE THE HIGH-HANGING FRUIT
9. 9.  Program boujou sold by 2d3  Extremely accurate camera calibration because of bundle adjustment  Used on almost every movie you see  Emmy award for boujou in 2002TRY TO MAKE STUFF REALLY WORK
10. 10. Image Based Rendering using image-based priors, ICCV 2003DEVELOP OTHERS’ GOOD IDEAS
11. 11.  2004: Rigid 3D reconstruction is “solved” :)  What about nonrigid?  Bregler, Torresani, Hertzmann, Biermann are already doing it since 2000  Just one group… so OK to join in!CHASE THE HIGH-HANGING FRUIT
12. 12. 1. Get your own giraffe 2. Implement the existing algorithms 3. Find out why they don’t work 4. More on this later…HOW TO JOIN IN
13. 13. Presenter:Andrew FitzgibbonPrincipal ResearcherMicrosoft Research CambridgeBLUE SKIES TO WORLD RECORDSFROM THE LAB TO REAL LIFE
14. 14. Innovation into Products TECHNOLOGY TRANSFERSPATENTS & PAPERS IP Licensing to 3rd parties Expertise for ProductsMSR OUTPUTS
15. 15. COMPUTER VISION AT MSRC
16. 16. COMPUTER VISION AT MSRC
17. 17. COMPUTER VISION AT MSRC
18. 18. 18
19. 19. building water building car cow cat bicycle grass road road [Shotton, Winn, Rother, road Criminisi 06 + 08] [Winn & Shotton 06] [Shotton, Johnson, Cipolla 08]COMPUTER VISION AT MSRC
20. 20. Ground truth Entangled Conventional A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi,Entangled Decision Forests and their Application for Semantic Segmentation of CT Images,in Information Processing in Medical Imaging (IPMI), July 2011COMPUTER VISION AT MSRC
21. 21. COMPUTER VISION AT MSRC
22. 22. Innovation into Products TECHNOLOGY TRANSFERSPATENTS & PAPERS IP Licensing to 3rd parties Expertise for ProductsMSR OUTPUTS
23. 23. THE SCENARIO
24. 24. Okada & Stenger 2008 Navaratnam et al. 2007STATE OF THE ART
25. 25. 3500 mm3000 mm2500 mm2000 mm1500 mm
26. 26. XBox prototype, Sept 2008 Incredibly impressive:  Real time  Accurate  General poses But…  Needs initialization  Limited agilitySTATE OF THE ART
27. 27. 1965. L. G. Roberts, Machine Perception of ThreeDimensional Solids, in Optical and electro-optical informationprocessing, J. T. Tippett (ed.), MIT Press.MODEL-BASED VISION
28. 28. 1980. J. ORourke and N. Badler. Model-based image analysis of humanmotion using constraint propagation. IEEE Trans. on Pattern Analysis andMachine Intelligence.MODEL-BASED VISION
29. 29. D Hogg, Image and VisionComputing, Vol 1 (1983)
30. 30. D Hogg, Image and VisionComputing, Vol 1 (1983)
31. 31. MODEL-BASED VISION
32. 32.  The “model” is Forward/Generative/Graphical Requiring search in many dimensions  say 1013 for the bodyResolved using (a) clever search: gradient descent and better (b) temporal coherence  Assume we were right in the previous frame  And search only “nearby” configurations in thisTHE PROBLEM WITH MODEL-BASED VISION
33. 33. Exponential likelihood of failureAssume 0.1% failure rate per frame• After frames, chance of success = 0.999• At 30 frames per second, that’s: • 3.0% chance of failure after 1 second • 83.5% chance of failure after 1 minute • 99.99% chance of failure after 5 minutesTHE PROBLEM WITH TEMPORAL COHERENCE
34. 34. Exponential likelihood of failureAssume 0.01% failure rate per frame• After frames, chance of success = 0.9999• At 30 frames per second, that’s: • 0.3% chance of failure after 1 second • 16.5% chance of failure after 1 minute • 59.3% chance of failure after 5 minutesTHE PROBLEM WITH TEMPORAL COHERENCE
35. 35.  Need a method which works on a single frame  [Or a small set of frames]  So we can “reinitialize” the tracker Single-frame methods all based on machine learning  So we’ll need training data  Lots of training data And will need to represent multiple hypothesesSO WE CAN’T USE TEMPORAL COHERENCE.
36. 36.  “Tracking”: Elegant, Easy, Wrong.
37. 37.  “Tracking”: Elegant, Easy, Wrong. Ignoring temporal priors: Just wrong.
38. 38. Paul A. Viola, Michael J. Jones Robust Real-Time Face Detection IEEE International Conference on Computer Vision, 2001LEARNING A FACE DETECTOR
39. 39. Andrew Blake, Kentaro Toyama,Probablistic tracking in a metric spaceIEEE International Conference on Computer Vision, 2001DETECTION VS. TRACKING
40. 40. Okada Stenger 2008 Navaratnam et al. 2007STATE OF THE ART
41. 41. Image “Pose” Features Joint angles 1 1 = ⋮ → = ⋮ LEARNING A POSE ESTIMATOR
42. 42. Step Zero: Training data (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , )LEARNING A POSE ESTIMATOR
43. 43.  Real home visits  Pose: Motion capture  Standard “CMU” database  Custom database  Body size shape: Retargeting  Effects/Games industry tool: MotionBuilderSOURCES OF VARIED DATA
44. 44. Computer “MoCap” GraphicsActor wearing spherical 3D joint positions Syntheticmarkers Depth ImageObserved by multiplecamerasMOTION CAPTURE 46
45. 45.  Standard motion capture datasets on the web  Feed to MotionBuilder to generate 3D images  Limited range of body typesINITIAL EXPERIMENTS 47
46. 46. Synthetic data: Artificially corrupted data  depth resolution noiserealistic, but too clean  rough edges  missing pixels: hair/beards  cropping occlusionsSIMULATING CAMERA ARTEFACTS
47. 47. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , )SO WE HAVE DATANOW WHAT?
48. 48. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) function ?“LEARN” FUNCTION FROM DATA
49. 49. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
50. 50. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
51. 51. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
52. 52. Given training set = * , + =1Define : ℝ100 ↦ ℝ27 such that = ()behaves as a “nearest neighbour classifier” =Assume “distance” ⋅,⋅ : ℝ100 × ℝ100 ↦ ℝ+ASIDE: DEFINE NEAREST NEIGHBOUR
53. 53. = min (, ) ∈ min (, ) = ∗ ∗ = for = argmin , ASIDE: DEFINE NEAREST NEIGHBOUR
54. 54. Accuracy (1.0 is best) Number of training images (log scale)ALWAYS TRY NEAREST NEIGHBOUR FIRST
55. 55. 1.0 0.9 0.8 0.7 0.6Accuracy 0.5 Time taken: 0.4 500 milliseconds 0.3 per frame 0.2 0.1 0.0 30 300 3000 30000 300000 Number of training images (log scale)NEAREST NEIGHBOUR DIDN’T SCALE
56. 56. The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 DEALING WITH (LACK OF) DATA
57. 57. (1 , 1 ) (2 , 2 ) Pose, θ Image, TRAINING: PLOT THE DATA
58. 58. Pose, θ Image, TRAINING: FIT FUNCTION =
59. 59. Pose, θ Image, znewGIVEN NEW IMAGE , = ( )
60. 60. 3. GIVEN NEW IMAGE , COMPUTE =( ) Pose, θ Image,
61. 61. 3. OR, MORE USEFULLY, COMPUTE ) Pose, θ Image, znew
62. 62. or ?Multivalued f:
63. 63. Pose, θ Image,
64. 64. Pose, θ Image,
65. 65. Pose, θ Image,
66. 66. Pose, θ Image, GAUSSIAN PROCESS LATENT VARIABLE MODEL
67. 67. Pose, θ Image, znew
68. 68. Pose, θ Image, znew
69. 69. Pose, θ Image, znew
70. 70. t=1TEMPORAL FILTERING 73
71. 71. t=1TEMPORAL FILTERING 74
72. 72. t=2TEMPORAL FILTERING 75
73. 73. t=2TEMPORAL FILTERING 76
74. 74. t=1 t=2TEMPORAL FILTERING 77
75. 75. t=1 t=2TEMPORAL FILTERING 78
76. 76. t=1 t=2 t=3TEMPORAL FILTERING 79
77. 77. Pose, θ Image, WE DON’T HAVE THIS…
78. 78. Pose, θ Image, WE HAVE THIS…
79. 79. Pose, θ Image, OF WHICH A NOT UNREASONABLE MODEL IS…
80. 80. WE HAVE TOO FEW LABELLED (, ) PAIRS
81. 81. But maybe we can capture more unlabelled images, i.e. (,∗) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
82. 82. But maybe we can capture more mocap, i.e. (∗, ) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
83. 83. = ∫ , Image marginal = ∫ , MARGINAL STATISTICS
84. 84. MARGINALS CONTRADICT OUR EARLIER GUESS
85. 85. REQUIRING CONSISTENT MARGINALS GIVES THIS…
86. 86. The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 89
87. 87. That was elegant…BACK TO REALITY
88. 88.  Whole body 1012 poses (say)  Four parts 4 × 103+ poses  But ambiguity increasesBODY PARTS
89. 89. FOUR PARTS GOOD, 32 PARTS BETTER
90. 90. TRAINING DATA
91. 91. Old (holistic) approachNew (parts) approach
92. 92. VIRTUAL HARLEQUIN SUIT 95
93. 93. EXPANDING THE REPERTOIRE 96
94. 94. Random Camera Other Random300 000 Body Poses 15 Models Orientations Parameters Synthetic image Camera noise generation simulation 1 Million Image Pairs
95. 95. 98
96. 96. synthetic (held-out mocap poses) real (from home visits) TEST DATA 99
97. 97. EXAMPLE INPUTS OUTPUTS 10
98. 98. Input OutputSLIDING WINDOW CLASSIFIER
99. 99. Input OutputSLIDING WINDOW CLASSIFIER
100. 100. probability chest r elbow l shoulder l elbow r shoulder head l hand r hand … • Learn Prob(body part|window) from training dataFOCUS ON A SINGLE PIXEL: WHAT PART AM I?
101. 101. probability head l hand r hand l shoulderEXAMPLE PIXEL 1: WHAT PART AM I? r shoulder chest l elbow r elbow D1 60 mm
102. 102. probability D2 20 mmEXAMPLE PIXEL 1: WHAT PART AM I? probability yes head l hand r hand l shoulder r shoulder chest l elbow r elbow D1 60 mm no
103. 103. probability yes probability D2 20 mm headEXAMPLE PIXEL 1: WHAT PART AM I? l hand probability yes r hand no l shoulder r shoulder chest l elbow r elbow D1 60 mm no
104. 104. probability yes probability D2 20 mm headEXAMPLE PIXEL 1: WHAT PART AM I? l hand probability yes r hand no l shoulder r shoulder chest l elbow r elbow D1 60 mm no
105. 105. probability head l hand r hand l shoulderEXAMPLE PIXEL 2: WHAT PART AM I? r shoulder chest l elbow r elbow D1 60 mm
106. 106. probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability head l hand r hand l shoulder r shoulder chest l elbow r elbow no D1 60 mm D3 25 mm
107. 107. probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability yes probability head l hand no r hand D1 60 mm l shoulder r shoulder no chest l elbow r elbow D3 25 mm
108. 108. probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability yes probability head l hand no r hand D1 60 mm l shoulder r shoulder no chest l elbow r elbow D3 25 mm
109. 109. head l hand r hand l shoulder r shoulder chest yes l elbow r elbow … D2 20 mm head l hand r hand yes no l shoulder r shoulder chest • Same tree applied at every pixel l elbow r elbow • In practice, trees are much deeper … • Different pixels take different pathsDECISION TREE CLASSIFICATION head l hand r hand l shoulder r shoulder yes chest l elbow r elbow no D1 60 mm … head l hand r hand l shoulder no r shoulder chest l elbow D3 25 mm r elbow …
110. 110.  A forest is an ensemble of trees: tree 1 tree T …… r shoulder l shoulder l elbow r elbow … chest head l hand r hand r shoulder l shoulder l elbow r elbow … chest head l hand r hand Helps avoid over-fitting during training [Amit Geman 97] [Breiman 01] Testing takes average of leaf nodes distributions [Geurts et al. 06]DECISION FORESTS
111. 111. ground truth 55% 50% Accuracy inferred body parts (most likely) 45% 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of treesNUMBER OF TREES
112. 112. input depth ground truth parts inferred parts (soft) depth 1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2DEPTH OF TREES
113. 113. depth images ground truth parts inferred parts (soft)
114. 114. 65% 900k training images 15k training images 60% 55% Accuracy 50% 45% 40% 35% 30% 8 12 16 20 Depth of treesDEPTH OF TREES
115. 115.  Given  depth image  inferred body part probabilities Cluster high probability parts in 3D hypothesized body jointsBODY PARTS TO JOINT POSITIONS
116. 116. input depth inferred body partsfront view side view top view inferred joint positions: no tracking or smoothing
117. 117. input depth inferred body partsfront view side view top view inferred joint positions: no tracking or smoothing
118. 118. 0.8 Chamfer NN matching Whole pose 0.7 Time taken: Our new body Our algorithm 5 milliseconds 0.6 parts approach per frame 0.5Accuracy 0.4 Time taken: 500 milliseconds 0.3 per frame 0.2 0.1 0.0 30 300 3000 30000 300000 Number of training images (log scale)MATCHING BODY PARTS IS BETTER
119. 119. Joint position hypotheses are not the whole storyFollow up with skeleton fitting incorporating• Kinematic constraints (limb lengths etc)• Temporal coherence (it’s back!)And of course the incredible imagination of games designers…WRAPPING UP