OPTIMIZATION            for computer visionTWO PARTS                         1
COMPUTER VISION TRUTH AND BEAUTY     ANDREW FITZGIBBONMICROSOFT RESEARCH CAMBRIDGE
BEAUTY                                TRUTH Elegant                               Messy Easy                              ...
TRUTH & BEAUTYTHE TWO FACES OF BAYES
 Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasiona...
 Many papers on writing well, all with similar advice    State the problem    Describe the state of the art    Say wha...
 Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasiona...
 1998: we computed a decent                            3D reconstruction of a 36-frame                            sequenc...
 Program boujou sold by 2d3                  Extremely accurate camera                   calibration because of bundle  ...
Image Based Rendering                             using image-based priors,                             ICCV 2003DEVELOP O...
 2004: Rigid 3D                            reconstruction is “solved” :)                           What about nonrigid? ...
1. Get your own giraffe                 2. Implement the existing                    algorithms                 3. Find ou...
Presenter:Andrew FitzgibbonPrincipal ResearcherMicrosoft Research CambridgeBLUE SKIES TO WORLD RECORDSFROM THE LAB TO REAL...
Innovation into Products                                                 TECHNOLOGY TRANSFERSPATENTS & PAPERS             ...
COMPUTER VISION AT MSRC
COMPUTER VISION AT MSRC
COMPUTER VISION AT MSRC
18
building            water                        building        car                cow             cat          bicycle  ...
Ground truth                   Entangled                   Conventional A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, ...
COMPUTER VISION AT MSRC
Innovation into Products                                                 TECHNOLOGY TRANSFERSPATENTS & PAPERS             ...
THE SCENARIO
Okada & Stenger 2008   Navaratnam et al. 2007STATE OF THE ART
3500 mm3000 mm2500 mm2000 mm1500 mm
XBox prototype, Sept 2008  Incredibly impressive:   Real time   Accurate   General poses  But…   Needs initialization ...
1965. L. G. Roberts, Machine Perception of ThreeDimensional Solids, in Optical and electro-optical informationprocessing, ...
1980. J. ORourke and N. Badler. Model-based image analysis of humanmotion using constraint propagation. IEEE Trans. on Pat...
D Hogg, Image and VisionComputing, Vol 1 (1983)
D Hogg, Image and VisionComputing, Vol 1 (1983)
MODEL-BASED VISION
 The “model” is Forward/Generative/Graphical Requiring search in many dimensions   say 1013 for the bodyResolved using...
Exponential likelihood of failureAssume 0.1% failure rate per frame• After  frames, chance of success = 0.999• At 30 frame...
Exponential likelihood of failureAssume 0.01% failure rate per frame• After  frames, chance of success = 0.9999• At 30 fra...
 Need a method which works on a single frame    [Or a small set of frames]    So we can “reinitialize” the tracker Sin...
 “Tracking”: Elegant, Easy, Wrong.
 “Tracking”: Elegant, Easy, Wrong. Ignoring temporal priors: Just wrong.
Paul A. Viola, Michael J. Jones           Robust Real-Time Face Detection           IEEE International Conference on Compu...
Andrew Blake, Kentaro Toyama,Probablistic tracking in a metric spaceIEEE International Conference on Computer Vision, 2001...
Okada  Stenger 2008   Navaratnam et al. 2007STATE OF THE ART
Image                                           “Pose”            Features                 Joint angles                  1...
Step Zero: Training data  (1 , 1 )   (2 , 2 )   ⋯                                             ( ,  )   ⋯   ( ,  )LEARNING ...
 Real home visits   Pose: Motion capture      Standard “CMU” database      Custom database   Body size  shape: Retarg...
Computer                          “MoCap”                                                    GraphicsActor wearing spheric...
 Standard motion capture datasets on the web    Feed to MotionBuilder to generate 3D images    Limited range of body ty...
Synthetic data:            Artificially corrupted data                              depth resolution  noiserealistic, but...
(1 , 1 )   (2 , 2 ) ⋯ ( ,  )   ⋯   ( ,  )SO WE HAVE DATANOW WHAT?
(1 , 1 )   (2 , 2 )    ⋯     ( ,  )   ⋯   ( ,  )                             function                                     ...
(1 , 1 )   (2 , 2 )    ⋯     ( ,  )   ⋯   ( ,  )                            functionNEAREST NEIGHBOUR
(1 , 1 )   (2 , 2 )    ⋯     ( ,  )   ⋯   ( ,  )                            functionNEAREST NEIGHBOUR
(1 , 1 )   (2 , 2 )    ⋯     ( ,  )   ⋯   ( ,  )                            functionNEAREST NEIGHBOUR
Given training set  = *  ,  +                                       =1Define : ℝ100 ↦ ℝ27 such that                       ...
= min (, )                     ∈                          min (, )               =                    ∗   ∗               ...
Accuracy (1.0 is best)                                        Number of training images (log scale)ALWAYS TRY NEAREST NEIG...
1.0           0.9           0.8           0.7           0.6Accuracy           0.5                                         ...
The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 DEALI...
(1 , 1 )                              (2 , 2 )   Pose, θ             Image, TRAINING: PLOT THE DATA
Pose, θ              Image, TRAINING: FIT FUNCTION  =
Pose, θ               Image,               znewGIVEN NEW IMAGE  ,  = ( )
3. GIVEN NEW IMAGE  , COMPUTE  =( )     Pose, θ              Image,
3. OR, MORE USEFULLY, COMPUTE    )  Pose, θ            Image,        znew
or   ?Multivalued f:
Pose, θ          Image,
Pose, θ          Image,
Pose, θ          Image,
Pose, θ              Image, GAUSSIAN PROCESS LATENT VARIABLE MODEL
Pose, θ          Image,    znew
Pose, θ          Image,    znew
Pose, θ          Image,    znew
t=1TEMPORAL FILTERING   73
t=1TEMPORAL FILTERING   74
t=2TEMPORAL FILTERING   75
t=2TEMPORAL FILTERING   76
t=1    t=2TEMPORAL FILTERING   77
t=1    t=2TEMPORAL FILTERING   78
t=1    t=2   t=3TEMPORAL FILTERING     79
Pose, θ             Image, WE DON’T HAVE THIS…
Pose, θ             Image, WE HAVE THIS…
Pose, θ              Image, OF WHICH A NOT UNREASONABLE MODEL IS…
WE HAVE TOO FEW LABELLED (, ) PAIRS
But maybe we can capture more unlabelled images, i.e. (,∗) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
But maybe we can capture more mocap, i.e. (∗, ) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
= ∫  ,  Image marginal   = ∫  ,   MARGINAL STATISTICS
MARGINALS CONTRADICT OUR EARLIER GUESS
REQUIRING CONSISTENT MARGINALS GIVES THIS…
The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007      ...
That was elegant…BACK TO REALITY
 Whole body 1012 poses (say)  Four parts 4 × 103+ poses  But ambiguity increasesBODY PARTS
FOUR PARTS GOOD, 32 PARTS BETTER
TRAINING DATA
Old (holistic) approachNew (parts) approach
VIRTUAL HARLEQUIN SUIT   95
EXPANDING THE REPERTOIRE   96
Random Camera        Other Random300 000 Body Poses   15 Models       Orientations        Parameters                      ...
98
synthetic (held-out mocap poses)   real (from home visits) TEST DATA                                                      ...
EXAMPLE INPUTS  OUTPUTS   10
Input               OutputSLIDING WINDOW CLASSIFIER
Input               OutputSLIDING WINDOW CLASSIFIER
probability                                                                                               chest           ...
probability                                        head                                      l hand                       ...
probability                                                   D2  20 mmEXAMPLE PIXEL 1: WHAT PART AM I?                pro...
probability                                                           yes                                             prob...
probability                                                           yes                                             prob...
probability                                        head                                      l hand                       ...
probability                                                         yesEXAMPLE PIXEL 2: WHAT PART AM I?                pro...
probability                                                                             yesEXAMPLE PIXEL 2: WHAT PART AM I...
probability                                                                             yesEXAMPLE PIXEL 2: WHAT PART AM I...
head                                   l hand                                   r hand                               l sho...
 A forest is an ensemble of trees:  tree 1                                                 tree T                        ...
ground truth            55%            50% Accuracy                                        inferred body parts (most likel...
input depth   ground truth parts   inferred parts (soft)                                           depth 1                ...
depth images   ground truth parts   inferred parts (soft)
65%       900k training images                               15k training images                     60%                  ...
 Given   depth image   inferred body part probabilities Cluster high probability parts in 3D                          ...
input depth          inferred body partsfront view            side view            top view inferred joint positions: no t...
input depth          inferred body partsfront view            side view            top view inferred joint positions: no t...
0.8                      Chamfer NN matching                       Whole pose           0.7                               ...
Joint position hypotheses are not the whole storyFollow up with skeleton fitting incorporating• Kinematic constraints (lim...
Andrew Fitzgibbon - Kinect
Andrew Fitzgibbon - Kinect
Upcoming SlideShare
Loading in …5
×

Andrew Fitzgibbon - Kinect

2,240 views

Published on

First talk by Andrew Fitzgibbon in Microsoft Computer Vision Summer School 2011

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,240
On SlideShare
0
From Embeds
0
Number of Embeds
708
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Andrew Fitzgibbon - Kinect

  1. 1. OPTIMIZATION for computer visionTWO PARTS 1
  2. 2. COMPUTER VISION TRUTH AND BEAUTY ANDREW FITZGIBBONMICROSOFT RESEARCH CAMBRIDGE
  3. 3. BEAUTY TRUTH Elegant Messy Easy Difficult Linear Nonlinear Unoccluded Occluded Unlimited data Missing data Unrealistic RealTHE TWO FACES OF COMPUTER VISION
  4. 4. TRUTH & BEAUTYTHE TWO FACES OF BAYES
  5. 5.  Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasionally you’ll get a good idea.  Write about it well.HOW TO GET IDEAS 5
  6. 6.  Many papers on writing well, all with similar advice  State the problem  Describe the state of the art  Say what is wrong with the state of the art  Say how you fixed it  Show that you have fixed it  Use notation correctlyWRITE ABOUT IT WELL
  7. 7.  Chase the high-hanging fruit.  Try to make stuff really work.  Look for things that confuse/annoy you.  And occasionally you’ll get a good idea.  Write about it well.HOW TO GET IDEAS 7
  8. 8.  1998: we computed a decent 3D reconstruction of a 36-frame sequence  Giving 3D super-resolution  And set ourselves the goal of solving a 1500-frame sequence  Leading to…[FCZ98] Fitzgibbon, Cross & Zisserman, SMILE 1998CHASE THE HIGH-HANGING FRUIT
  9. 9.  Program boujou sold by 2d3  Extremely accurate camera calibration because of bundle adjustment  Used on almost every movie you see  Emmy award for boujou in 2002TRY TO MAKE STUFF REALLY WORK
  10. 10. Image Based Rendering using image-based priors, ICCV 2003DEVELOP OTHERS’ GOOD IDEAS
  11. 11.  2004: Rigid 3D reconstruction is “solved” :)  What about nonrigid?  Bregler, Torresani, Hertzmann, Biermann are already doing it since 2000  Just one group… so OK to join in!CHASE THE HIGH-HANGING FRUIT
  12. 12. 1. Get your own giraffe 2. Implement the existing algorithms 3. Find out why they don’t work 4. More on this later…HOW TO JOIN IN
  13. 13. Presenter:Andrew FitzgibbonPrincipal ResearcherMicrosoft Research CambridgeBLUE SKIES TO WORLD RECORDSFROM THE LAB TO REAL LIFE
  14. 14. Innovation into Products TECHNOLOGY TRANSFERSPATENTS & PAPERS IP Licensing to 3rd parties Expertise for ProductsMSR OUTPUTS
  15. 15. COMPUTER VISION AT MSRC
  16. 16. COMPUTER VISION AT MSRC
  17. 17. COMPUTER VISION AT MSRC
  18. 18. 18
  19. 19. building water building car cow cat bicycle grass road road [Shotton, Winn, Rother, road Criminisi 06 + 08] [Winn & Shotton 06] [Shotton, Johnson, Cipolla 08]COMPUTER VISION AT MSRC
  20. 20. Ground truth Entangled Conventional A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi,Entangled Decision Forests and their Application for Semantic Segmentation of CT Images,in Information Processing in Medical Imaging (IPMI), July 2011COMPUTER VISION AT MSRC
  21. 21. COMPUTER VISION AT MSRC
  22. 22. Innovation into Products TECHNOLOGY TRANSFERSPATENTS & PAPERS IP Licensing to 3rd parties Expertise for ProductsMSR OUTPUTS
  23. 23. THE SCENARIO
  24. 24. Okada & Stenger 2008 Navaratnam et al. 2007STATE OF THE ART
  25. 25. 3500 mm3000 mm2500 mm2000 mm1500 mm
  26. 26. XBox prototype, Sept 2008 Incredibly impressive:  Real time  Accurate  General poses But…  Needs initialization  Limited agilitySTATE OF THE ART
  27. 27. 1965. L. G. Roberts, Machine Perception of ThreeDimensional Solids, in Optical and electro-optical informationprocessing, J. T. Tippett (ed.), MIT Press.MODEL-BASED VISION
  28. 28. 1980. J. ORourke and N. Badler. Model-based image analysis of humanmotion using constraint propagation. IEEE Trans. on Pattern Analysis andMachine Intelligence.MODEL-BASED VISION
  29. 29. D Hogg, Image and VisionComputing, Vol 1 (1983)
  30. 30. D Hogg, Image and VisionComputing, Vol 1 (1983)
  31. 31. MODEL-BASED VISION
  32. 32.  The “model” is Forward/Generative/Graphical Requiring search in many dimensions  say 1013 for the bodyResolved using (a) clever search: gradient descent and better (b) temporal coherence  Assume we were right in the previous frame  And search only “nearby” configurations in thisTHE PROBLEM WITH MODEL-BASED VISION
  33. 33. Exponential likelihood of failureAssume 0.1% failure rate per frame• After frames, chance of success = 0.999• At 30 frames per second, that’s: • 3.0% chance of failure after 1 second • 83.5% chance of failure after 1 minute • 99.99% chance of failure after 5 minutesTHE PROBLEM WITH TEMPORAL COHERENCE
  34. 34. Exponential likelihood of failureAssume 0.01% failure rate per frame• After frames, chance of success = 0.9999• At 30 frames per second, that’s: • 0.3% chance of failure after 1 second • 16.5% chance of failure after 1 minute • 59.3% chance of failure after 5 minutesTHE PROBLEM WITH TEMPORAL COHERENCE
  35. 35.  Need a method which works on a single frame  [Or a small set of frames]  So we can “reinitialize” the tracker Single-frame methods all based on machine learning  So we’ll need training data  Lots of training data And will need to represent multiple hypothesesSO WE CAN’T USE TEMPORAL COHERENCE.
  36. 36.  “Tracking”: Elegant, Easy, Wrong.
  37. 37.  “Tracking”: Elegant, Easy, Wrong. Ignoring temporal priors: Just wrong.
  38. 38. Paul A. Viola, Michael J. Jones Robust Real-Time Face Detection IEEE International Conference on Computer Vision, 2001LEARNING A FACE DETECTOR
  39. 39. Andrew Blake, Kentaro Toyama,Probablistic tracking in a metric spaceIEEE International Conference on Computer Vision, 2001DETECTION VS. TRACKING
  40. 40. Okada Stenger 2008 Navaratnam et al. 2007STATE OF THE ART
  41. 41. Image “Pose” Features Joint angles 1 1 = ⋮ → = ⋮ LEARNING A POSE ESTIMATOR
  42. 42. Step Zero: Training data (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , )LEARNING A POSE ESTIMATOR
  43. 43.  Real home visits  Pose: Motion capture  Standard “CMU” database  Custom database  Body size shape: Retargeting  Effects/Games industry tool: MotionBuilderSOURCES OF VARIED DATA
  44. 44. Computer “MoCap” GraphicsActor wearing spherical 3D joint positions Syntheticmarkers Depth ImageObserved by multiplecamerasMOTION CAPTURE 46
  45. 45.  Standard motion capture datasets on the web  Feed to MotionBuilder to generate 3D images  Limited range of body typesINITIAL EXPERIMENTS 47
  46. 46. Synthetic data: Artificially corrupted data  depth resolution noiserealistic, but too clean  rough edges  missing pixels: hair/beards  cropping occlusionsSIMULATING CAMERA ARTEFACTS
  47. 47. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , )SO WE HAVE DATANOW WHAT?
  48. 48. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) function ?“LEARN” FUNCTION FROM DATA
  49. 49. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
  50. 50. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
  51. 51. (1 , 1 ) (2 , 2 ) ⋯ ( , ) ⋯ ( , ) functionNEAREST NEIGHBOUR
  52. 52. Given training set = * , + =1Define : ℝ100 ↦ ℝ27 such that = ()behaves as a “nearest neighbour classifier” =Assume “distance” ⋅,⋅ : ℝ100 × ℝ100 ↦ ℝ+ASIDE: DEFINE NEAREST NEIGHBOUR
  53. 53. = min (, ) ∈ min (, ) = ∗ ∗ = for = argmin , ASIDE: DEFINE NEAREST NEIGHBOUR
  54. 54. Accuracy (1.0 is best) Number of training images (log scale)ALWAYS TRY NEAREST NEIGHBOUR FIRST
  55. 55. 1.0 0.9 0.8 0.7 0.6Accuracy 0.5 Time taken: 0.4 500 milliseconds 0.3 per frame 0.2 0.1 0.0 30 300 3000 30000 300000 Number of training images (log scale)NEAREST NEIGHBOUR DIDN’T SCALE
  56. 56. The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 DEALING WITH (LACK OF) DATA
  57. 57. (1 , 1 ) (2 , 2 ) Pose, θ Image, TRAINING: PLOT THE DATA
  58. 58. Pose, θ Image, TRAINING: FIT FUNCTION =
  59. 59. Pose, θ Image, znewGIVEN NEW IMAGE , = ( )
  60. 60. 3. GIVEN NEW IMAGE , COMPUTE =( ) Pose, θ Image,
  61. 61. 3. OR, MORE USEFULLY, COMPUTE ) Pose, θ Image, znew
  62. 62. or ?Multivalued f:
  63. 63. Pose, θ Image,
  64. 64. Pose, θ Image,
  65. 65. Pose, θ Image,
  66. 66. Pose, θ Image, GAUSSIAN PROCESS LATENT VARIABLE MODEL
  67. 67. Pose, θ Image, znew
  68. 68. Pose, θ Image, znew
  69. 69. Pose, θ Image, znew
  70. 70. t=1TEMPORAL FILTERING 73
  71. 71. t=1TEMPORAL FILTERING 74
  72. 72. t=2TEMPORAL FILTERING 75
  73. 73. t=2TEMPORAL FILTERING 76
  74. 74. t=1 t=2TEMPORAL FILTERING 77
  75. 75. t=1 t=2TEMPORAL FILTERING 78
  76. 76. t=1 t=2 t=3TEMPORAL FILTERING 79
  77. 77. Pose, θ Image, WE DON’T HAVE THIS…
  78. 78. Pose, θ Image, WE HAVE THIS…
  79. 79. Pose, θ Image, OF WHICH A NOT UNREASONABLE MODEL IS…
  80. 80. WE HAVE TOO FEW LABELLED (, ) PAIRS
  81. 81. But maybe we can capture more unlabelled images, i.e. (,∗) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
  82. 82. But maybe we can capture more mocap, i.e. (∗, ) pairsWE HAVE TOO FEW LABELLED (, ) PAIRS
  83. 83. = ∫ , Image marginal = ∫ , MARGINAL STATISTICS
  84. 84. MARGINALS CONTRADICT OUR EARLIER GUESS
  85. 85. REQUIRING CONSISTENT MARGINALS GIVES THIS…
  86. 86. The Joint Manifold Model for Semi-supervised Multi-valued RegressionR Navaratnam, A Fitzgibbon, R Cipolla, ICCV 2007 89
  87. 87. That was elegant…BACK TO REALITY
  88. 88.  Whole body 1012 poses (say)  Four parts 4 × 103+ poses  But ambiguity increasesBODY PARTS
  89. 89. FOUR PARTS GOOD, 32 PARTS BETTER
  90. 90. TRAINING DATA
  91. 91. Old (holistic) approachNew (parts) approach
  92. 92. VIRTUAL HARLEQUIN SUIT 95
  93. 93. EXPANDING THE REPERTOIRE 96
  94. 94. Random Camera Other Random300 000 Body Poses 15 Models Orientations Parameters Synthetic image Camera noise generation simulation 1 Million Image Pairs
  95. 95. 98
  96. 96. synthetic (held-out mocap poses) real (from home visits) TEST DATA 99
  97. 97. EXAMPLE INPUTS OUTPUTS 10
  98. 98. Input OutputSLIDING WINDOW CLASSIFIER
  99. 99. Input OutputSLIDING WINDOW CLASSIFIER
  100. 100. probability chest r elbow l shoulder l elbow r shoulder head l hand r hand … • Learn Prob(body part|window) from training dataFOCUS ON A SINGLE PIXEL: WHAT PART AM I?
  101. 101. probability head l hand r hand l shoulderEXAMPLE PIXEL 1: WHAT PART AM I? r shoulder chest l elbow r elbow D1 60 mm
  102. 102. probability D2 20 mmEXAMPLE PIXEL 1: WHAT PART AM I? probability yes head l hand r hand l shoulder r shoulder chest l elbow r elbow D1 60 mm no
  103. 103. probability yes probability D2 20 mm headEXAMPLE PIXEL 1: WHAT PART AM I? l hand probability yes r hand no l shoulder r shoulder chest l elbow r elbow D1 60 mm no
  104. 104. probability yes probability D2 20 mm headEXAMPLE PIXEL 1: WHAT PART AM I? l hand probability yes r hand no l shoulder r shoulder chest l elbow r elbow D1 60 mm no
  105. 105. probability head l hand r hand l shoulderEXAMPLE PIXEL 2: WHAT PART AM I? r shoulder chest l elbow r elbow D1 60 mm
  106. 106. probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability head l hand r hand l shoulder r shoulder chest l elbow r elbow no D1 60 mm D3 25 mm
  107. 107. probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability yes probability head l hand no r hand D1 60 mm l shoulder r shoulder no chest l elbow r elbow D3 25 mm
  108. 108. probability yesEXAMPLE PIXEL 2: WHAT PART AM I? probability yes probability head l hand no r hand D1 60 mm l shoulder r shoulder no chest l elbow r elbow D3 25 mm
  109. 109. head l hand r hand l shoulder r shoulder chest yes l elbow r elbow … D2 20 mm head l hand r hand yes no l shoulder r shoulder chest • Same tree applied at every pixel l elbow r elbow • In practice, trees are much deeper … • Different pixels take different pathsDECISION TREE CLASSIFICATION head l hand r hand l shoulder r shoulder yes chest l elbow r elbow no D1 60 mm … head l hand r hand l shoulder no r shoulder chest l elbow D3 25 mm r elbow …
  110. 110.  A forest is an ensemble of trees: tree 1 tree T …… r shoulder l shoulder l elbow r elbow … chest head l hand r hand r shoulder l shoulder l elbow r elbow … chest head l hand r hand Helps avoid over-fitting during training [Amit Geman 97] [Breiman 01] Testing takes average of leaf nodes distributions [Geurts et al. 06]DECISION FORESTS
  111. 111. ground truth 55% 50% Accuracy inferred body parts (most likely) 45% 1 tree 3 trees 6 trees 40% 1 2 3 4 5 6 Number of treesNUMBER OF TREES
  112. 112. input depth ground truth parts inferred parts (soft) depth 1 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2DEPTH OF TREES
  113. 113. depth images ground truth parts inferred parts (soft)
  114. 114. 65% 900k training images 15k training images 60% 55% Accuracy 50% 45% 40% 35% 30% 8 12 16 20 Depth of treesDEPTH OF TREES
  115. 115.  Given  depth image  inferred body part probabilities Cluster high probability parts in 3D hypothesized body jointsBODY PARTS TO JOINT POSITIONS
  116. 116. input depth inferred body partsfront view side view top view inferred joint positions: no tracking or smoothing
  117. 117. input depth inferred body partsfront view side view top view inferred joint positions: no tracking or smoothing
  118. 118. 0.8 Chamfer NN matching Whole pose 0.7 Time taken: Our new body Our algorithm 5 milliseconds 0.6 parts approach per frame 0.5Accuracy 0.4 Time taken: 500 milliseconds 0.3 per frame 0.2 0.1 0.0 30 300 3000 30000 300000 Number of training images (log scale)MATCHING BODY PARTS IS BETTER
  119. 119. Joint position hypotheses are not the whole storyFollow up with skeleton fitting incorporating• Kinematic constraints (limb lengths etc)• Temporal coherence (it’s back!)And of course the incredible imagination of games designers…WRAPPING UP

×