Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Roland Memisevic at AI Frontiers: Common sense video understanding at TwentyBN


Published on

Deep learning has evolved not linearly but through a series of step-functions: sudden unexpected outbreaks of capability, which fundamentally changed the envelope of what computers are able to do. At TwentyBN, we have created spatio-temporal video models and data infrastructure that allowed us to grow approximately one million labeled videos showing everyday common-sense scenes and situations - many of them extremely subtle. This allowed us to successfully train neural networks end-to-end on a wide range of action understanding tasks, that neither hand-engineering nor neural networks had appeared anywhere near solving just a few months ago. I will show how these recognition tasks now drive commercial value at TwentyBN, and how they drive our long-term AI agenda for learning common sense world knowledge through video.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Roland Memisevic at AI Frontiers: Common sense video understanding at TwentyBN

  1. 1. Twenty Billion Neurons Berlin & Toronto based Video Understanding Company
  2. 2. DOMESTIC COMPANIONS AUGMENTED REALITY AUTOMOTIVE (10M cars) (85M smart cameras) (6M AR glasses) COLLABORATIVE ROBOTICS (150M cobots) SMARTPHONE APPS (3 BN phones) All figures are estimated number of devices in 2020 By 2020: (CONSUMER VIDEOS) (80% of Internet Traffic) Sources: KPCB, Barclays
  3. 3. Dog Cat
  4. 4. 15 people, 3 street signs
  5. 5. 2012 2014 2016 2017 “Neural networks can’t do image classification” “Neural networks can’t translate text” “Neural networks can’t play Go” “Neural networks don’t have common sense” 1986 “Neural networks don’t work” ?
  6. 6. At TwentyBN we build the brain that allows cameras to see Prof. Yoshua Bengio Scientific Advisor Professor at MILA Montréal; noted for his pioneering work on deep learning Valentin Haenel VP Engineering Co-initiator of PyData Berlin; contributor in more than 50 open source projects Nathan Benaich Advisor VC investor, technologist, former scientist; Organizer of and RAAIS + 13 full-time staff, including AI researchers, engineers and product people Roland Memisevic 15+ years experience in DL as Professor (MILA Montreal) & PhD student of Geoff Hinton CEO & Chief Scientist Moritz Müller-Freitag COO & Head of Product Experience as Professor (FH Münster) & principal software architecture (XING AG) Experience as data scientist (Eleven) & country manager (Savedo/HitFox Group) Ingo Bax CTO Christian Thurau CBDO Experience as Co-founder, CTO (Game Analytics, exit) & researcher (Fraunhofer)
  7. 7. Research & engineering Data platform Integrated technology stack 1 2 Embedded real-time net 3 Solutions 4
  8. 8. ● RGB (for example, cheap, built-in laptop camera) ● Recognizes 25 hand gestures ● Very high accuracy ● Runs in real-time on a laptop using RGB camera input ● Require depth sensor devices ● ~5 gestures ● Low accuracy ● Never gained traction Camera based gesture control Existing solutions TwentyBN solution Note: Click picture for video
  9. 9. Variations Camera angles and scene layouts Multi-person actions and localization Interactivity Complex object interactions
  10. 10. Indoor activity monitoring Output: “Person picking [something] up” Output: “[Something] falling like a feather or paper” Output: “Person leaving through a door” Output: “Bending [something] until it breaks” Output: “Trying to bend [something unbendable] so nothing happens” Output: “[gesture] Zooming Out With Two Fingers”
  11. 11. We support all stages of our clients’ product cycles Softcore IP Data licensing Software licensing Hardware licensing Product Description Software that adds video capabilities to your product High-quality labeled videos customized to support your video applications
  12. 12. 20BN-JESTER A crowd-acted dataset of generic human hand gestures. Number of Videos: 148.094 License: Free for academic use (Creative Commons Attribution 4.0 International license CC BY-NC-ND 4.0)
  13. 13. 20BN-SOMETHING-SOMETHING A crowd-acted dataset of basic interactions with everyday objects. Number of Videos: 108.499 License: Free for academic use (Creative Commons Attribution 4.0 International license CC BY-NC-ND 4.0)
  14. 14. Contrastive classes make learning harder and networks stronger Tearing [something] into two pieces VS Tearing [something] just a little bit 0.74 (0.52) Pretending to pick [something] up VS Picking [something] up 0.86 (0.75) Pretending to pour VS Pouring 0.82 (0.64) Pouring with overflow VS Pouring without 0.76 (0.54) Pretending to put [something] onto VS Putting [something] onto [something] 0.82 (0.64)
  15. 15. Mistaken “opening” predictions Ground truth: Moving [part] of [something] Prediction: Opening [something] Ground truth: Unfolding [something] Ground truth: Putting [something] on a flat surface without letting it roll Prediction: Opening [something] Prediction: Opening [something]
  16. 16. Mistaken “covering” predictions Ground truth: Putting [something] in front of [something] Prediction: Covering [something] Ground truth: Turning [something] upside down Prediction: Covering [something]
  17. 17. Transfer learning
  18. 18. Roland Memisevic +1 416 826 1032