SlideShare a Scribd company logo
Breaking Through the Challenges of
Scalable Deep Learning for
Video Analytics
Steven Flores, sflores@compthree.com
Luke Hosking, lhosking@compthree.com
Use cases
A customer is somebody with a lot of unannotated video whose content they
want annotated and indexed into a searchable database. For example,
● Media: video library going back decades.
● Research institutions: video from a lecture series.
● Management and HR: conference/meetings notes.
What info do we want from video?
● What and who is in the video?
● What happens in the video?
● What is the video about?
(Example here: https://www.youtube.com/watch?v=X3a-ZX6ObJU)
Information from audio
● Topic modeling speech transcripts.
● Sentiment analysis of speech transcripts.
● Hot language and/or loud sounds heat map.
● Keywords (named entities) from transcripts. The Federal Reserve is widely expected to
increase interest rates again Wednesday...
Politics and policy
Sports
Science and
Technology
Using keywords to extract info
Within transcripts, keywords such as people, locations, organizations, and
geo-political entities carry much of the latent information we seek from a video.
For example, a video transcript containing the excerpt
...probably confirm the North Korean side in its willingness…
should appear if we search for the term “North Korea.” Also, the presence of
this term, along with other keywords, may support a topic assignment.
Keyword extraction
Keyword extraction can be a difficult problem. Free extractors always come
with their own ridgid taxonomy and may not be production quality:
For example, with the python natural language toolkit (NLTK)...
...probably confirm the North Korean side in its willingness…
Geo-socio-political group Geo-political entity
Using a human-curated whitelist
We maintain a “whitelist” of extracted keywords. This solves two problems:
● Quality control supervision of proposed keywords.
● Better custom keyword taxonomies are assigned to keywords on the list.
NLTK finds “North Korean” in the text, and we find it in the whitelist with its tag
...probably confirm the North Korean side in its willingness…
Ethnicity
But we have two more problems:
● Human supervision is time-consuming (prohibitively so with a large list).
● This doesn’t solve the case of a keyword phrase incorrectly split by NLTK.
Building a custom keyword extractor
The article Natural Language Processing (almost) from Scratch (R. Collobert et
al. 2011) introduces the “senna” named entity (keyword) extractor:
● A two-layer fully connected neural network.
● For each word, the input is its surrounding “context” words in the text.
● Input context words are mapped to 50-dim vectors in a word2vec model.
cat
sat
on
the
mat
I
O
E
B
S
The senna architecture
Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011)
Senna architecture advantages
● Results are often better than NLTK, thus requiring less human supervision.
● Minimal text preprocessing (for example, no chunking) is required.
● Because input is context-based, it may be possible to train a senna
network with automatically generated partially-annotated training data.
● With greater ease of generating training data, we can train keyword
extractors that are tailored to customer needs (taxonomy, jargon, etc.).
Sentiment heat maps
Sentiment heat maps indicate areas of potentially high interest in the video.
● Based on word sentiment and heated language.
● This may not be sufficient. We can also incorporate information from the
audio stream, such as loudness, to indicate areas of interest.
Challenges and future work
Keyword extraction:
● Adapting the senna model for in-house custom keyword extractors.
● Improving keyword extraction for “messy” spoken-language transcripts.
● How to quickly create training data for customer-dependent taxonomies?
Topic modeling:
● Supervised for customer-dependent topics?
● Unsupervised if the user wants to discover unknown information?
● How to do good topic modeling for “messy” spoken-language transcripts?
Information from video
● Object detection
● Face recognition
● Scene recognition
Object detection
Performing object detection on frames tells you what objects appear in a video:
We use various pre-trained models from the TensorFlow detection model zoo.
Challenges with object detection
Freely-available object detection models based on ResNet and Inception
architectures are production quality. Nonetheless, there are some challenges:
● What objects do we want to detect? Is this customer dependent?
● How to we create enough training data to build custom models quickly?
Scene recognition
We train a wide-ResNet model (S. Zagoruyko et al. 2016) to recognize scenes:
We train the network using the Places365 dataset with consolidated scene
categories (for example, not distinguishing stores based on their interiors).
Face recognition
A face recognition model require millions of faces for training and comprises
many steps: face detection, cropping and re-scaling, and classification.
To train such a model from scratch is very time-consuming. However, near
state-of-the art models are freely available. We are using dlib face recognition.
Face embeddings
Rather than simply recognize faces from a small list of people, most face
recognition models are trained to give good face-to-vector embeddings.
The model user then provides a list of images of faces to recognize, the model
maps the faces to vectors, and query faces are identified via k-nn search.
Who should we recognize?
What faces should we recognize? The answer may be customer dependent:
In generic situations, we should recognize people who are “famous enough”
(well-known politicians, celebrities, artists, scientists, thought-leaders, etc.)
What constitutes famous enough? How do we make a list of their names?
Given the list of names, how do we get enough pictures of their faces?
Steven Flores
(Engineer, Comp Three)
Luke Hosking
(Engineer, Comp Three)
Famous enough?
Our criteria for “famous enough” is partly set by our need to get a list of names
of such famous people: famous = has a wikipedia biography with birthday.
We can easily pull this list of famous people from the wikidata API. We record
each person’s name, birthday, occupation(s), and wikipedia page address.
Brad Pitt is in... Rich Skrenta is out (no b-day on wikipedia)
The gallery problem
Many state of the art facial recognition systems are still not good at picking the
correct face from a large gallery of faces. They generate many false positives.
The rank-1 accuracy decreases as the gallery “distractor” face count increases. (The MegaFace
Benchmark: 1 Million Faces for Recognition at Scale, I. Kemelmacher-Shlizerman et al. 2015)
A potential solution...
Given some faces each with a list of candidate names, use other information
(topic modeling, co-occurrence frequency) to find optimal name assignments:
On the left, Idina Menzel is correctly tagged. On the right, Amy Grant is wrongly
tagged “Fanny Cadeo;” her name is the second choice based on the image.
Use the fact that both are musicians to correct the second tag to “Amy Grant.”
Processing time considerations
● Estimated size of a “large” video cache: 40,000
● Number of frames in a typical 30 second video: 750
● Average video frame processing time (GTX 1080 GPU): about 1 second
→ Estimated time to process the entire video cache: almost one year...
The long time to process this hypothetical video cache is way too long!
Solution: only sample video keyframes (frames at shot changes or high-action
moments). These may contain most of the relevant information. For example,
● https://www.youtube.com/watch?v=_7WZ74F3j_I: 2650 frames
● Number of “irregularly spaced” keyframes processed: 10 keyframes
Challenges and future work
Object detection and scene recognition:
● What do we want to detect? (Customer-dependent?)
● How to we generate enough training data quickly and efficiently?
● What benchmarks do we need to hit for production quality?
Face recognition:
● Who can we / do we want to detect? (Customer-dependent?)
● How can we use other information to improve face-to-name assignments?
● What benchmarks do we need to hit for production quality?
Scalability:
● How can we speed up the wait time for image evaluation?
● What tradeoffs must we make to minimize video processing time?
● What can we trim without compromising performance benchmarks?
Augi Demo
Digital Ocean Instance
Docker Host
Augi Real-time Components
Port 5000
Augi Backend
Port 5001
Text Annotator
index.html
bundle.js
Port 80
Nginx
Elasticsearch
File System
Video Object
Store
Port 9200/videos/
Port 5002
Image Service
Real-time Technologies
Frontend
● React
● Apollo
● ChartJS
Backend
● Flask
● Graphene
● Elasticsearch Client
Microservices
Augi Preprocessing Pipeline
Python Code
Video Frame
Sampling
Transcript
Extractor
Audio
Extractor
Elasticsearch
Text
Annotation
Video Store on
File System
Classify
Image
DataConsolidation
ESDocumentInsert
LoopOverVideos
Preprocessing Technologies
● Core pipeline
○ ffmpeg
○ Google Cloud Speech
○ Amazon S3
○ Elasticsearch
● Image classification
○ Tensorflow
○ dlib
○ flask
● Text annotation
○ pygtrie
○ flask
Where the magic happens
Augi Preprocessing Workflow
Python scripts
● download videos and video metadata (youtube, proprietary APIs)
● manage overall process for list of videos to be enriched
Docker
● text Annotator
● image Classifier
Modular architecture
● file system based cache
● orchestration with override flags
Challenges
Iterative development over tens, to hundreds, of thousands of videos
File system based cache of data produced by each step in preprocessing,
along with granular overrides for each preprocessing method, allow for targeted
testing and implementation.
On-prem challenge: no internet access
We needed the architecture to be usable on-prem for clients that require data
security (confidential/healthcare sectors). Current external services used are
Google Cloud Speech and AWS S3, disk storage and products like Nuance
Dragon could be run on-prem.
Questions?

More Related Content

Similar to Breaking Through The Challenges of Scalable Deep Learning for Video Analytics

Dl applicationlandscape-mar2018-180405144127
Dl applicationlandscape-mar2018-180405144127Dl applicationlandscape-mar2018-180405144127
Dl applicationlandscape-mar2018-180405144127
Aravindharamanan S
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
Daniel Zivkovic
 
Info Session : University Institute of engineering and technology , Kurukshet...
Info Session : University Institute of engineering and technology , Kurukshet...Info Session : University Institute of engineering and technology , Kurukshet...
Info Session : University Institute of engineering and technology , Kurukshet...
HRITIKKHURANA1
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies
Andrés Leonardo Martinez Ortiz
 
Demo day
Demo dayDemo day
Demo day
DeepikaRana30
 
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent ApplicationsXuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Machine Learning Prague
 
Technology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and BeyondTechnology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and Beyond
James Huang
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
Dhruv Gohil
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
Trivadis
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
Sara Hooker
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
Paris Open Source Summit
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
Seth Grimes
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geißler
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
Turi, Inc.
 
Chatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes ViewChatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes View
Mark Cieliebak
 
Machine learning 101 Talk at Freshworks
Machine learning 101 Talk at FreshworksMachine learning 101 Talk at Freshworks
Machine learning 101 Talk at Freshworks
Shanmuga(Shyam) Anandaraman
 
AI 2023.pdf
AI 2023.pdfAI 2023.pdf
AI 2023.pdf
DavidCieslak4
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
HPCC Systems
 
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
Ortus Solutions, Corp
 
Career in Software Development
Career in Software Development  Career in Software Development
Career in Software Development
neosphere
 

Similar to Breaking Through The Challenges of Scalable Deep Learning for Video Analytics (20)

Dl applicationlandscape-mar2018-180405144127
Dl applicationlandscape-mar2018-180405144127Dl applicationlandscape-mar2018-180405144127
Dl applicationlandscape-mar2018-180405144127
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Info Session : University Institute of engineering and technology , Kurukshet...
Info Session : University Institute of engineering and technology , Kurukshet...Info Session : University Institute of engineering and technology , Kurukshet...
Info Session : University Institute of engineering and technology , Kurukshet...
 
Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies Google Cloud: Data Analysis and Machine Learningn Technologies
Google Cloud: Data Analysis and Machine Learningn Technologies
 
Demo day
Demo dayDemo day
Demo day
 
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent ApplicationsXuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
 
Technology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and BeyondTechnology and AI sharing - From 2016 to Y2017 and Beyond
Technology and AI sharing - From 2016 to Y2017 and Beyond
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - TrivadisTechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
TechEvent 2019: Artificial Intelligence in Dev & Ops; Martin Luckow - Trivadis
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
OWF14 - Big Data : The State of Machine Learning in 2014
OWF14 - Big Data : The State of Machine  Learning in 2014OWF14 - Big Data : The State of Machine  Learning in 2014
OWF14 - Big Data : The State of Machine Learning in 2014
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Chatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes ViewChatbots and Natural Language Generation - A Bird Eyes View
Chatbots and Natural Language Generation - A Bird Eyes View
 
Machine learning 101 Talk at Freshworks
Machine learning 101 Talk at FreshworksMachine learning 101 Talk at Freshworks
Machine learning 101 Talk at Freshworks
 
AI 2023.pdf
AI 2023.pdfAI 2023.pdf
AI 2023.pdf
 
Analyzing Big Data's Weakest Link (hint: it might be you)
Analyzing Big Data's Weakest Link  (hint: it might be you)Analyzing Big Data's Weakest Link  (hint: it might be you)
Analyzing Big Data's Weakest Link (hint: it might be you)
 
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdfITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
ITB_2023_Chatgpt_Box_Scott_Steinbeck.pdf
 
Career in Software Development
Career in Software Development  Career in Software Development
Career in Software Development
 

Recently uploaded

Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 

Recently uploaded (20)

Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics

  • 1. Breaking Through the Challenges of Scalable Deep Learning for Video Analytics Steven Flores, sflores@compthree.com Luke Hosking, lhosking@compthree.com
  • 2. Use cases A customer is somebody with a lot of unannotated video whose content they want annotated and indexed into a searchable database. For example, ● Media: video library going back decades. ● Research institutions: video from a lecture series. ● Management and HR: conference/meetings notes.
  • 3. What info do we want from video? ● What and who is in the video? ● What happens in the video? ● What is the video about? (Example here: https://www.youtube.com/watch?v=X3a-ZX6ObJU)
  • 4. Information from audio ● Topic modeling speech transcripts. ● Sentiment analysis of speech transcripts. ● Hot language and/or loud sounds heat map. ● Keywords (named entities) from transcripts. The Federal Reserve is widely expected to increase interest rates again Wednesday... Politics and policy Sports Science and Technology
  • 5. Using keywords to extract info Within transcripts, keywords such as people, locations, organizations, and geo-political entities carry much of the latent information we seek from a video. For example, a video transcript containing the excerpt ...probably confirm the North Korean side in its willingness… should appear if we search for the term “North Korea.” Also, the presence of this term, along with other keywords, may support a topic assignment.
  • 6. Keyword extraction Keyword extraction can be a difficult problem. Free extractors always come with their own ridgid taxonomy and may not be production quality: For example, with the python natural language toolkit (NLTK)... ...probably confirm the North Korean side in its willingness… Geo-socio-political group Geo-political entity
  • 7. Using a human-curated whitelist We maintain a “whitelist” of extracted keywords. This solves two problems: ● Quality control supervision of proposed keywords. ● Better custom keyword taxonomies are assigned to keywords on the list. NLTK finds “North Korean” in the text, and we find it in the whitelist with its tag ...probably confirm the North Korean side in its willingness… Ethnicity But we have two more problems: ● Human supervision is time-consuming (prohibitively so with a large list). ● This doesn’t solve the case of a keyword phrase incorrectly split by NLTK.
  • 8. Building a custom keyword extractor The article Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011) introduces the “senna” named entity (keyword) extractor: ● A two-layer fully connected neural network. ● For each word, the input is its surrounding “context” words in the text. ● Input context words are mapped to 50-dim vectors in a word2vec model. cat sat on the mat I O E B S
  • 9. The senna architecture Natural Language Processing (almost) from Scratch (R. Collobert et al. 2011)
  • 10. Senna architecture advantages ● Results are often better than NLTK, thus requiring less human supervision. ● Minimal text preprocessing (for example, no chunking) is required. ● Because input is context-based, it may be possible to train a senna network with automatically generated partially-annotated training data. ● With greater ease of generating training data, we can train keyword extractors that are tailored to customer needs (taxonomy, jargon, etc.).
  • 11. Sentiment heat maps Sentiment heat maps indicate areas of potentially high interest in the video. ● Based on word sentiment and heated language. ● This may not be sufficient. We can also incorporate information from the audio stream, such as loudness, to indicate areas of interest.
  • 12. Challenges and future work Keyword extraction: ● Adapting the senna model for in-house custom keyword extractors. ● Improving keyword extraction for “messy” spoken-language transcripts. ● How to quickly create training data for customer-dependent taxonomies? Topic modeling: ● Supervised for customer-dependent topics? ● Unsupervised if the user wants to discover unknown information? ● How to do good topic modeling for “messy” spoken-language transcripts?
  • 13. Information from video ● Object detection ● Face recognition ● Scene recognition
  • 14. Object detection Performing object detection on frames tells you what objects appear in a video: We use various pre-trained models from the TensorFlow detection model zoo.
  • 15. Challenges with object detection Freely-available object detection models based on ResNet and Inception architectures are production quality. Nonetheless, there are some challenges: ● What objects do we want to detect? Is this customer dependent? ● How to we create enough training data to build custom models quickly?
  • 16. Scene recognition We train a wide-ResNet model (S. Zagoruyko et al. 2016) to recognize scenes: We train the network using the Places365 dataset with consolidated scene categories (for example, not distinguishing stores based on their interiors).
  • 17. Face recognition A face recognition model require millions of faces for training and comprises many steps: face detection, cropping and re-scaling, and classification. To train such a model from scratch is very time-consuming. However, near state-of-the art models are freely available. We are using dlib face recognition.
  • 18. Face embeddings Rather than simply recognize faces from a small list of people, most face recognition models are trained to give good face-to-vector embeddings. The model user then provides a list of images of faces to recognize, the model maps the faces to vectors, and query faces are identified via k-nn search.
  • 19. Who should we recognize? What faces should we recognize? The answer may be customer dependent: In generic situations, we should recognize people who are “famous enough” (well-known politicians, celebrities, artists, scientists, thought-leaders, etc.) What constitutes famous enough? How do we make a list of their names? Given the list of names, how do we get enough pictures of their faces? Steven Flores (Engineer, Comp Three) Luke Hosking (Engineer, Comp Three)
  • 20. Famous enough? Our criteria for “famous enough” is partly set by our need to get a list of names of such famous people: famous = has a wikipedia biography with birthday. We can easily pull this list of famous people from the wikidata API. We record each person’s name, birthday, occupation(s), and wikipedia page address. Brad Pitt is in... Rich Skrenta is out (no b-day on wikipedia)
  • 21. The gallery problem Many state of the art facial recognition systems are still not good at picking the correct face from a large gallery of faces. They generate many false positives. The rank-1 accuracy decreases as the gallery “distractor” face count increases. (The MegaFace Benchmark: 1 Million Faces for Recognition at Scale, I. Kemelmacher-Shlizerman et al. 2015)
  • 22. A potential solution... Given some faces each with a list of candidate names, use other information (topic modeling, co-occurrence frequency) to find optimal name assignments: On the left, Idina Menzel is correctly tagged. On the right, Amy Grant is wrongly tagged “Fanny Cadeo;” her name is the second choice based on the image. Use the fact that both are musicians to correct the second tag to “Amy Grant.”
  • 23. Processing time considerations ● Estimated size of a “large” video cache: 40,000 ● Number of frames in a typical 30 second video: 750 ● Average video frame processing time (GTX 1080 GPU): about 1 second → Estimated time to process the entire video cache: almost one year... The long time to process this hypothetical video cache is way too long! Solution: only sample video keyframes (frames at shot changes or high-action moments). These may contain most of the relevant information. For example, ● https://www.youtube.com/watch?v=_7WZ74F3j_I: 2650 frames ● Number of “irregularly spaced” keyframes processed: 10 keyframes
  • 24. Challenges and future work Object detection and scene recognition: ● What do we want to detect? (Customer-dependent?) ● How to we generate enough training data quickly and efficiently? ● What benchmarks do we need to hit for production quality? Face recognition: ● Who can we / do we want to detect? (Customer-dependent?) ● How can we use other information to improve face-to-name assignments? ● What benchmarks do we need to hit for production quality? Scalability: ● How can we speed up the wait time for image evaluation? ● What tradeoffs must we make to minimize video processing time? ● What can we trim without compromising performance benchmarks?
  • 26. Digital Ocean Instance Docker Host Augi Real-time Components Port 5000 Augi Backend Port 5001 Text Annotator index.html bundle.js Port 80 Nginx Elasticsearch File System Video Object Store Port 9200/videos/ Port 5002 Image Service
  • 27. Real-time Technologies Frontend ● React ● Apollo ● ChartJS Backend ● Flask ● Graphene ● Elasticsearch Client
  • 28. Microservices Augi Preprocessing Pipeline Python Code Video Frame Sampling Transcript Extractor Audio Extractor Elasticsearch Text Annotation Video Store on File System Classify Image DataConsolidation ESDocumentInsert LoopOverVideos
  • 29. Preprocessing Technologies ● Core pipeline ○ ffmpeg ○ Google Cloud Speech ○ Amazon S3 ○ Elasticsearch ● Image classification ○ Tensorflow ○ dlib ○ flask ● Text annotation ○ pygtrie ○ flask
  • 30. Where the magic happens
  • 31. Augi Preprocessing Workflow Python scripts ● download videos and video metadata (youtube, proprietary APIs) ● manage overall process for list of videos to be enriched Docker ● text Annotator ● image Classifier Modular architecture ● file system based cache ● orchestration with override flags
  • 32. Challenges Iterative development over tens, to hundreds, of thousands of videos File system based cache of data produced by each step in preprocessing, along with granular overrides for each preprocessing method, allow for targeted testing and implementation. On-prem challenge: no internet access We needed the architecture to be usable on-prem for clients that require data security (confidential/healthcare sectors). Current external services used are Google Cloud Speech and AWS S3, disk storage and products like Nuance Dragon could be run on-prem.