SlideShare a Scribd company logo
Edge-based
Discovery of
Training Data
for Machine
Learning
Ziqiang (Edmond) Feng, Shilpa George, Jan
Harkes, Padmanabhan Pillai†, Roberta Klatzky,
Mahadev Satyanarayanan
Carnegie Mellon University and †Intel Labs
The New Yorker magazine April 20, 2018, p. 41
The Deep Learning Recipe
Collect a large amount of
data and label it
Select a model
and train a DNN
Deploy the DNN
for inference
2
TPOD @
CMU
DNNs for Domain Experts
Valuable in ecology, military intelligence, medical diagnosis, etc.
• Low base rate (prevalence) in the data
• Requires expertise to identify
Masked palm civet (Paguma larvata).
Transmitter of SARS during its 2003
outbreak in China.
BUK-M1. Believed to have shot down
MH17 and killed 298, 2014.
3
Nuclear atypia in cancer.
4Building a Training Set Is Hard
 Crowds are not experts
Crowd-sourcing (e.g., Amazon Mechanical Turk) are not applicable
 Access restriction of data
Patient privacy, business policy, national security, etc.
In the worst case, a single domain expert has to generate
an entire training set of 103 to 104 examples.
Masked palm civet
Red panda
Raccoon
Our Contribution: Eureka
 A system for efficient discovery of training examples from data
sources dispersed over the Internet
(focus on images in this paper)
 Goal: to effectively utilize an expert’s time and attention
 Key concepts:
 Early discard
 Iterative discovery workflow
 Edge computing
5
(positive)
Eureka’s Architecture
Expert with
domain-specific
GUI
cloudlet
Archival
Data
Source
LAN
cloudlet
LAN
cloudlet Live
Video
I
n
t
e
r
n
e
t
Archival
Data
Source
LAN
6
Executes early-discard code to
drop clearly irrelevant data
Only a tiny fraction of data along with
meta-data is transmitted and shown to
user, consuming little Internet
bandwidth.
High-bandwidth,
low-latency access
Example GUI: Finding Deer 7
Early-discard filters
Iterative Discovery Workflow
Explicit features, manual
weights (RGB histogram,
SIFT, perceptual hashing)
Explicit features, learned
weights (HOG + SVM)
Shallow transfer learning
(MobileNet + SVM)
Deep transfer learning
(Faster R-CNN finetuning)
Deep learning
100 101 102 103 104
Number of Examples (log scale)
Accuracy(nottoscale)
8
Finding Deer (after a few iterations) 9
System Design and Implementation
 Software generality: allow use of CV code written in
different languages, libraries and frameworks
(e.g., Python, Matlab, C++, TensorFlow, PyTorch, Scikit-learn)
 Empower experts with newest CV innovations quickly
 Encapsulate filters in Docker containers
 Runtime efficiency: be able to rapidly process and discard
large volume of data
 Exploit specialized hardware on cloudlets (e.g., GPU)
 Cache filter results to exploit temporal locality
10
Matching System to User
The system should deliver images to user at a rate the user can inspect them.
Wasting computation and precious
Internet bandwidth 
Suggestion
1. Restrict to fewer cloudlets
2. Bias filters towards precision rather
than recall
11
Too Fast
Matching System to User (cont’d)
The system should deliver images to user at a rate the user can inspect them.
Wasting expert time 
Obvious solution
Scale out to more cloudlets
(Edge computing is your friend)
Risk
“Junk” (false positives) causes user
annoyance and dissatisfaction
Rule of thumb
Focus on reducing false positive rate
before scaling out
12
Too Slow
Evaluation: Setup
YFCC100M: 99.2 million Flickr photos.
Real-life distribution of objects.
Evenly partitioned over the cloudlets.
Dataset
8 cloudlets with Nvidia GPUs, access data from
local SSDs.Edge
Connected to the cloudlets via the Internet.Client
13
Evaluation: Case Studies
Deer Taj Mahal Fire hydrant
0.07% 0.02% 0.005%Estimated
base rate
111 105 74Collected positives
in evaluation
7,447 4,791 15,379Images viewed
by user
14
2,104,076 2,542,889 2,734,070Images discarded
by Eureka
Eureka vs. Brute-force
1,000
10,000
100,000
1,000,000
Deer Taj Mahal Fire hydrant
Number of images the user viewed to collect
~100 true positives
Brute-force Single-iteration Eureka Eureka
Brute-force:
User views every image.
Single-iteration Eureka:
Early-discard without
iterative improvement.
15
Please refer to our paper for detailed results of each case study.
Iteratively Improving Productivity
The case of deer
0.4 0.36
1.49
4.24
4.77
1 2 3 4 5
Iteration in Eureka
Productivity (New true positives / minute)
16
~10X
Compute Must Co-locate with Data
0
200
400
600
800
1000
10 Mbps 25 Mbps 100 Mbps 1 Gbps
MachineProcessingThroughput
(#/sec)
Throttling bandwidth between .
RGB histogram filter
US average
connectivity:
18.7 Mbps (2017)
17
More in the Paper
• Detailed system design and implementation
• An analytic model relating user wait time to base rate,
filter accuracy, cloudlet processing speed, etc.
• Detailed results of individual case studies
18
Conclusion
Eureka combines early discard, iterative discovery workflow
and edge computing to help domain experts efficiently
discover training examples of rare phenomena from data
sources on the edge.
Eureka reduces human labeling effort by two orders of
magnitude compared to a brute force approach.
19
Thank you!
I will also present on tomorrow’s PhD Forum to discuss related ideas.
20

More Related Content

What's hot

Intel 2020 Labs Day Keynote Slides
Intel 2020 Labs Day Keynote SlidesIntel 2020 Labs Day Keynote Slides
Intel 2020 Labs Day Keynote Slides
DESMOND YUEN
 
[Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network [Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network
KIMMINHA3
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
geetachauhan
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
geetachauhan
 
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
KIMMINHA3
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Mahdi Hosseini Moghaddam
 
Using Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGridUsing Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGrid
gwickler
 
Making Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale ComputingMaking Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale Computing
Larry Smarr
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchII-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
Vijay Srinivas Agneeswaran, Ph.D
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
inside-BigData.com
 
AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
Larry Smarr
 
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical ResearchICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine Intelligence
Larry Smarr
 
IRJET- A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET-  	  A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...IRJET-  	  A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET- A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET Journal
 
Virtualized high performance computing with mellanox fdr and ro ce
Virtualized high performance computing with mellanox fdr and ro ceVirtualized high performance computing with mellanox fdr and ro ce
Virtualized high performance computing with mellanox fdr and ro ce
inside-BigData.com
 
CI image processing mns
CI image processing mnsCI image processing mns
CI image processing mns
Meenakshi Sood
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 

What's hot (20)

Intel 2020 Labs Day Keynote Slides
Intel 2020 Labs Day Keynote SlidesIntel 2020 Labs Day Keynote Slides
Intel 2020 Labs Day Keynote Slides
 
[Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network [Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network
 
Best Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in EnterprisesBest Practices for On-Demand HPC in Enterprises
Best Practices for On-Demand HPC in Enterprises
 
Deep learning for medical imaging
Deep learning for medical imagingDeep learning for medical imaging
Deep learning for medical imaging
 
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Using Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGridUsing Simulation for Decision Support: Lessons Learned from FireGrid
Using Simulation for Decision Support: Lessons Learned from FireGrid
 
Making Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale ComputingMaking Sense of Information Through Planetary Scale Computing
Making Sense of Information Through Planetary Scale Computing
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical ResearchII-SDV 2017: The Next Era: Deep Learning for Biomedical Research
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM ProcessingHow to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
 
AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019AdClickFraud_Bigdata-Apic-Ist-2019
AdClickFraud_Bigdata-Apic-Ist-2019
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical ResearchICIC 2017: The Next Era: Deep Learning for Biomedical Research
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine Intelligence
 
IRJET- A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET-  	  A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...IRJET-  	  A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET- A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
 
Virtualized high performance computing with mellanox fdr and ro ce
Virtualized high performance computing with mellanox fdr and ro ceVirtualized high performance computing with mellanox fdr and ro ce
Virtualized high performance computing with mellanox fdr and ro ce
 
CI image processing mns
CI image processing mnsCI image processing mns
CI image processing mns
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co..."New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
 

Similar to Edge-based Discovery of Training Data for Machine Learning

AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
Subrat Panda, PhD
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
Ha Phuong
 
1st review android malware.pptx
1st review  android malware.pptx1st review  android malware.pptx
1st review android malware.pptx
Nambiraju
 
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
CHENHuiMei
 
Deep Learning Based Real-Time DNS DDoS Detection System
Deep Learning Based Real-Time DNS DDoS Detection SystemDeep Learning Based Real-Time DNS DDoS Detection System
Deep Learning Based Real-Time DNS DDoS Detection System
Seungjoo Kim
 
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTSACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
IJCNCJournal
 
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing EnvironmentsActor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
IJCNCJournal
 
RICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNN
RICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNNRICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNN
RICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNN
IRJET Journal
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
Allen Day, PhD
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care
Meenakshi Sood
 
Next Century Project Overview
Next Century Project OverviewNext Century Project Overview
Next Century Project Overview
jennhunter
 
An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...
IJECEIAES
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Carlos Paredes
 
About an Immune System Understanding for Cloud-native Applications - Biology ...
About an Immune System Understanding for Cloud-native Applications - Biology ...About an Immune System Understanding for Cloud-native Applications - Biology ...
About an Immune System Understanding for Cloud-native Applications - Biology ...
Nane Kratzke
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
Colin Clark
 
How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?
Amazon Web Services
 
Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...
Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...
Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...
Pluribus One
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
Chicago Hadoop Users Group
 
SoleraNetworks
SoleraNetworksSoleraNetworks
SoleraNetworks
Joe Levy
 
Surveillance scene classification using machine learning
Surveillance scene classification using machine learningSurveillance scene classification using machine learning
Surveillance scene classification using machine learning
Utkarsh Contractor
 

Similar to Edge-based Discovery of Training Data for Machine Learning (20)

AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
1st review android malware.pptx
1st review  android malware.pptx1st review  android malware.pptx
1st review android malware.pptx
 
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
小數據如何實現電腦視覺,微軟AI研究首席剖析關鍵
 
Deep Learning Based Real-Time DNS DDoS Detection System
Deep Learning Based Real-Time DNS DDoS Detection SystemDeep Learning Based Real-Time DNS DDoS Detection System
Deep Learning Based Real-Time DNS DDoS Detection System
 
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTSACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
ACTOR CRITIC APPROACH BASED ANOMALY DETECTION FOR EDGE COMPUTING ENVIRONMENTS
 
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing EnvironmentsActor Critic Approach based Anomaly Detection for Edge Computing Environments
Actor Critic Approach based Anomaly Detection for Edge Computing Environments
 
RICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNN
RICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNNRICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNN
RICE INSECTS CLASSIFICATION USIING TRANSFER LEARNING AND CNN
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
Deep learning health care
Deep learning health care  Deep learning health care
Deep learning health care
 
Next Century Project Overview
Next Century Project OverviewNext Century Project Overview
Next Century Project Overview
 
An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...An intrusion detection system for packet and flow based networks using deep n...
An intrusion detection system for packet and flow based networks using deep n...
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
 
About an Immune System Understanding for Cloud-native Applications - Biology ...
About an Immune System Understanding for Cloud-native Applications - Biology ...About an Immune System Understanding for Cloud-native Applications - Biology ...
About an Immune System Understanding for Cloud-native Applications - Biology ...
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?How Can We Answer the Really BIG Questions?
How Can We Answer the Really BIG Questions?
 
Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...
Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...
Battista Biggio @ S+SSPR2014, Joensuu, Finland -- Poisoning Complete-Linkage ...
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
SoleraNetworks
SoleraNetworksSoleraNetworks
SoleraNetworks
 
Surveillance scene classification using machine learning
Surveillance scene classification using machine learningSurveillance scene classification using machine learning
Surveillance scene classification using machine learning
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Edge-based Discovery of Training Data for Machine Learning

  • 1. Edge-based Discovery of Training Data for Machine Learning Ziqiang (Edmond) Feng, Shilpa George, Jan Harkes, Padmanabhan Pillai†, Roberta Klatzky, Mahadev Satyanarayanan Carnegie Mellon University and †Intel Labs The New Yorker magazine April 20, 2018, p. 41
  • 2. The Deep Learning Recipe Collect a large amount of data and label it Select a model and train a DNN Deploy the DNN for inference 2 TPOD @ CMU
  • 3. DNNs for Domain Experts Valuable in ecology, military intelligence, medical diagnosis, etc. • Low base rate (prevalence) in the data • Requires expertise to identify Masked palm civet (Paguma larvata). Transmitter of SARS during its 2003 outbreak in China. BUK-M1. Believed to have shot down MH17 and killed 298, 2014. 3 Nuclear atypia in cancer.
  • 4. 4Building a Training Set Is Hard  Crowds are not experts Crowd-sourcing (e.g., Amazon Mechanical Turk) are not applicable  Access restriction of data Patient privacy, business policy, national security, etc. In the worst case, a single domain expert has to generate an entire training set of 103 to 104 examples. Masked palm civet Red panda Raccoon
  • 5. Our Contribution: Eureka  A system for efficient discovery of training examples from data sources dispersed over the Internet (focus on images in this paper)  Goal: to effectively utilize an expert’s time and attention  Key concepts:  Early discard  Iterative discovery workflow  Edge computing 5 (positive)
  • 6. Eureka’s Architecture Expert with domain-specific GUI cloudlet Archival Data Source LAN cloudlet LAN cloudlet Live Video I n t e r n e t Archival Data Source LAN 6 Executes early-discard code to drop clearly irrelevant data Only a tiny fraction of data along with meta-data is transmitted and shown to user, consuming little Internet bandwidth. High-bandwidth, low-latency access
  • 7. Example GUI: Finding Deer 7 Early-discard filters
  • 8. Iterative Discovery Workflow Explicit features, manual weights (RGB histogram, SIFT, perceptual hashing) Explicit features, learned weights (HOG + SVM) Shallow transfer learning (MobileNet + SVM) Deep transfer learning (Faster R-CNN finetuning) Deep learning 100 101 102 103 104 Number of Examples (log scale) Accuracy(nottoscale) 8
  • 9. Finding Deer (after a few iterations) 9
  • 10. System Design and Implementation  Software generality: allow use of CV code written in different languages, libraries and frameworks (e.g., Python, Matlab, C++, TensorFlow, PyTorch, Scikit-learn)  Empower experts with newest CV innovations quickly  Encapsulate filters in Docker containers  Runtime efficiency: be able to rapidly process and discard large volume of data  Exploit specialized hardware on cloudlets (e.g., GPU)  Cache filter results to exploit temporal locality 10
  • 11. Matching System to User The system should deliver images to user at a rate the user can inspect them. Wasting computation and precious Internet bandwidth  Suggestion 1. Restrict to fewer cloudlets 2. Bias filters towards precision rather than recall 11 Too Fast
  • 12. Matching System to User (cont’d) The system should deliver images to user at a rate the user can inspect them. Wasting expert time  Obvious solution Scale out to more cloudlets (Edge computing is your friend) Risk “Junk” (false positives) causes user annoyance and dissatisfaction Rule of thumb Focus on reducing false positive rate before scaling out 12 Too Slow
  • 13. Evaluation: Setup YFCC100M: 99.2 million Flickr photos. Real-life distribution of objects. Evenly partitioned over the cloudlets. Dataset 8 cloudlets with Nvidia GPUs, access data from local SSDs.Edge Connected to the cloudlets via the Internet.Client 13
  • 14. Evaluation: Case Studies Deer Taj Mahal Fire hydrant 0.07% 0.02% 0.005%Estimated base rate 111 105 74Collected positives in evaluation 7,447 4,791 15,379Images viewed by user 14 2,104,076 2,542,889 2,734,070Images discarded by Eureka
  • 15. Eureka vs. Brute-force 1,000 10,000 100,000 1,000,000 Deer Taj Mahal Fire hydrant Number of images the user viewed to collect ~100 true positives Brute-force Single-iteration Eureka Eureka Brute-force: User views every image. Single-iteration Eureka: Early-discard without iterative improvement. 15 Please refer to our paper for detailed results of each case study.
  • 16. Iteratively Improving Productivity The case of deer 0.4 0.36 1.49 4.24 4.77 1 2 3 4 5 Iteration in Eureka Productivity (New true positives / minute) 16 ~10X
  • 17. Compute Must Co-locate with Data 0 200 400 600 800 1000 10 Mbps 25 Mbps 100 Mbps 1 Gbps MachineProcessingThroughput (#/sec) Throttling bandwidth between . RGB histogram filter US average connectivity: 18.7 Mbps (2017) 17
  • 18. More in the Paper • Detailed system design and implementation • An analytic model relating user wait time to base rate, filter accuracy, cloudlet processing speed, etc. • Detailed results of individual case studies 18
  • 19. Conclusion Eureka combines early discard, iterative discovery workflow and edge computing to help domain experts efficiently discover training examples of rare phenomena from data sources on the edge. Eureka reduces human labeling effort by two orders of magnitude compared to a brute force approach. 19
  • 20. Thank you! I will also present on tomorrow’s PhD Forum to discuss related ideas. 20

Editor's Notes

  1. Cartoon: New Yorker magazine April 20, 2018, p. 41
  2. Deep learning has become the gold standard in many areas, especially computer vision, due to its superb accuracy. Here shows the high-level recipe when you try to apply deep learning to a problem. You collect a large amount of data and label it. Then you select a model and train a DNN. Finally you deploy the DNN for inference. Nowadays, there are many software libraries, frameworks, cloud services and web-based tools that let you do the last two steps with great convenience. Virtually all the painstaking effort is in the very first step. And it sometimes can be the showstopper of applying deep learning to your problem.
  3. In this work, we focus on DNNs used by domain experts. Here are some examples. This animal is the transmitter of the SARS disease in China, 2003. You can imagine how valuable it would be if we had an accurate DNN detector and use it in public health effort. Likewise, this is a weapon that shot down an airplane and this is a pathological image of nuclear atypia in cancer. In all these cases, the target has low base rates – they are pretty rare in the data you are examining. And they all require expertise to correct identify. https://en.wikipedia.org/wiki/Masked_palm_civet#Connection_with_SARS
  4. Building a training set of this kind of targets is hard. First, obviously, crowds are not experts. So crowd-sourcing methods like Amazon Mechanical Turk are not applicable to these domains. For example, only an expert can reliably and accurately distinguish between these animals. Second, there may exist access restriction of data, such as patient privacy, business policy and national security. In the worst case, a single domain expert has to generate an entire training set of thousands to tens of thousands of examples.
  5. In this paper, we describe a system called Eureka, for efficient discovery of training examples from data sources dispersed over the Internet. The goal of Eureka is to optimally utilize an expert’s time and attention. It combines three key concepts to achieve its goal: early discard, iterative discovery workflow and edge computing, which I will describe next.
  6. Here shows Eureka’s architecture. An expert user runs a GUI on her own computer. The GUI connects to a number of cloudlets across the Internet. These cloudlets are LAN-connected to some associated data sources. These data sources may be archival or live, depending on the specific use case. As the shape of these arrows indicates, connections between cloudlets and data sources are high-bandwidth and low-latency, while those on the Internet are the contrary. And this high-bandwidth access is used to execute early-discard code on the cloudlets to drop clearly irrelevant data. Only a tiny fraction of data long with meta-data are transmitted and shown to the user, consuming little Internet bandwidth.
  7. Here shows an example of using the GUI to find images of deer from an unlabeled dataset. You can specify a list of early-discard filters and only images passing all of the filters are transmitted and displayed. You are seeing many false positives because the filters used in this case are very weak color and texture filters. (more time: 1. extend to general logical expression; 2. 500 more – efficient use of user attention)
  8. To improve the efficacy of early-discard, we introduce the iterative discovery workflow. Here you see a spectrum of computer vision algorithms and machine learning models, from simple on the left, such as RGB and SIFT, to sophisticated on the right, such as deep learning. X-axis is the number of example images you have, and the Y-axis is the accuracy. While these numbers are not meant to be precise, the idea is that different models require a different amount of data to work properly, and they give you different levels of accuracy. When using Eureka, instead of creating a set of filters and searching for your target in one go, you iteratively change and improve your filters as you collect more examples, and move up the stairs when you have sufficient data to do so. When using Eureka, in the beginning, you have very few examples. So you should only use explicit features like RGB or SIFT. With these weak filters, you may be able to find a few more, which allows you to escalate to a little more advanced filter, like SVM. The SVM is considerable more accurate, making it easier to find some more positives in a reasonable amount of time. So you iterate and climb up the stairs when you have sufficient data. In this process, you are both using more and more sophisticated filters, and growing the size of the training set you collect.
  9. Here again shows the case of finding deer, but after a few iterations of using Eureka and an SVM is being used. You see the filter has now become much more accurate.
  10. When designing and implementing Eureka, we have two major concerns. First is software generality. We want to allow the use computer vision code written in a diversity of languages, libraries and frameworks, so that we can empower experts with the newest computer vision innovations quickly. To do so, we encapsulate filters in Docker containers. Second is runtime efficiency. Eureka needs to be able to rapidly discard large volume of data. To do so, we exploit specialized hardware such as GPUs on cloudlets when available. We also cache filter results to exploit temporal locality in typical Eureka workloads.
  11. Another interesting problem is to match the Eureka system to the user. We propose that, ideally, the system should deliver images to user at a rate the user can inspect them. Because if the system is delivering too fast, you are pumping lots of results into the network which the user may never see. So it’s a waste of computation and precious Internet bandwidth. Our suggestion in this case is to restrict your search to fewer cloudlets ( or to bias your filters towards precision rather than recall.)
  12. On the other hand, if the system is delivering too slowly, you are basically forcing the user to wait. And wasting an expert’s time is a really bad thing to do. An obvious solution is to scale out to more cloudlets. But there is a risk here. Showing more “junk” to user will cause user annoyance and dissatisfaction. So you really need to strike a balance between avoiding user wait time and avoiding too many false positives. Our rule of thumb in this scenario is one should focus on reducing the false positive rate before scaling out the many cloudlets.
  13. To evaluate Eureka, we 99 million Flickr images from the YFCC100M dataset. On the edge we have 8 cloudlets with local access to data. And the client GUI connects to the cloudlets over the Internet.
  14. We conducted three case studies using these three chosen targets – deer, Taj Mahal and fire hydrant. As you can see from the base rate, these are fairly rare objects in Flickr photos. We used Eureka to collect about 100 positive examples of each. Here you can see the number of images viewed by the user, and images discarded by Eureka in the whole process. You can see how effective Eureka is in reducing the amount of data the user needs to look at and label.
  15. We compare Eureka with a brute-force method, where the user goes through the images one by one and label them. That’s basically how many datasets are curated today. For reference, we also compare with what we called “single-iteration Eureka”, which means using early-discard, but without iterative improvement. Y-axis shows how many images the user viewed in order to collect the same number of positives. As you can see, compared with brute-force, single-iteration Eureka gives you up to an order of magnitude of improvement, showing the efficacy of early-discard. On top of that, full Eureka gives another order of magnitude of improvement, showing the benefit of the iterative workflow.
  16. We show how Eureka is iteratively improving user’s productivity, in the case of deer. We measure productivity in terms of new true positives found in each Eureka iteration. Over five iterations of using Eureka, the productivity increases from 0.4 to 4.7, more than 10X improvement.
  17. Finally, we show the importance of edge computing. Specifically, we show when the data is at the edge, the compute must also be at the edge for Eureka to be efficient. To do so, we throttled the bandwidth between the cloudlet and the data source, and measure the machine processing throughput of an RGB histogram filter. The result shows it really needs LAN-connectivity at 1Gbps to deliver sufficiently high throughput. If the data is shipped over the wide area network, it slows down by about 10X.
  18. In conclusion, … (….) Our evaluation shows ….
  19. Why is it hard? Most importantly, crowds are not experts. So crowd-sourcing approaches like Amazon Mechanical Turk are not applicable in these domains. Only an expert can reliably classify these animals. Besides, these interesting phenomena are usually rare, making it difficult to find positive examples in unlabeled. Finally, there may exists access restriction of data, such as patient privacy, business policy and national security. In the worst case, a single expert has to generate an entire training set of thousands of to tens of thousands of examples.
  20. Here shows the execution model. On the cloudlet, a component called itemizer reads whatever data in its raw format, and emits individual items. Items are independently unit of early-discard. Items are then feed into the item processor. Here a chain of filters evaluate the items and try to drop them. We encapsulate filters in Docker containers to achieve software generality I just mentioned. And we cache filter results to improve efficiency. Finally, the filters also attach key-value attributes to each item. These attributes both facilitates communication between filters during the run time and post-analysis after they are sent back to the user.
  21. Finally, we study the importance of edge computing for Eureka. Specifically, how necessary is high-bandwidth access to data. Here we throttle the bandwidth between the cloudlet and the data source, and measure the machine processing throughput of three filters, including cheap ones and expensive ones. As you can see, when we decrease the bandwidth, the throughput drops significantly. Under 25 Mbps, there is basically no difference between cheap filters and expensive filters, because data access time becomes the bottleneck. So we see high-bandwidth access is crucial to the efficacy of Eureka.