SlideShare a Scribd company logo
1 of 64
Download to read offline
@lifestylesydney
Machine Learning
— for startups without PHDs
Lex Toumbourou // CTO
lex@scrunch.com
https://twitter.com/lexandstuff
https://instagram.com/lextoumbourou
https://github.com/lextoumbourou
http://lextoumbourou.github.io
© Scrunch
Who am I
— and why am I giving this talk?
•	 CTO of Scrunch
•	 Not a data scientist.
•	 Building Machine Learning products for the past 2-3 years
•	 How can people without formal training build Machine Learning
products?
© Scrunch
What this talk is about?
1.	 What is influencer marketing
and how does Scrunch help?
2.	 How we’re using ML.
3.	 ML for startups.
4.	The future of Machine Learning
research at Scrunch.
© Scrunch
What is influencer marketing and how
does Scrunch help?
© Scrunch
What is an influencer?
© Scrunch
What is influencer marketing?
•	 Brands pay for access to engaged audience.
•	 Influencers can monetize their audience.
•	 Consumers get product recommendations from people 		
they trust (ideally).
© Scrunch
Influencer marketing
— at its best
© Scrunch
Influencer marketing
— at its worst
© Scrunch
How does Scrunch help?
© Scrunch
Influencer
recommendations
— (search and filtering)
Brands can ask: “find me
influencers located in Queensland
who mostly talk about street wear
and whose audience is female,
aged 25-34”.
Most of our Machine Learning
here!
© Scrunch
End-to-end campaign
workflow
Best practises for:
reaching out to influencers.
helping influencers pitch on
campaigns.
shipping products.
managing payments.
© Scrunch
Campaign analytics
•	 ROI
•	 Engagement
•	 Gender
•	 Age
•	 Location
•	 Ethnicity
•	 Audience Interests
•	 Audience Brands
•	 Audience hashtags
© Scrunch
2. How we use Machine
Learning at Scrunch
© Scrunch
Image classification
Categorise ~20M blogs and social media entities using photo content.
Gender, age, “profile type” inference, lots more in dev.
Fig 1. Image from DEX: Deep EXpectation of apparent age from a single image by Rasmus Rothe
and Radu Timofte and Luc Van Gool, 2015
© Scrunch
Tools
•	 Convolutional Neural Networks
•	 Transfer learning (lots more on
this later).
•	 Face detection models and
algorithms.
CNN diagram by Pawit Kochakarn.
© Scrunch
Natural Language
Processing
Topic analysis, location inference,
language inference, gender
inference, profile type inference,
fake followers etc etc
© Scrunch
Tools
•	 Non-Deep Learning:
LogisticRegression and SVMs.
•	 Recurrent Neural Networks
(Bidirectional LSTMs).
•	 Pretrained word embeddings.
•	 CNNs.
© Scrunch
Recommender Systems
and Search
Mostly content-based filtering at this stage. Big plans here.
Image source: The Marketing Technologist
© Scrunch
Examples of ML in Scrunch
— P1
•	 Influencer tagging •	 Audience analysis
© Scrunch
Examples of ML in Scrunch
— P2
•	 Topic analysis
•	 Profile type inference.
•	 Location inference.
•	 Lots more in dev...
© Scrunch
3. Machine Learning for
startups on a budget
— (our notes)
1.	 Upskilling vs hiring.
2.	 Starting with no data.
3.	 Bootstrapping datasets.
4.	Starting with some data.
5.	 Compute
6.	 Productionisation.
7.	 Research.
© Scrunch
Glossary
1.	 Model = a set of weights that can be used to predict some output
given some input. Usually trained by an “optimization” algorithm (not
covered).
2.	 Training set = dataset used to train a model.
3.	 Validation set = dataset used to validate “fit” of model while training.
4.	Test set = aka holdout set. Used to evaluate performance of model
after training.
5.	 Deep Learning = a Machine Learning technique that uses stacks of
“hidden layers” to model complex problems. Very effective for certain
problems.
© Scrunch
3.1 Upskilling vs hiring
© Scrunch
Upskilling
•	 If you can learn to code, you can learn ML.
•	 “people with 1-year of coding experience can become world-class
deep learning practitioners” - Jeremy Howard
•	 Can offer business value with minimal skills (for some problems).
•	 Tip: Try to put skills into practise as soon as possible - theory will be
easier to learn with context.
© Scrunch
Hiring experts
•	 Machine Learning and AI are electives for Computer Science degrees
at UQ, probably the case at QUT and lots of graduates are hungry to
solve real problems.
•	 Outsourcing?
•	 Tip: if you outsource your research, you must be across the project
evaluation: test set and evaluation metrics. More on this coming up.
© Scrunch
3.2 Starting with no data
© Scrunch
Validate idea first
— if you can
•	 Validate idea before collecting data and setting up ML
pipeline if you can.
•	 Focus on problem not solution: “If you remove AI from the
company but it still has a valuable product, you’re on the
right track” - Siraj Raval
•	 Tip: read The Lean Startup.
© Scrunch
Heuristics / handwritten
rules
•	 Spam filtering example: just block all hosts on DNS based email
blacklists.
•	 Topic analysis: table of hashtags to topics. Good enough.
•	 However: “Choose machine learning over a complex heuristic.” -
Martin Zinkevich in Rules of Machine Learning: Best Practices for ML
Engineering
•	 Tip: ensure you test the accuracy of handwritten rules by hand
checking ~100 examples using small test set and compare to ML
approach.
© Scrunch
Third-party APIs
•	 Google Cloud Vision API,
IBM Watson, Kairos, AWS
Recoknition.
•	 Can be expensive for big data
workloads.
•	 Tip: test accuracy with your
production-like test set (again)!
© Scrunch
Using pretrained models
•	 So many to choose from! See ModelDepot.io for a list.
•	 Don’t expect great production results! Your input data is probably
different to what the model was trained on.
•	 Transfer learning helps! - coming soon
•	 Tip: build a test set to evaluate model performance on your
production data (again).
© Scrunch
3.3 Bootstrapping datasets
© Scrunch
Publically available datasets
•	 Lots of great publically
available datasets with various
(see Github project).
•	 We’ve had incredible mileage
out of the Wikidata set.
© Scrunch
Classifying data by hand
•	 Annotation tools.
•	 Platforms like Crowdflower or Mechanical Turk.
•	 Ask your users? Incentives?
•	 Tip: it’s not as hard as you think - just get started.
© Scrunch
Purchasing datasets
•	 Data Marketplaces like Data
Circle.
•	 Partner with organisation who
have data they can share?
© Scrunch
User data
•	 “Virtuous circle of AI”:
Product > Users > Data > ML loop.
•	 Tip: Andrew Ng’s talk “AI Is The New Electricity”.
© Scrunch
General advice
•	 Just start. You’ll figure it out over time.
•	 Transfer learning can allow you to do a lot with a little data…
(up next)
© Scrunch
3.4 Starting with a little
data
© Scrunch
Transfer learning!
•	 Deep learning with small data.
•	 Take a model trained for a
similar problem and repurpose
on your dataset.
© Scrunch
Transfer learning in
image classification
•	 Take deep model trained on a big dataset like ImageNet and retrain
only last layers on your dataset.
•	 More data you can have, the more layers you can retrain.
•	 Slowly “unfreeze” earlier layers until validation accuracy stops
improving.
•	 Tip: Precomputing activations will speed up your training time
significantly!
Source: André Karpištšenko,
© Scrunch
Transfer learning in NLP
•	 Pretrained word embeddings: Glove, Word2Vec etc
•	 Learns a vector for each word that captures relationships between
words.
•	 “Man is to woman as king is to _”.
Source: Jeffrey Pennington
© Scrunch
More
•	 Data augmentation.
•	 Regularization (Dropout / Weight Decay).
•	 Tip: for image classification problems read “Building powerful image
classification models using very little data” by Francois Chollet
© Scrunch
Advice:
— focus on great validation and test set
•	 “Why is my model performing
well in dev but not in prod”?
Likely answer: test and
validation set does not match
production “distribution”.
•	 Guideline from Deep Learning
Specialization - Structuring
Machine Learning Projects.
© Scrunch
3.5 Compute
Where can a startup obtain the
compute required to train Deep
Learning?
© Scrunch
AWS SageMaker
Built in notebooks. Preconfigured environments.
Uses ml.p2.xlarge (Nvidia K80).
Cost per hour: $1.26
Cost per month (assuming constant training): $937.44
© Scrunch
Floydhub
•	 Super easy to setup. Handles hosting model after training.
•	 Total per hour: $1.2
•	 Total per month to consistently train: $892.8
© Scrunch
Amazon EC2
(Deep Learning AMIs)
•	 Standard EC2 instances. Cheaper options, especially if using
Spot Pricing.
•	 Total per hour (ml.p2.xlarge (Nvidia K80)): $0.9
•	 Total per month to consistently train: $669.6
•	 Cheaper wth spot pricing.
© Scrunch
Paperspace
•	 Another great choice. Recommended in Fast.AI (v2)
•	 Total per hour: $0.40
•	 Total per month to consistently train: $297.6
© Scrunch
Building your own
machine
•	 Often cheapest in the long
term. Allow you to train
continuously.
•	 See article: “The $1700 great
Deep Learning box” by Slav
Ivanov.
•	 Cons: more difficult to access
remotely (reverse SSH
tunneling)
© Scrunch
Advice:
— make sure you’re making use of GPU
Tensorflow example:
pip install tensorflow-gpu
© Scrunch
3.6 Productionisation
© Scrunch
Basic approach
•	 Common approach: put your
model behind a simple Flask
API and host it somewhere.
•	 We use AWS Lambda, see our
blog post.
•	 Tip: build a catalog of models
that can be used across the
business.
© Scrunch
Data pipelines
•	 System for ingesting and
processing data.
•	 We use a queue-based pipeline,
means we can cope with burst
workloads and tend not to lose
data.
•	 Tip: “Keep the first
model simple and get the
infrastructure right.” Martin
Zinkevich.
© Scrunch
Live vs cached predictions
Two approaches to serving predictions:
•	 provide live predictions to customers
•	 cache predictions and let users query the cache (our approach)
	 We cache predictions and store them in Elasticsearch.
•	 Tip: store the model and an encoding of the input data (aka md5sum
of an image) to prevent unnecessarily reprocessing data.
“gender”: {
“value”: “female”,
“prob”: 0.8561”,
“img_id”:
3b85ec9ab2984b91070128be6aae25eb,
“model_id”: “mobilenet_20171008_v1”
}
© Scrunch
Monitoring
•	 Performance / latency
•	 Accuracy - often human component
•	 Ongoing retraining plan
© Scrunch
3.7 Research
© Scrunch
Reading academic papers
•	 Don’t need to understand all the theory
•	 Usually an open-source implementation for reference.
•	 Tip: get started by reading papers for problems you understand
•	 More info: Fast.ai Part 2, Lesson 1
© Scrunch
Managing research
— and R&D Tax Incentives
•	 Store research in version control with results and links to artifacts.
•	 If claiming R&D Tax incentives, audits can be very time consuming if
you aren’t meticulous with records (trust me).
•	 Questions to think about for R&D: What is your hypothesis? What new
knowledge are you generating? Why is the research novel? What are
your unknowns? What are your main observations and conclusions.
© Scrunch
Recap
1.	 You can upskill internally.
2.	 Try to valid product without ML.
3.	 Lots of ways to bootstrap datasets - your strategy will evolve.
4.	Transfer learning is awesome!
5.	 Lot’s of compute options: hard to beat building your own machine.
6.	 Consider production strategy: pipeline, monitoring, retraining.
7.	 Store research in version control and be meticulous with it.
© Scrunch
Example at Scrunch:
— audience analysis
•	 Validate idea with customers:
will anyone pay extra for it?
•	 First version no ML: start by
sending our data to third-party
API.
•	 Use collected data to bootstrap
dataset as well as augment
with publically available
datasets and some hand
cleaning.
•	 Find academic papers solving
similar problems and use to
train models. Their accuracy
gives a good indication
of what to aim for. Can also
verify against third-party API.
•	 Deploy and ship into
preexisting pipeline.
© Scrunch
What’s next for Machine Learning at
Scrunch?
© Scrunch
Face verification and
recognition
•	 Model learns encoding of input
can be used to find distance
between another input.
•	 Help us “join” people across
multiple networks. We can
understand influencers
content across the internet
and understand audience
behaviour.
•	 Siamese Networks with
Triplet-Loss Andrew Ng
© Scrunch
Object detection
•	 What’s actually in an image?
•	 Much deep level of consumer
preference understanding.
•	 YOLO and YOLO2.
© Scrunch
Collaborative filtering
recommendation engine
•	 Collect influencer campaign
briefs to influencer selection
results.
•	 Plan: build a collaborative
filtering recommender systems
that will help brands find the
perfect influencer by just
defining a brief.
© Scrunch
Thank you
— Lex Toumbourou //CTO
We’re hiring!
dev@scrunch.com

More Related Content

Similar to Machine Learning for Startups Without PHDs

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling OverviewFerris Jumah
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer Kevin Lee
 
Engineering practices in Scrum for Hardware - Sisma Spa Case Study
Engineering practices in Scrum for Hardware - Sisma Spa Case StudyEngineering practices in Scrum for Hardware - Sisma Spa Case Study
Engineering practices in Scrum for Hardware - Sisma Spa Case StudyPaolo Sammicheli
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Professor Lili Saghafi
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
International software testing conference 2017 fergal hynes
International software testing conference 2017 fergal hynesInternational software testing conference 2017 fergal hynes
International software testing conference 2017 fergal hynesFergal Hynes
 
Workshop - cqrs brief introduction
Workshop - cqrs brief introductionWorkshop - cqrs brief introduction
Workshop - cqrs brief introductionFrancesco Garavaglia
 
A real-life overview of Agile and Scrum
A real-life overview of Agile and ScrumA real-life overview of Agile and Scrum
A real-life overview of Agile and Scrummtoppa
 
Build next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google ChromeBuild next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google ChromeAhmedabadJavaMeetup
 
CampusSDN2017 - Jawdat: Product Management and Agile Development
CampusSDN2017 - Jawdat: Product Management and Agile DevelopmentCampusSDN2017 - Jawdat: Product Management and Agile Development
CampusSDN2017 - Jawdat: Product Management and Agile DevelopmentJawdatTI
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsRamsha Ijaz
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 antimo musone
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)WeCloudData
 

Similar to Machine Learning for Startups Without PHDs (20)

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
How I became ML Engineer
How I became ML Engineer How I became ML Engineer
How I became ML Engineer
 
Engineering practices in Scrum for Hardware - Sisma Spa Case Study
Engineering practices in Scrum for Hardware - Sisma Spa Case StudyEngineering practices in Scrum for Hardware - Sisma Spa Case Study
Engineering practices in Scrum for Hardware - Sisma Spa Case Study
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
International software testing conference 2017 fergal hynes
International software testing conference 2017 fergal hynesInternational software testing conference 2017 fergal hynes
International software testing conference 2017 fergal hynes
 
Workshop - cqrs brief introduction
Workshop - cqrs brief introductionWorkshop - cqrs brief introduction
Workshop - cqrs brief introduction
 
A real-life overview of Agile and Scrum
A real-life overview of Agile and ScrumA real-life overview of Agile and Scrum
A real-life overview of Agile and Scrum
 
Collab365 Empower-Your-Applications-With-Azure-Machine-Learning
Collab365 Empower-Your-Applications-With-Azure-Machine-LearningCollab365 Empower-Your-Applications-With-Azure-Machine-Learning
Collab365 Empower-Your-Applications-With-Azure-Machine-Learning
 
Build next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google ChromeBuild next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google Chrome
 
CampusSDN2017 - Jawdat: Product Management and Agile Development
CampusSDN2017 - Jawdat: Product Management and Agile DevelopmentCampusSDN2017 - Jawdat: Product Management and Agile Development
CampusSDN2017 - Jawdat: Product Management and Agile Development
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
 
Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015 Azure Machine Learning Dotnet Campus 2015
Azure Machine Learning Dotnet Campus 2015
 
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)Applied Machine Learning Course - Jodie Zhu (WeCloudData)
Applied Machine Learning Course - Jodie Zhu (WeCloudData)
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Machine Learning for Startups Without PHDs

  • 1. @lifestylesydney Machine Learning — for startups without PHDs Lex Toumbourou // CTO lex@scrunch.com https://twitter.com/lexandstuff https://instagram.com/lextoumbourou https://github.com/lextoumbourou http://lextoumbourou.github.io
  • 2. © Scrunch Who am I — and why am I giving this talk? • CTO of Scrunch • Not a data scientist. • Building Machine Learning products for the past 2-3 years • How can people without formal training build Machine Learning products?
  • 3. © Scrunch What this talk is about? 1. What is influencer marketing and how does Scrunch help? 2. How we’re using ML. 3. ML for startups. 4. The future of Machine Learning research at Scrunch.
  • 4. © Scrunch What is influencer marketing and how does Scrunch help?
  • 5. © Scrunch What is an influencer?
  • 6. © Scrunch What is influencer marketing? • Brands pay for access to engaged audience. • Influencers can monetize their audience. • Consumers get product recommendations from people they trust (ideally).
  • 9. © Scrunch How does Scrunch help?
  • 10. © Scrunch Influencer recommendations — (search and filtering) Brands can ask: “find me influencers located in Queensland who mostly talk about street wear and whose audience is female, aged 25-34”. Most of our Machine Learning here!
  • 11. © Scrunch End-to-end campaign workflow Best practises for: reaching out to influencers. helping influencers pitch on campaigns. shipping products. managing payments.
  • 12. © Scrunch Campaign analytics • ROI • Engagement • Gender • Age • Location • Ethnicity • Audience Interests • Audience Brands • Audience hashtags
  • 13. © Scrunch 2. How we use Machine Learning at Scrunch
  • 14. © Scrunch Image classification Categorise ~20M blogs and social media entities using photo content. Gender, age, “profile type” inference, lots more in dev. Fig 1. Image from DEX: Deep EXpectation of apparent age from a single image by Rasmus Rothe and Radu Timofte and Luc Van Gool, 2015
  • 15. © Scrunch Tools • Convolutional Neural Networks • Transfer learning (lots more on this later). • Face detection models and algorithms. CNN diagram by Pawit Kochakarn.
  • 16. © Scrunch Natural Language Processing Topic analysis, location inference, language inference, gender inference, profile type inference, fake followers etc etc
  • 17. © Scrunch Tools • Non-Deep Learning: LogisticRegression and SVMs. • Recurrent Neural Networks (Bidirectional LSTMs). • Pretrained word embeddings. • CNNs.
  • 18. © Scrunch Recommender Systems and Search Mostly content-based filtering at this stage. Big plans here. Image source: The Marketing Technologist
  • 19. © Scrunch Examples of ML in Scrunch — P1 • Influencer tagging • Audience analysis
  • 20. © Scrunch Examples of ML in Scrunch — P2 • Topic analysis • Profile type inference. • Location inference. • Lots more in dev...
  • 21. © Scrunch 3. Machine Learning for startups on a budget — (our notes) 1. Upskilling vs hiring. 2. Starting with no data. 3. Bootstrapping datasets. 4. Starting with some data. 5. Compute 6. Productionisation. 7. Research.
  • 22. © Scrunch Glossary 1. Model = a set of weights that can be used to predict some output given some input. Usually trained by an “optimization” algorithm (not covered). 2. Training set = dataset used to train a model. 3. Validation set = dataset used to validate “fit” of model while training. 4. Test set = aka holdout set. Used to evaluate performance of model after training. 5. Deep Learning = a Machine Learning technique that uses stacks of “hidden layers” to model complex problems. Very effective for certain problems.
  • 24. © Scrunch Upskilling • If you can learn to code, you can learn ML. • “people with 1-year of coding experience can become world-class deep learning practitioners” - Jeremy Howard • Can offer business value with minimal skills (for some problems). • Tip: Try to put skills into practise as soon as possible - theory will be easier to learn with context.
  • 25. © Scrunch Hiring experts • Machine Learning and AI are electives for Computer Science degrees at UQ, probably the case at QUT and lots of graduates are hungry to solve real problems. • Outsourcing? • Tip: if you outsource your research, you must be across the project evaluation: test set and evaluation metrics. More on this coming up.
  • 26. © Scrunch 3.2 Starting with no data
  • 27. © Scrunch Validate idea first — if you can • Validate idea before collecting data and setting up ML pipeline if you can. • Focus on problem not solution: “If you remove AI from the company but it still has a valuable product, you’re on the right track” - Siraj Raval • Tip: read The Lean Startup.
  • 28. © Scrunch Heuristics / handwritten rules • Spam filtering example: just block all hosts on DNS based email blacklists. • Topic analysis: table of hashtags to topics. Good enough. • However: “Choose machine learning over a complex heuristic.” - Martin Zinkevich in Rules of Machine Learning: Best Practices for ML Engineering • Tip: ensure you test the accuracy of handwritten rules by hand checking ~100 examples using small test set and compare to ML approach.
  • 29. © Scrunch Third-party APIs • Google Cloud Vision API, IBM Watson, Kairos, AWS Recoknition. • Can be expensive for big data workloads. • Tip: test accuracy with your production-like test set (again)!
  • 30. © Scrunch Using pretrained models • So many to choose from! See ModelDepot.io for a list. • Don’t expect great production results! Your input data is probably different to what the model was trained on. • Transfer learning helps! - coming soon • Tip: build a test set to evaluate model performance on your production data (again).
  • 32. © Scrunch Publically available datasets • Lots of great publically available datasets with various (see Github project). • We’ve had incredible mileage out of the Wikidata set.
  • 33. © Scrunch Classifying data by hand • Annotation tools. • Platforms like Crowdflower or Mechanical Turk. • Ask your users? Incentives? • Tip: it’s not as hard as you think - just get started.
  • 34. © Scrunch Purchasing datasets • Data Marketplaces like Data Circle. • Partner with organisation who have data they can share?
  • 35. © Scrunch User data • “Virtuous circle of AI”: Product > Users > Data > ML loop. • Tip: Andrew Ng’s talk “AI Is The New Electricity”.
  • 36. © Scrunch General advice • Just start. You’ll figure it out over time. • Transfer learning can allow you to do a lot with a little data… (up next)
  • 37. © Scrunch 3.4 Starting with a little data
  • 38. © Scrunch Transfer learning! • Deep learning with small data. • Take a model trained for a similar problem and repurpose on your dataset.
  • 39. © Scrunch Transfer learning in image classification • Take deep model trained on a big dataset like ImageNet and retrain only last layers on your dataset. • More data you can have, the more layers you can retrain. • Slowly “unfreeze” earlier layers until validation accuracy stops improving. • Tip: Precomputing activations will speed up your training time significantly! Source: André Karpištšenko,
  • 40. © Scrunch Transfer learning in NLP • Pretrained word embeddings: Glove, Word2Vec etc • Learns a vector for each word that captures relationships between words. • “Man is to woman as king is to _”. Source: Jeffrey Pennington
  • 41. © Scrunch More • Data augmentation. • Regularization (Dropout / Weight Decay). • Tip: for image classification problems read “Building powerful image classification models using very little data” by Francois Chollet
  • 42. © Scrunch Advice: — focus on great validation and test set • “Why is my model performing well in dev but not in prod”? Likely answer: test and validation set does not match production “distribution”. • Guideline from Deep Learning Specialization - Structuring Machine Learning Projects.
  • 43. © Scrunch 3.5 Compute Where can a startup obtain the compute required to train Deep Learning?
  • 44. © Scrunch AWS SageMaker Built in notebooks. Preconfigured environments. Uses ml.p2.xlarge (Nvidia K80). Cost per hour: $1.26 Cost per month (assuming constant training): $937.44
  • 45. © Scrunch Floydhub • Super easy to setup. Handles hosting model after training. • Total per hour: $1.2 • Total per month to consistently train: $892.8
  • 46. © Scrunch Amazon EC2 (Deep Learning AMIs) • Standard EC2 instances. Cheaper options, especially if using Spot Pricing. • Total per hour (ml.p2.xlarge (Nvidia K80)): $0.9 • Total per month to consistently train: $669.6 • Cheaper wth spot pricing.
  • 47. © Scrunch Paperspace • Another great choice. Recommended in Fast.AI (v2) • Total per hour: $0.40 • Total per month to consistently train: $297.6
  • 48. © Scrunch Building your own machine • Often cheapest in the long term. Allow you to train continuously. • See article: “The $1700 great Deep Learning box” by Slav Ivanov. • Cons: more difficult to access remotely (reverse SSH tunneling)
  • 49. © Scrunch Advice: — make sure you’re making use of GPU Tensorflow example: pip install tensorflow-gpu
  • 51. © Scrunch Basic approach • Common approach: put your model behind a simple Flask API and host it somewhere. • We use AWS Lambda, see our blog post. • Tip: build a catalog of models that can be used across the business.
  • 52. © Scrunch Data pipelines • System for ingesting and processing data. • We use a queue-based pipeline, means we can cope with burst workloads and tend not to lose data. • Tip: “Keep the first model simple and get the infrastructure right.” Martin Zinkevich.
  • 53. © Scrunch Live vs cached predictions Two approaches to serving predictions: • provide live predictions to customers • cache predictions and let users query the cache (our approach) We cache predictions and store them in Elasticsearch. • Tip: store the model and an encoding of the input data (aka md5sum of an image) to prevent unnecessarily reprocessing data. “gender”: { “value”: “female”, “prob”: 0.8561”, “img_id”: 3b85ec9ab2984b91070128be6aae25eb, “model_id”: “mobilenet_20171008_v1” }
  • 54. © Scrunch Monitoring • Performance / latency • Accuracy - often human component • Ongoing retraining plan
  • 56. © Scrunch Reading academic papers • Don’t need to understand all the theory • Usually an open-source implementation for reference. • Tip: get started by reading papers for problems you understand • More info: Fast.ai Part 2, Lesson 1
  • 57. © Scrunch Managing research — and R&D Tax Incentives • Store research in version control with results and links to artifacts. • If claiming R&D Tax incentives, audits can be very time consuming if you aren’t meticulous with records (trust me). • Questions to think about for R&D: What is your hypothesis? What new knowledge are you generating? Why is the research novel? What are your unknowns? What are your main observations and conclusions.
  • 58. © Scrunch Recap 1. You can upskill internally. 2. Try to valid product without ML. 3. Lots of ways to bootstrap datasets - your strategy will evolve. 4. Transfer learning is awesome! 5. Lot’s of compute options: hard to beat building your own machine. 6. Consider production strategy: pipeline, monitoring, retraining. 7. Store research in version control and be meticulous with it.
  • 59. © Scrunch Example at Scrunch: — audience analysis • Validate idea with customers: will anyone pay extra for it? • First version no ML: start by sending our data to third-party API. • Use collected data to bootstrap dataset as well as augment with publically available datasets and some hand cleaning. • Find academic papers solving similar problems and use to train models. Their accuracy gives a good indication of what to aim for. Can also verify against third-party API. • Deploy and ship into preexisting pipeline.
  • 60. © Scrunch What’s next for Machine Learning at Scrunch?
  • 61. © Scrunch Face verification and recognition • Model learns encoding of input can be used to find distance between another input. • Help us “join” people across multiple networks. We can understand influencers content across the internet and understand audience behaviour. • Siamese Networks with Triplet-Loss Andrew Ng
  • 62. © Scrunch Object detection • What’s actually in an image? • Much deep level of consumer preference understanding. • YOLO and YOLO2.
  • 63. © Scrunch Collaborative filtering recommendation engine • Collect influencer campaign briefs to influencer selection results. • Plan: build a collaborative filtering recommender systems that will help brands find the perfect influencer by just defining a brief.
  • 64. © Scrunch Thank you — Lex Toumbourou //CTO We’re hiring! dev@scrunch.com