SlideShare a Scribd company logo
1 of 18
Download to read offline
Girish Nathan
Misha Bilenko
Microsoft Azure Machine Learning
How to Work
with Large Datasets to Build Predictive
Models
Agenda
1. How to Work with Large Datasets
• Sample Dataset: NYC Taxi
• HDInsight (Hadoop on Azure)
• iPython notebook and HDInsight
2. Building Predictive Models
• Azure ML Studio
• Learning with Counts
3. Putting it all together: Learning with Counts and HDInsight
Sample Data: NYC Taxi
• One year log of NYC taxi rides
• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/
• Trip (driver id, times, locations) and fare (fare, tip, tolls)
• Rest of tutorial: data wrangling and tip prediction
• Tools: AzCopy, HDInsight, iPython, Azure ML Studio
• 100% Apache Hadoop as an Azure service
• Can deploy on Windows or Linux
• Provides Map-Reduce capability over big data in Azure
blobs
• Head node: job and cluster monitoring
• Hive: SQL-like queries as an alternative to writing code
SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table
GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;
HD Insight : Hadoop on Azure
• Web-based Python REPL environment
• Combines authoring, execution, visualization
• Can author and execute HDInsight Hive queries
• Sample query (python code snippet)
def submit_hive_query(self):
response=urllib2.urlopen(self.url, self.hiveParams)
data = json.load(response)
self.hiveJobID = data[‘id’]
def query(self, queryString):
self.submit_hive_query()
Example query string: SELECT * FROM sample_table LIMIT 10;
Ipython Notebook
• Fully managed cloud service
• Browser based authoring of
dataflow
• Best in class machine learning
algorithms
• Support for R/Python/SQL
• Collaborative data science
• Quickly deploy models as web
services/REST API’s
• Publish to a gallery for
collaboration with community
What is Azure ML Studio
(Distributed Robust Algorithm for CoUnt-based LeArning)
Misha Bilenko
Microsoft Azure Machine Learning
Microsoft Research
Learning with Counts
a.k.a Dracula
adid = 1010054353
adText = K2 ski sale!
adURL= www.k2.com/sale
Userid = 0xb49129827048dd9b
IP = 131.107.65.14
Query = powder skis
QCategories = {skiing, outdoor gear}
8
¿ 𝑢𝑠𝑒𝑟𝑠 10
9
¿ 𝑞𝑢𝑒𝑟𝑖𝑒𝑠 10
9 +¿ ¿
𝑎𝑑𝑠 10
7
¿ ( 𝑎𝑑 × 𝑞𝑢𝑒𝑟𝑦 ) 10
10+ ¿ ¿
• Information retrieval
• Advertising, recommending, search: item, page/query, user
• Transaction classification
• Payment fraud: transaction, product, user
• Email spam: message, sender, recipient
• Intrusion detection: session, system, user
• IoT: device, location
Large Scale learning in multi entity
domains
adid: 1010054353
adText: Fall ski sale!
adURL:
www.k2.com/sale
userid 0xb49129827048dd9b
IP 131.107.65.14
query powder skis
qCategories {skiing, outdoor gear}
9
• Problem: representing high-cardinality attributes as features
• Scalable: to billions of attribute values
• Efficient: predictions/sec
• Flexible: for a variety of downstream learners
• Adaptive: to distribution change
• Standard approaches: binary features, hashing, projections
• What everyone uses in industry: learning with counts
• This talk: formalization and generalization
Large Scale learning in multi entity
domains
• Features are transforms of conditional statistics (per-label
counts)
= [N+
N-
log(N+
)-log(N-
) IsBackoff]
• log(N+
)-log(N-
) = log log-odds/Naïve Bayes estimate
• N+
, N-
indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
) )
131.107.65.14
) )
k 2.com
)
powder skis
)
powder skis, k2.com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
• Features are transforms of conditional counts
= [N+
N-
log(N+
)-log(N-
) IsBackoff]
Scalable “head” in memory + tail in backoff; or: count-min sketch
Efficient low cost, low dimensionality
Flexible low dimensionality works well with non-linear learners
new values easily added, back-off for infrequent values, temporal counts
) )
131.107.65.14
) )
k 2.com
)
powder skis
)
powder skis, k2.com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
Aggregate for different
• Standard MapReduce
• Bin function: any projection
• Backoff options: “tail bin”, hashing,
hierarchical (shrinkage)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 441931
2
52754683
time
Tnow
Counting
IP[2]
173.194.*.* 46964 993424
87.250.*.* 6341 91356
131.253.*.* 75126 430826
… … …
12
Learning with Counts : aggregation
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
time
Tnow
Train predictor
….
IsBackoff
ln 𝑁
+¿
−ln 𝑁
−
¿
Aggregated
features
Original numeric features
𝑁
−
𝑁+¿¿
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 441931
2
52754683
13
Learning with Counts : combiner
training
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
URL × Country
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982
… … …
REST 441931
2
52754683
time
Tnow
….
IsBackoff
ln 𝑁
+¿
−ln 𝑁
−
¿
Aggregated
features
𝑁
−
𝑁+¿¿
Counting
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Prediction with counts
• State-of-the-art accuracy
• Good fit for map-reduce
• Modular (vs. monolithic)
• Learner can be tuned/monitored/replaced in isolation
• Monitorable, debuggable (this is HUGE in practice!)
• Temporal changes easy to monitor
• Easy emergency recovery (remove bot attacks, etc.)
• Decomposable predictions
• Error debugging (which feature can we blame…) 15
What is great about learning with
Counts ?
Learning with Counts : in Azure ML
• HDInsight: large data storage and map-reduce
processing
• Azure ML: cloud ML and analytics accessible
anywhere
• Learning with Counts: intuitive, flexible large-scale
ML solution
Putting it all together
Thanks for your time
Useful Links:
http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial
http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML
Need Azure ML for teaching in classroom ? - Contact the speakers
Other Questions ? - Contact the speakers
Speakers :-
Misha Bilenko : mbilenko@Microsoft.com
Girish Nathan – ginathan@Microsoft.com

More Related Content

Similar to Learning with counts

OpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsOpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsvirtualcitySYSTEMS GmbH
 
3D Laser Scanning for Oil & Gas Facilities
3D Laser Scanning for Oil & Gas Facilities3D Laser Scanning for Oil & Gas Facilities
3D Laser Scanning for Oil & Gas FacilitiesYasser Eldegwy
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachMihai Criveti
 
Web based interactive big data visualization
Web based interactive big data visualizationWeb based interactive big data visualization
Web based interactive big data visualizationWenli Zhang
 
Machine Learning for Web Developers
Machine Learning for Web DevelopersMachine Learning for Web Developers
Machine Learning for Web DevelopersRiza Fahmi
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with AzureBarbara Fusinska
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentationDr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentationJan Kunigk
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create PyData
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchSylvain Wallez
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nltieleman
 
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewMapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewPrakher Hajela Saxena
 
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...Databricks
 

Similar to Learning with counts (20)

OpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developmentsOpenStreetMap in 3D - current developments
OpenStreetMap in 3D - current developments
 
Mapinfo 2014
Mapinfo 2014Mapinfo 2014
Mapinfo 2014
 
3D Laser Scanning for Oil & Gas Facilities
3D Laser Scanning for Oil & Gas Facilities3D Laser Scanning for Oil & Gas Facilities
3D Laser Scanning for Oil & Gas Facilities
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Web based interactive big data visualization
Web based interactive big data visualizationWeb based interactive big data visualization
Web based interactive big data visualization
 
Machine Learning for Web Developers
Machine Learning for Web DevelopersMachine Learning for Web Developers
Machine Learning for Web Developers
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
 
Machine Learning with Azure
Machine Learning with AzureMachine Learning with Azure
Machine Learning with Azure
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Lessons learned while building Omroep.nl
Lessons learned while building Omroep.nlLessons learned while building Omroep.nl
Lessons learned while building Omroep.nl
 
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewMapInfo Professional 12.5 and Discover3D 2014 - A brief overview
MapInfo Professional 12.5 and Discover3D 2014 - A brief overview
 
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
A Distributed Deep Learning Approach for the Mitosis Detection from Big Medic...
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Learning with counts

  • 1. Girish Nathan Misha Bilenko Microsoft Azure Machine Learning How to Work with Large Datasets to Build Predictive Models
  • 2. Agenda 1. How to Work with Large Datasets • Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight 2. Building Predictive Models • Azure ML Studio • Learning with Counts 3. Putting it all together: Learning with Counts and HDInsight
  • 3. Sample Data: NYC Taxi • One year log of NYC taxi rides • 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/ • Trip (driver id, times, locations) and fare (fare, tip, tolls) • Rest of tutorial: data wrangling and tip prediction • Tools: AzCopy, HDInsight, iPython, Azure ML Studio
  • 4. • 100% Apache Hadoop as an Azure service • Can deploy on Windows or Linux • Provides Map-Reduce capability over big data in Azure blobs • Head node: job and cluster monitoring • Hive: SQL-like queries as an alternative to writing code SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10; HD Insight : Hadoop on Azure
  • 5. • Web-based Python REPL environment • Combines authoring, execution, visualization • Can author and execute HDInsight Hive queries • Sample query (python code snippet) def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams) data = json.load(response) self.hiveJobID = data[‘id’] def query(self, queryString): self.submit_hive_query() Example query string: SELECT * FROM sample_table LIMIT 10; Ipython Notebook
  • 6. • Fully managed cloud service • Browser based authoring of dataflow • Best in class machine learning algorithms • Support for R/Python/SQL • Collaborative data science • Quickly deploy models as web services/REST API’s • Publish to a gallery for collaboration with community What is Azure ML Studio
  • 7. (Distributed Robust Algorithm for CoUnt-based LeArning) Misha Bilenko Microsoft Azure Machine Learning Microsoft Research Learning with Counts a.k.a Dracula
  • 8. adid = 1010054353 adText = K2 ski sale! adURL= www.k2.com/sale Userid = 0xb49129827048dd9b IP = 131.107.65.14 Query = powder skis QCategories = {skiing, outdoor gear} 8 ¿ 𝑢𝑠𝑒𝑟𝑠 10 9 ¿ 𝑞𝑢𝑒𝑟𝑖𝑒𝑠 10 9 +¿ ¿ 𝑎𝑑𝑠 10 7 ¿ ( 𝑎𝑑 × 𝑞𝑢𝑒𝑟𝑦 ) 10 10+ ¿ ¿ • Information retrieval • Advertising, recommending, search: item, page/query, user • Transaction classification • Payment fraud: transaction, product, user • Email spam: message, sender, recipient • Intrusion detection: session, system, user • IoT: device, location Large Scale learning in multi entity domains
  • 9. adid: 1010054353 adText: Fall ski sale! adURL: www.k2.com/sale userid 0xb49129827048dd9b IP 131.107.65.14 query powder skis qCategories {skiing, outdoor gear} 9 • Problem: representing high-cardinality attributes as features • Scalable: to billions of attribute values • Efficient: predictions/sec • Flexible: for a variety of downstream learners • Adaptive: to distribution change • Standard approaches: binary features, hashing, projections • What everyone uses in industry: learning with counts • This talk: formalization and generalization Large Scale learning in multi entity domains
  • 10. • Features are transforms of conditional statistics (per-label counts) = [N+ N- log(N+ )-log(N- ) IsBackoff] • log(N+ )-log(N- ) = log log-odds/Naïve Bayes estimate • N+ , N- indicators of confidence of the naïve estimate • IsFromRest: indicator of back-off vs. “real count” ) ) 131.107.65.14 ) ) k 2.com ) powder skis ) powder skis, k2.com IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 12 430 … … … REST 745623 13964931 Learning with Counts
  • 11. • Features are transforms of conditional counts = [N+ N- log(N+ )-log(N- ) IsBackoff] Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts ) ) 131.107.65.14 ) ) k 2.com ) powder skis ) powder skis, k2.com IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.107.65.14 12 430 … … … REST 745623 13964931 Learning with Counts
  • 12. Aggregate for different • Standard MapReduce • Bin function: any projection • Backoff options: “tail bin”, hashing, hierarchical (shrinkage) IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 Query × AdId facebook, ad1 54546 978964 facebook, ad2 232343 8431467 dozen roses, ad3 12973 430982 … … … REST 441931 2 52754683 time Tnow Counting IP[2] 173.194.*.* 46964 993424 87.250.*.* 6341 91356 131.253.*.* 75126 430826 … … … 12 Learning with Counts : aggregation
  • 13. IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 time Tnow Train predictor …. IsBackoff ln 𝑁 +¿ −ln 𝑁 − ¿ Aggregated features Original numeric features 𝑁 − 𝑁+¿¿ Counting Train non-linear model on count-based features • Counts, transforms, lookup properties • Additional features can be injected Query × AdId facebook, ad1 54546 978964 facebook, ad2 232343 8431467 dozen roses, ad3 12973 430982 … … … REST 441931 2 52754683 13 Learning with Counts : combiner training
  • 14. IP 173.194.33.9 46964 993424 87.250.251.11 31 843 131.253.13.32 12 430 … … … REST 745623 13964931 query facebook 281912 7957321 dozen roses 32791 640964 … … … REST 6321789 43477252 URL × Country url1, US 54546 978964 url2, CA 232343 8431467 url3, FR 12973 430982 … … … REST 441931 2 52754683 time Tnow …. IsBackoff ln 𝑁 +¿ −ln 𝑁 − ¿ Aggregated features 𝑁 − 𝑁+¿¿ Counting • Counts are updated continuously • Combiner re-training infrequent Ttrain Original numeric features Prediction with counts
  • 15. • State-of-the-art accuracy • Good fit for map-reduce • Modular (vs. monolithic) • Learner can be tuned/monitored/replaced in isolation • Monitorable, debuggable (this is HUGE in practice!) • Temporal changes easy to monitor • Easy emergency recovery (remove bot attacks, etc.) • Decomposable predictions • Error debugging (which feature can we blame…) 15 What is great about learning with Counts ?
  • 16. Learning with Counts : in Azure ML
  • 17. • HDInsight: large data storage and map-reduce processing • Azure ML: cloud ML and analytics accessible anywhere • Learning with Counts: intuitive, flexible large-scale ML solution Putting it all together
  • 18. Thanks for your time Useful Links: http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML Need Azure ML for teaching in classroom ? - Contact the speakers Other Questions ? - Contact the speakers Speakers :- Misha Bilenko : mbilenko@Microsoft.com Girish Nathan – ginathan@Microsoft.com