2. Agenda
1. How to Work with Large Datasets
• Sample Dataset: NYC Taxi
• HDInsight (Hadoop on Azure)
• iPython notebook and HDInsight
2. Building Predictive Models
• Azure ML Studio
• Learning with Counts
3. Putting it all together: Learning with Counts and HDInsight
3. Sample Data: NYC Taxi
• One year log of NYC taxi rides
• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/
• Trip (driver id, times, locations) and fare (fare, tip, tolls)
• Rest of tutorial: data wrangling and tip prediction
• Tools: AzCopy, HDInsight, iPython, Azure ML Studio
4. • 100% Apache Hadoop as an Azure service
• Can deploy on Windows or Linux
• Provides Map-Reduce capability over big data in Azure
blobs
• Head node: job and cluster monitoring
• Hive: SQL-like queries as an alternative to writing code
SELECT Col1, COUNT(*) AS Count_Col1 FROM Your_Table
GROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;
HD Insight : Hadoop on Azure
5. • Web-based Python REPL environment
• Combines authoring, execution, visualization
• Can author and execute HDInsight Hive queries
• Sample query (python code snippet)
def submit_hive_query(self):
response=urllib2.urlopen(self.url, self.hiveParams)
data = json.load(response)
self.hiveJobID = data[‘id’]
def query(self, queryString):
self.submit_hive_query()
Example query string: SELECT * FROM sample_table LIMIT 10;
Ipython Notebook
6. • Fully managed cloud service
• Browser based authoring of
dataflow
• Best in class machine learning
algorithms
• Support for R/Python/SQL
• Collaborative data science
• Quickly deploy models as web
services/REST API’s
• Publish to a gallery for
collaboration with community
What is Azure ML Studio
7. (Distributed Robust Algorithm for CoUnt-based LeArning)
Misha Bilenko
Microsoft Azure Machine Learning
Microsoft Research
Learning with Counts
a.k.a Dracula
9. adid: 1010054353
adText: Fall ski sale!
adURL:
www.k2.com/sale
userid 0xb49129827048dd9b
IP 131.107.65.14
query powder skis
qCategories {skiing, outdoor gear}
9
• Problem: representing high-cardinality attributes as features
• Scalable: to billions of attribute values
• Efficient: predictions/sec
• Flexible: for a variety of downstream learners
• Adaptive: to distribution change
• Standard approaches: binary features, hashing, projections
• What everyone uses in industry: learning with counts
• This talk: formalization and generalization
Large Scale learning in multi entity
domains
10. • Features are transforms of conditional statistics (per-label
counts)
= [N+
N-
log(N+
)-log(N-
) IsBackoff]
• log(N+
)-log(N-
) = log log-odds/Naïve Bayes estimate
• N+
, N-
indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
) )
131.107.65.14
) )
k 2.com
)
powder skis
)
powder skis, k2.com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
11. • Features are transforms of conditional counts
= [N+
N-
log(N+
)-log(N-
) IsBackoff]
Scalable “head” in memory + tail in backoff; or: count-min sketch
Efficient low cost, low dimensionality
Flexible low dimensionality works well with non-linear learners
new values easily added, back-off for infrequent values, temporal counts
) )
131.107.65.14
) )
k 2.com
)
powder skis
)
powder skis, k2.com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430
… … …
REST 745623 13964931
Learning with Counts
13. IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
time
Tnow
Train predictor
….
IsBackoff
ln 𝑁
+¿
−ln 𝑁
−
¿
Aggregated
features
Original numeric features
𝑁
−
𝑁+¿¿
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982
… … …
REST 441931
2
52754683
13
Learning with Counts : combiner
training
14. IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430
… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964
… … …
REST 6321789 43477252
URL × Country
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982
… … …
REST 441931
2
52754683
time
Tnow
….
IsBackoff
ln 𝑁
+¿
−ln 𝑁
−
¿
Aggregated
features
𝑁
−
𝑁+¿¿
Counting
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Prediction with counts
15. • State-of-the-art accuracy
• Good fit for map-reduce
• Modular (vs. monolithic)
• Learner can be tuned/monitored/replaced in isolation
• Monitorable, debuggable (this is HUGE in practice!)
• Temporal changes easy to monitor
• Easy emergency recovery (remove bot attacks, etc.)
• Decomposable predictions
• Error debugging (which feature can we blame…) 15
What is great about learning with
Counts ?
17. • HDInsight: large data storage and map-reduce
processing
• Azure ML: cloud ML and analytics accessible
anywhere
• Learning with Counts: intuitive, flexible large-scale
ML solution
Putting it all together
18. Thanks for your time
Useful Links:
http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial
http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML
Need Azure ML for teaching in classroom ? - Contact the speakers
Other Questions ? - Contact the speakers
Speakers :-
Misha Bilenko : mbilenko@Microsoft.com
Girish Nathan – ginathan@Microsoft.com