SlideShare a Scribd company logo
Cloudera Data
Science Challenge
Doug Needham Mark Nichols, P.E.
Introduction
 Doug Needham Doug's Linkedin @dougneedham
 Mark Nichols Mark’s LinkedIn
Data Science, Why does it matter?
 What is the only skill that matters for a data scientist?
 “the ability to use the resources available to them to solve a challenge.”
 Solving problems, the only skill you need to know
 The skill of solving problems.
 We both accomplished a lot in tackling this challenge. For some of the problems
we did well, for some we could improve.
 This challenge shows the ability to solve problems over and above the actual
“answers” we sought.
 I think too often we seek out people who have one particular skill or another, rather
than general problem solving abilities.
 Certainly there is a time and a place for expertise with a particular set of skills. But
the skill of adaptability is often overlooked.
 Think on this, the next time you are considering who you need to assist in solving a
problem.
Cloudera Certified Professional:
Data Scientist
 Intent of CCP:DS
 Demonstrate Knowledge in a Variety of Data
Science Topics
 Demonstrate the Knowledge at Scale
 Requirements
 Pass Cloudera’s Data Science Essentials Exam (DS-
200)
 Pass Cloudera’s Data Science Challenge (semi-
annual; use simulated data to solve real problems)
 Change coming in Q2 2015
SME Expertise
Math &
Statistics
Knowledge
Computer
Science
Skills
Data
Science
Machine
Learning
Reuters Article on CCP:DS certification
Data Acquisition
Data Evaluation
Data
Transformation
Machine Learning
Clustering
Classification
Model/ Selection
Feature Selection
Probability
Visualization
Optimization
Collaborative
Filtering
Topics
Fall 2014 Data Science Challenge
 Timeline: October 21, 2014 to January 21, 2015
 Each person sitting for the challenge has to submit individual
solutions for each problem.
 Problems:
 Problem 1: Smartfly – Predict probability of a flight being delayed.
 Problem 2: Almost Famous – Statistical analysis of web log data.
 Problem 3: Winklr – Who should follow whom.
Multiple Ways to Solve a Problem
Problem
100,000 FT High Overview of Solution Tools Used
Mark Doug
Smartfly (ML – binary
classification)
• Hive to Explore the Data
• Python & MapReduce to Format
and Clean the Input
• Spark MLLIB for Model
Data Science at the Command Line.
Scripts, counts, summaries.
“Pseudo Map-Reduce”
R plotting.
Spark MLLib for predictions.
Almost Famous (spam
filter & statistical
analysis)
• Python to Explore the Data
• Python to Filter and Answer
Questions
Data Science at the command line.
Scripts, counts, summaries.
“Pseudo Map-Reduce”
SciPY for particular functions.
Winklr (social network
analysis)
• Hive and Command Line to
Explore the Data
• Mahout, Spark, Command Line
and Python to develop a hybrid
recommender
Gephi, for analysis of subgraphs.
Python to format the data.
Spark GraphX for solution.
Shell scripts to get the data in the
required format
Smartfly – Problem Summary
 Motivation
 Client is an online travel service that provides timely travel information to their
customers
 Their product team has come up with an idea of using flight data to predict whether a
flight will be delayed and use that information to respond proactively.
 Given
 7,374,365 records of historic flight data at 279 airports and 17 airlines
 566,376 records of scheduled flight data
 Requirements
 Rank all scheduled flights in order of descending probability of delay
Smartfly – Raw Data (Starting Point)
 1-Unique Flight ID (int)
 2-Year (int)
 3-Month (int)
 4-Day of Month (int)
 5-Day of Week (int)
 6-Scheduled Departure (HHMM)
 7-Scheduled Arrival (HHMM)
 8-Airline (string)
 9-Flight Number (int)
 10-Tail Number (string)
 11-Plane Model (string)
 12-Seat Configuration (string)
 13-Departure Delay in Minutes (int)
 14-Origin Airport (string)
 15-Destination Airport (string)
 16-Distance Travelled in Miles (int)
 17-Taxi In Time in Minutes (int)
 18-Taxi Out Time in Minutes (int)
 19-Cancelled (Boolean)
 20-Cancellation Code (string)
Historic and Scheduled Data was provided in CSV format with the
following fields in each row:
Machine Learning Algorithms for
Binary Predictions (Potential Paths)
 http://spark.apache.org/docs/1.2.0/mllib-guide.html
Model Evaluation Criteria
 Set the evaluation criteria prior to running
any models, similar to setting the null and
alternate hypothesis prior to conducting an
experiment
 Selected criteria: Area Under Receiver
Operating Characteristic Curve (auROC)
 Compare different models
 Independent of cutoff
 No cutoff assumptions required
Model Evaluation Criteria
 Area Under Receiver Operating
Characteristic Curve (auROC)
 Weighted Confusion Matrix
Data Exploration
 Used Hive primarily
SELECT Max(distance), Min(distance)
FROM sfhist
 Determined range of values for
each field
 Looked at delays by airline, airport,
plane model…
 Are there mismatches in data (ex.
Cancelled = 0, but a valid Cancel
Code is present)
Input Data Manipulation
 Format to input for ML algorithm (LIBSVM format) using Python and
Map Reduce
 Created dictionaries of airports, airlines, plane models, seat
configurations & holidays
 LIBSVM – efficient sparse matrix
 0 10:1 13:1 46:1 51:1 52:1 67:1 77:1 82:1 106:1 674:1 804:1 3225:1
 1 9:1 42:1 45:1 54:1 54:1 75:1 77:1 84:1 291:1 458:1 801:1 3891:1
 Deal with errors & omissions in data
 Validate
 Manual calculation at the head/tail/changes
 Verify the correct number of records
Response
0 = no delay
1 = delayed
Features
1-12: Month
13-43: Day
…
1K-7K: Tailnumber
7001+: Holidays
Train the Model
 Split the historic data into training and testing subsets
 Split randomly
 Split based on time
 Run the model in Spark
 Load the formatted input
 Set model parameters
 Run the model (train the SVM or Logistic Regression Model)
Test the model
 Use the model to predict delays in the test data and compare to determine the
auROC (Spark)
 Repeat using a range of iterations, model types (SVM/Log), regularization
parameter (size of step), and regularization technique (L1/L2)
 Results
 Worst: auROC = 0.51 SVM using only flight times and default optimization settings
 Best: auROC = 0.68
 Logisitic Regressioin
 L2 Reglarization (2000 iterations, step = 0.0001)
 Categorical input for: month, day, weekday, time of day (6hr blocks), departing airport,
arrival airport, airline, seat configuration, flight number (type of flight), & holidays
 Represents a 36% improvement over random selection
 Predict the Scheduled for Submission
Smartfly Review
 Ability to use all of the data
 Unable to run a SVM / logistic regression model in R with ~ 6million rows
 Spark completed final model in ~ 10 min
 Can be used for any binary decisions process
 Issue loan or not
 Purchase stock or not
 Other ML algorithms, the basic process remains the same
 Linear regression – predict a value
 Clustering – segment your data for reporting
 Collaborative filter – recommend products to customers…
Winklr – Problem Summary
 Who should Follow whom?
 Winklr is a curiously popular social network for fans of the sitcom Happy Days.
Users can post photos, write messages, and most importantly, follow each other’s
posts and content. This helps users keep up with new content from their favorite
users on the site.
 Basically Winklr is a site that is set up similar to Twitter. We want to provide
recommendations on who to follow. We know that some people have “clicked”
on another user (I interpret this as a “Favorite”, or a “Re-Tweet”)
Make sense of this:
My solution
 Type of problem: Graph Analysis
 Create a Master Graph.
 Run Page Rank to identify centrality.
 Create many small graphs for individual users.
 Mask the Master Graph, and PageRank Graph.
 Multiply out Centrality, number of in Degrees for a possible followers,
and the inverse of the length of the path away from this particular
user to a candidate vertex to be followed.
 This code runs in about 60 hours using Spark GraphX.
 Code: Problem3.sh, and AnalyzeGraph.scala
Doug’s Problem Solving approach
 This is the approach I took, and may or may not be useful for others to apply.
 Analysis. I started with some basic numbers, and just browsing through the data with the “Data Science at the Command
line toolkit”. This is very handy for getting a feel for things.
 Based on some general understanding this analysis provided, create a “pipeline”
 Generally the data has to be transformed to a usable structure for the particular method of solving the problem.
 Do some basics with the problem solving method, Stats, ML, Graph, etc…
 Get some data back out of that tool, then format output to specification.
 Iterate.
 I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to 2, back to 3.
 This method allowed me to give some “space” to myself, and actually look at the each problem with fresh eyes on more
than one occasion.
 Breaking the basics down of Input, Process, Output for each problem allowed me to have “working” code for each
problem really quickly, then through tuning, analysis, research, and some time to think about the problem, I was able to
come up with each unique solution.
 It also allows me to refactor the code, having given each problem time to “rest”.
 Very much like a painting, broad strokes first, details emerge as the painting progresses.
 Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious where the
performance bottlenecks are for the pipeline.
 This method does take a bit of time.
Graph Analysis
 As Graphs get really large it becomes difficult to visualize them.
 However, I was able to “subset” the master graph based on the
recommendation output of my process.
 I was expecting to see one big clump of nodes tightly connected.
This would be the “Target” to follow.
 I was also expecting to see two smaller clumps of nodes, loosely
connected to the larger clump. These are the “followers”, as we
make a recommendation to them to follow the more popular node,
they will be closer connected to this user.
 Here is the output from Gephi that shows whether the code worked
or not.
This is what I expected to see
Looks good, except I was wrong.
 The challenge is looking for those “Likely” to follow someone.
 So this part called for something a little different than what I coded.
 It appears they were looking for the neighbors of the people that
were already being followed.
 This is a much less complicated problem than I actually solved.
 I look forward to seeing what Data Science Challenge 4 will look
like.
Where to go from here?
 Spark.
 Scala.
 Learn these topics.
 Teach these topics.
 Especially for folks planning on sitting for Data Science challenge 4: Learn
Scala. Learn Spark.
 Oh, and keep studying about Graphs…
 For an example of what not to do: Doug's github link
 Recent change – This is apparently the final Data Science challenge. Future
CCP:DS certs will be based on a testing format.

More Related Content

What's hot

End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
Eng Teong Cheah
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3
Roger Barga
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
Srinath Perera
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
Knoldus Inc.
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
HJ van Veen
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
BICA Labs
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
Aditya Bhattacharya
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Tamir Taha
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
DKALab
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
TigerGraph
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
台灣資料科學年會
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.Teng Xiaolu
 
Nlp and Neural Networks workshop
Nlp and Neural Networks workshopNlp and Neural Networks workshop
Nlp and Neural Networks workshop
QuantUniversity
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Raveen Perera
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
HJ van Veen
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Koundinya Desiraju
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
Haptik
 

What's hot (20)

End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3
 
Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference Data Science in the Real World: Making a Difference
Data Science in the Real World: Making a Difference
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
 
Nlp and Neural Networks workshop
Nlp and Neural Networks workshopNlp and Neural Networks workshop
Nlp and Neural Networks workshop
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Fairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine LearningFairly Measuring Fairness In Machine Learning
Fairly Measuring Fairness In Machine Learning
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
A Friendly Introduction to Machine Learning
A Friendly Introduction to Machine LearningA Friendly Introduction to Machine Learning
A Friendly Introduction to Machine Learning
 

Similar to Cloudera Data Science Challenge

Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
Doug Needham
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
Joshua Bloom
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction 2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction
Mark Billinghurst
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
Sanket Shikhar
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
Vijayananda Mohire
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
gdgsurrey
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Kareem Amin
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
Stepan Pushkarev
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Provectus
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 

Similar to Cloudera Data Science Challenge (20)

Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction 2013 Lecture 5: AR Tools and Interaction
2013 Lecture 5: AR Tools and Interaction
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Cloudera Data Science Challenge

  • 1. Cloudera Data Science Challenge Doug Needham Mark Nichols, P.E.
  • 2. Introduction  Doug Needham Doug's Linkedin @dougneedham  Mark Nichols Mark’s LinkedIn
  • 3. Data Science, Why does it matter?  What is the only skill that matters for a data scientist?  “the ability to use the resources available to them to solve a challenge.”  Solving problems, the only skill you need to know  The skill of solving problems.  We both accomplished a lot in tackling this challenge. For some of the problems we did well, for some we could improve.  This challenge shows the ability to solve problems over and above the actual “answers” we sought.  I think too often we seek out people who have one particular skill or another, rather than general problem solving abilities.  Certainly there is a time and a place for expertise with a particular set of skills. But the skill of adaptability is often overlooked.  Think on this, the next time you are considering who you need to assist in solving a problem.
  • 4. Cloudera Certified Professional: Data Scientist  Intent of CCP:DS  Demonstrate Knowledge in a Variety of Data Science Topics  Demonstrate the Knowledge at Scale  Requirements  Pass Cloudera’s Data Science Essentials Exam (DS- 200)  Pass Cloudera’s Data Science Challenge (semi- annual; use simulated data to solve real problems)  Change coming in Q2 2015 SME Expertise Math & Statistics Knowledge Computer Science Skills Data Science Machine Learning Reuters Article on CCP:DS certification Data Acquisition Data Evaluation Data Transformation Machine Learning Clustering Classification Model/ Selection Feature Selection Probability Visualization Optimization Collaborative Filtering Topics
  • 5. Fall 2014 Data Science Challenge  Timeline: October 21, 2014 to January 21, 2015  Each person sitting for the challenge has to submit individual solutions for each problem.  Problems:  Problem 1: Smartfly – Predict probability of a flight being delayed.  Problem 2: Almost Famous – Statistical analysis of web log data.  Problem 3: Winklr – Who should follow whom.
  • 6. Multiple Ways to Solve a Problem Problem 100,000 FT High Overview of Solution Tools Used Mark Doug Smartfly (ML – binary classification) • Hive to Explore the Data • Python & MapReduce to Format and Clean the Input • Spark MLLIB for Model Data Science at the Command Line. Scripts, counts, summaries. “Pseudo Map-Reduce” R plotting. Spark MLLib for predictions. Almost Famous (spam filter & statistical analysis) • Python to Explore the Data • Python to Filter and Answer Questions Data Science at the command line. Scripts, counts, summaries. “Pseudo Map-Reduce” SciPY for particular functions. Winklr (social network analysis) • Hive and Command Line to Explore the Data • Mahout, Spark, Command Line and Python to develop a hybrid recommender Gephi, for analysis of subgraphs. Python to format the data. Spark GraphX for solution. Shell scripts to get the data in the required format
  • 7. Smartfly – Problem Summary  Motivation  Client is an online travel service that provides timely travel information to their customers  Their product team has come up with an idea of using flight data to predict whether a flight will be delayed and use that information to respond proactively.  Given  7,374,365 records of historic flight data at 279 airports and 17 airlines  566,376 records of scheduled flight data  Requirements  Rank all scheduled flights in order of descending probability of delay
  • 8. Smartfly – Raw Data (Starting Point)  1-Unique Flight ID (int)  2-Year (int)  3-Month (int)  4-Day of Month (int)  5-Day of Week (int)  6-Scheduled Departure (HHMM)  7-Scheduled Arrival (HHMM)  8-Airline (string)  9-Flight Number (int)  10-Tail Number (string)  11-Plane Model (string)  12-Seat Configuration (string)  13-Departure Delay in Minutes (int)  14-Origin Airport (string)  15-Destination Airport (string)  16-Distance Travelled in Miles (int)  17-Taxi In Time in Minutes (int)  18-Taxi Out Time in Minutes (int)  19-Cancelled (Boolean)  20-Cancellation Code (string) Historic and Scheduled Data was provided in CSV format with the following fields in each row:
  • 9. Machine Learning Algorithms for Binary Predictions (Potential Paths)  http://spark.apache.org/docs/1.2.0/mllib-guide.html
  • 10. Model Evaluation Criteria  Set the evaluation criteria prior to running any models, similar to setting the null and alternate hypothesis prior to conducting an experiment  Selected criteria: Area Under Receiver Operating Characteristic Curve (auROC)  Compare different models  Independent of cutoff  No cutoff assumptions required
  • 11. Model Evaluation Criteria  Area Under Receiver Operating Characteristic Curve (auROC)  Weighted Confusion Matrix
  • 12. Data Exploration  Used Hive primarily SELECT Max(distance), Min(distance) FROM sfhist  Determined range of values for each field  Looked at delays by airline, airport, plane model…  Are there mismatches in data (ex. Cancelled = 0, but a valid Cancel Code is present)
  • 13. Input Data Manipulation  Format to input for ML algorithm (LIBSVM format) using Python and Map Reduce  Created dictionaries of airports, airlines, plane models, seat configurations & holidays  LIBSVM – efficient sparse matrix  0 10:1 13:1 46:1 51:1 52:1 67:1 77:1 82:1 106:1 674:1 804:1 3225:1  1 9:1 42:1 45:1 54:1 54:1 75:1 77:1 84:1 291:1 458:1 801:1 3891:1  Deal with errors & omissions in data  Validate  Manual calculation at the head/tail/changes  Verify the correct number of records Response 0 = no delay 1 = delayed Features 1-12: Month 13-43: Day … 1K-7K: Tailnumber 7001+: Holidays
  • 14. Train the Model  Split the historic data into training and testing subsets  Split randomly  Split based on time  Run the model in Spark  Load the formatted input  Set model parameters  Run the model (train the SVM or Logistic Regression Model)
  • 15. Test the model  Use the model to predict delays in the test data and compare to determine the auROC (Spark)  Repeat using a range of iterations, model types (SVM/Log), regularization parameter (size of step), and regularization technique (L1/L2)  Results  Worst: auROC = 0.51 SVM using only flight times and default optimization settings  Best: auROC = 0.68  Logisitic Regressioin  L2 Reglarization (2000 iterations, step = 0.0001)  Categorical input for: month, day, weekday, time of day (6hr blocks), departing airport, arrival airport, airline, seat configuration, flight number (type of flight), & holidays  Represents a 36% improvement over random selection  Predict the Scheduled for Submission
  • 16. Smartfly Review  Ability to use all of the data  Unable to run a SVM / logistic regression model in R with ~ 6million rows  Spark completed final model in ~ 10 min  Can be used for any binary decisions process  Issue loan or not  Purchase stock or not  Other ML algorithms, the basic process remains the same  Linear regression – predict a value  Clustering – segment your data for reporting  Collaborative filter – recommend products to customers…
  • 17. Winklr – Problem Summary  Who should Follow whom?  Winklr is a curiously popular social network for fans of the sitcom Happy Days. Users can post photos, write messages, and most importantly, follow each other’s posts and content. This helps users keep up with new content from their favorite users on the site.  Basically Winklr is a site that is set up similar to Twitter. We want to provide recommendations on who to follow. We know that some people have “clicked” on another user (I interpret this as a “Favorite”, or a “Re-Tweet”)
  • 18. Make sense of this:
  • 19. My solution  Type of problem: Graph Analysis  Create a Master Graph.  Run Page Rank to identify centrality.  Create many small graphs for individual users.  Mask the Master Graph, and PageRank Graph.  Multiply out Centrality, number of in Degrees for a possible followers, and the inverse of the length of the path away from this particular user to a candidate vertex to be followed.  This code runs in about 60 hours using Spark GraphX.  Code: Problem3.sh, and AnalyzeGraph.scala
  • 20. Doug’s Problem Solving approach  This is the approach I took, and may or may not be useful for others to apply.  Analysis. I started with some basic numbers, and just browsing through the data with the “Data Science at the Command line toolkit”. This is very handy for getting a feel for things.  Based on some general understanding this analysis provided, create a “pipeline”  Generally the data has to be transformed to a usable structure for the particular method of solving the problem.  Do some basics with the problem solving method, Stats, ML, Graph, etc…  Get some data back out of that tool, then format output to specification.  Iterate.  I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to 2, back to 3.  This method allowed me to give some “space” to myself, and actually look at the each problem with fresh eyes on more than one occasion.  Breaking the basics down of Input, Process, Output for each problem allowed me to have “working” code for each problem really quickly, then through tuning, analysis, research, and some time to think about the problem, I was able to come up with each unique solution.  It also allows me to refactor the code, having given each problem time to “rest”.  Very much like a painting, broad strokes first, details emerge as the painting progresses.  Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious where the performance bottlenecks are for the pipeline.  This method does take a bit of time.
  • 21. Graph Analysis  As Graphs get really large it becomes difficult to visualize them.  However, I was able to “subset” the master graph based on the recommendation output of my process.  I was expecting to see one big clump of nodes tightly connected. This would be the “Target” to follow.  I was also expecting to see two smaller clumps of nodes, loosely connected to the larger clump. These are the “followers”, as we make a recommendation to them to follow the more popular node, they will be closer connected to this user.  Here is the output from Gephi that shows whether the code worked or not.
  • 22. This is what I expected to see
  • 23. Looks good, except I was wrong.  The challenge is looking for those “Likely” to follow someone.  So this part called for something a little different than what I coded.  It appears they were looking for the neighbors of the people that were already being followed.  This is a much less complicated problem than I actually solved.  I look forward to seeing what Data Science Challenge 4 will look like.
  • 24. Where to go from here?  Spark.  Scala.  Learn these topics.  Teach these topics.  Especially for folks planning on sitting for Data Science challenge 4: Learn Scala. Learn Spark.  Oh, and keep studying about Graphs…  For an example of what not to do: Doug's github link  Recent change – This is apparently the final Data Science challenge. Future CCP:DS certs will be based on a testing format.

Editor's Notes

  1. Doug and Mark intro
  2. Doug and Mark intro
  3. Doug
  4. Mark
  5. Mark
  6. Doug and Mark
  7. Mark Switch to command line early – will start Model and then move back to this presentation. Wait on prompt from Mark to go to cmd line or discuss prior (dependent on how trial goes prior to presentation).
  8. Animation Blacks out the fields that are NA (null) in the scheduled data set. Click/down arrow to start this animation when Mark starts to mention the items that are not provided/nulled out in the sample
  9. Animation: 1st click when start talking about Classifciation 2nd click when I mention SVM
  10. True Positive = Sensitivity Sensitivity = TP / (TP + FP) False Positive = (1 – Specificity) Specificity = TN / (TN + FP)
  11. Mark
  12. Mark
  13. Switch to Notepad upon cue. (“map_sfSparkHist.py”)
  14. Mark DL and B6 had the highest average probability of delay
  15. Last slide for Mark, Doug is up next
  16. Back to Doug