SlideShare a Scribd company logo
1 of 38
Machine Learning
With Apache Spark
CodeMash, Sandusky, Ohio, Jan 5-8, 2016
David Taieb
STSM-IBM Cloud Data Services
©2015 IBM Corporation
Introduction
David Taieb
david_taieb@us.ibm.com
Developer Advocate
IBM Cloud Data Services
Our mission:
We are here to help developers realize their most ambitious projects.
https://developer.ibm.com/clouddataservices/connect/
©2015 IBM Corporation
Big data, cloud and the rise of business Analytics
‣ Data being collected by enterprises
grows exponentially : ERP,
embedded systems (IOT)
‣ Cloud, with high availability and huge
capacity, make more data available
for analytics
‣ Big data and cloud create new
opportunities:
- Organizations: more effective decision-
making process, richer client interactions
- Business users: discover new insights,
better decision-making process
- Developers: access to diverse data sources
and new tools that increase productivity
©2015 IBM Corporation
Why Business Analytics with big data
“In God we trust.
All others bring data”
W. Edwards Deming
‣ Every day, companies make bet-the-business
decisions about their customers, competitors and
new products
‣ Time available for decision-making is shrinking
(sometimes real-time)
‣ As more and more companies go digital, data
becomes the world’s newest resource for
competitive advantage
‣ Decision making has moved from the elite few to
the empowered many
‣ Few organizations can keep pace with the
appetite for data
Business Analytics Types
Descriptive Analytics Predictive Analytics Prescriptive Analytics
Look at the reason for
past success or failure
What is probably going
to happen in the future?
What’s my best actions?
• Use interactive querying and
visualization to explore and
communicate data
• Discover insight and trends
• correlation between 2
seemingly unrelated
variables
• Data mining
• Generate hypothesis and
models
• Predict occurrence of future
events using probability
(confidence)
• Product recommendations
• Classification
• Help make the right decision
based on the data
• Find optimal solution to a
given problem
Taking Analytics a step further with Cognitive Systems
‣ Use natural language processing and machine learning algorithms to unlock knowledge
from massive amount of structured and unstructured data
Decide
• Ingest and analyze domain sources, info models
• Generate evidence based decisions with confidence
• Learn with new outcomes and actions
• e.g. - Next generation Apps  Probabilistic Apps
Ask
• Leverage vast amounts of data
• Ask questions for greater insights
• Natural language inquiries
• e.g. - Next generation Chat
Discover
• Find the rationale for given answers
• Prompt for inputs to yield improved responses
• Inspire considerations of new ideas
• e.g. - Next generation Search  Discovery
IBM Watson
IBM Cloud Data Services
Resources for developers to get, build, and analyze on the IBM Cloud
©2015 IBM Corporation
What is spark
Spark is an open source
in-memory
computing framework for
distributed data processing
and
iterative analysis
on massive data volumes
©2015 IBM Corporation
Spark Core Libraries
Spark Core
general compute engine, handles
distributed task dispatching, scheduling
and basic I/O functions
Spark
SQL
Spark
Streaming
Mllib
(machine
learning)
GraphX
(graph)
executes
SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
©2015 IBM Corporation
Key reasons for interest in Spark
Open Source
Fast
distributed data
processing
Productive
Web Scale
•In-memory storage greatly reduces disk I/O
•Up to 100x faster in memory, 10x faster on disk
•Largest project and one of the most active on Apache
•Vibrant growing community of developers continuously improve code
base and extend capabilities
•Fast adoption in the enterprise (IBM, Databricks, etc…)
•Fault tolerant, seamlessly recompute lost data from hardware failure
•Scalable: easily increase number of worker nodes
•Flexible job execution: Batch, Streaming, Interactive
•Easily handle Petabytes of data without special code handling
•Compatible with existing Hadoop ecosystem
•Unified programming model across a range of use cases
•Rich and expressive apis hide complexities of parallel computing and worker node
management
•Support for Java, Scala, Python and R: less code written
•Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
©2015 IBM Corporation
IBM is all-in on its commitment to Spark
11
Foster
Community
Educate 1M+ data scientists
and engineers via online
courses
Sponsor AMPLab, creators and
evangelists of Spark
Infuse the
Portfolio
Integrate Spark throughout
portfolio
3,500 employees working
on Spark-related topics
Spark however customers
want it – standalone,
platform or products
Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss
Launch Spark Technology Cluster
(STC), 300 engineers
Open source SystemML
Partner with databricks
Contribute to the
Core
©2015 IBM Corporation
Spark MLLib
‣ Extension to the Spark Core API that provide a library of easy to use Machine
learning algorithms.
‣ Highly scalable: Leverages Spark ability to work with massive amount of data
‣ Fast: Designed for parallel computing
‣ Cover common Machine Learning algorithms:
- Regression
- Classification
- Clustering
- Recommender Systems
- Text Analytics
©2015 IBM Corporation
What is Machine Learning and where is it used
‣Subfield of computer science that focuses on getting computers to
learn from data:
- Recognize patterns
- Make predictions
‣Example use:
- Spam filters
- Netflix recommendations
- Self-driving cars
- Watson
- …
©2015 IBM Corporation
Typical Machine Learning Flow diagram
Data
Acquisition
Data
Preparation
Data Annotation
(Ground Truth)
Model
Training
• Cleansing
• Shaping
• Enrichment
Model
Testing
Training
Set
Test
Set
Blind
Set
Iterative
Cross-Validation
Evaluate Performance and optimize model
Train Model
©2015 IBM Corporation
MLLib Algorithm Overview
• Predictive analytics
• Recommendations
• Collaborative Filtering
• Matrix Factorization
• Feature extraction and Transformation
• TF-IDF
• HashingTF
• Word2Vec
• StandardScaler
• Normalizer
• Model Evaluation/Metrics
• Binary Classification Metrics
• Multi Class Metrics
• Regression Metrics
©2015 IBM Corporation
Predictive analytics
Continuous Output Discrete Output
Supervised
Learning
(require Ground-Truth)
• Regression
- Linear
- Ridge
- Lasso
- Isotonic
• Decision Tree
• RandomForest
• GradientBoostedTree
• Classification
- Logistic Regression
- SVM
- NaiveBayes
• Decision Tree
• RandomForest
• GradientBoostedTree
• K-NN (available as add-on spark
package)
Unsupervised
Learning
(no Ground-Truth data required)
• Clustering
- KMeans
- Gaussian Mixture
• Dimensionality Reduction
- PCA
- SVD
• FP-Growth
©2015 IBM Corporation
Featured demo: Flight Delay Predictor
‣ Use training data collected from flight stats and enriched with weather observations from
“Insight for Weather” service on Bluemix
‣ Train multi-class classifier that, given and flight departure weather observations, can predict the
flight delay class:
- 0 = Canceled
- 1 = On Time
- 2 = Delay less than 2 hours
- 3 = Delay between 2 and 4 hours
- 4 = Delay more than 4 hours
‣ Provide metrics measurement for each algorithms
- Accuracy
- Precision
- Recall
©2015 IBM Corporation
Architecture
Weather
Simple Data
Pipes
Airports
Flight Schedules
Flight Status
Metadata
Training
Set
Test
Set
Blind
Set
Custom
Connector run
every 24 hours
Notebook
©2015 IBM Corporation
Get
‣ Identify data sources:
- flightstats.com: https://developer.flightstats.com
- Airport metadata: FS Code, geolocation,…
- Flight Schedules
- Flight Status
- Weather Observations
- Insight for Weather on Bluemix
‣ Storage:
- Cloudant
‣ Tool used:
- Simple Data Pipes custom connector to build Training, Test and Blind data set
‣ Constraints:
- Weather service provide past observations as far as 24 hours back only
- Flightstats API key is a 30 day trial version, limited to 20,000 calls only
©2015 IBM Corporation
Custom Pipes Connector to build training data set
https://developer.ibm.com/clouddataservices/simple-data-pipe/
©2015 IBM Corporation
Run every 24 hours
Because Weather service doesn’t return observations older than 24 hours, the data
set must be ran every 24 hours
©2015 IBM Corporation
Build: Explore the data with Notebook
©2015 IBM Corporation
Loading training data set
©2015 IBM Corporation
Build: Visualize and explore data set
Scatter plot of flights delays based on temperature in Departing and Arrival airports
©2015 IBM Corporation
Build: Visualize and explore data set
Scatter plot of flights delays based on wind speed in Departing and Arrival airports
©2015 IBM Corporation
Constraints
‣ Past weather observations provided by the “Insight for Weather” service have more details than
forecast data:
- Limit the number of features used to train the models to the intersections of the 2.
‣ Restrict the training data to weather forecast at departure and arrival airport
- Would adding weather data from various point in the route increase the model performance?
‣ Difficult to get enough representative data because I was using a trial account on flightstats
- Ideally, I would use more airports with better representative weather
‣ Didn’t use any categorical features
‣ For simplicity: Use IPython Notebook as the user interface
- Make the experience less compelling for Business users
- To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python library
- Doesn’t cover as much of the Spark API as Scala
©2015 IBM Corporation
Load labeled data RDD
©2015 IBM Corporation
Load labeled data RDD
©2015 IBM Corporation
Build: NaiveBayes Classification
©2015 IBM Corporation
Build: Decision Tree classification
©2015 IBM Corporation
Build: Random Forest classification
©2015 IBM Corporation
Build: Performance measurements
Load blind data
©2015 IBM Corporation
Build: Compare metrics between different
models
©2015 IBM Corporation
Naïve Bayes vs Decision Tree
‣ Probabilistic: compute the probability of a
data instance to be in a specific class
‣ Assume that each feature (variable) is
independent from the others
‣ Performance depends on the predictive
nature of the features (non predictive
features will affect the accuracy)
‣ Works well with low amount of training data.
Doesn’t need all the possibilities
‣ Doesn’t work with categorical features.
‣ Non-Probabilistic: partition the data into
subsets that best describe the variable
‣ The deeper the tree, the better the model fits
the data
‣ Watch out for overfiting: need to prune the tree
‣ Can handle categorical or continuous features
‣ No need for input to be scaled or standardized:
Set you features and go!
‣ Requires a lot of data covering all possibilities
©2015 IBM Corporation
Analyze: Run model
©2015 IBM Corporation
Code: Run Model
©2015 IBM Corporation
If you want to know more
‣https://developer.ibm.com/clouddataservices/
‣https://github.com/ibm-cds-labs/pipes-connector-flightstats
‣http://spark.apache.org/docs/latest/mllib-guide.html
‣https://console.ng.bluemix.net/data/analytics/
©2015 IBM Corporation

More Related Content

What's hot

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Databricks
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
Johann Schleier-Smith
 

What's hot (20)

Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Semantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflowSemantic Image Logging Using Approximate Statistics & MLflow
Semantic Image Logging Using Approximate Statistics & MLflow
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Saving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AISaving Energy in Homes with a Unified Approach to Data and AI
Saving Energy in Homes with a Unified Approach to Data and AI
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AI
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the Business
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Serverless data pipelines gcp
Serverless data pipelines gcpServerless data pipelines gcp
Serverless data pipelines gcp
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
ML-Ops: From Proof-of-Concept to Production Application
ML-Ops: From Proof-of-Concept to Production ApplicationML-Ops: From Proof-of-Concept to Production Application
ML-Ops: From Proof-of-Concept to Production Application
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 

Similar to Machine Learning with Apache Spark

Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
vithakur
 

Similar to Machine Learning with Apache Spark (20)

High Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environmentsHigh Value Business Intelligence for IBM Platform compute environments
High Value Business Intelligence for IBM Platform compute environments
 
Software Defined Infrastructure
Software Defined InfrastructureSoftware Defined Infrastructure
Software Defined Infrastructure
 
ICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data ScienceICP for Data- Enterprise platform for AI, ML and Data Science
ICP for Data- Enterprise platform for AI, ML and Data Science
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
The sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of ThingsThe sensor data challenge - Innovations (not only) for the Internet of Things
The sensor data challenge - Innovations (not only) for the Internet of Things
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
Machine learning in the physical world by Kip Larson from AWS IoT
Machine learning in the physical world by  Kip Larson from AWS IoTMachine learning in the physical world by  Kip Larson from AWS IoT
Machine learning in the physical world by Kip Larson from AWS IoT
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Inawisdom MLOPS
Inawisdom MLOPSInawisdom MLOPS
Inawisdom MLOPS
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
20150617 spark meetup zagreb
20150617 spark meetup zagreb20150617 spark meetup zagreb
20150617 spark meetup zagreb
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data:  InterConnect 2016 Session on Getting Started with Big Data AnalyticsBig Data:  InterConnect 2016 Session on Getting Started with Big Data Analytics
Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics
 
Take the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented AnalyticsTake the Bias out of Big Data Insights With Augmented Analytics
Take the Bias out of Big Data Insights With Augmented Analytics
 
Accelerating Innovation with Hybrid Cloud
Accelerating Innovation with Hybrid CloudAccelerating Innovation with Hybrid Cloud
Accelerating Innovation with Hybrid Cloud
 

More from IBM Cloud Data Services

Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
IBM Cloud Data Services
 

More from IBM Cloud Data Services (20)

CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text SearchCouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
 
CouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDB
CouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDBCouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDB
CouchDB Day NYC 2017: Using Geospatial Data in Cloudant & CouchDB
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
CouchDB Day NYC 2017: Replication
CouchDB Day NYC 2017: ReplicationCouchDB Day NYC 2017: Replication
CouchDB Day NYC 2017: Replication
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
CouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON DocumentsCouchDB Day NYC 2017: JSON Documents
CouchDB Day NYC 2017: JSON Documents
 
CouchDB Day NYC 2017: Core HTTP API
CouchDB Day NYC 2017: Core HTTP APICouchDB Day NYC 2017: Core HTTP API
CouchDB Day NYC 2017: Core HTTP API
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Practical Use of a NoSQL
Practical Use of a NoSQLPractical Use of a NoSQL
Practical Use of a NoSQL
 
I See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial ApplicationsI See NoSQL Document Stores in Geospatial Applications
I See NoSQL Document Stores in Geospatial Applications
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
NoSQL for SQL Users
NoSQL for SQL UsersNoSQL for SQL Users
NoSQL for SQL Users
 
dashDB: the GIS professional’s bridge to mainstream IT systems
dashDB: the GIS professional’s bridge to mainstream IT systemsdashDB: the GIS professional’s bridge to mainstream IT systems
dashDB: the GIS professional’s bridge to mainstream IT systems
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
SQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The MoveSQL To NoSQL - Top 6 Questions Before Making The Move
SQL To NoSQL - Top 6 Questions Before Making The Move
 
Mobile App Development With IBM Cloudant
Mobile App Development With IBM CloudantMobile App Development With IBM Cloudant
Mobile App Development With IBM Cloudant
 
IBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDBIBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDB
 
Run Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDBRun Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDB
 
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGISAnalyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
Analyzing GeoSpatial data with IBM Cloud Data Services & Esri ArcGIS
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a Service
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 

Recently uploaded (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 

Machine Learning with Apache Spark

  • 1. Machine Learning With Apache Spark CodeMash, Sandusky, Ohio, Jan 5-8, 2016 David Taieb STSM-IBM Cloud Data Services
  • 2. ©2015 IBM Corporation Introduction David Taieb david_taieb@us.ibm.com Developer Advocate IBM Cloud Data Services Our mission: We are here to help developers realize their most ambitious projects. https://developer.ibm.com/clouddataservices/connect/
  • 3. ©2015 IBM Corporation Big data, cloud and the rise of business Analytics ‣ Data being collected by enterprises grows exponentially : ERP, embedded systems (IOT) ‣ Cloud, with high availability and huge capacity, make more data available for analytics ‣ Big data and cloud create new opportunities: - Organizations: more effective decision- making process, richer client interactions - Business users: discover new insights, better decision-making process - Developers: access to diverse data sources and new tools that increase productivity
  • 4. ©2015 IBM Corporation Why Business Analytics with big data “In God we trust. All others bring data” W. Edwards Deming ‣ Every day, companies make bet-the-business decisions about their customers, competitors and new products ‣ Time available for decision-making is shrinking (sometimes real-time) ‣ As more and more companies go digital, data becomes the world’s newest resource for competitive advantage ‣ Decision making has moved from the elite few to the empowered many ‣ Few organizations can keep pace with the appetite for data
  • 5. Business Analytics Types Descriptive Analytics Predictive Analytics Prescriptive Analytics Look at the reason for past success or failure What is probably going to happen in the future? What’s my best actions? • Use interactive querying and visualization to explore and communicate data • Discover insight and trends • correlation between 2 seemingly unrelated variables • Data mining • Generate hypothesis and models • Predict occurrence of future events using probability (confidence) • Product recommendations • Classification • Help make the right decision based on the data • Find optimal solution to a given problem
  • 6. Taking Analytics a step further with Cognitive Systems ‣ Use natural language processing and machine learning algorithms to unlock knowledge from massive amount of structured and unstructured data Decide • Ingest and analyze domain sources, info models • Generate evidence based decisions with confidence • Learn with new outcomes and actions • e.g. - Next generation Apps  Probabilistic Apps Ask • Leverage vast amounts of data • Ask questions for greater insights • Natural language inquiries • e.g. - Next generation Chat Discover • Find the rationale for given answers • Prompt for inputs to yield improved responses • Inspire considerations of new ideas • e.g. - Next generation Search  Discovery IBM Watson
  • 7. IBM Cloud Data Services Resources for developers to get, build, and analyze on the IBM Cloud
  • 8. ©2015 IBM Corporation What is spark Spark is an open source in-memory computing framework for distributed data processing and iterative analysis on massive data volumes
  • 9. ©2015 IBM Corporation Spark Core Libraries Spark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark Streaming Mllib (machine learning) GraphX (graph) executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework
  • 10. ©2015 IBM Corporation Key reasons for interest in Spark Open Source Fast distributed data processing Productive Web Scale •In-memory storage greatly reduces disk I/O •Up to 100x faster in memory, 10x faster on disk •Largest project and one of the most active on Apache •Vibrant growing community of developers continuously improve code base and extend capabilities •Fast adoption in the enterprise (IBM, Databricks, etc…) •Fault tolerant, seamlessly recompute lost data from hardware failure •Scalable: easily increase number of worker nodes •Flexible job execution: Batch, Streaming, Interactive •Easily handle Petabytes of data without special code handling •Compatible with existing Hadoop ecosystem •Unified programming model across a range of use cases •Rich and expressive apis hide complexities of parallel computing and worker node management •Support for Java, Scala, Python and R: less code written •Include a set of core libraries that enable various analytic methods: Saprk SQL, Mllib, GraphX
  • 11. ©2015 IBM Corporation IBM is all-in on its commitment to Spark 11 Foster Community Educate 1M+ data scientists and engineers via online courses Sponsor AMPLab, creators and evangelists of Spark Infuse the Portfolio Integrate Spark throughout portfolio 3,500 employees working on Spark-related topics Spark however customers want it – standalone, platform or products Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss Launch Spark Technology Cluster (STC), 300 engineers Open source SystemML Partner with databricks Contribute to the Core
  • 12. ©2015 IBM Corporation Spark MLLib ‣ Extension to the Spark Core API that provide a library of easy to use Machine learning algorithms. ‣ Highly scalable: Leverages Spark ability to work with massive amount of data ‣ Fast: Designed for parallel computing ‣ Cover common Machine Learning algorithms: - Regression - Classification - Clustering - Recommender Systems - Text Analytics
  • 13. ©2015 IBM Corporation What is Machine Learning and where is it used ‣Subfield of computer science that focuses on getting computers to learn from data: - Recognize patterns - Make predictions ‣Example use: - Spam filters - Netflix recommendations - Self-driving cars - Watson - …
  • 14. ©2015 IBM Corporation Typical Machine Learning Flow diagram Data Acquisition Data Preparation Data Annotation (Ground Truth) Model Training • Cleansing • Shaping • Enrichment Model Testing Training Set Test Set Blind Set Iterative Cross-Validation Evaluate Performance and optimize model Train Model
  • 15. ©2015 IBM Corporation MLLib Algorithm Overview • Predictive analytics • Recommendations • Collaborative Filtering • Matrix Factorization • Feature extraction and Transformation • TF-IDF • HashingTF • Word2Vec • StandardScaler • Normalizer • Model Evaluation/Metrics • Binary Classification Metrics • Multi Class Metrics • Regression Metrics
  • 16. ©2015 IBM Corporation Predictive analytics Continuous Output Discrete Output Supervised Learning (require Ground-Truth) • Regression - Linear - Ridge - Lasso - Isotonic • Decision Tree • RandomForest • GradientBoostedTree • Classification - Logistic Regression - SVM - NaiveBayes • Decision Tree • RandomForest • GradientBoostedTree • K-NN (available as add-on spark package) Unsupervised Learning (no Ground-Truth data required) • Clustering - KMeans - Gaussian Mixture • Dimensionality Reduction - PCA - SVD • FP-Growth
  • 17. ©2015 IBM Corporation Featured demo: Flight Delay Predictor ‣ Use training data collected from flight stats and enriched with weather observations from “Insight for Weather” service on Bluemix ‣ Train multi-class classifier that, given and flight departure weather observations, can predict the flight delay class: - 0 = Canceled - 1 = On Time - 2 = Delay less than 2 hours - 3 = Delay between 2 and 4 hours - 4 = Delay more than 4 hours ‣ Provide metrics measurement for each algorithms - Accuracy - Precision - Recall
  • 18. ©2015 IBM Corporation Architecture Weather Simple Data Pipes Airports Flight Schedules Flight Status Metadata Training Set Test Set Blind Set Custom Connector run every 24 hours Notebook
  • 19. ©2015 IBM Corporation Get ‣ Identify data sources: - flightstats.com: https://developer.flightstats.com - Airport metadata: FS Code, geolocation,… - Flight Schedules - Flight Status - Weather Observations - Insight for Weather on Bluemix ‣ Storage: - Cloudant ‣ Tool used: - Simple Data Pipes custom connector to build Training, Test and Blind data set ‣ Constraints: - Weather service provide past observations as far as 24 hours back only - Flightstats API key is a 30 day trial version, limited to 20,000 calls only
  • 20. ©2015 IBM Corporation Custom Pipes Connector to build training data set https://developer.ibm.com/clouddataservices/simple-data-pipe/
  • 21. ©2015 IBM Corporation Run every 24 hours Because Weather service doesn’t return observations older than 24 hours, the data set must be ran every 24 hours
  • 22. ©2015 IBM Corporation Build: Explore the data with Notebook
  • 23. ©2015 IBM Corporation Loading training data set
  • 24. ©2015 IBM Corporation Build: Visualize and explore data set Scatter plot of flights delays based on temperature in Departing and Arrival airports
  • 25. ©2015 IBM Corporation Build: Visualize and explore data set Scatter plot of flights delays based on wind speed in Departing and Arrival airports
  • 26. ©2015 IBM Corporation Constraints ‣ Past weather observations provided by the “Insight for Weather” service have more details than forecast data: - Limit the number of features used to train the models to the intersections of the 2. ‣ Restrict the training data to weather forecast at departure and arrival airport - Would adding weather data from various point in the route increase the model performance? ‣ Difficult to get enough representative data because I was using a trial account on flightstats - Ideally, I would use more airports with better representative weather ‣ Didn’t use any categorical features ‣ For simplicity: Use IPython Notebook as the user interface - Make the experience less compelling for Business users - To avoid writing too much code in the Notebook, encapsulate some of the business logic in a Python library - Doesn’t cover as much of the Spark API as Scala
  • 27. ©2015 IBM Corporation Load labeled data RDD
  • 28. ©2015 IBM Corporation Load labeled data RDD
  • 29. ©2015 IBM Corporation Build: NaiveBayes Classification
  • 30. ©2015 IBM Corporation Build: Decision Tree classification
  • 31. ©2015 IBM Corporation Build: Random Forest classification
  • 32. ©2015 IBM Corporation Build: Performance measurements Load blind data
  • 33. ©2015 IBM Corporation Build: Compare metrics between different models
  • 34. ©2015 IBM Corporation Naïve Bayes vs Decision Tree ‣ Probabilistic: compute the probability of a data instance to be in a specific class ‣ Assume that each feature (variable) is independent from the others ‣ Performance depends on the predictive nature of the features (non predictive features will affect the accuracy) ‣ Works well with low amount of training data. Doesn’t need all the possibilities ‣ Doesn’t work with categorical features. ‣ Non-Probabilistic: partition the data into subsets that best describe the variable ‣ The deeper the tree, the better the model fits the data ‣ Watch out for overfiting: need to prune the tree ‣ Can handle categorical or continuous features ‣ No need for input to be scaled or standardized: Set you features and go! ‣ Requires a lot of data covering all possibilities
  • 37. ©2015 IBM Corporation If you want to know more ‣https://developer.ibm.com/clouddataservices/ ‣https://github.com/ibm-cds-labs/pipes-connector-flightstats ‣http://spark.apache.org/docs/latest/mllib-guide.html ‣https://console.ng.bluemix.net/data/analytics/