IBM Strategy for Spark

Introductions
Garrett Young (sgyoung@us.ibm.com)
1) Introduction to Spark (10 mins)
2) IBM's Commitment to Spark (5 mins)
3) How Predictive Analytic Lifecycles Typically Work (10 mins)
3) Using Spark to Predict Hospital Readmissions (15 mins)
4) How you can get a free-trial Spark environment from IBM (5 mins)
5) Q&A (15 mins)

What is Spark?
• In-memory data processing engine
• Open Source Apache Project
• Cluster Computing Framework
• Can use Scala, Python or R Languages
• Horizontally/Vertically Scalable
• Not a data store

IBM | SPARK – The Analytics Operating System
“Enabling New Classes of Intelligent Applications Embedded with Analytics”
• Spark unifies data, enabling
real-time insights
• Spark processes and analyzes data
from any data source
• Spark is complementary to Hadoop,
but faster with
in-memory performance
• Build models quickly. Iterate faster.
Apply intelligence .

• Traditional Approach: MapReduce jobs for complex jobs, interactive query, and
online event-hub processing involves lots of (slow) disk I/O
How Spark Works
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
Input ResultCPU
Iteration 1
Memory CPU
Iteration 2
Memory

• Solution: Keep more data in-memory with a new distributed execution engine
HDFS
Read
Input CPU
Iteration 1
Memory CPU
Iteration 2
Memory
faster than
network & disk
Zero
Read/Write
Disk
Bottleneck
How Spark Works
Chain Job
Output
into New Job
Input

General Spark Architecture Overview
• Driver Uses Spark
Context to talk to the
Cluster Manager
• Executors run their own
JVM Processes
• Cluster manager
distributes the workload
based on information
from the Worker

Key Reasons for the Interest in Spark
Performant  In-memory architecture greatly reduces disk I/O
 Anywhere from 20-100x faster for common
tasks
Productive  Concise and expressive syntax, especially
compared to prior approaches
 Single programming model across a range
of use cases and steps in data lifecycle
 Integrated with common programming
languages – Java, Python, Scala, R
 New tools continually reduce skill barrier for
access (e.g. SQL for analysts)
Leverages existing
investments
 Works well within existing Hadoop
ecosystem
Improves with age  Large and growing community of
contributors continuously improve full analytics
stack and extend capabilities

What is SparkML?
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy.
At a high level, it provides tools such as:
1. ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative
filtering
2. Featurization: feature extraction, transformation, dimensionality reduction, and selection
3. Persistence: saving and load algorithms, models, and Pipelines
4. Utilities: linear algebra, statistics, data handling, etc.

What is scikit-learn?
• Used for Data Mining and Data Analysis
• Open Source
• Various classification, regression and clustering algorithms

Watson Machine Learning
• Uses both Spark ML and Scikit-Learn plus others
• Built on SPSS plaform
• Can pull from many different data sources
• Integrates with DSX (Beta)

Web
Service
Data Access:
• Easily connect to Behind-
the-Firewall and Public
Cloud Data
• Catalogued and
Governed Controls
through Watson Data
Platform
Creating Models:
• Single UI and API for
creating ML Models on
various Runtimes
• Auto-Modelling and
Hyperparameter
Optimization
Web Service:
• Real-time,
Streaming, and
Batch Deployment
• Continuous
Monitoring and
Feedback Loop
Intelligent Apps:
• Integrate ML
models with apps,
websites, etc.
• Continuously
Improve and Adapt
with Self-Learning
IBM DSX Machine Learning
IMS

IBM Machine Learning in Data Science Experience
API for Jupyter Notebooks Wizard GUI
IBM Machine Learning is provisioned by default in Data Science Experience
• Enables Data Scientists to deploy machine learning models as web services
• Single UI for creating, collaborating, deploying, monitoring, and feedback
• Accessible via API, Wizard GUI, and Canvas

Spark Tech Center (STC): IBM’s
Commitment to Spark
0
100
200
300
400
500
600
700
800
900
1000
Databricks IBM Hortonworks Cloudera Intel IVU Traffic
Technologies
Tencent
Top 7 Contributing Companies to Spark 2.0.0
25,600 Spark LOC
606 Spark JIRAs
253 SystemML JIRAs
64 Speakers at events
… and all that with 1 Team
1.5 Years
Databricks
Hortonworks
Cloudera
Intel
Tencent
NTT
Other

IBM Spark Technology Center – San Francisco, CA
As of March 10, 2016
See what we’re up to …
IBM Spark Technology Center
http://www.spark.tc/blog/
Fixing lot’s of issues reported
by others

Using Spark to Predict Hospital Readmissions &
How Predictive Analytic Lifecycles Typically Work

Reducing Hospital Readmissions with Predictive Analytics
An Example ‘Proof of Concept’ Using Open Data

Outline
Problem
Solution
Details
Results
Summary

Problem
Solution
Details
Results
Summary
Problem

Problem : 30-Day Hospital Readmissions costs $41B
Annually
Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.pdf

Medicare HRRP – Penalties to Hospitals
Source: Kaiser Family Foundation
http://kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-hospital-readmission-reduction-program/

Problem
Solution
Details
Results
Summary
Solution

Get Data: Diabetes Readmissions Dataset
• University of California Irvine – Machine Learning Repos.
• Open Data
• 130 Hospitals, 1999-2008
• 101,766 rows, 50 columns of data
• Diabetes Readmissions
• Top ten for Medicaid, Private Insurance and Uninsured
• Not in top ten for Medicare
https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

Build a Predictive Model : Conceptual View
 Step 1: Model Development
 Step 2: Perform Predictions
Historical
Data
Machine
Learning
(Mathematical
Algorithm)
Model
Model PredictionNew Case

IBM Bluemix
• Bluemix
• Infrastructure, Watson, software and services on Bluemix Cloud Platform
• Services such as Big Insights (Hadoop), Data Connect (ETL), and Spark can be almost instantly provisioned

Data Science Experience (DSX)
• Data Science Experience (DSX)
• Easily execute scala, python and R notebooks
• Share notebooks with your data science team

Bluemix Services Architecture in the Cloud
BigInsights HDFS
(Hadoop)
Data Connect DashDB
Data Science ExperienceCloudantNode.js Web Form
Training Data Convert to CSV
Predictions
New Records
Predictions

Problem
Solution
Details
Results
Summary
Details

Data Science Experience – Python Code

Problem
Solution
Details
Results
Summary
Results

First Pass Results – Are they any good?
 AUC = Area Under the Curve
AUC Score 0.6514
 0.50 = Random Guessing
 1.00 = Perfect Prediction

2nd Pass Results – Are they any good?
 AUC = Area Under the Curve
AUC Score 0.6750
 0.50 = Random Guessing
 1.00 = Perfect Prediction

How Do Other Readmission Models
Perform?
“A comparison of models for predicting early hospital
readmissions”
Journal of Biomedical Informatics Volume 56, August 2015, Pages 229–
238
Source: http://www.sciencedirect.com/science/article/pii/S1532046415000969

Which Factors Affect Diabetes Readmission?
Data: Feature Importance from Random Forest Algorithm
The Algorithm can tell us which features
(columns) it found important during the
training process.
22 columns from original 50

Problem
Solution
Details
Results
Summary
Summary

Summary
• Readmissions Prediction is an important area of research for using
Predictive Analytics in Healthcare
• Patient: Improved Outcome
• Hospital Providers: Avoid Penalties
• Payers: Reduce Costs
• In a short amount of time we were able to develop results comparable
to leading research studies

How you can get a free-trial Spark Cluster from IBM

Sign Up for Free Account
Data Science Experience
with IBM ML
https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://datascience.ibm.com/
Notebook Samples

IBM Strategy for Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IBM Strategy for Spark

Similar to IBM Strategy for Spark (20)

More from Mark Kerzner

More from Mark Kerzner (20)

Recently uploaded

Recently uploaded (20)

IBM Strategy for Spark

Editor's Notes