Introduction to Spark
Introductions Garrett Young (sgyoung@us.ibm.com) 1) Introduction to Spark (10 mins) 2) IBM's Commitment to Spark (5 mins) ...
What is Spark? • In-memory data processing engine • Open Source Apache Project • Cluster Computing Framework • Can use Sca...
IBM | SPARK – The Analytics Operating System “Enabling New Classes of Intelligent Applications Embedded with Analytics” • ...
• Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots ...
• Solution: Keep more data in-memory with a new distributed execution engine HDFS Read Input CPU Iteration 1 Memory CPU It...
General Spark Architecture Overview • Driver Uses Spark Context to talk to the Cluster Manager • Executors run their own J...
Key Reasons for the Interest in Spark Performant  In-memory architecture greatly reduces disk I/O  Anywhere from 20-100x...
What is SparkML? MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable a...
What is scikit-learn? • Used for Data Mining and Data Analysis • Open Source • Various classification, regression and clus...
Watson Machine Learning • Uses both Spark ML and Scikit-Learn plus others • Built on SPSS plaform • Can pull from many dif...
Web Service Data Access: • Easily connect to Behind- the-Firewall and Public Cloud Data • Catalogued and Governed Controls...
IBM Machine Learning in Data Science Experience API for Jupyter Notebooks Wizard GUI IBM Machine Learning is provisioned b...
IBM's Commitment to Spark
Spark Tech Center (STC): IBM’s Commitment to Spark 0 100 200 300 400 500 600 700 800 900 1000 Databricks IBM Hortonworks C...
IBM Spark Technology Center – San Francisco, CA As of March 10, 2016 See what we’re up to … IBM Spark Technology Center ht...
Using Spark to Predict Hospital Readmissions & How Predictive Analytic Lifecycles Typically Work
Reducing Hospital Readmissions with Predictive Analytics An Example ‘Proof of Concept’ Using Open Data
Outline Problem Solution Details Results Summary
Problem Solution Details Results Summary Problem
Problem : 30-Day Hospital Readmissions costs $41B Annually Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Co...
Medicare HRRP – Penalties to Hospitals Source: Kaiser Family Foundation http://kff.org/medicare/issue-brief/aiming-for-few...
Problem Solution Details Results Summary Solution
Get Data: Diabetes Readmissions Dataset • University of California Irvine – Machine Learning Repos. • Open Data • 130 Hosp...
Build a Predictive Model : Conceptual View  Step 1: Model Development  Step 2: Perform Predictions Historical Data Machi...
IBM Bluemix • Bluemix • Infrastructure, Watson, software and services on Bluemix Cloud Platform • Services such as Big Ins...
Data Science Experience (DSX) • Data Science Experience (DSX) • Easily execute scala, python and R notebooks • Share noteb...
Bluemix Services Architecture in the Cloud BigInsights HDFS (Hadoop) Data Connect DashDB Data Science ExperienceCloudantNo...
Problem Solution Details Results Summary Details
A Look at The Raw Data
Data Science Experience – Python Code
Problem Solution Details Results Summary Results
First Pass Results – Are they any good?  AUC = Area Under the Curve AUC Score 0.6514  0.50 = Random Guessing  1.00 = P...
2nd Pass Results – Are they any good?  AUC = Area Under the Curve AUC Score 0.6750  0.50 = Random Guessing  1.00 = Per...
How Do Other Readmission Models Perform? “A comparison of models for predicting early hospital readmissions” Journal of Bi...
Which Factors Affect Diabetes Readmission? Data: Feature Importance from Random Forest Algorithm The Algorithm can tell us...
Problem Solution Details Results Summary Summary
Summary • Readmissions Prediction is an important area of research for using Predictive Analytics in Healthcare • Patient:...
How you can get a free-trial Spark Cluster from IBM
Sign Up for Free Account Data Science Experience with IBM ML https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://...
Upcoming SlideShare
Loading in …5
×

Meetup 2

43 views

Published on

IBM Strategy for Spark. Presented by Garrett Young at Houston Hadoop & Spark Meetup in June of 2017.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
no profile picture user

  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
43
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 3

    • Meetup 2

    1. 1. Introduction to Spark
    2. 2. Introductions Garrett Young (sgyoung@us.ibm.com) 1) Introduction to Spark (10 mins) 2) IBM's Commitment to Spark (5 mins) 3) How Predictive Analytic Lifecycles Typically Work (10 mins) 3) Using Spark to Predict Hospital Readmissions (15 mins) 4) How you can get a free-trial Spark environment from IBM (5 mins) 5) Q&A (15 mins)
    3. 3. What is Spark? • In-memory data processing engine • Open Source Apache Project • Cluster Computing Framework • Can use Scala, Python or R Languages • Horizontally/Vertically Scalable • Not a data store
    4. 4. IBM | SPARK – The Analytics Operating System “Enabling New Classes of Intelligent Applications Embedded with Analytics” • Spark unifies data, enabling real-time insights • Spark processes and analyzes data from any data source • Spark is complementary to Hadoop, but faster with in-memory performance • Build models quickly. Iterate faster. Apply intelligence .
    5. 5. • Traditional Approach: MapReduce jobs for complex jobs, interactive query, and online event-hub processing involves lots of (slow) disk I/O How Spark Works HDFS Read HDFS Write HDFS Read HDFS Write Input ResultCPU Iteration 1 Memory CPU Iteration 2 Memory
    6. 6. • Solution: Keep more data in-memory with a new distributed execution engine HDFS Read Input CPU Iteration 1 Memory CPU Iteration 2 Memory faster than network & disk Zero Read/Write Disk Bottleneck How Spark Works Chain Job Output into New Job Input
    7. 7. General Spark Architecture Overview • Driver Uses Spark Context to talk to the Cluster Manager • Executors run their own JVM Processes • Cluster manager distributes the workload based on information from the Worker
    8. 8. Key Reasons for the Interest in Spark Performant  In-memory architecture greatly reduces disk I/O  Anywhere from 20-100x faster for common tasks Productive  Concise and expressive syntax, especially compared to prior approaches  Single programming model across a range of use cases and steps in data lifecycle  Integrated with common programming languages – Java, Python, Scala, R  New tools continually reduce skill barrier for access (e.g. SQL for analysts) Leverages existing investments  Works well within existing Hadoop ecosystem Improves with age  Large and growing community of contributors continuously improve full analytics stack and extend capabilities
    9. 9. What is SparkML? MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: 1. ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering 2. Featurization: feature extraction, transformation, dimensionality reduction, and selection 3. Persistence: saving and load algorithms, models, and Pipelines 4. Utilities: linear algebra, statistics, data handling, etc.
    10. 10. What is scikit-learn? • Used for Data Mining and Data Analysis • Open Source • Various classification, regression and clustering algorithms
    11. 11. Watson Machine Learning • Uses both Spark ML and Scikit-Learn plus others • Built on SPSS plaform • Can pull from many different data sources • Integrates with DSX (Beta)
    12. 12. Web Service Data Access: • Easily connect to Behind- the-Firewall and Public Cloud Data • Catalogued and Governed Controls through Watson Data Platform Creating Models: • Single UI and API for creating ML Models on various Runtimes • Auto-Modelling and Hyperparameter Optimization Web Service: • Real-time, Streaming, and Batch Deployment • Continuous Monitoring and Feedback Loop Intelligent Apps: • Integrate ML models with apps, websites, etc. • Continuously Improve and Adapt with Self-Learning IBM DSX Machine Learning IMS
    13. 13. IBM Machine Learning in Data Science Experience API for Jupyter Notebooks Wizard GUI IBM Machine Learning is provisioned by default in Data Science Experience • Enables Data Scientists to deploy machine learning models as web services • Single UI for creating, collaborating, deploying, monitoring, and feedback • Accessible via API, Wizard GUI, and Canvas
    14. 14. IBM's Commitment to Spark
    15. 15. Spark Tech Center (STC): IBM’s Commitment to Spark 0 100 200 300 400 500 600 700 800 900 1000 Databricks IBM Hortonworks Cloudera Intel IVU Traffic Technologies Tencent Top 7 Contributing Companies to Spark 2.0.0 25,600 Spark LOC 606 Spark JIRAs 253 SystemML JIRAs 64 Speakers at events … and all that with 1 Team 1.5 Years Databricks Hortonworks Cloudera Intel Tencent NTT Other
    16. 16. IBM Spark Technology Center – San Francisco, CA As of March 10, 2016 See what we’re up to … IBM Spark Technology Center http://www.spark.tc/blog/ Fixing lot’s of issues reported by others
    17. 17. Using Spark to Predict Hospital Readmissions & How Predictive Analytic Lifecycles Typically Work
    18. 18. Reducing Hospital Readmissions with Predictive Analytics An Example ‘Proof of Concept’ Using Open Data
    19. 19. Outline Problem Solution Details Results Summary
    20. 20. Problem Solution Details Results Summary Problem
    21. 21. Problem : 30-Day Hospital Readmissions costs $41B Annually Source: http://www.hcup-us.ahrq.gov/reports/statbriefs/sb172-Conditions-Readmissions-Payer.pdf
    22. 22. Medicare HRRP – Penalties to Hospitals Source: Kaiser Family Foundation http://kff.org/medicare/issue-brief/aiming-for-fewer-hospital-u-turns-the-medicare-hospital-readmission-reduction-program/
    23. 23. Problem Solution Details Results Summary Solution
    24. 24. Get Data: Diabetes Readmissions Dataset • University of California Irvine – Machine Learning Repos. • Open Data • 130 Hospitals, 1999-2008 • 101,766 rows, 50 columns of data • Diabetes Readmissions • Top ten for Medicaid, Private Insurance and Uninsured • Not in top ten for Medicare https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
    25. 25. Build a Predictive Model : Conceptual View  Step 1: Model Development  Step 2: Perform Predictions Historical Data Machine Learning (Mathematical Algorithm) Model Model PredictionNew Case
    26. 26. IBM Bluemix • Bluemix • Infrastructure, Watson, software and services on Bluemix Cloud Platform • Services such as Big Insights (Hadoop), Data Connect (ETL), and Spark can be almost instantly provisioned
    27. 27. Data Science Experience (DSX) • Data Science Experience (DSX) • Easily execute scala, python and R notebooks • Share notebooks with your data science team
    28. 28. Bluemix Services Architecture in the Cloud BigInsights HDFS (Hadoop) Data Connect DashDB Data Science ExperienceCloudantNode.js Web Form Training Data Convert to CSV Predictions New Records Predictions
    29. 29. Problem Solution Details Results Summary Details
    30. 30. A Look at The Raw Data
    31. 31. Data Science Experience – Python Code
    32. 32. Problem Solution Details Results Summary Results
    33. 33. First Pass Results – Are they any good?  AUC = Area Under the Curve AUC Score 0.6514  0.50 = Random Guessing  1.00 = Perfect Prediction
    34. 34. 2nd Pass Results – Are they any good?  AUC = Area Under the Curve AUC Score 0.6750  0.50 = Random Guessing  1.00 = Perfect Prediction
    35. 35. How Do Other Readmission Models Perform? “A comparison of models for predicting early hospital readmissions” Journal of Biomedical Informatics Volume 56, August 2015, Pages 229– 238 Source: http://www.sciencedirect.com/science/article/pii/S1532046415000969
    36. 36. Which Factors Affect Diabetes Readmission? Data: Feature Importance from Random Forest Algorithm The Algorithm can tell us which features (columns) it found important during the training process. 22 columns from original 50
    37. 37. Problem Solution Details Results Summary Summary
    38. 38. Summary • Readmissions Prediction is an important area of research for using Predictive Analytics in Healthcare • Patient: Improved Outcome • Hospital Providers: Avoid Penalties • Payers: Reduce Costs • In a short amount of time we were able to develop results comparable to leading research studies
    39. 39. How you can get a free-trial Spark Cluster from IBM
    40. 40. Sign Up for Free Account Data Science Experience with IBM ML https://ibm.box.com/s/y2zvpzk8pje56lto0oja0372tnbydbomhttp://datascience.ibm.com/ Notebook Samples

    ×