SlideShare a Scribd company logo
© 2017 IBM Corporation
Scaling Data Science on Big Data
Vikram Murali
Program Director, Data Science & Machine Learning, IBM
Sriram Srinivasan
STSM & Architect, Data Science & Machine Learning, IBM
IBM Analytics
3 © 2017 IBM Corporation<#>
Data Scientist Pain Points
 Where is the data I need to drive business insights?
 I don’t want to know Hadoop/Hive etc
How do I collaborate and share my work with others?
 What is the best visualization technique to tell my story?
 How do I bring my familiar R/Python libraries to this new Data Science platform?
 How do I learn to use the latest libraries/Technique? (TensorFlow, Scikit learn, XGBoost,…) and
how do I ensure the right set of compute resources for these ?
How are my Machine Learning Models performing & how to improve them?
 I have this Machine Learning Model, how do I deploy it in production?
Machine Learning & Data Science
IBM Analytics
4 © 2017 IBM Corporation<#>
Challenges for the Enterprise
 Ensure secure data access & auditability - for governance and compliance
 Control and Curate access to data and for all open source libraries used
 Explainability and reproducibility of machine learning activities
 Improve trust in analytics and predictions
 Efficient Collaboration and versioning of all source, sample data and models
 Repeatability of process
 Establish Continuous integration practices
 Agility in delivery
 Publish/Share and identify provenance/ lineage with confidence
 Visibility and Access control
 Effective Resource utilization and ability to scale-out on demand
 Balance resources amongst different data scientists, machine learning practioners' workloads
Machine Learning & Data Science
IBM Analytics
5 © 2017 IBM Corporation<#>
Why has this been hard ?
 Rigid toolsets & absence of an integrated platform
 Have to choose one and only one approach
 Cannot easily connect all of the capabilities required
 Difficult to navigate between the various tools used
 Fragmented and time consuming practices
 Result of using multiple disjoint environments
 Separate on-ramp/community for each tool/environment
 Does not yield meaningful meta data or complete data lineage
 Analytical Silos
 Difficult to maintain and version control project assets
 Limited means of collaborating with teams
 Results are difficult to share and audit
Machine Learning & Data Science
 Resource Management Complexity
 Lack of scalable infrastructure
 Inflexible resource prioritization
techniques
IBM Analytics
6 © 2017 IBM Corporation<#>
Introducing IBM Data Science Experience
• Projects and Version Control
• Spark-in-DSX and Remote Spark
• IBM Machine Learning tech - algorithms
& more
• Platform Manager – for easy
administration
• Compute Elasticity support
IBM Data Science Experience
Community Open Source IBM Added Value
• Find tutorials and datasets
• Connect with other Data Scientists
• IBM ML Hub for expert assistance
• Open Source evangelism
• Fork and share projects, samples
• Code in Scala/Python/R/SQL
• Zeppelin & Jupyter Notebooks
• RStudio IDE
• Anaconda distribution
• Add your favorite libraries
Machine Learning & Data Science
IBM Analytics
7 © 2017 IBM Corporation<#>
IBM Data Science Experience
DSX on Public Cloud DSX Desktop DSX Local on Private Cloud
 PayGo consumption with as-a-service
delivery, up & running in seconds
 Integrated with IBM Spark-as-a-Service for
compute, IBM Object Store for data, as well
as other platform assets
 Immediate cloud collaboration via RStudio
and Jupyter notebooks
 Easily installed on your laptop or PC
 Won’t scale beyond the hardware available on
your machine
 Access to RStudio and Jupyter notebooks,
powered by one small Spark worker operating
locally on your machine
 Load CSV data files into Data Frames
 Scalable DSX cluster deployed on your
private infrastructure
 Dockerized containers via Kubernetes
 DSX Local can also deploy with
Hortonworks Data Platform on-premises
 LDAP for user management and
authentication
 Easy collaboration, versioning with Projects
& git
Built-in Zeppelin & Jupyter Notebooks
and RStudio for visualizing and coding on
data science tasks using Python, R, &
Scala.
Built-in Spark parallelizes & accelerates
data science tasks.
Machine Learning & Data Science
IBM Analytics
8
Machine Learning Workflow in Data Science Experience
 Machine learning detects if models fall out of spec — and automatically triggers retraining
 Fully integrated model management means data scientists, app developers & operations can use
the same environment
Machine Learning & Data Science
Data
Live
SystemIngest
Data
Processing
Model
Training
Deployment &
Management
Creating samples &
Cleansing
Automating Data Science Workloads Scalable
Deployment
Feedback Loop
 Historical
 Streaming
 Data visualization
 Feature transform
and engineering
 Model selection
and evaluation
 Pipelines, not
only models
 Versioning
 Predict when
given new data
 Monitoring and
live evaluation
Models
lose accuracy
Data Scientists
+ Researchers
ML Engineers
+ Production Engineers
Data
Engineers
IBM Analytics
9
Data Science
Experience
Machine Learning Everywhere – An Open Platform
 Add your favorite libraries
 Publish Open APIs for secure ML applications
Machine Learning & Data Science
IBM Analytics
10
DSX Local Architecture
Machine Learning & Data Science
IBM Analytics
11
DSX Scale out in Kubernetes is simple
 DSX-Spark scale-out is automatically done by adding more compute nodes (via “Daemon Sets”)
 Remote Spark can be independently scaled out as usual (say in Hadoop/Yarn)
 Individual workload Isolation and scale-out in pods
 Each DSX individual user (or an entity, in general) gets a Kubernetes namespace assigned, making
metering simple.
 All containers (pods) for that user gets spawned in that namespace, such as for tools – Jupyter/Zeppelin
(Python) or R/RStudio as well as other non-spark jobs.
 Namespace provides total quota for that user with resource requests and limits set in each pod
deployment
 “Shared” services are load balanced (with HA support) across all user access by typical
Kubernetes techniques, such as via replicas of pods & DNS-routing via Kubernetes services.
Machine Learning & Data Science
IBM Analytics
12
Data Science Experience with Hortonworks Data Platform
Big Data
DSX
IBM Analytics
13
Data Science Experience with HDP –Roadmap
DSX & HDP interoperability
Side-by-Side Installation
DSX on-the-edge integrated &
optimized for HDP deployments
DSX Jupyter, RStudio &
Zeppelin and Machine Learning
services enabled for HDP data
sources
Yarn managed Spark leveraged
by DSX, via Livy
• Spark jobs pushed to HDP
cluster
Single Cluster
DSX Within HDP Cluster
Dedicated nodes for DSX in the
HDP cluster with Ambari-based
installation/configuration.
Deploy & scale DSX with Yarn
managing DSX as a top-level
application
Knox, Ranger & Atlas integration
for authentication, authorization &
governance
Fully Yarn Managed DSX
Workloads
HDP embeds Kubernetes in Yarn,
enables launch and integration of
Kubernetes pods as Yarn
containers
Yarn manages all workloads in a
granular fashion across the entire
HDP cluster
• Python & R workloads (non-
spark) also managed by Yarn
• GPU affinity , especially for
Deep Learning Jobs
Today Q4 2017 1st Half 2018
1
Machine Learning & Data Science
IBM Analytics
14
Goal: Enterprise IaaS for Data Scientists
 Efficient Compute Resource Management for large-scale Analytics, Machine Learning and Deep
Learning workloads
-Enable Data Scientists to procure resources from a shared compute “grid” for any kind of activity from
interactive notebooks & IDEs to training Jobs or scheduled scripts and Apps.
-All compute manifested as Docker containers/Kubernetes pods
 HDP/Yarn as the Resource Manager
-Enable all workloads, whether Map Reduce or Spark Jobs or DSX/ML activities to be uniformly handled by
the HDP/Yarn scheduler.
-Manage Queue Priorities, balancing of workloads and scale-out for the whole cluster providing best
utilization of all resources.
 Yarn and Kubernetes - the best of both worlds !
Machine Learning & Data Science
IBM Analytics
15 © 2017 IBM Corporation<#>
Call to Action
Experience DSX & ML Today…
IBM DSX at http://datascience.ibm.com
DSX Local recorded demos
Machine Learning: https://www.youtube.com/watch?v=htGZ1Iomeec
Connecting to external Spark: https://www.youtube.com/watch?v=rA0Rlb2M_oI
Spark submit from external app: https://www.youtube.com/watch?v=TETAT9pC9_o
Administration experience: https://www.youtube.com/watch?v=htGZ1Iomeec
Birds of a Feather
session
6pm Thursday, C 4.5
Machine Learning & Data Science
© 2017 IBM Corporation
THANK YOU
IBM Data Science Experience
Vikram Murali
Program Director, Data Science & Machine Learning, IBM
Sriram Srinivasan
STSM & Architect, Data Science & Machine Learning, IBM

More Related Content

What's hot

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
DataWorks Summit/Hadoop Summit
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
DataWorks Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
BMC Software
 
Hadoop for the Masses
Hadoop for the MassesHadoop for the Masses
Hadoop for the Masses
DataWorks Summit/Hadoop Summit
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
Humza Naseer
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
DataWorks Summit
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
Cloudera, Inc.
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
DataWorks Summit
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
DataWorks Summit
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
DataWorks Summit
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
Mark Rittman
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
Idan Tohami
 

What's hot (20)

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Beyond TCO
Beyond TCOBeyond TCO
Beyond TCO
 
Hadoop for the Masses
Hadoop for the MassesHadoop for the Masses
Hadoop for the Masses
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
How Apache Spark and Apache Hadoop are being used to keep banking regulators ...
 
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
Integrating and Analyzing Data from Multiple Manufacturing Sites using Apache...
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 

Similar to Scaling Data Science on Big Data

Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
ModusOptimum
 
BlueData DataSheet
BlueData DataSheetBlueData DataSheet
BlueData DataSheet
Greg Kirchoff
 
Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?
Hortonworks
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Debraj GuhaThakurta
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
Abhinav Joshi
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
Mohammadhasan Farazmand
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOS
QAware GmbH
 
Cisco Big Data Use Case
Cisco Big Data Use CaseCisco Big Data Use Case
Cisco Big Data Use CaseErni Susanti
 
cisco_bigdata_case_study_1
cisco_bigdata_case_study_1cisco_bigdata_case_study_1
cisco_bigdata_case_study_1Erni Susanti
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
Wilfried Hoge
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
flexpod_hadoop_cloudera
flexpod_hadoop_clouderaflexpod_hadoop_cloudera
flexpod_hadoop_clouderaPrem Jain
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Above the cloud joarder kamal
Above the cloud   joarder kamalAbove the cloud   joarder kamal
Above the cloud joarder kamal
Joarder Kamal
 

Similar to Scaling Data Science on Big Data (20)

Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
BlueData DataSheet
BlueData DataSheetBlueData DataSheet
BlueData DataSheet
 
AnilKumarT_Resume_latest
AnilKumarT_Resume_latestAnilKumarT_Resume_latest
AnilKumarT_Resume_latest
 
Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?Your Self-Driving Car - How Did it Get So Smart?
Your Self-Driving Car - How Did it Get So Smart?
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 
Building ML Pipelines with DCOS
Building ML Pipelines with DCOSBuilding ML Pipelines with DCOS
Building ML Pipelines with DCOS
 
Cisco Big Data Use Case
Cisco Big Data Use CaseCisco Big Data Use Case
Cisco Big Data Use Case
 
cisco_bigdata_case_study_1
cisco_bigdata_case_study_1cisco_bigdata_case_study_1
cisco_bigdata_case_study_1
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
flexpod_hadoop_cloudera
flexpod_hadoop_clouderaflexpod_hadoop_cloudera
flexpod_hadoop_cloudera
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Above the cloud joarder kamal
Above the cloud   joarder kamalAbove the cloud   joarder kamal
Above the cloud joarder kamal
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 

Scaling Data Science on Big Data

  • 1. © 2017 IBM Corporation Scaling Data Science on Big Data Vikram Murali Program Director, Data Science & Machine Learning, IBM Sriram Srinivasan STSM & Architect, Data Science & Machine Learning, IBM
  • 2. IBM Analytics 3 © 2017 IBM Corporation<#> Data Scientist Pain Points  Where is the data I need to drive business insights?  I don’t want to know Hadoop/Hive etc How do I collaborate and share my work with others?  What is the best visualization technique to tell my story?  How do I bring my familiar R/Python libraries to this new Data Science platform?  How do I learn to use the latest libraries/Technique? (TensorFlow, Scikit learn, XGBoost,…) and how do I ensure the right set of compute resources for these ? How are my Machine Learning Models performing & how to improve them?  I have this Machine Learning Model, how do I deploy it in production? Machine Learning & Data Science
  • 3. IBM Analytics 4 © 2017 IBM Corporation<#> Challenges for the Enterprise  Ensure secure data access & auditability - for governance and compliance  Control and Curate access to data and for all open source libraries used  Explainability and reproducibility of machine learning activities  Improve trust in analytics and predictions  Efficient Collaboration and versioning of all source, sample data and models  Repeatability of process  Establish Continuous integration practices  Agility in delivery  Publish/Share and identify provenance/ lineage with confidence  Visibility and Access control  Effective Resource utilization and ability to scale-out on demand  Balance resources amongst different data scientists, machine learning practioners' workloads Machine Learning & Data Science
  • 4. IBM Analytics 5 © 2017 IBM Corporation<#> Why has this been hard ?  Rigid toolsets & absence of an integrated platform  Have to choose one and only one approach  Cannot easily connect all of the capabilities required  Difficult to navigate between the various tools used  Fragmented and time consuming practices  Result of using multiple disjoint environments  Separate on-ramp/community for each tool/environment  Does not yield meaningful meta data or complete data lineage  Analytical Silos  Difficult to maintain and version control project assets  Limited means of collaborating with teams  Results are difficult to share and audit Machine Learning & Data Science  Resource Management Complexity  Lack of scalable infrastructure  Inflexible resource prioritization techniques
  • 5. IBM Analytics 6 © 2017 IBM Corporation<#> Introducing IBM Data Science Experience • Projects and Version Control • Spark-in-DSX and Remote Spark • IBM Machine Learning tech - algorithms & more • Platform Manager – for easy administration • Compute Elasticity support IBM Data Science Experience Community Open Source IBM Added Value • Find tutorials and datasets • Connect with other Data Scientists • IBM ML Hub for expert assistance • Open Source evangelism • Fork and share projects, samples • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE • Anaconda distribution • Add your favorite libraries Machine Learning & Data Science
  • 6. IBM Analytics 7 © 2017 IBM Corporation<#> IBM Data Science Experience DSX on Public Cloud DSX Desktop DSX Local on Private Cloud  PayGo consumption with as-a-service delivery, up & running in seconds  Integrated with IBM Spark-as-a-Service for compute, IBM Object Store for data, as well as other platform assets  Immediate cloud collaboration via RStudio and Jupyter notebooks  Easily installed on your laptop or PC  Won’t scale beyond the hardware available on your machine  Access to RStudio and Jupyter notebooks, powered by one small Spark worker operating locally on your machine  Load CSV data files into Data Frames  Scalable DSX cluster deployed on your private infrastructure  Dockerized containers via Kubernetes  DSX Local can also deploy with Hortonworks Data Platform on-premises  LDAP for user management and authentication  Easy collaboration, versioning with Projects & git Built-in Zeppelin & Jupyter Notebooks and RStudio for visualizing and coding on data science tasks using Python, R, & Scala. Built-in Spark parallelizes & accelerates data science tasks. Machine Learning & Data Science
  • 7. IBM Analytics 8 Machine Learning Workflow in Data Science Experience  Machine learning detects if models fall out of spec — and automatically triggers retraining  Fully integrated model management means data scientists, app developers & operations can use the same environment Machine Learning & Data Science Data Live SystemIngest Data Processing Model Training Deployment & Management Creating samples & Cleansing Automating Data Science Workloads Scalable Deployment Feedback Loop  Historical  Streaming  Data visualization  Feature transform and engineering  Model selection and evaluation  Pipelines, not only models  Versioning  Predict when given new data  Monitoring and live evaluation Models lose accuracy Data Scientists + Researchers ML Engineers + Production Engineers Data Engineers
  • 8. IBM Analytics 9 Data Science Experience Machine Learning Everywhere – An Open Platform  Add your favorite libraries  Publish Open APIs for secure ML applications Machine Learning & Data Science
  • 9. IBM Analytics 10 DSX Local Architecture Machine Learning & Data Science
  • 10. IBM Analytics 11 DSX Scale out in Kubernetes is simple  DSX-Spark scale-out is automatically done by adding more compute nodes (via “Daemon Sets”)  Remote Spark can be independently scaled out as usual (say in Hadoop/Yarn)  Individual workload Isolation and scale-out in pods  Each DSX individual user (or an entity, in general) gets a Kubernetes namespace assigned, making metering simple.  All containers (pods) for that user gets spawned in that namespace, such as for tools – Jupyter/Zeppelin (Python) or R/RStudio as well as other non-spark jobs.  Namespace provides total quota for that user with resource requests and limits set in each pod deployment  “Shared” services are load balanced (with HA support) across all user access by typical Kubernetes techniques, such as via replicas of pods & DNS-routing via Kubernetes services. Machine Learning & Data Science
  • 11. IBM Analytics 12 Data Science Experience with Hortonworks Data Platform Big Data DSX
  • 12. IBM Analytics 13 Data Science Experience with HDP –Roadmap DSX & HDP interoperability Side-by-Side Installation DSX on-the-edge integrated & optimized for HDP deployments DSX Jupyter, RStudio & Zeppelin and Machine Learning services enabled for HDP data sources Yarn managed Spark leveraged by DSX, via Livy • Spark jobs pushed to HDP cluster Single Cluster DSX Within HDP Cluster Dedicated nodes for DSX in the HDP cluster with Ambari-based installation/configuration. Deploy & scale DSX with Yarn managing DSX as a top-level application Knox, Ranger & Atlas integration for authentication, authorization & governance Fully Yarn Managed DSX Workloads HDP embeds Kubernetes in Yarn, enables launch and integration of Kubernetes pods as Yarn containers Yarn manages all workloads in a granular fashion across the entire HDP cluster • Python & R workloads (non- spark) also managed by Yarn • GPU affinity , especially for Deep Learning Jobs Today Q4 2017 1st Half 2018 1 Machine Learning & Data Science
  • 13. IBM Analytics 14 Goal: Enterprise IaaS for Data Scientists  Efficient Compute Resource Management for large-scale Analytics, Machine Learning and Deep Learning workloads -Enable Data Scientists to procure resources from a shared compute “grid” for any kind of activity from interactive notebooks & IDEs to training Jobs or scheduled scripts and Apps. -All compute manifested as Docker containers/Kubernetes pods  HDP/Yarn as the Resource Manager -Enable all workloads, whether Map Reduce or Spark Jobs or DSX/ML activities to be uniformly handled by the HDP/Yarn scheduler. -Manage Queue Priorities, balancing of workloads and scale-out for the whole cluster providing best utilization of all resources.  Yarn and Kubernetes - the best of both worlds ! Machine Learning & Data Science
  • 14. IBM Analytics 15 © 2017 IBM Corporation<#> Call to Action Experience DSX & ML Today… IBM DSX at http://datascience.ibm.com DSX Local recorded demos Machine Learning: https://www.youtube.com/watch?v=htGZ1Iomeec Connecting to external Spark: https://www.youtube.com/watch?v=rA0Rlb2M_oI Spark submit from external app: https://www.youtube.com/watch?v=TETAT9pC9_o Administration experience: https://www.youtube.com/watch?v=htGZ1Iomeec Birds of a Feather session 6pm Thursday, C 4.5 Machine Learning & Data Science
  • 15. © 2017 IBM Corporation THANK YOU IBM Data Science Experience Vikram Murali Program Director, Data Science & Machine Learning, IBM Sriram Srinivasan STSM & Architect, Data Science & Machine Learning, IBM