SlideShare a Scribd company logo
1 of 35
1Copyright 2018 © Qubole
STATE OF ENTERPRISE DATA
SCIENCE
David Roe
Pradeep Reddy
2Copyright 2018 © Qubole
Birth of Data Science in Life Sciences
Cholera Outbreak in 1854; London
- Prevailing Theory: Miasma Theory (Cholera was caused by bad air)
- Dr John Snow refuted Miasma Theory and came up with an idea to mark on a map of London the locations of all known
cases of cholera that led to death. This marked the birth of “Epidemiology
- Reference: The Ghost Map by Steven Johnson
3Copyright 2018 © Qubole
INTRODUCTION
OVERVIEW OF STATE OF DATA SCIENCE TODAY
- KEY TRENDS
- CURRENT PROBLEMS
DATA SCIENCE WORKFLOW IN MODERN ARCHITECTURE
- INSIGHTS FROM 2018 BIG DATA ACTIVATION REPORT
- HOW COMPANIES ARE BECOMING SUCCESSFUL
DEMO OF ML IMPLEMENTATION WITH HADOOP AND SPARK
- END-TO-END BATCH PIPELINE
- MODEL OUTPUTS & VISUALIZATION
4Copyright 2018 © Qubole
The transformational promise of
Data Science projects remain elusive
85%of Data Science
projects fail to
meet expectations
>70%of Analytics
potential value is
unrealised
Copyright 2018 © Qubole
5Copyright 2018 © Qubole
Data Science can be successful with
Modern Data Architecture -
that scales to
allow your models
to train against
production data
enables you to
iterate and
prototype quickly
Copyright 2018 © Qubole
provides you with a
solid hand-off from
training to
production
6Copyright 2018 © Qubole
COMMON MACHINE LEARNING DATAFLOW
7Copyright 2018 © Qubole
Copyright 2018 © Qubole
Data Preparation Model Build Model Validation Deploy & Monitor
Tasks: wrangling,
exploration, validation
Tasks: split data, model
specification, feature
selection
Tasks: Train, Visualize,
compare / choose models,
model report
Tasks: build, compile/JAR,
reporting dashboard,
monitor
8Copyright 2018 © Qubole
Question 1:
How many of you do Big Data and/or
Data Science in the Cloud
9Copyright 2018 © Qubole
QUBOLE BIG DATA ACTIVATION STACK
Copyright 2018 © Qubole
Data Scientists
Third-Party
Tools
Data Engineers
Third-Party
Tools
Analysts
Third-Party
Tools
Qubole Cloud-Native Big Data Activation Platform
Autoscale Caching
Spot
Buying
AIR Serverless Monitoring
…
Cloud Data Lake
10Copyright 2018 © Qubole
AUTOSCALING BIG DATA ENGINES IN CLOUD
11Copyright 2018 © Qubole
DATA SCIENCE REQUIRES SCALABLE BIG DATA
DATA CLOUD
50%savings in
cloud spend
1:65DataOps : Users
10Xincrease in
IoT data
Copyright 2017 © Qubole
STATE OF BIG DATA ADOPTION
Copyright 2018 © Qubole
• Production
reporting/DW
• Researching
• Initial Big Data
Deployment
• Targeted use
case
• Multiple
departments
• Multiple engines
• Top down use
cases
• Enterprise
transformation
• Bottom up use
cases
• Digital enterprise
• Ubiquitous insights
• True business
transformation
ASPIRATION
1ST
STAGE
EXPERIMENTATION
2ND
STAGE
EXPANSION
3RD
STAGE
INVERSION
4TH
STAGE
NIRVANA
5TH
STAGE
Copyright 2017 © Qubole
MACHINE LEARNING WORKFLOW IS A PRODUCT
LIFECYCLE
Copyright 2018 © Qubole
BUSINESS
VALUE
EXPERIMENTATION DEVELOPMENT PRODUCTION Continuous
Integration /
Delivery (CI/CD)
• Identifying
stakeholders
• Product
roadmap
• Data Exploration
• Initial Big Data
deployment
• Targeted use
case
• Multiple Departments
• Model training
• Multiple engines &
deployments
• Top Down Use
Cases
• Enterprise
transformation
• Bottom up use
cases
• Digital enterprise
• Measuring impact
• True business
transformation
1ST
STAGE
2ND
STAGE
3RD
STAGE
4TH
STAGE
5TH
STAGE
14Copyright 2018 © Qubole
Data Science Workflow - Team Data Science
Process(TDSP)
Source: Microsoft Azure
“Data that is loved
tends to survive.”
Kurt Bollacker,
Distinguished Data
Scientist
15Copyright 2018 © Qubole
Question 2:
How many of you use have Big data in
the cloud?
16Copyright 2018 © Qubole
Other Data Science/Data Mining Process Models
Source: http://www.gmelli.org/RKB/SEMMA_Process_Model
Source: https://www.stellarconsulting.co.nz/blog/data/crisp-dm-still-a-
leader/
17Copyright 2018 © Qubole
ENABLING DATA SCIENCE WORKFLOW
Personas Access Use Cases Engines Cloud
Data Engineering
Data Science
Data Analysts
Machine Learning
Campaign Reports
Email analytics
Fraud detection
Presto
Spark
Hive
TensorFlow
AirFlow
AWS
GCP
Marketing
Revenue
Management
Finance
Commercial teams
● Data Science teams are able scale their
products individually (rather than having one
shared multi-tenant environment)
● Saw immediate cost savings on existing cloud
investments, which allowed the company to
focus on R&D
● Able to go-to-market with new Data Science
products in 1-3 months
● Mitigate SLA delays on analytics reports
OUTCOMES
18Copyright 2018 © Qubole
How did they do it?
1
8
Copyright 2018 © Qubole
Send email to request data to tag
Attachment with untagged data
Upload tagged data
Cloud data lake
Rollup tagged data
Train Model
Internal customer data
Email data
classified by
campaign type
Extract email text and join with tagged data
Hive Table &
Dashboards
Browse Campaign
Product
AUTOMATED EMAIL CAMPAIGN CLASSIFICATION
19Copyright 2018 © Qubole
How did they do it?
1
9
Copyright 2018 © Qubole
KEY CHARACTERISTICS OF
DATA-DRIVEN ORGANIZATIONS
Copyright 2017 © Qubole
TYPICAL DATA LAKE OPERATION
AVRO AVRO
Raw
(Staged)
Derived ‘Source of
Truth’
PARQUET
Hive / Spark Hive / Spark
Insert/Update/Delete
Export CSV JSON
Analytic Data
Warehouse
(i.e. Redshift &
Snowflake
environments)
Data Serving
DBs
(i.e. Cassandra,
DynamoDB,
etc.)
SPARK
PRESTO Interactive
ad-hoc queries
Use
Cases
Analytics
(i.e. Product
Analytics, BI, User
insights etc.)
Data Products
(i.e. Personalisation,
Recommendation etc.)
Data Science
(i.e. Time-series
Analysis, Research etc.)
Data
Discovery
ML & DL
Cloud
Compute
Object
Storage
21Copyright 2018 © Qubole
ON-PREMISE DATA SCIENCE APPROACH VS. CLOUD
• Impossible to scale storage without scaling
compute leading to expensive deployments
• Difficult to share HDFS data across Operating
Units
• Compute & Storage Separate
• Data is easily shared across Operating Units &
accessed from different locations
Cloud
Object
Store
DATA LOCALITY NO DATA LOCALITY
Higher Upfront Cost
No Autoscaling
Having to Fit
Models in Fixed
Infrastructure
Fewer DS Tools
Lower Cost
More Iterative
Scalable with
Automation
Fast Data and ML
Tool Access
22Copyright 2018 © Qubole
How did they do it?
2
2
Copyright 2018 © Qubole
STATE OF BIG DATA TODAY
23Copyright 2018 © Qubole
2018 QUBOLE BIG DATA ACTIVATION REPORT
Download a copy of the Qubole
2018 Big Data Activation Report at
https://go.qubole.com/CA-WP-
BigDataIndexReport_NewLP.html
This in-depth research is based
on anonymised insights from
more than 200 global Qubole
customers.
24Copyright 2018 © Qubole
THE ‘BIG THREE’ OPEN SOURCE ENGINES
Characteristics and strengths
Apache
Hive/Hadoop
Workhorse for handling
massive volumes of data for
ETL, ELT or data preparation
on structured and semi-
structured information
Apache Spark
Powerful for processing
complex and memory-
intensive workflows such as
creating data pipelines or
implementing machine
learning
Presto
Shines in interactive analytics
- business intelligence (BI),
data discovery tools when
data is in a semi-structured or
structured form.
25Copyright 2018 © Qubole
Question 3:
How many of you use multiple big data
engines(Hive, Spark & Presto)
26Copyright 2018 © Qubole
SINGLE VS. MULTIPLE OPEN SOURCE ENGINES
Percentage of companies who use single vs. multiple big data engines
Companies are increasingly deploying multiple
engines to solve specific uses cases
Copyright 2018 © Qubole
Multi
Engine
75.9%
Single
Engine
24.1%
Multi
Engine
86%
Single
Engine
14%
27Copyright 2018 © Qubole
MEASURING EFFICIENCY BY COMMANDS
YOY Growth in
Total No. of
Commands Run
439%
Apache Spark
365%
Presto
129%
Apache
Hadoop/Hive
24x more commands run per hour in Presto than
Apache Spark
6x more commands than Apache Hadoop/Hive
}
28Copyright 2018 © Qubole
3 MUST-HAVES
Movement to
Multi-Engine
Companies are
increasingly deploying
multiple open source
engines for different
use cases (ML, ETL,
analytics, etc.)
Users Getting
More Access
More users have
access to data and are
running more
commands and
collaborating
Cloud Benefits
Recognized
Companies are
leveraging multiple
clouds and automation
29Copyright 2018 © Qubole
How did they do it?
2
9
Copyright 2018 © Qubole
Customer Churn Model Demo
30Copyright 2018 © Qubole
Data Science Notebooks
What are they?
Notebooks are like lab books from high
school science, but with a Harry Potter
twist. Like animated images in print on
Daily Prophet, the code in a notebook
can be executed and results displayed
as part
Purpose:
• Collaboration Suite for Data Science
projects
• Easy access to computing resources
for data science workloads.
• Building blocks that enable self
service data mining.
• Supports a variety of languages like
R, Python and Scala.
31Copyright 2018 © Qubole
Question 4:
How many of you use Data Science
Notebooks for Collaboration?
32Copyright 2018 © Qubole
ML Example: Scalable Data Science
Data Science Customer Churn Overview:
1. Ingest Telco Churn Dataset (ETL)
2. Refine/Curate features and labels(ETL); Often
referred to as feature engineering.
3. Split dataset into test & train samples (70-30 or
60-40 splits)
4. Create multiple 3-stage ML pipelines for various
models (eg: logistic, gradient boosting, random
forests)
5. Run the multiple pipelines defined above to train
on predicting churn response variable.
6. Plotly visualizations for model
comparison/validation, scoring & selection
33Copyright 2018 © Qubole
ML Customer Churn Pipeline
34Copyright 2018 © Qubole
Sign up at www.qubole.com
35Copyright 2018 © Qubole
Appendix: Instructions to Download the Demo Notebook
• Sign up for a Qubole free account on Azure ( www.qubole.com ). This will give you a 14
day free access to try Hive, Spark, Presto & Airflow on Qubole.
• Once signed up, navigate to “Notebooks” in the Home menu on the left top corner.
• Click New, Import from URL and enter the below URL
• https://goo.gl/ENTqo2
• Once the notebook imports you may start the cluster from the notebook and explore the
notebook.

More Related Content

What's hot

What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk Ellen Friedman
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsTokyo University of Science
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive AnalyticsInfochimps, a CSC Big Data Business
 
Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101Ellen Friedman
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningGreg Landrum
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017Ray Bugg
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesDataWorks Summit/Hadoop Summit
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCEAM Publications,India
 
DataXDay - A data scientist journey to industrialization of machine learning
DataXDay - A data scientist journey to industrialization of machine learning DataXDay - A data scientist journey to industrialization of machine learning
DataXDay - A data scientist journey to industrialization of machine learning DataXDay Conference by Xebia
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysisGreg Landrum
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
 
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020Cambridge Semantics
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speeddanpotterdwch
 
Leveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine LearningLeveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine LearningSemantic Web Company
 

What's hot (20)

What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk
 
7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay PlatformsOn Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101Digital Transformation - #StrataData London 2017 - Data101
Digital Transformation - #StrataData London 2017 - Data101
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Moving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine LearningMoving from Artisanal to Industrial Machine Learning
Moving from Artisanal to Industrial Machine Learning
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and Challenges
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCESURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
 
DataXDay - A data scientist journey to industrialization of machine learning
DataXDay - A data scientist journey to industrialization of machine learning DataXDay - A data scientist journey to industrialization of machine learning
DataXDay - A data scientist journey to industrialization of machine learning
 
Let’s talk about reproducible data analysis
Let’s talk about reproducible data analysisLet’s talk about reproducible data analysis
Let’s talk about reproducible data analysis
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020Fireside Chat with Bloor Research: State of the Graph Database Market 2020
Fireside Chat with Bloor Research: State of the Graph Database Market 2020
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
 
Leveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine LearningLeveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine Learning
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 

Similar to State of enterprise data science

Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Holden Ackerman
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?SnapLogic
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
A New Day for Oracle Analytics
A New Day for Oracle AnalyticsA New Day for Oracle Analytics
A New Day for Oracle AnalyticsRich Clayton
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceMojtaba Imani
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderDataconomy Media
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsSkillspeed
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
 
Putting together AI pipelines with Acumos
Putting together AI pipelines with AcumosPutting together AI pipelines with Acumos
Putting together AI pipelines with AcumosPantelis Monogioudis
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Data Con LA
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewVMware Tanzu
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessionsJessicaMurrell3
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar ibi
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudJuarez Junior
 
Cwin16 - Lyon - partner mark logic - the rise of nosql
Cwin16 - Lyon - partner mark logic - the rise of nosqlCwin16 - Lyon - partner mark logic - the rise of nosql
Cwin16 - Lyon - partner mark logic - the rise of nosqlCapgemini
 
Big Data for Product Managers
Big Data for Product ManagersBig Data for Product Managers
Big Data for Product ManagersPentaho
 

Similar to State of enterprise data science (20)

Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI Top Trends in Building Data Lakes for Machine Learning and AI
Top Trends in Building Data Lakes for Machine Learning and AI
 
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
Intelligent data summit: Self-Service Big Data and AI/ML: Reality or Myth?
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
A New Day for Oracle Analytics
A New Day for Oracle AnalyticsA New Day for Oracle Analytics
A New Day for Oracle Analytics
 
OVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a ServiceOVH Analytics Data Compute and Apache Spark as a Service
OVH Analytics Data Compute and Apache Spark as a Service
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
BIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in LogisticsBIG Data & Hadoop Applications in Logistics
BIG Data & Hadoop Applications in Logistics
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Putting together AI pipelines with Acumos
Putting together AI pipelines with AcumosPutting together AI pipelines with Acumos
Putting together AI pipelines with Acumos
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Pivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical OverviewPivotal Big Data Suite: A Technical Overview
Pivotal Big Data Suite: A Technical Overview
 
Modern data integration expert sessions
Modern data integration expert sessionsModern data integration expert sessions
Modern data integration expert sessions
 
Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar Modern Data Integration Expert Session Webinar
Modern Data Integration Expert Session Webinar
 
Stream based Data Integration
Stream based Data IntegrationStream based Data Integration
Stream based Data Integration
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
 
Cwin16 - Lyon - partner mark logic - the rise of nosql
Cwin16 - Lyon - partner mark logic - the rise of nosqlCwin16 - Lyon - partner mark logic - the rise of nosql
Cwin16 - Lyon - partner mark logic - the rise of nosql
 
Big Data for Product Managers
Big Data for Product ManagersBig Data for Product Managers
Big Data for Product Managers
 

More from Yan Xu

Kaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingKaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingYan Xu
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming Yan Xu
 
Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Yan Xu
 
Practical contextual bandits for business
Practical contextual bandits for businessPractical contextual bandits for business
Practical contextual bandits for businessYan Xu
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed BanditsYan Xu
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangA Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangYan Xu
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Yan Xu
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Yan Xu
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Yan Xu
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to AutoencodersYan Xu
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term MemoryYan Xu
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Yan Xu
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningYan Xu
 
Secrets behind AlphaGo
Secrets behind AlphaGoSecrets behind AlphaGo
Secrets behind AlphaGoYan Xu
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network Yan Xu
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkYan Xu
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 

More from Yan Xu (20)

Kaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales ForecastingKaggle winning solutions: Retail Sales Forecasting
Kaggle winning solutions: Retail Sales Forecasting
 
Basics of Dynamic programming
Basics of Dynamic programming Basics of Dynamic programming
Basics of Dynamic programming
 
Walking through Tensorflow 2.0
Walking through Tensorflow 2.0Walking through Tensorflow 2.0
Walking through Tensorflow 2.0
 
Practical contextual bandits for business
Practical contextual bandits for businessPractical contextual bandits for business
Practical contextual bandits for business
 
Introduction to Multi-armed Bandits
Introduction to Multi-armed BanditsIntroduction to Multi-armed Bandits
Introduction to Multi-armed Bandits
 
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack WangA Data-Driven Question Generation Model for Educational Content - by Jack Wang
A Data-Driven Question Generation Model for Educational Content - by Jack Wang
 
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
Deep Learning Approach in Characterizing Salt Body on Seismic Images - by Zhe...
 
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
Deep Hierarchical Profiling & Pattern Discovery: Application to Whole Brain R...
 
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
Detecting anomalies on rotating equipment using Deep Stacked Autoencoders - b...
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)Linear algebra and probability (Deep Learning chapter 2&3)
Linear algebra and probability (Deep Learning chapter 2&3)
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 
Secrets behind AlphaGo
Secrets behind AlphaGoSecrets behind AlphaGo
Secrets behind AlphaGo
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Convolutional neural network
Convolutional neural network Convolutional neural network
Convolutional neural network
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

State of enterprise data science

  • 1. 1Copyright 2018 © Qubole STATE OF ENTERPRISE DATA SCIENCE David Roe Pradeep Reddy
  • 2. 2Copyright 2018 © Qubole Birth of Data Science in Life Sciences Cholera Outbreak in 1854; London - Prevailing Theory: Miasma Theory (Cholera was caused by bad air) - Dr John Snow refuted Miasma Theory and came up with an idea to mark on a map of London the locations of all known cases of cholera that led to death. This marked the birth of “Epidemiology - Reference: The Ghost Map by Steven Johnson
  • 3. 3Copyright 2018 © Qubole INTRODUCTION OVERVIEW OF STATE OF DATA SCIENCE TODAY - KEY TRENDS - CURRENT PROBLEMS DATA SCIENCE WORKFLOW IN MODERN ARCHITECTURE - INSIGHTS FROM 2018 BIG DATA ACTIVATION REPORT - HOW COMPANIES ARE BECOMING SUCCESSFUL DEMO OF ML IMPLEMENTATION WITH HADOOP AND SPARK - END-TO-END BATCH PIPELINE - MODEL OUTPUTS & VISUALIZATION
  • 4. 4Copyright 2018 © Qubole The transformational promise of Data Science projects remain elusive 85%of Data Science projects fail to meet expectations >70%of Analytics potential value is unrealised Copyright 2018 © Qubole
  • 5. 5Copyright 2018 © Qubole Data Science can be successful with Modern Data Architecture - that scales to allow your models to train against production data enables you to iterate and prototype quickly Copyright 2018 © Qubole provides you with a solid hand-off from training to production
  • 6. 6Copyright 2018 © Qubole COMMON MACHINE LEARNING DATAFLOW
  • 7. 7Copyright 2018 © Qubole Copyright 2018 © Qubole Data Preparation Model Build Model Validation Deploy & Monitor Tasks: wrangling, exploration, validation Tasks: split data, model specification, feature selection Tasks: Train, Visualize, compare / choose models, model report Tasks: build, compile/JAR, reporting dashboard, monitor
  • 8. 8Copyright 2018 © Qubole Question 1: How many of you do Big Data and/or Data Science in the Cloud
  • 9. 9Copyright 2018 © Qubole QUBOLE BIG DATA ACTIVATION STACK Copyright 2018 © Qubole Data Scientists Third-Party Tools Data Engineers Third-Party Tools Analysts Third-Party Tools Qubole Cloud-Native Big Data Activation Platform Autoscale Caching Spot Buying AIR Serverless Monitoring … Cloud Data Lake
  • 10. 10Copyright 2018 © Qubole AUTOSCALING BIG DATA ENGINES IN CLOUD
  • 11. 11Copyright 2018 © Qubole DATA SCIENCE REQUIRES SCALABLE BIG DATA DATA CLOUD 50%savings in cloud spend 1:65DataOps : Users 10Xincrease in IoT data
  • 12. Copyright 2017 © Qubole STATE OF BIG DATA ADOPTION Copyright 2018 © Qubole • Production reporting/DW • Researching • Initial Big Data Deployment • Targeted use case • Multiple departments • Multiple engines • Top down use cases • Enterprise transformation • Bottom up use cases • Digital enterprise • Ubiquitous insights • True business transformation ASPIRATION 1ST STAGE EXPERIMENTATION 2ND STAGE EXPANSION 3RD STAGE INVERSION 4TH STAGE NIRVANA 5TH STAGE
  • 13. Copyright 2017 © Qubole MACHINE LEARNING WORKFLOW IS A PRODUCT LIFECYCLE Copyright 2018 © Qubole BUSINESS VALUE EXPERIMENTATION DEVELOPMENT PRODUCTION Continuous Integration / Delivery (CI/CD) • Identifying stakeholders • Product roadmap • Data Exploration • Initial Big Data deployment • Targeted use case • Multiple Departments • Model training • Multiple engines & deployments • Top Down Use Cases • Enterprise transformation • Bottom up use cases • Digital enterprise • Measuring impact • True business transformation 1ST STAGE 2ND STAGE 3RD STAGE 4TH STAGE 5TH STAGE
  • 14. 14Copyright 2018 © Qubole Data Science Workflow - Team Data Science Process(TDSP) Source: Microsoft Azure “Data that is loved tends to survive.” Kurt Bollacker, Distinguished Data Scientist
  • 15. 15Copyright 2018 © Qubole Question 2: How many of you use have Big data in the cloud?
  • 16. 16Copyright 2018 © Qubole Other Data Science/Data Mining Process Models Source: http://www.gmelli.org/RKB/SEMMA_Process_Model Source: https://www.stellarconsulting.co.nz/blog/data/crisp-dm-still-a- leader/
  • 17. 17Copyright 2018 © Qubole ENABLING DATA SCIENCE WORKFLOW Personas Access Use Cases Engines Cloud Data Engineering Data Science Data Analysts Machine Learning Campaign Reports Email analytics Fraud detection Presto Spark Hive TensorFlow AirFlow AWS GCP Marketing Revenue Management Finance Commercial teams ● Data Science teams are able scale their products individually (rather than having one shared multi-tenant environment) ● Saw immediate cost savings on existing cloud investments, which allowed the company to focus on R&D ● Able to go-to-market with new Data Science products in 1-3 months ● Mitigate SLA delays on analytics reports OUTCOMES
  • 18. 18Copyright 2018 © Qubole How did they do it? 1 8 Copyright 2018 © Qubole Send email to request data to tag Attachment with untagged data Upload tagged data Cloud data lake Rollup tagged data Train Model Internal customer data Email data classified by campaign type Extract email text and join with tagged data Hive Table & Dashboards Browse Campaign Product AUTOMATED EMAIL CAMPAIGN CLASSIFICATION
  • 19. 19Copyright 2018 © Qubole How did they do it? 1 9 Copyright 2018 © Qubole KEY CHARACTERISTICS OF DATA-DRIVEN ORGANIZATIONS
  • 20. Copyright 2017 © Qubole TYPICAL DATA LAKE OPERATION AVRO AVRO Raw (Staged) Derived ‘Source of Truth’ PARQUET Hive / Spark Hive / Spark Insert/Update/Delete Export CSV JSON Analytic Data Warehouse (i.e. Redshift & Snowflake environments) Data Serving DBs (i.e. Cassandra, DynamoDB, etc.) SPARK PRESTO Interactive ad-hoc queries Use Cases Analytics (i.e. Product Analytics, BI, User insights etc.) Data Products (i.e. Personalisation, Recommendation etc.) Data Science (i.e. Time-series Analysis, Research etc.) Data Discovery ML & DL Cloud Compute Object Storage
  • 21. 21Copyright 2018 © Qubole ON-PREMISE DATA SCIENCE APPROACH VS. CLOUD • Impossible to scale storage without scaling compute leading to expensive deployments • Difficult to share HDFS data across Operating Units • Compute & Storage Separate • Data is easily shared across Operating Units & accessed from different locations Cloud Object Store DATA LOCALITY NO DATA LOCALITY Higher Upfront Cost No Autoscaling Having to Fit Models in Fixed Infrastructure Fewer DS Tools Lower Cost More Iterative Scalable with Automation Fast Data and ML Tool Access
  • 22. 22Copyright 2018 © Qubole How did they do it? 2 2 Copyright 2018 © Qubole STATE OF BIG DATA TODAY
  • 23. 23Copyright 2018 © Qubole 2018 QUBOLE BIG DATA ACTIVATION REPORT Download a copy of the Qubole 2018 Big Data Activation Report at https://go.qubole.com/CA-WP- BigDataIndexReport_NewLP.html This in-depth research is based on anonymised insights from more than 200 global Qubole customers.
  • 24. 24Copyright 2018 © Qubole THE ‘BIG THREE’ OPEN SOURCE ENGINES Characteristics and strengths Apache Hive/Hadoop Workhorse for handling massive volumes of data for ETL, ELT or data preparation on structured and semi- structured information Apache Spark Powerful for processing complex and memory- intensive workflows such as creating data pipelines or implementing machine learning Presto Shines in interactive analytics - business intelligence (BI), data discovery tools when data is in a semi-structured or structured form.
  • 25. 25Copyright 2018 © Qubole Question 3: How many of you use multiple big data engines(Hive, Spark & Presto)
  • 26. 26Copyright 2018 © Qubole SINGLE VS. MULTIPLE OPEN SOURCE ENGINES Percentage of companies who use single vs. multiple big data engines Companies are increasingly deploying multiple engines to solve specific uses cases Copyright 2018 © Qubole Multi Engine 75.9% Single Engine 24.1% Multi Engine 86% Single Engine 14%
  • 27. 27Copyright 2018 © Qubole MEASURING EFFICIENCY BY COMMANDS YOY Growth in Total No. of Commands Run 439% Apache Spark 365% Presto 129% Apache Hadoop/Hive 24x more commands run per hour in Presto than Apache Spark 6x more commands than Apache Hadoop/Hive }
  • 28. 28Copyright 2018 © Qubole 3 MUST-HAVES Movement to Multi-Engine Companies are increasingly deploying multiple open source engines for different use cases (ML, ETL, analytics, etc.) Users Getting More Access More users have access to data and are running more commands and collaborating Cloud Benefits Recognized Companies are leveraging multiple clouds and automation
  • 29. 29Copyright 2018 © Qubole How did they do it? 2 9 Copyright 2018 © Qubole Customer Churn Model Demo
  • 30. 30Copyright 2018 © Qubole Data Science Notebooks What are they? Notebooks are like lab books from high school science, but with a Harry Potter twist. Like animated images in print on Daily Prophet, the code in a notebook can be executed and results displayed as part Purpose: • Collaboration Suite for Data Science projects • Easy access to computing resources for data science workloads. • Building blocks that enable self service data mining. • Supports a variety of languages like R, Python and Scala.
  • 31. 31Copyright 2018 © Qubole Question 4: How many of you use Data Science Notebooks for Collaboration?
  • 32. 32Copyright 2018 © Qubole ML Example: Scalable Data Science Data Science Customer Churn Overview: 1. Ingest Telco Churn Dataset (ETL) 2. Refine/Curate features and labels(ETL); Often referred to as feature engineering. 3. Split dataset into test & train samples (70-30 or 60-40 splits) 4. Create multiple 3-stage ML pipelines for various models (eg: logistic, gradient boosting, random forests) 5. Run the multiple pipelines defined above to train on predicting churn response variable. 6. Plotly visualizations for model comparison/validation, scoring & selection
  • 33. 33Copyright 2018 © Qubole ML Customer Churn Pipeline
  • 34. 34Copyright 2018 © Qubole Sign up at www.qubole.com
  • 35. 35Copyright 2018 © Qubole Appendix: Instructions to Download the Demo Notebook • Sign up for a Qubole free account on Azure ( www.qubole.com ). This will give you a 14 day free access to try Hive, Spark, Presto & Airflow on Qubole. • Once signed up, navigate to “Notebooks” in the Home menu on the left top corner. • Click New, Import from URL and enter the below URL • https://goo.gl/ENTqo2 • Once the notebook imports you may start the cluster from the notebook and explore the notebook.

Editor's Notes

  1. Want to give a bit of background as to how Qubole sits in the big data ecoystem Our cloud-native, big data activation platform Has built in tools and also connects to many 3rd party tools to support all of your use cases The platform itself makes running workloads easy using your choice of open source technology, optimizes price performance automatically and evolves over time through a plug in architecture Finally, we support multiple cloud providers so there’s no lock in
  2. https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview