1© Cloudera, Inc. All rights reserved.
Josh YEH | Software Engineer
https://www.linkedin.com/in/joshyeh/
Data Science in the Enterprise
2© Cloudera, Inc. All rights reserved.
This is the age of machine learning.
2
Cost of compute
Data volume
Time
Machine
Learning
NO
Machine
Learning
1950
s
1960
s
1970
s
1980
s
1990
s
2000
s
2010
s
3© Cloudera, Inc. All rights reserved.
4© Cloudera, Inc. All rights reserved.
CDSW
Data Scientist
Machine Learning and
Advanced Analytical
Platform
CM / CDH
Data Engineer
Data Admin
ETL and ...
5© Cloudera, Inc. All rights reserved.
Machine learning opportunities are everywhere.
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
Reports,
Dashboards
Data has never been
more plentiful
Open source data science and
machine learning libraries are
rapidly evolving
Commodity (and on-demand) compute
makes scalable production machine
learning affordable
6© Cloudera, Inc. All rights reserved.
But, there are challenges:
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
Reports,
Dashboards
Most data science done at
small scale, individually,
and is difficult to replicate
Very few models
reach production
Teams have different,
conflicting requests for
languages & libraries
Data needs to move
across multiple different
systems
7© Cloudera, Inc. All rights reserved.
Our enterprise customers say:
Access
For sensitive data, secure clusters are
difficult to access. And IT typically
doesn’t want random packages
installed on a secure cluster.
Popular open source tools don’t easily
connect to these environments, or
always support Hadoop data formats.
Scale
Laptops rarely have capacity for
medium, let alone big data. This
leads to a lot of sampling.
Popular frameworks don’t easily
parallelize on a cluster. Typically
code has to get rewritten for
production.
Developer Experience
Notebooks, while awesome, don’t
easily support virtual environment
and dependency management,
especially for teams. This makes
sharing and reproducibility hard.
Notebooks are also challenging to
“put into production.”
8© Cloudera, Inc. All rights reserved.
Cloudera: The Enterprise Platform for
Data Science and Machine Learning
The data is now here
30B
CONNECTED DEVICES
440x
MORE DATA
Cloudera first to integrate Spark
Modern Platform for Machine Learning and Advanced Analytics
Leading adoption among enterprises
500Customers
Run Spark on
9© Cloudera, Inc. All rights reserved.
Our goal: an open data science at enterprise scale
Help more data scientists
use the power of Cloudera
Use a powerful, familiar
environment with direct access to
Cloudera data and compute
Data Scientist
Data Engineer
Make it easy and secure to
add new users, use cases
Offer secure self-service analytics
and a faster path to production on
common, affordable infrastructure
Enterprise Architect
Hadoop Admin
10© Cloudera, Inc. All rights reserved.
An open ecosystem for agility and innovation
Open Ecosystem Black Box
11© Cloudera, Inc. All rights reserved.
Rich data science libraries and the power of Spark
IT
drive adoption while maintaining compliance
Data Scientist
explore, experiment, iterate
12© Cloudera, Inc. All rights reserved.
Supports the complete data science pipeline
From data to exploration to action
Data Engineering Data Science (Exploratory) Production (Operational)
Data Wrangling
Visualization
and Analysis
Model Training
& Testing
Production
Data Pipelines Batch Scoring
Online Scoring
Serving
Data GovernanceGovernance
Processing
Acquisition
Reports,
Dashboards
CDSW
Data Scientist
Machine Learning and
Advanced Analytical Platform
CM / CDH
Data Engineer
Data Administrator,
ETL and ...
13© Cloudera, Inc. All rights reserved.
Data science at enterprise scale requires a full stack.
• Support unlimited data
• Provide sufficient tools for Analysts
• Provide sufficient tools for
Data Scientists + Data Engineers
• Enable real-time use cases
• Provide data governance
• Provide full-stack security
• Deploy in the cloud
• Integrate with partner tools
• Be easy for IT to deploy/maintain
✓Hadoop
✓Impala, Hive, Hue
✓Spark, Data Science Workbench
✓Kafka, Spark Streaming
✓Navigator + Partners
✓Kerberos, Sentry, Record Service, KMS/KTS
✓Cloudera Director
✓Rich Ecosystem
✓Cloudera Manager + Director
14© Cloudera, Inc. All rights reserved.
Introducing Cloudera Data Science Workbench
Self-service data science for the enterprise
Accelerates data science from
development to production with:
• Secure self-service environments
for data scientists to work against
Cloudera clusters
• Support for Python, R, and Scala,
plus project dependency isolation
for multiple library versions
• Workflow automation, version
control, collaboration and sharing
15© Cloudera, Inc. All rights reserved.
Data scientists can:
• Use R, Python, or Scala from a web
browser, with no desktop footprint
• Install any library or framework within
isolated project environments
• Directly access data in secure clusters
with Spark and Impala
• Share insights with their team for
reproducible, collaborative research
• Automate and monitor data pipelines
using built-in job scheduling
IT can:
• Give their data science team the
freedom to work how they want, when
they want
• Stay compliant with out-of-the-box
support for full platform security,
especially Kerberos
• Run on-premises or in the cloud,
wherever data is managed
With Cloudera Data Science Workbench…
16© Cloudera, Inc. All rights reserved.
Enable Data Scientists IT can focus data engineering
With Cloudera Data Science Workbench…
17© Cloudera, Inc. All rights reserved.
Demo
CDH security integration, Share Report, Group Collaboration
Data Visualization, Abnormality Detection,
GPU resource management, Train ML models
18© Cloudera, Inc. All rights reserved.
Connect to Kerberized Hadoop cluster with
• As easy as one single
configuration for a given
user.
• CDSW application will
refresh kerberos
authentication every 15
mins
• Support both Kerberos
password and key tab
authentication.
• Revoke is easy.
19© Cloudera, Inc. All rights reserved.
20© Cloudera, Inc. All rights reserved.
21© Cloudera, Inc. All rights reserved.
22© Cloudera, Inc. All rights reserved.
23© Cloudera, Inc. All rights reserved.
24© Cloudera, Inc. All rights reserved.
25© Cloudera, Inc. All rights reserved.
26© Cloudera, Inc. All rights reserved.
27© Cloudera, Inc. All rights reserved.
28© Cloudera, Inc. All rights reserved.
29© Cloudera, Inc. All rights reserved.
30© Cloudera, Inc. All rights reserved.
31© Cloudera, Inc. All rights reserved.
32© Cloudera, Inc. All rights reserved.
33© Cloudera, Inc. All rights reserved.
34© Cloudera, Inc. All rights reserved.
35© Cloudera, Inc. All rights reserved.
Cloudera Fast Forward Lab (CFFL)
Machine Learning Kickstart kit
36© Cloudera, Inc. All rights reserved.
What is Cloudera Fast Forward Labs (CFFL)
Fast Forward Labs (FFL) is a small data science/machine learning research and
consulting team.
FFL provides a subscription advisory service that helps technology leaders keep
current on machine learning and artificial intelligence innovations that will be
practical within the next 6-24 months
37© Cloudera, Inc. All rights reserved.
1.Executives struggle with machine learning strategy
There are thousands of different ML
techniques available, and innovation is
incredibly fast. It's impossible for any
company know about all the options, let alone
evaluate the few that may meet their needs.
The vendor ecosystem is increasingly complex,
and hype muddies the waters. Analyst firms
like Gartner and Forrester take a high-level
and long view of the market, so they don't
provide any real actionable intelligence about
ML that could be useful today.
38© Cloudera, Inc. All rights reserved.
2. Cut through the ML hype
A Cloudera Fast Forward Labs research subscription is the
best way to keep up with the latest in machine learning and
artificial intelligence. Each quarter, we publish a research
report and provide software prototypes highlighting
emerging machine learning capabilities that will be useful in
the next two years. The prototypes prove the algorithms,
reduce risk and uncertainty, and shorten time to value. Our
advising services expand on the research and help you apply
machine learning to your specific line of work, keeping you
ahead of the curve and above the confusion and hype.
39© Cloudera, Inc. All rights reserved.
3. Research and so much more
40© Cloudera, Inc. All rights reserved.
CFFL Customer Successes
• A bank's head of innovation read the CFFL report on Natural Language
Generation. He is now generating 80% of its regulatory compliance
documentation automatically, saving 30% of the hours.
• One of the big four accounting firms is using Summarization techniques to
reissue tax guidance to its customers. What were once static memos, are now
dynamically updated documents. Their ML system sits on a newsfeed of
updates to judicial interpretation of the tax law, automatically alerting the CPA
and their customers.
• Another customer read our report on Image and video analysis with deep
learning. CFFL conducted a feasibility study to get surgical robots to be able
detect when a surgery is going wrong. The findings wound up radically
changing the company's product roadmap.
41© Cloudera, Inc. All rights reserved.
Future Data Science Trending
Invest $$ on the right direction; avoid building technical debts
42© Cloudera, Inc. All rights reserved.
Trending 1: GPU for training; CPU for deployment
1. GPU is built for Machine Learning ( Deep
Neural Network Learning / AI ).
2. Data retention (/revamp) is costly.
a. Eg: the data from the development,
test, manufacture process for chips
that are used in autonomous car, need
to need 15 years of retention.
3. Re-train / Re-deploy new DL model with
more and more data set daily, weekly,
yearly etc...
43© Cloudera, Inc. All rights reserved.
Trending 2: Investment on Right Tool Sets
1. Don’t reinvent the wheel: Leverage
what’s available out there to shorten
development cycle.
2. Use community support: Mainstream
tools have ample enthusiastic support and
contribution.
3. Easier troubleshooting
44© Cloudera, Inc. All rights reserved.
Trending 3: Application Containerization
1. Maintenance: legacy application,
complicated framework, 3rd party
libraries, evaluate free open projects. Too
many things could go wrong.
2. Resource Optimization: easy resource
management via cgroup in cluster.
3. Deployment: Easy testing, validation and
deployment.
4. Scale: kubernete to manage containers
and easy scale-up.
45© Cloudera, Inc. All rights reserved.
Trending 4: Simplified Data Scientist Workflow
1. Work Anywhere 2. Train at large scale 3. Seamless deployment
90+% resource spent on
46© Cloudera, Inc. All rights reserved.
Thank you!
jjyeh@cloudera.com

Data Science in Enterprise

  • 1.
    1© Cloudera, Inc.All rights reserved. Josh YEH | Software Engineer https://www.linkedin.com/in/joshyeh/ Data Science in the Enterprise
  • 2.
    2© Cloudera, Inc.All rights reserved. This is the age of machine learning. 2 Cost of compute Data volume Time Machine Learning NO Machine Learning 1950 s 1960 s 1970 s 1980 s 1990 s 2000 s 2010 s
  • 3.
    3© Cloudera, Inc.All rights reserved.
  • 4.
    4© Cloudera, Inc.All rights reserved. CDSW Data Scientist Machine Learning and Advanced Analytical Platform CM / CDH Data Engineer Data Admin ETL and ...
  • 5.
    5© Cloudera, Inc.All rights reserved. Machine learning opportunities are everywhere. Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition Reports, Dashboards Data has never been more plentiful Open source data science and machine learning libraries are rapidly evolving Commodity (and on-demand) compute makes scalable production machine learning affordable
  • 6.
    6© Cloudera, Inc.All rights reserved. But, there are challenges: Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition Reports, Dashboards Most data science done at small scale, individually, and is difficult to replicate Very few models reach production Teams have different, conflicting requests for languages & libraries Data needs to move across multiple different systems
  • 7.
    7© Cloudera, Inc.All rights reserved. Our enterprise customers say: Access For sensitive data, secure clusters are difficult to access. And IT typically doesn’t want random packages installed on a secure cluster. Popular open source tools don’t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don’t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don’t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to “put into production.”
  • 8.
    8© Cloudera, Inc.All rights reserved. Cloudera: The Enterprise Platform for Data Science and Machine Learning The data is now here 30B CONNECTED DEVICES 440x MORE DATA Cloudera first to integrate Spark Modern Platform for Machine Learning and Advanced Analytics Leading adoption among enterprises 500Customers Run Spark on
  • 9.
    9© Cloudera, Inc.All rights reserved. Our goal: an open data science at enterprise scale Help more data scientists use the power of Cloudera Use a powerful, familiar environment with direct access to Cloudera data and compute Data Scientist Data Engineer Make it easy and secure to add new users, use cases Offer secure self-service analytics and a faster path to production on common, affordable infrastructure Enterprise Architect Hadoop Admin
  • 10.
    10© Cloudera, Inc.All rights reserved. An open ecosystem for agility and innovation Open Ecosystem Black Box
  • 11.
    11© Cloudera, Inc.All rights reserved. Rich data science libraries and the power of Spark IT drive adoption while maintaining compliance Data Scientist explore, experiment, iterate
  • 12.
    12© Cloudera, Inc.All rights reserved. Supports the complete data science pipeline From data to exploration to action Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition Reports, Dashboards CDSW Data Scientist Machine Learning and Advanced Analytical Platform CM / CDH Data Engineer Data Administrator, ETL and ...
  • 13.
    13© Cloudera, Inc.All rights reserved. Data science at enterprise scale requires a full stack. • Support unlimited data • Provide sufficient tools for Analysts • Provide sufficient tools for Data Scientists + Data Engineers • Enable real-time use cases • Provide data governance • Provide full-stack security • Deploy in the cloud • Integrate with partner tools • Be easy for IT to deploy/maintain ✓Hadoop ✓Impala, Hive, Hue ✓Spark, Data Science Workbench ✓Kafka, Spark Streaming ✓Navigator + Partners ✓Kerberos, Sentry, Record Service, KMS/KTS ✓Cloudera Director ✓Rich Ecosystem ✓Cloudera Manager + Director
  • 14.
    14© Cloudera, Inc.All rights reserved. Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: • Secure self-service environments for data scientists to work against Cloudera clusters • Support for Python, R, and Scala, plus project dependency isolation for multiple library versions • Workflow automation, version control, collaboration and sharing
  • 15.
    15© Cloudera, Inc.All rights reserved. Data scientists can: • Use R, Python, or Scala from a web browser, with no desktop footprint • Install any library or framework within isolated project environments • Directly access data in secure clusters with Spark and Impala • Share insights with their team for reproducible, collaborative research • Automate and monitor data pipelines using built-in job scheduling IT can: • Give their data science team the freedom to work how they want, when they want • Stay compliant with out-of-the-box support for full platform security, especially Kerberos • Run on-premises or in the cloud, wherever data is managed With Cloudera Data Science Workbench…
  • 16.
    16© Cloudera, Inc.All rights reserved. Enable Data Scientists IT can focus data engineering With Cloudera Data Science Workbench…
  • 17.
    17© Cloudera, Inc.All rights reserved. Demo CDH security integration, Share Report, Group Collaboration Data Visualization, Abnormality Detection, GPU resource management, Train ML models
  • 18.
    18© Cloudera, Inc.All rights reserved. Connect to Kerberized Hadoop cluster with • As easy as one single configuration for a given user. • CDSW application will refresh kerberos authentication every 15 mins • Support both Kerberos password and key tab authentication. • Revoke is easy.
  • 19.
    19© Cloudera, Inc.All rights reserved.
  • 20.
    20© Cloudera, Inc.All rights reserved.
  • 21.
    21© Cloudera, Inc.All rights reserved.
  • 22.
    22© Cloudera, Inc.All rights reserved.
  • 23.
    23© Cloudera, Inc.All rights reserved.
  • 24.
    24© Cloudera, Inc.All rights reserved.
  • 25.
    25© Cloudera, Inc.All rights reserved.
  • 26.
    26© Cloudera, Inc.All rights reserved.
  • 27.
    27© Cloudera, Inc.All rights reserved.
  • 28.
    28© Cloudera, Inc.All rights reserved.
  • 29.
    29© Cloudera, Inc.All rights reserved.
  • 30.
    30© Cloudera, Inc.All rights reserved.
  • 31.
    31© Cloudera, Inc.All rights reserved.
  • 32.
    32© Cloudera, Inc.All rights reserved.
  • 33.
    33© Cloudera, Inc.All rights reserved.
  • 34.
    34© Cloudera, Inc.All rights reserved.
  • 35.
    35© Cloudera, Inc.All rights reserved. Cloudera Fast Forward Lab (CFFL) Machine Learning Kickstart kit
  • 36.
    36© Cloudera, Inc.All rights reserved. What is Cloudera Fast Forward Labs (CFFL) Fast Forward Labs (FFL) is a small data science/machine learning research and consulting team. FFL provides a subscription advisory service that helps technology leaders keep current on machine learning and artificial intelligence innovations that will be practical within the next 6-24 months
  • 37.
    37© Cloudera, Inc.All rights reserved. 1.Executives struggle with machine learning strategy There are thousands of different ML techniques available, and innovation is incredibly fast. It's impossible for any company know about all the options, let alone evaluate the few that may meet their needs. The vendor ecosystem is increasingly complex, and hype muddies the waters. Analyst firms like Gartner and Forrester take a high-level and long view of the market, so they don't provide any real actionable intelligence about ML that could be useful today.
  • 38.
    38© Cloudera, Inc.All rights reserved. 2. Cut through the ML hype A Cloudera Fast Forward Labs research subscription is the best way to keep up with the latest in machine learning and artificial intelligence. Each quarter, we publish a research report and provide software prototypes highlighting emerging machine learning capabilities that will be useful in the next two years. The prototypes prove the algorithms, reduce risk and uncertainty, and shorten time to value. Our advising services expand on the research and help you apply machine learning to your specific line of work, keeping you ahead of the curve and above the confusion and hype.
  • 39.
    39© Cloudera, Inc.All rights reserved. 3. Research and so much more
  • 40.
    40© Cloudera, Inc.All rights reserved. CFFL Customer Successes • A bank's head of innovation read the CFFL report on Natural Language Generation. He is now generating 80% of its regulatory compliance documentation automatically, saving 30% of the hours. • One of the big four accounting firms is using Summarization techniques to reissue tax guidance to its customers. What were once static memos, are now dynamically updated documents. Their ML system sits on a newsfeed of updates to judicial interpretation of the tax law, automatically alerting the CPA and their customers. • Another customer read our report on Image and video analysis with deep learning. CFFL conducted a feasibility study to get surgical robots to be able detect when a surgery is going wrong. The findings wound up radically changing the company's product roadmap.
  • 41.
    41© Cloudera, Inc.All rights reserved. Future Data Science Trending Invest $$ on the right direction; avoid building technical debts
  • 42.
    42© Cloudera, Inc.All rights reserved. Trending 1: GPU for training; CPU for deployment 1. GPU is built for Machine Learning ( Deep Neural Network Learning / AI ). 2. Data retention (/revamp) is costly. a. Eg: the data from the development, test, manufacture process for chips that are used in autonomous car, need to need 15 years of retention. 3. Re-train / Re-deploy new DL model with more and more data set daily, weekly, yearly etc...
  • 43.
    43© Cloudera, Inc.All rights reserved. Trending 2: Investment on Right Tool Sets 1. Don’t reinvent the wheel: Leverage what’s available out there to shorten development cycle. 2. Use community support: Mainstream tools have ample enthusiastic support and contribution. 3. Easier troubleshooting
  • 44.
    44© Cloudera, Inc.All rights reserved. Trending 3: Application Containerization 1. Maintenance: legacy application, complicated framework, 3rd party libraries, evaluate free open projects. Too many things could go wrong. 2. Resource Optimization: easy resource management via cgroup in cluster. 3. Deployment: Easy testing, validation and deployment. 4. Scale: kubernete to manage containers and easy scale-up.
  • 45.
    45© Cloudera, Inc.All rights reserved. Trending 4: Simplified Data Scientist Workflow 1. Work Anywhere 2. Train at large scale 3. Seamless deployment 90+% resource spent on
  • 46.
    46© Cloudera, Inc.All rights reserved. Thank you! jjyeh@cloudera.com