1© Cloudera, Inc. All rights reserved.
From Insight to Action - Using
Data Science to Transform
Your Organization
Rob Morrow, Chief Technologist US Government
2© Cloudera, Inc. All rights reserved.
Deploy on any cloud infrastructure
Cloudera Director: Management for IaaS-related and CDH cluster operations
Easy Administration
• Dynamic cluster lifecycle management
• ICD-503 Support
• Single pane of glass: multi-cluster view
• Consumption based billing and metering
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at
scale
Flexible Deployments
• No cloud vendor lock-in: open plugin
framework for IaaS platforms
• Scaling of provisioned clusters
• Spot instance provisioning
Cloudera Director
3© Cloudera, Inc. All rights reserved.
Enterprise Data Science Topics
It took Todd Lipcon 3 years to
create Kudu;10 years of work
before that learning and gaining
trust among OS Community as a
committer.
Government of the future:
value created through
interesting methods.
If your organization is already
good at the 5,000 Open Source
Algorithms (Regression etc), you
now need a Data Science Cadre.
Open Source: Help Wanted. Methods, not raw DataMost problems are not really
Data Science “Challenges”
4© Cloudera, Inc. All rights reserved.
Data Engineering and Data Science Workloads
Data Ingestion
(Kafka, Navigator,
Search)
Cloudera enables users to build real-time, end-
to-end data pipelines in order to power their
business. Leadership in Apache Spark and
Kafka have made Cloudera a trusted resource
for users who want to capture real-time,
streaming, and time series data without being
presented with gaps in security.
Data Processing
(Spark, Hive)
Cloudera is helping users accelerate
their data pipelines with leadership in
technologies like Apache
Spark. Data processing in Cloudera
Enterprise can help take processing
windows from hours to minutes and
enables faster access to data for a
variety of users and skillsets.
Data Science (Spark
MLlib)
Cloudera is bringing the most popular data
science languages/libraries to our platform
for easier collaboration, self-service
exploration, and implementation at
scale. Cloudera is advancing the state of
distributed machine learning at scale.
Cloudera enables exploratory data science
and the ability to deliver robust data
products.
5© Cloudera, Inc. All rights reserved.
Closing Gaps in Critical Skills Areas in the Govt
Data Science
High Value, Low Frequency
• Only a small set of problems require
direct Data Science expertise (~5%)
• Domain-general, algorithm-specific
• Very high expertise
Characterized by
• Spark/Python Expertise
• Advanced Algorithms
• Hypothesis-testing
Automation/Workload
• Per-task/Algorithm automation
Data Analysis
High Frequency, Self-Service
• The “other” 95% of Problems
• More domain-specific
Characterized by
• Tools with UI’s (Data Robot)
• “Exploratory” data investigation
Automation/Workload
• Easily automated
Data Science “Unicorns” are even more valuable in the Govt.
So how to you scale them out?
6© Cloudera, Inc. All rights reserved.
Two Data Science Use Cases
Improving decisions vs. improving products
Decision Science
(improving business decisions)
Data Products
(improving products for customers)
• User: Data scientists and analysts
• Data: New and changing; often sampled
• Environment: Local machine, sandbox cluster
• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …
• Goal: Understand data, develop and improve models,
share results
• Production: Hosted/scheduled reports or dashboards
• User: Data engineers, developers, SREs
• Data: Known data; full scale
• Environment: Production clusters
• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …
• Goal: Build and maintain applications, improve
model performance, manage models in production
• Production: Online applications
7© Cloudera, Inc. All rights reserved.
Ingest
The Foundation of Hadoop’s Potential
Data can come from a variety of “siloed” sources
▪ Existing databases
▪ Sensor data
▪ Server logs
▪ Chat transcripts
Value of data is multiplied when combined and
correlated with other data
▪ “40% value improvement from combining data from
multiple IoT sources” McKinsey Global Institute
8© Cloudera, Inc. All rights reserved.
Data Processing
Leverage the right processing for your job
Data may require unique processing characteristics
▪ Batch
▪ Streaming
▪ Real-time
Hadoop arose to address one and now the ecosystem
has evolved to answer the rest.
▪ “We’re doubling down on Spark. We invested earliest,
and we’ve invested most, in making Hadoop
enterprise-grade” Mike Olson
9© Cloudera, Inc. All rights reserved.
Data Science
A Unified Platform to Accelerate Data Science from Exploration to Production.
Data Scientists need to use data to…
▪ Explore
▪ Model
▪ Test
The field of data science blends math and statistics
knowledge with advanced computer knowledge.
▪ “Data Scientist: Person who is better at statistics than
any software engineer and better at software
engineering than any statistician” Josh Wills
10© Cloudera, Inc. All rights reserved.
MLlib
Collection of mainstream machine learning algorithms built on Spark
Including:
•Classifiers: logistic regression, boosted trees, random forests, etc
•Clustering: k-means, Latent Dirichlet Allocation (LDA)
•Recommender Systems: Alternating Least Squares
•Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value
Decomposition (SVD)
•Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc
•Statistical Functions: Chi-Squared Test, Pearson Correlation, etc
11© Cloudera, Inc. All rights reserved.
Data Science Track Info
Data Science Location: Severn
Matrix Decomposition at Scale
Juliet Hougland, Data Scientist, Cloudera
Large-scale Agent-Based Modeling and Simulation on High-Performance Computers
Dr. Robert Axtell, George Mason University
Random Decision Forests at Scale
Todd Boetticher, Solutions Consultant, Cloudera
12© Cloudera, Inc. All rights reserved.
1
Recommended Training for Data Engineering
Learn how to identify which tool
is the right one to use in a given
situation, and gain hands-on
experience using those tools
Cloudera University’s three-day
course helps participants
understand what data scientists
do, the problems they solve,
and the tools and techniques
they use
Learn how to increase the ROI
from big data investments, by
delivering faster time to insight
for your organization.
Apache Spark and Hadoop Data Science on Hadoop Cloudera Search
13© Cloudera, Inc. All rights reserved.
Thank you

From Insight to Action: Using Data Science to Transform Your Organization

  • 1.
    1© Cloudera, Inc.All rights reserved. From Insight to Action - Using Data Science to Transform Your Organization Rob Morrow, Chief Technologist US Government
  • 2.
    2© Cloudera, Inc.All rights reserved. Deploy on any cloud infrastructure Cloudera Director: Management for IaaS-related and CDH cluster operations Easy Administration • Dynamic cluster lifecycle management • ICD-503 Support • Single pane of glass: multi-cluster view • Consumption based billing and metering Enterprise-grade • Integration across Cloudera Enterprise • Management of CDH deployments at scale Flexible Deployments • No cloud vendor lock-in: open plugin framework for IaaS platforms • Scaling of provisioned clusters • Spot instance provisioning Cloudera Director
  • 3.
    3© Cloudera, Inc.All rights reserved. Enterprise Data Science Topics It took Todd Lipcon 3 years to create Kudu;10 years of work before that learning and gaining trust among OS Community as a committer. Government of the future: value created through interesting methods. If your organization is already good at the 5,000 Open Source Algorithms (Regression etc), you now need a Data Science Cadre. Open Source: Help Wanted. Methods, not raw DataMost problems are not really Data Science “Challenges”
  • 4.
    4© Cloudera, Inc.All rights reserved. Data Engineering and Data Science Workloads Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end- to-end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets. Data Science (Spark MLlib) Cloudera is bringing the most popular data science languages/libraries to our platform for easier collaboration, self-service exploration, and implementation at scale. Cloudera is advancing the state of distributed machine learning at scale. Cloudera enables exploratory data science and the ability to deliver robust data products.
  • 5.
    5© Cloudera, Inc.All rights reserved. Closing Gaps in Critical Skills Areas in the Govt Data Science High Value, Low Frequency • Only a small set of problems require direct Data Science expertise (~5%) • Domain-general, algorithm-specific • Very high expertise Characterized by • Spark/Python Expertise • Advanced Algorithms • Hypothesis-testing Automation/Workload • Per-task/Algorithm automation Data Analysis High Frequency, Self-Service • The “other” 95% of Problems • More domain-specific Characterized by • Tools with UI’s (Data Robot) • “Exploratory” data investigation Automation/Workload • Easily automated Data Science “Unicorns” are even more valuable in the Govt. So how to you scale them out?
  • 6.
    6© Cloudera, Inc.All rights reserved. Two Data Science Use Cases Improving decisions vs. improving products Decision Science (improving business decisions) Data Products (improving products for customers) • User: Data scientists and analysts • Data: New and changing; often sampled • Environment: Local machine, sandbox cluster • Tools: R, Python, SAS/SPSS, SQL; notebooks; data wrangling/discovery tools, … • Goal: Understand data, develop and improve models, share results • Production: Hosted/scheduled reports or dashboards • User: Data engineers, developers, SREs • Data: Known data; full scale • Environment: Production clusters • Tools: Java/Scala, C++; IDEs; continuous integration, source control, … • Goal: Build and maintain applications, improve model performance, manage models in production • Production: Online applications
  • 7.
    7© Cloudera, Inc.All rights reserved. Ingest The Foundation of Hadoop’s Potential Data can come from a variety of “siloed” sources ▪ Existing databases ▪ Sensor data ▪ Server logs ▪ Chat transcripts Value of data is multiplied when combined and correlated with other data ▪ “40% value improvement from combining data from multiple IoT sources” McKinsey Global Institute
  • 8.
    8© Cloudera, Inc.All rights reserved. Data Processing Leverage the right processing for your job Data may require unique processing characteristics ▪ Batch ▪ Streaming ▪ Real-time Hadoop arose to address one and now the ecosystem has evolved to answer the rest. ▪ “We’re doubling down on Spark. We invested earliest, and we’ve invested most, in making Hadoop enterprise-grade” Mike Olson
  • 9.
    9© Cloudera, Inc.All rights reserved. Data Science A Unified Platform to Accelerate Data Science from Exploration to Production. Data Scientists need to use data to… ▪ Explore ▪ Model ▪ Test The field of data science blends math and statistics knowledge with advanced computer knowledge. ▪ “Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician” Josh Wills
  • 10.
    10© Cloudera, Inc.All rights reserved. MLlib Collection of mainstream machine learning algorithms built on Spark Including: •Classifiers: logistic regression, boosted trees, random forests, etc •Clustering: k-means, Latent Dirichlet Allocation (LDA) •Recommender Systems: Alternating Least Squares •Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) •Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc •Statistical Functions: Chi-Squared Test, Pearson Correlation, etc
  • 11.
    11© Cloudera, Inc.All rights reserved. Data Science Track Info Data Science Location: Severn Matrix Decomposition at Scale Juliet Hougland, Data Scientist, Cloudera Large-scale Agent-Based Modeling and Simulation on High-Performance Computers Dr. Robert Axtell, George Mason University Random Decision Forests at Scale Todd Boetticher, Solutions Consultant, Cloudera
  • 12.
    12© Cloudera, Inc.All rights reserved. 1 Recommended Training for Data Engineering Learn how to identify which tool is the right one to use in a given situation, and gain hands-on experience using those tools Cloudera University’s three-day course helps participants understand what data scientists do, the problems they solve, and the tools and techniques they use Learn how to increase the ROI from big data investments, by delivering faster time to insight for your organization. Apache Spark and Hadoop Data Science on Hadoop Cloudera Search
  • 13.
    13© Cloudera, Inc.All rights reserved. Thank you