From Insight to Action: Using Data Science to Transform Your Organization

1© Cloudera, Inc. All rights reserved.
From Insight to Action - Using
Data Science to Transform
Your Organization
Rob Morrow, Chief Technologist US Government

Deploy on any cloud infrastructure
Cloudera Director: Management for IaaS-related and CDH cluster operations
Easy Administration
• Dynamic cluster lifecycle management
• ICD-503 Support
• Single pane of glass: multi-cluster view
• Consumption based billing and metering
Enterprise-grade
• Integration across Cloudera Enterprise
• Management of CDH deployments at
scale
Flexible Deployments
• No cloud vendor lock-in: open plugin
framework for IaaS platforms
• Scaling of provisioned clusters
• Spot instance provisioning
Cloudera Director

Enterprise Data Science Topics
It took Todd Lipcon 3 years to
create Kudu;10 years of work
before that learning and gaining
trust among OS Community as a
committer.
Government of the future:
value created through
interesting methods.
If your organization is already
good at the 5,000 Open Source
Algorithms (Regression etc), you
now need a Data Science Cadre.
Open Source: Help Wanted. Methods, not raw DataMost problems are not really
Data Science “Challenges”

Data Engineering and Data Science Workloads
Data Ingestion
(Kafka, Navigator,
Search)
Cloudera enables users to build real-time, end-
to-end data pipelines in order to power their
business. Leadership in Apache Spark and
Kafka have made Cloudera a trusted resource
for users who want to capture real-time,
streaming, and time series data without being
presented with gaps in security.
Data Processing
(Spark, Hive)
Cloudera is helping users accelerate
their data pipelines with leadership in
technologies like Apache
Spark. Data processing in Cloudera
Enterprise can help take processing
windows from hours to minutes and
enables faster access to data for a
variety of users and skillsets.
Data Science (Spark
MLlib)
Cloudera is bringing the most popular data
science languages/libraries to our platform
for easier collaboration, self-service
exploration, and implementation at
scale. Cloudera is advancing the state of
distributed machine learning at scale.
Cloudera enables exploratory data science
and the ability to deliver robust data
products.

Closing Gaps in Critical Skills Areas in the Govt
Data Science
High Value, Low Frequency
• Only a small set of problems require
direct Data Science expertise (~5%)
• Domain-general, algorithm-specific
• Very high expertise
Characterized by
• Spark/Python Expertise
• Advanced Algorithms
• Hypothesis-testing
Automation/Workload
• Per-task/Algorithm automation
Data Analysis
High Frequency, Self-Service
• The “other” 95% of Problems
• More domain-specific
Characterized by
• Tools with UI’s (Data Robot)
• “Exploratory” data investigation
Automation/Workload
• Easily automated
Data Science “Unicorns” are even more valuable in the Govt.
So how to you scale them out?

Two Data Science Use Cases
Improving decisions vs. improving products
Decision Science
(improving business decisions)
Data Products
(improving products for customers)
• User: Data scientists and analysts
• Data: New and changing; often sampled
• Environment: Local machine, sandbox cluster
• Tools: R, Python, SAS/SPSS, SQL; notebooks; data
wrangling/discovery tools, …
• Goal: Understand data, develop and improve models,
share results
• Production: Hosted/scheduled reports or dashboards
• User: Data engineers, developers, SREs
• Data: Known data; full scale
• Environment: Production clusters
• Tools: Java/Scala, C++; IDEs; continuous
integration, source control, …
• Goal: Build and maintain applications, improve
model performance, manage models in production
• Production: Online applications

Ingest
The Foundation of Hadoop’s Potential
Data can come from a variety of “siloed” sources
▪ Existing databases
▪ Sensor data
▪ Server logs
▪ Chat transcripts
Value of data is multiplied when combined and
correlated with other data
▪ “40% value improvement from combining data from
multiple IoT sources” McKinsey Global Institute

Data Processing
Leverage the right processing for your job
Data may require unique processing characteristics
▪ Batch
▪ Streaming
▪ Real-time
Hadoop arose to address one and now the ecosystem
has evolved to answer the rest.
▪ “We’re doubling down on Spark. We invested earliest,
and we’ve invested most, in making Hadoop
enterprise-grade” Mike Olson

Data Science
A Unified Platform to Accelerate Data Science from Exploration to Production.
Data Scientists need to use data to…
▪ Explore
▪ Model
▪ Test
The field of data science blends math and statistics
knowledge with advanced computer knowledge.
▪ “Data Scientist: Person who is better at statistics than
any software engineer and better at software
engineering than any statistician” Josh Wills

MLlib
Collection of mainstream machine learning algorithms built on Spark
Including:
•Classifiers: logistic regression, boosted trees, random forests, etc
•Clustering: k-means, Latent Dirichlet Allocation (LDA)
•Recommender Systems: Alternating Least Squares
•Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value
Decomposition (SVD)
•Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc
•Statistical Functions: Chi-Squared Test, Pearson Correlation, etc

Data Science Track Info
Data Science Location: Severn
Matrix Decomposition at Scale
Juliet Hougland, Data Scientist, Cloudera
Large-scale Agent-Based Modeling and Simulation on High-Performance Computers
Dr. Robert Axtell, George Mason University
Random Decision Forests at Scale
Todd Boetticher, Solutions Consultant, Cloudera

1
Recommended Training for Data Engineering
Learn how to identify which tool
is the right one to use in a given
situation, and gain hands-on
experience using those tools
Cloudera University’s three-day
course helps participants
understand what data scientists
do, the problems they solve,
and the tools and techniques
they use
Learn how to increase the ROI
from big data investments, by
delivering faster time to insight
for your organization.
Apache Spark and Hadoop Data Science on Hadoop Cloudera Search

Thank you

From Insight to Action: Using Data Science to Transform Your Organization

More Related Content

What's hot

Viewers also liked

Similar to From Insight to Action: Using Data Science to Transform Your Organization

More from Cloudera, Inc.

Recently uploaded

From Insight to Action: Using Data Science to Transform Your Organization