Red hat infrastructure for analytics

Analytics and Machine Learning with
Red Hat Infrastructure
Kyle Bader, Senior Architect
Sean Pryor, AI Developer
Sherard Grifﬁn, Senior Manager, Open Data Hub
BOSTON, 2019

● PROBLEM STATEMENT
○ Multi-tenant data analytics and machine learning
○ Shared data context
○ Sensitive data can’t leave the country, data governance restrictions
● DATA STRUCTURES
○ Shared data context with Ceph
○ Preparing your data
■ Structured data with Hive Metastore*
■ Semi-structured data
■ Data processing jobs
■ Spark
○ AI/ML
■ Features/Labels/other important terms
■ Background on AI and how it works
■ TensorFlow
● DATA PLATFORM ARCHITECTURE
○ Open Data Hub (Spark, Ceph, JupyterHub, TensorFlow)
○ Follow-up slides for them to learn more
■ ISVs
■ ODH
■ Frameworks
■ Other talks, etc.

ANALYTICS AND ML CHALLENGES
EXPLOSIVE GROWTH
in analytics teams and analytic tools
MULTIPLE TEAMS COMPETING
for use of the same big data resources
CONGESTION
in busy analytic clusters causing frustration
and missed SLAs
HADOOP
SPARK
HIVE
PRESTO
IMPALA
KAFKA
NIFI
TENSORFLOW
PYTORCH

OPTIONS TO ADDRESS CHALLENGES
Get a bigger cluster
for many teams to share
Give each team
own dedicated cluster,
each with copies of
PBs of data
#1 #2
Give teams ability to
spin-up/spin-down
clusters which can
share common data store
#3

MULTI-WORKLOAD TENANCY
SHARED DATA CONTEXT
HIT SERVICE-LEVEL AGREEMENTS
Give teams their own compute clusters.
ELIMINATE IDLE RESOURCES
By right-sizing de-coupled compute and storage.
BUY 10’s OF PBS INSTEAD OF 100’s
Share data sets across clusters instead of duplicating them.
INCREASE AGILITY
With spin-up/spin-down clusters.

HYBRID CLOUD ANALYTICS AND ML
OPERATOR FRAMEWORK
Provides a managed service like experience
STATEFUL STORAGE SERVICES
Object, block, and ﬁle interfaces
DEVICE PLUGIN
GPU acceleration
LOCAL PVS
High performance scratch storage

SEMI-STRUCTURED DATA
● Sources
○ Stateless applications
○ Sensors
● Common formats
○ CSV, JSON, XML
○ ORC, Avro, Parquet

DATA PROCESSING
● Variety of sources and formats
● Schema detection
● Distributed streaming and batch ETL

STRUCTURED DATA
● Cataloged into databases and tables
● External locations map to object URIs
● Table and column statistics

Select
Model
Select
Features
Model
Training
Model
Evaluation
Model
Tuning
Trained
Models
Model
Serving
&
Scoring
Keras
Microsoft
Cognitive
Toolkit
Horovod
MODELING AND SERVING

OPEN DATA HUB
Collaborate on a Data & AI platform for the Hybrid Cloud
● Open source community for AI-as-a-service platform
● Cloud-agnostic - AI for the Hybrid Cloud
● No cloud vendor lock-in
● OpenDataHub.io

Sentiment analysis and entity detection
on customer engagements, support
tickets, marketing surveys and more.
Trained on the speciﬁc Red Hat product
terminology.
AWS Microsoft AzureOpenStackDatacenterLaptop
CONTAINERIZER APPS
AT RED HAT’S CORE PROCESSES
Internal Use Cases

CONTAINERIZER APPS
AT RED HAT’S CORE PROCESSES
Internal Use Cases
Improve Red Hat’s core Engineering and
Operations processes by applying
analytics, machine learning, and AI.
CONTAINERIZER APPS
- rules
- heuristics
- ML

CORE DEPLOYMENT
● Container platform
● Certified Kubernetes
● Hybrid cloud
● Unified, distributed
storage
● RESTful gateway
● S3 and Swift compatible
● Radanalytics.io
community
● Unified analytics
engine
● Large-scale data
● Runs on Kubernetes
● Multi-user Jupyter
● Used for data science
and research
Available Now at OpenDataHub.io

Add-Ons
● Part of Open Data Hub
● Set of deployed
pre-deﬁned AI models
available to use
● Monitoring and alerting
toolkit
● Records numeric time
series data
● Used to diagnose
problems
● Analytics platform for
all metrics
● Query, visualize and
alert on metrics
● Deploying machine
learning models on
Kubernetes
● Expose models via
REST and gRPC
● Full model lifecycle
management
Available Now at OpenDataHub.io
Open Data Hub
AI Library

PLANNED RELEASES
Highlights
July
2019
Data Engineering Additions
- Cloudera Hue deployment
- Spark SQL Thrift Server deployment
- Argo deployment
- MLFlow deployment
- Kubeﬂow integration
- Kafka (Strimzi) deployment
- Seldon-core deployment
October
2019
To be determined
January
2019
Version 0.1 - Initial ODH Release
- OCP 3.10 and 3.11 support
- JupyterHub + Spark + Ceph-nano
deployment
April
2019
Operator Support + Monitoring
- OCP 4.0+ support
- Open Data Hub operator
- AI Library
- Rook for Ceph deployment
- TwoSigma BeakerX integration
- JupyterHub with GPU support
- Prometheus deployment with Spark
monitoring

AI AND MACHINE LEARNING
IN THIS LAB

WHAT NEXT?
● Try Open Data Hub yourself!
○ https://try.openshift.com
○ https://gitlab.com/opendatahub/opendatahub-operator
● Building the Next Generation of Innovation Together
○ Thursday at 8:30 AM
● Kaleidoscope of Innovation: AI and Machine Learning on
OpenShift
○ Part 1: Thursday at 2:00 PM
○ Part 2: Thursday at 3:15 PM
Red Hat data analytics infrastructure solution
red.ht/videos-RHDAIS

MACHINE LEARNING CYCLE
Ingest Prepare Preprocess Discover Develop Train Test Deploy
MKL-DNN
cuDNN

Red hat infrastructure for analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Red hat infrastructure for analytics

Similar to Red hat infrastructure for analytics (20)

Recently uploaded

Recently uploaded (20)

Red hat infrastructure for analytics