Data Operations Problems Created by Deep
Learning, and How to Fix Them!
Jim Scott
@kingmesal
2 © 2018 MapR Technologies, Inc. // MapR Confidential
Public Service Announcement
You may see Artificial Intelligence, Machine
Learning and Deep Learning used interchangeably
within this presentation please feel free to
mentally substitute the phrase of your choice if
it is more applicable to you.
I’m not trying to split hairs on terminology.
Thanks for understanding!
3 © 2018 MapR Technologies, Inc. // MapR Confidential
Terminology
Data Science
Artificial Intelligence
Machine Learning
Deep
Learning
Data Science
Artificial Intelligence (AI)
Machine Learning (ML)
The use of algorithms to extract knowledge and
insights from data in various forms in order to
obtain insights.
Some Subfields: Statistics, Artificial Intelligence
(AI), Computational Math
The simulation of intelligent human behavior for
problem-solving and decision-making.
Some Subfields: Robotics, Natural Language
Processing (NLP), Machine Learning.
The process by which machines are taught to make
calculated suggestions and/or predictions by
examining large amounts of input data.
Some Subfields: Logistic Regression, Deep Learning,
Reinforcement Learning
4 © 2018 MapR Technologies, Inc. // MapR Confidential
90% of the effort in successful
machine learning isn’t in the
training or model development…
It’s the logistics!
5 © 2018 MapR Technologies, Inc. // MapR Confidential
“Machine learning offers a
fantastically powerful toolkit for
building complex systems
quickly.… it is remarkably easy
to incur massive ongoing
maintenance costs at the
system level when applying
machine learning.”
The Importance of Data Logistics
6 © 2018 MapR Technologies, Inc. // MapR Confidential
Why?
Just getting the training data is hard:
● Which data? How to make it accessible? Multiple sources!
● New kinds of observations force restarts
● Requires a ton of domain knowledge
The myth of a single model:
● You cannot train just one
● You will have dozens of models, likely hundreds or more
● Handoff to new versions is tricky
● You have to get runtime to be sure about which is better
7 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 1
Lack of support for the
Artificial Intelligence
Software Development Lifecycle
(AI-SDLC)
8 © 2018 MapR Technologies, Inc. // MapR Confidential
Building a Machine Learning Solution
1. Identify the data sources
2. Identify the tools
3. Write some code
4. Train a model
5. Test the model
6. Analyze the output of the tests
7. Repeat steps 3 through 6 until happy-ish
a. Maybe swap out a tool if your cannot achieve happiness
8. Figure out how to get this solution into production
9 © 2018 MapR Technologies, Inc. // MapR Confidential
Choosing the Best Machine Learning Tool
Most successful groups keep several “favorite” machine learning tools on hand:
● No single tool is the best in every situation
The most important tool is a platform that supports logistics well:
● Everything does not need to be done at the application level
● Lots of what matters can be handled at the platform level
A good design for the logistics can make a big difference
10 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 2
The deep learning
workload is only one
type of workload
11 © 2018 MapR Technologies, Inc. // MapR Confidential
Massaging your data
● This will normally include cleansing, normalizing, and even optimizing data formats
for downstream consumption for GPU based workloads
● Distributed compute is often the best approach for this type of activity due to the
volume of data and variety of data types
● Be sure to keep your original data, in case of mistakes
Separate your training and holdback data sets
● This is based off of the massaged data
Analyze model outputs to determine the quality of your models
● Especially valuable over time to know that the models are moving the right way
● Great candidate for a distributed workload
Workloads
12 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 3
Putting machine learning into
production is not quite the same
as other enterprise software
13 © 2018 MapR Technologies, Inc. // MapR Confidential
Gotchas with Making it to Production
● Ops-oriented people will not necessarily “get it” regarding modeling subtleties
● Data scientists will not necessarily “get it” regarding operational realities
● Therefore, modelers have to deliver self-contained models
● And, ops has to provide pre-wired structure
14 © 2018 MapR Technologies, Inc. // MapR Confidential
Handling Real-time
Stream instead of database as the shared “truth”
Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
15 © 2018 MapR Technologies, Inc. // MapR Confidential
Real-time can be started on
your schedule, that is the key
16 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 4
Running machine learning
models in more than
one location is tough
17 © 2018 MapR Technologies, Inc. // MapR Confidential
Streaming Isolates Services
18 © 2018 MapR Technologies, Inc. // MapR Confidential
With MapR, Geo-Distributed Data Appears Local
19 © 2018 MapR Technologies, Inc. // MapR Confidential
With MapR, Geo-Distributed Data Appears Local
Global Data Center
Regional Data Center
20 © 2018 MapR Technologies, Inc. // MapR Confidential
Features of Good Streaming
Persistent
● Messages stick around for other consumers
● Consumers don’t affect producers
● Consumer doesn’t have to be online when message arrives
Performant
● You should NEVER need to worry if a stream can keep up
Pervasive
● It is there whenever you need it, no need to deploy anything
● How much work is it to create a new file? Why harder for a stream?
21 © 2018 MapR Technologies, Inc. // MapR Confidential
Improving Machine Learning Logistics
Stream first architecture is a powerful approach with surprisingly widespread
advantages
● Innovative technologies emerging to for streaming data
Microservices approach provides flexibility
● Streaming supports microservices (if done right)
Containers remove surprises
● Predictable environment for running models
22 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 5
Data dependencies cost more
than code dependencies,
a lot more!
23 © 2018 MapR Technologies, Inc. // MapR Confidential
Data dependencies cost more than code dependencies!
● Code dependencies are easy to track, because it is a well known and a well
practiced discipline
● The data may be unstable
Undeclared consumers can wreak havoc on your models
● Downstream users may create a data dependency on the data from your model
● Updates to your model may break their system, if they made an assumption on
the function of your model
● Who do you think will suffer?
Top Reason to Use a Streaming Architecture
24 © 2018 MapR Technologies, Inc. // MapR Confidential
First Look with Streams
25 © 2018 MapR Technologies, Inc. // MapR Confidential
Then Rendezvous
26 © 2018 MapR Technologies, Inc. // MapR Confidential
Faster Throughput Through Failure
Suppose we have one model that can handle 10,000 t/s @ 2ms
● But this isn’t the most accurate model. Not bad, but not the best.
And our champion model can handle 1,000 t/s @ 10ms
● Then imagine a burst of 2,000 t/s for several minutes
Champion can only evaluate half of all requests
● Should skip to keep up
● Fast model will cover for champion
27 © 2018 MapR Technologies, Inc. // MapR Confidential
Rendezvous – Mainly for Making Decisions
Decisioning models
● Looking for a “right answer”
● Simpler than reinforcement learning
Examples
● Fraud detection
● Predictive analytics / market prediction
● Churn prediction (as in telecommunications)
● Yield optimization
● Deep learning in form of speech or image recognition, in some cases
28 © 2018 MapR Technologies, Inc. // MapR Confidential
Some Key Points
● Note that all models see identical inputs
● All models run in production setting
● All models send scores to same stream
● The rendezvous server decides which scores to ignore
● Roll forward, roll back, correlated comparison are all now trivial
29 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 6
Wash,
Rinse,
Repeat!
30 © 2018 MapR Technologies, Inc. // MapR Confidential
Are you performing all of these steps in your AI-SDLC manually?
● Consider a workflow tool
○ e.g. Airflow, Kubeflow, Argo, etc…
Is all of your data in static files or will there be real-time data?
● Prepare for real-time in development to be ready for production
Version everything!
● I’m sorry, this isn’t a job for GIT!
● Includes source data: static and real-time
● Also includes models and their output
● Ensures sanity checks
● Long-term performance analytics
Concerns About Repeatability
31 © 2018 MapR Technologies, Inc. // MapR Confidential
Quality & Reproducibility of Input Data is Important!
Recording raw-ish data is really a big deal
● Data as seen by a model is worth gold
● Data reconstructed later often has time-machine leaks
● Databases were made for updates, streams are safer
Raw data is useful for non-ML cases as well (think flexibility)
Decoy model records training data as seen by models under development &
evaluation
32 © 2018 MapR Technologies, Inc. // MapR Confidential
A Quick Review
33 © 2018 MapR Technologies, Inc. // MapR Confidential
The Proxy Talks to the Outside World
34 © 2018 MapR Technologies, Inc. // MapR Confidential
The Input Stream Feeds All Models Identically
35 © 2018 MapR Technologies, Inc. // MapR Confidential
The Scores Stream Contains All Results
36 © 2018 MapR Technologies, Inc. // MapR Confidential
The Rendezvous Picks A Result
37 © 2018 MapR Technologies, Inc. // MapR Confidential
Results Return Via A Stream and Return Address
38 © 2018 MapR Technologies, Inc. // MapR Confidential
Problem 7
In the real world
conditions may
(will) change!
39 © 2018 MapR Technologies, Inc. // MapR Confidential
Not Such Bad Ideas
Keep models running “in the wings”
● Do not wait until conditions change to start building the next model
● Keep new short-history models ready to roll, some graybeards as well
Hot hand-off
● With rendezvous, stop ignoring the new best model
Deploy a canary server
● Keep an old model active as a reference
● If it was 90% correct, difference with any better model should be small
● Score distribution should be roughly constant
40 © 2018 MapR Technologies, Inc. // MapR Confidential
Prepare for Scaling Up
Model Variety
● Multiple rendezvous frameworks for different tasks
Throughput
● Fast default models
● Partition input stream to allow parallel model evaluation
● Input batching
Extreme Volumes
● Cannibalize fancy models to run more fast/simple models
● Speed before beauty
41 © 2018 MapR Technologies, Inc. // MapR Confidential
Making Improvements
1. Data + the right question + domain knowledge, matters!
2. Prioritize – put serious effort into infrastructure
a. DataOps requires more than just data science
3. Persist – use streams to keep data around
4. Measure – everything, and record it
5. Analyze – understand and see what is happening
6. Containerize – make deployment predictable, repeatable and easy
42 © 2018 MapR Technologies, Inc. // MapR Confidential
Problems 8, 9 and 10
Copying data from your
streaming system,
data lake,
and edge systems to your
machine learning environment
43 © 2018 MapR Technologies, Inc. // MapR Confidential
PLEASE, PLEASE, PLEASE…
...tell me you are not copying
all your data between these systems
44 © 2018 MapR Technologies, Inc. // MapR Confidential
Storage
Appliance
Traditional Storage Vendor Solution
Edge
Copy
Ingest
Core Cloud
Unified Data
Lake
Data Prep
Training
+
Testing
Production
Training
Cluster Deployment
Copy
Storage
Appliance
ServersServers w/
GPU
Does NOT support real-time workflows
Doesn’t support distributed data preparation workloads
Copy
Copy
45 © 2018 MapR Technologies, Inc. // MapR Confidential
Hadoop Based Solutions
Edge
Copy
Core Cloud
Unified Data
Lake
Data Prep
Training
+
Testing
Production
Training
Cluster Deployment
HDFS
Cluster
ServersServers w/
GPU
Minimum of seven non-homogeneous environments to administer and secure
Full data copies without versioning, lineage control or multi-master support
Copy
Kafkain-motion
Copy
Copy
Copy
in-motion Kafka
in-motion
Copy
Copy
Copy
Ingest
Kafka
Where does the
master copy of
the data live?
46 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR Solution
Data Fabric
Global Namespace
Core CloudEdge
Data Prep
Training
+
Testing
Deployment
One homogeneous environment to manage and secure
Supports real-time processing with data protection, lineage, and versioning
Runs directly on DGX servers to create a unified DGX cluster
47 © 2018 MapR Technologies, Inc. // MapR Confidential
MapR AI + RAPIDS
Document
DB
Events
Structured
Data
Unstructured
Data
Inference
Typical Training and Evaluation Workflow
Events
Production DeploymentData Management
Applications
RAPIDS
Apache
Arrow GPU Memory
cuGRAPH
Graph Analytics
cuML
Machine Learning
cuDF
Analytics
Data
Preparation
Training
Data Set
Model
Training
Evaluate/
Visualize
48 © 2018 MapR Technologies, Inc. // MapR Confidential
How Data is Accessed
Advantages of the MapR Data Fabric
● Linear Scalability
● Architected for performance, scale,
and reliability
● Distributed metadata in the fabric
How Data is Stored
How Data is Accessed
● Distributed location support
● Multi-master Replication
● Location awareness
How Data is Distributed
● Capability to serve as a system of record
● Data security and governance within the
fabric
● Mixed Data access from multiple
protocols
● Distributed Multi-tenancy
● Global Namespace
● Integrated data streaming for AI
49 © 2018 MapR Technologies, Inc. // MapR Confidential
On-premise or Cloud Infrastructure
• Combines Distributed
Compute, AI, HPC, and
general purpose
workloads
• MapR provides
complementary data
logistics to better manage
and deploy deep learning
across the entire ecosystem
• Enables deployment agility
with data management
extending from on-premise,
to cross-cloud, to the edge
Architecture Matters
50 © 2018 MapR Technologies, Inc. // MapR Confidential
Simplified administration and security models
● One and done - no need for a different model in each location
● GDPR “compliant”!
Scales linearly with customer needs
● No reason to create a bunch of separate clusters
Sustainability - All data, files, database and event streaming
● Both at-rest and in-motion
An enabling and flexible architecture
● Only way to bring distributed data and GPUs together
● Easy to meet customers needs
● Supports both kubernetes and containers
Low cost of entry and linear cost of scaling
MapR Advantages for AI
51 © 2018 MapR Technologies, Inc. // MapR Confidential
Same platform and architecture in all locations:
● On-premise works the same as the cloud
● Second cloud works the same as a first cloud
● Data mirroring between locations is built-in
● Real-time event management and lineage is built-in
○ Scale out streaming applications without rearchitecting them
● Kubernetes is a simple way to inject MapR storage and GPU support into a
container
○ Leverage resources anywhere with Global Namespace
○ Application portability across all locations, no rework required
On-Premise, Cloud or Both
52 © 2018 MapR Technologies, Inc. // MapR Confidential
Complex data pipelines, large data volumes serving GPUs
● Mixed workloads - distributed data prep plus real-time
Simultaneous data and model versioning
● Data at-rest and in-motion
Model output lands in a stream
● Creates pluggable model flow
Works across on-premise and cloud infrastructures, simultaneously
Simplifying Model Development and Deployment
53 © 2018 MapR Technologies, Inc. // MapR Confidential
“90+% of Machine Learning
Success is Data Logistics”
https://mapr.com/ebook/machine-learning-logistics
The Key is Data Logistics
54 © 2018 MapR Technologies, Inc. // MapR Confidential
● Over 35 FREE on-demand training courses for AI and analytic development, data
engineering and administration
● Certification tracks for developers, administrators, and data scientists
● Expanded support portal and knowledge base
● Containerized clusters, for free download, solution templates and code examples
for hands-on experience
https://mapr.com/training/
Need Help Solving Your Data Logistics Problems?
Thank you!

Big Data LDN 2018: DATA OPERATIONS PROBLEMS CREATED BY DEEP LEARNING, AND HOW TO FIX THEM!

  • 1.
    Data Operations ProblemsCreated by Deep Learning, and How to Fix Them! Jim Scott @kingmesal
  • 2.
    2 © 2018MapR Technologies, Inc. // MapR Confidential Public Service Announcement You may see Artificial Intelligence, Machine Learning and Deep Learning used interchangeably within this presentation please feel free to mentally substitute the phrase of your choice if it is more applicable to you. I’m not trying to split hairs on terminology. Thanks for understanding!
  • 3.
    3 © 2018MapR Technologies, Inc. // MapR Confidential Terminology Data Science Artificial Intelligence Machine Learning Deep Learning Data Science Artificial Intelligence (AI) Machine Learning (ML) The use of algorithms to extract knowledge and insights from data in various forms in order to obtain insights. Some Subfields: Statistics, Artificial Intelligence (AI), Computational Math The simulation of intelligent human behavior for problem-solving and decision-making. Some Subfields: Robotics, Natural Language Processing (NLP), Machine Learning. The process by which machines are taught to make calculated suggestions and/or predictions by examining large amounts of input data. Some Subfields: Logistic Regression, Deep Learning, Reinforcement Learning
  • 4.
    4 © 2018MapR Technologies, Inc. // MapR Confidential 90% of the effort in successful machine learning isn’t in the training or model development… It’s the logistics!
  • 5.
    5 © 2018MapR Technologies, Inc. // MapR Confidential “Machine learning offers a fantastically powerful toolkit for building complex systems quickly.… it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning.” The Importance of Data Logistics
  • 6.
    6 © 2018MapR Technologies, Inc. // MapR Confidential Why? Just getting the training data is hard: ● Which data? How to make it accessible? Multiple sources! ● New kinds of observations force restarts ● Requires a ton of domain knowledge The myth of a single model: ● You cannot train just one ● You will have dozens of models, likely hundreds or more ● Handoff to new versions is tricky ● You have to get runtime to be sure about which is better
  • 7.
    7 © 2018MapR Technologies, Inc. // MapR Confidential Problem 1 Lack of support for the Artificial Intelligence Software Development Lifecycle (AI-SDLC)
  • 8.
    8 © 2018MapR Technologies, Inc. // MapR Confidential Building a Machine Learning Solution 1. Identify the data sources 2. Identify the tools 3. Write some code 4. Train a model 5. Test the model 6. Analyze the output of the tests 7. Repeat steps 3 through 6 until happy-ish a. Maybe swap out a tool if your cannot achieve happiness 8. Figure out how to get this solution into production
  • 9.
    9 © 2018MapR Technologies, Inc. // MapR Confidential Choosing the Best Machine Learning Tool Most successful groups keep several “favorite” machine learning tools on hand: ● No single tool is the best in every situation The most important tool is a platform that supports logistics well: ● Everything does not need to be done at the application level ● Lots of what matters can be handled at the platform level A good design for the logistics can make a big difference
  • 10.
    10 © 2018MapR Technologies, Inc. // MapR Confidential Problem 2 The deep learning workload is only one type of workload
  • 11.
    11 © 2018MapR Technologies, Inc. // MapR Confidential Massaging your data ● This will normally include cleansing, normalizing, and even optimizing data formats for downstream consumption for GPU based workloads ● Distributed compute is often the best approach for this type of activity due to the volume of data and variety of data types ● Be sure to keep your original data, in case of mistakes Separate your training and holdback data sets ● This is based off of the massaged data Analyze model outputs to determine the quality of your models ● Especially valuable over time to know that the models are moving the right way ● Great candidate for a distributed workload Workloads
  • 12.
    12 © 2018MapR Technologies, Inc. // MapR Confidential Problem 3 Putting machine learning into production is not quite the same as other enterprise software
  • 13.
    13 © 2018MapR Technologies, Inc. // MapR Confidential Gotchas with Making it to Production ● Ops-oriented people will not necessarily “get it” regarding modeling subtleties ● Data scientists will not necessarily “get it” regarding operational realities ● Therefore, modelers have to deliver self-contained models ● And, ops has to provide pre-wired structure
  • 14.
    14 © 2018MapR Technologies, Inc. // MapR Confidential Handling Real-time Stream instead of database as the shared “truth” Image © 2016 Ted Dunning & Ellen Friedman from Chap 6 of O’Reilly book Streaming Architecture used with permission
  • 15.
    15 © 2018MapR Technologies, Inc. // MapR Confidential Real-time can be started on your schedule, that is the key
  • 16.
    16 © 2018MapR Technologies, Inc. // MapR Confidential Problem 4 Running machine learning models in more than one location is tough
  • 17.
    17 © 2018MapR Technologies, Inc. // MapR Confidential Streaming Isolates Services
  • 18.
    18 © 2018MapR Technologies, Inc. // MapR Confidential With MapR, Geo-Distributed Data Appears Local
  • 19.
    19 © 2018MapR Technologies, Inc. // MapR Confidential With MapR, Geo-Distributed Data Appears Local Global Data Center Regional Data Center
  • 20.
    20 © 2018MapR Technologies, Inc. // MapR Confidential Features of Good Streaming Persistent ● Messages stick around for other consumers ● Consumers don’t affect producers ● Consumer doesn’t have to be online when message arrives Performant ● You should NEVER need to worry if a stream can keep up Pervasive ● It is there whenever you need it, no need to deploy anything ● How much work is it to create a new file? Why harder for a stream?
  • 21.
    21 © 2018MapR Technologies, Inc. // MapR Confidential Improving Machine Learning Logistics Stream first architecture is a powerful approach with surprisingly widespread advantages ● Innovative technologies emerging to for streaming data Microservices approach provides flexibility ● Streaming supports microservices (if done right) Containers remove surprises ● Predictable environment for running models
  • 22.
    22 © 2018MapR Technologies, Inc. // MapR Confidential Problem 5 Data dependencies cost more than code dependencies, a lot more!
  • 23.
    23 © 2018MapR Technologies, Inc. // MapR Confidential Data dependencies cost more than code dependencies! ● Code dependencies are easy to track, because it is a well known and a well practiced discipline ● The data may be unstable Undeclared consumers can wreak havoc on your models ● Downstream users may create a data dependency on the data from your model ● Updates to your model may break their system, if they made an assumption on the function of your model ● Who do you think will suffer? Top Reason to Use a Streaming Architecture
  • 24.
    24 © 2018MapR Technologies, Inc. // MapR Confidential First Look with Streams
  • 25.
    25 © 2018MapR Technologies, Inc. // MapR Confidential Then Rendezvous
  • 26.
    26 © 2018MapR Technologies, Inc. // MapR Confidential Faster Throughput Through Failure Suppose we have one model that can handle 10,000 t/s @ 2ms ● But this isn’t the most accurate model. Not bad, but not the best. And our champion model can handle 1,000 t/s @ 10ms ● Then imagine a burst of 2,000 t/s for several minutes Champion can only evaluate half of all requests ● Should skip to keep up ● Fast model will cover for champion
  • 27.
    27 © 2018MapR Technologies, Inc. // MapR Confidential Rendezvous – Mainly for Making Decisions Decisioning models ● Looking for a “right answer” ● Simpler than reinforcement learning Examples ● Fraud detection ● Predictive analytics / market prediction ● Churn prediction (as in telecommunications) ● Yield optimization ● Deep learning in form of speech or image recognition, in some cases
  • 28.
    28 © 2018MapR Technologies, Inc. // MapR Confidential Some Key Points ● Note that all models see identical inputs ● All models run in production setting ● All models send scores to same stream ● The rendezvous server decides which scores to ignore ● Roll forward, roll back, correlated comparison are all now trivial
  • 29.
    29 © 2018MapR Technologies, Inc. // MapR Confidential Problem 6 Wash, Rinse, Repeat!
  • 30.
    30 © 2018MapR Technologies, Inc. // MapR Confidential Are you performing all of these steps in your AI-SDLC manually? ● Consider a workflow tool ○ e.g. Airflow, Kubeflow, Argo, etc… Is all of your data in static files or will there be real-time data? ● Prepare for real-time in development to be ready for production Version everything! ● I’m sorry, this isn’t a job for GIT! ● Includes source data: static and real-time ● Also includes models and their output ● Ensures sanity checks ● Long-term performance analytics Concerns About Repeatability
  • 31.
    31 © 2018MapR Technologies, Inc. // MapR Confidential Quality & Reproducibility of Input Data is Important! Recording raw-ish data is really a big deal ● Data as seen by a model is worth gold ● Data reconstructed later often has time-machine leaks ● Databases were made for updates, streams are safer Raw data is useful for non-ML cases as well (think flexibility) Decoy model records training data as seen by models under development & evaluation
  • 32.
    32 © 2018MapR Technologies, Inc. // MapR Confidential A Quick Review
  • 33.
    33 © 2018MapR Technologies, Inc. // MapR Confidential The Proxy Talks to the Outside World
  • 34.
    34 © 2018MapR Technologies, Inc. // MapR Confidential The Input Stream Feeds All Models Identically
  • 35.
    35 © 2018MapR Technologies, Inc. // MapR Confidential The Scores Stream Contains All Results
  • 36.
    36 © 2018MapR Technologies, Inc. // MapR Confidential The Rendezvous Picks A Result
  • 37.
    37 © 2018MapR Technologies, Inc. // MapR Confidential Results Return Via A Stream and Return Address
  • 38.
    38 © 2018MapR Technologies, Inc. // MapR Confidential Problem 7 In the real world conditions may (will) change!
  • 39.
    39 © 2018MapR Technologies, Inc. // MapR Confidential Not Such Bad Ideas Keep models running “in the wings” ● Do not wait until conditions change to start building the next model ● Keep new short-history models ready to roll, some graybeards as well Hot hand-off ● With rendezvous, stop ignoring the new best model Deploy a canary server ● Keep an old model active as a reference ● If it was 90% correct, difference with any better model should be small ● Score distribution should be roughly constant
  • 40.
    40 © 2018MapR Technologies, Inc. // MapR Confidential Prepare for Scaling Up Model Variety ● Multiple rendezvous frameworks for different tasks Throughput ● Fast default models ● Partition input stream to allow parallel model evaluation ● Input batching Extreme Volumes ● Cannibalize fancy models to run more fast/simple models ● Speed before beauty
  • 41.
    41 © 2018MapR Technologies, Inc. // MapR Confidential Making Improvements 1. Data + the right question + domain knowledge, matters! 2. Prioritize – put serious effort into infrastructure a. DataOps requires more than just data science 3. Persist – use streams to keep data around 4. Measure – everything, and record it 5. Analyze – understand and see what is happening 6. Containerize – make deployment predictable, repeatable and easy
  • 42.
    42 © 2018MapR Technologies, Inc. // MapR Confidential Problems 8, 9 and 10 Copying data from your streaming system, data lake, and edge systems to your machine learning environment
  • 43.
    43 © 2018MapR Technologies, Inc. // MapR Confidential PLEASE, PLEASE, PLEASE… ...tell me you are not copying all your data between these systems
  • 44.
    44 © 2018MapR Technologies, Inc. // MapR Confidential Storage Appliance Traditional Storage Vendor Solution Edge Copy Ingest Core Cloud Unified Data Lake Data Prep Training + Testing Production Training Cluster Deployment Copy Storage Appliance ServersServers w/ GPU Does NOT support real-time workflows Doesn’t support distributed data preparation workloads Copy Copy
  • 45.
    45 © 2018MapR Technologies, Inc. // MapR Confidential Hadoop Based Solutions Edge Copy Core Cloud Unified Data Lake Data Prep Training + Testing Production Training Cluster Deployment HDFS Cluster ServersServers w/ GPU Minimum of seven non-homogeneous environments to administer and secure Full data copies without versioning, lineage control or multi-master support Copy Kafkain-motion Copy Copy Copy in-motion Kafka in-motion Copy Copy Copy Ingest Kafka Where does the master copy of the data live?
  • 46.
    46 © 2018MapR Technologies, Inc. // MapR Confidential MapR Solution Data Fabric Global Namespace Core CloudEdge Data Prep Training + Testing Deployment One homogeneous environment to manage and secure Supports real-time processing with data protection, lineage, and versioning Runs directly on DGX servers to create a unified DGX cluster
  • 47.
    47 © 2018MapR Technologies, Inc. // MapR Confidential MapR AI + RAPIDS Document DB Events Structured Data Unstructured Data Inference Typical Training and Evaluation Workflow Events Production DeploymentData Management Applications RAPIDS Apache Arrow GPU Memory cuGRAPH Graph Analytics cuML Machine Learning cuDF Analytics Data Preparation Training Data Set Model Training Evaluate/ Visualize
  • 48.
    48 © 2018MapR Technologies, Inc. // MapR Confidential How Data is Accessed Advantages of the MapR Data Fabric ● Linear Scalability ● Architected for performance, scale, and reliability ● Distributed metadata in the fabric How Data is Stored How Data is Accessed ● Distributed location support ● Multi-master Replication ● Location awareness How Data is Distributed ● Capability to serve as a system of record ● Data security and governance within the fabric ● Mixed Data access from multiple protocols ● Distributed Multi-tenancy ● Global Namespace ● Integrated data streaming for AI
  • 49.
    49 © 2018MapR Technologies, Inc. // MapR Confidential On-premise or Cloud Infrastructure • Combines Distributed Compute, AI, HPC, and general purpose workloads • MapR provides complementary data logistics to better manage and deploy deep learning across the entire ecosystem • Enables deployment agility with data management extending from on-premise, to cross-cloud, to the edge Architecture Matters
  • 50.
    50 © 2018MapR Technologies, Inc. // MapR Confidential Simplified administration and security models ● One and done - no need for a different model in each location ● GDPR “compliant”! Scales linearly with customer needs ● No reason to create a bunch of separate clusters Sustainability - All data, files, database and event streaming ● Both at-rest and in-motion An enabling and flexible architecture ● Only way to bring distributed data and GPUs together ● Easy to meet customers needs ● Supports both kubernetes and containers Low cost of entry and linear cost of scaling MapR Advantages for AI
  • 51.
    51 © 2018MapR Technologies, Inc. // MapR Confidential Same platform and architecture in all locations: ● On-premise works the same as the cloud ● Second cloud works the same as a first cloud ● Data mirroring between locations is built-in ● Real-time event management and lineage is built-in ○ Scale out streaming applications without rearchitecting them ● Kubernetes is a simple way to inject MapR storage and GPU support into a container ○ Leverage resources anywhere with Global Namespace ○ Application portability across all locations, no rework required On-Premise, Cloud or Both
  • 52.
    52 © 2018MapR Technologies, Inc. // MapR Confidential Complex data pipelines, large data volumes serving GPUs ● Mixed workloads - distributed data prep plus real-time Simultaneous data and model versioning ● Data at-rest and in-motion Model output lands in a stream ● Creates pluggable model flow Works across on-premise and cloud infrastructures, simultaneously Simplifying Model Development and Deployment
  • 53.
    53 © 2018MapR Technologies, Inc. // MapR Confidential “90+% of Machine Learning Success is Data Logistics” https://mapr.com/ebook/machine-learning-logistics The Key is Data Logistics
  • 54.
    54 © 2018MapR Technologies, Inc. // MapR Confidential ● Over 35 FREE on-demand training courses for AI and analytic development, data engineering and administration ● Certification tracks for developers, administrators, and data scientists ● Expanded support portal and knowledge base ● Containerized clusters, for free download, solution templates and code examples for hands-on experience https://mapr.com/training/ Need Help Solving Your Data Logistics Problems?
  • 55.