ODSC East virtual presentation - The best machine learning, and advanced analytics projects are often stopped when it comes time to move into large scale production, preventing them from ever impacting the business in a meaningful way. Hundreds of hours of work may never get put to use.
Python is rapidly becoming the language of choice for scientists and researchers of many types to build, test, train and score models. But when data science models need to go into production, challenges of performance and scale can be a huge roadblock.
By combining a Python application with an underlying massively parallel (MPP) database, Python users can achieve a simplified path to production. An MPP database also allows you to do data preparation and data analysis at far greater speeds, accelerating development and testing as well as production performance. It also allows greater numbers of concurrent jobs to run, while also continuously loading data for IoT or other streaming use cases.
Analyze data in the database where it sits, rather than first moving it to another framework, then analyzing it, then moving the results, taking multiple performance hits from both CPU and IO for every move and transformation.
In this talk, you will learn about combination architectures that can get your work into production, shorten development time, and provide the performance and scale advantages of an MPP database with the convenience and power of Python. Use case examples use the open source Vertica-Python project created by Uber with contributions from Twitter, Palantir, Etsy, Vertica, Kayak and Gooddata.
4. Production Machine Learning Needs
Speed
Fast data processing
without heavy
operations cost
Ease of Use
High level of
abstraction
functions
Features
A wide panel of
functionalities
Flexibility
Open
Architecture
Being able to connect
with a lot of different
technologies
Change is constant –
code, deployment,
data sources,
algorithms, …
5. Advantages of Python
Broad Utility
Many functionalities - one
of the most broadly useful
programming languages.
Flexibility
It Many right paths to do
things, a lot of freedom,
works on many platforms.
Ease of Use
High level of abstraction
makes Python one of the
easiest programming
languages.
Strong Community
Most data scientists master
Python. Many useful
packages (pandas, scikit, …)
6. Python Uses & Challenges
Python is great for …
Predictive Maintenance
Ensuring Quality of Service
Proactive Sales
New Products & Markets
Differentiation
A/B Testing
Marketing behaviors and click analysis
… Data Science
Python has challenges with:
Performance with big data
- Global interpreter lock
- CPU Thread management
- Access to data in multiple nodes
- Methods for efficiently accessing data (indexing
and data optimization)
- Concurrency
7. End-to-End Machine Learning Process
8
Business
Understanding Data Analysis Data
Preparation Modeling Evaluation Deployment
8. End-to-End Machine Learning Process
9
Business
Understanding Data Analysis Data
Preparation Modeling DeploymentEvaluation
9. Challenges of Machine Learning at Scale
The need for speed at
reasonable cost
Not easy to move
big data around
Sub-sampling can
compromise accuracy
10. Challenges of Machine Learning at Scale
Sub-sampling can
compromise accuracy
Work with all of
your data
11. Sampling vs. Full Dataset
13
Source: https://towardsdatascience.com/breaking-the-curse-of-small-
datasets-in-machine-learning-part-1-36f28b0c044d
Data usually matters more than algorithms for complex problems
Small data sets usually lack generalization and are prone to over-fitting
Large datasets result in better model generalization
12. Challenges of Machine Learning at Scale
Not easy to move
big data around
Bring models to
the data
13. Bring Data to the Model
Slow
Data transfer is bottleneck – fighting inertia
Need to downsample reduces accuracy
Results are not where you need them to
interact with production systems
15
Data Has Gravity
14. Bring the Model to the Data
Fast!
Ease of integration with production systems
Parallelized
Data stays where it is – security, provenance
Model management in the database
16
Data Has Gravity
15. Challenges of Machine Learning at Scale
The need for speed at
reasonable cost
Pick (the right) scaling
architecture
17. Advantages of MPP Analytical Database
MPP Scale
Clusters with no name
node or other single point
of failure allow unlimited
scale
Speed and
Concurrency
Query optimization and
resource management
across multiple nodes
Features
ML algorithm
parallelization, moving
windows, geospatial
analysis, time series joins,
fast data prep...
Open Architecture
Integration with many
other applications - BI, ETL,
Kafka, Spark, Data Science
Labs …
18. High Performance + High Concurrency
20
Get data quickly enough to act upon it, explore your data interactively,
and enable everyone to make their own data-driven decisions
Enable everyone to make their own data-driven decisions.
Get data quickly enough to act on it.
Explore data interactively.
Scale Data Volumes Scale Users
SQL Database
++
Analytics & ML Query Engine
19. Advantages of Python + MPP Analytical Database
MPP Scale
Clusters with no name
node or other single point
of failure allow unlimited
scale
Speed and
Concurrency
Query optimization and
resource management
across multiple nodes
Features
ML algorithm
parallelization, Moving
Windows, Geospatial,
Time Series, fast data
prep...
Open Architecture
Integration with many
other applications - BI, ETL,
Kafka, Spark, Data Science
Labs …
Broad Utility
Many functionalities - one
of the most broadly useful
programming languages.
Flexibility
It Many right paths to do
things, a lot of freedom,
works on many platforms.
Ease of Use
High level of abstraction
makes Python one of the
easiest programming
languages.
Strong Community
Most data scientists master
Python. Many useful
packages (pandas, scikit, …)
20. Parallelization
22
Predicting and scoring on multiple nodes
Python models get copied to all
nodes
Different portions of data are
processed simultaneously
Result: Fast response
Node 3
Data
Node 2Node 1
DataData
21. Built-In Statistical and Quality Functions
Business
Understanding
Data
Exploration
Data
Preparation Modeling Evaluation Deployment
Parallel Machine Learning
Algorithms
Speed
ANSI SQL
Scalability
Parallel Data Preparation
Deploy Anywhere
Outlier
Detection
Normalization
Imbalanced
Data Processing
Sampling
Missing Value
Imputation
And More…
Pattern
Matching
Date/
Time Algebra
Window/
Partition
Date Type
Handling
Sequences
And More…
Sessionize
Time Series
Statistical
Summary
SQL SQLSQL SQLSQL
22. Automate Model Training and Validation
Business
Understanding
Data
Exploration
Data
Preparation Modeling Evaluation Deployment
Parallel Machine Learning
Algorithms
Speed
ANSI SQL
Scalability
Parallel Data Preparation
Deploy Anywhere
Outlier
Detection
Normalization
Imbalanced
Data Processing
Sampling
Missing Value
Imputation
And More…
Pattern
Matching
Date/
Time Algebra
Window/
Partition
Date Type
Handling
Sequences
And More…
Sessionize
Time Series
Statistical
Summary
SQL SQLSQL SQLSQL
SVM
Random
Forests
Logistic
Regression
Linear
Regression
Ridge
Regression
Naive Bayes
Cross
Validation
And More…
Model-level
Stats
ROC Tables
Error Rate
Lift Table
Confusion
Matrix
R-Squared
MSE
23. Manage Model Life Cycle
Business
Understanding
Data
Exploration
Data
Preparation Modeling Evaluation Deployment
Parallel Machine Learning
Algorithms
Speed
ANSI SQL
Scalability
Parallel Data Preparation
Deploy Anywhere
Outlier
Detection
Normalization
Imbalanced
Data Processing
Sampling
Missing Value
Imputation
And More…
Pattern
Matching
Date/
Time Algebra
Window/
Partition
Date Type
Handling
Sequences
And More…
Sessionize
Time Series
Statistical
Summary
SQL SQLSQL SQLSQL
SVM
Random
Forests
Logistic
Regression
Linear
Regression
Ridge
Regression
Naive Bayes
Cross
Validation
And More…
Model-level
Stats
ROC Tables
Error Rate
Lift Table
Confusion
Matrix
R-Squared
MSE
In-Database
Scoring
Speed
Scale
Security
24. 26
Bring your R,
TensorFlow, and Python
code inside the database
– analyze the data in
place.
https://github.com/vertica/vertica-python
https://github.com/vertica/Vertica-ML-Python
25. Huge improvements in stability and
performance after moving to Vertica
24 mins on Spark, 3 mins in Vertica
Can incorporate other data like weather to
optimize predictive thermostat efficiency
after moving to Vertica ML
Citing speed of analytics, ease of use when
coding in SQL, and improvements in the
accuracy of models after moving workloads
to Vertica ML
Solving issues that were previously unsolvable
Minimal hardware, software, and personnel
investments when differentiating with
data science.
27
26. Thank you!
Learn More: academy.vertica.com
Try it Free: vertica.com/try
Paige Roberts
Open Source Relations Manager
E: Paige.Roberts@microfocus.com
27.
28. Advantages of In-Database Machine Learning
• Eliminate overhead of data transfer
• Keep data secure with clear provenance
• Store and manage models and data together
• Serve hundreds of concurrent users
• Use highly scalable, high performance
machine learning functionalities
• Avoid maintenance cost of a separate
analytical system
• Increase productivity with simple SQL calls
instead of coding everything
• Prep data faster
30
Node 1 Node 2 Node 3
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network
29. Benefits of In-database Machine Learning
31
Scale Speed Accuracy
Empower more users within
your organization to leverage
machine learning and increase
data scientist productivity with a
simple SQL interface
You need massively parallel
processing power to build and
train models at the speed of
business
Run machine learning models
based on all your historical
data, not just a subset of
down-sampled data
Democratized predictive
analytics applications
Faster time to market for
machine learning projects
Deploy predictive use
cases and stay ahead
In-database machine learning transforms the way data scientists and analysts interact with data
30. Simple SQL Execution
32
Put the power of predictive analytics in the hands of more analysts and database users
With Vertica, users can create, train and deploy machine learning models
using simple SQL calls, at massive scale
Linear
Regression
Logistic
Regression
K-Means
Clustering
Random
Forrest
Naive
Bayes
Support Vector
Machines
SQL
31. An Open Architecture with a Rich Ecosystem
Python
SQL
C++
Geospatial
TimeSeries
EventSeries
Real-time
User-DefinedStorage
Security
External Tables:Analyze inPlace
MachineLearning
TextAnalytics
Regression
PatternMatching
User-DefinedFunctions
DataTransformation
Streaming
ETL
User-Defined
Loads
BI &Visualization
ODBC
JDBC
OLEDB
S3
R Java
32. The Vertica Analytics Platform
34
Native High
Availability
Standard SQL
Interface
Column
Orientation
Machine
Learning
Advanced
Compression
MPP Massive
Parallel
Processing
Leverages BI, ETL,
Hadoop/MapReduce and
OLTP investments
No disk I/O bottleneck
simultaneously load &
query
Native DB-aware
clustering on low-cost x86
Linux nodes
Built-in redundancy that
also speeds up queries
In-database machine
learning functions for
predictive analytics at
scale
Up to 90% space
reduction using 10+
algorithms
10-50x faster than legacy
databases
Scales from TB to PB with
industry-standard
hardware
Simple integration with
existing ETL and BI
solutions
SQL-99+ compliant
Ultimate deployment
flexibility
Extended analytics
In-database machine
learning
24/7 Load & Query
34. Predictive Maintenance Demo
36
Analyze sensor data
from cooling towers
across the US ,
enabling equipment
manufacturers to
predict and prevent
equipment failure
35. Flight Tracker Demo
37
Vertica operates
at the “edge”
with flight track
detail. Sensor
data is collected
using a Raspberry
pi with radio
receiver and
antenna. Data is
loaded into
Vertica as
thousands of
records per
second and
builds to billions
of flight data
points collected
within a 250-mile
radius.
https://www.vertica.com/blog/blog-post-series-using-vertica-track-
commercial-aircraft-near-real-time/
37. Moving data science workloads from Spark on Hadoop to in-database
Improvements in stability and performance
Creating customer segmentation via clustering algorithms on a
15 million customer dataset took 24 mins on Spark - 3 mins in database
Concurrently running other algorithms without performance impact
Cardlytics partners with more than 1,500
financial institutions to run their online and
mobile banking rewards programs, which
gives us a robust view into where and when
consumers are spending their money.
38. Fidelis Cybersecurity protects the
world's most sensitive data by
identifying and removing attackers
no matter where they're hiding on
your network and endpoints.
40
Data science team was experiencing
challenges with performance while
using Spark ML
Moving workloads from Spark ML to
in-database ML provided:
Speed of analytics
Ease of use when coding in SQL
Increased accuracy of models
39. Some Vertica IoT Customer Resources
Case Studies
Anritsu ROI case study: https://www.vertica.com/wp-
content/uploads/2017/01/r24-HPE-Vertica-ROI-case-study-Anritsu.pdf
Infographic of ROI: https://www.vertica.com/wp-
content/uploads/2017/03/Anritsu-v2.pdf
Nimble Storage ROI case study: https://www.vertica.com/wp-
content/uploads/2017/08/Nimble-Storage-ROI.pdf
Optimal+ case study: https://www.vertica.com/wp-
content/uploads/2017/06/Optimal-MF-rebrand-FINAL-lo-res.pdf
*Climate Corp case study: https://www.vertica.com/wp-
content/uploads/2019/01/Climate-Corp_Success-Story-FINAL.pdf
Webcasts – Data Disruptors
Philips: https://www.brighttalk.com/webcast/10477/277693
Climate Corp: https://www.brighttalk.com/webcast/8913/336201
Nimble Storage (HPE InfoBright):
https://www.brighttalk.com/webcast/8913/330769
Zebrium: https://www.brighttalk.com/webcast/8913/332838
Simpli.fi: https://www.brighttalk.com/webcast/8913/354325/simpli-fi-
delivers-advertising-insights-on-billions-of-streaming-bid-messages
Videos
Optimal+:
https://www.youtube.com/watch?v=IZkkoy5ZT1M&feature=youtu.be
Anritsu:
https://www.youtube.com/watch?v=QZ5vWqblVXU&feature=youtu.be
41
40. 42
Try Vertica
• 3 Easy ways to try Vertica (https://www.vertica.com/try/)
o Get Started in Minutes with Vertica by the Hour from AWS Marketplace,
Google Cloud or Microsoft Azure
o Free Community Edition (for up to 1TB and 3-node cluster)
o Vertica Start-Up Accelerator Program (Free 1-year term, 25 TB license)
vertica.com/try