Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Vertica Open Source Relations Manager
Python + MPP Database = In Production Faster
Paige Roberts

3
https://www.brighttalk.com/webcast/8913/351928
Mauro Barbieri, Senior Scientist at Philips

SQL Server
Philips Remote
Service
Network
Teradata
(Salesforce,
SAP data)
Visualization /
Reporting /
Application
Distributed
Pub/Sub
System
Data
Sources
Large-Scale
Storage
ETL – Extract,
Transform,
Load
MPP Analytics
/ Machine
Learning
Batch
Low
Latency

Production Machine Learning Needs
Speed
Fast data processing
without heavy
operations cost
Ease of Use
High level of
abstraction
functions
Features
A wide panel of
functionalities
Flexibility
Open
Architecture
Being able to connect
with a lot of different
technologies
Change is constant –
code, deployment,
data sources,
algorithms, …

Advantages of Python
Broad Utility
Many functionalities - one
of the most broadly useful
programming languages.
Flexibility
It Many right paths to do
things, a lot of freedom,
works on many platforms.
Ease of Use
High level of abstraction
makes Python one of the
easiest programming
languages.
Strong Community
Most data scientists master
Python. Many useful
packages (pandas, scikit, …)

Python Uses & Challenges
Python is great for …
 Predictive Maintenance
 Ensuring Quality of Service
 Proactive Sales
 New Products & Markets
 Differentiation
 A/B Testing
 Marketing behaviors and click analysis
… Data Science
Python has challenges with:
 Performance with big data
- Global interpreter lock
- CPU Thread management
- Access to data in multiple nodes
- Methods for efficiently accessing data (indexing
and data optimization)
- Concurrency

End-to-End Machine Learning Process
8
Business
Understanding Data Analysis Data
Preparation Modeling Evaluation Deployment

End-to-End Machine Learning Process
9
Business
Understanding Data Analysis Data
Preparation Modeling DeploymentEvaluation

Challenges of Machine Learning at Scale
The need for speed at
reasonable cost
Not easy to move
big data around
Sub-sampling can
compromise accuracy

Sub-sampling can
compromise accuracy
Work with all of
your data

Sampling vs. Full Dataset
13
Source: https://towardsdatascience.com/breaking-the-curse-of-small-
datasets-in-machine-learning-part-1-36f28b0c044d
 Data usually matters more than algorithms for complex problems
 Small data sets usually lack generalization and are prone to over-fitting
Large datasets result in better model generalization

Not easy to move
big data around
Bring models to
the data

Bring Data to the Model
Slow
 Data transfer is bottleneck – fighting inertia
 Need to downsample reduces accuracy
 Results are not where you need them to
interact with production systems
15
Data Has Gravity

Bring the Model to the Data
Fast!
 Ease of integration with production systems
 Parallelized
 Data stays where it is – security, provenance
 Model management in the database
16
Data Has Gravity

The need for speed at
reasonable cost
Pick (the right) scaling
architecture

RDBMS
MySQL,
PostgreSQL …
Cassandra,
Key/Value
DB
Schema
Enforced
ETL
(Flattened,
Modeled
Tables)
Hive, Spark,
Presto,
Notebooks
Recent
Data
Applications:
• ETL/Modeling
• CityOps
• Machine
Learning
• Experiments
Ad Hoc Analytics:
• CityOps
• Data Scientists
Batch
Low
Latency Ingestion EL
(Extract,
Load)
Visualization /
Reporting /
Application
Distributed
Pub/Sub
System
Data
Sources
Large-Scale
Storage
ETL – Extract,
Transform,
Load
MPP Analytics
/ Machine
Learning

Advantages of MPP Analytical Database
MPP Scale
Clusters with no name
node or other single point
of failure allow unlimited
scale
Speed and
Concurrency
Query optimization and
resource management
across multiple nodes
Features
ML algorithm
parallelization, moving
windows, geospatial
analysis, time series joins,
fast data prep...
Open Architecture
Integration with many
other applications - BI, ETL,
Kafka, Spark, Data Science
Labs …

High Performance + High Concurrency
20
Get data quickly enough to act upon it, explore your data interactively,
and enable everyone to make their own data-driven decisions
Enable everyone to make their own data-driven decisions.
Get data quickly enough to act on it.
Explore data interactively.
Scale Data Volumes Scale Users
SQL Database
++
Analytics & ML Query Engine

Advantages of Python + MPP Analytical Database
MPP Scale
Clusters with no name
node or other single point
of failure allow unlimited
scale
Speed and
Concurrency
Query optimization and
resource management
across multiple nodes
Features
ML algorithm
parallelization, Moving
Windows, Geospatial,
Time Series, fast data
prep...
Open Architecture
Integration with many
other applications - BI, ETL,
Kafka, Spark, Data Science
Labs …
Broad Utility
Many functionalities - one
of the most broadly useful
programming languages.
Flexibility
It Many right paths to do
things, a lot of freedom,
works on many platforms.
Ease of Use
High level of abstraction
makes Python one of the
easiest programming
languages.
Strong Community
Most data scientists master
Python. Many useful
packages (pandas, scikit, …)

Parallelization
22
Predicting and scoring on multiple nodes
 Python models get copied to all
nodes
 Different portions of data are
processed simultaneously
 Result: Fast response
Node 3
Data
Node 2Node 1
DataData

Built-In Statistical and Quality Functions
Business
Understanding
Data
Exploration
Data
Parallel Machine Learning
Algorithms
Speed
ANSI SQL
Scalability
Parallel Data Preparation
Deploy Anywhere
Outlier
Detection
Normalization
Imbalanced
Data Processing
Sampling
Missing Value
Imputation
And More…
Pattern
Matching
Date/
Time Algebra
Window/
Partition
Date Type
Handling
Sequences
And More…
Sessionize
Time Series
Statistical
Summary
SQL SQLSQL SQLSQL

Automate Model Training and Validation
Business
Understanding
Data
Exploration
Data
Algorithms
Speed
ANSI SQL
Scalability
Deploy Anywhere
Outlier
Detection
Normalization
Imbalanced
Data Processing
Sampling
Missing Value
Imputation
And More…
Pattern
Matching
Date/
Time Algebra
Window/
Partition
Date Type
Handling
Sequences
And More…
Sessionize
Time Series
Statistical
Summary
SQL SQLSQL SQLSQL
SVM
Random
Forests
Logistic
Regression
Linear
Regression
Ridge
Regression
Naive Bayes
Cross
Validation
And More…
Model-level
Stats
ROC Tables
Error Rate
Lift Table
Confusion
Matrix
R-Squared
MSE

Manage Model Life Cycle
Business
Understanding
Data
Exploration
Data
Algorithms
Speed
ANSI SQL
Scalability
Deploy Anywhere
Outlier
Detection
Normalization
Imbalanced
Data Processing
Sampling
Missing Value
Imputation
And More…
Pattern
Matching
Date/
Time Algebra
Window/
Partition
Date Type
Handling
Sequences
And More…
Sessionize
Time Series
Statistical
Summary
SQL SQLSQL SQLSQL
SVM
Random
Forests
Logistic
Regression
Linear
Regression
Ridge
Regression
Naive Bayes
Cross
Validation
And More…
Model-level
Stats
ROC Tables
Error Rate
Lift Table
Confusion
Matrix
R-Squared
MSE
In-Database
Scoring
Speed
Scale
Security

26
Bring your R,
TensorFlow, and Python
code inside the database
– analyze the data in
place.
https://github.com/vertica/vertica-python
https://github.com/vertica/Vertica-ML-Python

 Huge improvements in stability and
performance after moving to Vertica
 24 mins on Spark, 3 mins in Vertica
 Can incorporate other data like weather to
optimize predictive thermostat efficiency
after moving to Vertica ML
 Citing speed of analytics, ease of use when
coding in SQL, and improvements in the
accuracy of models after moving workloads
to Vertica ML
 Solving issues that were previously unsolvable
 Minimal hardware, software, and personnel
investments when differentiating with
data science.
27

Thank you!
Learn More: academy.vertica.com
Try it Free: vertica.com/try
Paige Roberts
Open Source Relations Manager
E: Paige.Roberts@microfocus.com

Advantages of In-Database Machine Learning
• Eliminate overhead of data transfer
• Keep data secure with clear provenance
• Store and manage models and data together
• Serve hundreds of concurrent users
• Use highly scalable, high performance
machine learning functionalities
• Avoid maintenance cost of a separate
analytical system
• Increase productivity with simple SQL calls
instead of coding everything
• Prep data faster
30
Node 1 Node 2 Node 3
Schema
Tables
Models
Schema
Tables
Models
Schema
Tables
Models
Network

Benefits of In-database Machine Learning
31
Scale Speed Accuracy
Empower more users within
your organization to leverage
machine learning and increase
data scientist productivity with a
simple SQL interface
You need massively parallel
processing power to build and
train models at the speed of
business
Run machine learning models
based on all your historical
data, not just a subset of
down-sampled data
Democratized predictive
analytics applications
Faster time to market for
machine learning projects
Deploy predictive use
cases and stay ahead
In-database machine learning transforms the way data scientists and analysts interact with data

Simple SQL Execution
32
Put the power of predictive analytics in the hands of more analysts and database users
With Vertica, users can create, train and deploy machine learning models
using simple SQL calls, at massive scale
Linear
Regression
Logistic
Regression
K-Means
Clustering
Random
Forrest
Naive
Bayes
Support Vector
Machines
SQL

An Open Architecture with a Rich Ecosystem
Python
SQL
C++
Geospatial
TimeSeries
EventSeries
Real-time
User-DefinedStorage
Security
External Tables:Analyze inPlace
MachineLearning
TextAnalytics
Regression
PatternMatching
User-DefinedFunctions
DataTransformation
Streaming
ETL
User-Defined
Loads
BI &Visualization
ODBC
JDBC
OLEDB
S3
R Java

The Vertica Analytics Platform
34
Native High
Availability
Standard SQL
Interface
Column
Orientation
Machine
Learning
Advanced
Compression
MPP Massive
Parallel
Processing
Leverages BI, ETL,
Hadoop/MapReduce and
OLTP investments
No disk I/O bottleneck
simultaneously load &
query
Native DB-aware
clustering on low-cost x86
Linux nodes
Built-in redundancy that
also speeds up queries
In-database machine
learning functions for
predictive analytics at
scale
Up to 90% space
reduction using 10+
algorithms
 10-50x faster than legacy
databases
 Scales from TB to PB with
industry-standard
hardware
 Simple integration with
existing ETL and BI
solutions
 SQL-99+ compliant
 Ultimate deployment
flexibility
 Extended analytics
 In-database machine
learning
 24/7 Load & Query

Predictive Maintenance Demo
36
Analyze sensor data
from cooling towers
across the US ,
enabling equipment
manufacturers to
predict and prevent
equipment failure

Flight Tracker Demo
37
Vertica operates
at the “edge”
with flight track
detail. Sensor
data is collected
using a Raspberry
pi with radio
receiver and
antenna. Data is
loaded into
Vertica as
thousands of
records per
second and
builds to billions
of flight data
points collected
within a 250-mile
radius.
https://www.vertica.com/blog/blog-post-series-using-vertica-track-
commercial-aircraft-near-real-time/

Moving data science workloads from Spark on Hadoop to in-database
Improvements in stability and performance
Creating customer segmentation via clustering algorithms on a
15 million customer dataset took 24 mins on Spark - 3 mins in database
Concurrently running other algorithms without performance impact
Cardlytics partners with more than 1,500
financial institutions to run their online and
mobile banking rewards programs, which
gives us a robust view into where and when
consumers are spending their money.

Fidelis Cybersecurity protects the
world's most sensitive data by
identifying and removing attackers
no matter where they're hiding on
your network and endpoints.
40
Data science team was experiencing
challenges with performance while
using Spark ML
Moving workloads from Spark ML to
in-database ML provided:
Speed of analytics
Ease of use when coding in SQL
Increased accuracy of models

Some Vertica IoT Customer Resources
Case Studies
 Anritsu ROI case study: https://www.vertica.com/wp-
content/uploads/2017/01/r24-HPE-Vertica-ROI-case-study-Anritsu.pdf
 Infographic of ROI: https://www.vertica.com/wp-
content/uploads/2017/03/Anritsu-v2.pdf
 Nimble Storage ROI case study: https://www.vertica.com/wp-
content/uploads/2017/08/Nimble-Storage-ROI.pdf
 Optimal+ case study: https://www.vertica.com/wp-
content/uploads/2017/06/Optimal-MF-rebrand-FINAL-lo-res.pdf
 *Climate Corp case study: https://www.vertica.com/wp-
content/uploads/2019/01/Climate-Corp_Success-Story-FINAL.pdf
Webcasts – Data Disruptors
 Philips: https://www.brighttalk.com/webcast/10477/277693
 Climate Corp: https://www.brighttalk.com/webcast/8913/336201
 Nimble Storage (HPE InfoBright):
https://www.brighttalk.com/webcast/8913/330769
 Zebrium: https://www.brighttalk.com/webcast/8913/332838
 Simpli.fi: https://www.brighttalk.com/webcast/8913/354325/simpli-fi-
delivers-advertising-insights-on-billions-of-streaming-bid-messages
Videos
 Optimal+:
https://www.youtube.com/watch?v=IZkkoy5ZT1M&feature=youtu.be
 Anritsu:
https://www.youtube.com/watch?v=QZ5vWqblVXU&feature=youtu.be
41

42
Try Vertica
• 3 Easy ways to try Vertica (https://www.vertica.com/try/)
o Get Started in Minutes with Vertica by the Hour from AWS Marketplace,
Google Cloud or Microsoft Azure
o Free Community Edition (for up to 1TB and 3-node cluster)
o Vertica Start-Up Accelerator Program (Free 1-year term, 25 TB license)
vertica.com/try

Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Similar to Python + MPP Database = Large Scale AI/ML Projects in Production Faster (20)

Recently uploaded

Recently uploaded (20)

Python + MPP Database = Large Scale AI/ML Projects in Production Faster