Continuous Machine and Deep Learning with Apache Ignite

Continuous Machine and Deep
Learning at Scale With
Apache Ignite
Denis Magda
Apache Ignite Committer & PMC Chair
@denismagda

2019 © GridGain Systems @denismagda @ApacheIgnite
Agenda
1
• Why Machine Learning at Scale?
• Ignite Machine Learning Intro
• TensorFlow Integration
• Ignite Machine Learning Internals
• Q&A

2019 © GridGain Systems @denismagda @ApacheIgnite2
5 Mins Guide to Ignite:
Overview and why to
support ML?

Why Machine Learning at Scale?
3
• Scalability
– Data exceed capacity of single server
– Burden for dev and business
• Models trained and deployed in
different systems
– Move data out for training
– Wait for training to complete
– Redeploy models in production

App
Continuous Learning Approach Without ETL
Periodic
update of
models
Periodic ETL
of terabytes
of data
Loading data
for training
Model training
& testing
Storing and
processing
working set
Before
Storing and
processing
working set
Instant
updates of
models
After (With CL)
App ML/DL
Engine
Model training & testing
No ETL

Apache Ignite Overview
Mainframe NoSQL HadoopIgnite Persistence
Persistent Layer
RDBMS
Machine and Deep Learning
EventsStreamingMessagingTransactionsSQLKey-Value
Service GridCompute Grid
Application Layer
Web SaaS SocialMobile IoT
In-Memory Data Store

Ignite Deployment Modes
Enhance Legacy Architecture - IMDG Simplified Modern Architecture - IMDB
Ignite In-Memory Storage
Application Layer
Web-Scale Apps Mobile AppsIoT Social Media
Ignite In-Memory Storage
External Database
NoSQLRDBMS Hadoop
Application Layer
Web-Scale Apps Mobile AppsIoT Social Media
Ignite Persistence

Ignite Machine Learning:
Slightly More Details

Ignite Machine and Deep Learning
Ignite Persistence
Distributed Machine Learning Datasets
TensorFLowRegressionsK-Means Decision Trees
Ignite Machine and Deep Learning
Compute and Service Grid
C++.NETJava Python
Binary Protocal
(Thin client)
Distributed
Algorithms
Large Scale
Parallelization
Multi-language
Support
No ETL
Distributed
Dataset based
on partitioned
caches

Distributed Classification
• Logistic Regression
• SVM, KNN, ANN
• Decision trees
• Random Forest
• Naive Bayes

Distributed Regression
• KNN Regression
• Linear Regression
• Decision tree regression
• Random forest regression
• Gradient-boosted tree regression

Distributed Clustering
• K-means
• GMM

Multilayer Perceptron Neural Network

Ignite ML API Usage
IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite);
Vectorizer vectorizer = new SampleVectorizer(0, 5, 6).labeled(1);
DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());

Machine Learning Pipelines

Pipelining with Apache Ignite
IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers(ignite);
// Extracts "pclass", "sibsp", "parch", "sex", "embarked", "age", "fare".
Vectorizer<Integer, Vector, Integer, Double> vectorizer
= new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1);
PipelineMdl<Integer, Vector> mdl =
new Pipeline<Integer, Vector, Integer, Double>()
.addVectorizer(vectorizer)
.addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>()
.withEncoderType(EncoderType.STRING_ENCODER)
.withEncodedFeature(1)
.withEncodedFeature(6))
.addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>()
.withP(1))
.addTrainer(new DecisionTreeClassificationTrainer(5, 0))
.fit(ignite, dataCache);

Continuous Learning With Apache Ignite
SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer();
SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer);
SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2,
vectorizer);

Demo:
Payments Fraud Detection

Ignite and TensorFlow

TensorFlow Integration: Benefits
19
• Ignite as distributed data source
– Perfect fit for distributed TF
training
• Less ETL
– TF nodes deployed together
with Ignite nodes
– In-machine data movement only

TensorFlow Integration: Main Features
20
• Distribution of user tasks written
in Python
• Automatic creation and
maintenance of TF cluster
• Minimization of ETL costs
• Fault tolerance for both Ignite
and TF instances
>>> import tensorflow as tf
>>> from tensorflow.contrib.ignite import IgniteDataset
>>>
>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
>>> iterator = dataset.make_one_shot_iterator()
>>> next_obj = iterator.get_next()
>>>
>>> with tf.Session() as sess:
>>> for _ in range(3):
>>> print(sess.run(next_obj))
{'key': 1, 'val': {'NAME': b'WARM KITTY'}}
{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}

Ignite Machine Learning:
Internals

Distributed In-Memory Data Store
Ignite Memory-Centric Storage
Ignite Cluster
Predictable Memory
Consumption
Fully Transactional
WAL (Write Ahead
Log)
Instantaneous
Restarts
Automatic
Defragmentation
Off-heap Removes
Noticeable GC Pauses
Stores Superset
of Data
Distributed Persistent Store
Persistent Store
Server Node
Persistent Store
Server Node
Persistent Store
Server Node

Record to Node Mapping
Key Partition
Server Node
ON-DISK

Caches and Partitions
K1, V1
K2, V2
K3, V3
K4, V4
Partition 1
K5, V5
K6, V6
K7,V7
K8, V8 K9, V9
Partition 2
Cache

Partitions Distribution
Node 1 Node 2
Node 3 Node 4
0 1
2 3
0
1
2
3
Primary
Backup

Partition-Based Dataset
Node 1
P1 C D
Node 2
P2 C D
Training
Training
REDUCE
Client
Initial
solution

Training Failover
Node 3 Node 1
P C D*
P = Partition
C = Partition Context
D = Partition Data
D* = Local ETL
P C D

To be released soon

Full Python Support and Model Importing
29
• Model Importing from Spark, XGBoost, etc.
• Full Python support
– https://github.com/gridgain/ml-python-api

Wrapping Up

Apache Ignite Benefits for ML Use Cases
31
• Massive scalability
– Horizontal + Vertical
– RAM + Disk
• Minimal ETL
– Train models and run algorithms
in place
• Fault tolerance and continuous
learning
– Partition-based dataset

Resources
32
• Documentation:
– https://apacheignite.readme.io/docs
• Examples and Tutorials:
– https://github.com/apache/ignite/tree/master/exam
ples/src/main/java/org/apache/ignite/examples/ml
• Details on TensorFlow
• https://medium.com/tensorflow/tensorflow-on-
apache-ignite-99f1fc60efeb

Apache Ignite – We’re Hiring ;)
33
• Rapidly Growing Community
• Great Way to Learn Distributed
Storages, Computing, SQL, ML,
Transactions
• How To Contribute:
– https://ignite.apache.org/

-
50,000
100,000
150,000
200,000
Apr-14
Jun-14
Aug-14
Oct-14
Dec-14
Feb-15
Apr-15
Jun-15
Aug-15
Oct-15
Dec-15
Feb-16
Apr-16
Jun-16
Aug-16
Oct-16
Dec-16
Feb-17
Apr-17
Jun-17
Aug-17
Oct-17
Dec-17
Feb-18
Apr-18
Jun-18
Aug-18
Oct-18
Dec-18
Apache Ignite Is a Top 5 Apache Project
Over 2M downloads per year
and 4M total downloadsTop 5 Dev Mailing Lists
1.
2.
3.
4.
5.
Top 5 User Mailing Lists
1.
2.
3.
4.
5.
Monthly Ignite/GridGain Downloads
From January 1, 2019 Apache Software Foundation Blog Post:
“Apache in 2018 – By The Digits”
A Top 5 Apache Software Foundation Project

Logistics & Transportation
Apache Ignite Users
IoT
AdTech/Media/Entertainment
Pharma & Healthcare
Reliance
Financial Services
FinTech
Software/Cloud
Telecom & Mobile
IoT
AdTech / Media / Entertainment
Logistics & Transportation
eCommerce & Retail
Pharma & Healthcare

36
Any Questions?
@apacheignite
@denismagda

Continuous Machine and Deep Learning with Apache Ignite

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Continuous Machine and Deep Learning with Apache Ignite

Similar to Continuous Machine and Deep Learning with Apache Ignite (20)

Recently uploaded

Recently uploaded (20)

Continuous Machine and Deep Learning with Apache Ignite

Editor's Notes