SlideShare a Scribd company logo
1 of 37
Continuous Machine and Deep
Learning at Scale With
Apache Ignite
Denis Magda
Apache Ignite Committer & PMC Chair
@denismagda
2019 © GridGain Systems @denismagda @ApacheIgnite
Agenda
1
• Why Machine Learning at Scale?
• Ignite Machine Learning Intro
• TensorFlow Integration
• Ignite Machine Learning Internals
• Q&A
2019 © GridGain Systems @denismagda @ApacheIgnite2
5 Mins Guide to Ignite:
Overview and why to
support ML?
2019 © GridGain Systems @denismagda @ApacheIgnite
Why Machine Learning at Scale?
3
• Scalability
– Data exceed capacity of single server
– Burden for dev and business
• Models trained and deployed in
different systems
– Move data out for training
– Wait for training to complete
– Redeploy models in production
2019 © GridGain Systems @denismagda @ApacheIgnite
App
Continuous Learning Approach Without ETL
Periodic
update of
models
Periodic ETL
of terabytes
of data
Loading data
for training
Model training
& testing
Storing and
processing
working set
Before
Storing and
processing
working set
Instant
updates of
models
After (With CL)
App ML/DL
Engine
Model training & testing
No ETL
2019 © GridGain Systems @denismagda @ApacheIgnite
Apache Ignite Overview
Mainframe NoSQL HadoopIgnite Persistence
Persistent Layer
RDBMS
Machine and Deep Learning
EventsStreamingMessagingTransactionsSQLKey-Value
Service GridCompute Grid
Application Layer
Web SaaS SocialMobile IoT
In-Memory Data Store
2019 © GridGain Systems @denismagda @ApacheIgnite6
Ignite Deployment Modes
Enhance Legacy Architecture - IMDG Simplified Modern Architecture - IMDB
Ignite In-Memory Storage
Application Layer
Web-Scale Apps Mobile AppsIoT Social Media
Ignite In-Memory Storage
External Database
NoSQLRDBMS Hadoop
Application Layer
Web-Scale Apps Mobile AppsIoT Social Media
Ignite Persistence
2019 © GridGain Systems @denismagda @ApacheIgnite7
Ignite Machine Learning:
Slightly More Details
2019 © GridGain Systems @denismagda @ApacheIgnite
Ignite Machine and Deep Learning
Ignite Persistence
Distributed Machine Learning Datasets
TensorFLowRegressionsK-Means Decision Trees
In-Memory Data Store
Ignite Machine and Deep Learning
Compute and Service Grid
C++.NETJava Python
Binary Protocal
(Thin client)
Distributed
Algorithms
Large Scale
Parallelization
Multi-language
Support
No ETL
Distributed
Dataset based
on partitioned
caches
2019 © GridGain Systems @denismagda @ApacheIgnite
Distributed Classification
• Logistic Regression
• SVM, KNN, ANN
• Decision trees
• Random Forest
• Naive Bayes
2019 © GridGain Systems @denismagda @ApacheIgnite
Distributed Regression
• KNN Regression
• Linear Regression
• Decision tree regression
• Random forest regression
• Gradient-boosted tree regression
2019 © GridGain Systems @denismagda @ApacheIgnite
Distributed Clustering
• K-means
• GMM
2019 © GridGain Systems @denismagda @ApacheIgnite
Multilayer Perceptron Neural Network
2019 © GridGain Systems @denismagda @ApacheIgnite
Ignite ML API Usage
IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite);
Vectorizer vectorizer = new SampleVectorizer(0, 5, 6).labeled(1);
DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0);
DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer);
double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
2019 © GridGain Systems @denismagda @ApacheIgnite
Machine Learning Pipelines
2019 © GridGain Systems @denismagda @ApacheIgnite
Pipelining with Apache Ignite
IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers(ignite);
// Extracts "pclass", "sibsp", "parch", "sex", "embarked", "age", "fare".
Vectorizer<Integer, Vector, Integer, Double> vectorizer
= new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1);
PipelineMdl<Integer, Vector> mdl =
new Pipeline<Integer, Vector, Integer, Double>()
.addVectorizer(vectorizer)
.addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>()
.withEncoderType(EncoderType.STRING_ENCODER)
.withEncodedFeature(1)
.withEncodedFeature(6))
.addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>())
.addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>()
.withP(1))
.addTrainer(new DecisionTreeClassificationTrainer(5, 0))
.fit(ignite, dataCache);
2019 © GridGain Systems @denismagda @ApacheIgnite
Continuous Learning With Apache Ignite
SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer();
SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer);
SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2,
vectorizer);
2019 © GridGain Systems @denismagda @ApacheIgnite17
Demo:
Payments Fraud Detection
2019 © GridGain Systems @denismagda @ApacheIgnite18
Ignite and TensorFlow
2019 © GridGain Systems @denismagda @ApacheIgnite
TensorFlow Integration: Benefits
19
• Ignite as distributed data source
– Perfect fit for distributed TF
training
• Less ETL
– TF nodes deployed together
with Ignite nodes
– In-machine data movement only
2019 © GridGain Systems @denismagda @ApacheIgnite
TensorFlow Integration: Main Features
20
• Distribution of user tasks written
in Python
• Automatic creation and
maintenance of TF cluster
• Minimization of ETL costs
• Fault tolerance for both Ignite
and TF instances
>>> import tensorflow as tf
>>> from tensorflow.contrib.ignite import IgniteDataset
>>>
>>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE")
>>> iterator = dataset.make_one_shot_iterator()
>>> next_obj = iterator.get_next()
>>>
>>> with tf.Session() as sess:
>>> for _ in range(3):
>>> print(sess.run(next_obj))
{'key': 1, 'val': {'NAME': b'WARM KITTY'}}
{'key': 2, 'val': {'NAME': b'SOFT KITTY'}}
{'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
2019 © GridGain Systems @denismagda @ApacheIgnite21
Ignite Machine Learning:
Internals
2019 © GridGain Systems @denismagda @ApacheIgnite
Distributed In-Memory Data Store
Ignite Memory-Centric Storage
Ignite Cluster
Predictable Memory
Consumption
Fully Transactional
WAL (Write Ahead
Log)
Instantaneous
Restarts
Automatic
Defragmentation
Off-heap Removes
Noticeable GC Pauses
Stores Superset
of Data
Distributed Persistent Store
In-Memory Data Store
Persistent Store
Server Node
In-Memory Data Store
Persistent Store
Server Node
In-Memory Data Store
Persistent Store
Server Node
2019 © GridGain Systems @denismagda @ApacheIgnite23
Record to Node Mapping
Key Partition
Server Node
ON-DISK
2019 © GridGain Systems @denismagda @ApacheIgnite24
Caches and Partitions
K1, V1
K2, V2
K3, V3
K4, V4
Partition 1
K5, V5
K6, V6
K7,V7
K8, V8 K9, V9
Partition 2
Cache
2019 © GridGain Systems @denismagda @ApacheIgnite25
Partitions Distribution
Node 1 Node 2
Node 3 Node 4
0 1
2 3
0
1
2
3
Primary
Backup
2019 © GridGain Systems @denismagda @ApacheIgnite26
Partition-Based Dataset
Node 1
P1 C D
Node 2
P2 C D
Training
Training
REDUCE
Client
Initial
solution
2019 © GridGain Systems @denismagda @ApacheIgnite27
Training Failover
Node 3 Node 1
P C D*
P = Partition
C = Partition Context
D = Partition Data
D* = Local ETL
P C D
2019 © GridGain Systems @denismagda @ApacheIgnite28
To be released soon
2019 © GridGain Systems @denismagda @ApacheIgnite
Full Python Support and Model Importing
29
• Model Importing from Spark, XGBoost, etc.
• Full Python support
– https://github.com/gridgain/ml-python-api
2019 © GridGain Systems @denismagda @ApacheIgnite30
Wrapping Up
2019 © GridGain Systems @denismagda @ApacheIgnite
Apache Ignite Benefits for ML Use Cases
31
• Massive scalability
– Horizontal + Vertical
– RAM + Disk
• Minimal ETL
– Train models and run algorithms
in place
• Fault tolerance and continuous
learning
– Partition-based dataset
2019 © GridGain Systems @denismagda @ApacheIgnite
Resources
32
• Documentation:
– https://apacheignite.readme.io/docs
• Examples and Tutorials:
– https://github.com/apache/ignite/tree/master/exam
ples/src/main/java/org/apache/ignite/examples/ml
• Details on TensorFlow
• https://medium.com/tensorflow/tensorflow-on-
apache-ignite-99f1fc60efeb
2019 © GridGain Systems @denismagda @ApacheIgnite
Apache Ignite – We’re Hiring ;)
33
• Rapidly Growing Community
• Great Way to Learn Distributed
Storages, Computing, SQL, ML,
Transactions
• How To Contribute:
– https://ignite.apache.org/
2019 © GridGain Systems @denismagda @ApacheIgnite
-
50,000
100,000
150,000
200,000
Apr-14
Jun-14
Aug-14
Oct-14
Dec-14
Feb-15
Apr-15
Jun-15
Aug-15
Oct-15
Dec-15
Feb-16
Apr-16
Jun-16
Aug-16
Oct-16
Dec-16
Feb-17
Apr-17
Jun-17
Aug-17
Oct-17
Dec-17
Feb-18
Apr-18
Jun-18
Aug-18
Oct-18
Dec-18
Apache Ignite Is a Top 5 Apache Project
Over 2M downloads per year
and 4M total downloadsTop 5 Dev Mailing Lists
1.
2.
3.
4.
5.
Top 5 User Mailing Lists
1.
2.
3.
4.
5.
Monthly Ignite/GridGain Downloads
From January 1, 2019 Apache Software Foundation Blog Post:
“Apache in 2018 – By The Digits”
A Top 5 Apache Software Foundation Project
2019 © GridGain Systems @denismagda @ApacheIgnite
Logistics & Transportation
Apache Ignite Users
IoT
AdTech/Media/Entertainment
Pharma & Healthcare
Reliance
Financial Services
FinTech
Software/Cloud
Telecom & Mobile
IoT
AdTech / Media / Entertainment
Logistics & Transportation
eCommerce & Retail
Pharma & Healthcare
2019 © GridGain Systems @denismagda @ApacheIgnite
36
Any Questions?
@apacheignite
@denismagda

More Related Content

What's hot

Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science ToolkitApache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science ToolkitDenis Magda
 
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Databricks
 
Apache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database OrchestrationApache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database OrchestrationAriel Jatib
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkDataWorks Summit
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
 
Bring Your Own Container: Using Docker Images In Production
Bring Your Own Container: Using Docker Images In ProductionBring Your Own Container: Using Docker Images In Production
Bring Your Own Container: Using Docker Images In ProductionDatabricks
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...DataWorks Summit/Hadoop Summit
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIsCisco DevNet
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Databricks
 
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...Databricks
 

What's hot (20)

Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science ToolkitApache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
 
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
Deep Learning to Big Data Analytics on Apache Spark Using BigDL with Xianyan ...
 
Apache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database OrchestrationApache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database Orchestration
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Bring Your Own Container: Using Docker Images In Production
Bring Your Own Container: Using Docker Images In ProductionBring Your Own Container: Using Docker Images In Production
Bring Your Own Container: Using Docker Images In Production
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
Open Source Ingredients for Interactive Data Analysis in Spark by Maxim Lukiy...
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
Best Practices for Using Alluxio with Apache Spark with Cheng Chang and Haoyu...
 
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...
 

Similar to Continuous Machine and Deep Learning with Apache Ignite

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Dataconomy Media
 
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM France Lab
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Deep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech Talks
Deep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech TalksDeep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech Talks
Deep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech TalksAmazon Web Services
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...HostedbyConfluent
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsStijn Decubber
 
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...Amazon Web Services
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Intel® Software
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMatthias Feys
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Codemotion
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeLuciano Resende
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lGanesan Narayanasamy
 

Similar to Continuous Machine and Deep Learning with Apache Ignite (20)

Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
BSC LMS DDL
BSC LMS DDL BSC LMS DDL
BSC LMS DDL
 
IBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & PowerIBM Cloud Paris Meetup - 20190520 - IA & Power
IBM Cloud Paris Meetup - 20190520 - IA & Power
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Deep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech Talks
Deep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech TalksDeep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech Talks
Deep Dive on Amazon EC2 Accelerated Computing - AWS Online Tech Talks
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.jsTensorFlow meetup: Keras - Pytorch - TensorFlow.js
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
 
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
Scalable Multi-Node Deep Learning Training in the Cloud (CMP368-R1) - AWS re:...
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
 

Recently uploaded

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 

Recently uploaded (20)

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 

Continuous Machine and Deep Learning with Apache Ignite

  • 1. Continuous Machine and Deep Learning at Scale With Apache Ignite Denis Magda Apache Ignite Committer & PMC Chair @denismagda
  • 2. 2019 © GridGain Systems @denismagda @ApacheIgnite Agenda 1 • Why Machine Learning at Scale? • Ignite Machine Learning Intro • TensorFlow Integration • Ignite Machine Learning Internals • Q&A
  • 3. 2019 © GridGain Systems @denismagda @ApacheIgnite2 5 Mins Guide to Ignite: Overview and why to support ML?
  • 4. 2019 © GridGain Systems @denismagda @ApacheIgnite Why Machine Learning at Scale? 3 • Scalability – Data exceed capacity of single server – Burden for dev and business • Models trained and deployed in different systems – Move data out for training – Wait for training to complete – Redeploy models in production
  • 5. 2019 © GridGain Systems @denismagda @ApacheIgnite App Continuous Learning Approach Without ETL Periodic update of models Periodic ETL of terabytes of data Loading data for training Model training & testing Storing and processing working set Before Storing and processing working set Instant updates of models After (With CL) App ML/DL Engine Model training & testing No ETL
  • 6. 2019 © GridGain Systems @denismagda @ApacheIgnite Apache Ignite Overview Mainframe NoSQL HadoopIgnite Persistence Persistent Layer RDBMS Machine and Deep Learning EventsStreamingMessagingTransactionsSQLKey-Value Service GridCompute Grid Application Layer Web SaaS SocialMobile IoT In-Memory Data Store
  • 7. 2019 © GridGain Systems @denismagda @ApacheIgnite6 Ignite Deployment Modes Enhance Legacy Architecture - IMDG Simplified Modern Architecture - IMDB Ignite In-Memory Storage Application Layer Web-Scale Apps Mobile AppsIoT Social Media Ignite In-Memory Storage External Database NoSQLRDBMS Hadoop Application Layer Web-Scale Apps Mobile AppsIoT Social Media Ignite Persistence
  • 8. 2019 © GridGain Systems @denismagda @ApacheIgnite7 Ignite Machine Learning: Slightly More Details
  • 9. 2019 © GridGain Systems @denismagda @ApacheIgnite Ignite Machine and Deep Learning Ignite Persistence Distributed Machine Learning Datasets TensorFLowRegressionsK-Means Decision Trees In-Memory Data Store Ignite Machine and Deep Learning Compute and Service Grid C++.NETJava Python Binary Protocal (Thin client) Distributed Algorithms Large Scale Parallelization Multi-language Support No ETL Distributed Dataset based on partitioned caches
  • 10. 2019 © GridGain Systems @denismagda @ApacheIgnite Distributed Classification • Logistic Regression • SVM, KNN, ANN • Decision trees • Random Forest • Naive Bayes
  • 11. 2019 © GridGain Systems @denismagda @ApacheIgnite Distributed Regression • KNN Regression • Linear Regression • Decision tree regression • Random forest regression • Gradient-boosted tree regression
  • 12. 2019 © GridGain Systems @denismagda @ApacheIgnite Distributed Clustering • K-means • GMM
  • 13. 2019 © GridGain Systems @denismagda @ApacheIgnite Multilayer Perceptron Neural Network
  • 14. 2019 © GridGain Systems @denismagda @ApacheIgnite Ignite ML API Usage IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers (ignite); Vectorizer vectorizer = new SampleVectorizer(0, 5, 6).labeled(1); DecisionTreeClassificationTrainer trainer = new DecisionTreeClassificationTrainer(5, 0); DecisionTreeNode mdl = trainer.fit(ignite, dataCache, vectorizer); double accuracy = Evaluator.evaluate(dataCache, mdl, vectorizer, new Accuracy<>());
  • 15. 2019 © GridGain Systems @denismagda @ApacheIgnite Machine Learning Pipelines
  • 16. 2019 © GridGain Systems @denismagda @ApacheIgnite Pipelining with Apache Ignite IgniteCache<Integer, Vector> dataCache = TitanicUtils.readPassengers(ignite); // Extracts "pclass", "sibsp", "parch", "sex", "embarked", "age", "fare". Vectorizer<Integer, Vector, Integer, Double> vectorizer = new DummyVectorizer<Integer>(0, 3, 4, 5, 6, 8, 10).labeled(1); PipelineMdl<Integer, Vector> mdl = new Pipeline<Integer, Vector, Integer, Double>() .addVectorizer(vectorizer) .addPreprocessingTrainer(new EncoderTrainer<Integer, Vector>() .withEncoderType(EncoderType.STRING_ENCODER) .withEncodedFeature(1) .withEncodedFeature(6)) .addPreprocessingTrainer(new ImputerTrainer<Integer, Vector>()) .addPreprocessingTrainer(new MinMaxScalerTrainer<Integer, Vector>()) .addPreprocessingTrainer(new NormalizationTrainer<Integer, Vector>() .withP(1)) .addTrainer(new DecisionTreeClassificationTrainer(5, 0)) .fit(ignite, dataCache);
  • 17. 2019 © GridGain Systems @denismagda @ApacheIgnite Continuous Learning With Apache Ignite SVMLinearClassificationTrainer trainer = new SVMLinearClassificationTrainer(); SVMLinearClassificationModel mdl1 = trainer.fit(ignite, dataCache1, vectorizer); SVMLinearClassificationModel mdl2 = trainer.update(mdl1, ignite, dataCache2, vectorizer);
  • 18. 2019 © GridGain Systems @denismagda @ApacheIgnite17 Demo: Payments Fraud Detection
  • 19. 2019 © GridGain Systems @denismagda @ApacheIgnite18 Ignite and TensorFlow
  • 20. 2019 © GridGain Systems @denismagda @ApacheIgnite TensorFlow Integration: Benefits 19 • Ignite as distributed data source – Perfect fit for distributed TF training • Less ETL – TF nodes deployed together with Ignite nodes – In-machine data movement only
  • 21. 2019 © GridGain Systems @denismagda @ApacheIgnite TensorFlow Integration: Main Features 20 • Distribution of user tasks written in Python • Automatic creation and maintenance of TF cluster • Minimization of ETL costs • Fault tolerance for both Ignite and TF instances >>> import tensorflow as tf >>> from tensorflow.contrib.ignite import IgniteDataset >>> >>> dataset = IgniteDataset(cache_name="SQL_PUBLIC_KITTEN_CACHE") >>> iterator = dataset.make_one_shot_iterator() >>> next_obj = iterator.get_next() >>> >>> with tf.Session() as sess: >>> for _ in range(3): >>> print(sess.run(next_obj)) {'key': 1, 'val': {'NAME': b'WARM KITTY'}} {'key': 2, 'val': {'NAME': b'SOFT KITTY'}} {'key': 3, 'val': {'NAME': b'LITTLE BALL OF FUR'}}
  • 22. 2019 © GridGain Systems @denismagda @ApacheIgnite21 Ignite Machine Learning: Internals
  • 23. 2019 © GridGain Systems @denismagda @ApacheIgnite Distributed In-Memory Data Store Ignite Memory-Centric Storage Ignite Cluster Predictable Memory Consumption Fully Transactional WAL (Write Ahead Log) Instantaneous Restarts Automatic Defragmentation Off-heap Removes Noticeable GC Pauses Stores Superset of Data Distributed Persistent Store In-Memory Data Store Persistent Store Server Node In-Memory Data Store Persistent Store Server Node In-Memory Data Store Persistent Store Server Node
  • 24. 2019 © GridGain Systems @denismagda @ApacheIgnite23 Record to Node Mapping Key Partition Server Node ON-DISK
  • 25. 2019 © GridGain Systems @denismagda @ApacheIgnite24 Caches and Partitions K1, V1 K2, V2 K3, V3 K4, V4 Partition 1 K5, V5 K6, V6 K7,V7 K8, V8 K9, V9 Partition 2 Cache
  • 26. 2019 © GridGain Systems @denismagda @ApacheIgnite25 Partitions Distribution Node 1 Node 2 Node 3 Node 4 0 1 2 3 0 1 2 3 Primary Backup
  • 27. 2019 © GridGain Systems @denismagda @ApacheIgnite26 Partition-Based Dataset Node 1 P1 C D Node 2 P2 C D Training Training REDUCE Client Initial solution
  • 28. 2019 © GridGain Systems @denismagda @ApacheIgnite27 Training Failover Node 3 Node 1 P C D* P = Partition C = Partition Context D = Partition Data D* = Local ETL P C D
  • 29. 2019 © GridGain Systems @denismagda @ApacheIgnite28 To be released soon
  • 30. 2019 © GridGain Systems @denismagda @ApacheIgnite Full Python Support and Model Importing 29 • Model Importing from Spark, XGBoost, etc. • Full Python support – https://github.com/gridgain/ml-python-api
  • 31. 2019 © GridGain Systems @denismagda @ApacheIgnite30 Wrapping Up
  • 32. 2019 © GridGain Systems @denismagda @ApacheIgnite Apache Ignite Benefits for ML Use Cases 31 • Massive scalability – Horizontal + Vertical – RAM + Disk • Minimal ETL – Train models and run algorithms in place • Fault tolerance and continuous learning – Partition-based dataset
  • 33. 2019 © GridGain Systems @denismagda @ApacheIgnite Resources 32 • Documentation: – https://apacheignite.readme.io/docs • Examples and Tutorials: – https://github.com/apache/ignite/tree/master/exam ples/src/main/java/org/apache/ignite/examples/ml • Details on TensorFlow • https://medium.com/tensorflow/tensorflow-on- apache-ignite-99f1fc60efeb
  • 34. 2019 © GridGain Systems @denismagda @ApacheIgnite Apache Ignite – We’re Hiring ;) 33 • Rapidly Growing Community • Great Way to Learn Distributed Storages, Computing, SQL, ML, Transactions • How To Contribute: – https://ignite.apache.org/
  • 35. 2019 © GridGain Systems @denismagda @ApacheIgnite - 50,000 100,000 150,000 200,000 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 Feb-15 Apr-15 Jun-15 Aug-15 Oct-15 Dec-15 Feb-16 Apr-16 Jun-16 Aug-16 Oct-16 Dec-16 Feb-17 Apr-17 Jun-17 Aug-17 Oct-17 Dec-17 Feb-18 Apr-18 Jun-18 Aug-18 Oct-18 Dec-18 Apache Ignite Is a Top 5 Apache Project Over 2M downloads per year and 4M total downloadsTop 5 Dev Mailing Lists 1. 2. 3. 4. 5. Top 5 User Mailing Lists 1. 2. 3. 4. 5. Monthly Ignite/GridGain Downloads From January 1, 2019 Apache Software Foundation Blog Post: “Apache in 2018 – By The Digits” A Top 5 Apache Software Foundation Project
  • 36. 2019 © GridGain Systems @denismagda @ApacheIgnite Logistics & Transportation Apache Ignite Users IoT AdTech/Media/Entertainment Pharma & Healthcare Reliance Financial Services FinTech Software/Cloud Telecom & Mobile IoT AdTech / Media / Entertainment Logistics & Transportation eCommerce & Retail Pharma & Healthcare
  • 37. 2019 © GridGain Systems @denismagda @ApacheIgnite 36 Any Questions? @apacheignite @denismagda

Editor's Notes

  1. Fraud prevention. A bank has developed a historical model of what indicates a loan application is likely fraudulent, but as the system ingests new credit applications the system continually updates the machine learning model based on the new data to identify in real-time any emerging trends that might indicate a new concerted effort to acquire credit fraudulently. Any related fraudulent activity can then be immediately identified. Ecommerce recommendations. Online shopping recommendation engines are based on historical data such as web page visits and purchase patterns, but they are far more powerful – and deliver an increased ROI – if they incorporate real-time continuous learning. Incorporating the latest web page information, referral information, and purchase patterns into the machine learning model can result in real-time improvements to the recommendation engine model, resulting in improved recommendations based on the latest data available.
  2. The GridGain Platform GridGain is a memory-centric data platform that is used to build fast, scalable & resilient solutions. At the heart of the GridGain platform lies a distributed memory-centric data storage platform with ACID semantics, and powerful processing APIs including SQL, Compute, Key/Value and transactions. Built with a memory-centric approach, this enables GridGain to leverage memory for high throughput and low latency whilst utilising local disk or SSD to provide durability and fast recovery. GridGain platform can be integrated with third-party databases and external storage mediums and can be deployed on any infrastructure. It provides linear scalability, built-in fault tolerance, comprehensive security and auditing alongside advanced monitoring & management. The GridGain platform caters for a range of use cases including: Core banking services, Real-time product pricing, reconciliation and risk calculation engines, analytics and machine learning.
  3. * Architectural simplification
  4. Apache Ignite incorporates distributed SQL database capabilities as a part of its platform. The database is horizontally scalable, fault tolerant and SQL ANSI-99 compliant. It supports all SQL, DDL, and DML commands including SELECT, UPDATE, INSERT, MERGE, and DELETE queries. It also provides support for a subset of DDL commands relevant for distributed databases. Data sets as well as indexes can be stored both in RAM and on disk thanks to the durable memory architecture. This allows executing distributed SQL operations across different memory layers achieving in-memory performance with durability of disk. You can interact with Apache Ignite using SQL language via natively developed APIs for Java, .NET and C++, or via the Ignite JDBC or ODBC drivers. This provides a true cross-platform connectivity from languages such as PHP, Ruby and more.
  5. Also you could await that your model is perfect. Calculate the classification metric, accuracy for example to evaluate the quality of model.
  6. Apache Ignite memory-centric platform is based on an in-memory architecture that allows storing and processing data and indexes both in memory and on disk when the Ignite Persistent Store feature is enabled. The memory architecture helps achieve in-memory performance with durability of disk using all the available resources of the cluster. The GridGain in-memory data store is built and operates in a way similar to the Virtual Memory of operating systems such as Linux. However, one significant difference between these two types of architectures is that Durable Memory always keeps the whole data set and indexes on disk if the Ignite Persistent Store is used, while Virtual Memory uses the disk for swapping purposes only. In-Memory • Off-Heap memory • Removes noticeable GC pauses • Automatic Defragmentation • Predictable memory consumption • Boosts SQL performance On Disk • Optional Persistence • Support of flash, SSD, Intel 3D Xpoint • Stores superset of data • Fully Transactional ◦ Write-Ahead-Log (WAL) • Instantaneous Cluster Restarts
  7. Abstraction layer on top of Ignite storage and computation MapReduce using Compute Grid Partition data Can be recovered from another node Partition context ML algorithms are iterative and require context
  8. Part of the reason behind our growth is the growth of Apache Ignite. HAVE YOU HEARD OF APACHE IGNITE? GridGain Systems donated the code to the Apache Ignite project in late 2014. It became a top level project of the Apache Software Foundation (ASF) in mid 2015, the second fastest to do so. Apache Ignite is now one of the top 5 Apache Software Foundation projects, and has been for the last 2 years now. While we continue to be the leading contributor, though there are several others. With over 4 million total downloads, Ignite has reached a 2 million download-a-year run rate. [1] http://globenewswire.com/news-release/2019/07/09/1534470/0/en/The-Apache-Software-Foundation-Announces-Annual-Report-for-2019-Fiscal-Year.html 2018 numbers [2] https://blogs.apache.org/foundation/entry/apache-in-2018-by-the 2017 numbers [3] https://blogs.apache.org/foundation/entry/apache-in-2017-by-the
  9. Today there are hundreds of leading companies that rely on GridGain to support their mission-critical applications. While GridGain started in Financial Services, today that is about 25% of its total business … USE THIS OPPORTUNITY TO TELL SOME OF THE RELEVENT STORIES. It is used by FinTech and SaaS companies to add speed and scale, usually to support the larger customers as they adopt the FinTech/SaaS technologies. In FinTech, Finastra, which supports 48 out of the 50 top banks worldwide, adopted GridGain for their Cloud platform to add the speed and scale needed for their offerings and to support FRTB real-time regulatory requirements. In SaaS Microsoft Azure uses GridGain for real-time attack prevention as part of their identity services for all customer applications on Azure. In telco, all of RingCentral’s VOIP relies on GridGain for storing all call/service sessions and making sure connections continue even as calls connect through different datacenters. In IoT, Itron supports hundreds of millions of smartmeters globally and relies on GridGain for real-time data ingestion at scale. They adopted GridGain at first to support their larger customers. American Airlines uses GridGain for real-time rerouting of customers and their luggage as they land Multiplan uses GridGain to better manage healthcare costs at scale.