Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolkit by Akmal Chaudhri


Published on

Machine learning is a method of data analysis that automates the building of analytical models. By using algorithms that iteratively learn from data, computers are able to find hidden insights without the help of explicit programming. These insights bring tremendous benefits into many different domains. For business users, in particular, these insights help organizations improve customer experience, become more competitive, and respond much faster to opportunities or threats. The availability of very powerful in-memory computing platforms, such as Apache Ignite, means that more organizations can benefit from machine learning today. In this presentation we will look at some of the main components of Apache Ignite, such as the Compute Grid, Data Grid and the Machine Learning Grid. Through examples, attendees will learn how Apache Ignite can be used for data analysis.

Published in: Software
  • Be the first to comment

  • Be the first to like this

OSDC 2018 | Apache Ignite - the in-memory hammer for your data science toolkit by Akmal Chaudhri

  1. 1. © 2018 GridGain Systems, Inc. In-Memory Performance Durability of Disk
  2. 2. © 2018 GridGain Systems, Inc. Akmal Chaudhri GridGain Systems Apache Ignite In-Memory Hammer for Your Data Science Toolkit
  3. 3. © 2018 GridGain Systems, Inc. • Apache Ignite Overview • Data Science Toolkit • Memory-Centric Storage • Compute Grid • Machine Learning • Genetic Algorithms • Summary • Q&A Agenda
  4. 4. © 2018 GridGain Systems, Inc. Apache Ignite Database, Caching and Processing Platform Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute Services MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco
  5. 5. © 2018 GridGain Systems, Inc. 1. Models trained and deployed in different systems • Move data out for training • Wait for training to complete • Redeploy models in production 2. Scalability • Data exceed capacity of single server • Need complex solutions • Burden for developers Machine Learning Business Case
  6. 6. © 2018 GridGain Systems, Inc. Memory-Centric Storage Off-heap Removes noticeable GC pauses Automatic Defragmentation Stores Superset of Data Predictable memory consumption Fully Transactional (Write-Ahead Log) DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Server Node Server Node Server Node Memory-Centric Storage Instantaneous Restarts
  7. 7. © 2018 GridGain Systems, Inc. Compute Grid DURABLE MEMORY DURABLE MEMORY Ignite Cluster C1 R1 C2 R2 C = C1 + C2 R = R1 + R2 C = Compute R = Result in T/2 time Automatic Failover Load Balancing Zero Deployment C = Task C1, C2 = Jobs
  8. 8. © 2018 GridGain Systems, Inc. Machine Learning K-Means Regressions Decision Trees R C++ Python Java Server Node Server NodeServer Node Distributed Core Algebra DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Scala REST Random Forest Distributed Algorithms Dense and Sparse Algebra Large Scale Parallelization Multi-Language Support Dense and Sparse Algebra No ETL
  9. 9. © 2018 GridGain Systems, Inc. Key to Node Mapping Key Partition Server Node ON-DISK
  10. 10. © 2018 GridGain Systems, Inc. Caches and Partitions K1, V1 K2, V2 K3, V3 K4, V4 Partition 1 K5, V5 K6, V6 K7,V7 K8, V8 K9, V9 Partition 2 Cache
  11. 11. © 2018 GridGain Systems, Inc. Backup Copies Ignite Node 1 Ignite Node 2 Ignite Node 3 Ignite Node 4 0 1 2 3
  12. 12. © 2018 GridGain Systems, Inc. Backup Copies Ignite Node 1 Ignite Node 2 Ignite Node 3 Ignite Node 4 0 1 2 3 0 1 2 3Primary Backup
  13. 13. © 2018 GridGain Systems, Inc. • Abstraction layer on top of Ignite storage and computation • MapReduce-like using Compute Grid • Partition data • Can be recovered from another node • Partition context • ML algorithms are iterative and require context Partitioned-Based Dataset
  14. 14. © 2018 GridGain Systems, Inc. Partition-Based Dataset Ignite Node 1 P1 C D Ignite Node 2 P2 C D Training Training REDUCE Client Initial solution
  15. 15. © 2018 GridGain Systems, Inc. Iterative Optimization
  16. 16. © 2018 GridGain Systems, Inc. Recovery after Node Failure Ignite Node 1 Ignite Node 2 P C D P = Partition C = Partition Context D = Partition Data D* = Local ETL P C D*
  17. 17. © 2018 GridGain Systems, Inc. Algorithms and Applicability Classification Regression Description Identify to which category a new observation belongs, on the basis of a training set of data Modeling the relationship between a scalar dependent variable y and one or more explanatory variables x Applicability spam detection, image recognition, credit scoring, disease identification drug response, stock prices, supermarket revenue Algorithms nearest neighbor, decision tree classification, neural network linear regression, decision tree regression, nearest neighbor, neural network
  18. 18. © 2018 GridGain Systems, Inc. Algorithms and Applicability Clustering Preprocessing Description Grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups Feature extraction and normalization Applicability customer segmentation, grouping experiment outcomes, grouping shopping items transform input data, such as text, for use with machine learning algorithms Algorithms k-means Normalization preprocessor
  19. 19. © 2018 GridGain Systems, Inc. Demo
  20. 20. © 2018 GridGain Systems, Inc. • Find good solutions to complex problems • Simulate biological evolution • Population • Consists of chromosomes • Chromosome • Possible solution • Consists of genes • Genes • Combined to derive new chromosomes Genetic Algorithms
  21. 21. © 2018 GridGain Systems, Inc. • Fitness Calculation • Chromosomes contain a fitness score, which is used to compare different solutions • Crossover • The process of combining genes to produce new chromosomes • Mutation • Some genes within chromosomes are updated to produce new characteristics Genetic Algorithms
  22. 22. © 2018 GridGain Systems, Inc. Genetic Algorithms DURABLE MEMORY DURABLE MEMORY Ignite Cluster F2, C2, M2 F = F1 + F2 C = C1 + C2 Collocated Computation Biological Evolution Simulation Chromosome and Genes Cluster M = M1 + M2 F1, C1, M1 F = Fitness Calculation C = Crossover M = Mutation
  23. 23. © 2018 GridGain Systems, Inc. Genetic Algorithms
  24. 24. © 2018 GridGain Systems, Inc. Demo
  25. 25. © 2018 GridGain Systems, Inc. • Distributed Machine Learning and Deep Learning when data do not fit within a single server unit • Zero-ETL • Train models and run algorithms in place • Massive scalability • Horizontal + Vertical • RAM + Disk • Fault tolerance and continuous learning • Partition-based dataset Apache Ignite Benefits
  26. 26. © 2018 GridGain Systems, Inc. Any Questions? Thank you for joining us. Follow the conversation. #apacheignite