Machine learning is a method of data analysis that automates the building of analytical models. By using algorithms that iteratively learn from data, computers are able to find hidden insights without the help of explicit programming. These insights bring tremendous benefits into many different domains. For business users, in particular, these insights help organizations improve customer experience, become more competitive, and respond much faster to opportunities or threats. The availability of very powerful in-memory computing platforms, such as the open-source Apache Ignite (https://ignite.apache.org/), means that more organizations can benefit from machine learning today.
In this presentation, Denis will look at some of the main components of Apache Ignite, such as a distributed database, distributed computations, and machine learning toolkit. Through examples, attendees will learn how Apache Ignite can be used for data analysis.
The Apache Ignite Platform
Apache Ignite is a memory-centric data platform that is used to build fast, scalable & resilient solutions.
At the heart of the Apache Ignite platform lies a distributed memory-centric data storage platform with ACID semantics, and powerful processing APIs including SQL, Compute, Key/Value and transactions. Built with a memory-centric approach, this enables Apache Ignite to leverage memory for high throughput and low latency whilst utilising local disk or SSD to provide durability and fast recovery.
The main difference between the memory-centric approach and the traditional disk-centric approach is that the memory is treated as a fully functional storage, not just as a caching layer, like most databases do. For example, Apache Ignite can function in a pure in-memory mode, in which case it can be treated as an In-Memory Database (IMDB) and In-Memory Data Grid (IMDG) in one.
On the other hand, when persistence is turned on, Ignite begins to function as a memory-centric system where most of the processing happens in memory, but the data and indexes get persisted to disk. The main difference here from the traditional disk-centric RDBMS or NoSQL system is that Ignite is strongly consistent, horizontally scalable, and supports both SQL and key-value processing APIs.
Apache Ignite platform can be integrated with third-party databases and external storage mediums and can be deployed on any infrastructure. It provides linear scalability, built-in fault tolerance, comprehensive security and auditing alongside advanced monitoring & management.
The Apache Ignite platform caters for a range of use cases including: Core banking services, Real-time product pricing, reconciliation and risk calculation engines, analytics and machine learning.
Ignite Data Grid is a distributed key-value store that enables storing data both in memory and on disk within distributed clusters and provides extensive APIs. Ignite Data Grid can be viewed as a distributed partitioned hash map with every cluster node owning a portion of the overall data. This way the more cluster nodes we add, the more data we can store.
Apache Ignite memory-centric platform is based on the Durable Memory architecture that allows storing and processing data and indexes both in memory and on disk when the Ignite Persistent Store feature is enabled. The memory architecture helps achieve in-memory performance with durability of disk using all the available resources of the cluster.Ignite's durable memory is built and operates in a way similar to the Virtual Memory of operating systems such as Linux. However, one significant difference between these two types of architectures is that Durable Memory always keeps the whole data set and indexes on disk if the Ignite Persistent Store is used, while Virtual Memory uses the disk for swapping purposes only.
In-Memory
• Off-Heap memory
• Removes noticeable GC pauses
• Automatic Defragmentation
• Predictable memory consumption
• Boosts SQL performance
On Disk
• Optional Persistence
• Support of flash, SSD, Intel 3D Xpoint
• Stores superset of data
• Fully Transactional
◦ Write-Ahead-Log (WAL)
• Instantaneous Cluster Restarts
Ignite Native Persistence is a distributed ACID and SQL-compliant disk store that transparently integrates with Ignite's Durable Memory as an optional disk layer storing data and indexes on SSD, Flash, 3D XPoint, and other types of non-volatile storages.
With the Ignite Persistence enabled, you no longer need to keep all the data and indexes in memory or warm it up after a node or cluster restart because the Durable Memory is tightly coupled with persistence and treats it as a secondary memory tier. This implies that if a subset of data or an index is missing in RAM, the Durable Memory will take it from the disk.
Apache Ignite incorporates distributed SQL database capabilities as a part of its platform. The database is horizontally scalable, fault tolerant and SQL ANSI-99 compliant. It supports all SQL, DDL, and DML commands including SELECT, UPDATE, INSERT, MERGE, and DELETE queries. It also provides support for a subset of DDL commands relevant for distributed databases.
Data sets as well as indexes can be stored both in RAM and on disk thanks to the durable memory architecture. This allows executing distributed SQL operations across different memory layers achieving in-memory performance with durability of disk.
You can interact with Apache Ignite using SQL language via natively developed APIs for Java, .NET and C++, or via the Ignite JDBC or ODBC drivers. This provides a true cross-platform connectivity from languages such as PHP, Ruby and more.
Ignite In-Memory Compute Grid allows executing distributed computations in a parallel fashion to gain high performance, low latency, and linear scalability. Ignite compute grid provides a set of simple APIs that allow users distribute computations and data processing across multiple computers in the cluster.
The disk-centric systems, like RDBMS or NoSQL, generally utilize the classic client-server approach, where the data is brought from the server to the client side where it gets processed and then is usually discarded. This approach does not scale well as moving the data over the network is the most expensive operation in a distributed system.
A much more scalable approach is collocated processing that reverses the flow by bringing the computations to the servers where the data actually resides. This approach allows you to execute advanced logic or distributed SQL with JOINs exactly where the data is stored avoiding expensive serialization and network trips.
https://ignite.apache.org/collocatedprocessing.html
Collocation of computations with data allow for minimizing data serialization within network and can significantly improve performance and scalability of your application. Whenever possible, you should always make best effort to colocate your computations with the cluster nodes caching the data that needs to be processed.
Let's assume that a blizzard is approaching New York City. You, as a telecommunication company has to warn all the people sending a message to everyone with precise instructions on how to behave during such weather conditions. There are around 8 million New Yorkers in your database that have to receive the text message.
With the client-server approach the company has to connect to the database, move all 8 million (!) records from there to a client application that will text to everyone. This is highly inefficient that wastes network and computational resources of company's IT infrastructure.
However, if the company initially collocates all the cities it covers with the people who live there then it can send a single computation (!) to the cluster node that stores information about all New Yorkers and send the text message from there. This approach avoids 8 million records movement over the network and helps utilizing cluster resources for computation needs. That's the collocated processing in action!
https://github.com/techbysample/gagrid
GA Grid (Beta) is an in memory Genetic Algorithm (GA) component for Apache Ignite. A GA is a method of solving optimization problems by simulating the process of biological evolution. GA Grid provides a distributive GA library built on top of a mature and scalable Apache Ignite platform. GAs are excellent for searching through large and complex data sets for an optimal solution. Real world applications of GAs include: automotive design, computer gaming, robotics, investments, traffic/shipment routing and more.
Glossary
Chromosome is a sequence of Genes. A Chromosome represents a potential solution.
Crossover is the process in which the genes within chromosomes are combined to derive new chromosomes.
Fitness Score is a numerical score that measures the value of a particular Chromosome (ie: solution) relative to other Chromosome in the population.
Gene is the discrete building blocks that make up the Chromosome.
Genetic Algorithm (GA) is a method of solving optimization problems by simulating the process of biological evolution. A GA continuously enhances a population of potential solutions. With each iteration, a GA selects the 'best fit' individuals from the current population to create offspring for the next generation. After subsequent generations, a GA will "evolve" the population toward an optimal solution.
Mutation is the process where genes within a chromosomes are randomly updated to produce new characteristics.
Population is the collection of potential solutions or Chromosomes.
Selection is the process of choosing candidate solutions (Chromosomes) for the next generation.
DEMO: run several ML samples from the standard distribution.
Main benefits:
No ETL – online “in place” ML
In-memory speed & scale
Large scale parallelization
Optimized ML/DL algorithms
Last-mile GPU optimization
The rationale for building ML Grid is quite simple. Many users employ Ignite as the central high-performance storage and processing systems for various data sets. If they wanted to perform ML or Deep Learning (DL) on these data sets (i.e training sets or model inference) they had to ETL them first into some other systems like Apache Mahout or Apache Spark.
The roadmap for ML Grid is to start with core algebra implementation based on Ignite co-located distributed processing. The initial version was released with Ignite 2.0. Future releases will introduce custom DSLs for Python, R and Scala, growing collection of optimized ML algorithms such as Linear and Logistic Regression, Decision Tree/Random Forest, SVM, Naive Bayes, as well support for Ignite-optimized Neural Networks and integration with TensorFlow.
Current beta version of Apache Ignite Machine Learning Grid (ML Grid) supports a distributed machine learning library built on top of highly optimized and scalable Apache Ignite platform and implements local and distributed vector and matrix algebra operations as well as distributed versions of widely used algorithms.