GridGain & Hadoop: Differences & Synergies


Published on

Overview of differences and synergies between GridGain's In-Memory Data Platform and Hadoop.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

GridGain & Hadoop: Differences & Synergies

  1. 1. Technical Brief GridGain & Hadoop: Differences & Synergies GridGain Systems, November 2012OverviewThis paper helps you understand how Hadoop and GridGain are different and howthey complement each other. It compares the main concepts of each product.Hadoop is increasingly being seen as an attractive platform to integrate andanalyze data from multiple sources, especially when traditional databases hit theirlimits. It provides a convenient and fast way to integrate and store data withdifferent structures which is then batch processed for later analysis.With more and more companies realizing the competitive advantage they aregaining from these insights, they are looking for solutions which offer them fasteranalytic capabilities. Instead of waiting for results from batch jobs runningovernight or in off-hours, they want to use their data in real-time to maximize theirbusiness value and to enable additional real-time functionality for internal or client-facing systems.While Hadoop today is used in situations where high-write speeds and the
  2. 2. unstructured integration of data matter most, its lack of ACID transactions and thelatencies involved in data processing have not mattered that much. However, afocus now on real-time processing and live data analytics, companies are lookingfor ways better to process live data in real-time.GridGain is a modern platform that has been specifically designed as a highperformance platform for the the high-performance storage and processing of datain memory. It handles the processing of both transactional and non-transactionallive data with very low latencies. GridGain typically resides between business,analytics, transactional or BI applications on one side and long term data storagesuch as RDBMS, ERP or Hadoop HDFS on the other side.As a Java-based middleware for distributed in-memory processing, GridGainintegrates a fast in-memory MapReduce implementation with its advanced in-memory data grid technology. It provides companies with a complete platform forreal-time processing and analytics, and GridGain can also be integrated into theirexisting architecture, databases or Hadoop data stores.GridGain can process terabytes of data, on thousands of nodes, in real-time. Itsmodern architecture has been created to integrate well with traditional databasesor unstructured data stores. It is a solution that does scale.GridGain In-Memory Compute Grid vsHadoop MapReduce MapReduce is a programming model developed by Google for processing large datasets of data stored on disks. Hadoop MapReduce is an implementation of suchmodel. The model is based on the fact that data in a single file can be distributedacross multiple nodes and hence the processing of those files has to be co-locatedon the same nodes to avoid moving data around. The processing is based onscanning files record by record in parallel on multiple nodes and then reducing theresults in parallel on multiple nodes as well. Because of that, standard disk-basedMapReduce is good for problem sets which require analyzing every single record ina file and does not fit for cases when direct access to a certain data record isrequired. Furthermore, due to offline batch orientation of Hadoop it is not suitedfor low-latency applications.GridGain In-Memory Compute Grid (IMCG) on the other hand is geared towards in-memory computations and very low latencies. GridGain IMCG has its ownimplementation of MapReduce which is designed specifically for real-time in-memory processing use cases and is very different from Hadoop one. Its main goalis to split a task into multiple sub-tasks, load balance those sub-tasks amongavailable cluster nodes, execute them in parallel, then aggregate the results from
  3. 3. those sub-tasks and return them to user.Splitting tasks into multiple sub-tasks and assigning them to nodes is the mappingstep and aggregating of results is reducing step. However, there is no concept ofmandatory data built in into this design and it can work in the absence of any dataat all which makes it a good fit for both, stateless and state-full computations, liketraditional HPC. In cases when data is present, GridGain IMCG will also automaticallycolocate computations with the nodes where the data is to avoid redundant datamovement.It is also worth mentioning, that unlike Hadoop, GridGain IMCG is very well suitedfor processing of computations which are very short-lived in nature, e.g. below100 milliseconds and may not require any mapping or reducing.Here is a simple Java coding example of GridGain IMCG which counts number ofletters in a phrase by splitting it into multiple words, assigning each word to a sub-task for parallel remote execution in the map step, and then adding all lengthsreceives from remote jobs in reduce letterCount = g.reduce( BALANCE, // Mapper
  4. 4. new GridClosure<String, Integer>() { @Override public Integer apply(String s) { return s.length(); } }, Arrays.asList("GridGain Letter Count".split(" ")), // Reducer F.sumIntReducer()));GridGain In-Memory Data Grid vs HadoopDistributed File SystemHadoop Distributed File System (HDFS) is designed for storing large amounts ofdata in files on disk. Just like any file system, the data is mostly stored in textualor binary formats. To find a single record inside an HDFS file requires a file scan.Also, being distributed in nature, to update a single record within a file in HDFSrequires copying of a whole file (file in HDFS can only be appended). This makesHDFS well-suited for cases when data is appended at the end of a file, but not wellsuited for cases when data needs to be located and/or updated in the middle of afile. With indexing technologies, like HBase or Impala, data access becomessomewhat easier because keys can be indexed, but not being able to index intovalues (secondary indexes) only allow for primitive query execution.GridGain In-Memory Data Grid (IMDG) on the other hand is an in-memory key-valuedata store. The roots of IMDGs came from distributed caching, however GridGainIMDG also adds transactions, data partitioning, and SQL querying to cached data.The main difference with HDFS (or Hadoop ecosystem overall) is the ability totransact and update any data directly in real time. This makes GridGain IMDG wellsuited for working on operational data sets, the data sets that are currently beingupdated and queried, while HDFS is suited for working on historical data which isconstant and will never change.Unlike a file system, GridGain IMDG works with user domain model by directlycaching user application objects. Objects are accessed and updated by key whichallows IMDG to work with volatile data which requires direct key-based access.
  5. 5. GridGain IMDG allows for indexing into keys and values (i.e. primary and secondaryindices) and supports native SQL for data querying & processing. One of uniquefeatures of GridGain IMDG is support for distributed joins which allow to executecomplex SQL queries on the data in-memory without limitations.GridGain and Hadoop Working TogetherTo summarize: Hadoop essentially is a Big Data warehouse which is good for batch processing of historic data that never changes, while GridGain, on the other hand, is an In-Memory Data Platform which works with your current operational data set in transactional fashion with very low latencies. Focusing on very different use cases make GridGain and Hadoop very complementary with each other.
  6. 6. Up-Stream IntegrationThe diagram above shows integration between GridGain and Hadoop. Here we haveGridGain In-Memory Compute Grid and Data Grid working directly in real-time withuser application by partitioning and caching data within data grid, and executing in-memory computations and SQL queries on it. Every so often, when data becomeshistoric, it is snapshotted into HDFS where it can be analyzed using HadoopMapReduce and analytical tools from Hadoop eco-system.Down-Stream IntegrationAnother possible way to integrate would be for cases when data is already storedin HDFS but needs to be loaded into IMDG for faster in-memory processing. Forcases like that GridGain provides fast loading mechanisms from HDFS into GridGainIMDG where it can be further analyzed using GridGain in-memory Map Reduce andindexed SQL queries.ConclusionIntegration between an in-memory data platform like GridGain and disk based data
  7. 7. platform like Hadoop allows businesses to get valuable insights into the whole dataset at once, including volatile operational data set cached in memory, as well ashistoric data set stored in Hadoop. This essentially eliminates any gaps inprocessing time caused by Extract-Transfer-Load (ETL) process of copying datafrom operational system of records, like standard databases, into historic datawarehouses like Hadoop. Now data can be analyzed and processed at any point ofits lifecycle, from the moment when it gets into the system up until it gets putaway into a warehouse.