Your SlideShare is downloading. ×
October 2013 HUG: GridGain-In memory accelaration
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

October 2013 HUG: GridGain-In memory accelaration


Published on

As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this …

As Apache Hadoop adoption continues to advance, customers are depending more and more on Hadoop for critical tasks, and deploying Hadoop for use cases with more real-time requirements. In this session, we will discuss the desired performance characteristics of such a deployment and the corresponding challenges. Leave with an understanding of how performance-sensitive deployments can be accelerated using In-Memory technologies that merge the Big Data capabilities of Hadoop with the unmatched performance of In-Memory data management.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. In-Memory Accelerator for Hadoop™ #gridgain
  • 2. Hadoop: Pros and Cons What is Hadoop? Pros: > Hadoop is a batch system > Scales very well > HDFS - Hadoop Distributed File System > Fault tolerant and resilient > Data must be ETL-ed into HDFS > Very active and rich eco-system > Parallel processing over data in HDFS > Process TBs/PBs in parallel fashion > Hive, Pig, HBase, Mahout... > Most popular data warehouse Cons: > > Complex deployment > Significant execution overhead > Batch oriented - real time not possible HDFS is IO and network bound Slide 2
  • 3. In-Memory Accelerator For Hadoop: Overview Up To 100x Faster: 1. In-Memory File System 100% compatible with HDFS Boost HDFS performance by removing IO overhead Dual-mode: standalone or caching Blend into Hadoop ecosystem 2. In-Memory MapReduce Eliminate Hadoop MapReduce overhead Allow for embedded execution Record-based Slide 3
  • 4. GridGain: In-Memory Computing Platform Slide 4
  • 5. In-Memory Accelerator For Hadoop: Details > PnP Integration Minimal or zero code change > Any Hadoop distro Hadoop v1 and v2 > In-Memory File System 100% compatible with HDFS Dual-mode: no ETL needed, read/write-through Block-level caching & smart eviction Automatic pre-fetching Background fragmentizer On-heap and off-heap memory utilization > In-Memory MapReduce In-process co-located computations - access GGFS in-process Eliminate unnecessary IPC Eliminate long task startup time Eliminate mandatory sorting and re-shuffling on reduction Slide 5
  • 6. GridGain Visor: Unified DevOps File Manager HDFS Profiler Slide 6
  • 7. Benchmarks: GGFS vs HDFS 10 nodes cluster of Dell R610 > > Ubuntu 12.4, Java 7 > 10 GBE network > Each has dual 8-core CPU Stock Apache Hadoop 2.x Slide 7
  • 8. Comparison: Hadoop Accelerator vs. Spark > No ETL required > Automatic HDFS read-through and write-through Data is loaded on demand > Per-block file caching Changes to data do not get propagated to HDFS Explicit ETL step consumes time > Only hot data blocks are in memory > Strong management capabilities GridGain Visor - Unified DevOps Requires data ETL-ed into Spark Needs to have full file loaded If does not fit - gets offloaded to disk > No management capabilities Slide 8
  • 9. Customer Use Case: Task & Challenge Task: Challenge: > Real time search with MapReduce > Hadoop MapReduce too slow (> 30 sec) > Dataset size is 5TB > Data scanning slow due to constant IO > Writes 80%, reads 20% > Overall job takes > 1 minute > Perceptual real time SLA (few seconds) Slide 9
  • 10. Customer Use Case: Solution > Utilize existing servers Start GridGain data node on every server > Only put highly utilized files in GGFS User controlled caching > In-Memory MapReduce over GGFS Embedded processing > Results under 3 seconds Slide 10
  • 11. @gridgain GridGain Systems 1065 East Hillsdale Blvd, Suite 230 Foster City, CA 94404