Your SlideShare is downloading. ×
  • Like
November 2013 HUG: Real-time analytics with in-memory grid
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

November 2013 HUG: Real-time analytics with in-memory grid



Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Enabling Real-Time Analytics Using Hadoop Map/Reduce Hadoop Users Group November 20, 2013 Bill Bain, CEO ( Copyright © 2013 by ScaleOut Software, Inc.
  • 2. Agenda • Quick Review of In-Memory Data Grids • The Need for Real-Time Analytics: Two Use Cases • Data-Parallel Computation on an IMDG Using Parallel Method Invocation (PMI) • Implementing MapReduce Using PMI: ScaleOut hServer™ • Sample Use Cases • Video Demo • Comparison to Spark 2 ScaleOut Software, Inc.
  • 3. About ScaleOut Software • Develops and markets In-Memory Data Grids: software middleware for: • Scaling application performance and • Performing real-time analytics using • In-memory data storage and computing • Dr. William Bain, Founder & CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server • Eight years in the market; 400 customers, 9,000 servers • Sample customers: 3 ScaleOut Software, Inc.
  • 4. What is an In-Memory Data Grid? In-memory storage for fast updates and retrieval of live data • Fits in the business logic layer: • Follows object-oriented view of data (vs. relational view). • Stores collections of Java/.NET objects shared by multiple clients. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. 4 ScaleOut Software, Inc.
  • 5. Our Focus: Real-Time Analytics Real-time Batch Live data sets Gigabytes to terabytes In-memory storage Minutes to seconds Best uses: Static data sets Petabytes Disk storage Hours to minutes Best uses: “Business Intelligence” “Operational Intelligence” • Tracking live data • Immediately identifying trends and capturing opportunities 5 Big Data Analytics Real-Time Batch Analytics Server Hadoop IBM Teradata SAS SAP hServer ScaleOut Software, Inc. • Analyzing warehoused data • Mining for longterm trends
  • 6. Online Systems Need Real-Time Analysis A • • • • • 6 few examples: Equity trading: to minimize risk during a trading day Ecommerce: to optimize real-time shopping activity Reservations systems: to identify issues, reroute, etc. Credit cards: to detect fraud in real time Smart grids: to optimize power distribution & detect issues ScaleOut Software, Inc.
  • 7. Integrate MapReduce into IMDG for Real-Time Analytics Benefits: • Enables use of widely used Hadoop MapReduce APIs: • Accelerates data access by staging data in memory. • Eliminates batch scheduling and data shuffling overheads of standard Hadoop distributions. • Analyzes and updates live data. • Enables Hadoop deployment in live systems. • Hadoop MapReduce programs run without change. • ScaleOut’s implementation is called ScaleOut hServer™. 7 ScaleOut Software, Inc.
  • 8. Data-Parallel Analysis Is Not New • 1980’s: Special Purpose Hardware: “SIMD” Thinking Machines Connection Machine 5 • 1990’s: General Purpose Parallel Supercomputers: “Domain Decomposition”, “SPMD” Intel IPSC-2 8 ScaleOut Software, Inc. IBM SP1
  • 9. Data-Parallel Analysis Is Not New • 1990’s – early 2000’s: HPC on Clusters: “MPI” HP Blade Servers • Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce” Amazon EC2, Windows Azure 9 ScaleOut Software, Inc.
  • 10. Parallel Method Invocation • Basic, well understood model of data-parallel computation • Implemented for use on objects hosted in IMDGs: • Executes user’s code in parallel across the grid. • Uses parallel query to select objects for analysis. Analyze Data (Eval) In-Memory Data Grid Runs Data-Parallel Analysis. Combine Results (Merge) 10 ScaleOut Software, Inc.
  • 11. Running the Analysis The parallel analysis executes in three steps: • Step 1: The application first selects all relevant objects in the collection with a parallel query run on all grid servers. • Note: Query spec matches data’s object-oriented properties. 11 ScaleOut Software, Inc.
  • 12. Running the Analysis: Step 2 • Step 2: The IMDG automatically schedules analysis operations across all grid servers and cores. • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. 12 ScaleOut Software, Inc.
  • 13. Running the Analysis: Step 3 • Step 3: The IMDG automatically merges all analysis results. • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the trader’s display as one object. 13 ScaleOut Software, Inc.
  • 14. Sample Performance Results for PMI Optimizing a stock trading platform with real-time analysis: • IMDG hosted in Amazon cloud using 75 servers. • IMDG holds 1 TB of stock history data in memory. • IMDG handles continuous stream of updates (1.1 GB/s). • IMDG performs real-time analysis on live data. • Entire data set analyzed in 4.1 seconds (250 GB/s). • IMDG scales linearly as workload grows. 14 ScaleOut Software, Inc.
  • 15. Implementing Real-Time MapReduce • Goal: Run MapReduce applications from a remote workstation. • The IMDG automatically builds an “invocation grid” of JVMs on the grid’s servers for PMI and ships the application’s jars. • The invocation grid can be reused to shorten startup time. • Use PMI to implement MapReduce. 15 ScaleOut Software, Inc.
  • 16. Accelerating MapReduce Execution PMI is the foundation of fast execution time: • Data can be input from either the IMDG or an external data source. • Works with any input/output format compatible with the Apache distribution. • ScaleOut IMDG uses its dataparallel execution engine (PMI) to invoke the mappers and the reducers. • Eliminates batch scheduling overhead. • Intermediate results are stored within the IMDG. • • 16 Minimizes data motion between the mappers and reducers. Allows optional sorting. ScaleOut Software, Inc.
  • 17. Only One-Line Code Change ScaleOut hServer subclasses the Hadoop Job class: // This job will run using the Hadoop // job tracker: public static void main(String[] args) throws Exception { // This job will run using ScaleOut hServer: Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Configuration conf = new Configuration(); Job job = new HServerJob(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } job.waitForCompletion(true); } 17 public static void main(String[] args) throws Exception { ScaleOut Software, Inc.
  • 18. Accessing IMDG Data for M/R • IMDG adds grid input format for accessing key/value pairs held in the IMDG. • MapReduce programs optionally can output results to IMDG with grid output format. • Grid Record Reader optimizes access to key/value pairs to eliminate network overhead. • Applications can access and update key/value pairs as operational data during analysis. 18 ScaleOut Software, Inc.
  • 19. Optimized In-Memory Storage Multiple in-memory storage models: • Named cache, optimized for rich semantics: • Property-based query • Distributed locking • Access from remote grids • Named map, optimized for efficient storage and bulk analysis: • Highly efficient object storage • Pipelined, bulk-access mechanisms 19 ScaleOut Software, Inc.
  • 20. Example: Ecommerce: Inventory Management Fast map/reduce reconciles inventory and order systems for an online retailer: • Challenge: Inventory and online order management are handled by different applications. • Reconciled once per day. • Inaccurate orders reduces margins. • Solution: • Host SKUs in IMDG updated in real time by order & inventory systems. • Use PMI to reconcile in two minutes. • Results: Real-time reconciliation ensures accurate orders. 20 ScaleOut Software, Inc.
  • 21. Example in Financial Services Integrate analysis into a stock trading platform: • The IMDG holds market data and hedging strategies. • Updates to market data continuously flow through the IMDG. • The IMDG performs repeated map/reduce analysis on hedging strategies and alerts traders in real time. • IMDG automatically and dynamically scales its throughput to handle new hedging strategies by adding servers. 21 ScaleOut Software, Inc.
  • 22. Demo • Video Link 22 ScaleOut Software, Inc.
  • 23. Comparison to Spark • Spark is intended to accelerate data analysis using in-memory computing. • ScaleOut’s IMDG provides standard MapReduce for “live” systems. Spark ScaleOut IMDG New MapReduce engine Yes Yes In-memory data storage Resilient Distr. Datasets Distributed Objects Load/store from HDFS Yes Yes Avoid disk access Yes Yes CRUD on live data No Yes Query on properties No Yes High availability Rebuild on failure Replication and failover Extensibility Additional operators PMI methods Open source Yes Hybrid 23 ScaleOut Software, Inc.
  • 24. Summary • Online systems need to analyze “live” data in real-time. • MapReduce has traditionally focused on analyzing large, static (offline) datasets held in file systems. • An in-memory data grid (IMDG) can accelerate MapReduce applications, enabling real-time analytics: • Enables the application to analyze and update live data. • Leverages the IMDG’s load-balanced placement of data. • Avoids batch-scheduled startup delays. • Avoids data motion from secondary storage. • MapReduce can be implemented using standard dataparallel computing techniques (“parallel method invocation”): • Tightly integrates Map/Reduce engine with the IMDG. • Accelerates Map/Reduce execution by >20X in benchmark tests. 24 ScaleOut Software, Inc.
  • 25. Accelerating Start-Up Times • The invocation grid can be re-used across MapReduce jobs: public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar"). // Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload(); } 25 ScaleOut Software, Inc.
  • 26. Targeted Use Cases Run continuous Hadoop on live data, while it’s being updated. Accelerate Hadoop on static data with a one line code change. Quickly prototype Hadoop code. 26 “Capture perishable business opportunities and identify issues.” Real-time risk analysis Credit card fraud detection ... “Speed-up Hadoop execution by >10X for faster business insights.” Financial modeling Process simulations ... “Validate your Hadoop code before it goes into batch processing.” No need to install Hadoop stack ScaleOut Software, Inc. Fast-turn debug and tuning ...
  • 27. The Need for Real-Time Analytics Many Use Cases: • Across Key Industries: Authorizations / Payment Processing / Mobile Payments • • • • • • • • • • 27 ScaleOut Software, Inc. Health Care • Operational Risk Compliance Government • Financial: Risk, P&L, Pricing Life Sciences • Execution Rules IC / DoD • Market Feed / Event Handlers Logistics • Churn Management Manufacturing • Situational Awareness Utilities • Fraud Detection Retail • Real Time Tracking Telco • Sensor Data / SCADA Financial • Inventory Management CPG • Service Activation • • Law enforcement
  • 28. Problem: Hadoop Cannot Efficiently Perform Real-Time Analytics • Typically used for very large, static, offline datasets • Data must be copied from disk-based storage (e.g., HDFS) into memory for analysis. • Hadoop Map/Reduce adds lengthy batch scheduling and data shuffling overhead. 28 ScaleOut Software, Inc.
  • 29. Hadoop Users Need Real-Time Analytics • ScaleOut Software conducted informal survey at Strata 2013 Conference (Santa Clara). • Based on 150 responses: • 78% of organizations generate fast-changing data. • 60% use Hadoop and 78% plan to expand usage of Hadoop within 12 months. • Only 42% consider Hadoop to be an effective platform for realtime analysis, but… • 93% would benefit from real-time data analytics. • 71% consider a 10X improvement in performance meaningful. • Take-away: Hadoop users need real-time analytics. 29 ScaleOut Software, Inc.
  • 30. Optional Caching of HDFS Data • ScaleOut hServer adds Dataset Record Reader (wrapper) to cache HDFS data during program execution. • Hadoop automatically retrieves data from ScaleOut IMDG on subsequent runs. • Dataset Record Reader stores and retrieves data with minimum network and memory overheads. • Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG. 30 ScaleOut Software, Inc.
  • 31. Java Example: Parallel Method Invocation • Create method to analyze each queried stock object and another method to pair-wise merge the results: public class StockAnalysis implements Invokable<Stock, StockCalcParams, Double> { public Double eval(Stock stock, StockCalcParams param) throws InvokeException { return stock.getPrice() * stock.getTotalShares(); } public Double merge(Double first, Double second) throws InvokeException { return first + second; } } 31 ScaleOut Software, Inc.
  • 32. Java Example: Parallel Method Invocation • Run a parallel method invocation on the query results: NamedCache cache = CacheFactory.getCache("Stocks"); InvokeResult valueOfSelectedStocks = cache.invoke( StockAnalysis.class, Stock.class, or(equal("ticker", "GOOG"), equal("ticker", "ORCL")), new StockCalcParams()); System.out.println("The value of selected stocks is" + valueOfSelectedStocks.getResult()); 32 ScaleOut Software, Inc.