Operational Intelligence Using Hadoop


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Operational Intelligence Using Hadoop

  1. 1. Enabling Operational Intelligence Using Hadoop MapReduce Copyright © 2014 by ScaleOut Software, Inc. Hadoop Summit June 3-5, 2014 Bill Bain, CEO (wbain@scaleoutsoftware.com)
  2. 2. 2 ScaleOut Software, Inc. • The Need for Operational Intelligence (OI) • Operational Intelligence vs. Business Intelligence • Implementing OI Using In-Memory Computing: • In-Memory Data Grid • Data-Parallel Computation: “Parallel Method Invocation” • Implementing MapReduce Unchanged on an IMDG • A Detailed Example in Financial Services • Video Demo • Examples of Applications in Operational Intelligence Agenda
  3. 3. 3 ScaleOut Software, Inc. Goal: Provide immediate feedback to a system handling live data. A few examples: • Equity trading: to minimize risk during a trading day • Ecommerce: to optimize real-time shopping activity • Reservations systems: to identify issues, reroute, etc. • Credit cards & wire transfers: to detect fraud in real time • Smart grids: to optimize power distribution & detect issues Online Systems Need Operational Intelligence
  4. 4. 4 ScaleOut Software, Inc. • To keep up with fast growing “live” workloads & maintain fast response times: • Ex.: Handle incoming data streams in real time. • Ex. Process updates to data set based on incoming data. • To identify and respond to trends in fast-changing data: • Ex. Evaluate data set changes in real time. • Ex. Respond to identified patterns within seconds. Challenges for Operational Intelligence 0 50 100 150 200 250 300 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Millions Growth in Web Servers Source: Netcraft 0 500 1000 1500 2000 2500 3000 3500 4000 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Exebytes Growth in “Big Data” “More data has been created in the past three years than in the past 40,000.”
  5. 5. 5 ScaleOut Software, Inc. Big Data Analytics Real-Time vs. Batch Analytics Static data sets Petabytes Disk storage Hours to minutes Best uses: • Analyzing warehoused data • Mining for long- term trends Live data sets Gigabytes to terabytes In-memory storage Minutes to seconds Best uses: • Tracking live data • Immediately identifying trends and capturing opportunities • Providing immediate feedback IMDGs Spark Storm CEP Hadoop IBM Teradata Oracle SAP Real-Time Batch Real-time “Operational Intelligence” Batch “Business Intelligence”
  6. 6. 6 ScaleOut Software, Inc. • Traditional Hadoop MapReduce platforms analyze offline data: • Very large, disk-based datasets • Data repeatedly copied from disk to memory. • Batch-scheduled (multi-tenant) • IMDGs store and analyze live data: • Fast-changing, operational data integrated with live updates • Data kept memory-resident (data motion is minimized) • Inline-scheduled (single tenant) Design Goals for Hadoop vs. IMDGs
  7. 7. 7 ScaleOut Software, Inc. • Operational intelligence can co-exist with business intelligence: • Processes streaming data close to its sources. • Provides real-time, “tactical” feedback (e.g., recommendations, alerts). • Translates data for storage in the data warehouse (ETL). • Data warehouse provides “strategic” guidance. • Using the same tool set (i.e., Hadoop MapReduce) lowers TCO: • Leverages common skill set. • Simplifies design (e.g., loading data into HDFS). Integrated View of Analytics
  8. 8. 8 ScaleOut Software, Inc. • In-memory data grid (IMDG) holds active entities undergoing state changes in memory. • IMDG updates entities with incoming stream of state changes. • Backing store optionally holds large population of entities. • Analytics engine examines entities in real time and generates alerts within seconds as needed. In-Memory Architecture for Operational Intelligence
  9. 9. 9 ScaleOut Software, Inc. In-Memory Data Grid (IMDG) stores “live” data in a cluster: • Fits in the business logic layer: • Follows object-oriented view of data (vs. relational view). • Stores unstructured collections of Java/.NET objects. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. In-Memory Data Grid for Live Data
  10. 10. 10 ScaleOut Software, Inc. • IMDG’s collections of objects behave like in-memory collections: • Unstructured, typically instances of a class (stored as serialized blobs) • Individually accessible / update-able • IMDG adds attributes: • Accessible by global key • Query-able by properties • Highly available • Optional timeouts • Distributed locking • Integration with a backing store • Optional dependency relationships • Asynchronous event handling IMDG Stores “Live” Data Basic “CRUD” APIs: • Create(key, obj, tout) • Read(key) • Update(key, obj) • Delete(key) and… • Lock(key) • Unlock(key) Object key
  11. 11. 11 ScaleOut Software, Inc. IMDG Analyzes Live Data • Integrated execution engine: “Parallel Method Invocation” (PMI) • Object-oriented version of HPC data-parallel computing model • Serves as a platform for implementing MapReduce and other data-parallel operators. • Runs user-defined methods in parallel across the cluster. • Globally merges results. • Benefits: • Simple, well understood model • Fast startup time • Fast global barrier • Minimum data motion • Automatic code shipping Analyze Data (Eval) Combine Results (Merge)
  12. 12. 12 ScaleOut Software, Inc. PMI Enables Linear Speedup Avoids data motion (network or disk I/O) which limits throughput:
  13. 13. 13 ScaleOut Software, Inc. Spark / Spark Streaming from U.C. Berkeley amplab: • In-memory computing to accelerate and extend Hadoop MapReduce using data-parallel operators in Scala. • Stores data as “resilient distributed datasets” (RDDs): • Distributed across cluster • Immutable • Hold data from/output to HDFS. • Store data stream as a sequence of RDDs. • Comparison to IMDG: • Not designed for “live” data: • Lacks CRUD on individual objects. • Lacks high availability. • Designed for “data parallel” transformations. Comparison: IMDGs to Spark
  14. 14. 14 ScaleOut Software, Inc. Run MapReduce as two PMI phases: • Data can be input from either the IMDG or an external data source. • Works with any input/output format compatible with the Apache distribution. • IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers. • Eliminates batch scheduling overhead. • Intermediate results are stored within the IMDG. • Minimizes data motion between the mappers and reducers. • Allows optional sorting. • Output of a single reducer/combiner optionally can be globally merged. Implementing MapReduce on IMDG
  15. 15. 15 ScaleOut Software, Inc. // This job will run using the Hadoop // job tracker: public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } // This job will run using ScaleOut hServer: public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new HServerJob(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } Configuring Application for the IMDG • Without YARN, just subclass the Hadoop Job class with a one-line change:
  16. 16. 16 ScaleOut Software, Inc. Running Under YARN • With YARN, just replace the MapReduce execution framework: • Example of running MapReduce on IMDG using Hortonworks YARN: • YARN directs jobs to IMDG. • IMDG accelerates execution. $ hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.framework.name=hserver-yarn in out
  17. 17. 17 ScaleOut Software, Inc. • With YARN, IMDG can run Apache or other Hive distribution unchanged. • Accelerates queries for datasets hosted in HDFS or the IMDG. • Limitation: Intermediate data must fit within the IMDG. • Implementation note: • Hive not thread-safe • Requires multiple JVMs per server for one Hive query • Currently seeing 3X speedup (tuning in progress) • More optimizations possible, but… • Limited by “unchanged” approach Using YARN to Run Hive on IMDG
  18. 18. 18 ScaleOut Software, Inc. • A Hadoop distribution does not have to be installed unless HDFS is used. • The developer starts MapReduce applications from a remote workstation. • The IMDG automatically builds a reusable “invocation grid” of JVMs on the grid’s servers for PMI and ships the application’s jars. • Results are stored in the IMDG, HDFS, or optionally globally merged and returned to the remote workstation. Running MapReduce on an IMDG
  19. 19. 19 ScaleOut Software, Inc. The invocation grid can be re-used across MapReduce jobs: Accelerating Start-Up Times //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar(“mylib.jar"). // Add classes as IG dependencies addClass(MyMap.class). addClass(MyRed.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed to the job. Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload();
  20. 20. 20 ScaleOut Software, Inc. • IMDG adds grid input format for accessing key/value pairs held in the IMDG. • MapReduce programs optionally can output results to IMDG with grid output format. • Grid Record Reader optimizes access to key/value pairs to eliminate network overhead. • Applications can access and update key/value pairs as operational data during analysis. Accessing In-Memory Data
  21. 21. 21 ScaleOut Software, Inc. IMDG needs multiple in-memory storage models: • Named cache, optimized for rich semantics on large objects: • Property-based query • Distributed locking • Access from remote grids • Named map, optimized for efficient storage and bulk analysis (e.g., MapReduce): • Highly efficient object storage • Pipelined, bulk-access mechanisms • Follows Java Named Map semantics. Optimizing In-Memory Storage for M/R
  22. 22. 22 ScaleOut Software, Inc. In-Memory Named Map: • Stores key/value pairs in chunks. • Allows CRUD operations on kvps. • Automatically organizes chunks into splits. • Uses per-split hash table to access keys and manage multi-valued keys. • Stores shuffled data set between mappers and reducers. • Pipelines chunks to mappers and from reducers. • Optionally uses memory mapped files to reduce access latency. • Provides support for sorting keys. Named Map Optimizations
  23. 23. 23 ScaleOut Software, Inc. • IMDG adds Dataset Record Reader (wrapper) to cache HDFS data during program execution. • Hadoop automatically retrieves data from IMDG on subsequent runs. • Dataset Record Reader stores and retrieves data with minimum network and memory overheads. • Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG. Optional Caching of HDFS Data
  24. 24. 24 ScaleOut Software, Inc. • IMDG caches “chunks” of key/value pairs instead of HDFS records. • Serves key/value pairs directly to mappers on cache access. • Avoids overhead of reparsing records. Details of HDFS Caching “Record” Phase “Playback” Phase
  25. 25. 25 ScaleOut Software, Inc. • Measured performance: • Startup times reduced to a few milliseconds • Word count benchmark shows 20X speedup. • Real-world example shows >40X speedup. • MapReduce optimizations: • Optional sorting • Optional multicast of parameters to mappers • Optional O(logN) global combining (avoids single reducer) • Optional HDFS caching • Optional reuse of JVMs across jobs • Current limitations: • No specific security for multi-tenancy • Intermediate data must fit in the IMDG Performance & Optimizations
  26. 26. 26 ScaleOut Software, Inc. In-Memory MapReduce: • Enables use of Hadoop MapReduce for operational intelligence. • Accelerates data access by holding data in memory. • Analyzes and updates “live” data. • Reduces overheads of standard Hadoop distributions: • Batch scheduling • Disk access • Data shuffling • Mandatory key sorting • Avoids vendor-specific APIs: • Leverages Hadoop skill sets. Summary of Benefits
  27. 27. 27 ScaleOut Software, Inc. Integrate analysis into a stock trading platform: • The IMDG holds market data and hedging strategies. • Updates to market data continuously flow through the IMDG. • The IMDG performs repeated data-parallel analysis on hedging strategies and alerts traders in real time. • IMDG automatically and dynamically scales its throughput to handle new hedging strategies by adding servers. • Measured >40X speedup over Apache 1.2. Example in Financial Services
  28. 28. 28 ScaleOut Software, Inc. The Challenge: Quickly evaluate and respond to sub-second market changes: • Hedge fund tracks a set of hedging strategies: • Strategies can cover various market sectors, such as high-tech, automotive, energy, consumer, real estate, etc. • Each strategy contains list of holdings and rules for managing the holdings (such as target allocations). • Updates to market data continuously arrive during the trading day. • Challenge: The hedge fund must be able to quickly update and analyze its hedging strategies and provide alerts to traders. Demo of the Finserv Application
  29. 29. 29 ScaleOut Software, Inc. • Delivers a stream of alerts to traders within a few seconds. • Enables the trader to examine strategy details in real time: Output: Real-Time Alerts
  30. 30. 30 ScaleOut Software, Inc. • Video Link Video
  31. 31. 31 ScaleOut Software, Inc. • Measured a similar financial services application (back testing stock trading strategies on stock histories) • Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory • IMDG handled a continuous stream of updates (1.1 GB/s) • Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling Example of Performance Scaling
  32. 32. 32 ScaleOut Software, Inc. Fast map/reduce reconciles inventory and order systems for an online retailer: • Challenge: Inventory and online order management are handled by different applications. • Reconciled once per day. • Inaccurate orders reduces margins. • Solution: • Host SKUs in IMDG updated in real time by order & inventory systems. • Use MapReduce to reconcile in two minutes. • Results: Real-time reconciliation ensures accurate orders. Example in Ecommerce: Inventory Management
  33. 33. 33 ScaleOut Software, Inc. • IMDG holds customer information for active Web users. • IMDG saves/retrieves customer information from backing store. • Web browsers send activity information to analytics engine. • IMDG updates customer history and preferences. • Analytics engine identifies browsing and buying patterns. • Analytics engine makes suggestions in real-time. Also sends email follow-ups. Example: Web Shopping
  34. 34. 34 ScaleOut Software, Inc. • Track connectivity issues. • Obtain time- sensitive business data. • Offer enhanced services. • Increase security. Example: Telecommunications Optimize Operations Customer Experience Historical queries for real-time data enrichment Stream persistence for future analysis Network Elements
  35. 35. 35 ScaleOut Software, Inc. • Online systems need operational intelligence on “live” data for immediate feedback. • Operational intelligence can be implemented using Hadoop MapReduce unchanged. • In-memory data grid provides an excellent platform for MR- based operational intelligence: • Hosts and updates “live” data. • Implements high availability. • Offers fast MapReduce execution for immediate results. • Leverages Hadoop skill sets. Recap
  36. 36. Additional Information 36
  37. 37. 37 ScaleOut Software, Inc. • Storm implements pipelined, task-parallel execution by “bolts” on incoming data streams. • Streams can be distributed to bolts with configurable mappings. • Developer controls the number of tasks per bolt. • Storm uses a centralized master node and Zookeeper for fault- tolerance. • Key strength: continuous processing of input streams • Issues: • Complexity / tuning • Minimizing data motion • Managing global state Comparison to Storm
  38. 38. 38 ScaleOut Software, Inc. • Create method to analyze a queried stock object and another method to pair-wise merge the results: Java Example: Parallel Method Invocation public class StockAnalysis implements Invokable<Stock, StockCalcParams, Double> { public Double eval(Stock stock, StockCalcParams param) throws InvokeException { return stock.getPrice() * stock.getTotalShares(); } public Double merge(Double first, Double second) throws InvokeException { return first + second; } }
  39. 39. 39 ScaleOut Software, Inc. • Run a parallel method invocation on a queried set of objects: Java Example: Parallel Method Invocation NamedCache cache = CacheFactory.getCache("Stocks"); InvokeResult valueOfSelectedStocks = cache.invoke( StockAnalysis.class, Stock.class, or(equal("ticker", "GOOG"), equal("ticker", "ORCL")), new StockCalcParams()); System.out.println("The value of selected stocks is" + valueOfSelectedStocks.getResult());
  40. 40. 40 ScaleOut Software, Inc. • IMDG ships user’s code and libraries to its servers. • IMDG automatically schedules analysis operations across all grid servers and cores. • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. PMI: Running the Analysis
  41. 41. 41 ScaleOut Software, Inc. • The IMDG automatically merges all analysis results. • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the invoking application as one object. PMI: Merging the Results