Real-time analysis using an in-memory data grid - Cloud Expo 2013


Published on

ScaleOut technical session at Cloud Expo 2013 in NY. Covers the use of in-memory data grids for real-time analysis of fast-changing data. Includes a financial services example.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Real-time analysis using an in-memory data grid - Cloud Expo 2013

  1. 1. Performing Real-Time Analyticswith In-Memory Data GridsCopyright © 2013 by ScaleOut Software, Inc.Cloud ExpoJune 10, 2013Mikhail Sobolev ( Brinker (
  2. 2. 2 ScaleOut Software, Inc.• What is an In-Memory Data Grid (IMDG)?• Top Benefits of IMDGs• The Need for Real-Time Analytics• Example: A Platform for Managing Hedging Strategies• Using an IMDG to Perform Real-Time Analysis• Benchmark Results• Integrating an IMDG into Hadoop2Agenda
  3. 3. 3 ScaleOut Software, Inc.• Dr. Mikhail Sobolev, Lead Java Architect• Ph.D. from Moscow Institute of Physics and Technology• Research and consulting focus in parallel computing• Responsible for development of scalable software services in Java• David Brinker, COO• 20 years software business and executive management experience• Mentor Graphics, Cadence, Webridge• Company: ScaleOut Software• Develops and markets IMDG products• Founded in September 2003• Offices in Bellevue, WA and Beaverton, OR• Eight years market experience in Windows& LinuxAbout the Speakers
  4. 4. 4 ScaleOut Software, Inc.• ScaleOut StateServer®• Flagship product• IMDG middleware for Windowsand Linux• Industry-leading performance and ease of use• ScaleOut GeoServer® adds• WAN based data replication for DR• Breakthrough technology for globaldata access• ScaleOut Analytics Server™ adds• Real-time data analysis for operational data• Comprehensive management tools• ScaleOut hServer™ adds• 1st step for Hadoop real-time analytics• Accelerates data access and execution.ScaleOut Software ProductsScaleOut StateServer In-Memory Data GridGridServiceGridServiceGridServiceGridService
  5. 5. 5 ScaleOut Software, Inc.In-memory storage for fast updates and retrieval of live data• Fits in the business logic layer:• Stores collections of Java/.NETobjects shared by multiple clients.• Uses create/read/update/deleteand query APIs to access data.• Implemented across a cluster ofservers or VMs:• Scales storage and throughputby adding servers.• Provides high availabilityin case a server fails.What is an In-Memory Data Grid?
  6. 6. 6 ScaleOut Software, Inc.Scaling Data Access Using an IMDGExample: Cloud-Hosted App• Application runs as multiple virtualservers (VS).• Application instances store andretrieve LOB data from cloud-basedfile system or database-.• Applications need fast, scalablestorage for live data.• In-memory data grid runs asmultiple virtual servers to provide“elastic” in-memory storage forlive data.
  7. 7. 7 ScaleOut Software, Inc.• As a “vertical” storage tier:• Runs as middleware software.• Adds missing storage layer to boostperformance.• Uses out-of-process memory.• Avoids repeated trips to a backing store.Where IMDGs Are DeployedProcessorCacheApplicationMemory“In-Process”L2 CacheProcessorCacheApplicationMemory“In-Process”L2 CacheBackingStorage• As a “horizontal” storage tier:• Allows data sharing among servers.• Scales performance & capacity.• Adds high availability.• Can be used independently of backingstorage.In-MemoryData Grid“Out-of-Process”In-MemoryData Grid“Out-of-Process”
  8. 8. 8 ScaleOut Software, Inc.• IMDG incorporates a client-side in-processcache (“near cache”):• Transparent to the application• Holds recently accessed data• Boosts performance:• Eliminates repeated network data transfers &deserialization• Reduces access times to near “in-process”latency• Is automatically updated if the grid isupdated• Supports various coherency models(coherent, polled, event-driven)The Secret to Fast Access TimeApplicationMemory“In-Process”Client-sideCache“In-Process”In-MemoryData Grid“Out-of-Process”
  9. 9. 9 ScaleOut Software, Inc.• IMDGs enable seamless data access across on-premise sites andcloud-based deployments:• Automatically accessremote data as needed.• Efficiently manageWAN bandwidth.• Enable full datacoherency across sites.• Supports multiple usagemodels:• Replication for DR• Remote access• Synchronized read/writeGlobal Data Integration
  10. 10. 10 ScaleOut Software, Inc.• IMDG bridges on-premise and cloud-based in-memory storage ofWeb session state.• IMDG automatically migrates session-state objects into the cloudon demand.• This enables seamless access to data across multiple sites.Example: Web Farm Cloud-Bursting
  11. 11. 11 ScaleOut Software, Inc.In-Memory Data Grid is middleware software which provides:1. Fast access time for fast-changing, “live” data2. Scalable throughput and storage capacity to match agrowing workload and keep response times low3. High availability to prevent data loss if a grid server (ornetwork link) fails4. Shared access to dataacross the server farm5. Global data access acrossmultiple sites and the cloud6. And … fast data analysisfor quickly and easily miningdata using “map/reduce”Top Benefits of IMDGsAccessLatencyThroughputGrid DBMSAccess Latency vs. ThroughputFasterScales
  12. 12. 12 ScaleOut Software, Inc.• Traditional “big data” analysisplatforms analyze offline data:• Example: Hadoop• Very large, static datasets• Data is often copied from otherdisk-based storage systems to adistributed file system for analysis.• IMDGs store and analyze online data:• Fast-changing, operational data• Data storage is memory-based.• Data motion is minimized for fast,continuous analysis.IMDGs Analyze Live Data
  13. 13. 13 ScaleOut Software, Inc.A few examples:• Equity trading: to minimize risk during a trading day• Ecommerce: to optimize real-time shopping activity• Reservations systems: to identify issues, reroute, etc.• Credit cards: to detect fraud in real time• Smart grids: to optimize power distribution & detect issuesOnline Systems Need Real-Time Analysis
  14. 14. 14 ScaleOut Software, Inc.A platform for managing hedging strategies:• A hedge fund manages a set of hedging strategies:• Strategies can cover various marketsectors, such as high-tech, automotive,energy, consumer, real estate, etc.• Each strategy contains list of holdingsand rules for managing the holdings(such as target allocations).• Updates to market datacontinuously arrive duringthe trading day.• Challenge: The hedge fund must be able to quickly update andanalyze its hedging strategies and provide alerts to traders.Example in Financial Services
  15. 15. 15 ScaleOut Software, Inc.• Deliver a stream of alerts to traderswithin a few seconds.• Enable the trader to examine strategy details in real time:The Result: Real-Time Alerts
  16. 16. 16 ScaleOut Software, Inc.• The IMDG holds the set of strategy objects as an in-memory collection.• Updates to market datacontinuously flow throughthe IMDG.• The IMDG performsrepeated map/reduceanalysis on hedgingstrategies everysecond.• Each analysis iteration both updatesand analyzes every strategy object.• The IMDG collects alerts after eachanalysis and delivers them to thetrader.The Solution: Real-Time AnalyticsUsing an IMDG
  17. 17. 17 ScaleOut Software, Inc.• Analyze every selected strategy object in parallel within the IMDG:• Update the strategy’s positions with latest market prices.• Evaluate the strategy’s rules to see if a trade is needed.• Example: Alert if current allocation exceeds target threshold.• Generate an alert if holdings need to be changed.• Merge the results across all strategy objects to create a set ofalerts.The Analysis Algorithm
  18. 18. 18 ScaleOut Software, Inc.Shipping Analysis Code to the IMDG• IMDG creates Java or .NET execution environment for analysis:• Spans all IMDG servers.• Ensures tight integration with memory-based data storage.• IMDG client ships jars/assemblies to IMDG servers for execution:• Keeps development model simple.• Optionally allows pre-staging for multiple runs to shorten startup time.• Optionally allows automatic re-staging if code changes between runs.• Client starts analysis:• Sends invocation tothe IMDG.• IMDG returnsanalysis results.
  19. 19. 19 ScaleOut Software, Inc.The parallel analysis executes in three steps:• Step 1: The application first selects all relevant objects in thecollection with a parallel query run on all grid servers.• Note: Query spec matches data’s object-oriented properties.Running the Analysis
  20. 20. 20 ScaleOut Software, Inc.• Step 2: The IMDG automatically schedules analysis operationsacross all grid servers and cores.• The analysis runs on all objects selectedby the parallel query.• Each grid server analyzes its locally storedobjects to minimize data motion.• Parallel execution ensures fastcompletion time:• IMDG automatically distributesworkload across servers/cores.• Scaling the IMDG automaticallyhandles larger data sets.Running the Analysis: Step 2
  21. 21. 21 ScaleOut Software, Inc.• File-based map/reduce must move data to memory for analysis:• IMDG’s memory-based computation engine analyzes data in place:IMDG Minimizes Data MotionD D D D D D D D DD D D D D D D D DGrid ServerGrid ServerGrid ServerE E EM/R ServerEM/R ServerEM/R ServerEFile System /DatabaseServerMemoryIn-MemoryData Grid
  22. 22. 22 ScaleOut Software, Inc.• Step 3: The IMDG automatically merges all analysis results.• The IMDG first merges all results within each grid server in parallel.• It then merges results across all grid servers to create one combinedresult.• Efficient parallel mergeminimizes the delay incombining all results.• The IMDG delivers thecombined result to thetrader’s display as oneobject.Running the Analysis: Step 3
  23. 23. 23 ScaleOut Software, Inc.Running a similar analysis algorithm (stock back-testing) within anIMDG:• IMDG hosted in Amazon cloud using 75 servers.• IMDG holds 1 TB of stock history data in memory.• IMDG handles continuous stream of updates (1.1 GB/s) whileperforming real-time analysis on live data.• Entire data set analyzed in4.1 seconds (250 GB/s).• IMDG scales linearly byadding servers asworkload grows.Benchmark Results
  24. 24. 24 ScaleOut Software, Inc.• Typically used for very large, static, offline datasets• Data is held on disk in a file system (HDFS) or DBMS• Data is often copied from other disk-based storage systems toHDFS for analysis.Problem: Hadoop Cannot EfficientlyPerform Real-Time Analytics
  25. 25. 25 ScaleOut Software, Inc.Comparison of IMDGs and HadoopIMDG HadoopData set size Gigabytes->terabytes Terabytes->petabytesData repository In-memory File / databaseData view Queried object collection File-based key/valuepairsDevelopment time Low HighAutomaticscalabilityYes Application dependentBest use Real-time analysis oflive, memory-based dataBatch analysis oflarge, static datasetsI/O overhead Low HighCluster mgt. Simple ComplexHigh availability Memory-based File-based
  26. 26. 26 ScaleOut Software, Inc.• Survey result from Strata 2013: 93% of Hadoop users wouldbenefit from real-time data analytics.• Strategy: Integrate IMDG into Hadoop.• How:• Stage data in IMDG for fast access.• Thereby allow updates to data duringHadoop execution.• Automatically retrievedata from HDFS asnecessary.• Enable unchangedHadoop programstructure.• Combine scalabilityof Hadoop map/reduceand IMDG.Enabling Hadoop to PerformReal-Time Analysis
  27. 27. 27 ScaleOut Software, Inc.• IMDG adds Hadoop grid recordreader for accessing key/valuepairs held in the IMDG.• Hadoop programs optionally canoutput results to IMDG with gridrecord writer.• Applications can access and updatekey/value pairs as live data duringanalysis.• Grid record reader and writeroptimize access to key/value pairsto eliminate network overhead.Accessing IMDG Data in Hadoop
  28. 28. 28 ScaleOut Software, Inc.• IMDG adds wrapper for HDFS record reader to cache HDFS dataduring program execution.• Hadoop automatically retrieves data from IMDG on subsequent runs.• Wrapper accesses IMDG tostore and retrieve datawith minimum networkoverhead.• Useful in multiple “what-if”analyses on one data set• Tests with Terasortbenchmark havedemonstrated 11Xlower access latencyover HDFS without IMDG.Using IMDG as an HDFS Cache
  29. 29. 29 ScaleOut Software, Inc.• IMDGs use in-memory storage to scale access to data forapplications which process live, fast-changing data.• IMDGs can be deployed in the cloud and provide global dataintegration across sites.• Many applications need toperform real-time analyticson live data.• IMDGs can meet this need,delivering results in secondsinstead of minutes or hours.• Hadoop was not designed forreal-time analytics, but…• IMDGs can enable Hadoop to accelerate access to data.Summary
  30. 30. In-Memory Data Grids forServer Farms & Cloud