Using Distributed In-Memory Computing for Fast Data Analysis


Published on

This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Using Distributed In-Memory Computing for Fast Data Analysis

  1. 1. Using Distributed, In-MemoryComputing for Fast Data Analysis WSTA Seminar September 14, 2011 Bill Bain ( Copyright © 2011 by ScaleOut Software, Inc.
  2. 2. Agenda• The Need for Memory-Based, Distributed Storage• What Is a Distributed Data Grid (DDG)• Performance Advantages and Architecture• Migrating Data to the Cloud and Across Global Sites• Parallel Data Analysis• Comparison of DDG to File-Based Map/Reduce2 WSTA Seminar
  3. 3. The Need for Memory-Based StorageExample: Web server farm: Internet• Load-balancer directs POW E R FAU LT DATA AL A RM Load-balancer incoming client requests Ethernet to Web servers.• Web and app. server farms build Web pages W eb Server Distributed, In-Memory DataServer W eb Server W eb Server W eb Server W eb Server W eb Grid and run business logic. Ethernet• Database server holds all mission-critical, LOB data. D atabase R aid D isk D atabase Server Array Server Bottleneck• Server farms share fast- Ethernet changing data using a Distributed, In-Memory Data Grid DDG to avoid bottlenecks and maximize scalability. App. Server App. Server App. Server App. Server 3 WSTA Seminar
  4. 4. The Need for Memory-Based StorageExample: Cloud Application: Cloud Application• Application runs as multiple, App VS App VS App VS virtual servers (VS). App VS App VS• Application instances store and retrieve LOB data from cloud- Grid VS Grid VS based file system or database. Grid VS Distributed Data Grid• Applications need fast, scalable storage for fast-changing data.• Distributed data grid runs as multiple, virtual servers to provide “elastic,” in-memory storage. Cloud-Based Storage4 WSTA Seminar
  5. 5. What is a Distributed Data Grid?• A new “vertical” storage tier: Processor Processor Cache Cache – Adds missing layer to boost performance. – Uses in-memory, out-of-process L2 Cache L2 Cache storage. – Avoids repeated trips to backing Application Memory Application Memory “In-Process” “In-Process” storage.• A new “horizontal” storage tier: Distributed Distributed Cache Cache – Allows data sharing among servers. “Out-of- “Out-of- Process” Process” – Scales performance & capacity. – Adds high availability. Backing – Can be used independently of Storage backing storage.5 WSTA Seminar
  6. 6. Distributed Data Grids: A Closer Look• Incorporates a client-side, in- Application Memory process cache (“near cache”): “In-Process” – Transparent to the application – Holds recently accessed data. Client-side• Boosts performance: Cache – Eliminates repeated network data “In-Process” transfers & deserialization. – Reduces access times to near “in- process” latency. Distributed – Is automatically updated if the Cache distributed grid changes. “Out-of- Process” – Supports various coherency models (coherent, polled, event-driven)6 WSTA Seminar
  7. 7. Performance Benefit of Client-side Cache• Eliminates repeated network data transfers.• Eliminates repeated object deserialization. Average Response Time 10KB Objects 3500 20:1 Read/Update 3000 2500 Microseconds 2000 1500 1000 500 0 DDG DBMS 7 WSTA Seminar
  8. 8. Top 5 Benefits of Distributed Data Grids1. Faster access time for business logic state or database data2. Scalable throughput to match a growing workload and keep response times low3. High availability to prevent data loss if a grid server (or network link) fails Access Latency vs. Throughput4. Shared access to data across Access Latency (msec) the server farm Grid DBMS5. Advanced capabilities for quickly and easily mining data using scalable, “map/reduce,” analysis. Throughput (accesses / sec)8 WSTA Seminar
  9. 9. Scaling the Distributed Data Grid• Distributed data grid must deliver scalable throughput.• To do so, its architecture must eliminate bottlenecks to scaling: – Avoid centralized scheduling to eliminate hot spots. – Use data partitioning and maintain load-balance to allow scaling. – Use fixed vs. full replication Read/Write Throughput to avoid n-fold overhead. 10KB Objects – Use low overhead Accesses / Second heart-beating. 80,000• Example of linear 60,000 40,000 throughput scaling: 20,000 0 4 16 28 40 52 64 Nodes 16,000 ------------------------------------------- 256,000 #Objects 9 WSTA Seminar
  10. 10. Typical Commercial Distributed Data Grids• Partition objects to scale throughput and avoid hot spots.• Synchronize access to objects across all servers.• Dynamically rebalance objects to avoid hot spots.• Replicate each cached object for high availability.• Detect server or network failures and self-heal. Client Application Retrieve Client Cached Library Copy Object Copy Replica Cache Cache Cache Cache Service Service Service Service Distributed Cache Ethernet10 WSTA Seminar
  11. 11. Wide Range of ApplicationsFinancial Services E-commerce• Portfolio risk analysis • Session-state storage• VaR calculations • Application state storage• Monte Carlo simulations • Online banking• Algorithmic trading • Loan applications• Market message caching • Wealth management• Derivatives trading • Online learning• Pricing calculations • Hotel reservations • News story cachingOther Applications• Edge servers: chat, email • Shopping carts• Online gaming servers • Social networking• Scientific computations • Service call tracking• Command and control • Online surveys11 WSTA Seminar
  12. 12. Importance for Cloud Computing• Cloud computing: – Make elastic resources readily available, but… – Clouds have relatively slow interconnects.• Distributed data grids add significant value in the cloud: – Allow data sharing across a group of virtual servers. – Elastically scale throughput as needed. – Provide low latency, object-oriented storage• Clouds provide the elastic platform for parallel data analysis.• DDGs provides the efficiency and scalability needed to overcome the cloud’s limited interconnect speed.12 WSTA Seminar
  13. 13. DDGs Simplify Data Migration to the Cloud• Distributed data grids can automatically bridge on- premise and cloud-based data grids to unify access.• This enables seamless access to data across multiple sites. Cloud Application Cloud Application VS App App VS App VS App VS App VS App VS App VS App VS On-Premise Application 2 App VS App VS Server App Server App On-Premise Application 2 SOSS VS Server App Server App SOSS VS SOSS VSVS SOSS Aut o SOSS VS Mig matic rate ally Cloud-Based Distributed Automatically Cache Da ta SOSS Host SOSSHost SOSS Host SOSS VS Migrate Data SOSS Host Cloud hosted Cloud of Virtual Servers On-Premise Backing Distributed Data Grid Distributed Data Grid On-Premise Cache Store User’s On-Premise Application Cloud of Virtual Servers User’s On-Premise Application13 WSTA Seminar
  14. 14. DDGs Enable Seamless Global Access Mirrored Data Centers SOSS SVR Satellite Data Centers SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid SOSS SVR SOSS SVR SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid Distributed Data Grid SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid Global Distributed Data Grid14 WSTA Seminar
  15. 15. Introducing Parallel Data Analysis• The goal: – Quickly analyze a large set of data for patterns and trends. – How? Run a method E (“eval”) across a set of objects D in parallel. – Optionally merge the results using method M (“merge”).• Evolution of parallel analysis: E M – 80s: “SIMD/SPMD” (Flynn, Hillis) – 90s: “Domain decomposition” (Intel, IBM) D D D D – 00s: “Map/reduce” (Google, Hadoop, Dryad) D D D D• Applications: – Search, financial services, D D D D business intelligence, simulation D D D D Result15 WSTA Seminar
  16. 16. Example in Financial ServicesAnalyze trading strategies across stock histories:Why?• Back-testing systems help guard against risks in deploying new trading strategies.• Performance is critical for “first to market” advantage.• Uses significant amount of market data and computation time.How?• Write method E to analyze trading strategies across a single stock history.• Write method M to merge two sets of results.• Populate the data store with a set of stock histories.• Run method E in parallel on all stock histories.• Merge the results with method M to produce a report.• Refine and repeat…16 WSTA Seminar
  17. 17. Stage the Data for Analysis• Step 1: Populate the distributed data grid with objects each of which represents a price history for a ticker symbol:17 WSTA Seminar
  18. 18. Code the Eval and Merge Methods• Step 2: Write a method to evaluate a stock history based on parameters: Results EvalStockHistory(StockHistory history, Parameters params) { <analyze trading strategy for this stock history> return results; }• Step 3: Write a method to merge the results of two evaluations: Results MergeResuts(Results results1, Results results2) { <merge both results> return results; }• Notes: – This code can be run a sequential calculation on in-memory data. – No explicit accesses to the distributed data grid are used.18 WSTA Seminar
  19. 19. Run the Analysis • Step 4: Invoke parallel evaluation and merging of results: Results Invoke(EvalStockHistory, MergeResults, querySpec, params);EvalStockHistory() MergeResults() 19 WSTA Seminar
  20. 20. Start parallel analysis .eval() stock stock stock stock stock stock history history history history history history results results results results results results .merge() .merge() .merge() results results results .merge() results returned results to client 20 WSTA Seminar
  21. 21. DDG Minimizes Data Motion• File-based map/reduce must move data to memory for analysis: M/R Server M/R Server M/R Server E E E Server Memory File System / D D D D D D D D D Database• Memory-based DDG analyzes data in place: Grid Server Grid Server Grid Server E E E Distributed D D D D D D D D D Data Grid21 WSTA Seminar
  22. 22. Start parallel analysis .eval() File I/O stock stock stock stock stock stock history history history history history history results results results results results results .merge() .merge() .merge() File I/O results results results File I/O .merge() results returned results to client 22 WSTA Seminar
  23. 23. Performance Impact of Data Motion Measured random access to DDG data to simulate file I/O:23 WSTA Seminar
  24. 24. Comparison of DDGs and File-Based M/R DDG File-Based M/RData set size Gigabytes->terabytes Terabytes->petabytesData repository In-memory File / databaseData view Queried object collection File-based key/value pairsDevelopment time Low HighAutomatic Yes Applicationscalability dependentBest use Quick-turn analysis of Complex analysis of memory-based data large datasetsI/O overhead Low HighCluster mgt. Simple ComplexHigh availability Memory-based File-based24 WSTA Seminar
  25. 25. Walk-Away Points• Developers need fast, scalable, highly available and sharable memory-based storage for scaled out applications.• Distributed data grids (DDGs) address these needs with: – Fast access time & scalable throughput – Highly available data storage – Support for parallel data analysis• Cloud-based and globally distributed applications need DDGs to: – Support scalable data access for “elastic” applications. – Efficiently and easily migrate data across sites. – Avoid relatively slow cloud I/O storage and interconnects.• DDGs offer simple, fast “map/reduce” parallel analysis: – Make it easy to develop applications and configure clusters. – Avoid file I/O overhead for datasets that fit in memory-based grids. – Deliver automatic, highly scalable performance.25 WSTA Seminar
  26. 26. Distributed Data Grids forServer Farms & High Performance Computing