Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Upcoming SlideShare
Loading in...5
×
 

Benchmarking data warehouse systems in the cloud: new requirements & new metrics

on

  • 608 views

 

Statistics

Views

Total Views
608
Views on SlideShare
608
Embed Views
0

Actions

Likes
0
Downloads
16
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • nice one for me
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Benchmarking data warehouse systems in the cloud: new requirements & new metrics Benchmarking data warehouse systems in the cloud: new requirements & new metrics Presentation Transcript

  • Data Warehouse Systems in the Cloud: new requirements and new challenges Rim Moussa LaTICE Lab. -University of Tunis ESTI -University of Carthage rim.moussa@esti.rnu.tn 10th Intl. Conference on Computer Systems and Applications (AICCSA), Fez, Kingdom of Morocco th 30 May 2013 Keynote @ Intl. Conference on Computing, Networking and 30th May Communications, Hammamet, Tunisia DWS in the Cloud, AICCSA'13, Fez 2013
  • Context Cloud Rationale Benchmarking Data Warehouse Systems NO 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 2
  • Cloud Rationale Benchmarking Data Warehouse Systems NO 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 3
  • Outline 1. Cloud Computing 2. Data Warehouse Systems 3. Overview of DWS Benchmarks 4. New Requirements for DWS in the Cloud 5. Related Work 6. Conclusion 7. Research Perspectives 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 4
  • Cloud Computing ● NIST Definition – ● cloud computing as a pay-per-use model for enabling available, convenient, on-demand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Opportunities – Performance – Faster data analysis through usage of up-to-date hardware infrastructure made available by Cloud Service Providers, More Economical ● ● 30th May 2013 Organizations no longer need to expend capital upfront for hardware and software purchases, with Services provided on a pay-per-use basis, DWS in the Cloud, AICCSA'13, Fez 5
  • Cloud Computing --Market share ● Market Share – Forrester Research expects the global cloud computing market to reach $241 billion in 2020, – Gartner group: The public cloud services market is forecast to grow 18.5% in 2013 to total $131 billion worldwide, up from $111 billion in 2012, – Gartner: the public cloud services market in the Middle East and North Africa (MENA) is expected to increase by 24.5% in 2013, – Gartner group: the public cloud services market in INDIA is forecast to grow 36% in 2013 to total $443 million, up from $326 million in 2012, 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 6
  • Data Warehouse Systems --Typical System Architecture 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 7
  • Data Warehouse Systems --Technologies ● Traditional Relational DBMSs & OLAP Servers – – ● Mature Do not scale linearly NoSQL solutions – Adopted by Google, Facebook, Amazon, ... – Dynamic horizontal scale-up – Nodes are added without bringing the cluster down ● Shared-nothing architecture ● Independent computing and storage nodes interconnected via a high speed network MapReduce Distributed programming framework ● 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 8
  • Data Warehouse Systems --challenges with big data management 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 9
  • Data Warehouse Systems --Common Optimizations: Hardware Storage Tech. ● DRAM: in-memory data processing (very expensive) ● SSD (Solid State Drives): a non-volatile type of memory. ● An SSD does not have a mechanical arm to read and write data SSD HDD Cost/GB $1/GB $0.075/GB Typical size 512GB Up to 2TB Failure rate: 2 million hours MTBF Read/Write speed 200-500 MBps 30th May 2013 1.5 million hour 120 MBps DWS in the Cloud, AICCSA'13, Fez 10
  • Data Warehouse Systems --Common Optimizations: Columnar Storage Principle ● Row-oriented storage – Read pages containing all columns Date ● Customer Product Price Quantity Column-oriented storage – Read only columns needed for query processing Date 30th May 2013 Customer Product Price DWS in the Cloud, AICCSA'13, Fez Quantity 11
  • Data Warehouse Systems --Common Optimizations: Columnar Storage Benefits ● ● ● Allows best data compression rate, since data values are redundant within a single column, Eliminates unnecessary I/O through the retrieval of only relevant data Vectorwise is in the TPC-H - Top Ten Performance Results (14-Jun-2013) 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 12
  • Data Warehouse Systems --Common Optimizations: Derived Data ● Derived Data: – – Derived Attributes, – ● Indexes, Aggregate tables Pros: – ● High Performance Cons: – Maintenance: refresh is expensive – Storage cost 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 13
  • Data Warehouse Systems --DWS Benchmarks ● APB-1 OLAP Benchmark --obsolete – – ● Released by the OLAP Council (www.olapcouncil.org) in 1998 A simple star schema data model TPC DSS Benchmark – Released by the Transaction Processing Council (www.tpc.org) – Examine large volumes of data (from 10GB to 100TB) – Complex relational data model – TPC-H Workload composed of 22 ad-hoc complex SQL Statements ● The most prominent DSS benchmark TPC-DS -successor of TPC-H ● – ● ● 30th May 2013 Workload composed of a 99 SQL business questions Same metrics than TPC-H DWS in the Cloud, AICCSA'13, Fez 14
  • Data Warehouse Systems --TPC-H Benchmark Metrics (same for TPC-DS) ● Query-per-hour Performance Metric – – ● For a given scale factor (warehouse data volume) Concurrent users Price-Performance Metric – 30th May 2013 Ratio of Priced System (cost of ownership: hardware, software, maintenance, and cost of everything needed to run the TPC6H workload) to Query performance Metric DWS in the Cloud, AICCSA'13, Fez 15
  • Data Warehouse Systems --TPC-H mismatches Cloud Rationale ● TPC-H Does not represent BI suites – – Analytics services (Multi-dimensional Language, Mining Structures) – ● Integration services Reporting services eXpressions TPC-H Workload Processing Metric – Qph@Size defines the number of queries processed by hour – The workload is assumed static, which is not realistic! – The benchmark should assess the SUT scalability under variable and evolving workload and data volumes 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 16
  • Data Warehouse Systems --TPC-H mismatches Cloud Rationale (ctnd.1) ● TPC-H Cost-Performance Metric – $/Qph@Size, where the cost relates to all of hardware, software and HR required for running the workload (3yrs) – The cost model in the cloud is different, and does relate to the cost of ownership ● TPC-H does not report a Cost-Effectiveness Metric ● not TPC-H implementation vs. CAP theorem – CAP theorem: A distributed system can not fulfill both Consistency (same view of data), Availability (query response) and Partition Tolerance (cope with hardware crash). – Since DWS deployments are onto shared-nothing architectures, benchmarks should be either CA, CP and AP-compliant. 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 17
  • New Requirements & New Metrics NewRequirements & New Metrics 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 18
  • High Performance Requirement High Performance Requirement --Data Transfer IN/ OUT CSP ● Data Transfer Characteristics – Huge data volumes transfer IN and OUT the Cloud Service Provider – Resulting in Network-bound DWS – Usually, the cost model adopted by CSPs is: ● ● ● Data upload IN the CSP is free of charge Data download OUT the CSP is priced Data Transfer Metrics in the Cloud – – 30th May 2013 Time and cost for data upload Time and cost for data download DWS in the Cloud, AICCSA'13, Fez 19
  • High Performance Requirement High Performance (ctnd. 1) Requirement --Workload Processing ● Workload Processing Characteristics – – ● Both I/O-bound and CPU-bound business questions Intra-query processing combined with virtual partitioning or physical processing Performance across Cluster Size – – 30th May 2013 For each business question, there is an optimum response time for a particular cluster size and performance degrades from this optimum onward and backward Proved for both SQL and NoSQL technologies DWS in the Cloud, AICCSA'13, Fez 20
  • High Performance Requirement High Performance (ctnd.2) Requirement --Workload Processing ● 30th May 2013 TPC-H benchmarking of Apache Hadoop/Pig Latin on GRID5000 -Bordeaux Site [Moussa,ICCIT'12] (SF=10) DWS in the Cloud, AICCSA'13, Fez 21
  • High Performance Requirement High Performance (ctnd.3) Requirement --Workload Processing ● Workload Processing Metrics – – 30th May 2013 Elapsed times for running business questions, Slope: performance - cost DWS in the Cloud, AICCSA'13, Fez 22
  • Scalability Requirement ● Definition – ● Scalability is the ability of a system to increase total throughput under an increased load when hardware resources are added.. Scalability Metric – Query Performance Metric under ● ● 30th May 2013 Ever increasing workload Different query frequencies DWS in the Cloud, AICCSA'13, Fez 23
  • Elasticity Requirement ● Definition – ● Elasticity adjusts the system capacity at runtime by adding and removing resources without service interruption in order to handle the workload variation. Elasticity Metric – – Scaling Latency: elapsed time to scale-down and scale-up – Impact on SUT performances during scale-up and scale-down – Scale-up cost (+$) – 30th May 2013 Capacity to add/remove resources: (0|1) Scale-down gain (-$) DWS in the Cloud, AICCSA'13, Fez 24
  • High Availability Requirement –- Redundancy Strategies ● Redundancy Strategies – – ● Replication (a.k.a. mirroring) Erasure-Resilient Codes Redundancy Strategies vs. Workload Type – – ● Replication suits OLTP workload Erasure-resilient codes suits OLAP workload Comparison [Litwin et al.,ACM TODS'05] – – Computation cost – 30th May 2013 Data storage cost Communication cost DWS in the Cloud, AICCSA'13, Fez 25
  • High Availability Requirement –-Strategies Comparison (ctnd.1) 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 26
  • High Availability Requirement --Metrics for the Cloud (ctnd.2) ● High Availability Metrics – $@k: Cost of different targeted levels of availabilities (1-available, . . . , k-available, i.e. the number of failures the system can tolerate). – Cost of recovery expressed ● ● 30th May 2013 Time to get system back Decreased system productivity caused by the hardware failure ($) from customer perspective DWS in the Cloud, AICCSA'13, Fez 27
  • Cost Management Requirement ● CSP price cost model – Different cloud service price models (IaaS, PaaS, SaaS) – e.g. CPU cost for IaaS: Instance based (Amazon, MS Azur) or CPU-cycles based (Cloud Sites, Google App Engine) ● Query processing by Google BigQuery is based on retrieved bytes (columnar storage) Cost-Performance Ratio ● ● ● 30th May 2013 Cost-Effectiveness ratio DWS in the Cloud, AICCSA'13, Fez 28
  • Related Work ● Benchmarking in the cloud – [Gray,MS'08]: Terasoft Benchmark for data sort evaluations, – [Cooper et al., SoCC'10]: Yahoo Cloud Serving Benchmark (YCSB) for evaluating the performance of "key-value" and "cloud" serving stores. – [Sobel et al., ICCSA'08]: CloudStone Benchmark for Web2.0 applications – [Bennet et al., KDD'10]: MalStone Benchmarking for data mining in the cloud – [Ang et al., USENIX'10]: CloudCMP project for CSP comparison – [Binnig et al., DBTest'09], [Kossmann et al., SIGMOD'10]: Benchmarking OLTP systems in the cloud ● 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 29
  • Related Work (ctnd.1) ● NoSQL and SQL Technologies Assessment in the cloud – – ● [Pavlo et al. SIGMOD'09], [Floratou et al., TPC-TC'11 ], More Specific Issues – [Forrester, 2011]: Storage on-premises vs. in the cloud – [Nguyen et al., EDBT Workshops'12]: Materialized Views Selection – [Moussa, IJWA'12]: OLAP Scenarios in the Cloud and OLAP Workload Texonomy 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 30
  • Conclusion & Future Work ● Keynote scope – Overview of DWS – Insight of new requirements and new metrics to be considered for benchmarking DWS in the cloud [Moussa, AICCSA'13] ● Research Perspectives – Assessment of OLAP systems in the cloud e ● ● ● ● 30th May 2013 Amazon RDS Google BigQuery MS Azure ... DWS in the Cloud, AICCSA'13, Fez 31
  • Research Perspectives --New OLTP Systems ● Classical Workload Taxonomy – – ● OLTP: Transactions, ACID properties OLAP: complex queries, star-joins, grouping, aggregations... New OLTP Workload features: – – Big Data – ● OLTP Real-time analytics Examples of systems: Google Spanner, Clustrix, NuoDB and TransLattice 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 34
  • Thank you for Your Attention Q&A ? Rim Moussa Data Warehouse Systems in the Cloud N2C'2013, Hammamet 30th May 2013 15th June 2013 DWS in the Cloud, AICCSA'13, Fez 35
  • Data Warehouse Systems --TPC-H Benchmark Relational DB Schema 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 36
  • Data Warehouse Systems --TPC-H Benchmark Metrics 30th May 2013 DWS in the Cloud, AICCSA'13, Fez 37