PDGF: A Data Generator for Cloud-Scale Benchmarking
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

PDGF: A Data Generator for Cloud-Scale Benchmarking

on

  • 584 views

This is a presentation that was held at the Second TPC Technology Conference on Performance Evaluation & Benchmarking, 17.09.2010, Singapore. ...

This is a presentation that was held at the Second TPC Technology Conference on Performance Evaluation & Benchmarking, 17.09.2010, Singapore.

Abstract:
In many fields of research and business data sizes are breaking the petabyte barrier. This imposes new problems and research possibilities for the database community. Usually, data of this size is stored in large clusters or clouds. Although clouds have become very popular in recent years, there is only little work on benchmarking cloud applications. We present a data generator for cloud sized applications. Its architecture makes the data generator easy to extend and to configure. A key feature is the high degree of parallelism that allows linear scaling for arbitrary numbers of nodes.We show how distributions, relationships and dependencies in data can be computed in parallel with linear speed up.

Statistics

Views

Total Views
584
Views on SlideShare
584
Embed Views
0

Actions

Likes
1
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

PDGF: A Data Generator for Cloud-Scale Benchmarking Presentation Transcript

  • 1. T. Rabl, M. Frank, H. Mousselly Sergieh, and H. Kosch 17.09.2010 Second TPC Technology Conference on Performance Evaluation & Benchmarking
  • 2.  Data sizes grow beyond petabyte barrier  Cloud computing to the rescue  Need for cloud-scale benchmarking  Need for realistic, cloud-scale data sets  Problems:  Generating petabytes  Storing petabytes  Transporting petabytes  Solution:  Parallel, on-site generation A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 2
  • 3.  3 tables with primary and foreign keys Non-uniform distributions (lognormal) Replicated data  How to generate consistent data in parallel?   A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 3
  • 4.  No references  Only uncorrelated data  Simple statistical references  Scanned references  Read data to generate reference  Generate tuple pairs  Computed references  Data is generated deterministically  Compute data to generate reference Parallel Data Generation Framework A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 4
  • 5.  Parallel pseudo random number generator  Deterministic  Fast skip ahead  Generation of values as a function  Deterministic  Seeds + row id + generator allows recalculation  Every value can be computed independently A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 5
  • 6. seed t_id Table RNG seed c_id Column RNG seed r_id Row RNG rn    Generator user row_id user_id degree_program … 1 2 3 Seeding strategy All seeds can be cached Generation of value in row n with n-th random number A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 6
  • 7. seed id Table RNG seed id Column RNG seed r_id Lognormal rn seed rn    Row RNG r_id seminar_user row_id user_id degree_program … 1 2 3 Row RNG Generator Reference generation Lognormal indexes row_id of user Equal seeds and lognormal generator for user_id A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 7
  • 8.   Java based Plug-in concept  Generators  Distributions  RNGs  Output  Scheduler  TPC-H plug-in A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 8
  • 9.   XML file Reflects SQL schema  Tables  Fields     Seed Size Scale factor Output <project name="simpleUserSeminar"> [..] <table name=“seminar_user"> <size>201754</size> <fields> <field name="user_id"> <type>java.sql.Types.INTEGER</type> <reference> <referencedField>user_id</referencedField> <referencedTable>user</referencedTable> </reference> <generator name="DefaultReferenceGenerator"> <distribution name="LogNormal"> <mu>7.60021</mu><sigma>1.40058</sigma> </distribution> </generator> </field> <field name="degree_program"> [..] A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 9
  • 10.    Every value can be computed independently <?xml version="1.0" encoding="UTF-8"?> No communication <nodeConfig> <nodeNumber>1</nodeNumber> Workload distribution <nodeCount>2</nodeCount>  Configuration for every node <workers>2</workers> </nodeConfig>  Automatic distribution seminar_user Cluster/Cloud Nodes Processors 100000 rows Row 1 – 50000 Row 1 – 25000 Row 25001 – 50000 Row 50001 – 100000 Row 50001 – 75000 A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl Row 75001 – 100000 10
  • 11.  16 node HPC cluster  2 Intel Xeon QuadCore processors  16GB RAM  2 x 74GB HDD, RAID 0  SetQuery data set  1 table “Bench”, 21 columns  12 Random numbers, 8 strings  TPC-H data set  8 tables, 61 columns  Data types: integer, char, varchar, decimal, date A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 11
  • 12.    SetQuery data set 100 GB data set (SF 460) 1 to 16 nodes A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 12
  • 13.  TPC-H data set  Data sizes: 1 GB, 10 GB, 100 GB  Single node (8 cores) A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 13
  • 14.    Parallel data generation framework Linear speed up Fast generation  Large data sets  Realistic data   Independent generation of tables / colums / values Easy configuration and extension A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 14
  • 15.  More implementation      generators, distributions Benchmarks Graphical user interface Scheduler Query generator  Consistent inserts, updates, deletes  Precomputed query results  Time series A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 15
  • 16.  SetQuery data set  Data sizes: 220 MB, 2.2 GB, 10 GB, 22 GB, 100 GB  Single node A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 17
  • 17.  TPC-H data set  Data sizes: 1 GB, 10 GB, 100 GB, 1TB  1, 10, 16 nodes A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 18
  • 18. A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 19
  • 19. A Data Generator for Cloud-Scale Benchmarking - Tilmann Rabl 20