MIDDLEWARE SYSTEMS
RESEARCH GROUP
MSRG.ORG

Rapid Development of Data
Generators Using Meta
Generators in PDGF
Tilmann Rab...
DBMS Benchmarking is
Increasingly Complex
•

Data Volumes are sky rocketing
 Enterprise data warehouses double every thre...
Benchmarks are increasingly
complex
500
450
400
350
300
250
200
150
100
50
0

430

188

#Tables
#Columns

92
4 10
TPC-A

9...
Parallel Data Generation
Framework
•

Generic data generation framework

•

Relational model
 Schema specified in configu...
Repeatable Data Generation
•
PDGF Architecture

• Controller
 Initialization
• To generate data for a schema the user defines:
• Meta Scheduler
 Inte...
Configuring PDGF
•

Schema configuration
 Data model

•

Relational model
 Tables, fields

•

Properties
 Table size, c...
Base Generators in PDGF
•

DictList generator

<table name="users">
<size>10000</size>
 Random line from file
<fields>
<f...
Null Generator
•

Add NULL logic to every generator?
 Could easily be implemented in higher class
 Adds to the configura...
Meta Generators
•

Control flow and post-processing generators
 Null generator controls flow

•

Post-processing




...
Post-Processing Example
•

Phone number for users
 10s of representations
 PhoneNumberGenerator was too inflexible

•

F...
Flow Control Example
•

More elaborate name field
 Name male or female
 50% chance

 All upper case
 Padded to 100 cha...
Core Performance
250
200
150
100
50
0
Static Value
(no Cache)
Base Time
•
•

Generator

Null Generator
(100% NULL)
Base Ti...
Performance Basic Generators
600
500
400
300
200
100
0
DictList

•

LongGenerator DoubleGenerator DateGenerator

Basic gen...
Performance Formatted Values
2000
1800
1600
1400
1200
1000
800
600
400
200
0
DictList

•

SimpleFormat
Number Generator

D...
Performance Meta Generators
1600
1400
1200
1000

800
600
400
200
0
Null Generator Null Generator PrePostFix
(100% Null)
(0...
Use Cases
•

TPC-H / SSB






8 tables, 61 columns (first non-trivial example)
Without meta-FVGs: 26 custom FVGs
2h ...
Conclusion & Future Work
•

Meta generators





Improve usability and expressiveness
Speed up schema definition
Remov...
Thanks

•

Questions?

•

Contact: tilmann.rabl@utoronto.ca

•

Download and try PDGF:

•

http://www.paralleldatagenerati...
Upcoming SlideShare
Loading in …5
×

Rapid Development of Data Generators Using Meta Generators in PDGF

568 views

Published on

This is a presentation that was held at the Sixth International Workshop on Testing Database Systems, collocated with ACM SIGMOD 2013, June 24, New York, USA.

Full paper and additional information available at:
http://msrg.org/papers/dbtest13-rabl

Abstract:
Generating data sets for the performance testing of database systems on a particular hardware configuration and application domain is a very time consuming and tedious process. It is time consuming, because of the large amount of data that needs to be generated and tedious, because new data generators might need to be developed or existing once adjusted. The difficulty in generating this data is amplified by constant advances in hardware and software that allow the testing of ever larger and more complicated systems. In this paper, we present an approach for rapidly developing customized data generators. Our approach, which is based on the Parallel Data Generator Framework (PDGF), deploys a new concept of so called meta generators. Meta generators extend the concept of column-based generators in PDGF. Deploying meta generators in PDGF significantly reduces the development effort of customized data generators, it facilitates their debugging and eases their maintenance.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
568
On SlideShare
0
From Embeds
0
Number of Embeds
46
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Rapid Development of Data Generators Using Meta Generators in PDGF

  1. 1. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Rapid Development of Data Generators Using Meta Generators in PDGF Tilmann Rabl, Meikel Poess, Manuel Danisch, Hans-Arno Jacobsen DBTest 2013, June 24, New York City
  2. 2. DBMS Benchmarking is Increasingly Complex • Data Volumes are sky rocketing  Enterprise data warehouses double every three years  Many enterprise data warehouses are in petabyte size • Systems are becoming increasingly complex  Large number of processor cores  Single systems (SMP) with high number of cores (80 on commodity hardware, 2048 on specialized hardware)  Multi node systems (sky is the limit)  Large memory  Dell released a TPC-H benchmark with 15 TB of main memory on 64 systems • How to challenge these systems?
  3. 3. Benchmarks are increasingly complex 500 450 400 350 300 250 200 150 100 50 0 430 188 #Tables #Columns 92 4 10 TPC-A 9 TPC-C 33 TPC_E 24 TPC-DS • More tables, columns • More relationships, dependencies, data types, … • How to build these benchmarks? • Parallel Data Generation Framework to the rescue!
  4. 4. Parallel Data Generation Framework • Generic data generation framework • Relational model  Schema specified in configuration file  Post-processing stage for alternative representations • Repeatable computation  Based on XORSHIFT random number generators  Hierarchical seeding strategy
  5. 5. Repeatable Data Generation •
  6. 6. PDGF Architecture • Controller  Initialization • To generate data for a schema the user defines: • Meta Scheduler  Inter node scheduling  Schema XML file Inter thread scheduling • Scheduler  • Worker  Defines relational schema data generation  Blockwise • Update Black Box  Co-ordination of data updates  Generation XML file • Seeding System  Random sequence adaption  Defines output format (CSV, XML, merging tables) • Generators  Value generation • Output system  Data formating
  7. 7. Configuring PDGF • Schema configuration  Data model • Relational model  Tables, fields • Properties  Table size, characters, … • Generators  Base generators  Meta generators • <table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field> [..] Update definition  Insert, update, delete  Generated as change data capture
  8. 8. Base Generators in PDGF • DictList generator <table name="users"> <size>10000</size>  Random line from file <fields> <field name="name"> • Long generator <type>java.sql.types.VARCHAR</type>  Random long in interval <size>100</size> <gen_DictList> • Others <file>dicts/names.dict</file> </gen_DictList>  StaticValue </field>  Double <field name="age">  Date <type>java.sql.types.NUMERIC</type> <gen_LongGenerator>  String <min>0</min>  Text <max>120</max> </gen_LongGenerator>  … </field> </fields> </table>
  9. 9. Null Generator • Add NULL logic to every generator?  Could easily be implemented in higher class  Adds to the configuration file  Reduces performance (every time) • Higher order generator NullGenerator  Only used if added to the schema  Can be added to any generator <field name="age"> <type>java.sql.types.NUMERIC</type> <gen_NullGenerator> <probability>0.05</probability> <gen_LongGenerator> <min>0</min> <max>120</max> </gen_LongGenerator> </gen_NullGenerator> </field>
  10. 10. Meta Generators • Control flow and post-processing generators  Null generator controls flow • Post-processing      • FormattedNumberGenerator PaddingGenerator UpperLowerCaseGenerator PrePostfixGenerator FormulaGenerator Flow control      ProbabilityGenerator SequentialGenerator IfGenerator SwitchGenerator ReferenceGenerator
  11. 11. Post-Processing Example • Phone number for users  10s of representations  PhoneNumberGenerator was too inflexible • Formatted long number  Long numbers between 10010001 and 9999999999  Number formatting (%d%d%d) %d%d%d-%d%d%d%d <field name="phonenumber"> <type>java.sql.types.VARCHAR</type> <size>30</size> <generator name="FormattedNumberGenerator"> <generator name="LongGenerator"> <min>10010001</min> <max>9999999999</max> </generator> <format>(%d%d%d) %d%d%d-%d%d%d%d</format> </generator> </field>
  12. 12. Flow Control Example • More elaborate name field  Name male or female  50% chance  All upper case  Padded to 100 characters • Sequential generator  Probability generator  DictList generator  UpperLowerCase generator  Padding generator <field name="name"> <type>java.sql.types.VARCHAR</type> <size>100</size> <generator name="SequentialGenerator"> <generator name="ProbabilityGenerator"> <probability value="0.5"> <generator name="DictList"> <file>dicts/female.dict</file> </generator> </probability> <probability value="0.5"> <generator name="DictList"> <file>dicts/male.dict</file> </generator> </probability> </generator> <generator name="UpperLowerCaseGenerator"> <mode>uppercase</mode> </generator> <generator name="PaddingGenerator"> <character> </character> <padToLeft>true</padToLeft> </generator> </generator> </field>
  13. 13. Core Performance 250 200 150 100 50 0 Static Value (no Cache) Base Time • • Generator Null Generator (100% NULL) Base Time Sub Null Generator (0% NULL) Sub Generator Test environment: single core laptop, no I/O Base time for framework ~ 55 ns (Base Time)  Seeding, method invocation, setting a value • Computation time for generator 50+ ns (Gen Time) • Cache update if referenced ~ 50 ns (Cache Update) Cache lookup if intra row reference ~ 50 ns (Cache Lookup) Sub-generator invocation ~ 50 ns • •
  14. 14. Performance Basic Generators 600 500 400 300 200 100 0 DictList • LongGenerator DoubleGenerator DateGenerator Basic generators without formatting  120ns – 510ns RandomString
  15. 15. Performance Formatted Values 2000 1800 1600 1400 1200 1000 800 600 400 200 0 DictList • SimpleFormat Number Generator DateGenerator (formatted) Basic Generators with formatting  Usually > 1000ns DoubleGenerator (4 places)
  16. 16. Performance Meta Generators 1600 1400 1200 1000 800 600 400 200 0 Null Generator Null Generator PrePostFix (100% Null) (0% Null) • Sequential (exec 2) Meta generator overhead:  Base overhead ~ 50 ns  Generator overhead starts from 50 ns  Sub generator invocation ~ 50ns • Often negligible due to lazy formatting Sequential (concat 2) Sequential (2 formated + long)
  17. 17. Use Cases • TPC-H / SSB      8 tables, 61 columns (first non-trivial example) Without meta-FVGs: 26 custom FVGs 2h editing: 10 custom FVGs 1 day reimplementation: 0 custom FVGs, i.e. no coding SSB variations  skews on dimension attributes, fact measures, references • TPC-DI (in process)      20 tables, 200 columns 19 custom FVGs (mainly for performance in corner cases) 56x NullGenerator 32x ProbabilityGenerator 3000 lines of config (XML import for multiple files).
  18. 18. Conclusion & Future Work • Meta generators     Improve usability and expressiveness Speed up schema definition Remove necessity for coding Enlarged configuration files • Used in TPC benchmark(s) • Performance overhead is small, often negligible • Future work  GUI and SQL export  SQL import and data extraction
  19. 19. Thanks • Questions? • Contact: tilmann.rabl@utoronto.ca • Download and try PDGF: • http://www.paralleldatageneration.org • Some big data info in our BigBench presentation  Tuesday, 4pm, Industry 3

×