Test Data Management
Harald Kikkers, Maarten Urbach & Bert Nienhuis
DATPROF
Data IntegrationTest Data Management
• Dutch Software supplier
• Founded in 1998
• Partners: ITCG, Sogeti, …
…and many more!
MANY
ORGANISATIONS
USE MULTIPLE COPIES OF
PRODUCTION DATABASES
PURPOSES:
• TESTING
• DEVELOPMENT
• OUTSOURCING
• MARKETING
• TRAINING
Agile Development
• Building the right product
• Room for change
• Every 2-4 weeks working increments of the software
• Progress in development
How to test all these iterations?
And… what data to use?
Team 1 Team 2 Team 3
6 TB 500 GB
Production
10 GB
6 TB 500 GB
Test
10 GB 6 TB 500 GB
Development
10 GB
Total
19,53 TB
Team 1 Team 2 Team 3
6 TB 500 GB
Production
10 GB
6 TB 500 GB
Test
10 GB 6 TB 500 GB
Development
10 GB
Total
19,53 TB
Team 1 Team 2 Team 3
Test
Team 1 Team 2 Team 3
Development
Team 1 Team 2 Team 3
6 TB 500 GB
Production
10 GB
6 TB 500 GB
Development
10 GB
6 TB 500 GB
Test
10 GB
6 TB 500 GB
Development
10 GB
6 TB 500 GB
Test
10 GB
6 TB 500 GB
Development
10 GB
6 TB 500 GB
Test
10 GB
Total
45,57 TB
Team 1 Team 2 Team 3
6 TB 500 GB
Production
10 GB
600 GB 50 GB
Development
1 GB
600 GB 50 GB
Test
1 GB
600 GB 50 GB
Development
1 GB
600 GB 50 GB
Test
1 GB
600 GB 50 GB
Development
1 GB
600 GB 50 GB
Test
1 GB
Total
10.4 TB
10 % Subset 10 % Subset 10 % Subset
Development
Test
Development
Test
Development
Test
How to protect
sensitive customer data?
Test Test Test
Development Development Development
Minimize data usage
Save on hardware & infra
Reduce throughput times
Efficient data management
Protect customer information
Comply with regislation
Prevent brand damage
Maintain competitive advantages
Subsetting Anonymizing
Advantages of subsetting data Advantages of scrambling & masking data
DBA Tools ETL Suites
100$ tools IBM, Informatica, Oracle
DBA Tools ETL Suites
?
DBA Tools ETL Suites
- User Experience
- Default templates
- Easy to maintain
- Smart functionality
- Chain support
DBA Tools ETL Suites
Production Test/Development
Source Database Target Database
Data model classification
Subset – Process data
Example: Customers, Orders, Contracts, Invoices, Transactions
Full – Master data
Example: Application data, configuration, master tables
Embty – Logging, non relevant history
Example: Logging tables, temp tabellen
Determine data to be subsetted
Chain of systems
Method for deriving consistent subsets from multiple systems
Production Test/Development
Start Filter
All customers from The
Netherlands
Start Filter
All orders from customers in
the previous subset.
Import
Meta data Classification Deployment
Anonymization of sensitive data
- Bank account balance
- Dept
- Medication
- Illness
- Religion
- Political preference
- Salary
- Phone history
- Et cetera…
- Name
- Date of birth
- Email
- Bank account number
- Social security number
- Adress
- Insurance number
- Cellphone number
- Et cetera..
Personal data
Identifying Characteristics
“Any information relating to an identified or identifiable natural person ("data subject")
Source: Data Protection Directive - Directive 95/46/EC
Techniques
Shuffle
Shuffle values within same column
Conditional
Manipulate specified rows+
First name Last name Type
John
Max
Joe
Clark
Smith
Williams
DATPROF
Customer
Customer
Customer
Company
321
First name Last name Type Comment E-Mail
John
Max
Joe
Smith
Williams
Clark
Blank
Delete values from columns
Scramble
Replace existing characters
j.clark@live.com
Smith_max@mail.com
i_am@JoeWilliams.de
“Brother of J. Clark”
“Has dept”
Customer
Customer
Customer
CompanyDATPROF
Nr. First name Last name Type Co.. E-mail Date of Birth
John
Max
Joe
Smith
Williams
Clark
DATPROF
123
Customer
Customer
Customer
Company
321
789
456
First day
Change dates to first day within same month and year
01-02-1954
01-11-1984
01-03-1974
Postal code
Date of Birth 1st day of month 1st day of year
87% 3.7% 0.04%
Source: research anonimity by Prof. Dr. Latanya Sweeney (Harvard University)
x.xxxxx@xxxx...
Xxxxx_xxx@xx...
x_xx@XxxXxxx...
Nr. First name Last name Type .. E-mail Date of birth
123
321
789
01-02-1954
01-11-1984
01-03-1974
Look-up
Replace values with values from a lookup table
James
Adrian
Thomas
John
Max
Joe
First names
Chris
Thomas
James
Ruben
Adrian
Michael
David
Reference data
Smith
Williams
Clark
DATPROF
Customer
Customer
Customer
Company
x.xxxxx@xxxx...
Xxxxx_xxx@xx...
x_xx@XxxXxxx...
Nr. First name Last name Type Comment E-mail Date of birth
Thomas
James
Adrian
Smith
Williams
Clark
DATPROF
123
Customer
Customer
Customer
Company
321
789
456
01-02-1954
01-11-1984
01-03-1974
Expression
Use custom made functions
Scrambled T.Smith@datprof.com
J.Willams@datprof.com
A.Clark@datprof.com
Scrambled
Scrambled
Import
Meta data
Define masking
rules
3. Deployment

DATPROF Test data Management (data privacy & data subsetting) - English

  • 1.
    Test Data Management HaraldKikkers, Maarten Urbach & Bert Nienhuis
  • 2.
    DATPROF Data IntegrationTest DataManagement • Dutch Software supplier • Founded in 1998 • Partners: ITCG, Sogeti, …
  • 3.
  • 4.
  • 5.
    PURPOSES: • TESTING • DEVELOPMENT •OUTSOURCING • MARKETING • TRAINING
  • 6.
    Agile Development • Buildingthe right product • Room for change • Every 2-4 weeks working increments of the software • Progress in development
  • 7.
    How to testall these iterations? And… what data to use?
  • 8.
    Team 1 Team2 Team 3 6 TB 500 GB Production 10 GB 6 TB 500 GB Test 10 GB 6 TB 500 GB Development 10 GB Total 19,53 TB
  • 9.
    Team 1 Team2 Team 3 6 TB 500 GB Production 10 GB 6 TB 500 GB Test 10 GB 6 TB 500 GB Development 10 GB Total 19,53 TB Team 1 Team 2 Team 3 Test Team 1 Team 2 Team 3 Development
  • 10.
    Team 1 Team2 Team 3 6 TB 500 GB Production 10 GB 6 TB 500 GB Development 10 GB 6 TB 500 GB Test 10 GB 6 TB 500 GB Development 10 GB 6 TB 500 GB Test 10 GB 6 TB 500 GB Development 10 GB 6 TB 500 GB Test 10 GB Total 45,57 TB
  • 11.
    Team 1 Team2 Team 3 6 TB 500 GB Production 10 GB 600 GB 50 GB Development 1 GB 600 GB 50 GB Test 1 GB 600 GB 50 GB Development 1 GB 600 GB 50 GB Test 1 GB 600 GB 50 GB Development 1 GB 600 GB 50 GB Test 1 GB Total 10.4 TB 10 % Subset 10 % Subset 10 % Subset
  • 12.
  • 13.
    Test Test Test DevelopmentDevelopment Development
  • 14.
    Minimize data usage Saveon hardware & infra Reduce throughput times Efficient data management Protect customer information Comply with regislation Prevent brand damage Maintain competitive advantages Subsetting Anonymizing Advantages of subsetting data Advantages of scrambling & masking data
  • 15.
    DBA Tools ETLSuites 100$ tools IBM, Informatica, Oracle
  • 16.
    DBA Tools ETLSuites ?
  • 17.
    DBA Tools ETLSuites - User Experience - Default templates - Easy to maintain - Smart functionality - Chain support
  • 18.
  • 19.
  • 20.
    Data model classification Subset– Process data Example: Customers, Orders, Contracts, Invoices, Transactions Full – Master data Example: Application data, configuration, master tables Embty – Logging, non relevant history Example: Logging tables, temp tabellen Determine data to be subsetted
  • 22.
    Chain of systems Methodfor deriving consistent subsets from multiple systems Production Test/Development Start Filter All customers from The Netherlands Start Filter All orders from customers in the previous subset.
  • 23.
  • 24.
  • 25.
    - Bank accountbalance - Dept - Medication - Illness - Religion - Political preference - Salary - Phone history - Et cetera… - Name - Date of birth - Email - Bank account number - Social security number - Adress - Insurance number - Cellphone number - Et cetera.. Personal data Identifying Characteristics “Any information relating to an identified or identifiable natural person ("data subject") Source: Data Protection Directive - Directive 95/46/EC
  • 26.
  • 27.
    Shuffle Shuffle values withinsame column Conditional Manipulate specified rows+ First name Last name Type John Max Joe Clark Smith Williams DATPROF Customer Customer Customer Company
  • 28.
    321 First name Lastname Type Comment E-Mail John Max Joe Smith Williams Clark Blank Delete values from columns Scramble Replace existing characters j.clark@live.com Smith_max@mail.com i_am@JoeWilliams.de “Brother of J. Clark” “Has dept” Customer Customer Customer CompanyDATPROF
  • 29.
    Nr. First nameLast name Type Co.. E-mail Date of Birth John Max Joe Smith Williams Clark DATPROF 123 Customer Customer Customer Company 321 789 456 First day Change dates to first day within same month and year 01-02-1954 01-11-1984 01-03-1974 Postal code Date of Birth 1st day of month 1st day of year 87% 3.7% 0.04% Source: research anonimity by Prof. Dr. Latanya Sweeney (Harvard University) x.xxxxx@xxxx... Xxxxx_xxx@xx... x_xx@XxxXxxx...
  • 30.
    Nr. First nameLast name Type .. E-mail Date of birth 123 321 789 01-02-1954 01-11-1984 01-03-1974 Look-up Replace values with values from a lookup table James Adrian Thomas John Max Joe First names Chris Thomas James Ruben Adrian Michael David Reference data Smith Williams Clark DATPROF Customer Customer Customer Company x.xxxxx@xxxx... Xxxxx_xxx@xx... x_xx@XxxXxxx...
  • 31.
    Nr. First nameLast name Type Comment E-mail Date of birth Thomas James Adrian Smith Williams Clark DATPROF 123 Customer Customer Customer Company 321 789 456 01-02-1954 01-11-1984 01-03-1974 Expression Use custom made functions Scrambled T.Smith@datprof.com J.Willams@datprof.com A.Clark@datprof.com Scrambled Scrambled
  • 32.

Editor's Notes

  • #8 Doordat het team zelf bepaald hoeveel werk zij van de backlog aankunnen en daarvoor commitment afgeven. Plus het feit dat na de sprint de rest van de organisatie zien wat hun voortgang is, zorgt voor een onzettend gemotiveerd en effectief team. Het bouwen van software is ontzettend veranderlijk. Gebruikers weten vaak niet precies wat ze willen totdat ze het voor hun zien of ermee kunnen werken. Daarvoor is prototype ontwikkeling en de mogelijk om na een sprint bij te sturen onzettend belangrijk. Zeggen Scrum te doen, maar niet doen…….. Uitleggen welke fouten
  • #9 - Test varianten -
  • #13 Verhouding tussen productie en test Nu 60-40