rent Session 
 
Presented by: 
Jason  auen 
 
 
Brought to you by: 
 
 
340 Corporate Way, Suite   Orange Park, FL 3...
Jason Rauen
LexisNexis
 
Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk
Solutions. With mor...
2/4/2014
1
“Quality isn’t measured by how many clients you 
obtain; it’s measured by how many clients you 
retain ”
Intere...
2/4/2014
2
Overview
• Architecture and why you need to know
– HPCC Systems/Hadoop
– Know Your Data/Environment
• Why Test ...
2/4/2014
3
Architecture and why you need to know
Data Fabrication Engines
• HDFS Hadoop and HPCC THOR 
• Made of several n...
2/4/2014
4
Architecture and why you need to know
HDFSHDFS
Hadoop Mapreduce
HBASE
7
Architecture and why you need to know
8
2/4/2014
5
Architecture and why you need to know
HDFS
Map Shuffle Reduce
9
Architecture and why you need to know
DISTRIBUT...
2/4/2014
6
Why Test Big Data and How it’s Different
Why Test Big Data:
• Traditional methods not adequate – Traditional sa...
2/4/2014
7
Why Test Big Data and How it’s Different
• When? 
o Testing ‐ SDLC
o Routine Testingg
o Frequency ‐ Yearly/Mont...
2/4/2014
8
Why Test Big Data and How it’s Different
Benefits:
• Cost savings
• Better Coverage
No Samples
Increased Sampli...
2/4/2014
9
Strategies and Concepts
JOIN
• Sample gathering
• New Key for testing
• Deployment Validation
‐ Data Fabricatio...
2/4/2014
10
Strategies and Concepts
Statistics: What you try to remember with this swimming 
behind you.y
19
Strategies an...
2/4/2014
11
Strategies and Concepts
350
400
RELEASE NUMBERS
AVERAGE 175.4
150
200
250
300
CEILING 210.6
FLOOR 135.1
0
50
1...
2/4/2014
12
Strategies and Concepts
Data Profiling Summary Report
23
Strategies and Concepts
Data Profiling Field Detail R...
2/4/2014
13
Strategies and Concepts
Data Profiling Field Combination Report
25
Strategies and Concepts
SQL
SELECT * FROM P...
2/4/2014
14
Strategies and Concepts
SQL
SELECT * FROM Products 
ORDER BY productcode;
Pig
Products= ORDER
Products BY prod...
2/4/2014
15
Questions?
29
Contact / Useful links
www.linkedin/in/jasonrauen
• HPCC Systems/ECL Links:
http://hpccsystems.c...
Upcoming SlideShare
Loading in …5
×

Become a Big Data Quality Hero

425 views

Published on

Many believe that regression testing an application with minimal data is sufficient. However, the data testing methodology becomes far more complex with big data applications. Testing can now be done within the data fabrication process as well as in the data delivery process. Today, comprehensive testing is often mandated by regulatory agencies—and more importantly by customers. Finding issues before deployment and saving your company’s reputation—and in some cases preventing litigation—is critical. Jason Rauen presents an overview of the architecture, processes, techniques, and lessons learned by an original big data company. Detecting defects up-front is vital. Learn how to test thousands, millions, and in some cases billions—yes, billions—of records directly, rendering sampling procedures obsolete. See how you can save your organization time and money—and have better data test coverage than ever before.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
425
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Become a Big Data Quality Hero

  1. 1.       rent Session    Presented by:  Jason  auen      Brought to you by:      340 Corporate Way, Suite   Orange Park, FL 32073  888‐2 T8  Concur 4/8/2014    12:45 PM          “Become a Big Data Quality Hero”      R LexisNexis              300, 68‐8770 ∙ 904‐278‐0524 ∙ sqeinfo@sqe.com ∙ www.sqe.com 
  2. 2. Jason Rauen LexisNexis   Jason Rauen is a senior quality test analyst at Georgia-based LexisNexis Risk Solutions. With more than fifteen years of experience, Jason has led the data testing team in big data from its inception. He has presented big data scripting techniques at HPCC Systems national Data Summit. His background includes working at companies including Microsoft, AT&T, and LexisNexis, and instructing at Intel, Boeing, Executrain, and the Department of the Navy. Jason has transitioned through various aspects of technology including technical sales, customer support, training, quality control/quality assurance, and into management.
  3. 3. 2/4/2014 1 “Quality isn’t measured by how many clients you  obtain; it’s measured by how many clients you  retain ” Interesting Quotes…… retain.” “QA isn’t the bottom of the totem pole; it’s the dirt  holding it up.” 1 Become a Big Data Quality Hero A look inside QA for Big Data Presented by 01001010 01100001 01110011 01101111 01101110 00100000  01010010 01100001 01110101 01100101 01101110 (Jason Rauen)
  4. 4. 2/4/2014 2 Overview • Architecture and why you need to know – HPCC Systems/Hadoop – Know Your Data/Environment • Why Test Big Data and How it’s Different – Issues – Benefits • Strategies and Concepts – What to look for – Sample Gathering (AUB)  – Stats – Profiling  3 Architecture and why you need to know Data Warehouse Architecture Source Files EXTRACT  TRANSFORM  LOAD Staging (Data  Cleansing) 4 DATA WAREHOUSE
  5. 5. 2/4/2014 3 Architecture and why you need to know Data Fabrication Engines • HDFS Hadoop and HPCC THOR  • Made of several nodes• Made of several nodes • Where the ETL happens • Where the Keys are made Data Delivery Engines • HPCC ROXIE, HBASE, etc… • Keys moved to and referenced here • Queries reside 5 Architecture and why you need to know 6
  6. 6. 2/4/2014 4 Architecture and why you need to know HDFSHDFS Hadoop Mapreduce HBASE 7 Architecture and why you need to know 8
  7. 7. 2/4/2014 5 Architecture and why you need to know HDFS Map Shuffle Reduce 9 Architecture and why you need to know DISTRIBUTE/PROJECT/TRANSFORM Rollup HPCC Systems 10
  8. 8. 2/4/2014 6 Why Test Big Data and How it’s Different Why Test Big Data: • Traditional methods not adequate – Traditional sampling  d i d i i b d hneeds improvement and is scenario based, not enough  samples, human error, etc…. • Size of the data is huge, from different  sources, and inconsistent  • Tied into current environment • Government regulatory compliancesg y p • Auditing requirements  • Company wide initiatives • The business makes crucial decisions based off of it 11 Why Test Big Data and How it’s Different Want to keep your customers? 12
  9. 9. 2/4/2014 7 Why Test Big Data and How it’s Different • When?  o Testing ‐ SDLC o Routine Testingg o Frequency ‐ Yearly/Monthly/Weekly/Daily/Hourly/On  Demand • What? Types Testing  New Project – Source to Target (Transform) Standard  ‐ Production Validation  Emergency releases • How?   o Using  what you have available o Freebies – Profiling tools, etc…  13 Why Test Big Data and How it’s Different Issues: • Lack of control Timing of buildsTiming of builds Samples and location of samples • 3rd Party Apps Lack of licenses, Costs, Training, and existing  knowledge • Extra hardware• Extra hardware • Upgrades 14
  10. 10. 2/4/2014 8 Why Test Big Data and How it’s Different Benefits: • Cost savings • Better Coverage No Samples Increased Sampling Focused Samples • Faster (Time is $) • Quicker to Diagnosing issues • Better Data Integrity • Collaboration with other groups 15 Strategies and Concepts • What to look for…… Brand New, Incomplete, or Missing Builds (Data Cops) Data progression  Today/Yesterday  FatherKey/Grandfatherkeyp g y/ y y/ y Count of Deltas in release/deploy Keys updated Missing keys/New keys Field Validations – mandatory fields blank, consistency, etc… Key Layout issues Corruption unprintable or invalid characters Duplicate records of new and existing records Data Fabrication Engine to Data delivery Engine deploys/sync Queries with new data 16
  11. 11. 2/4/2014 9 Strategies and Concepts JOIN • Sample gathering • New Key for testing • Deployment Validation ‐ Data Fabrication • Deployment Validation ‐ Data Delivery And get a free cookie… 17 Strategies and Concepts AUB for JOIN A = Left key (New) B = Right key (Old)B   Right key (Old) Types of JOINS Inner Join Left Outer Join Right Outer Join Full Outer Join Minus or Left Only 18
  12. 12. 2/4/2014 10 Strategies and Concepts Statistics: What you try to remember with this swimming  behind you.y 19 Strategies and Concepts Statistics: • On data sets and keys ‐ Gives you a high level look at the release              ‐ Ranges ‐ You’ll start to notice a trend line • On Releases ‐ Done over time you’ll see the trend of new data sets and keys ‐ Done over time you’ll see the trend of changed or modified   data sets and keys  20
  13. 13. 2/4/2014 11 Strategies and Concepts 350 400 RELEASE NUMBERS AVERAGE 175.4 150 200 250 300 CEILING 210.6 FLOOR 135.1 0 50 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 21 Strategies and Concepts Data Profiling: • Data Profiling Summary Report • Data Profiling Field Detail Report http://www.hpccsystems.com/demos/data‐ profiling‐demo • Data Profiling Field Combination Report 22
  14. 14. 2/4/2014 12 Strategies and Concepts Data Profiling Summary Report 23 Strategies and Concepts Data Profiling Field Detail Report 24
  15. 15. 2/4/2014 13 Strategies and Concepts Data Profiling Field Combination Report 25 Strategies and Concepts SQL SELECT * FROM Products; Pig DUMP Products; ECL Products; SELECT * FROM Products  WHERE productcode =  ‘R2D2C3PO’; Products= FILTER Products BY productcode  = ‘R2D2C3PO’; DUMP Products; Products= GROUP Products(productcode =  ‘R2D2C3PO’); COUNT(Products); SELECT COUNT (*) FROM  Products; Products= GROUP  Products ALL;  Products =FOREACH Products GENERATE  COUNT (Products); DUMP Products; COUNT(Products); 26
  16. 16. 2/4/2014 14 Strategies and Concepts SQL SELECT * FROM Products  ORDER BY productcode; Pig Products= ORDER Products BY productcode; ECL SORT( Products,productcode);ORDER BY productcode; SELECT * FROM Products FULL  OUTER JOIN OtherProducts  ON Products.col1 =  OtherProducts.col1; DUMP Products; Products= JOIN Products BY col1 FULL OUTER,  OtherProducts BY col1;  DUMP Products; JOIN(Products,OtherPro ducts, LEFT.col1 =  RIGHT.col1,FULL OUTER); 27 Summary Why Test Big Data and How it’s  Different Architecture and why you need to know Strategies and Concepts 28
  17. 17. 2/4/2014 15 Questions? 29 Contact / Useful links www.linkedin/in/jasonrauen • HPCC Systems/ECL Links: http://hpccsystems.com http://hpccsystems.com/demos • Hadoop/Pig Latin Links: http://pig apache orghttp://pig.apache.org http://hadoop.apache.org • SQL Links: http://sql.org/ http://msdn.microsoft.com/en‐US/sqlserver/default.aspx 30

×