Wrestling Large Data Volumes to the Ground Daniel Austin Yahoo! Exceptional Performance March 24, 2011 Large-Scale Product...
Agenda: A Boy and His Prototype <ul><ul><li>Project X: Performance + Data </li></ul></ul><ul><ul><li>Project X: The Protot...
Project X: Hi Performance, Low Budget <ul><ul><li>Need to collect and store a lot of performance data for BI purposes </li...
Business Intelligence in Three Acts Data Collection Data Storage Data Analysis (out of scope)
Data Collection: Tools – Gomez Last Mile <ul><li>HTTP response time testing in the wild </li></ul><ul><li>Instrumented Fir...
Data Collection: Tools – Talend Open Studio <ul><li>Open Source ETL tool </li></ul><ul><ul><li>Based on eclipse </li></ul>...
Data Storage: Tools – Infobright MySQL Engine <ul><li>Columnar DB for MySQL </li></ul><ul><li>Open Source & Commercial ver...
Intro to Data Products <ul><li>For each data product: </li></ul><ul><ul><li>Data Dictionary </li></ul></ul><ul><ul><li>Dat...
How to design ETL chains? <ul><ul><li>Idempotent </li></ul></ul><ul><ul><ul><li>If the process fails, restart the job </li...
Diagram: ETL for Data Validation  2. Semantic Validation Step 1. Syntactic Validation Step
Simple 3NF Level 1 Data Model for HTTP <ul><li>NO xrefs </li></ul><ul><li>5D User Narrative Model </li></ul><ul><li>High l...
Level 1: The Boss Battle!
Some Best Practices <ul><ul><li>URIs: Handle with care </li></ul></ul><ul><ul><ul><li>Encode text strings in lexical order...
Project X: What We Learned <ul><ul><li>Design your data products up front </li></ul></ul><ul><ul><li>Higher-level data pro...
Endgame! Analysis in Near Real-Time
Thank You! Daniel Austin Yahoo! Exceptional Performance @daniel_b_austin [email_address] March 24, 2011 Large-Scale Produc...
Upcoming SlideShare
Loading in …5
×

Wrestling Large Data Volumes to the Ground

360 views
304 views

Published on

My presentation for Surge 2011. You can see the video here: http://omniti.com/surge/2011/speakers/daniel-austin

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
360
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • One other thing I forgot to mention: no budget
  • Some very basic architectural building blocks
  • Heresy! Normalized Dates and Times!
  • Wrestling Large Data Volumes to the Ground

    1. 1. Wrestling Large Data Volumes to the Ground Daniel Austin Yahoo! Exceptional Performance March 24, 2011 Large-Scale Production Engineering Meetup
    2. 2. Agenda: A Boy and His Prototype <ul><ul><li>Project X: Performance + Data </li></ul></ul><ul><ul><li>Project X: The Prototype </li></ul></ul><ul><ul><li>What We Learned </li></ul></ul>
    3. 3. Project X: Hi Performance, Low Budget <ul><ul><li>Need to collect and store a lot of performance data for BI purposes </li></ul></ul><ul><ul><ul><li>~ 10+ TB working volume </li></ul></ul></ul><ul><ul><ul><li>~ 3+ TB inflow daily (tiny!) </li></ul></ul></ul><ul><ul><li>Cheap, fast, good (pick one) </li></ul></ul><ul><ul><ul><li>optimized for query speed </li></ul></ul></ul><ul><ul><ul><li>closely followed by ETL speed </li></ul></ul></ul><ul><ul><li>Needs to be done yesterday! </li></ul></ul><ul><ul><li>Next Step: Build a prototype </li></ul></ul>
    4. 4. Business Intelligence in Three Acts Data Collection Data Storage Data Analysis (out of scope)
    5. 5. Data Collection: Tools – Gomez Last Mile <ul><li>HTTP response time testing in the wild </li></ul><ul><li>Instrumented Firefox browser </li></ul><ul><li>On a real user’s machine </li></ul><ul><li>Data collection via FTP (brute force!) </li></ul><ul><li>Custom format provides max data for every object on every page in a sequence </li></ul>
    6. 6. Data Collection: Tools – Talend Open Studio <ul><li>Open Source ETL tool </li></ul><ul><ul><li>Based on eclipse </li></ul></ul><ul><ul><li>Similar to other ETL tools from IBM, Oracle, others </li></ul></ul><ul><ul><li>Java or Perl generation </li></ul></ul><ul><li>But easier to use! </li></ul>
    7. 7. Data Storage: Tools – Infobright MySQL Engine <ul><li>Columnar DB for MySQL </li></ul><ul><li>Open Source & Commercial versions </li></ul><ul><li>High compression </li></ul><ul><li>Knowledge grid </li></ul><ul><li>Query Optimizations </li></ul><ul><ul><li>For analytics </li></ul></ul><ul><ul><li>For performance </li></ul></ul>
    8. 8. Intro to Data Products <ul><li>For each data product: </li></ul><ul><ul><li>Data Dictionary </li></ul></ul><ul><ul><li>Data Model </li></ul></ul><ul><ul><li>Pre- and post-validation schemas </li></ul></ul><ul><ul><li>Date Lifecycle Plan </li></ul></ul><ul><li>Level 0 </li></ul><ul><ul><li>Raw data at measurement-level resolution </li></ul></ul><ul><ul><li>Field-level Syntactic & Semantic Validation </li></ul></ul><ul><li>Level 1 </li></ul><ul><ul><li>3NF 5D Data Model </li></ul></ul><ul><ul><li>concrete aggregates while retaining record-level resolution </li></ul></ul><ul><li>Levels 2+ </li></ul><ul><ul><li>Time & Space-based aggregates </li></ul></ul><ul><ul><li>We ended up choosing not to do this! </li></ul></ul><ul><ul><li>“ An organized, verifiable dataset with specific levels of quality and abstraction” </li></ul></ul>
    9. 9. How to design ETL chains? <ul><ul><li>Idempotent </li></ul></ul><ul><ul><ul><li>If the process fails, restart the job </li></ul></ul></ul><ul><ul><li>Intermediate steps from state to state </li></ul></ul><ul><ul><li>Syntactic vs. semantic validation </li></ul></ul><ul><ul><ul><li>Treat separately </li></ul></ul></ul><ul><ul><li>Don’t skimp the supporting jobs! </li></ul></ul><ul><ul><ul><li>Retention </li></ul></ul></ul><ul><ul><ul><li>Volumes and roll-off </li></ul></ul></ul><ul><ul><ul><li>Logging and trackback </li></ul></ul></ul>
    10. 10. Diagram: ETL for Data Validation 2. Semantic Validation Step 1. Syntactic Validation Step
    11. 11. Simple 3NF Level 1 Data Model for HTTP <ul><li>NO xrefs </li></ul><ul><li>5D User Narrative Model </li></ul><ul><li>High levels of normalization are costly up front… </li></ul><ul><li>… but pay for themselves later when you are making queries! </li></ul>
    12. 12. Level 1: The Boss Battle!
    13. 13. Some Best Practices <ul><ul><li>URIs: Handle with care </li></ul></ul><ul><ul><ul><li>Encode text strings in lexical order </li></ul></ul></ul><ul><ul><ul><li>Use sequential bitfields for searching </li></ul></ul></ul><ul><ul><li>Integer arithmetic only </li></ul></ul><ul><ul><li>Combined fields for per-row consistency checks in every table </li></ul></ul><ul><ul><li>Don’t trade ETL time for integrity risk </li></ul></ul>
    14. 14. Project X: What We Learned <ul><ul><li>Design your data products up front </li></ul></ul><ul><ul><li>Higher-level data products are often better created downstream </li></ul></ul><ul><ul><li>Open Source ETL can be made to scale well </li></ul></ul><ul><ul><ul><li>Requires a lot of upfront design </li></ul></ul></ul><ul><ul><li>High levels of normalization may be worth pursuing </li></ul></ul>
    15. 15. Endgame! Analysis in Near Real-Time
    16. 16. Thank You! Daniel Austin Yahoo! Exceptional Performance @daniel_b_austin [email_address] March 24, 2011 Large-Scale Production Engineering Meetup

    ×