Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Wrestling Large Data Volumes to the Ground


Published on

My presentation for Surge 2011. You can see the video here:

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Wrestling Large Data Volumes to the Ground

  1. 1. Wrestling Large Data Volumes to the Ground Daniel Austin Yahoo! Exceptional Performance March 24, 2011 Large-Scale Production Engineering Meetup
  2. 2. Agenda: A Boy and His Prototype <ul><ul><li>Project X: Performance + Data </li></ul></ul><ul><ul><li>Project X: The Prototype </li></ul></ul><ul><ul><li>What We Learned </li></ul></ul>
  3. 3. Project X: Hi Performance, Low Budget <ul><ul><li>Need to collect and store a lot of performance data for BI purposes </li></ul></ul><ul><ul><ul><li>~ 10+ TB working volume </li></ul></ul></ul><ul><ul><ul><li>~ 3+ TB inflow daily (tiny!) </li></ul></ul></ul><ul><ul><li>Cheap, fast, good (pick one) </li></ul></ul><ul><ul><ul><li>optimized for query speed </li></ul></ul></ul><ul><ul><ul><li>closely followed by ETL speed </li></ul></ul></ul><ul><ul><li>Needs to be done yesterday! </li></ul></ul><ul><ul><li>Next Step: Build a prototype </li></ul></ul>
  4. 4. Business Intelligence in Three Acts Data Collection Data Storage Data Analysis (out of scope)
  5. 5. Data Collection: Tools – Gomez Last Mile <ul><li>HTTP response time testing in the wild </li></ul><ul><li>Instrumented Firefox browser </li></ul><ul><li>On a real user’s machine </li></ul><ul><li>Data collection via FTP (brute force!) </li></ul><ul><li>Custom format provides max data for every object on every page in a sequence </li></ul>
  6. 6. Data Collection: Tools – Talend Open Studio <ul><li>Open Source ETL tool </li></ul><ul><ul><li>Based on eclipse </li></ul></ul><ul><ul><li>Similar to other ETL tools from IBM, Oracle, others </li></ul></ul><ul><ul><li>Java or Perl generation </li></ul></ul><ul><li>But easier to use! </li></ul>
  7. 7. Data Storage: Tools – Infobright MySQL Engine <ul><li>Columnar DB for MySQL </li></ul><ul><li>Open Source & Commercial versions </li></ul><ul><li>High compression </li></ul><ul><li>Knowledge grid </li></ul><ul><li>Query Optimizations </li></ul><ul><ul><li>For analytics </li></ul></ul><ul><ul><li>For performance </li></ul></ul>
  8. 8. Intro to Data Products <ul><li>For each data product: </li></ul><ul><ul><li>Data Dictionary </li></ul></ul><ul><ul><li>Data Model </li></ul></ul><ul><ul><li>Pre- and post-validation schemas </li></ul></ul><ul><ul><li>Date Lifecycle Plan </li></ul></ul><ul><li>Level 0 </li></ul><ul><ul><li>Raw data at measurement-level resolution </li></ul></ul><ul><ul><li>Field-level Syntactic & Semantic Validation </li></ul></ul><ul><li>Level 1 </li></ul><ul><ul><li>3NF 5D Data Model </li></ul></ul><ul><ul><li>concrete aggregates while retaining record-level resolution </li></ul></ul><ul><li>Levels 2+ </li></ul><ul><ul><li>Time & Space-based aggregates </li></ul></ul><ul><ul><li>We ended up choosing not to do this! </li></ul></ul><ul><ul><li>“ An organized, verifiable dataset with specific levels of quality and abstraction” </li></ul></ul>
  9. 9. How to design ETL chains? <ul><ul><li>Idempotent </li></ul></ul><ul><ul><ul><li>If the process fails, restart the job </li></ul></ul></ul><ul><ul><li>Intermediate steps from state to state </li></ul></ul><ul><ul><li>Syntactic vs. semantic validation </li></ul></ul><ul><ul><ul><li>Treat separately </li></ul></ul></ul><ul><ul><li>Don’t skimp the supporting jobs! </li></ul></ul><ul><ul><ul><li>Retention </li></ul></ul></ul><ul><ul><ul><li>Volumes and roll-off </li></ul></ul></ul><ul><ul><ul><li>Logging and trackback </li></ul></ul></ul>
  10. 10. Diagram: ETL for Data Validation 2. Semantic Validation Step 1. Syntactic Validation Step
  11. 11. Simple 3NF Level 1 Data Model for HTTP <ul><li>NO xrefs </li></ul><ul><li>5D User Narrative Model </li></ul><ul><li>High levels of normalization are costly up front… </li></ul><ul><li>… but pay for themselves later when you are making queries! </li></ul>
  12. 12. Level 1: The Boss Battle!
  13. 13. Some Best Practices <ul><ul><li>URIs: Handle with care </li></ul></ul><ul><ul><ul><li>Encode text strings in lexical order </li></ul></ul></ul><ul><ul><ul><li>Use sequential bitfields for searching </li></ul></ul></ul><ul><ul><li>Integer arithmetic only </li></ul></ul><ul><ul><li>Combined fields for per-row consistency checks in every table </li></ul></ul><ul><ul><li>Don’t trade ETL time for integrity risk </li></ul></ul>
  14. 14. Project X: What We Learned <ul><ul><li>Design your data products up front </li></ul></ul><ul><ul><li>Higher-level data products are often better created downstream </li></ul></ul><ul><ul><li>Open Source ETL can be made to scale well </li></ul></ul><ul><ul><ul><li>Requires a lot of upfront design </li></ul></ul></ul><ul><ul><li>High levels of normalization may be worth pursuing </li></ul></ul>
  15. 15. Endgame! Analysis in Near Real-Time
  16. 16. Thank You! Daniel Austin Yahoo! Exceptional Performance @daniel_b_austin [email_address] March 24, 2011 Large-Scale Production Engineering Meetup