Wrestling Large Data Volumes to the Ground
Upcoming SlideShare
Loading in...5
×
 

Wrestling Large Data Volumes to the Ground

on

  • 251 views

My presentation for Surge 2011. You can see the video here: http://omniti.com/surge/2011/speakers/daniel-austin

My presentation for Surge 2011. You can see the video here: http://omniti.com/surge/2011/speakers/daniel-austin

Statistics

Views

Total Views
251
Views on SlideShare
234
Embed Views
17

Actions

Likes
0
Downloads
2
Comments
0

2 Embeds 17

http://www.linkedin.com 14
https://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • One other thing I forgot to mention: no budget
  • Some very basic architectural building blocks
  • Heresy! Normalized Dates and Times!

Wrestling Large Data Volumes to the Ground Wrestling Large Data Volumes to the Ground Presentation Transcript

  • Wrestling Large Data Volumes to the Ground Daniel Austin Yahoo! Exceptional Performance March 24, 2011 Large-Scale Production Engineering Meetup
  • Agenda: A Boy and His Prototype
      • Project X: Performance + Data
      • Project X: The Prototype
      • What We Learned
  • Project X: Hi Performance, Low Budget
      • Need to collect and store a lot of performance data for BI purposes
        • ~ 10+ TB working volume
        • ~ 3+ TB inflow daily (tiny!)
      • Cheap, fast, good (pick one)
        • optimized for query speed
        • closely followed by ETL speed
      • Needs to be done yesterday!
      • Next Step: Build a prototype
  • Business Intelligence in Three Acts Data Collection Data Storage Data Analysis (out of scope)
  • Data Collection: Tools – Gomez Last Mile
    • HTTP response time testing in the wild
    • Instrumented Firefox browser
    • On a real user’s machine
    • Data collection via FTP (brute force!)
    • Custom format provides max data for every object on every page in a sequence
  • Data Collection: Tools – Talend Open Studio
    • Open Source ETL tool
      • Based on eclipse
      • Similar to other ETL tools from IBM, Oracle, others
      • Java or Perl generation
    • But easier to use!
  • Data Storage: Tools – Infobright MySQL Engine
    • Columnar DB for MySQL
    • Open Source & Commercial versions
    • High compression
    • Knowledge grid
    • Query Optimizations
      • For analytics
      • For performance
  • Intro to Data Products
    • For each data product:
      • Data Dictionary
      • Data Model
      • Pre- and post-validation schemas
      • Date Lifecycle Plan
    • Level 0
      • Raw data at measurement-level resolution
      • Field-level Syntactic & Semantic Validation
    • Level 1
      • 3NF 5D Data Model
      • concrete aggregates while retaining record-level resolution
    • Levels 2+
      • Time & Space-based aggregates
      • We ended up choosing not to do this!
      • “ An organized, verifiable dataset with specific levels of quality and abstraction”
  • How to design ETL chains?
      • Idempotent
        • If the process fails, restart the job
      • Intermediate steps from state to state
      • Syntactic vs. semantic validation
        • Treat separately
      • Don’t skimp the supporting jobs!
        • Retention
        • Volumes and roll-off
        • Logging and trackback
  • Diagram: ETL for Data Validation 2. Semantic Validation Step 1. Syntactic Validation Step
  • Simple 3NF Level 1 Data Model for HTTP
    • NO xrefs
    • 5D User Narrative Model
    • High levels of normalization are costly up front…
    • … but pay for themselves later when you are making queries!
  • Level 1: The Boss Battle!
  • Some Best Practices
      • URIs: Handle with care
        • Encode text strings in lexical order
        • Use sequential bitfields for searching
      • Integer arithmetic only
      • Combined fields for per-row consistency checks in every table
      • Don’t trade ETL time for integrity risk
  • Project X: What We Learned
      • Design your data products up front
      • Higher-level data products are often better created downstream
      • Open Source ETL can be made to scale well
        • Requires a lot of upfront design
      • High levels of normalization may be worth pursuing
  • Endgame! Analysis in Near Real-Time
  • Thank You! Daniel Austin Yahoo! Exceptional Performance @daniel_b_austin [email_address] March 24, 2011 Large-Scale Production Engineering Meetup