Your SlideShare is downloading. ×

Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data

1,694

Published on

Data is Yahoo!'s most strategic assets - from user engagement and insights data to revenue and billing data. Three years ago, Yahoo! invested in a Data Quality program. …

Data is Yahoo!'s most strategic assets - from user engagement and insights data to revenue and billing data. Three years ago, Yahoo! invested in a Data Quality program.

By applying industry principles and techniques the Data Quality program has provided proactive and reactive system solutions to Audience data issues and root causes by addressing technical challenges of data quality at scale and engaging and leveraging the rest of the organization in the solution: from product teams all through the data stack (data sourcing, ETL, aggs and analytics) to analysts and sciences teams who consume the data. This methodology is now being scaled to the all data across Yahoo! including Search and Display Advertising.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,694
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • “Riding Giants” is only possible by using a recently-discovered method: tow-in surfingShow video: http://www.youtube.com/watch?v=LhKFTqxn6qs70’ wave = power of dataUDA DSI = jet-ski method (unlocking the data to harness the 70 wave)UDA DQ = getting the GPS coordinates correct so you are in the right place to catch it – without high quality data we miss the wave altogether!
  • Yahoo business model = advertising
  • Transcript

    • 1. Data Quality Challenges and Solution Approaches in Yahoo!'s Massive Data Environment Data Quality Manager: Dan DefendData Quality Architect: Aparna Vani
      DataVersity Webinar
      September 29, 2011
      Abstract: By applying industry principles and techniques the Data Quality program has provided proactive and reactive system solutions to Audience data issues and root causes by addressing technical challenges of data quality at scale and engaging and leveraging the rest of the organization in the solution: from product teams all through the data stack (data sourcing, ETL, aggs and analytics) to analysts and sciences teams who consume the data. This methodology is now being scaled to the all data across Yahoo! including Search and Display Advertising.
       © 2011 Yahoo!.  All rights reserved.
    • 2. Unlocking the Power of Data
      2
    • 3. MEDIA
      Tech-
      nology
      3
    • 4. The Anatomy of a Yahoo! Web Page
      Buzz
      Targeted
      Content
      Apps
      Ads
      Content
      Y!
      links
    • 5. What Yahoo! Does With Its Data?
      Analytics & Business Insights – data-driven decisions
      How many people visited Home Page today and what did they click on?
      What impact did the Japan tsunami have News and global engagement?
      Targeting
      What products are you interested in based on your recent web usage?
      Advertisers pay a lot of $$ for good targeting.
      Targeted content means better user engagement.
      Experimentation
      “Live user testing”
      What layout do users like best? Are most profitable?
      5
    • 6. Which is the Better Home Page?
      6
    • 7. Which Ad Position Makes More Money?
      7
    • 8. Yahoo! Has a LOT of Data
      Leading Internet Portal and Software Supplier[1]
      Serves 640 MM users or84.5% of US internet users
      Top ranked site in Mail, Messenger, Home Page, and more
      Collects over 25 terabytes of behavioral data per day
      2 U.S. Library of Congress equivalents every day
      [1] US Yahoo! Audience Measurement Report. comScore, Jan 2011
      8
    • 9. Overview of Yahoo! Data Pipelines
      • Audience (web analytics)
      • 10. Processes data from all Yahoo! properties web server logs and delivers audience engagement metrics
      • 11. Display Advertising
      • 12. Analytics and billing
      • 13. Guaranteed and Non-guaranteed delivery ad campaigns
      • 14. Exchange networks leverage other advertisers and publishers
      • 15. Search Advertising
      • 16. Analytics and billing
      9
    • 17. Yahoo! Data Pipeline
      DataExtraction
      Web Pages
      Extract
      Transform
      Load
      Business
      Insights
      Data Warehouse
      Σ
      AdServers
      Targeting
      Reporting systems
    • 18. Dimensions of Yahoo! Data Quality
      `
      DQ Team
      abuse
      traffic
      invalid
      events
      metadata
      integrity
      external src
      agreement
      Σ
      events
      uncollected
      key metric
      agreement
      11
    • 19. DATA QUALITY = BUSINESS UPImpacts of poor quality of data to Yahoo!
      $$ Loss: Revenue, Refunds
      Wasted resources
      • Data investigation and reprocessing
      • 20. Sciences frustration and attrition (“QA the data”)
      Incorrect insights
      • Wrong decisions
      • 21. Suboptimal targeting
      • 22. Credibility loss – customers don’t trust the data
      External impact
      • Loss of competitive ranking
      REAL $$!!! … if managed reactively
      +
      =
      12
    • 23. Audience DQ Solution Path
      13
    • 24. Circa 2007: Significant Opportunities for Improvement in Audience Pipeline
      • Quality of data largely unmeasured
      • 25. Property sourcing:
      • 26. Is this a page view? No standard
      • 27. Tagging & server errors
      • 28. Data dropped in collection system
      • 29. Data discrepancies found by customer end of month or quarter
      • 30. Abuse and robots skew metrics due to minimal traffic protection
      • 31. Data sources that should agree, don’t
      • 32. Internal customers don’t trust the data
      14
    • 33. We Started to Measure It
      `
      Source: ticket volumes and root cause analysis
      Σ
      Key Finding: >80% data issue from the source
      15
    • 34. `
      Root Causes Differ Per Stage
      Source: root cause analysis
      Σ
      Insights into point fixes and system solutions
      16
    • 35. SolutionIdentification of issues followed by quick wins and system solutions
      Developed DQ Methodology for large data systems
      17
    • 36. Improvements & Results
      18
    • 37. Central DQ Team Structure
      19
    • 38. DQ Proactive StandardsBuilding Data Quality into Products
      20
    • 39. Property DQ StandardsSourcing and Consuming Clean Data
      `
      Σ
      Data Issues
      DQ Champs
      Classification, triage, drive fixes
      Metric/monitor priorities, pain points
      Server setup cookbook and validation
      Instrumentation validation: PV, URL, CSC
      Proactive
      Reactive
      Customer-
      Driven
      21
    • 40. DQ Standards Overview (Proactive)
      22
    • 41. Support for DQ in the QE Cycle Data Validation
      `
      `
      Test
      Environment
      E2E data validation
      tests covering major
      customer use cases
      in pre-release QE cycle
      Σ
      Note: Specific tools are not currently part of DQ standard
      but partnership in this area may make sense
      23
    • 42. Data Validation Coverage in QEChecklist and Examples
      Checklist
      • Coverage from a data perspective taking into account major data input scenarios and major use cases direct customers of data produced
      • 43. Compare results from legacy system or previous version of system (with production data)
      • 44. Suggest organizing per DQ dimensions: completeness, accuracy, validity, consistency, integrity
      Examples
      • Validity: Check for valid data values per the schema – certain column not null, within acceptable range or of correct data type
      • 45. Completeness: Include coverage to validate that the volume sent is the amount of load received, processed and output.
      • 46. Accuracy - Test that the data input equals the data output. If data is requested for a specific day in one time zone but fetched in another the data will not be accurate.
      24
    • 47. Support for DQ in the QE CycleQE Coverage of DQ Features
      `
      Σ
      Functional test coverage for built-in DQ
      features, e.g., in-line DQ checks
      25
    • 48. Data Sourcing Case Study
      • New Ad Delivery Engine. Some impression events were partially corrupted leading to failed joins. This led to missing data.
      • 49. Problem not discovered for over 2 days. Rollback occurred on the 3rd day.
      • 50. A simple/cheap in-line DQ check for % of non-billable clicks would have found the problem before any bad data was published.
      26
    • 51. Answer: In-Line DQ ChecksImmediate Detection of Issues Before Publish
      `
      week-over-week statistical, source comparison, def met,…
      SE&O
      Σ
      • Performed in-line with processing before publishing the data.
      • 52. Designed to catch serious DQ issues needing immediate action.
      • 53. Cover processing performed in stage as well as sanity checks of upstream data (e.g. total volume).
      • 54. Cannot significantly impact SLA of the process. This means that checks generally be at a course granularity, e.g., at network level.
      • 55. Owned/implemented by Dev with review input from DQ, QE, PM, Customer
      27
    • 56. Audience Case StudyLet’s Start Looking at ALL Our Data
      X Bln/day
      Y Bln/day
      Reported
      Unreported
      Invalid?
      Tagging
      Error
      ETL Error
      Backend
      Call
      Non-
      PV
      Robot
      • Valid / invalid traffic improperly classified/reported
      • 57. Processing and storing billions events/day without knowing why
      • 58. Profile the data to find out which properties/pages/servers are the biggest offenders
      28
    • 59. Answer: Data TransparencyWhere is all the data going?
      `
      Σ
      • End to end, transactional insight of data processed across all stages of e2e data based on DQ stats
      • 60. Facilitates data investigations -- immediate insight into the data processing stage/category with incorrect data
      • 61. Provides customers confidence by showing what happened to the data between the beginning of the processing and the final analytics data metrics
      • 62. Provides full, end-to-end audit-ability -- event is accounted for in some stage and filtered bucket
      • 63. Transparency reporting can be done in the analytic application or in DQ Central
      29
    • 64. Audience Case StudyWhat’s Going On With This Network Uptrend?
      30
    • 65. Off-Line DQ Monitoring in Audience Case StudyWhat’s Going On With This Network Uptrend?
      • Alert triggered first week of uptrend indicating instrumentation problem
      31
    • 66. Answer: Off-Line DQ Monitor and AlertingEarly Detection of Issues at a Fine Grain
      `
      DQ Central
      week-over-week statistical, source comparison, intra-pipeline…
      Σ
      • In-depth checks finer granularity than possible inline based on DQ stats
      • 67. Customer-driven (“DQ Champion”) metric / dimension priorities and signoff
      • 68. Sophisticated techniques (time series modeling) to find deeper anomalies
      • 69. SLA next-day
      • 70. Sometimes need investigation to avoid false positive alert
      • 71. Owned by DQ team with review input from PM, Dev, QE, & Customer
      32
    • 72. DQ Central – End to End Audience Data
      Features:
      Data statistics/trending of audience PVs for property at each stages of audience pipeline
      End to end data transparency per page and server
      Critical traffic fluctuation notification for properties and custom monitoring for any data customers
      Data issue investigation and diagnostics
      Open/overdue data quality bug tracking
      33
    • 73. DQ Central – DQ Champion Engagement
      5. Sign off information is captured for alert in DQ database
      4. DQ Champion manages alert sign-off in DQ Central UI
      6. Explanation overlaid on data
      Reason…. BugID
      σ
      σ
      Reason…. BugID
      ALERT
      1. Data Source Metrics are monitored and an anomaly is found
      2. Each alert is registered in the DQ database
      3. Email detailing alert/s is sent to DQ Champion
      34
    • 74. DQ Central – Case Study
      Impact:
      • suboptimal BT
      • 75. PV counting error
      • 76. UU inaccuracy
      DQ ALERT!
      Fixed same day!!
      DQ Champion
      Analysis Insight
      35
    • 77. Display Sciences Case StudySeeing a High Percentage of Null Events
      • Sciences encounters “missing events”
      • 78. Extensive end-to-end analysis of feeds with missing data, upstream feeds and data sources slicing and dicing by interesting dimensions to understand source & cause of issue
      • 79. Conclusion: Expected behavior; field of interest was populated according to sampling rates as designed, but known only by Serving teams – not by Sciences customers
      • 80. Impact: Dozens of hours of wasted investigation
      36
    • 81. Answer: Data LineageCentralized Documentation Consumable by Customers
      Covers all business logic and filtering rules (any merge, removal, addition, or other modification) as data flows end to end. Lineage answers the question: "How did my data get from A to B to C to D and all the way through Z?” Each stage must provide detail on all major filtering and columns in a form consumable by customers.
      • Resides within the metadata system
      • 82. Available to all customers and kept up-to-date
      • 83. Reduces or removes confusion regarding differences in two seemingly-similar data sets
      • 84. There is a close relationship between Data Lineage and Data Transparency: the former describes the processing rules behind the later transactional data.
      37
    • 85. DQ Improvement Playbook
      • Metrics/dimension priorities from customers
      • 86. End-to-end profiling & DQ assessment
      • 87. Start building end-to-end monitoring & transparency
      • 88. Core In-line Checks
      • 89. Core Data Lineage
      • 90. Discover systemic solutions through root-cause analysis
      • 91. Develop standards around common causes
      • 92. Adoption as part of process for all new work
      • 93. Dashboard to surface most important DQ success drivers
      38
    • 94. Appendix - Reactive Standards - Product Team Roles & Responsibilities - DQ Program Adoption
      39
    • 95. Reactive StandardsDealing with DQ Issues After the Fact
      40
    • 96. DQ Defect Process and Standards
      • Product DQ keywords and distribution lists
      • 97. Triage information template
      • 98. Severity definitions according to metric (including revenue) impact
      • 99. Root-cause analysis
      • 100. Weekly bug scrub
      • 101. Goals to resolve DQ bugs faster with central team oversight
      • 102. Reduce defects bouncing between teams and get assigned to proper owner
      • 103. Leverage DQ tools – e2e data transparency
      • 104. Follow-up on delinquents
      41
    • 105. Product Team Roles and Responsibilities
      42
    • 106. Roles and ResponsibilitiesArch, PM, Dev, QE, SE Checklists
      43
    • 107. Data Quality of Shared Pipeline Ownership
      Approach and Inherent Challenges
      • Instrumentation methods for data capture for validation
      • 108. Inconsistent methods (client vs. server side) and business logic
      • 109. Apples-apples data reconciliation
      • 110. Inconsistent metrics definitions and dimensions
      • 111. Influencing partner to use best DQ practices
      • 112. Different philosophies, organizational approach, etc.
      • 113. Validate before use
      • 114. Joint data issue investigation/escalation
      44
    • 115. DQ Program AdoptionAs DQ expands across the company organizational engagement, buy-in and adoption is key
      45
    • 116. Communication Lessons Learned
      • Set clear expectations – methodology, milestones
      • 117. Paint clear picture of DQ standards and solution path at the top level to executives
      • 118. There are many perspectives on DQ based on product role:
      • 119. [Product] X [Arch,PM,Dev,QE,SE]
      • 120. Need focused communication and training
      46
    • 121. Driving Change in the OrganizationShared Goals and Stakeholders
      • Need solid technical value proposition, but it won’t work without strong executive support and relationships with key stakeholders
      • 122. Top down and bottoms up
      • 123. Org-wide shared DQ progress goals and dashboard– has to be measurable and clear to all
      • 124. Solid engagement with the right stakeholders
      • 125. Respected decision makers
      • 126. Lot of meetings initially to get on the same page & build relationship
      47
    • 127. Driving Change in the OrganizationHow to Get the Engineers to Own DQ
      • DQ consulting model vs. DQ central team does all the work
      • 128. Product teams have to own DQ – needs some form of consulting model
      • 129. But DQ has to get hands dirty, demonstrate technical success and drive progress on the roadmaps – pure consulting & education doesn’t work
      • 130. DQ standards and checklists have to be specific and as prescriptive as possible
      • 131. Be prepared for pushback when asking for extra dev or QE work
      • 132. Clarify DQ vs. Product Roles and Responsibilities
      • 133. Clearly specify anticipated LOE
      • 134. Scalability: Portal, Training, Standards, Self-Service Tools
      • 135. Challenge: How do you have impact without deep technical domain knowledge?
      • 136. Challenge of working with teams with similar charters: Product Support, QE, Tools
      48
    • 137. Dealing With Customers and StakeholdersBeware of Hotbuttons and “Help”
      • Product teams who give you resources – importance of communication, setting expectations and measurable progress
      • 138. Properly engaging unhappy data customers
      • 139. Challenge of dealing with sr. tech/arch who want the perfect technical solution vs. the need to make progress with interim/viable ugly/manual solutions
      • 140. No silver bullets
      • 141. Just because DQ improvement approach worked in one part of the org…
      • 142. Sciences Customer & Partners (Machine Learning)
      49

    ×