Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data

2,051 views
1,883 views

Published on

Data is Yahoo!'s most strategic assets - from user engagement and insights data to revenue and billing data. Three years ago, Yahoo! invested in a Data Quality program.

By applying industry principles and techniques the Data Quality program has provided proactive and reactive system solutions to Audience data issues and root causes by addressing technical challenges of data quality at scale and engaging and leveraging the rest of the organization in the solution: from product teams all through the data stack (data sourcing, ETL, aggs and analytics) to analysts and sciences teams who consume the data. This methodology is now being scaled to the all data across Yahoo! including Search and Display Advertising.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,051
On SlideShare
0
From Embeds
0
Number of Embeds
165
Actions
Shares
0
Downloads
32
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • “Riding Giants” is only possible by using a recently-discovered method: tow-in surfingShow video: http://www.youtube.com/watch?v=LhKFTqxn6qs70’ wave = power of dataUDA DSI = jet-ski method (unlocking the data to harness the 70 wave)UDA DQ = getting the GPS coordinates correct so you are in the right place to catch it – without high quality data we miss the wave altogether!
  • Yahoo business model = advertising
  • Data Quality Challenges & Solution Approaches in Yahoo!’s Massive Data

    1. 1. Data Quality Challenges and Solution Approaches in Yahoo!'s Massive Data Environment Data Quality Manager: Dan DefendData Quality Architect: Aparna Vani<br />DataVersity Webinar<br />September 29, 2011<br />Abstract: By applying industry principles and techniques the Data Quality program has provided proactive and reactive system solutions to Audience data issues and root causes by addressing technical challenges of data quality at scale and engaging and leveraging the rest of the organization in the solution: from product teams all through the data stack (data sourcing, ETL, aggs and analytics) to analysts and sciences teams who consume the data. This methodology is now being scaled to the all data across Yahoo! including Search and Display Advertising. <br /> © 2011 Yahoo!.  All rights reserved.<br />
    2. 2. Unlocking the Power of Data<br />2<br />
    3. 3. MEDIA<br />Tech-<br />nology<br />3<br />
    4. 4. The Anatomy of a Yahoo! Web Page<br />Buzz<br />Targeted<br />Content<br />Apps<br />Ads<br />Content<br />Y!<br />links<br />
    5. 5. What Yahoo! Does With Its Data?<br />Analytics & Business Insights – data-driven decisions<br />How many people visited Home Page today and what did they click on?<br />What impact did the Japan tsunami have News and global engagement?<br />Targeting<br />What products are you interested in based on your recent web usage?<br />Advertisers pay a lot of $$ for good targeting.<br />Targeted content means better user engagement.<br />Experimentation<br />“Live user testing”<br />What layout do users like best? Are most profitable?<br />5<br />
    6. 6. Which is the Better Home Page?<br />6<br />
    7. 7. Which Ad Position Makes More Money?<br />7<br />
    8. 8. Yahoo! Has a LOT of Data<br />Leading Internet Portal and Software Supplier[1]<br />Serves 640 MM users or84.5% of US internet users<br />Top ranked site in Mail, Messenger, Home Page, and more<br />Collects over 25 terabytes of behavioral data per day<br />2 U.S. Library of Congress equivalents every day<br />[1] US Yahoo! Audience Measurement Report. comScore, Jan 2011<br />8<br />
    9. 9. Overview of Yahoo! Data Pipelines<br /><ul><li>Audience (web analytics)
    10. 10. Processes data from all Yahoo! properties web server logs and delivers audience engagement metrics
    11. 11. Display Advertising
    12. 12. Analytics and billing
    13. 13. Guaranteed and Non-guaranteed delivery ad campaigns
    14. 14. Exchange networks leverage other advertisers and publishers
    15. 15. Search Advertising
    16. 16. Analytics and billing</li></ul>9<br />
    17. 17. Yahoo! Data Pipeline<br />DataExtraction<br />Web Pages<br />Extract<br />Transform<br />Load<br />Business<br />Insights<br />Data Warehouse<br />Σ<br />AdServers<br />Targeting<br />Reporting systems<br />
    18. 18. Dimensions of Yahoo! Data Quality<br />`<br />DQ Team<br />abuse<br />traffic<br />invalid<br />events<br />metadata<br />integrity<br />external src<br />agreement<br />Σ<br />events<br />uncollected<br />key metric<br />agreement<br />11<br />
    19. 19. DATA QUALITY = BUSINESS UPImpacts of poor quality of data to Yahoo!<br />$$ Loss: Revenue, Refunds<br />Wasted resources<br /><ul><li>Data investigation and reprocessing
    20. 20. Sciences frustration and attrition (“QA the data”)</li></ul>Incorrect insights<br /><ul><li>Wrong decisions
    21. 21. Suboptimal targeting
    22. 22. Credibility loss – customers don’t trust the data</li></ul>External impact<br /><ul><li>Loss of competitive ranking</li></ul>REAL $$!!! … if managed reactively<br />+<br />=<br />12<br />
    23. 23. Audience DQ Solution Path<br />13<br />
    24. 24. Circa 2007: Significant Opportunities for Improvement in Audience Pipeline<br /><ul><li>Quality of data largely unmeasured
    25. 25. Property sourcing:
    26. 26. Is this a page view? No standard
    27. 27. Tagging & server errors
    28. 28. Data dropped in collection system
    29. 29. Data discrepancies found by customer end of month or quarter
    30. 30. Abuse and robots skew metrics due to minimal traffic protection
    31. 31. Data sources that should agree, don’t
    32. 32. Internal customers don’t trust the data</li></ul>14<br />
    33. 33. We Started to Measure It<br />`<br />Source: ticket volumes and root cause analysis<br />Σ<br />Key Finding: >80% data issue from the source<br />15<br />
    34. 34. `<br />Root Causes Differ Per Stage<br />Source: root cause analysis<br />Σ<br />Insights into point fixes and system solutions<br />16<br />
    35. 35. SolutionIdentification of issues followed by quick wins and system solutions<br />Developed DQ Methodology for large data systems<br />17<br />
    36. 36. Improvements & Results<br />18<br />
    37. 37. Central DQ Team Structure<br />19<br />
    38. 38. DQ Proactive StandardsBuilding Data Quality into Products<br />20<br />
    39. 39. Property DQ StandardsSourcing and Consuming Clean Data<br />`<br />Σ<br />Data Issues<br />DQ Champs<br />Classification, triage, drive fixes<br />Metric/monitor priorities, pain points <br />Server setup cookbook and validation<br />Instrumentation validation: PV, URL, CSC<br />Proactive<br />Reactive<br />Customer-<br />Driven<br />21<br />
    40. 40. DQ Standards Overview (Proactive)<br />22<br />
    41. 41. Support for DQ in the QE Cycle Data Validation<br />`<br />`<br />Test<br />Environment<br />E2E data validation<br />tests covering major <br />customer use cases<br />in pre-release QE cycle<br />Σ<br />Note: Specific tools are not currently part of DQ standard <br />but partnership in this area may make sense<br />23<br />
    42. 42. Data Validation Coverage in QEChecklist and Examples<br />Checklist<br /><ul><li>Coverage from a data perspective taking into account major data input scenarios and major use cases direct customers of data produced
    43. 43. Compare results from legacy system or previous version of system (with production data)
    44. 44. Suggest organizing per DQ dimensions: completeness, accuracy, validity, consistency, integrity</li></ul>Examples<br /><ul><li>Validity: Check for valid data values per the schema – certain column not null, within acceptable range or of correct data type
    45. 45. Completeness: Include coverage to validate that the volume sent is the amount of load received, processed and output.
    46. 46. Accuracy - Test that the data input equals the data output. If data is requested for a specific day in one time zone but fetched in another the data will not be accurate.</li></ul>24<br />
    47. 47. Support for DQ in the QE CycleQE Coverage of DQ Features<br />`<br />Σ<br />Functional test coverage for built-in DQ <br />features, e.g., in-line DQ checks<br />25<br />
    48. 48. Data Sourcing Case Study<br /><ul><li>New Ad Delivery Engine. Some impression events were partially corrupted leading to failed joins. This led to missing data.
    49. 49. Problem not discovered for over 2 days. Rollback occurred on the 3rd day.
    50. 50. A simple/cheap in-line DQ check for % of non-billable clicks would have found the problem before any bad data was published.</li></ul>26<br />
    51. 51. Answer: In-Line DQ ChecksImmediate Detection of Issues Before Publish<br />`<br />week-over-week statistical, source comparison, def met,…<br />SE&O<br />Σ<br /><ul><li>Performed in-line with processing before publishing the data.
    52. 52. Designed to catch serious DQ issues needing immediate action.
    53. 53. Cover processing performed in stage as well as sanity checks of upstream data (e.g. total volume).
    54. 54. Cannot significantly impact SLA of the process. This means that checks generally be at a course granularity, e.g., at network level.
    55. 55. Owned/implemented by Dev with review input from DQ, QE, PM, Customer</li></ul>27<br />
    56. 56. Audience Case StudyLet’s Start Looking at ALL Our Data<br />X Bln/day<br />Y Bln/day<br />Reported<br />Unreported<br />Invalid?<br />Tagging<br />Error<br />ETL Error<br />Backend<br />Call<br />Non-<br />PV<br />Robot<br /><ul><li>Valid / invalid traffic improperly classified/reported
    57. 57. Processing and storing billions events/day without knowing why
    58. 58. Profile the data to find out which properties/pages/servers are the biggest offenders</li></ul>28<br />
    59. 59. Answer: Data TransparencyWhere is all the data going?<br />`<br />Σ<br /><ul><li>End to end, transactional insight of data processed across all stages of e2e data based on DQ stats
    60. 60. Facilitates data investigations -- immediate insight into the data processing stage/category with incorrect data
    61. 61. Provides customers confidence by showing what happened to the data between the beginning of the processing and the final analytics data metrics
    62. 62. Provides full, end-to-end audit-ability -- event is accounted for in some stage and filtered bucket
    63. 63. Transparency reporting can be done in the analytic application or in DQ Central</li></ul>29<br />
    64. 64. Audience Case StudyWhat’s Going On With This Network Uptrend?<br />30<br />
    65. 65. Off-Line DQ Monitoring in Audience Case StudyWhat’s Going On With This Network Uptrend?<br /><ul><li>Alert triggered first week of uptrend indicating instrumentation problem</li></ul>31<br />
    66. 66. Answer: Off-Line DQ Monitor and AlertingEarly Detection of Issues at a Fine Grain<br />`<br />DQ Central<br />week-over-week statistical, source comparison, intra-pipeline…<br />Σ<br /><ul><li>In-depth checks finer granularity than possible inline based on DQ stats
    67. 67. Customer-driven (“DQ Champion”) metric / dimension priorities and signoff
    68. 68. Sophisticated techniques (time series modeling) to find deeper anomalies
    69. 69. SLA next-day
    70. 70. Sometimes need investigation to avoid false positive alert
    71. 71. Owned by DQ team with review input from PM, Dev, QE, & Customer</li></ul>32<br />
    72. 72. DQ Central – End to End Audience Data<br />Features: <br />Data statistics/trending of audience PVs for property at each stages of audience pipeline<br />End to end data transparency per page and server<br />Critical traffic fluctuation notification for properties and custom monitoring for any data customers<br />Data issue investigation and diagnostics<br />Open/overdue data quality bug tracking <br />33<br />
    73. 73. DQ Central – DQ Champion Engagement<br />5. Sign off information is captured for alert in DQ database<br />4. DQ Champion manages alert sign-off in DQ Central UI<br />6. Explanation overlaid on data<br />Reason…. BugID<br />σ<br />σ<br />Reason…. BugID<br />ALERT<br />1. Data Source Metrics are monitored and an anomaly is found<br />2. Each alert is registered in the DQ database<br />3. Email detailing alert/s is sent to DQ Champion<br />34<br />
    74. 74. DQ Central – Case Study<br />Impact:<br /><ul><li> suboptimal BT
    75. 75. PV counting error
    76. 76. UU inaccuracy</li></ul>DQ ALERT!<br />Fixed same day!!<br />DQ Champion<br />Analysis Insight<br />35<br />
    77. 77. Display Sciences Case StudySeeing a High Percentage of Null Events<br /><ul><li>Sciences encounters “missing events”
    78. 78. Extensive end-to-end analysis of feeds with missing data, upstream feeds and data sources slicing and dicing by interesting dimensions to understand source & cause of issue
    79. 79. Conclusion: Expected behavior; field of interest was populated according to sampling rates as designed, but known only by Serving teams – not by Sciences customers
    80. 80. Impact: Dozens of hours of wasted investigation</li></ul>36<br />
    81. 81. Answer: Data LineageCentralized Documentation Consumable by Customers<br />Covers all business logic and filtering rules (any merge, removal, addition, or other modification) as data flows end to end. Lineage answers the question: "How did my data get from A to B to C to D and all the way through Z?” Each stage must provide detail on all major filtering and columns in a form consumable by customers.<br /><ul><li>Resides within the metadata system
    82. 82. Available to all customers and kept up-to-date
    83. 83. Reduces or removes confusion regarding differences in two seemingly-similar data sets
    84. 84. There is a close relationship between Data Lineage and Data Transparency: the former describes the processing rules behind the later transactional data.</li></ul>37<br />
    85. 85. DQ Improvement Playbook<br /><ul><li>Metrics/dimension priorities from customers
    86. 86. End-to-end profiling & DQ assessment
    87. 87. Start building end-to-end monitoring & transparency
    88. 88. Core In-line Checks
    89. 89. Core Data Lineage
    90. 90. Discover systemic solutions through root-cause analysis
    91. 91. Develop standards around common causes
    92. 92. Adoption as part of process for all new work
    93. 93. Dashboard to surface most important DQ success drivers</li></ul>38<br />
    94. 94. Appendix - Reactive Standards - Product Team Roles & Responsibilities - DQ Program Adoption<br />39<br />
    95. 95. Reactive StandardsDealing with DQ Issues After the Fact<br />40<br />
    96. 96. DQ Defect Process and Standards<br /><ul><li>Product DQ keywords and distribution lists
    97. 97. Triage information template
    98. 98. Severity definitions according to metric (including revenue) impact
    99. 99. Root-cause analysis
    100. 100. Weekly bug scrub
    101. 101. Goals to resolve DQ bugs faster with central team oversight
    102. 102. Reduce defects bouncing between teams and get assigned to proper owner
    103. 103. Leverage DQ tools – e2e data transparency
    104. 104. Follow-up on delinquents</li></ul>41<br />
    105. 105. Product Team Roles and Responsibilities<br />42<br />
    106. 106. Roles and ResponsibilitiesArch, PM, Dev, QE, SE Checklists<br />43<br />
    107. 107. Data Quality of Shared Pipeline Ownership<br />Approach and Inherent Challenges<br /><ul><li>Instrumentation methods for data capture for validation
    108. 108. Inconsistent methods (client vs. server side) and business logic
    109. 109. Apples-apples data reconciliation
    110. 110. Inconsistent metrics definitions and dimensions
    111. 111. Influencing partner to use best DQ practices
    112. 112. Different philosophies, organizational approach, etc.
    113. 113. Validate before use
    114. 114. Joint data issue investigation/escalation</li></ul>44<br />
    115. 115. DQ Program AdoptionAs DQ expands across the company organizational engagement, buy-in and adoption is key<br />45<br />
    116. 116. Communication Lessons Learned<br /><ul><li>Set clear expectations – methodology, milestones
    117. 117. Paint clear picture of DQ standards and solution path at the top level to executives
    118. 118. There are many perspectives on DQ based on product role:
    119. 119. [Product] X [Arch,PM,Dev,QE,SE]
    120. 120. Need focused communication and training</li></ul>46<br />
    121. 121. Driving Change in the OrganizationShared Goals and Stakeholders<br /><ul><li>Need solid technical value proposition, but it won’t work without strong executive support and relationships with key stakeholders
    122. 122. Top down and bottoms up
    123. 123. Org-wide shared DQ progress goals and dashboard– has to be measurable and clear to all
    124. 124. Solid engagement with the right stakeholders
    125. 125. Respected decision makers
    126. 126. Lot of meetings initially to get on the same page & build relationship</li></ul>47<br />
    127. 127. Driving Change in the OrganizationHow to Get the Engineers to Own DQ<br /><ul><li>DQ consulting model vs. DQ central team does all the work
    128. 128. Product teams have to own DQ – needs some form of consulting model
    129. 129. But DQ has to get hands dirty, demonstrate technical success and drive progress on the roadmaps – pure consulting & education doesn’t work
    130. 130. DQ standards and checklists have to be specific and as prescriptive as possible
    131. 131. Be prepared for pushback when asking for extra dev or QE work
    132. 132. Clarify DQ vs. Product Roles and Responsibilities
    133. 133. Clearly specify anticipated LOE
    134. 134. Scalability: Portal, Training, Standards, Self-Service Tools
    135. 135. Challenge: How do you have impact without deep technical domain knowledge?
    136. 136. Challenge of working with teams with similar charters: Product Support, QE, Tools</li></ul>48<br />
    137. 137. Dealing With Customers and StakeholdersBeware of Hotbuttons and “Help”<br /><ul><li>Product teams who give you resources – importance of communication, setting expectations and measurable progress
    138. 138. Properly engaging unhappy data customers
    139. 139. Challenge of dealing with sr. tech/arch who want the perfect technical solution vs. the need to make progress with interim/viable ugly/manual solutions
    140. 140. No silver bullets
    141. 141. Just because DQ improvement approach worked in one part of the org…
    142. 142. Sciences Customer & Partners (Machine Learning)</li></ul>49<br />

    ×