Lessons Learned - Monitoring the Data Pipeline at Hulu

2,940 views

Published on

Published in: Technology
  • Be the first to comment

Lessons Learned - Monitoring the Data Pipeline at Hulu

  1. 1. LESSONS LEARNED MONITORING THE DATA PIPELINE
  2. 2. AGENDA • Who am I? • What’s a Hulu? • Beacons & the Data Pipeline • Monitoring – Take One • Monitoring – Take Two
  3. 3. TRISTAN REID METRICS & REPORTING TOOLS TEAM LEAD
  4. 4. Help people find and enjoy the world’s premium content when, where and how they want it. HULU’S MISSION
  5. 5. PREMIUM CONTENT QUALITY AD EXPERIENCE • Premium Content • 485+ Content Partners • 6 of 6 Broadcast Networks USER CONTROL • Ads can’t be skipped • Less ad load than TV • 100% video completion rate guarantee • On Demand • Across Devices • Choice Based Ad Formats WHY IS HULU EFFECTIVE?
  6. 6. 7 • Service Oriented • Small teams, specialized scopes • Build tools for other developers • Right tool for the job
  7. 7. Beacons & The Data Pipeline 8
  8. 8. Fire & Forget HTTP Format High Availability Process Transform Collect External View of Beacons
  9. 9. Beacons 80 2013-04-01 00:00:00 /v3/playback/start? bitrate=650 &cdn=Akamai &channel=Anime &client=Explorer &computerguid=EA8FA1000232B8F6986C3E0BE 55E9333 &contentid=5003673 …  Which show is the user watching?  Which pages did they visit?  How long did they stay?  Where did they come from?  Did they become Plus members?
  10. 10. The pipeline Beacon collection service HDFS Hive RDBMS Log Collector / Flume MapReduce Jobs Continuous Aggregation / Selective PublishingReporting Monitoring Developers Business Analysts
  11. 11. Avg. 12,000 events per second Peak: ~35K Data Collection
  12. 12. Data never stops coming… and we can’t lose any data
  13. 13. HDFS Files bucketed by beacon type and partitioned by hour Log Collection machine #1 Log Collection … Load balancer Devices Devices Devices Log Collection machine #11 CDN
  14. 14. MapReduce - from beacons to basefacts video_id 289696 content_partner_id 398 distribution_partner_id 602 distro_platform_id 14 is_on_hulu 0 … hourid 383149 watched 76426
  15. 15. Hulu MapReduce Metrics Jobs Definitions of beacons and base-facts Beaconspec compiler MapReduce code, including metadata lookups Job Scheduler BeaconSpec DSL Scala / Akka JFlex & CUP Java (Generated) Documentation Automated Validations for Beacon Generators In Progress…
  16. 16. UserJobs  Mention the MVEL coolness MVEL: client contains 'Chrome' && fullscreen == true && (os contains 'Windows' || os contains 'Mac')
  17. 17. Aggregation & Publishing Hourly Facts Aggregations Daily/Weekly/Monthly/Quarterly/Ann ual Popular Data MySQL SQL Publishing
  18. 18. Data API Service Reporting Flow Reporting Portal UI (RP2) Report Controller Scheduler HiveRunner Published DB’s RP2 DB Available columns Date range checks Submit Report Execute Report Check Status Queue Run Generate Query
  19. 19. RP2 UI
  20. 20. Monitoring
  21. 21. Some Issues… BIG DATA PIPELINE? I’LL BET THAT’S GOING GREAT FOR YOU EMAIL EXPLOSIONS GATEKEEPINGOverhead Consumption C H A N G E
  22. 22. Lots of Monitoring Tools Available Ingest Jobs ClusterOpenTSDB & Graphite
  23. 23. WHAT’s GOING ON??!?? HOW IS OUR CLUSTER? WILL WE MEET OUR SLAs? HOW FAST DID A JOB RUN? HOW DID RUNTIME COMPARE TO HISTORICAL? HOW IS THIS COMPONENT? HOW IS OUR SYSTEM?
  24. 24. The Design… Access all your tools in one place... …but avoid multitasking Service Oriented Architecture Comprehensive Web UI
  25. 25. Does this solve our problems? 32 • Single Point of Access? • Maintain services separately? TAKE THAT DATA PIPELINE ISSUES!!
  26. 26. Our Users’ Perspective? • We detect platform issues • We quickly troubleshoot errors • We track relative performance • We know where we are re: SLAs …but is detection of a problem enough? A PROBLEM DETECTION USERS We need to think of things from the report users’ perspectives
  27. 27. The User Perspective User Group Report User Report User Report UserReport User Report UserReport UserUser Group Report Report Report Report Report Report Run Report Run Report Run Report Run Report Run Data Pipeline Resources ETC! Schedule
  28. 28. Contextual Troubleshooting Model • Connect issues to business units • Better impact assessment • Tune performance per user needs We need a graph data structure, populated with the stuff we care about Something like this
  29. 29. Why a Graph?  …instead of RDBMS  Indeterminate # of Joins  Query for graph connectedness is trivial and short  Query for connectedness w/ SQL relies on knowing the intermediate resources  …instead of a tree?  Data is sometimes recombinant (e.g. a metric in multiple reports to same user)
  30. 30. Let’s investigate… These failed before getting to a data store Most of the hive failures were the same table, but it’s a common table As we filter, the matched reports show up on the bottom of the page. The log link shows us the details
  31. 31. Each service implements a log-fetching interface, specific to the resources used for a particular report
  32. 32. SUCCESS!!!
  33. 33. In Summary…  Find the Important Questions => Measure the Right Data  Make troubleshooting easy  Small distinct services are easy to create, maintain, and wire together
  34. 34. Questions? • Muthu…the Platform GrandMaster • All of Metrics Platform, Tools, Reporting for making this stuff • Mohamed, Chris, Charlie, Robert, Phong, AJ, Ratheesh, Adi, Matt, Shashank, Joanne, Siddhartha, Tamir, Jun, James, Dr. Kevin, Hang • All of the Hulu DEV team for general awesomeness • Prasan…thanks for the impetus to do this. I’ll look u up • Kevin…thanks for Hulu. I’ll send u a snap Thanks to…

×