Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSCON 2015

1,277 views

Published on

Slides from my talk at Oscon 2015 talking about the things we are measuring in an attempt to understand our data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

OSCON 2015

  1. 1. Measuring Big Data Understanding data by usage Charles Smith Big Data Platform Architecture - Netflix
  2. 2. About Me ▪ Netflix - I joined Netflix in 2011 - I spend my time working to make big data easy and efficient - Usually from the perspective of someone trying to use the platform ▪ University of Florida - Research in Information Retrieval - How much information does a document have
  3. 3. What would you measure?
  4. 4. What do you want to know?
  5. 5. ~20 PB of compressed data ~500 billion events a day ~18K data sets ~4200 nodes in our clusters
  6. 6. Our largest two datasets: 1.4 PB 1.2 PB
  7. 7. ~11K Hive ~3K Pig ~2.5K Presto
  8. 8. Task Hour Cost = (cost of node)/(tasks per node) * sum(task duration ms)/(60*60*1000)
  9. 9. 100 Jobs comprise 86% of the cost
  10. 10. What data is important?
  11. 11. Make people tell you the answer: tagging.
  12. 12. Manual data doesn’t stay current unless it needs to.
  13. 13. How do we actually use the data?
  14. 14. Parse the job (or ask the tool that parses it)
  15. 15. CharlottePresto Sql Parser (Hive) Sql Parser (Teradata) Lipstick (Pig) Metacat*
  16. 16. Dataset Distinct Queries … 2000 … 1052 prodhive/dse/geo_country_d 1009 prodhive/dse/ttl_title_d 580 … 565 … 512 … 466 … 427 … 395 … 317
  17. 17. Dataset Queries prodhive/dse/geo_country_d 11405 prodhive/dse/ttl_title_d 8194 … 5928 … 5451 … 4849 … 4654 … 4334 … 3620 … 3046 … 2823
  18. 18. Related To geo_country_d Shared Queries prodhive/dse/ttl_title_country_r 2277 … 1697 prodhive/dse/ttl_show_d 1540 prodhive/dse/ttl_season_d 1405 prodhive/dse/ttl_title_d 1392 … 926 … 817 … 743 prodhive/dse/ttl_season_country_r 638 … 628
  19. 19. Datasets Input Jobs Queries prodhive/cdn/occ… 2016 66 teradata/gdw_stg_prod/seg… 1587 36 prodhive/dse/msg… 1527 14 prodhive/dse/msg… 1512 30 teradata/gdw_stg_prod/seg… 1043 50 teradata/gdw_stg_prod/cdn… 970 10 teradata/gdw_tbl_prod/seg… 903 1 prodhive/rpt/pbe… 811 11 prodhive/gps/gro… 904 137 prodhive/cdn/ttl… 631 39
  20. 20. Challenges ▪ Knowing what questions should you try to answer. ▪ Getting this data isn’t easy. ▪ The data is noisy.
  21. 21. Thanks ▪ Charles Smith – Big Data Platform Architecture Netflix ▪ @charles_s_smith

×