Measuring Big Data
Understanding data by usage
Charles Smith
Big Data Platform Architecture - Netflix
About Me ▪ Netflix
- I joined Netflix in 2011
- I spend my time working to make big data easy and efficient
- Usually from the perspective of someone trying to use the platform
▪ University of Florida
- Research in Information Retrieval
- How much information does a document have
What would you measure?
What do you want to know?
~20 PB of compressed data
~500 billion events a day
~18K data sets
~4200 nodes in our clusters
Our largest two datasets:
1.4 PB
1.2 PB
~11K Hive
~3K Pig
~2.5K Presto
Task Hour Cost = (cost of node)/(tasks per node) * sum(task duration ms)/(60*60*1000)
100 Jobs comprise 86% of the cost
What data is important?
Make people tell you the answer: tagging.
Manual data doesn’t stay current unless it needs to.
How do we actually use the data?
Parse the job (or ask the tool that parses it)
CharlottePresto
Sql Parser
(Hive)
Sql Parser
(Teradata)
Lipstick
(Pig)
Metacat*
Dataset Distinct Queries
… 2000
… 1052
prodhive/dse/geo_country_d 1009
prodhive/dse/ttl_title_d 580
… 565
… 512
… 466
… 427
… 395
… 317
Dataset Queries
prodhive/dse/geo_country_d 11405
prodhive/dse/ttl_title_d 8194
… 5928
… 5451
… 4849
… 4654
… 4334
… 3620
… 3046
… 2823
Related To geo_country_d Shared Queries
prodhive/dse/ttl_title_country_r 2277
… 1697
prodhive/dse/ttl_show_d 1540
prodhive/dse/ttl_season_d 1405
prodhive/dse/ttl_title_d 1392
… 926
… 817
… 743
prodhive/dse/ttl_season_country_r 638
… 628
Datasets Input Jobs Queries
prodhive/cdn/occ… 2016 66
teradata/gdw_stg_prod/seg… 1587 36
prodhive/dse/msg… 1527 14
prodhive/dse/msg… 1512 30
teradata/gdw_stg_prod/seg… 1043 50
teradata/gdw_stg_prod/cdn… 970 10
teradata/gdw_tbl_prod/seg… 903 1
prodhive/rpt/pbe… 811 11
prodhive/gps/gro… 904 137
prodhive/cdn/ttl… 631 39
Challenges ▪ Knowing what questions should you try to answer.
▪ Getting this data isn’t easy.
▪ The data is noisy.
Thanks ▪ Charles Smith – Big Data Platform Architecture Netflix
▪ @charles_s_smith

OSCON 2015