Your SlideShare is downloading. ×
Xldb2011 tue 1120_youtube_datawarehouse
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Xldb2011 tue 1120_youtube_datawarehouse

927
views

Published on

Published in: Technology, Business

1 Comment
2 Likes
Statistics
Notes
  • haha , youtube
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
927
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
21
Comments
1
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Youtube Data WarehouseBiswapesh Chattopadhyaybiswapesh@google.comXLDB 2011 Google Confidential and Proprietary
  • 2. YTDW - Motivation & History● Consolidated warehouse of Youtube data● Videos, playbacks, summarized logs, etc.● Very large (X PB uncompressed, Trillion row tables)● High volume ETL (XXX TB processed / day)● 100% Google Tech Stack: ○ Query: Oracle -> MySQL -> ColumnIO ○ ETL: Python -> Sawzall + Tenzing + Python ○ Reporting: Microstrategy -> ABI● Key technologies: Sawzall, Tenzing, Dremel, ABI Google Confidential and Proprietary
  • 3. YTDW - Overall Architecture Logs Dremel ABI Service Reports ETL (Sawzall, MySQL Tenzing, YTDW DB Python, C++ (ColumnIO / MR) GFS) Tenzing Service Bigtable Google Confidential and Proprietary
  • 4. YTDW - About Sawzall● Scripting language on Google MR framework● Sawzall vs MR-Saw● Built-in security for accessing sensitive logs data● Strong support for aggregation and complex computations● Read/write various formats● Procedural language● Open sourced!● YTDW Usage: ○ ETL of Youtube logs ○ Complex one-off logs analysis Google Confidential and Proprietary
  • 5. YTDW - About Tenzing● SQL on MR - Think HIVE, HadoopSQL● Key strengths: ○ Strong SQL support ○ Highly scalable - built on Google MR ○ Read / write many formats● Weaknesses: ○ Not ideal for complex procedural code ○ Higher latency than Dremel ○ Limited support for nested-repeated structures● YTDW Usage: ○ ETL for non-logs data, denormalizations ○ Medium complexity analysis on YTDW dataGoogle Confidential and Proprietary
  • 6. YTDW - About Dremel● Current use in YTDW: ○ Reporting query engine ○ Interactive simple logs analysis● Key Strengths ○ Very low latency ○ SQL support ○ Strong nested-relational support ○ Access to logs● Limitations ○ More complex SQL constructs (joins, setops, ...) ○ Limited library of functions ○ Doesnt scale as much as MR Google Confidential and Proprietary
  • 7. YTDW - Techonology Comparison Sawzall Tenzing Dremel Latency High Medium LowScalability High High Medium SQL None High Medium Power High Medium Low Google Confidential and Proprietary
  • 8. YTDW Future: Query Engines ● Adding MR capabilities to Dremel ○ Scalable reliable shuffle ○ Materializing large result sets ○ Read / write multiple data formats ● Easier / more powerful analysis in Dremel ○ User defined scalar and table values functions ○ More SQL features: ■ Better support for joins ■ Analytic functions, set operators, etc. ● Long term for Dremel: ○ Completely replace Tenzing MR backend ○ Extend BigQuery service capabilities Google Confidential and Proprietary
  • 9. YTDW - About ABI● Complete reporting and dashboarding solution● Built on Google stack● Tight integration with Dremel and ColumnIO● Google Visualizations, some Flash● Current use in YTDW: ○ Most reports and dashboards Google Confidential and Proprietary
  • 10. YTDW - Misc Technologies● Python ○ Glue code - drivers, wrappers, etc. ○ Simple small scale extracts● Scheduler ○ In-house scheduling framework ○ Built internally by YTDW engineers● C++ MapReduce ○ Used sparingly for complex cases not possible using Sawzall /Tenzing● Query Rewriter ○ Sits between ABI and Dremel ○ Rewrites queries to be faster / cheaper Google Confidential and Proprietary
  • 11. YTDW - PerformancePerformance improvement strategies: ● Data model: ○ De-normalized aggregated materialized views ○ Range partitioning ● Query rewrite layer: ○ Use the right aggregated materialized view ○ Prune partitions based on data knowledge ● Reporting front end ○ Aggressive result caching (memcache) Google Confidential and Proprietary
  • 12. YTDW - Q & A Google Confidential and Proprietary