2008 Ur Tech Talk Zshao

6,485 views

Published on

zheng shao's talk about hive from uiuc

Published in: Technology, Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,485
On SlideShare
0
From Embeds
0
Number of Embeds
202
Actions
Shares
0
Downloads
326
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • 2008 Ur Tech Talk Zshao

    1. 2. Facebook and Open Source University of Illinois, Urbana-Champaign <ul><li>Zheng Shao </li></ul><ul><ul><li>Facebook Data Team </li></ul></ul><ul><ul><li>10/3/2008 </li></ul></ul>
    2. 3. Current Status of Facebook <ul><li>#5 site on the Internet </li></ul><ul><li>100+ million users </li></ul><ul><li>65 billion monthly page views </li></ul><ul><li>> 2^32 photos </li></ul><ul><li>400,000 developers around the world </li></ul><ul><li>Google/7 search queries </li></ul><ul><li>More hourly news stories than every other media source in history combined </li></ul>
    3. 4. Architecture <ul><li>3 service tiers: </li></ul><ul><ul><li>Web/Apache </li></ul></ul><ul><ul><li>Memcached </li></ul></ul><ul><ul><li>MySQL </li></ul></ul><ul><li>Uses: </li></ul><ul><ul><li>Linux </li></ul></ul><ul><ul><li>PHP </li></ul></ul>
    4. 5. Architecture (continued) <ul><li>Backend Data Warehousing System: </li></ul>Web Servers Scribe Servers Network Storage Hive on Hadoop Cluster Oracle RAC MySQL
    5. 6. Open Source Strategy <ul><li>Facebook relies heavily on open source software. </li></ul><ul><li>Facebook makes big contributions to the open source community. </li></ul><ul><li>What’s the benefit of open source for you? </li></ul><ul><ul><li>Everybody can use it, including YOU </li></ul></ul><ul><ul><li>Everybody can work on it, including YOU </li></ul></ul>
    6. 7. Open Source at Facebook <ul><li>Thrift </li></ul><ul><li>Cassandra </li></ul><ul><li>Facebook Open Platform </li></ul><ul><li>MemCacheD enhancements </li></ul><ul><li>PHP enhancements </li></ul><ul><li>Hive on top of Hadoop </li></ul><ul><li>Many more at http://developers.facebook.com/opensource.php </li></ul>
    7. 8. Hadoop and Hive Distributed File System, Map-Reduce, and SQL
    8. 9. Hadoop <ul><li>Mimics Google File System and Map Reduce </li></ul><ul><li>Started by Yahoo in Feb 2006 </li></ul><ul><li>Supported/used by 30+ companies/universities now, including Google : http://wiki.apache.org/hadoop/PoweredBy </li></ul>
    9. 10. Why Hadoop? <ul><li>We Want </li></ul><ul><ul><li>Efficiency </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Redundancy </li></ul></ul><ul><ul><li>Load Balance </li></ul></ul><ul><li>Google File System and Map Reduce is nice </li></ul><ul><ul><li>But not open source </li></ul></ul>
    10. 11. Hive <ul><li>A database/data warehouse on top of Hadoop </li></ul><ul><ul><li>Rich data types (structs, lists and maps) </li></ul></ul><ul><ul><li>Efficient implementations of SQL filters, joins and group-by’s on top of map reduce </li></ul></ul><ul><ul><li>Easy interactions with different programming languages </li></ul></ul>
    11. 12. Why Hive? <ul><li>We want </li></ul><ul><ul><li>SQL support </li></ul></ul><ul><ul><li>Efficiency </li></ul></ul><ul><ul><li>Scalability </li></ul></ul><ul><ul><li>Redundancy </li></ul></ul><ul><ul><li>Easy to add new features </li></ul></ul><ul><li>Alternatives: </li></ul><ul><ul><li>Traditional database systems </li></ul></ul><ul><ul><li>Proprietary data warehousing systems </li></ul></ul><ul><ul><li>Sawzall / Pig </li></ul></ul>
    12. 13. Hive Architecture HDFS Map Reduce Planner Hive CLI DDL Queries Browsing SerDe Thrift Jute JSON Thrift API MetaStore Web UI Mgmt, etc Hive QL Planner Execution Parser
    13. 14. (Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    14. 15. Hive QL – Join <ul><li>SQL: </li></ul><ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT pv.pageid, u.age </li></ul></ul><ul><ul><li>FROM page_view pv JOIN user u ON (pv.userid = u.userid); </li></ul></ul>X = page_view user pv_users 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid male 32 222 female 25 111 gender age userid 32 1 25 2 25 1 age pageid
    15. 16. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce < 1, 1> 222 < 1, 2> 111 < 1, 1> 111 value key 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid male 32 222 female 25 111 gender age userid < 2, 32> 222 < 2, 25> 111 value key < 2, 25> 111 < 1, 2> 111 < 1, 1> 111 value key < 2, 32> 222 < 1, 1> 222 value key 25 2 25 1 age pageid 32 1 age pageid
    16. 17. Hive QL – Group By <ul><li>SQL: </li></ul><ul><ul><ul><ul><li>INSERT INTO TABLE pageid_age_sum </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SELECT pageid, age, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY pageid, age; </li></ul></ul></ul></ul>pv_users pageid_age_sum 25 2 32 1 25 2 25 1 age pageid 32 25 25 age 1 2 1 pageid 1 2 1 Count
    17. 18. Hive QL – Group By in Map Reduce pv_users pageid_age_sum Map Shuffle Sort Reduce 25 2 25 1 age pageid 32 25 age 1 1 pageid 1 1 Count 25 2 32 1 age pageid 1 <2,25> 1 <1,25> value key 1 <2,25> 1 <1,32> value key 1 <1,32> 1 <1,25> value key 1 <2,25> 1 <2,25> value key 25 age 2 pageid 2 Count
    18. 19. Hive QL – Group By with Distinct <ul><li>SQL </li></ul><ul><ul><li>SELECT pageid, COUNT(DISTINCT userid) </li></ul></ul><ul><ul><li>FROM page_view GROUP BY pageid </li></ul></ul>page_view result 9:08:20 111 2 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid 1 2 2 1 count_distinct_userid pageid
    19. 20. Hive QL – Group By with Distinct in Map Reduce <ul><li>send your thinkings to zshao@facebook.com </li></ul>
    20. 21. Hive Optimizations Efficient execution of SQL on Map Reduce
    21. 22. (Simplified) Map Reduce Revisit Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    22. 23. Hive Optimizations – Merge Sequential Map Reduce Jobs <ul><li>SQL: </li></ul><ul><ul><li>FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT … </li></ul></ul>A Map Reduce B C AB Map Reduce ABC 111 av 222 bv 1 key 111 1 av key 222 1 bv key 333 1 cv key 222 bv 333 cv 111 av 1 key
    23. 24. Hive Optimizations – Share Common Read Operations <ul><li>Extended SQL </li></ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_pageid_sum </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT pageid, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY pageid </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_age_sum </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT age, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY age; </li></ul></ul></ul></ul>Map Reduce Map Reduce 32 2 25 age 1 pageid 1 2 1 count 1 pageid 32 2 25 age 1 pageid 1 32 1 count 25 age
    24. 25. Hive Optimizations – Load Balance Problem pv_users pageid_age_sum Map-Reduce pageid_age_partial_sum Map-Reduce 25 1 32 2 25 1 25 1 25 1 age pageid 32 25 age 2 1 pageid 1 4 count 2 25 1 1 32 2 25 age 1 pageid 2 count
    25. 26. Future Works <ul><li>Database Management Functionalities </li></ul><ul><li>Optimizations using Combiners </li></ul><ul><li>Optimizations based on Table Statistics </li></ul><ul><ul><li>(Learn more about that in CS 511) </li></ul></ul><ul><li>Everybody can use it, including YOU </li></ul><ul><li>Everybody can work on it, including YOU </li></ul>
    26. 27. Questions? [email_address]
    27. 28. Credits Data Management at Facebook, Jeff Hammerbacher Hive – Data Warehousing & Analytics on Hadoop, Joydeep Sen Sarma, Ashish Thusoo People Suresh Anthony Zheng Shao Prasad Chakka Pete Wyckoff Namit Jain Raghu Murthy Joydeep Sen Sarma Ashish Thusoo
    28. 30. Appendix Pages
    29. 31. Dealing with Structured Data <ul><li>Type system </li></ul><ul><ul><li>Primitive types </li></ul></ul><ul><ul><li>Recursively build up using Composition/Maps/Lists </li></ul></ul><ul><li>Generic (De)Serialization Interface (SerDe) </li></ul><ul><ul><li>To recursively list schema </li></ul></ul><ul><ul><li>To recursively access fields within a row object </li></ul></ul><ul><li>Serialization families implement interface </li></ul><ul><ul><li>Thrift DDL based SerDe </li></ul></ul><ul><ul><li>Delimited text based SerDe </li></ul></ul><ul><ul><li>You can write your own SerDe </li></ul></ul><ul><li>Schema Evolution </li></ul>
    30. 32. MetaStore <ul><li>Stores Table/Partition properties: </li></ul><ul><ul><li>Table schema and SerDe library </li></ul></ul><ul><ul><li>Table Location on HDFS </li></ul></ul><ul><ul><li>Logical Partitioning keys and types </li></ul></ul><ul><ul><li>Other information </li></ul></ul><ul><li>Thrift API </li></ul><ul><ul><li>Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests) </li></ul></ul><ul><li>Metadata can be stored as text files or even in a SQL backend </li></ul>
    31. 33. Hive CLI <ul><li>DDL: </li></ul><ul><ul><li>create table/drop table/rename table </li></ul></ul><ul><ul><li>alter table add column </li></ul></ul><ul><li>Browsing: </li></ul><ul><ul><li>show tables </li></ul></ul><ul><ul><li>describe table </li></ul></ul><ul><ul><li>cat table </li></ul></ul><ul><li>Loading Data </li></ul><ul><li>Queries </li></ul>
    32. 34. Web UI for Hive <ul><li>MetaStore UI: </li></ul><ul><ul><li>Browse and navigate all tables in the system </li></ul></ul><ul><ul><li>Comment on each table and each column </li></ul></ul><ul><ul><li>Also captures data dependencies </li></ul></ul><ul><li>HiPal: </li></ul><ul><ul><li>Interactively construct SQL queries by mouse clicks </li></ul></ul><ul><ul><li>Support projection, filtering, group by and joining </li></ul></ul><ul><ul><li>Also support </li></ul></ul>
    33. 35. Hive Query Language <ul><li>Philosophy </li></ul><ul><ul><li>SQL </li></ul></ul><ul><ul><li>Map-Reduce with custom scripts (hadoop streaming) </li></ul></ul><ul><li>Query Operators </li></ul><ul><ul><li>Projections </li></ul></ul><ul><ul><li>Equi-joins </li></ul></ul><ul><ul><li>Group by </li></ul></ul><ul><ul><li>Sampling </li></ul></ul><ul><ul><li>Order By </li></ul></ul>
    34. 36. Hive QL – Custom Map/Reduce Scripts <ul><li>Extended SQL: </li></ul><ul><ul><ul><li>FROM ( </li></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SELECT TRANSFORM (pv_users.userid, pv_users.date) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>USING 'map_script' AS (dt, uid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>CLUSTER BY dt) map </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_users_reduced </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT TRANSFORM (map.dt, map.uid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>USING 'reduce_script' AS (date, count); </li></ul></ul></ul></ul><ul><li>Map-Reduce: similar to hadoop streaming </li></ul>

    ×