Your SlideShare is downloading. ×
2008 Ur Tech Talk Zshao
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

2008 Ur Tech Talk Zshao

5,959
views

Published on

zheng shao's talk about hive from uiuc

zheng shao's talk about hive from uiuc

Published in: Technology, Business

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,959
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
323
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Transcript

    • 1.  
    • 2. Facebook and Open Source University of Illinois, Urbana-Champaign
      • Zheng Shao
        • Facebook Data Team
        • 10/3/2008
    • 3. Current Status of Facebook
      • #5 site on the Internet
      • 100+ million users
      • 65 billion monthly page views
      • > 2^32 photos
      • 400,000 developers around the world
      • Google/7 search queries
      • More hourly news stories than every other media source in history combined
    • 4. Architecture
      • 3 service tiers:
        • Web/Apache
        • Memcached
        • MySQL
      • Uses:
        • Linux
        • PHP
    • 5. Architecture (continued)
      • Backend Data Warehousing System:
      Web Servers Scribe Servers Network Storage Hive on Hadoop Cluster Oracle RAC MySQL
    • 6. Open Source Strategy
      • Facebook relies heavily on open source software.
      • Facebook makes big contributions to the open source community.
      • What’s the benefit of open source for you?
        • Everybody can use it, including YOU
        • Everybody can work on it, including YOU
    • 7. Open Source at Facebook
      • Thrift
      • Cassandra
      • Facebook Open Platform
      • MemCacheD enhancements
      • PHP enhancements
      • Hive on top of Hadoop
      • Many more at http://developers.facebook.com/opensource.php
    • 8. Hadoop and Hive Distributed File System, Map-Reduce, and SQL
    • 9. Hadoop
      • Mimics Google File System and Map Reduce
      • Started by Yahoo in Feb 2006
      • Supported/used by 30+ companies/universities now, including Google : http://wiki.apache.org/hadoop/PoweredBy
    • 10. Why Hadoop?
      • We Want
        • Efficiency
        • Scalability
        • Redundancy
        • Load Balance
      • Google File System and Map Reduce is nice
        • But not open source
    • 11. Hive
      • A database/data warehouse on top of Hadoop
        • Rich data types (structs, lists and maps)
        • Efficient implementations of SQL filters, joins and group-by’s on top of map reduce
        • Easy interactions with different programming languages
    • 12. Why Hive?
      • We want
        • SQL support
        • Efficiency
        • Scalability
        • Redundancy
        • Easy to add new features
      • Alternatives:
        • Traditional database systems
        • Proprietary data warehousing systems
        • Sawzall / Pig
    • 13. Hive Architecture HDFS Map Reduce Planner Hive CLI DDL Queries Browsing SerDe Thrift Jute JSON Thrift API MetaStore Web UI Mgmt, etc Hive QL Planner Execution Parser
    • 14. (Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    • 15. Hive QL – Join
      • SQL:
        • INSERT INTO TABLE pv_users
        • SELECT pv.pageid, u.age
        • FROM page_view pv JOIN user u ON (pv.userid = u.userid);
      X = page_view user pv_users 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid male 32 222 female 25 111 gender age userid 32 1 25 2 25 1 age pageid
    • 16. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce < 1, 1> 222 < 1, 2> 111 < 1, 1> 111 value key 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid male 32 222 female 25 111 gender age userid < 2, 32> 222 < 2, 25> 111 value key < 2, 25> 111 < 1, 2> 111 < 1, 1> 111 value key < 2, 32> 222 < 1, 1> 222 value key 25 2 25 1 age pageid 32 1 age pageid
    • 17. Hive QL – Group By
      • SQL:
            • INSERT INTO TABLE pageid_age_sum
            • SELECT pageid, age, count(1)
            • FROM pv_users
            • GROUP BY pageid, age;
      pv_users pageid_age_sum 25 2 32 1 25 2 25 1 age pageid 32 25 25 age 1 2 1 pageid 1 2 1 Count
    • 18. Hive QL – Group By in Map Reduce pv_users pageid_age_sum Map Shuffle Sort Reduce 25 2 25 1 age pageid 32 25 age 1 1 pageid 1 1 Count 25 2 32 1 age pageid 1 <2,25> 1 <1,25> value key 1 <2,25> 1 <1,32> value key 1 <1,32> 1 <1,25> value key 1 <2,25> 1 <2,25> value key 25 age 2 pageid 2 Count
    • 19. Hive QL – Group By with Distinct
      • SQL
        • SELECT pageid, COUNT(DISTINCT userid)
        • FROM page_view GROUP BY pageid
      page_view result 9:08:20 111 2 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid 1 2 2 1 count_distinct_userid pageid
    • 20. Hive QL – Group By with Distinct in Map Reduce
      • send your thinkings to zshao@facebook.com
    • 21. Hive Optimizations Efficient execution of SQL on Map Reduce
    • 22. (Simplified) Map Reduce Revisit Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    • 23. Hive Optimizations – Merge Sequential Map Reduce Jobs
      • SQL:
        • FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT …
      A Map Reduce B C AB Map Reduce ABC 111 av 222 bv 1 key 111 1 av key 222 1 bv key 333 1 cv key 222 bv 333 cv 111 av 1 key
    • 24. Hive Optimizations – Share Common Read Operations
      • Extended SQL
          • FROM pv_users
          • INSERT INTO TABLE pv_pageid_sum
            • SELECT pageid, count(1)
            • GROUP BY pageid
          • INSERT INTO TABLE pv_age_sum
            • SELECT age, count(1)
            • GROUP BY age;
      Map Reduce Map Reduce 32 2 25 age 1 pageid 1 2 1 count 1 pageid 32 2 25 age 1 pageid 1 32 1 count 25 age
    • 25. Hive Optimizations – Load Balance Problem pv_users pageid_age_sum Map-Reduce pageid_age_partial_sum Map-Reduce 25 1 32 2 25 1 25 1 25 1 age pageid 32 25 age 2 1 pageid 1 4 count 2 25 1 1 32 2 25 age 1 pageid 2 count
    • 26. Future Works
      • Database Management Functionalities
      • Optimizations using Combiners
      • Optimizations based on Table Statistics
        • (Learn more about that in CS 511)
      • Everybody can use it, including YOU
      • Everybody can work on it, including YOU
    • 27. Questions? [email_address]
    • 28. Credits Data Management at Facebook, Jeff Hammerbacher Hive – Data Warehousing & Analytics on Hadoop, Joydeep Sen Sarma, Ashish Thusoo People Suresh Anthony Zheng Shao Prasad Chakka Pete Wyckoff Namit Jain Raghu Murthy Joydeep Sen Sarma Ashish Thusoo
    • 29.  
    • 30. Appendix Pages
    • 31. Dealing with Structured Data
      • Type system
        • Primitive types
        • Recursively build up using Composition/Maps/Lists
      • Generic (De)Serialization Interface (SerDe)
        • To recursively list schema
        • To recursively access fields within a row object
      • Serialization families implement interface
        • Thrift DDL based SerDe
        • Delimited text based SerDe
        • You can write your own SerDe
      • Schema Evolution
    • 32. MetaStore
      • Stores Table/Partition properties:
        • Table schema and SerDe library
        • Table Location on HDFS
        • Logical Partitioning keys and types
        • Other information
      • Thrift API
        • Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests)
      • Metadata can be stored as text files or even in a SQL backend
    • 33. Hive CLI
      • DDL:
        • create table/drop table/rename table
        • alter table add column
      • Browsing:
        • show tables
        • describe table
        • cat table
      • Loading Data
      • Queries
    • 34. Web UI for Hive
      • MetaStore UI:
        • Browse and navigate all tables in the system
        • Comment on each table and each column
        • Also captures data dependencies
      • HiPal:
        • Interactively construct SQL queries by mouse clicks
        • Support projection, filtering, group by and joining
        • Also support
    • 35. Hive Query Language
      • Philosophy
        • SQL
        • Map-Reduce with custom scripts (hadoop streaming)
      • Query Operators
        • Projections
        • Equi-joins
        • Group by
        • Sampling
        • Order By
    • 36. Hive QL – Custom Map/Reduce Scripts
      • Extended SQL:
          • FROM (
            • FROM pv_users
            • SELECT TRANSFORM (pv_users.userid, pv_users.date)
            • USING 'map_script' AS (dt, uid)
            • CLUSTER BY dt) map
          • INSERT INTO TABLE pv_users_reduced
            • SELECT TRANSFORM (map.dt, map.uid)
            • USING 'reduce_script' AS (date, count);
      • Map-Reduce: similar to hadoop streaming