Hadoop and Hive

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    5 Favorites

    Hadoop and Hive - Presentation Transcript

    1.  
    2. Facebook and Open Source University of Illinois, Urbana-Champaign
      • Zheng Shao
        • Facebook Data Team
        • 10/3/2008
    3. Current Status of Facebook
      • #5 site on the Internet
      • 100+ million users
      • 65 billion monthly page views
      • > 2^32 photos
      • 400,000 developers around the world
      • Google/7 search queries
      • More hourly news stories than every other media source in history combined
    4. Architecture
      • 3 service tiers:
        • Web/Apache
        • Memcached
        • MySQL
      • Uses:
        • Linux
        • PHP
    5. Architecture (continued)
      • Backend Data Warehousing System:
      Web Servers Scribe Servers Network Storage Hive on Hadoop Cluster Oracle RAC MySQL
    6. Open Source Strategy
      • Facebook relies heavily on open source software.
      • Facebook makes big contributions to the open source community.
      • What’s the benefit of open source for you?
        • Everybody can use it, including YOU
        • Everybody can work on it, including YOU
    7. Open Source at Facebook
      • Thrift
      • Cassandra
      • Facebook Open Platform
      • MemCacheD enhancements
      • PHP enhancements
      • Hive on top of Hadoop
      • Many more at http://developers.facebook.com/opensource.php
    8. Hadoop and Hive Distributed File System, Map-Reduce, and SQL
    9. Hadoop
      • Mimics Google File System and Map Reduce
      • Started by Yahoo in Feb 2006
      • Supported/used by 30+ companies/universities now, including Google : http://wiki.apache.org/hadoop/PoweredBy
    10. Why Hadoop?
      • We Want
        • Efficiency
        • Scalability
        • Redundancy
        • Load Balance
      • Google File System and Map Reduce is nice
        • But not open source
    11. Hive
      • A database/data warehouse on top of Hadoop
        • Rich data types (structs, lists and maps)
        • Efficient implementations of SQL filters, joins and group-by’s on top of map reduce
        • Easy interactions with different programming languages
    12. Why Hive?
      • We want
        • SQL support
        • Efficiency
        • Scalability
        • Redundancy
        • Easy to add new features
      • Alternatives:
        • Traditional database systems
        • Proprietary data warehousing systems
        • Sawzall / Pig
    13. Hive Architecture HDFS Map Reduce Planner Hive CLI DDL Queries Browsing SerDe Thrift Jute JSON Thrift API MetaStore Web UI Mgmt, etc Hive QL Planner Execution Parser
    14. (Simplified) Map Reduce Review Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    15. Hive QL – Join
      • SQL:
        • INSERT INTO TABLE pv_users
        • SELECT pv.pageid, u.age
        • FROM page_view pv JOIN user u ON (pv.userid = u.userid);
      X = page_view user pv_users 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid male 32 222 female 25 111 gender age userid 32 1 25 2 25 1 age pageid
    16. Hive QL – Join in Map Reduce page_view user pv_users Map Shuffle Sort Reduce < 1, 1> 222 < 1, 2> 111 < 1, 1> 111 value key 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid male 32 222 female 25 111 gender age userid < 2, 32> 222 < 2, 25> 111 value key < 2, 25> 111 < 1, 2> 111 < 1, 1> 111 value key < 2, 32> 222 < 1, 1> 222 value key 25 2 25 1 age pageid 32 1 age pageid
    17. Hive QL – Group By
      • SQL:
            • INSERT INTO TABLE pageid_age_sum
            • SELECT pageid, age, count(1)
            • FROM pv_users
            • GROUP BY pageid, age;
      pv_users pageid_age_sum 25 2 32 1 25 2 25 1 age pageid 32 25 25 age 1 2 1 pageid 1 2 1 Count
    18. Hive QL – Group By in Map Reduce pv_users pageid_age_sum Map Shuffle Sort Reduce 25 2 25 1 age pageid 32 25 age 1 1 pageid 1 1 Count 25 2 32 1 age pageid 1 <2,25> 1 <1,25> value key 1 <2,25> 1 <1,32> value key 1 <1,32> 1 <1,25> value key 1 <2,25> 1 <2,25> value key 25 age 2 pageid 2 Count
    19. Hive QL – Group By with Distinct
      • SQL
        • SELECT pageid, COUNT(DISTINCT userid)
        • FROM page_view GROUP BY pageid
      page_view result 9:08:20 111 2 9:08:14 222 1 9:08:13 111 2 9:08:01 111 1 time userid pageid 1 2 2 1 count_distinct_userid pageid
    20. Hive QL – Group By with Distinct in Map Reduce
      • send your thinkings to zshao@facebook.com
    21. Hive Optimizations Efficient execution of SQL on Map Reduce
    22. (Simplified) Map Reduce Revisit Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    23. Hive Optimizations – Merge Sequential Map Reduce Jobs
      • SQL:
        • FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT …
      A Map Reduce B C AB Map Reduce ABC 111 av 222 bv 1 key 111 1 av key 222 1 bv key 333 1 cv key 222 bv 333 cv 111 av 1 key
    24. Hive Optimizations – Share Common Read Operations
      • Extended SQL
          • FROM pv_users
          • INSERT INTO TABLE pv_pageid_sum
            • SELECT pageid, count(1)
            • GROUP BY pageid
          • INSERT INTO TABLE pv_age_sum
            • SELECT age, count(1)
            • GROUP BY age;
      Map Reduce Map Reduce 32 2 25 age 1 pageid 1 2 1 count 1 pageid 32 2 25 age 1 pageid 1 32 1 count 25 age
    25. Hive Optimizations – Load Balance Problem pv_users pageid_age_sum Map-Reduce pageid_age_partial_sum Map-Reduce 25 1 32 2 25 1 25 1 25 1 age pageid 32 25 age 2 1 pageid 1 4 count 2 25 1 1 32 2 25 age 1 pageid 2 count
    26. Future Works
      • Database Management Functionalities
      • Optimizations using Combiners
      • Optimizations based on Table Statistics
        • (Learn more about that in CS 511)
      • Everybody can use it, including YOU
      • Everybody can work on it, including YOU
    27. Questions? [email_address]
    28. Credits Data Management at Facebook, Jeff Hammerbacher Hive – Data Warehousing & Analytics on Hadoop, Joydeep Sen Sarma, Ashish Thusoo People Suresh Anthony Zheng Shao Prasad Chakka Pete Wyckoff Namit Jain Raghu Murthy Joydeep Sen Sarma Ashish Thusoo
    29.  
    30. Appendix Pages
    31. Dealing with Structured Data
      • Type system
        • Primitive types
        • Recursively build up using Composition/Maps/Lists
      • Generic (De)Serialization Interface (SerDe)
        • To recursively list schema
        • To recursively access fields within a row object
      • Serialization families implement interface
        • Thrift DDL based SerDe
        • Delimited text based SerDe
        • You can write your own SerDe
      • Schema Evolution
    32. MetaStore
      • Stores Table/Partition properties:
        • Table schema and SerDe library
        • Table Location on HDFS
        • Logical Partitioning keys and types
        • Other information
      • Thrift API
        • Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests)
      • Metadata can be stored as text files or even in a SQL backend
    33. Hive CLI
      • DDL:
        • create table/drop table/rename table
        • alter table add column
      • Browsing:
        • show tables
        • describe table
        • cat table
      • Loading Data
      • Queries
    34. Web UI for Hive
      • MetaStore UI:
        • Browse and navigate all tables in the system
        • Comment on each table and each column
        • Also captures data dependencies
      • HiPal:
        • Interactively construct SQL queries by mouse clicks
        • Support projection, filtering, group by and joining
        • Also support
    35. Hive Query Language
      • Philosophy
        • SQL
        • Map-Reduce with custom scripts (hadoop streaming)
      • Query Operators
        • Projections
        • Equi-joins
        • Group by
        • Sampling
        • Order By
    36. Hive QL – Custom Map/Reduce Scripts
      • Extended SQL:
          • FROM (
            • FROM pv_users
            • SELECT TRANSFORM (pv_users.userid, pv_users.date)
            • USING 'map_script' AS (dt, uid)
            • CLUSTER BY dt) map
          • INSERT INTO TABLE pv_users_reduced
            • SELECT TRANSFORM (map.dt, map.uid)
            • USING 'reduce_script' AS (date, count);
      • Map-Reduce: similar to hadoop streaming

    + zshaozshao, 2 years ago

    custom

    2599 views, 5 favs, 0 embeds more stats

    Facebook and Open Source talk for University of Ill more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2599
      • 2599 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 5
    • Downloads 117
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories