• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hive Percona 2009
 

Hive Percona 2009

on

  • 5,414 views

talk given at Percona 2009 conference held along side mysql conference in Santa Clara, USA.

talk given at Percona 2009 conference held along side mysql conference in Santa Clara, USA.

Statistics

Views

Total Views
5,414
Views on SlideShare
5,397
Embed Views
17

Actions

Likes
6
Downloads
340
Comments
0

1 Embed 17

http://www.slideshare.net 17

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hive Percona 2009 Hive Percona 2009 Presentation Transcript

  • Data Warehousing & Analytics on Hadoop
    • Ashish Thusoo, Prasad Chakka
    • Facebook Data Team
  • Why Another Data Warehousing System?
    • Data, data and more data
      • 200GB per day in March 2008
      • 2+TB(compressed) raw data per day today
  •  
  •  
  • Lets try Hadoop…
    • Pros
      • Superior in availability/scalability/manageability
      • Efficiency not that great, but throw more hardware
      • Partial Availability/resilience/scale more important than ACID
    • Cons: Programmability and Metadata
      • Map-reduce hard to program (users know sql/bash/python)
      • Need to publish data in well known schemas
    • Solution: HIVE
  • What is HIVE?
    • A system for managing and querying structured data built on top of Hadoop
      • Map-Reduce for execution
      • HDFS for storage
      • Metadata on raw files
    • Key Building Principles:
      • SQL as a familiar data warehousing tool
      • Extensibility – Types, Functions, Formats, Scripts
      • Scalability and Performance
  • Simplifying Hadoop
    • hive> select key, count(1) from kv1 where key > 100 group by key;
    • vs.
    • $ cat > /tmp/reducer.sh
    • uniq -c | awk '{print $2" "$1}‘
    • $ cat > /tmp/map.sh
    • awk -F '01' '{if($1 > 100) print $1}‘
    • $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
    • $ bin/hadoop dfs –cat /tmp/largekey/part*
  • Looks like this .. Node Node Node Node Node Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce Disks Disks Disks Disks Disks Disks
  • Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
  • Hive/Hadoop Usage @ Facebook
    • Types of Applications:
      • Reporting
        • Eg: Daily/Weekly aggregations of impression/click counts
        • Complex measures of user engagement
      • Ad hoc Analysis
        • Eg: how many group admins broken down by state/country
      • Data Mining (Assembling training data)
        • Eg: User Engagement as a function of user attributes
      • Spam Detection
        • Anomalous patterns for Site Integrity
        • Application API usage patterns
      • Ad Optimization
      • Too many to count ..
  • Hadoop Usage @ Facebook
    • Data statistics:
      • Total Data: ~1.7PB
      • Cluster Capacity ~2.4PB
      • Net Data added/day: ~15TB
        • 6TB of uncompressed source logs
        • 4TB of uncompressed dimension data reloaded daily
      • Compression Factor ~5x (gzip, more with bzip)
    • Usage statistics:
      • 3200 jobs/day with 800K tasks(map-reduce tasks)/day
      • 55TB of compressed data scanned daily
      • 15TB of compressed output data written to hdfs
      • 80 MM compute minutes/day
  • In Pictures
  •  
    • HIVE Internals!!
  • HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Parser Planner Web UI Optimizer DB
  • Data Model Logical Partitioning Hash Partitioning clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
  • Hive Query Language
    • SQL
      • Subqueries in from clause
      • Equi-joins
      • Multi-table Insert
      • Multi-group-by
    • Sampling
    • Complex object types
    • Extensibility
      • Pluggable Map-reduce scripts
      • Pluggable User Defined Functions
      • Pluggable User Defined Types
      • Pluggable Data Formats
  • Map Reduce Example Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
  • Hive QL – Join
      • INSERT INTO TABLE pv_users
      • SELECT pv.pageid, u.age
      • FROM page_view pv JOIN user u ON (pv.userid = u.userid);
  • Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
  • Hive QL – Group By
          • SELECT pageid, age, count(1)
          • FROM pv_users
          • GROUP BY pageid, age;
  • Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort pageid age count 2 25 2
  • Group by Optimizations
    • Map side partial aggregations
      • Hash Based aggregates
      • Serialized key/values in hash tables
    • Optimizations being Worked On:
      • Exploit pre-sorted data for distinct counts
      • Partial aggregations and Combiners
      • Be smarter about how to avoid multiple stage
      • Exploit table/column statistics for deciding strategy
  • Inserts into Files, Tables and Local Files
        • FROM pv_users
        • INSERT INTO TABLE pv_gender_sum
          • SELECT pv_users.gender, count_distinct(pv_users.userid)
          • GROUP BY(pv_users.gender)
        • INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’
          • SELECT pv_users.age, count_distinct(pv_users.userid)
          • GROUP BY(pv_users.age)
        • INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’
        • FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY 13
          • SELECT pv_users.age, count_distinct(pv_users.userid)
          • GROUP BY(pv_users.age);
  • Extensibility - Custom Map/Reduce Scripts
        • FROM (
          • FROM pv_users
          • MAP (pv_users.userid, pv_users.date) USING 'map_script' AS(dt, uid)
          • CLUSTER BY(dt)) map
        • INSERT INTO TABLE pv_users_reduced
          • REDUCE (map.dt, map.uid) USING 'reduce_script'
          • AS (date, count);
  • Open Source Community
    • 21 contributors and growing
      • 6 contributors within Facebook
    • Contributors from:
      • Academia
      • Other web companies
      • Etc..
    • 7 committers
      • 1 external to Facebook and looking to add more here
  • Future Work
    • Statistics and cost-based optimization
    • Integration with BI tools (through JDBC/ODBC)
    • Performance improvements
    • More SQL constructs & UDFs
    • Indexing
    • Schema Evolution
    • Advanced operators
      • Cubes/Frequent Item Sets/Window Functions
    • Hive Roadmap
      • http://wiki.apache.org/hadoop/Hive/Roadmap
  • Information
    • Available as a sub project in Hadoop
      • http://wiki.apache.org/hadoop/Hive (wiki)
      • http://hadoop.apache.org/hive (home page)
      • http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)
      • ##hive (IRC)
      • Works with hadoop-0.17, 0.18, 0.19
    • Release 0.3 is coming in the next few weeks
    • Mailing Lists:
      • hive-{user,dev,commits}@hadoop.apache.org