Hive Percona 2009


Published on

talk given at Percona 2009 conference held along side mysql conference in Santa Clara, USA.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hive Percona 2009

    1. 1. Data Warehousing & Analytics on Hadoop <ul><li>Ashish Thusoo, Prasad Chakka </li></ul><ul><li>Facebook Data Team </li></ul>
    2. 2. Why Another Data Warehousing System? <ul><li>Data, data and more data </li></ul><ul><ul><li>200GB per day in March 2008 </li></ul></ul><ul><ul><li>2+TB(compressed) raw data per day today </li></ul></ul>
    3. 5. Lets try Hadoop… <ul><li>Pros </li></ul><ul><ul><li>Superior in availability/scalability/manageability </li></ul></ul><ul><ul><li>Efficiency not that great, but throw more hardware </li></ul></ul><ul><ul><li>Partial Availability/resilience/scale more important than ACID </li></ul></ul><ul><li>Cons: Programmability and Metadata </li></ul><ul><ul><li>Map-reduce hard to program (users know sql/bash/python) </li></ul></ul><ul><ul><li>Need to publish data in well known schemas </li></ul></ul><ul><li>Solution: HIVE </li></ul>
    4. 6. What is HIVE? <ul><li>A system for managing and querying structured data built on top of Hadoop </li></ul><ul><ul><li>Map-Reduce for execution </li></ul></ul><ul><ul><li>HDFS for storage </li></ul></ul><ul><ul><li>Metadata on raw files </li></ul></ul><ul><li>Key Building Principles: </li></ul><ul><ul><li>SQL as a familiar data warehousing tool </li></ul></ul><ul><ul><li>Extensibility – Types, Functions, Formats, Scripts </li></ul></ul><ul><ul><li>Scalability and Performance </li></ul></ul>
    5. 7. Simplifying Hadoop <ul><li>hive> select key, count(1) from kv1 where key > 100 group by key; </li></ul><ul><li>vs. </li></ul><ul><li>$ cat > /tmp/ </li></ul><ul><li>uniq -c | awk '{print $2&quot; &quot;$1}‘ </li></ul><ul><li>$ cat > /tmp/ </li></ul><ul><li>awk -F '01' '{if($1 > 100) print $1}‘ </li></ul><ul><li>$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper -file /tmp/ -file /tmp/ -reducer -output /tmp/largekey -numReduceTasks 1 </li></ul><ul><li>$ bin/hadoop dfs –cat /tmp/largekey/part* </li></ul>
    6. 8. Looks like this .. Node Node Node Node Node Node 1 Gigabit 4-8 Gigabit Node = DataNode + Map-Reduce Disks Disks Disks Disks Disks Disks
    7. 9. Data Warehousing at Facebook Today Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
    8. 10. Hive/Hadoop Usage @ Facebook <ul><li>Types of Applications: </li></ul><ul><ul><li>Reporting </li></ul></ul><ul><ul><ul><li>Eg: Daily/Weekly aggregations of impression/click counts </li></ul></ul></ul><ul><ul><ul><li>Complex measures of user engagement </li></ul></ul></ul><ul><ul><li>Ad hoc Analysis </li></ul></ul><ul><ul><ul><li>Eg: how many group admins broken down by state/country </li></ul></ul></ul><ul><ul><li>Data Mining (Assembling training data) </li></ul></ul><ul><ul><ul><li>Eg: User Engagement as a function of user attributes </li></ul></ul></ul><ul><ul><li>Spam Detection </li></ul></ul><ul><ul><ul><li>Anomalous patterns for Site Integrity </li></ul></ul></ul><ul><ul><ul><li>Application API usage patterns </li></ul></ul></ul><ul><ul><li>Ad Optimization </li></ul></ul><ul><ul><li>Too many to count .. </li></ul></ul>
    9. 11. Hadoop Usage @ Facebook <ul><li>Data statistics: </li></ul><ul><ul><li>Total Data: ~1.7PB </li></ul></ul><ul><ul><li>Cluster Capacity ~2.4PB </li></ul></ul><ul><ul><li>Net Data added/day: ~15TB </li></ul></ul><ul><ul><ul><li>6TB of uncompressed source logs </li></ul></ul></ul><ul><ul><ul><li>4TB of uncompressed dimension data reloaded daily </li></ul></ul></ul><ul><ul><li>Compression Factor ~5x (gzip, more with bzip) </li></ul></ul><ul><li>Usage statistics: </li></ul><ul><ul><li>3200 jobs/day with 800K tasks(map-reduce tasks)/day </li></ul></ul><ul><ul><li>55TB of compressed data scanned daily </li></ul></ul><ul><ul><li>15TB of compressed output data written to hdfs </li></ul></ul><ul><ul><li>80 MM compute minutes/day </li></ul></ul>
    10. 12. In Pictures
    11. 14. <ul><li>HIVE Internals!! </li></ul>
    12. 15. HIVE: Components HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Parser Planner Web UI Optimizer DB
    13. 16. Data Model Logical Partitioning Hash Partitioning clicks HDFS MetaStore / hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
    14. 17. Hive Query Language <ul><li>SQL </li></ul><ul><ul><li>Subqueries in from clause </li></ul></ul><ul><ul><li>Equi-joins </li></ul></ul><ul><ul><li>Multi-table Insert </li></ul></ul><ul><ul><li>Multi-group-by </li></ul></ul><ul><li>Sampling </li></ul><ul><li>Complex object types </li></ul><ul><li>Extensibility </li></ul><ul><ul><li>Pluggable Map-reduce scripts </li></ul></ul><ul><ul><li>Pluggable User Defined Functions </li></ul></ul><ul><ul><li>Pluggable User Defined Types </li></ul></ul><ul><ul><li>Pluggable Data Formats </li></ul></ul>
    15. 18. Map Reduce Example Machine 2 Machine 1 <k1, v1> <k2, v2> <k3, v3> <k4, v4> <k5, v5> <k6, v6> <nk1, nv1> <nk2, nv2> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk1, nv6> Local Map <nk2, nv4> <nk2, nv5> <nk2, nv2> <nk1, nv1> <nk3, nv3> <nk1, nv6> Global Shuffle <nk1, nv1> <nk1, nv6> <nk3, nv3> <nk2, nv4> <nk2, nv5> <nk2, nv2> Local Sort <nk2, 3> <nk1, 2> <nk3, 1> Local Reduce
    16. 19. Hive QL – Join <ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT pv.pageid, u.age </li></ul></ul><ul><ul><li>FROM page_view pv JOIN user u ON (pv.userid = u.userid); </li></ul></ul>
    17. 20. Hive QL – Join in Map Reduce page_view user pv_users Map Reduce key value 111 < 1, 1> 111 < 1, 2> 222 < 1, 1> pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 < 2, 25> 222 < 2, 32> key value 111 < 1, 1> 111 < 1, 2> 111 < 2, 25> key value 222 < 1, 1> 222 < 2, 32> Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
    18. 21. Hive QL – Group By <ul><ul><ul><ul><li>SELECT pageid, age, count(1) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY pageid, age; </li></ul></ul></ul></ul>
    19. 22. Hive QL – Group By in Map Reduce pv_users Map Reduce pageid age 1 25 2 25 pageid age count 1 25 1 1 32 1 pageid age 1 32 2 25 key value <1,25> 1 <2,25> 1 key value <1,32> 1 <2,25> 1 key value <1,25> 1 <1,32> 1 key value <2,25> 1 <2,25> 1 Shuffle Sort pageid age count 2 25 2
    20. 23. Group by Optimizations <ul><li>Map side partial aggregations </li></ul><ul><ul><li>Hash Based aggregates </li></ul></ul><ul><ul><li>Serialized key/values in hash tables </li></ul></ul><ul><li>Optimizations being Worked On: </li></ul><ul><ul><li>Exploit pre-sorted data for distinct counts </li></ul></ul><ul><ul><li>Partial aggregations and Combiners </li></ul></ul><ul><ul><li>Be smarter about how to avoid multiple stage </li></ul></ul><ul><ul><li>Exploit table/column statistics for deciding strategy </li></ul></ul>
    21. 24. Inserts into Files, Tables and Local Files <ul><ul><ul><li>FROM pv_users </li></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_gender_sum </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT pv_users.gender, count_distinct(pv_users.userid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY(pv_users.gender) </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT pv_users.age, count_distinct(pv_users.userid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY(pv_users.age) </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ </li></ul></ul></ul><ul><ul><ul><li> FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY 13 </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT pv_users.age, count_distinct(pv_users.userid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY(pv_users.age); </li></ul></ul></ul></ul>
    22. 25. Extensibility - Custom Map/Reduce Scripts <ul><ul><ul><li>FROM ( </li></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>MAP (pv_users.userid, USING 'map_script' AS(dt, uid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>CLUSTER BY(dt)) map </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_users_reduced </li></ul></ul></ul><ul><ul><ul><ul><li>REDUCE (map.dt, map.uid) USING 'reduce_script' </li></ul></ul></ul></ul><ul><ul><ul><ul><li>AS (date, count); </li></ul></ul></ul></ul>
    23. 26. Open Source Community <ul><li>21 contributors and growing </li></ul><ul><ul><li>6 contributors within Facebook </li></ul></ul><ul><li>Contributors from: </li></ul><ul><ul><li>Academia </li></ul></ul><ul><ul><li>Other web companies </li></ul></ul><ul><ul><li>Etc.. </li></ul></ul><ul><li>7 committers </li></ul><ul><ul><li>1 external to Facebook and looking to add more here </li></ul></ul>
    24. 27. Future Work <ul><li>Statistics and cost-based optimization </li></ul><ul><li>Integration with BI tools (through JDBC/ODBC) </li></ul><ul><li>Performance improvements </li></ul><ul><li>More SQL constructs & UDFs </li></ul><ul><li>Indexing </li></ul><ul><li>Schema Evolution </li></ul><ul><li>Advanced operators </li></ul><ul><ul><li>Cubes/Frequent Item Sets/Window Functions </li></ul></ul><ul><li>Hive Roadmap </li></ul><ul><ul><li> </li></ul></ul>
    25. 28. Information <ul><li>Available as a sub project in Hadoop </li></ul><ul><ul><li> (wiki) </li></ul></ul><ul><ul><li> (home page) </li></ul></ul><ul><ul><li> (SVN repo) </li></ul></ul><ul><ul><li>##hive (IRC) </li></ul></ul><ul><ul><li>Works with hadoop-0.17, 0.18, 0.19 </li></ul></ul><ul><li>Release 0.3 is coming in the next few weeks </li></ul><ul><li>Mailing Lists: </li></ul><ul><ul><li>hive-{user,dev,commits} </li></ul></ul>