Your SlideShare is downloading. ×
0
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop Summit 2009 Hive

4,671

Published on

Hive talk at Hadoop Summit 2009

Hive talk at Hadoop Summit 2009

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,671
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
458
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • What is this? This is huge amount of data. Along with the fast growth of active users on Facebook, the size of our data is exploding. In the last 12 months, the amount of data increased by 500%. These data are very valuable. They can be used to understand the user behavior, measure the impact of a new product, and make data-based decisions. Traditionally people store data in data warehouse solutions on top of Oracle and MySQL. In the recent years, we are also seeing new proprietary solutions like AsterData and Netezza. However, these solutions either do not scale to the amount of data that we have, or they are very inflexible that cannot satisfy our data analysis requirements. In order to provide the capability to analyze the huge amount of data that we have, we started the Hive project. Hive is based on Hadoop but does much more than Hadoop. We will show the details in the following slides. ============
  • Transcript

    • 1. Hive - Data Warehousing & Analytics on Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Namit Jain, Zheng Shao Facebook
    • 2. Agenda <ul><li>Introduction </li></ul><ul><li>Facebook Usage </li></ul><ul><li>Hive Progress and Roadmap </li></ul><ul><li>Open Source Community </li></ul>Facebook
    • 3. <ul><li>Introduction </li></ul>Facebook
    • 4. Why Another Data Warehousing System? <ul><li>Data, data and more data </li></ul><ul><li>~1TB per day in March 2008 </li></ul><ul><li>~10TB per day today </li></ul>Facebook
    • 5. &nbsp;
    • 6. Lets try Hadoop… <ul><li>Pros </li></ul><ul><ul><li>Superior in availability/scalability/manageability </li></ul></ul><ul><ul><li>Efficiency not that great, but throw more hardware </li></ul></ul><ul><ul><li>Partial Availability/resilience/scale more important than ACID </li></ul></ul><ul><li>Cons: Programmability and Metadata </li></ul><ul><ul><li>Map-reduce hard to program (users know sql/bash/python) </li></ul></ul><ul><ul><li>Need to publish data in well known schemas </li></ul></ul><ul><li>Solution: HIVE </li></ul>Facebook
    • 7. Lets try Hadoop… (continued) <ul><li>RDBMS&gt; select key, count(1) from kv1 where key &gt; 100 group by key; </li></ul><ul><li>vs. </li></ul><ul><li>$ cat &gt; /tmp/reducer.sh </li></ul><ul><li>uniq -c | awk &apos;{print $2&amp;quot; &amp;quot;$1}‘ </li></ul><ul><li>$ cat &gt; /tmp/map.sh </li></ul><ul><li>awk -F &apos;01&apos; &apos;{if($1 &gt; 100) print $1}‘ </li></ul><ul><li>$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 </li></ul><ul><li>$ bin/hadoop dfs –cat /tmp/largekey/part* </li></ul>Facebook
    • 8. What is HIVE? <ul><li>A system for managing and querying structured data built on top of Hadoop </li></ul><ul><ul><li>Map-Reduce for execution </li></ul></ul><ul><ul><li>HDFS for storage </li></ul></ul><ul><ul><li>Metadata on raw files </li></ul></ul><ul><li>Key Building Principles: </li></ul><ul><ul><li>SQL as a familiar data warehousing tool </li></ul></ul><ul><ul><li>Extensibility – Types, Functions, Formats, Scripts </li></ul></ul><ul><ul><li>Scalability and Performance </li></ul></ul>Facebook
    • 9. Simplifying Hadoop <ul><li>RDBMS&gt; select key, count(1) from kv1 where key &gt; 100 group by key; </li></ul><ul><li>vs. </li></ul><ul><li>hive&gt; select key, count(1) from kv1 where key &gt; 100 group by key; </li></ul>Facebook
    • 10. <ul><li>Facebook Usage </li></ul>Facebook
    • 11. Data Warehousing at Facebook Today Facebook Web Servers Scribe Servers Filers Hive on Hadoop Cluster Oracle RAC Federated MySQL
    • 12. Hive/Hadoop Usage @ Facebook <ul><li>Types of Applications: </li></ul><ul><ul><li>Reporting </li></ul></ul><ul><ul><ul><li>Eg: Daily/Weekly aggregations of impression/click counts </li></ul></ul></ul><ul><ul><ul><li>SELECT pageid, count(1) as imps FROM imp_table GROUP BY pageid WHERE date = ‘2009-05-01’; </li></ul></ul></ul><ul><ul><ul><li>Complex measures of user engagement </li></ul></ul></ul><ul><ul><li>Ad hoc Analysis </li></ul></ul><ul><ul><ul><li>Eg: how many group admins broken down by state/country </li></ul></ul></ul><ul><ul><li>Data Mining (Assembling training data) </li></ul></ul><ul><ul><ul><li>Eg: User Engagement as a function of user attributes </li></ul></ul></ul><ul><ul><li>Spam Detection </li></ul></ul><ul><ul><ul><li>Anomalous patterns for Site Integrity </li></ul></ul></ul><ul><ul><ul><li>Application API usage patterns </li></ul></ul></ul><ul><ul><li>Ad Optimization </li></ul></ul>Facebook
    • 13. Hadoop Usage @ Facebook <ul><li>Cluster Capacity: </li></ul><ul><ul><li>600 nodes </li></ul></ul><ul><ul><li>~2.4PB (80% used) </li></ul></ul><ul><li>Data statistics: </li></ul><ul><ul><li>Source logs/day: 6TB </li></ul></ul><ul><ul><li>Dimension data/day: 4TB </li></ul></ul><ul><ul><li>Compression Factor ~5x (gzip) </li></ul></ul><ul><li>Usage statistics: </li></ul><ul><ul><li>3200 jobs/day with 800K tasks(map-reduce tasks)/day </li></ul></ul><ul><ul><li>55TB of compressed data scanned daily </li></ul></ul><ul><ul><li>15TB of compressed output data written to hdfs </li></ul></ul><ul><ul><li>150 active users within Facebook </li></ul></ul>Facebook
    • 14. <ul><li>Hive Progress and Roadmap </li></ul>Facebook
    • 15. <ul><li>CREATE TABLE clicks(key STRING, value STRING) LOCATION &apos;/hive/clicks&apos; PARTITIONED BY (ds STRING) ROW FORMAT SERDE &apos;org.apache.hadoop.hive.serde2.TestSerDe&apos; WITH SERDEPROPERTIES (&apos;testserde.default.serialization.format&apos;=&apos;03&apos;); </li></ul>Facebook
    • 16. Data Model Facebook Logical Partitioning Hash Partitioning clicks HDFS MetaStore /hive/clicks /hive/clicks/ds=2008-03-25 /hive/clicks/ds=2008-03-25/0 … Tables Metastore DB Data Location Bucketing Info Partitioning Cols
    • 17. HIVE: Components Facebook HDFS Hive CLI DDL Queries Browsing Map Reduce MetaStore Thrift API SerDe Thrift CSV JSON.. Execution Parser Planner Web UI Optimizer DB
    • 18. Hive Query Language <ul><li>SQL </li></ul><ul><ul><li>Subqueries in from clause </li></ul></ul><ul><ul><li>Equi-joins </li></ul></ul><ul><ul><li>Multi-table Insert </li></ul></ul><ul><ul><li>Multi-group-by </li></ul></ul><ul><li>Sampling </li></ul><ul><ul><li>SELECT s.key, count(1) FROM clicks TABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = ‘2009-04-22’ GROUP BY s.key </li></ul></ul>Facebook
    • 19. <ul><ul><ul><li>FROM pv_users </li></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_gender_sum </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT gender, count(DISTINCT userid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY gender </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO DIRECTORY ‘/user/facebook/tmp/pv_age_sum.dir’ </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT age, count(DISTINCT userid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY age </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ </li></ul></ul></ul><ul><ul><ul><ul><li>SELECT age, count(DISTINCT userid) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>GROUP BY age; </li></ul></ul></ul></ul>Facebook
    • 20. Hive Query Language (continued) <ul><li>Extensibility </li></ul><ul><ul><li>Pluggable Map-reduce scripts </li></ul></ul><ul><ul><li>Pluggable User Defined Functions </li></ul></ul><ul><ul><li>Pluggable User Defined Types </li></ul></ul><ul><ul><ul><li>Complex object types: List of Maps </li></ul></ul></ul><ul><ul><li>Pluggable Data Formats </li></ul></ul><ul><ul><ul><li>Apache Log Format </li></ul></ul></ul>Facebook
    • 21. <ul><ul><ul><li>FROM ( </li></ul></ul></ul><ul><ul><ul><ul><li>FROM pv_users </li></ul></ul></ul></ul><ul><ul><ul><ul><li>MAP pv_users.userid, pv_users.date </li></ul></ul></ul></ul><ul><ul><ul><ul><li>USING &apos;map_script‘ </li></ul></ul></ul></ul><ul><ul><ul><ul><li>AS dt, uid </li></ul></ul></ul></ul><ul><ul><ul><ul><li>CLUSTER BY dt) map </li></ul></ul></ul></ul><ul><ul><ul><li>INSERT INTO TABLE pv_users_reduced </li></ul></ul></ul><ul><ul><ul><ul><li>REDUCE map.dt, map.uid </li></ul></ul></ul></ul><ul><ul><ul><ul><li>USING &apos;reduce_script&apos; </li></ul></ul></ul></ul><ul><ul><ul><ul><li>AS date, count; </li></ul></ul></ul></ul>Pluggable Map-Reduce Scripts Facebook
    • 22. Map Reduce Example Facebook Machine 2 Machine 1 &lt;k1, v1&gt; &lt;k2, v2&gt; &lt;k3, v3&gt; &lt;k4, v4&gt; &lt;k5, v5&gt; &lt;k6, v6&gt; &lt;nk1, nv1&gt; &lt;nk2, nv2&gt; &lt;nk3, nv3&gt; &lt;nk2, nv4&gt; &lt;nk2, nv5&gt; &lt;nk1, nv6&gt; Local Map &lt;nk2, nv4&gt; &lt;nk2, nv5&gt; &lt;nk2, nv2&gt; &lt;nk1, nv1&gt; &lt;nk3, nv3&gt; &lt;nk1, nv6&gt; Global Shuffle &lt;nk1, nv1&gt; &lt;nk1, nv6&gt; &lt;nk3, nv3&gt; &lt;nk2, nv4&gt; &lt;nk2, nv5&gt; &lt;nk2, nv2&gt; Local Sort &lt;nk2, 3&gt; &lt;nk1, 2&gt; &lt;nk3, 1&gt; Local Reduce
    • 23. Hive QL – Join <ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT pv.pageid, u.age </li></ul></ul><ul><ul><li>FROM page_view pv </li></ul></ul><ul><ul><li>JOIN user u </li></ul></ul><ul><ul><li>ON (pv.userid = u.userid); </li></ul></ul>Facebook
    • 24. Hive QL – Join in Map Reduce Facebook page_view user pv_users Map Reduce key value 111 &lt; 1, 1&gt; 111 &lt; 1, 2&gt; 222 &lt; 1, 1&gt; pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male key value 111 &lt; 2, 25&gt; 222 &lt; 2, 32&gt; key value 111 &lt; 1, 1&gt; 111 &lt; 1, 2&gt; 111 &lt; 2, 25&gt; key value 222 &lt; 1, 1&gt; 222 &lt; 2, 32&gt; Shuffle Sort Pageid age 1 25 2 25 pageid age 1 32
    • 25. Join Optimizations <ul><li>Map Joins </li></ul><ul><ul><li>User specified small tables stored in hash tables on the mapper backed by jdbm </li></ul></ul><ul><ul><li>No reducer needed </li></ul></ul><ul><ul><li>INSERT INTO TABLE pv_users </li></ul></ul><ul><ul><li>SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age </li></ul></ul><ul><ul><li>FROM page_view pv JOIN user u </li></ul></ul><ul><ul><li>ON (pv.userid = u.userid); </li></ul></ul><ul><li>Future </li></ul><ul><ul><li>Exploit table/column statistics for deciding strategy </li></ul></ul>Facebook
    • 26. Hive QL – Map Join Facebook page_view user Hash table pv_users key value 111 &lt;1,2&gt; 222 &lt;2&gt; pageid userid time 1 111 9:08:01 2 111 9:08:13 1 222 9:08:14 userid age gender 111 25 female 222 32 male Pageid age 1 25 2 25 1 32
    • 27. Hive QL – Group By <ul><ul><li>SELECT pageid, age, count(1) </li></ul></ul><ul><ul><li>FROM pv_users </li></ul></ul><ul><ul><li>GROUP BY pageid, age; </li></ul></ul>Facebook
    • 28. Hive QL – Group By in Map Reduce Facebook pv_users Map Reduce pageid age 1 25 1 25 pageid age count 1 25 3 pageid age 2 32 1 25 key value &lt;1,25&gt; 2 key value &lt;1,25&gt; 1 &lt;2,32&gt; 1 key value &lt;1,25&gt; 2 &lt;1,25&gt; 1 key value &lt;2,32&gt; 1 Shuffle Sort pageid age count 2 32 1
    • 29. Group by Optimizations <ul><li>Map side partial aggregations </li></ul><ul><ul><li>Hash-based aggregates </li></ul></ul><ul><ul><li>Serialized key/values in hash tables </li></ul></ul><ul><ul><li>90% speed improvement on Query </li></ul></ul><ul><ul><ul><li>SELECT count(1) FROM t; </li></ul></ul></ul><ul><li>Load balancing for data skew </li></ul><ul><li>Optimizations being Worked On: </li></ul><ul><ul><li>Exploit pre-sorted data for distinct counts </li></ul></ul><ul><ul><li>Exploit table/column statistics for deciding strategy </li></ul></ul>Facebook
    • 30. Columnar Storage <ul><li>CREATE table columnTable </li></ul><ul><li>(key STRING, value STRING) </li></ul><ul><li>ROW FORMAT SERDE &apos;org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe&apos; </li></ul><ul><li>STORED AS RCFILE; </li></ul><ul><li>Saved 25% of space compared with SequenceFile </li></ul><ul><ul><li>Based on one of the largest tables (30 columns) inside Facebook </li></ul></ul><ul><ul><li>Both are compressed with GzipCodec </li></ul></ul><ul><li>Speed improvements in progress </li></ul><ul><ul><li>Need to propagate column-selection information to FileFormat </li></ul></ul><ul><li>*Contribution from Yongqiang He (outside Facebook) </li></ul>Facebook
    • 31. Speed Improvements over Time Facebook <ul><li>QueryA: SELECT count(1) FROM t; </li></ul><ul><li>QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t; </li></ul><ul><li>QueryC: SELECT * FROM t; </li></ul><ul><li>Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec). </li></ul><ul><li>* No performance benchmarks for Map-side Join yet. </li></ul>Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
    • 32. Overcoming Java Overhead <ul><li>Reuse objects </li></ul><ul><ul><li>Use Writable instead of Java Primitives </li></ul></ul><ul><ul><li>Reuse objects across all rows </li></ul></ul><ul><ul><li>*40% speed improvement on Query C </li></ul></ul><ul><li>Lazy deserialization </li></ul><ul><ul><li>Only deserialize the column when asked </li></ul></ul><ul><ul><li>Very helpful for complex types (map/list/struct) </li></ul></ul><ul><ul><li>*108% speed improvement on Query A </li></ul></ul>Facebook
    • 33. Generic UDF and UDAF <ul><li>Let UDF and UDAF accept complex-type parameters </li></ul><ul><li>Integrate UDF and UDAF with Writables </li></ul><ul><ul><li>public IntWritable evaluate(IntWritable a, IntWritable b) { </li></ul></ul><ul><ul><li>intWritable.set((int)(a.get() + b.get())); </li></ul></ul><ul><ul><li>return intWritable; </li></ul></ul><ul><ul><li>} </li></ul></ul>Facebook
    • 34. HQL Optimizations <ul><li>Predicate Pushdown </li></ul><ul><li>Merging n-way join </li></ul><ul><li>Column Pruning </li></ul>Facebook
    • 35. <ul><li>Open Source Community </li></ul>Facebook
    • 36. Open Source Community <ul><li>21 contributors and growing </li></ul><ul><ul><li>6 contributors within Facebook </li></ul></ul><ul><li>Contributors from: </li></ul><ul><ul><li>Academia </li></ul></ul><ul><ul><li>Other web companies </li></ul></ul><ul><ul><li>Etc.. </li></ul></ul><ul><li>7 committers </li></ul><ul><ul><li>1 external to Facebook and looking to add more here </li></ul></ul>Facebook
    • 37. <ul><li>50 jiras fixed in last month </li></ul><ul><li>218 jiras still open </li></ul><ul><li>125 mails in last month on hive-user@ </li></ul><ul><li>600 mails in last month on hive-dev@ </li></ul><ul><li>Various companies/universities </li></ul><ul><ul><li>Adknowledge, Admob </li></ul></ul><ul><ul><li>Berkeley, Chinese Academy of Science </li></ul></ul><ul><li>Demonstration in VLDB’2009 </li></ul>Facebook
    • 38. Deployment Options <ul><li>EC2 </li></ul><ul><ul><li>http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely </li></ul></ul><ul><li>Cloudera Virtual Machine </li></ul><ul><ul><li>http://www.cloudera.com/hadoop-training-hive-tutorial </li></ul></ul><ul><li>Your own cluster </li></ul><ul><ul><li>http://wiki.apache.org/hadoop/Hive/GettingStarted </li></ul></ul><ul><li>Hive can directly consume data on hadoop </li></ul><ul><ul><li>CREATE EXTERNAL TABLE mytable (key STRING, value STRING) LOCATION &apos;/user/abc/mytable&apos;; </li></ul></ul>Facebook
    • 39. Future Work <ul><li>Benchmark &amp; Performance </li></ul><ul><li>Integration with BI tools (through JDBC/ODBC) </li></ul><ul><li>Indexing </li></ul><ul><li>More on Hive Roadmap </li></ul><ul><ul><li>http://wiki.apache.org/hadoop/Hive/Roadmap </li></ul></ul><ul><li>Machine Learning Integration </li></ul><ul><li>Real-time Streaming </li></ul>Facebook
    • 40. Information <ul><li>Available as a sub project in Hadoop </li></ul><ul><ul><li>http://wiki.apache.org/hadoop/Hive (wiki) </li></ul></ul><ul><ul><li>http://hadoop.apache.org/hive (home page) </li></ul></ul><ul><ul><li>http://svn.apache.org/repos/asf/hadoop/hive (SVN repo) </li></ul></ul><ul><ul><li>##hive (IRC) </li></ul></ul><ul><ul><li>Works with hadoop-0.17, 0.18, 0.19 </li></ul></ul><ul><li>Release 0.3 is out and more are coming </li></ul><ul><li>Mailing Lists: </li></ul><ul><ul><li>hive-{user,dev,commits}@hadoop.apache.org </li></ul></ul>Facebook
    • 41. Contributors <ul><li>Aaron Newton </li></ul><ul><li>Ashish Thusoo </li></ul><ul><li>David Phillips </li></ul><ul><li>Dhruba Borthakur </li></ul><ul><li>Edward Capriolo </li></ul><ul><li>Eric Hwang </li></ul><ul><li>Hao Liu </li></ul><ul><li>He Yongqiang </li></ul><ul><li>Jeff Hammerbacher </li></ul><ul><li>Johan Oskarsson </li></ul><ul><li>Josh Ferguson </li></ul><ul><li>Joydeep Sen Sarma </li></ul><ul><li>Kim P. </li></ul>Facebook <ul><li>Michi Mutsuzaki </li></ul><ul><li>Min Zhou </li></ul><ul><li>Namit Jain </li></ul><ul><li>Neil Conway </li></ul><ul><li>Pete Wyckoff </li></ul><ul><li>Prasad Chakka </li></ul><ul><li>Raghotham Murthy </li></ul><ul><li>Richard Lee </li></ul><ul><li>Shyam Sundar Sarkar </li></ul><ul><li>Suresh Antony </li></ul><ul><li>Venky Iyer </li></ul><ul><li>Zheng Shao </li></ul>
    • 42. <ul><li>Questions </li></ul>Facebook

    ×