Your SlideShare is downloading. ×
0
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain

3,557

Published on

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,557
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
175
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hive Evolution<br />Hadoop India Summit<br />February 2011<br />Namit Jain (Facebook)<br />
  • 2. Agenda<br />Hive Overview<br />Version 0.6 (released!)<br />Version 0.7 (under development)<br />Hive is now a TLP!<br />Roadmaps<br />
  • 3. What is Hive?<br />A Hadoop-based system for querying and managing structured data<br />Uses Map/Reduce for execution<br />Uses Hadoop Distributed File System (HDFS) for storage<br />
  • 4. Hive Origins<br />Data explosion at Facebook<br />Traditional DBMS technology could not keep up with the growth<br />Hadoop to the rescue!<br />Incubation with ASF, then became a Hadoop sub-project<br />Now a top-level ASF project<br />
  • 5. SQL vs MapReduce<br />hive&gt; select key, count(1) from kv1 where key &gt; 100 group by key;<br /> vs.<br />$ cat &gt; /tmp/reducer.sh<br />uniq -c | awk &apos;{print $2&quot;t&quot;$1}‘<br />$ cat &gt; /tmp/map.sh<br />awk -F &apos;001&apos; &apos;{if($1 &gt; 100) print $1}‘<br />$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 <br />$ bin/hadoop dfs –cat /tmp/largekey/part*<br />
  • 6. Hive Evolution<br />Originally:<br />a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs<br />Now more and more:<br />A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture<br />
  • 7. Intended Usage<br />Web-scale Big Data<br />100’s of terabytes<br />Large Hadoop cluster<br />100’s of nodes (heterogeneous OK)<br />Data has a schema<br />Batch jobs<br />for both loads and queries<br />
  • 8. So Don’t Use Hive If…<br />Your data is measured in GB<br />You don’t want to impose a schema<br />You need responses in seconds<br />A “conventional” analytic DBMS can already do the job<br />(and you can afford it)<br />You don’t have a lot of time and smart people<br />
  • 9. Scaling Up<br />Facebook warehouse, Jan 2011:<br />2750 nodes<br />30 petabytes disk space<br />Data access per day:<br />~40 terabytes added (compressed)<br />25000 map/reduce jobs<br />300-400 users/month<br />
  • 10. Facebook Deployment<br />Web Servers<br />Scribe MidTier<br />Scribe-Hadoop Clusters<br /> Hive <br />Replication<br />Production <br />Hive-Hadoop <br />Cluster<br />Archival <br />Hive-Hadoop <br />Cluster <br />Adhoc <br />Hive-Hadoop <br />Cluster <br />Sharded MySQL<br />
  • 11. System Architecture<br />
  • 12. Data Model<br />
  • 13. Column Data Types<br />Primitive Types<br />integer types, float, string, boolean<br />Nest-able Collections<br />array&lt;any-type&gt;<br />map&lt;primitive-type, any-type&gt;<br />User-defined types<br />structures with attributes which can be of any-type<br />
  • 14. Hive Query Language<br />DDL<br />{create/alter/drop} {table/view/partition}<br />create table as select<br />DML<br />Insert overwrite<br />QL<br />Sub-queries in from clause<br />Equi-joins (including Outer joins)<br />Multi-table Insert<br />Sampling<br />Lateral Views<br />Interfaces<br />JDBC/ODBC/Thrift<br />
  • 15. Query Translation Example<br />SELECT url, count(*) FROM page_views GROUP BY url<br />Map tasks compute partial counts for each URL in a hash table<br />“map side” pre-aggregation<br />map outputs are partitioned by URL and shipped to corresponding reducers<br />Reduce tasks tally up partial counts to produce final results<br />
  • 16. FROM (SELECT a.status, b.school, b.gender <br /> FROM status_updates a JOIN profiles b <br /> ON (a.userid = b.userid and <br /> a.ds=&apos;2009-03-20&apos; )<br /> ) subq1<br />INSERT OVERWRITE TABLE gender_summary<br /> PARTITION(ds=&apos;2009-03-20&apos;)<br />SELECT subq1.gender, COUNT(1) <br />GROUP BY subq1.gender<br />INSERT OVERWRITE TABLE school_summary <br /> PARTITION(ds=&apos;2009-03-20&apos;)<br />SELECT subq1.school, COUNT(1)<br />GROUP BY subq1.school<br />
  • 17. It Gets Quite Complicated!<br />
  • 18. Behavior Extensibility<br />TRANSFORM scripts (any language)<br />Serialization+IPC overhead<br />User defined functions (Java)<br />In-process, lazy object evaluation<br />Pre/Post Hooks (Java)<br />Statement validation/execution<br />Example uses: auditing, replication, authorization, multiple clusters<br />
  • 19. Map/Reduce Scripts Examples<br />add file page_url_to_id.py;<br />add file my_python_session_cutter.py;<br />FROM<br /> (SELECT TRANSFORM(user_id, page_url, unix_time)<br /> USING &apos;page_url_to_id.py&apos;<br /> AS (user_id, page_id, unix_time)<br /> FROM mylog<br /> DISTRIBUTE BY user_id<br /> SORT BY user_id, unix_time) mylog2<br /> SELECT TRANSFORM(user_id, page_id, unix_time)<br /> USING &apos;my_python_session_cutter.py&apos;<br /> AS (user_id, session_info);<br />
  • 20. UDF vs UDAF vs UDTF<br />User Defined Function<br />One-to-one row mapping<br />Concat(‘foo’, ‘bar’)<br />User Defined Aggregate Function<br />Many-to-one row mapping<br />Sum(num_ads)<br />User Defined Table Function<br />One-to-many row mapping<br />Explode([1,2,3])<br />
  • 21. UDF Example<br />add jar build/ql/test/test-udfs.jar;<br />CREATE TEMPORARY FUNCTION testlength AS &apos;org.apache.hadoop.hive.ql.udf.UDFTestLength&apos;;<br />SELECT testlength(src.value) FROM src;<br />DROP TEMPORARY FUNCTION testlength;<br />UDFTestLength.java:<br />package org.apache.hadoop.hive.ql.udf; <br />public class UDFTestLength extends UDF {<br /> public Integer evaluate(String s) {<br /> if (s == null) {<br /> return null;<br /> }<br /> return s.length();<br /> }<br />}<br />
  • 22. Storage Extensibility<br />Input/OutputFormat: file formats<br />SequenceFile, RCFile, TextFile, …<br />SerDe: row formats<br />Thrift, JSON, ProtocolBuffer, …<br />Storage Handlers (new in 0.6)<br />Integrate foreign metadata, e.g. HBase<br />Indexing<br />Under development in 0.7<br />
  • 23. Release 0.6<br />October 2010<br />Views<br />Multiple Databases<br />Dynamic Partitioning<br />Automatic Merge<br />New Join Strategies<br />Storage Handlers<br />
  • 24. Dynamic Partitions<br />Automatically create partitions based on distinct values in columns<br />INSERT OVERWRITE TABLE page_view PARTITION(dt=&apos;2008-06-08&apos;, country) <br />SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country<br />FROM page_view_stg pvs<br />
  • 25. Automatic merge<br />Jobs can produce many files<br />Why is this bad?<br />Namenode pressure<br />Downstream jobs have to deal with file processing overhead<br />So, clean up by merging results into a few large files (configurable)<br />Use conditional map-only task to do this<br />
  • 26. Join Strategies<br />Old Join Strategies<br />Map-reduce and Map Join<br />Bucketed map-join<br />Allows “small” table to be much bigger<br />Sort Merge Map Join<br />Deal with skew in map/reduce join<br />Conditional plan step for skewed keys<br />
  • 27. Storage Handler Syntax<br />HBase Example<br />CREATE TABLE users(<br /> userid int, name string, email string, notes string)<br />STORED BY<br /> &apos;org.apache.hadoop.hive.hbase.HBaseStorageHandler&apos; <br />WITH SERDEPROPERTIES (<br />“hbase.columns.mapping” = “small:name,small:email,large:notes”)<br />TBLPROPERTIES (<br />“hbase.table.name” = “user_list”);<br />
  • 28. Release 0.7<br />Deployed in Facebook<br />Stats Functions<br />Indexes<br />Local Mode<br />Automatic Map Join<br />Multiple DISTINCTs<br />Archiving<br />In development<br />Concurrency Control<br />Stats Collection<br />J/ODBC Enhancements<br />Authorization<br />RCFile2<br />Partitioned Views<br />Security Enhancements<br />
  • 29. Statistical Functions<br />Stats 101<br />Stddev, var, covar<br />Percentile_approx<br />Data Mining<br />Ngrams, sentences (text analysis)<br />Histogram_numeric<br />SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus<br />
  • 30. Histogram query results<br /><ul><li>“It’s complicated” peaks at 18-19, but lasts into late 40s!
  • 31. “In a relationship” peaks at 20
  • 32. “Engaged” peaks at 25
  • 33. Married peaks in early 30s
  • 34. More married than single at 28
  • 35. Only teenagers use widowed?</li></li></ul><li>Pluggable Indexing<br />Reference implementation<br />Index is stored in a normal Hive table<br />Compact: distinct block addresses<br />Partition-level rebuild<br />Currently in R&amp;D<br />Automatic use for WHERE, GROUP BY<br />New index types (e.g. bitmap, HBase)<br />
  • 36. Local Mode Execution<br />Avoids map/reduce cluster job latency<br />Good for jobs which process small amounts of data<br />Let Hive decide when to use it<br />set hive.exec.model.local.auto=true;<br />Or force its usage<br />set mapred.job.tracker=local;<br />
  • 37. Automatic Map Join<br />Map-Join if small table fits in memory<br />If it can’t, fall back to reduce join<br />Optimize hash table data structures<br />Use distributed cache to push out pre-filtered lookup table<br />Avoid swamping HDFS with reads from thousands of mappers<br />
  • 38. Multiple DISTINCT Aggs<br />Example<br />SELECT<br /> view_date, <br /> COUNT(DISTINCT userid),<br /> COUNT(DISTINCT page_url)<br />FROM page_views<br />GROUP BY view_date <br />
  • 39. Archiving<br />Use HAR (Hadoop archive format) to combine many files into a few<br />Relieves namenode memory<br />ALTER TABLE page_views<br />{ARCHIVE|UNARCHIVE}<br />PARTITION (ds=‘2010-10-30’)<br />
  • 40. Concurrency Control<br />Pluggable distributed lock manager<br />Default is Zookeeper-based<br />Simple read/write locking<br />Table-level and partition-level<br />Implicit locking (statement level)<br />Deadlock-free via lock ordering<br />Explicit LOCK TABLE (global)<br />
  • 41. Statistics Collection<br />Implicit metastore update during load<br />Or explicit via ANALYZE TABLE<br />Table/partition-level<br />Number of rows<br />Number of files<br />Size in bytes<br />
  • 42. Hive is now a TLP<br />PMC<br />Namit Jain (chair)<br />John Sichi<br />Zheng Shao<br />Edward Capriolo<br />Raghotham Murthy<br />Committers<br />Amareshwari Sriramadasu<br />Carl Steinbach<br />Paul Yang<br />He Yongqiang<br />Prasad Chakka<br />Joydeep Sen Sarma<br />Ashish Thusoo<br />Ning Zhang<br />
  • 43. Developer Diversity<br />Recent Contributors<br />Facebook, Yahoo, Cloudera<br />Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems<br />Numerous research projects<br />Many many more…<br />Monthly San Francisco bay area contributor meetups<br />India meetups ? <br />
  • 44. Roadmap: Heavy-Duty Tests<br />Unit tests are insufficient<br />What is needed:<br />Real-world schemas/queries<br />Non-toy data scales<br />Scripted setup; configuration matrix<br />Correctness/performance verification<br />Automatic reports: throughput, latency, profiles, coverage, perf counters…<br />
  • 45. Roadmap: Shared Test Site <br />Nightly runs, regression alerting<br />Performance trending<br />Synthetic workload (e.g. TPC-H)<br />Real-world workload (anonymized?)<br />This is critical for<br />Non-subjective commit criteria<br />Release quality<br />
  • 46. Roadmap: New Features<br />Hive Server Stability/Deployment<br />File Concatenation<br />Reduce Number of Files<br />Performance<br />Bloom Filters<br />Push Down Filters<br />Cost Based Optimizer<br />Column Level Statistics<br />Plan should be based on Statistics<br />
  • 47. Resources<br />http://hive.apache.org<br />user/dev@hive.apache.org<br />njain@fb.com<br />Questions?<br />

×