Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain

  1. 1. Hive Evolution<br />Hadoop India Summit<br />February 2011<br />Namit Jain (Facebook)<br />
  2. 2. Agenda<br />Hive Overview<br />Version 0.6 (released!)<br />Version 0.7 (under development)<br />Hive is now a TLP!<br />Roadmaps<br />
  3. 3. What is Hive?<br />A Hadoop-based system for querying and managing structured data<br />Uses Map/Reduce for execution<br />Uses Hadoop Distributed File System (HDFS) for storage<br />
  4. 4. Hive Origins<br />Data explosion at Facebook<br />Traditional DBMS technology could not keep up with the growth<br />Hadoop to the rescue!<br />Incubation with ASF, then became a Hadoop sub-project<br />Now a top-level ASF project<br />
  5. 5. SQL vs MapReduce<br />hive> select key, count(1) from kv1 where key > 100 group by key;<br /> vs.<br />$ cat > /tmp/reducer.sh<br />uniq -c | awk '{print $2"t"$1}‘<br />$ cat > /tmp/map.sh<br />awk -F '001' '{if($1 > 100) print $1}‘<br />$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1 <br />$ bin/hadoop dfs –cat /tmp/largekey/part*<br />
  6. 6. Hive Evolution<br />Originally:<br />a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs<br />Now more and more:<br />A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture<br />
  7. 7. Intended Usage<br />Web-scale Big Data<br />100’s of terabytes<br />Large Hadoop cluster<br />100’s of nodes (heterogeneous OK)<br />Data has a schema<br />Batch jobs<br />for both loads and queries<br />
  8. 8. So Don’t Use Hive If…<br />Your data is measured in GB<br />You don’t want to impose a schema<br />You need responses in seconds<br />A “conventional” analytic DBMS can already do the job<br />(and you can afford it)<br />You don’t have a lot of time and smart people<br />
  9. 9. Scaling Up<br />Facebook warehouse, Jan 2011:<br />2750 nodes<br />30 petabytes disk space<br />Data access per day:<br />~40 terabytes added (compressed)<br />25000 map/reduce jobs<br />300-400 users/month<br />
  10. 10. Facebook Deployment<br />Web Servers<br />Scribe MidTier<br />Scribe-Hadoop Clusters<br /> Hive <br />Replication<br />Production <br />Hive-Hadoop <br />Cluster<br />Archival <br />Hive-Hadoop <br />Cluster <br />Adhoc <br />Hive-Hadoop <br />Cluster <br />Sharded MySQL<br />
  11. 11. System Architecture<br />
  12. 12. Data Model<br />
  13. 13. Column Data Types<br />Primitive Types<br />integer types, float, string, boolean<br />Nest-able Collections<br />array<any-type><br />map<primitive-type, any-type><br />User-defined types<br />structures with attributes which can be of any-type<br />
  14. 14. Hive Query Language<br />DDL<br />{create/alter/drop} {table/view/partition}<br />create table as select<br />DML<br />Insert overwrite<br />QL<br />Sub-queries in from clause<br />Equi-joins (including Outer joins)<br />Multi-table Insert<br />Sampling<br />Lateral Views<br />Interfaces<br />JDBC/ODBC/Thrift<br />
  15. 15. Query Translation Example<br />SELECT url, count(*) FROM page_views GROUP BY url<br />Map tasks compute partial counts for each URL in a hash table<br />“map side” pre-aggregation<br />map outputs are partitioned by URL and shipped to corresponding reducers<br />Reduce tasks tally up partial counts to produce final results<br />
  16. 16. FROM (SELECT a.status, b.school, b.gender <br /> FROM status_updates a JOIN profiles b <br /> ON (a.userid = b.userid and <br /> a.ds='2009-03-20' )<br /> ) subq1<br />INSERT OVERWRITE TABLE gender_summary<br /> PARTITION(ds='2009-03-20')<br />SELECT subq1.gender, COUNT(1) <br />GROUP BY subq1.gender<br />INSERT OVERWRITE TABLE school_summary <br /> PARTITION(ds='2009-03-20')<br />SELECT subq1.school, COUNT(1)<br />GROUP BY subq1.school<br />
  17. 17. It Gets Quite Complicated!<br />
  18. 18. Behavior Extensibility<br />TRANSFORM scripts (any language)<br />Serialization+IPC overhead<br />User defined functions (Java)<br />In-process, lazy object evaluation<br />Pre/Post Hooks (Java)<br />Statement validation/execution<br />Example uses: auditing, replication, authorization, multiple clusters<br />
  19. 19. Map/Reduce Scripts Examples<br />add file page_url_to_id.py;<br />add file my_python_session_cutter.py;<br />FROM<br /> (SELECT TRANSFORM(user_id, page_url, unix_time)<br /> USING 'page_url_to_id.py'<br /> AS (user_id, page_id, unix_time)<br /> FROM mylog<br /> DISTRIBUTE BY user_id<br /> SORT BY user_id, unix_time) mylog2<br /> SELECT TRANSFORM(user_id, page_id, unix_time)<br /> USING 'my_python_session_cutter.py'<br /> AS (user_id, session_info);<br />
  20. 20. UDF vs UDAF vs UDTF<br />User Defined Function<br />One-to-one row mapping<br />Concat(‘foo’, ‘bar’)<br />User Defined Aggregate Function<br />Many-to-one row mapping<br />Sum(num_ads)<br />User Defined Table Function<br />One-to-many row mapping<br />Explode([1,2,3])<br />
  21. 21. UDF Example<br />add jar build/ql/test/test-udfs.jar;<br />CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength';<br />SELECT testlength(src.value) FROM src;<br />DROP TEMPORARY FUNCTION testlength;<br />UDFTestLength.java:<br />package org.apache.hadoop.hive.ql.udf; <br />public class UDFTestLength extends UDF {<br /> public Integer evaluate(String s) {<br /> if (s == null) {<br /> return null;<br /> }<br /> return s.length();<br /> }<br />}<br />
  22. 22. Storage Extensibility<br />Input/OutputFormat: file formats<br />SequenceFile, RCFile, TextFile, …<br />SerDe: row formats<br />Thrift, JSON, ProtocolBuffer, …<br />Storage Handlers (new in 0.6)<br />Integrate foreign metadata, e.g. HBase<br />Indexing<br />Under development in 0.7<br />
  23. 23. Release 0.6<br />October 2010<br />Views<br />Multiple Databases<br />Dynamic Partitioning<br />Automatic Merge<br />New Join Strategies<br />Storage Handlers<br />
  24. 24. Dynamic Partitions<br />Automatically create partitions based on distinct values in columns<br />INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) <br />SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country<br />FROM page_view_stg pvs<br />
  25. 25. Automatic merge<br />Jobs can produce many files<br />Why is this bad?<br />Namenode pressure<br />Downstream jobs have to deal with file processing overhead<br />So, clean up by merging results into a few large files (configurable)<br />Use conditional map-only task to do this<br />
  26. 26. Join Strategies<br />Old Join Strategies<br />Map-reduce and Map Join<br />Bucketed map-join<br />Allows “small” table to be much bigger<br />Sort Merge Map Join<br />Deal with skew in map/reduce join<br />Conditional plan step for skewed keys<br />
  27. 27. Storage Handler Syntax<br />HBase Example<br />CREATE TABLE users(<br /> userid int, name string, email string, notes string)<br />STORED BY<br /> 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' <br />WITH SERDEPROPERTIES (<br />“hbase.columns.mapping” = “small:name,small:email,large:notes”)<br />TBLPROPERTIES (<br />“hbase.table.name” = “user_list”);<br />
  28. 28. Release 0.7<br />Deployed in Facebook<br />Stats Functions<br />Indexes<br />Local Mode<br />Automatic Map Join<br />Multiple DISTINCTs<br />Archiving<br />In development<br />Concurrency Control<br />Stats Collection<br />J/ODBC Enhancements<br />Authorization<br />RCFile2<br />Partitioned Views<br />Security Enhancements<br />
  29. 29. Statistical Functions<br />Stats 101<br />Stddev, var, covar<br />Percentile_approx<br />Data Mining<br />Ngrams, sentences (text analysis)<br />Histogram_numeric<br />SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus<br />
  30. 30. Histogram query results<br /><ul><li>“It’s complicated” peaks at 18-19, but lasts into late 40s!
  31. 31. “In a relationship” peaks at 20
  32. 32. “Engaged” peaks at 25
  33. 33. Married peaks in early 30s
  34. 34. More married than single at 28
  35. 35. Only teenagers use widowed?</li></li></ul><li>Pluggable Indexing<br />Reference implementation<br />Index is stored in a normal Hive table<br />Compact: distinct block addresses<br />Partition-level rebuild<br />Currently in R&D<br />Automatic use for WHERE, GROUP BY<br />New index types (e.g. bitmap, HBase)<br />
  36. 36. Local Mode Execution<br />Avoids map/reduce cluster job latency<br />Good for jobs which process small amounts of data<br />Let Hive decide when to use it<br />set hive.exec.model.local.auto=true;<br />Or force its usage<br />set mapred.job.tracker=local;<br />
  37. 37. Automatic Map Join<br />Map-Join if small table fits in memory<br />If it can’t, fall back to reduce join<br />Optimize hash table data structures<br />Use distributed cache to push out pre-filtered lookup table<br />Avoid swamping HDFS with reads from thousands of mappers<br />
  38. 38. Multiple DISTINCT Aggs<br />Example<br />SELECT<br /> view_date, <br /> COUNT(DISTINCT userid),<br /> COUNT(DISTINCT page_url)<br />FROM page_views<br />GROUP BY view_date <br />
  39. 39. Archiving<br />Use HAR (Hadoop archive format) to combine many files into a few<br />Relieves namenode memory<br />ALTER TABLE page_views<br />{ARCHIVE|UNARCHIVE}<br />PARTITION (ds=‘2010-10-30’)<br />
  40. 40. Concurrency Control<br />Pluggable distributed lock manager<br />Default is Zookeeper-based<br />Simple read/write locking<br />Table-level and partition-level<br />Implicit locking (statement level)<br />Deadlock-free via lock ordering<br />Explicit LOCK TABLE (global)<br />
  41. 41. Statistics Collection<br />Implicit metastore update during load<br />Or explicit via ANALYZE TABLE<br />Table/partition-level<br />Number of rows<br />Number of files<br />Size in bytes<br />
  42. 42. Hive is now a TLP<br />PMC<br />Namit Jain (chair)<br />John Sichi<br />Zheng Shao<br />Edward Capriolo<br />Raghotham Murthy<br />Committers<br />Amareshwari Sriramadasu<br />Carl Steinbach<br />Paul Yang<br />He Yongqiang<br />Prasad Chakka<br />Joydeep Sen Sarma<br />Ashish Thusoo<br />Ning Zhang<br />
  43. 43. Developer Diversity<br />Recent Contributors<br />Facebook, Yahoo, Cloudera<br />Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems<br />Numerous research projects<br />Many many more…<br />Monthly San Francisco bay area contributor meetups<br />India meetups ? <br />
  44. 44. Roadmap: Heavy-Duty Tests<br />Unit tests are insufficient<br />What is needed:<br />Real-world schemas/queries<br />Non-toy data scales<br />Scripted setup; configuration matrix<br />Correctness/performance verification<br />Automatic reports: throughput, latency, profiles, coverage, perf counters…<br />
  45. 45. Roadmap: Shared Test Site <br />Nightly runs, regression alerting<br />Performance trending<br />Synthetic workload (e.g. TPC-H)<br />Real-world workload (anonymized?)<br />This is critical for<br />Non-subjective commit criteria<br />Release quality<br />
  46. 46. Roadmap: New Features<br />Hive Server Stability/Deployment<br />File Concatenation<br />Reduce Number of Files<br />Performance<br />Bloom Filters<br />Push Down Filters<br />Cost Based Optimizer<br />Column Level Statistics<br />Plan should be based on Statistics<br />
  47. 47. Resources<br />http://hive.apache.org<br />user/dev@hive.apache.org<br />njain@fb.com<br />Questions?<br />