Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Upcoming SlideShare
Loading in...5
×
 

Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain

on

  • 3,974 views

 

Statistics

Views

Total Views
3,974
Views on SlideShare
3,831
Embed Views
143

Actions

Likes
3
Downloads
170
Comments
0

2 Embeds 143

http://d.hatena.ne.jp 142
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain Presentation Transcript

    • Hive Evolution
      Hadoop India Summit
      February 2011
      Namit Jain (Facebook)
    • Agenda
      Hive Overview
      Version 0.6 (released!)
      Version 0.7 (under development)
      Hive is now a TLP!
      Roadmaps
    • What is Hive?
      A Hadoop-based system for querying and managing structured data
      Uses Map/Reduce for execution
      Uses Hadoop Distributed File System (HDFS) for storage
    • Hive Origins
      Data explosion at Facebook
      Traditional DBMS technology could not keep up with the growth
      Hadoop to the rescue!
      Incubation with ASF, then became a Hadoop sub-project
      Now a top-level ASF project
    • SQL vs MapReduce
      hive> select key, count(1) from kv1 where key > 100 group by key;
      vs.
      $ cat > /tmp/reducer.sh
      uniq -c | awk '{print $2"t"$1}‘
      $ cat > /tmp/map.sh
      awk -F '001' '{if($1 > 100) print $1}‘
      $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
      $ bin/hadoop dfs –cat /tmp/largekey/part*
    • Hive Evolution
      Originally:
      a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs
      Now more and more:
      A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
    • Intended Usage
      Web-scale Big Data
      100’s of terabytes
      Large Hadoop cluster
      100’s of nodes (heterogeneous OK)
      Data has a schema
      Batch jobs
      for both loads and queries
    • So Don’t Use Hive If…
      Your data is measured in GB
      You don’t want to impose a schema
      You need responses in seconds
      A “conventional” analytic DBMS can already do the job
      (and you can afford it)
      You don’t have a lot of time and smart people
    • Scaling Up
      Facebook warehouse, Jan 2011:
      2750 nodes
      30 petabytes disk space
      Data access per day:
      ~40 terabytes added (compressed)
      25000 map/reduce jobs
      300-400 users/month
    • Facebook Deployment
      Web Servers
      Scribe MidTier
      Scribe-Hadoop Clusters
      Hive
      Replication
      Production
      Hive-Hadoop
      Cluster
      Archival
      Hive-Hadoop
      Cluster
      Adhoc
      Hive-Hadoop
      Cluster
      Sharded MySQL
    • System Architecture
    • Data Model
    • Column Data Types
      Primitive Types
      integer types, float, string, boolean
      Nest-able Collections
      array<any-type>
      map<primitive-type, any-type>
      User-defined types
      structures with attributes which can be of any-type
    • Hive Query Language
      DDL
      {create/alter/drop} {table/view/partition}
      create table as select
      DML
      Insert overwrite
      QL
      Sub-queries in from clause
      Equi-joins (including Outer joins)
      Multi-table Insert
      Sampling
      Lateral Views
      Interfaces
      JDBC/ODBC/Thrift
    • Query Translation Example
      SELECT url, count(*) FROM page_views GROUP BY url
      Map tasks compute partial counts for each URL in a hash table
      “map side” pre-aggregation
      map outputs are partitioned by URL and shipped to corresponding reducers
      Reduce tasks tally up partial counts to produce final results
    • FROM (SELECT a.status, b.school, b.gender
      FROM status_updates a JOIN profiles b
      ON (a.userid = b.userid and
      a.ds='2009-03-20' )
      ) subq1
      INSERT OVERWRITE TABLE gender_summary
      PARTITION(ds='2009-03-20')
      SELECT subq1.gender, COUNT(1)
      GROUP BY subq1.gender
      INSERT OVERWRITE TABLE school_summary
      PARTITION(ds='2009-03-20')
      SELECT subq1.school, COUNT(1)
      GROUP BY subq1.school
    • It Gets Quite Complicated!
    • Behavior Extensibility
      TRANSFORM scripts (any language)
      Serialization+IPC overhead
      User defined functions (Java)
      In-process, lazy object evaluation
      Pre/Post Hooks (Java)
      Statement validation/execution
      Example uses: auditing, replication, authorization, multiple clusters
    • Map/Reduce Scripts Examples
      add file page_url_to_id.py;
      add file my_python_session_cutter.py;
      FROM
      (SELECT TRANSFORM(user_id, page_url, unix_time)
      USING 'page_url_to_id.py'
      AS (user_id, page_id, unix_time)
      FROM mylog
      DISTRIBUTE BY user_id
      SORT BY user_id, unix_time) mylog2
      SELECT TRANSFORM(user_id, page_id, unix_time)
      USING 'my_python_session_cutter.py'
      AS (user_id, session_info);
    • UDF vs UDAF vs UDTF
      User Defined Function
      One-to-one row mapping
      Concat(‘foo’, ‘bar’)
      User Defined Aggregate Function
      Many-to-one row mapping
      Sum(num_ads)
      User Defined Table Function
      One-to-many row mapping
      Explode([1,2,3])
    • UDF Example
      add jar build/ql/test/test-udfs.jar;
      CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength';
      SELECT testlength(src.value) FROM src;
      DROP TEMPORARY FUNCTION testlength;
      UDFTestLength.java:
      package org.apache.hadoop.hive.ql.udf;
      public class UDFTestLength extends UDF {
      public Integer evaluate(String s) {
      if (s == null) {
      return null;
      }
      return s.length();
      }
      }
    • Storage Extensibility
      Input/OutputFormat: file formats
      SequenceFile, RCFile, TextFile, …
      SerDe: row formats
      Thrift, JSON, ProtocolBuffer, …
      Storage Handlers (new in 0.6)
      Integrate foreign metadata, e.g. HBase
      Indexing
      Under development in 0.7
    • Release 0.6
      October 2010
      Views
      Multiple Databases
      Dynamic Partitioning
      Automatic Merge
      New Join Strategies
      Storage Handlers
    • Dynamic Partitions
      Automatically create partitions based on distinct values in columns
      INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
      SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
      FROM page_view_stg pvs
    • Automatic merge
      Jobs can produce many files
      Why is this bad?
      Namenode pressure
      Downstream jobs have to deal with file processing overhead
      So, clean up by merging results into a few large files (configurable)
      Use conditional map-only task to do this
    • Join Strategies
      Old Join Strategies
      Map-reduce and Map Join
      Bucketed map-join
      Allows “small” table to be much bigger
      Sort Merge Map Join
      Deal with skew in map/reduce join
      Conditional plan step for skewed keys
    • Storage Handler Syntax
      HBase Example
      CREATE TABLE users(
      userid int, name string, email string, notes string)
      STORED BY
      'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      WITH SERDEPROPERTIES (
      “hbase.columns.mapping” = “small:name,small:email,large:notes”)
      TBLPROPERTIES (
      “hbase.table.name” = “user_list”);
    • Release 0.7
      Deployed in Facebook
      Stats Functions
      Indexes
      Local Mode
      Automatic Map Join
      Multiple DISTINCTs
      Archiving
      In development
      Concurrency Control
      Stats Collection
      J/ODBC Enhancements
      Authorization
      RCFile2
      Partitioned Views
      Security Enhancements
    • Statistical Functions
      Stats 101
      Stddev, var, covar
      Percentile_approx
      Data Mining
      Ngrams, sentences (text analysis)
      Histogram_numeric
      SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
    • Histogram query results
      • “It’s complicated” peaks at 18-19, but lasts into late 40s!
      • “In a relationship” peaks at 20
      • “Engaged” peaks at 25
      • Married peaks in early 30s
      • More married than single at 28
      • Only teenagers use widowed?
    • Pluggable Indexing
      Reference implementation
      Index is stored in a normal Hive table
      Compact: distinct block addresses
      Partition-level rebuild
      Currently in R&D
      Automatic use for WHERE, GROUP BY
      New index types (e.g. bitmap, HBase)
    • Local Mode Execution
      Avoids map/reduce cluster job latency
      Good for jobs which process small amounts of data
      Let Hive decide when to use it
      set hive.exec.model.local.auto=true;
      Or force its usage
      set mapred.job.tracker=local;
    • Automatic Map Join
      Map-Join if small table fits in memory
      If it can’t, fall back to reduce join
      Optimize hash table data structures
      Use distributed cache to push out pre-filtered lookup table
      Avoid swamping HDFS with reads from thousands of mappers
    • Multiple DISTINCT Aggs
      Example
      SELECT
      view_date,
      COUNT(DISTINCT userid),
      COUNT(DISTINCT page_url)
      FROM page_views
      GROUP BY view_date
    • Archiving
      Use HAR (Hadoop archive format) to combine many files into a few
      Relieves namenode memory
      ALTER TABLE page_views
      {ARCHIVE|UNARCHIVE}
      PARTITION (ds=‘2010-10-30’)
    • Concurrency Control
      Pluggable distributed lock manager
      Default is Zookeeper-based
      Simple read/write locking
      Table-level and partition-level
      Implicit locking (statement level)
      Deadlock-free via lock ordering
      Explicit LOCK TABLE (global)
    • Statistics Collection
      Implicit metastore update during load
      Or explicit via ANALYZE TABLE
      Table/partition-level
      Number of rows
      Number of files
      Size in bytes
    • Hive is now a TLP
      PMC
      Namit Jain (chair)
      John Sichi
      Zheng Shao
      Edward Capriolo
      Raghotham Murthy
      Committers
      Amareshwari Sriramadasu
      Carl Steinbach
      Paul Yang
      He Yongqiang
      Prasad Chakka
      Joydeep Sen Sarma
      Ashish Thusoo
      Ning Zhang
    • Developer Diversity
      Recent Contributors
      Facebook, Yahoo, Cloudera
      Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems
      Numerous research projects
      Many many more…
      Monthly San Francisco bay area contributor meetups
      India meetups ? 
    • Roadmap: Heavy-Duty Tests
      Unit tests are insufficient
      What is needed:
      Real-world schemas/queries
      Non-toy data scales
      Scripted setup; configuration matrix
      Correctness/performance verification
      Automatic reports: throughput, latency, profiles, coverage, perf counters…
    • Roadmap: Shared Test Site
      Nightly runs, regression alerting
      Performance trending
      Synthetic workload (e.g. TPC-H)
      Real-world workload (anonymized?)
      This is critical for
      Non-subjective commit criteria
      Release quality
    • Roadmap: New Features
      Hive Server Stability/Deployment
      File Concatenation
      Reduce Number of Files
      Performance
      Bloom Filters
      Push Down Filters
      Cost Based Optimizer
      Column Level Statistics
      Plan should be based on Statistics
    • Resources
      http://hive.apache.org
      user/dev@hive.apache.org
      njain@fb.com
      Questions?