Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,133
On Slideshare
3,989
From Embeds
144
Number of Embeds
2

Actions

Shares
Downloads
172
Comments
0
Likes
3

Embeds 144

http://d.hatena.ne.jp 142
https://twitter.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hive Evolution
    Hadoop India Summit
    February 2011
    Namit Jain (Facebook)
  • 2. Agenda
    Hive Overview
    Version 0.6 (released!)
    Version 0.7 (under development)
    Hive is now a TLP!
    Roadmaps
  • 3. What is Hive?
    A Hadoop-based system for querying and managing structured data
    Uses Map/Reduce for execution
    Uses Hadoop Distributed File System (HDFS) for storage
  • 4. Hive Origins
    Data explosion at Facebook
    Traditional DBMS technology could not keep up with the growth
    Hadoop to the rescue!
    Incubation with ASF, then became a Hadoop sub-project
    Now a top-level ASF project
  • 5. SQL vs MapReduce
    hive> select key, count(1) from kv1 where key > 100 group by key;
    vs.
    $ cat > /tmp/reducer.sh
    uniq -c | awk '{print $2"t"$1}‘
    $ cat > /tmp/map.sh
    awk -F '001' '{if($1 > 100) print $1}‘
    $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
    $ bin/hadoop dfs –cat /tmp/largekey/part*
  • 6. Hive Evolution
    Originally:
    a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs
    Now more and more:
    A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
  • 7. Intended Usage
    Web-scale Big Data
    100’s of terabytes
    Large Hadoop cluster
    100’s of nodes (heterogeneous OK)
    Data has a schema
    Batch jobs
    for both loads and queries
  • 8. So Don’t Use Hive If…
    Your data is measured in GB
    You don’t want to impose a schema
    You need responses in seconds
    A “conventional” analytic DBMS can already do the job
    (and you can afford it)
    You don’t have a lot of time and smart people
  • 9. Scaling Up
    Facebook warehouse, Jan 2011:
    2750 nodes
    30 petabytes disk space
    Data access per day:
    ~40 terabytes added (compressed)
    25000 map/reduce jobs
    300-400 users/month
  • 10. Facebook Deployment
    Web Servers
    Scribe MidTier
    Scribe-Hadoop Clusters
    Hive
    Replication
    Production
    Hive-Hadoop
    Cluster
    Archival
    Hive-Hadoop
    Cluster
    Adhoc
    Hive-Hadoop
    Cluster
    Sharded MySQL
  • 11. System Architecture
  • 12. Data Model
  • 13. Column Data Types
    Primitive Types
    integer types, float, string, boolean
    Nest-able Collections
    array<any-type>
    map<primitive-type, any-type>
    User-defined types
    structures with attributes which can be of any-type
  • 14. Hive Query Language
    DDL
    {create/alter/drop} {table/view/partition}
    create table as select
    DML
    Insert overwrite
    QL
    Sub-queries in from clause
    Equi-joins (including Outer joins)
    Multi-table Insert
    Sampling
    Lateral Views
    Interfaces
    JDBC/ODBC/Thrift
  • 15. Query Translation Example
    SELECT url, count(*) FROM page_views GROUP BY url
    Map tasks compute partial counts for each URL in a hash table
    “map side” pre-aggregation
    map outputs are partitioned by URL and shipped to corresponding reducers
    Reduce tasks tally up partial counts to produce final results
  • 16. FROM (SELECT a.status, b.school, b.gender
    FROM status_updates a JOIN profiles b
    ON (a.userid = b.userid and
    a.ds='2009-03-20' )
    ) subq1
    INSERT OVERWRITE TABLE gender_summary
    PARTITION(ds='2009-03-20')
    SELECT subq1.gender, COUNT(1)
    GROUP BY subq1.gender
    INSERT OVERWRITE TABLE school_summary
    PARTITION(ds='2009-03-20')
    SELECT subq1.school, COUNT(1)
    GROUP BY subq1.school
  • 17. It Gets Quite Complicated!
  • 18. Behavior Extensibility
    TRANSFORM scripts (any language)
    Serialization+IPC overhead
    User defined functions (Java)
    In-process, lazy object evaluation
    Pre/Post Hooks (Java)
    Statement validation/execution
    Example uses: auditing, replication, authorization, multiple clusters
  • 19. Map/Reduce Scripts Examples
    add file page_url_to_id.py;
    add file my_python_session_cutter.py;
    FROM
    (SELECT TRANSFORM(user_id, page_url, unix_time)
    USING 'page_url_to_id.py'
    AS (user_id, page_id, unix_time)
    FROM mylog
    DISTRIBUTE BY user_id
    SORT BY user_id, unix_time) mylog2
    SELECT TRANSFORM(user_id, page_id, unix_time)
    USING 'my_python_session_cutter.py'
    AS (user_id, session_info);
  • 20. UDF vs UDAF vs UDTF
    User Defined Function
    One-to-one row mapping
    Concat(‘foo’, ‘bar’)
    User Defined Aggregate Function
    Many-to-one row mapping
    Sum(num_ads)
    User Defined Table Function
    One-to-many row mapping
    Explode([1,2,3])
  • 21. UDF Example
    add jar build/ql/test/test-udfs.jar;
    CREATE TEMPORARY FUNCTION testlength AS 'org.apache.hadoop.hive.ql.udf.UDFTestLength';
    SELECT testlength(src.value) FROM src;
    DROP TEMPORARY FUNCTION testlength;
    UDFTestLength.java:
    package org.apache.hadoop.hive.ql.udf;
    public class UDFTestLength extends UDF {
    public Integer evaluate(String s) {
    if (s == null) {
    return null;
    }
    return s.length();
    }
    }
  • 22. Storage Extensibility
    Input/OutputFormat: file formats
    SequenceFile, RCFile, TextFile, …
    SerDe: row formats
    Thrift, JSON, ProtocolBuffer, …
    Storage Handlers (new in 0.6)
    Integrate foreign metadata, e.g. HBase
    Indexing
    Under development in 0.7
  • 23. Release 0.6
    October 2010
    Views
    Multiple Databases
    Dynamic Partitioning
    Automatic Merge
    New Join Strategies
    Storage Handlers
  • 24. Dynamic Partitions
    Automatically create partitions based on distinct values in columns
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
    SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
    FROM page_view_stg pvs
  • 25. Automatic merge
    Jobs can produce many files
    Why is this bad?
    Namenode pressure
    Downstream jobs have to deal with file processing overhead
    So, clean up by merging results into a few large files (configurable)
    Use conditional map-only task to do this
  • 26. Join Strategies
    Old Join Strategies
    Map-reduce and Map Join
    Bucketed map-join
    Allows “small” table to be much bigger
    Sort Merge Map Join
    Deal with skew in map/reduce join
    Conditional plan step for skewed keys
  • 27. Storage Handler Syntax
    HBase Example
    CREATE TABLE users(
    userid int, name string, email string, notes string)
    STORED BY
    'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    “hbase.columns.mapping” = “small:name,small:email,large:notes”)
    TBLPROPERTIES (
    “hbase.table.name” = “user_list”);
  • 28. Release 0.7
    Deployed in Facebook
    Stats Functions
    Indexes
    Local Mode
    Automatic Map Join
    Multiple DISTINCTs
    Archiving
    In development
    Concurrency Control
    Stats Collection
    J/ODBC Enhancements
    Authorization
    RCFile2
    Partitioned Views
    Security Enhancements
  • 29. Statistical Functions
    Stats 101
    Stddev, var, covar
    Percentile_approx
    Data Mining
    Ngrams, sentences (text analysis)
    Histogram_numeric
    SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
  • 30. Histogram query results
    • “It’s complicated” peaks at 18-19, but lasts into late 40s!
    • 31. “In a relationship” peaks at 20
    • 32. “Engaged” peaks at 25
    • 33. Married peaks in early 30s
    • 34. More married than single at 28
    • 35. Only teenagers use widowed?
  • Pluggable Indexing
    Reference implementation
    Index is stored in a normal Hive table
    Compact: distinct block addresses
    Partition-level rebuild
    Currently in R&D
    Automatic use for WHERE, GROUP BY
    New index types (e.g. bitmap, HBase)
  • 36. Local Mode Execution
    Avoids map/reduce cluster job latency
    Good for jobs which process small amounts of data
    Let Hive decide when to use it
    set hive.exec.model.local.auto=true;
    Or force its usage
    set mapred.job.tracker=local;
  • 37. Automatic Map Join
    Map-Join if small table fits in memory
    If it can’t, fall back to reduce join
    Optimize hash table data structures
    Use distributed cache to push out pre-filtered lookup table
    Avoid swamping HDFS with reads from thousands of mappers
  • 38. Multiple DISTINCT Aggs
    Example
    SELECT
    view_date,
    COUNT(DISTINCT userid),
    COUNT(DISTINCT page_url)
    FROM page_views
    GROUP BY view_date
  • 39. Archiving
    Use HAR (Hadoop archive format) to combine many files into a few
    Relieves namenode memory
    ALTER TABLE page_views
    {ARCHIVE|UNARCHIVE}
    PARTITION (ds=‘2010-10-30’)
  • 40. Concurrency Control
    Pluggable distributed lock manager
    Default is Zookeeper-based
    Simple read/write locking
    Table-level and partition-level
    Implicit locking (statement level)
    Deadlock-free via lock ordering
    Explicit LOCK TABLE (global)
  • 41. Statistics Collection
    Implicit metastore update during load
    Or explicit via ANALYZE TABLE
    Table/partition-level
    Number of rows
    Number of files
    Size in bytes
  • 42. Hive is now a TLP
    PMC
    Namit Jain (chair)
    John Sichi
    Zheng Shao
    Edward Capriolo
    Raghotham Murthy
    Committers
    Amareshwari Sriramadasu
    Carl Steinbach
    Paul Yang
    He Yongqiang
    Prasad Chakka
    Joydeep Sen Sarma
    Ashish Thusoo
    Ning Zhang
  • 43. Developer Diversity
    Recent Contributors
    Facebook, Yahoo, Cloudera
    Netflix, Amazon, Media6Degrees, Intuit, Persistent Systems
    Numerous research projects
    Many many more…
    Monthly San Francisco bay area contributor meetups
    India meetups ? 
  • 44. Roadmap: Heavy-Duty Tests
    Unit tests are insufficient
    What is needed:
    Real-world schemas/queries
    Non-toy data scales
    Scripted setup; configuration matrix
    Correctness/performance verification
    Automatic reports: throughput, latency, profiles, coverage, perf counters…
  • 45. Roadmap: Shared Test Site
    Nightly runs, regression alerting
    Performance trending
    Synthetic workload (e.g. TPC-H)
    Real-world workload (anonymized?)
    This is critical for
    Non-subjective commit criteria
    Release quality
  • 46. Roadmap: New Features
    Hive Server Stability/Deployment
    File Concatenation
    Reduce Number of Files
    Performance
    Bloom Filters
    Push Down Filters
    Cost Based Optimizer
    Column Level Statistics
    Plan should be based on Statistics
  • 47. Resources
    http://hive.apache.org
    user/dev@hive.apache.org
    njain@fb.com
    Questions?