• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Hive Evolution: ApacheCon NA 2010

on

  • 9,532 views

Presentation on Hive by John Sichi at ApacheCon NA 2010.

Presentation on Hive by John Sichi at ApacheCon NA 2010.

Statistics

Views

Total Views
9,532
Views on SlideShare
8,817
Embed Views
715

Actions

Likes
11
Downloads
442
Comments
0

22 Embeds 715

http://mixellaneous.tistory.com 524
http://practicalquant.blogspot.com 141
http://practicalquant.blogspot.com.br 10
http://practicalquant.blogspot.de 6
http://practicalquant.blogspot.in 6
http://practicalquant.blogspot.nl 3
http://www.techgig.com 3
http://practicalquant.blogspot.fr 3
http://practicalquant.blogspot.co.uk 3
http://practicalquant.blogspot.com.es 2
http://webcache.googleusercontent.com 2
https://twitter.com 2
http://translate.googleusercontent.com 1
http://practicalquant.blogspot.jp 1
http://practicalquant.blogspot.kr 1
http://practicalquant.blogspot.hk 1
http://www.hanrss.com 1
http://practicalquant.blogspot.ch 1
http://twitter.com 1
http://practicalquant.blogspot.com.au 1
http://practicalquant.blogspot.ru 1
http://practicalquant.blogspot.co.il 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hive Evolution:  ApacheCon NA 2010 Hive Evolution: ApacheCon NA 2010 Presentation Transcript

    • Hive Evolution A Progress Report November 2010 John Sichi (Facebook)
    • Agenda
      • Hive Overview
      • Version 0.6 (just released!)
      • Version 0.7 (under development)
      • Hive is now a TLP!
      • Roadmaps
    • What is Hive?
      • A Hadoop-based system for querying and managing structured data
        • Uses Map/Reduce for execution
        • Uses Hadoop Distributed File System (HDFS) for storage
    • Hive Origins
      • Data explosion at Facebook
      • Traditional DBMS technology could not keep up with the growth
      • Hadoop to the rescue!
      • Incubation with ASF, then became a Hadoop sub-project
      • Now a top-level ASF project
    • Hive Evolution
      • Originally:
        • a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs
      • Now more and more:
        • A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
    • Intended Usage
      • Web-scale Big Data
        • 100’s of terabytes
      • Large Hadoop cluster
        • 100’s of nodes (heterogeneous OK)
      • Data has a schema
      • Batch jobs
        • for both loads and queries
    • So Don’t Use Hive If…
      • Your data is measured in GB
      • You don’t want to impose a schema
      • You need responses in seconds
      • A “conventional” analytic DBMS can already do the job
        • (and you can afford it)
      • You don’t have a lot of time and smart people
    • Scaling Up
      • Facebook warehouse, July 2010:
        • 2250 nodes
        • 36 petabytes disk space
      • Data access per day:
        • 80 to 90 terabytes added (uncompressed)
        • 25000 map/reduce jobs
      • 300-400 users/month
    • Facebook Deployment Web Servers Scribe MidTier Production Hive-Hadoop Cluster Sharded MySQL Scribe-Hadoop Clusters Adhoc Hive-Hadoop Cluster Hive replication
    • Hive Architecture Metastore Query Engine CLI Hive Thrift API Metastore Thrift API JDBC/ODBC clients Hadoop Map/Reduce + HDFS Clusters Web Management Console
    • Physical Data Model clicks ds=‘2010-10-28’ ds=‘2010-10-29’ ds=‘2010-10-30’ Partitions (possibly multi-level) Table HDFS Files (possibly as hash buckets)
    • Map/Reduce Plans Input Files Map Tasks Reduce Tasks Splits Result Files
    • Query Translation Example
      • SELECT url, count(*) FROM page_views GROUP BY url
      • Map tasks compute partial counts for each URL in a hash table
        • “ map side” preaggregation
        • map outputs are partitioned by URL and shipped to corresponding reducers
      • Reduce tasks tally up partial counts to produce final results
    • It Gets Quite Complicated!
    • Behavior Extensibility
      • TRANSFORM scripts (any language)
        • Serialization+IPC overhead
      • User defined functions (Java)
        • In-process, lazy object evaluation
      • Pre/Post Hooks (Java)
        • Statement validation/execution
        • Example uses: auditing, replication, authorization
    • UDF vs UDAF vs UDTF
      • User Defined Function
          • One-to-one row mapping
          • Concat(‘foo’, ‘bar’)
      • User Defined Aggregate Function
          • Many-to-one row mapping
          • Sum(num_ads)
      • User Defined Table Function
          • One-to-many row mapping
          • Explode([1,2,3])
    • Storage Extensibility
      • Input/OutputFormat: file formats
        • SequenceFile, RCFile, TextFile, …
      • SerDe: row formats
        • Thrift, JSON, ProtocolBuffer, …
      • Storage Handlers (new in 0.6)
        • Integrate foreign metadata, e.g. HBase
      • Indexing
        • Under development in 0.7
    • Release 0.6
      • October 2010
        • Views
        • Multiple Databases
        • Dynamic Partitioning
        • Automatic Merge
        • New Join Strategies
        • Storage Handlers
    • Views: Syntax
        • CREATE VIEW [IF NOT EXISTS] view_name
        • [ (column_name [COMMENT column_comment], … ) ]
        • [COMMENT ‘view_comment’]
        • AS SELECT …
        • [ ORDER BY … LIMIT … ]
    • Views: Usage
      • Use Cases
        • Column/table renaming
        • Encapsulate complex query logic
        • Security (Future)
      • Limitations
        • Read-only
        • Obscures partition metadata from underlying tables
        • No dependency management
    • Multiple Databases
      • Follows MySQL convention
        • CREATE DATABASE [IF NOT EXISTS] db_name [COMMENT ‘db_comment’]
        • USE db_name
      • Logical namespace for tables
      • ‘ default’ database is still there
      • Does not yet support queries across multiple databases
    • Dynamic Partitions: Syntax
      • Example
        • INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country )
        • SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
        • FROM page_view_stg pvs
    • Dynamic Partitions: Usage
      • Automatically create partitions based on distinct values in columns
      • Works as rudimentary indexing
        • Prune partitions via WHERE clause
      • But be careful…
        • Don’t create too many partitions!
        • Configuration parameters can be used to prevent accidents
    • Automatic merge
      • Jobs can produce many files
      • Why is this bad?
        • Namenode pressure
        • Downstream jobs have to deal with file processing overhead
      • So, clean up by merging results into a few large files (configurable)
        • Use conditional map-only task to do this
    • Join Strategies Before 0.6
      • Map/reduce join
        • Map tasks partition inputs on join keys and ship to corresponding reducers
        • Reduce tasks perform sort-merge-join
      • Map-join
        • Each mapper builds lookup hashtable from copy of small table
        • Then hash-join the splits of big table
    • New Join Strategies
      • Bucketed map-join
        • Each mapper filters its lookup table by the bucketing hash function
        • Allows “small” table to be much bigger
      • Sorted merge in map-join
        • Requires presorted input tables
      • Deal with skew in map/reduce join
        • Conditional plan step for skew keys p(after main map/reduce join step)
    • Storage Handlers Hive HDFS Native Tables Storage Handler Interface HBase Handler Cassandra Handler Hypertable Handler Hypertable API Cassandra API HBase API HBase Tables
    • Low Latency Warehouse HBase Other Files/Tables Periodic Load Continuous Update Hive Queries
    • Storage Handler Syntax
      • HBase Example
        • CREATE TABLE users(
        • userid int, name string, email string, notes string)
        • STORED BY
        • 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
        • WITH SERDEPROPERTIES (
        • “ hbase.columns.mapping” = “small:name,small:email,large:notes”)
        • TBLPROPERTIES (
        • “ hbase.table.name” = “user_list”);
    • Release 0.7
      • In development
        • Concurrency Control
        • Stats Collection
        • Stats Functions
        • Indexes
        • Local Mode
        • Faster map join
        • Multiple DISTINCT aggregates
        • Archiving
        • JDBC/ODBC improvements
    • Concurrency Control
      • Pluggable distributed lock manager
        • Default is Zookeeper-based
      • Simple read/write locking
      • Table-level and partition-level
      • Implicit locking (statement level)
        • Deadlock-free via lock ordering
      • Explicit LOCK TABLE (global)
    • Statistics Collection
      • Implicit metastore update during load
        • Or explicit via ANALYZE TABLE
      • Table/partition-level
        • Number of rows
        • Number of files
        • Size in bytes
    • Stats-driven Optimization
      • Automatic map-side join
      • Automatic map-side aggregation
      • Need column-level stats for better estimates
        • Filter/join selectivity
        • Distinct value counts
        • Column correlation
    • Statistical Functions
      • Stats 101
        • Stddev, var, covar
        • Percentile_approx
      • Data Mining
        • Ngrams, sentences (text analysis)
        • Histogram_numeric
          • SELECT histogram_numeric( dob_year ) FROM users GROUP BY relationshipstatus
    • Histogram query results
      • “ It’s complicated ” peaks at 18-19, but lasts into late 40s!
      • “ In a relationship ” peaks at 20
      • “ Engaged ” peaks at 25
      • Married peaks in early 30s
      • More married than single at 28
      • Only teenagers use widowed ?
    • Pluggable Indexing
      • Reference implementation
        • Index is stored in a normal Hive table
        • Compact: distinct block addresses
        • Partition-level rebuild
      • Currently in R&D
        • Automatic use for WHERE, GROUP BY
        • New index types (e.g. bitmap, HBase)
    • Local Mode Execution
      • Avoids map/reduce cluster job latency
      • Good for jobs which process small amounts of data
      • Let Hive decide when to use it
        • set hive.exec.model.local.auto=true;
      • Or force its usage
        • set mapred.job.tracker=local;
    • Faster map join
      • Make sure small table can fit in memory
        • If it can’t, fall back to reduce join
      • Optimize hash table data structures
      • Use distributed cache to push out pre-filtered lookup table
        • Avoid swamping HDFS with reads from thousands of mappers
    • Multiple DISTINCT Aggs
      • Example
        • SELECT
        • view_date,
        • COUNT(DISTINCT userid),
        • COUNT(DISTINCT page_url)
        • FROM page_views
        • GROUP BY view_date
    • Archiving
      • Use HAR (Hadoop archive format) to combine many files into a few
      • Relieves namenode memory
      • Archived partition becomes read-only
      • Syntax:
        • ALTER TABLE page_views
        • {ARCHIVE|UNARCHIVE}
        • PARTITION (ds=‘2010-10-30’)
    • JDBC/ODBC Improvements
      • JDBC: Basic metadata calls
        • Good enough for use with UI’s such as SQuirreL
      • JDBC: some PreparedStatement support
        • Pentaho Data Integration
      • ODBC: new driver under development (based on sqllite)
    • Hive is now a TLP
      • PMC
        • Namit Jain (chair)
        • John Sichi
        • Zheng Shao
        • Edward Capriolo
        • Raghotham Murthy
        • Ning Zhang
        • Paul Yang
        • He Yongqiang
        • Prasad Chakka
        • Joydeep Sen Sarma
        • Ashish Thusoo
      • Welcome to new committer Carl Steinbach!
    • Developer Diversity
      • Recent Contributors
        • Facebook, Yahoo, Cloudera
        • Netflix, Amazon, Media6Degrees, Intuit
        • Numerous research projects
        • Many many more…
      • Monthly San Francisco bay area contributor meetups
      • East coast meetups? 
    • Roadmap: Security
      • Authentication
        • Upgrading to SASL-enabled Thrift
      • Authorization
        • HDFS-level
          • Very limited (no ACL’s)
          • Can’t support all Hive features (e.g. views)
        • Hive-level (GRANT/REVOKE)
          • Hive server deployment for full effectiveness
    • Roadmap: Hadoop API
      • Dropping pre-0.20 support starting with Hive 0.7
        • But Hive is still using old mapred.*
      • Moving to mapreduce.* will be required in order to support newer Hadoop versions
        • Need to resolve some complications with 0.7’s indexing feature
    • Roadmap: Howl
      • Reuse metastore across Hadoop
      Howl Hive Pig Oozie Flume HDFS
    • Roadmap: Heavy-Duty Tests
      • Unit tests are insufficient
      • What is needed:
        • Real-world schemas/queries
        • Non-toy data scales
        • Scripted setup; configuration matrix
        • Correctness/performance verification
        • Automatic reports: throughput, latency, profiles, coverage, perf counters…
    • Roadmap: Shared Test Site
      • Nightly runs, regression alerting
      • Performance trending
      • Synthetic workload (e.g. TPC-H)
      • Real-world workload (anonymized?)
      • This is critical for
        • Non-subjective commit criteria
        • Release quality
    • Resources
      • http://hive.apache.org
      • [email_address]
      • [email_address]
      • Questions?