• Share
  • Email
  • Embed
  • Like
  • Private Content
Hw09   Rethinking The Data Warehouse With Hadoop And Hive
 

Hw09 Rethinking The Data Warehouse With Hadoop And Hive

on

  • 6,885 views

 

Statistics

Views

Total Views
6,885
Views on SlideShare
6,790
Embed Views
95

Actions

Likes
7
Downloads
337
Comments
0

5 Embeds 95

http://www.slideshare.net 42
http://confluence.cmi-hro.nl 34
http://oqueeufizhoje.blogspot.com 9
http://oqueeufizhoje.blogspot.com.br 9
http://oqueeufizhoje.blogspot.pt 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Cost of training people is high – have to reduce cost by making system easy to use.
  • Why Hive? Petabytes of structured data User base familiar with SQL and Python/Perl/PHP Commercial Warehousing Software .. Does not scale, very expensive, inflexible Closed source, not programmable using Python/Perl/PHP Solution: SQL layer on top of scalable storage and map-reduce (Hadoop) Openness: Use any data format, embed any programming language
  • Nomenclature: Core switch and Top of Rack
  • 1GB connectivity within a rack, 100MB across racks? Are all disks 7200 SATA?

Hw09   Rethinking The Data Warehouse With Hadoop And Hive Hw09 Rethinking The Data Warehouse With Hadoop And Hive Presentation Transcript

  • Rethinking Data Warehousing & Analytics Ashish Thusoo, Facebook Data Infrastructure Team
  • Why Another Data Warehousing System?
    • Data, data and more data
      • 200GB per day in March 2008
      • 2+TB(compressed) raw data per day in April 2009
      • 4+TB(compressed) raw data per day today
  •  
  • Trends Leading to More Data Free or low cost of user services Realization that more insights are derived from simple algorithms on more data
  • Deficiencies of Existing Technologies Cost of Analysis and Storage on proprietary systems does not support trends towards more data Closed and Proprietary Systems Limited Scalability does not support trends towards more data
  • Hadoop Advantages
    • Pros
      • Superior in availability/scalability/manageability despite lower single node performance
      • Open system
      • Scalable costs
    • Cons: Programmability and Metadata
      • Map-reduce hard to program (users know sql/bash/python/perl)
      • Need to publish data in well known schemas
    • Solution: HIVE
  • What is HIVE?
    • A system for managing and querying structured data built on top of Hadoop
    • Components
      • Map-Reduce for execution
      • HDFS for storage
      • Metadata in an RDBMS
  • Hive: Simplifying Hadoop – New Technology Familiar Interfaces
    • hive> select key, count(1) from kv1 where key > 100 group by key;
    • vs.
    • $ cat > /tmp/reducer.sh
    • uniq -c | awk '{print $2" "$1}‘
    • $ cat > /tmp/map.sh
    • awk -F '01' '{if($1 > 100) print $1}‘
    • $ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
    • $ bin/hadoop dfs –cat /tmp/largekey/part*
  • Hive: Open and Extensible
    • Query your own formats and types with your own Serializer/Deserializers
    • Extend the SQL functionality through User Defined Functions
    • Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script
  • Hive: Smart Execution Plans for Performance
    • Hash based Aggregations
    • Map-side Joins
    • Predicate Pushdown
    • Partition Pruning
    • Many more to come in the future
  • Interoperability
    • JDBC and ODBC interfaces available
    • Integrations with some traditional SQL tools with some minor modifications
    • More improvements in future to support interoperability with existing front end tools
  • Information
    • Available as a sub project in Hadoop
      • http://wiki.apache.org/hadoop/Hive (wiki)
      • http://hadoop.apache.org/hive (home page)
      • http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)
      • ##hive (IRC)
      • Works with hadoop-0.17, 0.18, 0.19, 0.20
    • Release 0.4.0 is coming in the next few days
    • Mailing Lists:
      • hive-{user,dev,commits}@hadoop.apache.org
  • Data Warehousing @ Facebook using Hive & Hadoop
  • Data Flow Architecture at Facebook Web Servers Scribe MidTier Filers Production Hive-Hadoop Cluster Oracle RAC Federated MySQL Scribe-Hadoop Cluster Adhoc Hive-Hadoop Cluster Hive replication
  • Looks like this .. Disks Node Disks Node Disks Node Disks Node Disks Node Disks Node 1 Gigabit 4 Gigabit Node = DataNode + Map-Reduce
  • Hadoop & Hive Cluster @ Facebook
    • Hadoop/Hive Warehouse – the new generation
      • 4800 cores, Storage capacity of 5.5 PetaBytes
      • 12 TB per node
      • Two level network topology
        • 1 Gbit/sec from node to rack switch
        • 4 Gbit/sec to top level rack switch
  • Hive & Hadoop Usage @ Facebook
    • Statistics per day:
      • 4 TB of compressed new data added per day
      • 135TB of compressed data scanned per day
      • 7500+ Hive jobs on per day
      • 80K compute hours per day
    • Hive simplifies Hadoop:
      • New engineers go though a Hive training session
      • ~200 people/month run jobs on Hadoop/Hive
      • Analysts (non-engineers) use Hadoop through Hive
      • 95% of jobs are Hive Jobs
  • Hive & Hadoop Usage @ Facebook
    • Types of Applications:
      • Reporting
        • Eg: Daily/Weekly aggregations of impression/click counts
        • Measures of user engagement
        • Microstrategy dashboards
      • Ad hoc Analysis
        • Eg: how many group admins broken down by state/country
      • Machine Learning (Assembling training data)
        • Ad Optimization
        • Eg: User Engagement as a function of user attributes
      • Many others
  • Facebook’s contributions…
    • A lot of significant contributions:
      • Hive
      • Hdfs features
      • Scheduler work
      • Etc…
    • Talks by Dhruba Borthakur and Zheng Shao in the development track for more information on these projects
  •