SlideShare a Scribd company logo
Hadoop and Hive Development at
   Facebook

Dhruba Borthakur Zheng Shao
{dhruba, zshao}@facebook.com
Presented at Hadoop World, New York
October 2, 2009
Hadoop @ Facebook
Who generates this data?

  Lots of data is generated on Facebook
    –  300+ million active users
    –  30 million users update their statuses at least once each
       day
    –  More than 1 billion photos uploaded each month
    –  More than 10 million videos uploaded each month
    –  More than 1 billion pieces of content (web links, news
       stories, blog posts, notes, photos, etc.) shared each
       week
Data Usage

  Statistics per day:
    –  4 TB of compressed new data added per day
    –  135TB of compressed data scanned per day
    –  7500+ Hive jobs on production cluster per day
    –  80K compute hours per day
  Barrier to entry is significantly reduced:
    –  New engineers go though a Hive training session
    –  ~200 people/month run jobs on Hadoop/Hive
    –  Analysts (non-engineers) use Hadoop through Hive
Where is this data stored?

  Hadoop/Hive Warehouse
  –  4800 cores, 5.5 PetaBytes
  –  12 TB per node
  –  Two level network topology
       1 Gbit/sec from node to rack switch
       4 Gbit/sec to top level rack switch
Data Flow into Hadoop Cloud
                                                     Network

                                                     Storage

                                                     and

                                                     Servers

    Web
Servers

                        Scribe
MidTier





     Oracle
RAC
   Hadoop
Hive
Warehouse
   MySQL

Hadoop Scribe: Avoid Costly Filers
                                                          Scribe
Writers


                                                                      RealBme

                                                                      Hadoop

                                                                      Cluster

    Web
Servers
               Scribe
MidTier





     Oracle
RAC
         Hadoop
Hive
Warehouse
                    MySQL

    http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
HDFS Raid

  Start the same: triplicate
   every data block
                                                 A               B                C
  Background encoding
   –  Combine third replica of                   A               B                C
      blocks from a single file to
      create parity block                        A               B                C
   –  Remove third replica
   –  Apache JIRA HDFS-503                                     A+B+C
  DiskReduce from CMU
   –  Garth Gibson research                     A file with three blocks A, B and C

   http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
Archival: Move old data to cheap storage

Hadoop
Warehouse


                                                 NFS




               Hadoop
Archive
Node
                     Cheap
NAS




                                 Hadoop
Archival
Cluster



                    hEp://issues.apache.org/jira/browse/HDFS‐220

 Hive
Query

Dynamic-size MapReduce Clusters

  Why multiple compute clouds in Facebook?
    –  Users unaware of resources needed by job
    –  Absence of flexible Job Isolation techniques
    –  Provide adequate SLAs for jobs
  Dynamically move nodes between clusters
    –  Based on load and configured policies
    –  Apache Jira MAPREDUCE-1044
Resource Aware Scheduling (Fair Share
Scheduler)
  We use the Hadoop Fair Share Scheduler
    –  Scheduler unaware of memory needed by job
  Memory and CPU aware scheduling
    –  RealTime gathering of CPU and memory usage
    –  Scheduler analyzes memory consumption in realtime
    –  Scheduler fair-shares memory usage among jobs
    –  Slot-less scheduling of tasks (in future)
    –  Apache Jira MAPREDUCE-961
Hive – Data Warehouse

  Efficient SQL to Map-Reduce Compiler

  Mar 2008: Started at Facebook
  May 2009: Release 0.3.0 available
  Now: Preparing for release 0.4.0


  Countable for 95%+ of Hadoop jobs @ Facebook
  Used by ~200 engineers and business analysts at Facebook
   every month
Hive Architecture

  Web UI + Hive CLI + JDBC/                Map Reduce       HDFS
            ODBC                       User-defined
     Browse, Query, DDL              Map-reduce Scripts

  MetaStore               Hive QL              UDF/UDAF
                                                 substr
                     Parser                       sum
  Thrift API                                    average
                    Planner    Execution
                                                          FileFormats
                                                 SerDe
                   Optimizer                                TextFile
                                                  CSV     SequenceFile
                                                 Thrift      RCFile
                                                 Regex
Hive DDL

  DDL
   –  Complex columns
   –  Partitions
   –  Buckets
  Example
   –  CREATE TABLE sales (
        id INT,
        items ARRAY<STRUCT<id:INT, name:STRING>>,
        extra MAP<STRING, STRING>
      ) PARTITIONED BY (ds STRING)
      CLUSTERED BY (id) INTO 32 BUCKETS;
Hive Query Language

  SQL
   –    Where
   –    Group By
   –    Equi-Join
   –    Sub query in from clause
  Example
   –  SELECT r.*, s.*
      FROM r JOIN (
        SELECT key, count(1) as count
        FROM s
        GROUP BY key) s
      ON r.key = s.key
      WHERE s.count > 100;
Group By

  4 different plans based on:
  –  Does data have skew?
  –  partial aggregation
  Map-side hash aggregation
  –  In-memory hash table in mapper to do partial
     aggregations
  2-map-reduce aggregation
  –  For distinct queries with skew and large cardinality
Join

  Normal map-reduce Join
  –  Mapper sends all rows with the same key to a single
     reducer
  –  Reducer does the join
  Map-side Join
  –  Mapper loads the whole small table and a portion of big
     table
  –  Mapper does the join
  –  Much faster than map-reduce join
Sampling

  Efficient sampling
   –  Table can be bucketed
   –  Each bucket is a file
   –  Sampling can choose some buckets
  Example
   –  SELECT product_id, sum(price)
      FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32)
      GROUP BY product_id
Multi-table Group-By/Insert

FROM users
INSERT INTO TABLE pv_gender_sum
   SELECT gender, count(DISTINCT userid)
   GROUP BY gender
INSERT INTO
   DIRECTORY '/user/facebook/tmp/pv_age_sum.dir'
   SELECT age, count(DISTINCT userid)
   GROUP BY age
INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir'
   SELECT country, gender, count(DISTINCT userid)
   GROUP BY country, gender;
File Formats

  TextFile:
   –  Easy for other applications to write/read
   –  Gzip text files are not splittable
  SequenceFile:
   –  Only hadoop can read it
   –  Support splittable compression
  RCFile: Block-based columnar storage
   –    Use SequenceFile block format
   –    Columnar storage inside a block
   –    25% smaller compressed size
   –    On-par or better query performance depending on the query
SerDe

  Serialization/Deserialization
  Row Format
   –    CSV (LazySimpleSerDe)
   –    Thrift (ThriftSerDe)
   –    Regex (RegexSerDe)
   –    Hive Binary Format (LazyBinarySerDe)
  LazySimpleSerDe and LazyBinarySerDe
   –  Deserialize the field when needed
   –  Reuse objects across different rows
   –  Text and Binary format
UDF/UDAF

  Features:
   –    Use either Java or Hadoop Objects (int, Integer, IntWritable)
   –    Overloading
   –    Variable-length arguments
   –    Partial aggregation for UDAF
  Example UDF:
   –  public class UDFExampleAdd extends UDF {
        public int evaluate(int a, int b) {
          return a + b;
        }
      }
Hive – Performance

  Date      SVN Revision        Major Changes            Query A   Query B   Query C
2/22/2009      746906      Before Lazy Deserialization    83 sec    98 sec   183 sec
2/23/2009      747293         Lazy Deserialization        40 sec    66 sec   185 sec
3/6/2009       751166        Map-side Aggregation         22 sec    67 sec   182 sec
4/29/2009      770074             Object Reuse            21 sec    49 sec   130 sec
6/3/2009       781633           Map-side Join *           21 sec    48 sec   132 sec
8/5/2009       801497        Lazy Binary Format *         21 sec    48 sec   132 sec

        QueryA: SELECT count(1) FROM t;
        QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;
        QueryC: SELECT * FROM t;
        map-side time only (incl. GzipCodec for comp/decompression)
        * These two features need to be tested with other queries.
Hive – Future Works

    Indexes
    Create table as select
    Views / variables
    Explode operator
    In/Exists sub queries
    Leverage sort/bucket information in Join

More Related Content

What's hot

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
siddharthboora
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
01 hbase
01 hbase01 hbase
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 

What's hot (17)

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
01 hbase
01 hbase01 hbase
01 hbase
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 

Similar to Hadoop and Hive Development at Facebook

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Nextag talk
Nextag talkNextag talk
Nextag talk
Joydeep Sen Sarma
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
HADOOP
HADOOPHADOOP
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Chirag Ahuja
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
appaji intelhunt
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Jonathan Seidman
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Apache drill
Apache drillApache drill
Apache drill
Jakub Pieprzyk
 

Similar to Hadoop and Hive Development at Facebook (20)

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
HADOOP
HADOOPHADOOP
HADOOP
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
מיכאל
מיכאלמיכאל
מיכאל
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Apache drill
Apache drillApache drill
Apache drill
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 

Hadoop and Hive Development at Facebook

  • 1. Hadoop and Hive Development at Facebook Dhruba Borthakur Zheng Shao {dhruba, zshao}@facebook.com Presented at Hadoop World, New York October 2, 2009
  • 3. Who generates this data?   Lots of data is generated on Facebook –  300+ million active users –  30 million users update their statuses at least once each day –  More than 1 billion photos uploaded each month –  More than 10 million videos uploaded each month –  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
  • 4. Data Usage   Statistics per day: –  4 TB of compressed new data added per day –  135TB of compressed data scanned per day –  7500+ Hive jobs on production cluster per day –  80K compute hours per day   Barrier to entry is significantly reduced: –  New engineers go though a Hive training session –  ~200 people/month run jobs on Hadoop/Hive –  Analysts (non-engineers) use Hadoop through Hive
  • 5. Where is this data stored?   Hadoop/Hive Warehouse –  4800 cores, 5.5 PetaBytes –  12 TB per node –  Two level network topology   1 Gbit/sec from node to rack switch   4 Gbit/sec to top level rack switch
  • 6. Data Flow into Hadoop Cloud Network
 Storage
 and
 Servers
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL

  • 7. Hadoop Scribe: Avoid Costly Filers Scribe
Writers
 RealBme
 Hadoop
 Cluster
 Web
Servers
 Scribe
MidTier
 Oracle
RAC
 Hadoop
Hive
Warehouse
 MySQL
 http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html
  • 8. HDFS Raid   Start the same: triplicate every data block A B C   Background encoding –  Combine third replica of A B C blocks from a single file to create parity block A B C –  Remove third replica –  Apache JIRA HDFS-503 A+B+C   DiskReduce from CMU –  Garth Gibson research A file with three blocks A, B and C http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
  • 9. Archival: Move old data to cheap storage Hadoop
Warehouse
 NFS
 Hadoop
Archive
Node
 Cheap
NAS
 Hadoop
Archival
Cluster
 hEp://issues.apache.org/jira/browse/HDFS‐220
 Hive
Query

  • 10. Dynamic-size MapReduce Clusters   Why multiple compute clouds in Facebook? –  Users unaware of resources needed by job –  Absence of flexible Job Isolation techniques –  Provide adequate SLAs for jobs   Dynamically move nodes between clusters –  Based on load and configured policies –  Apache Jira MAPREDUCE-1044
  • 11. Resource Aware Scheduling (Fair Share Scheduler)   We use the Hadoop Fair Share Scheduler –  Scheduler unaware of memory needed by job   Memory and CPU aware scheduling –  RealTime gathering of CPU and memory usage –  Scheduler analyzes memory consumption in realtime –  Scheduler fair-shares memory usage among jobs –  Slot-less scheduling of tasks (in future) –  Apache Jira MAPREDUCE-961
  • 12. Hive – Data Warehouse   Efficient SQL to Map-Reduce Compiler   Mar 2008: Started at Facebook   May 2009: Release 0.3.0 available   Now: Preparing for release 0.4.0   Countable for 95%+ of Hadoop jobs @ Facebook   Used by ~200 engineers and business analysts at Facebook every month
  • 13. Hive Architecture Web UI + Hive CLI + JDBC/ Map Reduce HDFS ODBC User-defined Browse, Query, DDL Map-reduce Scripts MetaStore Hive QL UDF/UDAF substr Parser sum Thrift API average Planner Execution FileFormats SerDe Optimizer TextFile CSV SequenceFile Thrift RCFile Regex
  • 14. Hive DDL   DDL –  Complex columns –  Partitions –  Buckets   Example –  CREATE TABLE sales ( id INT, items ARRAY<STRUCT<id:INT, name:STRING>>, extra MAP<STRING, STRING> ) PARTITIONED BY (ds STRING) CLUSTERED BY (id) INTO 32 BUCKETS;
  • 15. Hive Query Language   SQL –  Where –  Group By –  Equi-Join –  Sub query in from clause   Example –  SELECT r.*, s.* FROM r JOIN ( SELECT key, count(1) as count FROM s GROUP BY key) s ON r.key = s.key WHERE s.count > 100;
  • 16. Group By   4 different plans based on: –  Does data have skew? –  partial aggregation   Map-side hash aggregation –  In-memory hash table in mapper to do partial aggregations   2-map-reduce aggregation –  For distinct queries with skew and large cardinality
  • 17. Join   Normal map-reduce Join –  Mapper sends all rows with the same key to a single reducer –  Reducer does the join   Map-side Join –  Mapper loads the whole small table and a portion of big table –  Mapper does the join –  Much faster than map-reduce join
  • 18. Sampling   Efficient sampling –  Table can be bucketed –  Each bucket is a file –  Sampling can choose some buckets   Example –  SELECT product_id, sum(price) FROM sales TABLESAMPLE (BUCKET 1 OUT OF 32) GROUP BY product_id
  • 19. Multi-table Group-By/Insert FROM users INSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;
  • 20. File Formats   TextFile: –  Easy for other applications to write/read –  Gzip text files are not splittable   SequenceFile: –  Only hadoop can read it –  Support splittable compression   RCFile: Block-based columnar storage –  Use SequenceFile block format –  Columnar storage inside a block –  25% smaller compressed size –  On-par or better query performance depending on the query
  • 21. SerDe   Serialization/Deserialization   Row Format –  CSV (LazySimpleSerDe) –  Thrift (ThriftSerDe) –  Regex (RegexSerDe) –  Hive Binary Format (LazyBinarySerDe)   LazySimpleSerDe and LazyBinarySerDe –  Deserialize the field when needed –  Reuse objects across different rows –  Text and Binary format
  • 22. UDF/UDAF   Features: –  Use either Java or Hadoop Objects (int, Integer, IntWritable) –  Overloading –  Variable-length arguments –  Partial aggregation for UDAF   Example UDF: –  public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; } }
  • 23. Hive – Performance Date SVN Revision Major Changes Query A Query B Query C 2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec 2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec 3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec 4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec 6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec 8/5/2009 801497 Lazy Binary Format * 21 sec 48 sec 132 sec   QueryA: SELECT count(1) FROM t;   QueryB: SELECT concat(concat(concat(a,b),c),d) FROM t;   QueryC: SELECT * FROM t;   map-side time only (incl. GzipCodec for comp/decompression)   * These two features need to be tested with other queries.
  • 24. Hive – Future Works   Indexes   Create table as select   Views / variables   Explode operator   In/Exists sub queries   Leverage sort/bucket information in Join