SlideShare a Scribd company logo
1 of 32
Download to read offline
Petabyte Scale Data at Facebook


Dhruba Borthakur ,
Engineer at Facebook,
BigDataCloud, Santa Clara, Oct 18 2012
Agenda

 1   Types of Data

 2   Data Model and API for Facebook Graph Data

 3   SLTP (Semi-OLTP) data

 4   Immutable data store for photos, videos, etc

 5   Hadoop/Hive analytics storage
Four major types of storage systems
▪  Online   Transaction Processing Databases (OLTP)
 ▪    The Facebook Social Graph

▪  Semi-online    Light Transaction Processing Databases (SLTP)
 ▪    Facebook Messages and Sensory Time Series Data

▪  Immutable     DataStore
 ▪    Photos, videos, etc

▪  Analytics   DataStore
 ▪    Data Warehouse, Logs storage
Size and Scale of Databases
                       Total Size            Technology        Bottlenecks


                   Single digit petabytes   MySQL and TAO/    Random read IOPS
  Facebook Graph                              memcache

     Facebook
                                                               Write IOPS and
   Messages and      Tens of petabytes      Hadoop HDFS and   storage capacity
    Time Series                                 Hbase
       Data
    Facebook
                       High tens of            Haystack       storage capacity
     Photos             petabytes

      Data
                       Hundreds of           Hive, HDFS and   storage capacity
    Warehouse           petabytes               Hadoop
Characteristics
                 Query          Consistency      Durability
                Latency

                   < few           quickly       No data loss
 Facebook      milliseconds      consistent
  Graph                          across data
                                   centers
  Facebook
                                 consistent      No data loss
  Messages     < 200 millisec   within a data
  and Time                         center
 Series Data
 Facebook
               < 250 millisec    immutable       No data loss
  Photos

   Data
                  < 1 min       not consistent   No silent data
 Warehouse                       across data          loss
                                   centers
Facebook Graph: Objects and Associations
Data model
Objects & Associations
                                           18429207554
                                                 (page)

                                    fan
                                                                  name: Barack Obama
                    8636146                                       birthday: 08/04/1961
                                          admin
                    (user)                                        website: http://…
                                                                  verified: 1
                                               friend             …
                 likes
                         liked by     friend
                                                          604191769
                                                              (user)


   6205972929
       (story)
Facebook Social Graph: TAO and MySQL
An OLTP workload:
▪  Uneven    read heavy workload
▪  Huge   working set with creation-time locality
▪  Highly   interconnected data
▪  Constantly   evolving
▪  As   consistent as possible
Architecture
Cache & Storage

                                       MySQL Storage
     Web servers   TAO Storage Cache
Architecture
Sharding
▪  Object   ids and Assoc id1s are mapped to shard ids
                                                   MySQL Storage
            Web Servers        TAO Cache
                                                         db2 db4
                                 s1 s3
                                 s5

                                                         db1 db3
                                                         db8
                                 s2 s6

                                                          db7

                                 s4 s7
                                 s8
                                                         db5 db6
Workload

▪  Read-heavy    workload
 ▪    Significant range queries
▪  LinkBench    benchmark being open-sourced
 ▪    http://www.github.com/facebook/linkbench

▪  Real   data distribution of Assocs and their access patterns
Messages & Time Series Database
        SLTP workload
Facebook Messages

 Messages   Chats   Emails   SMS
Why we chose Hbase and HDFS
▪  High   write throughput
▪  Horizontal    scalability
▪  Automatic    Failover
▪  Strong   consistency within a data center
▪  Benefits    of HDFS : Fault tolerant, scalable, Map-Reduce toolset,
▪  Why    is this SLTP?
   ▪    Semi-online: Queries run even if part of the database is offline
   ▪    Light Transactions: single row transactions
   ▪    Storage capacity bound rather than iops or cpu bound
What we store in HBase
▪  Small   messages
▪  Message    metadata (thread/message indices)
▪  Search   index
▪  Large   attachments stored in Haystack (photo store
Size and scale of Messages Database
▪  6   Billion messages/day
▪  74    Billion operations/day
▪  At    peak: 1.5 million operations/sec
▪  55%    read, 45% write operations
▪  Average    write operation inserts 16 records
▪  All   data is lzo compressed
▪  Growing     at 8 TB/day
Haystack: The Photo Store
Facebook Photo DataStore
                         2009                     2012


    Total Size      15 billion photos
                      1.5 Petabyte        High tens of petabytes



                  30 million photos/day   300 million photos/day
   Upload Rate           3 TB/day               30 TB/day


                    555K images/sec
   Serving Rate
Haystack based Design
                        Haystack Store
        Haystack
        Directory




          Web             Haystack
         Server            Cache




        Browser             CDN
Haystack Internals
▪  Log   structured, append-only object store
▪  Built   on commodity hardware
▪  Application-aware    replication
▪  Images    stored in 100 GB xfs-files called needles
▪  An   in-memory index for each needle file
 ▪    32 bytes of index per photo
Hive Hadoop Analytics Warehouse
Life of a photo tag in Hadoop/Hive storage
             Periodic	
  Analysis	
  (HIVE)	
                               Adhoc	
  Analysis	
  (HIVE)	
  
 Daily report on count of                                                            Count photos tagged by
 photo tags by country (1day)                                                        females age 20-25 yesterday



                                                                                    Scrapes	
  
                                                                               User info reaches
                                              Hive	
  Warehouse	
              Warehouse (1day)             MySQL	
  DB	
  

                                  copier/loader	
           Log line reaches
                                                            warehouse (15 min)                                     User tags
                                                                                                                   a photo



                         puma	
  
RealHme	
  AnalyHcs	
  (HBASE)	
           Scribe	
  Log	
  Storage	
  (HDFS)	
                   www.facebook.com	
  
                                                              Log line reaches                      Log line generated:
            Count users tagging photos                        Scribeh (10s)
            in the last hour (1min)                                                                 <user_id, photo_id>
Analytics Data Growth(last 4 years)


           Facebook               Scribe Data/    Nodes in
                      Queries/Day                            Size (Total)
             Users                    Day        warehouse



  Growth     14X         60X          250X         260X        2500X
Why use Hadoop instead of Parallel DBMS?
▪    Stonebraker/DeWitt from the
     DBMS community:
     ▪    Quote “major step backwards”
     ▪    Published benchmark results
          which show that Hive is not as
          performant as a traditional DBMS
          ▪    http://database.cs.brown.edu/
               projects/mapreduce-vs-dbms/
Why Hadoop/Hive? Prospecting for Gold..
▪    “Gold in the wild-west”
▪    A platform for huge data-experiments
▪    A majority of queries are searching
     for a single gold nugget
▪    Great advantage in keeping all data in
     one queryable system
▪    No structure to data, specify
     structure at query time
Why Hadoop/Hive? Open to all users
▪  Open   Warehouse
▪  No
    need to take approval from DBA to
 create new table
▪  This
     is possible only because Hadoop is
 horizontally scalable
▪  Provisioning
              excess capacity is cheaper
 than traditional databases
▪  Lower
       $$/GB ($$/query-latency not
 relevant here)
Why Hadoop/Hive? Crowd Sourcing
▪    Crowd Sourcing for data discovery
▪    There are 50K tables in a single
     warehouse
▪    Users are DBAs themselves
▪    Questions about a table are directed
     to users of that table
▪    Automatic query lineage tools help
     here
Why Hadoop/Hive? Open Data Format
▪    No Lock-in
▪    Hive/Hadoop is open source
▪    Data is open-format, one can access
     data below the database layer
▪    HBase can query the same data that
     is in HDFS
▪    Want a new UDF? No problem, Very
     easily extendable
Why Hadoop/Hive? Benchmarks
▪    Shortcomings of existing DBMS benchmarks
     ▪    most DBMS benchmarks measure latency
     ▪    Do not measure $$/Gigabyte-of-data
     ▪    Do not test fault tolerance – kill machines
     ▪    Do not measure elasticity – add and remove machines
     ▪    Do not measure throughput – concurrent queries in parallel
New trends in storage software
▪  Trends:

 ▪    SSDs cheaper, increasing number of CPUs per server
      ▪    How can BigData use SSDs?
 ▪    SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB
      ▪    How can BigData become even bigger?
Questions?
     dhruba@fb.com
http://hadoopblog.blogspot.com/

More Related Content

What's hot

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMongoDB
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopDataWorks Summit
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaCloudera, Inc.
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersBenoit Perroud
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
MongoDB at eBay
MongoDB at eBayMongoDB at eBay
MongoDB at eBayMongoDB
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6longda feng
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Stefan Lipp
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databasesguestdfd1ec
 

What's hot (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Bringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on HadoopBringing OLTP woth OLAP: Lumos on Hadoop
Bringing OLTP woth OLAP: Lumos on Hadoop
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End Users
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
MongoDB at eBay
MongoDB at eBayMongoDB at eBay
MongoDB at eBay
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
 
Design Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational DatabasesDesign Patterns for Distributed Non-Relational Databases
Design Patterns for Distributed Non-Relational Databases
 

Viewers also liked

Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeDataWorks Summit
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowMapR Technologies
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And HiveCloudera, Inc.
 
Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13Andy Kriebel
 
Project Voldemort: Big data loading
Project Voldemort: Big data loadingProject Voldemort: Big data loading
Project Voldemort: Big data loadingDan Harvey
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)tatsuya6502
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversAngelo Failla
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookSREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookAngelo Failla
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

Viewers also liked (19)

Project Voldemort
Project VoldemortProject Voldemort
Project Voldemort
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
ADER RRHH PRESENTACIÓN CORPORATIVA
ADER RRHH PRESENTACIÓN CORPORATIVAADER RRHH PRESENTACIÓN CORPORATIVA
ADER RRHH PRESENTACIÓN CORPORATIVA
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
Data Warehouse Evolution Roadshow
Data Warehouse Evolution RoadshowData Warehouse Evolution Roadshow
Data Warehouse Evolution Roadshow
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Hw09   Rethinking The Data Warehouse With Hadoop And HiveHw09   Rethinking The Data Warehouse With Hadoop And Hive
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
 
Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13Creating a Culture of Data @ Facebook - TCCEU13
Creating a Culture of Data @ Facebook - TCCEU13
 
Project Voldemort: Big data loading
Project Voldemort: Big data loadingProject Voldemort: Big data loading
Project Voldemort: Big data loading
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
Tokyo HBase Meetup - Realtime Big Data at Facebook with Hadoop and HBase (ja)
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
 
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at FacebookSREConEurope15 - The evolution of the DHCP infrastructure at Facebook
SREConEurope15 - The evolution of the DHCP infrastructure at Facebook
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similar to Petabyte Scale Data at Facebook

How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinAmazon Web Services
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseAge Mooij
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summitdrewz lin
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)baggioss
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Spark Summit
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxAnonymous9etQKwW
 
Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introductionJakub Stransky
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014Hassan Islamov
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 

Similar to Petabyte Scale Data at Facebook (20)

How & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit DublinHow & When to Use NoSQL at Websummit Dublin
How & When to Use NoSQL at Websummit Dublin
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
How and when to use NoSQL
How and when to use NoSQLHow and when to use NoSQL
How and when to use NoSQL
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Fb talk arch_summit
Fb talk arch_summitFb talk arch_summit
Fb talk arch_summit
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
 
Big Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptxBig Data & Analytics (CSE6005) L6.pptx
Big Data & Analytics (CSE6005) L6.pptx
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introduction
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Intro to big data choco devday - 23-01-2014
Intro to big data   choco devday - 23-01-2014Intro to big data   choco devday - 23-01-2014
Intro to big data choco devday - 23-01-2014
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 

More from BigDataCloud

Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsWebinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsBigDataCloud
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction SystemBigDataCloud
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS BigDataCloud
 
Cloud Computing Services
Cloud Computing ServicesCloud Computing Services
Cloud Computing ServicesBigDataCloud
 
Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!BigDataCloud
 
Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBigDataCloud
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBigDataCloud
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud PlatformBigDataCloud
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value BigDataCloud
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.BigDataCloud
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideBigDataCloud
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?BigDataCloud
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalBigDataCloud
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinBigDataCloud
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud
 
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud
 

More from BigDataCloud (20)

Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning PlatformsWebinar - Comparative Analysis of Cloud based Machine Learning Platforms
Webinar - Comparative Analysis of Cloud based Machine Learning Platforms
 
Crime Analysis & Prediction System
Crime Analysis & Prediction SystemCrime Analysis & Prediction System
Crime Analysis & Prediction System
 
REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS REAL-TIME RECOMMENDATION SYSTEMS
REAL-TIME RECOMMENDATION SYSTEMS
 
Cloud Computing Services
Cloud Computing ServicesCloud Computing Services
Cloud Computing Services
 
Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!Google Enterprise Cloud Platform - Resources & $2000 credit!
Google Enterprise Cloud Platform - Resources & $2000 credit!
 
Big Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & AppsBig Data in the Cloud - Solutions & Apps
Big Data in the Cloud - Solutions & Apps
 
Big Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud PlatformBig Data Analytics in Motorola on the Google Cloud Platform
Big Data Analytics in Motorola on the Google Cloud Platform
 
Streak + Google Cloud Platform
Streak + Google Cloud PlatformStreak + Google Cloud Platform
Streak + Google Cloud Platform
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value
 
Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.Creating Business Value from Big Data, Analytics & Technology.
Creating Business Value from Big Data, Analytics & Technology.
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Recommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural GuideRecommendation Engines - An Architectural Guide
Recommendation Engines - An Architectural Guide
 
Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?Why Hadoop is the New Infrastructure for the CMO?
Why Hadoop is the New Infrastructure for the CMO?
 
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, PivotalHadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
Hadoop : A Foundation for Change - Milind Bhandarkar Chief Scientist, Pivotal
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentationBigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
BigDataCloud meetup Feb 16th - Microsoft's Saptak Sen's presentation
 
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
BigDataCloud Sept 8 2011 Meetup - Fail-Proofing Hadoop Clusters with Automati...
 

Petabyte Scale Data at Facebook

  • 1. Petabyte Scale Data at Facebook Dhruba Borthakur , Engineer at Facebook, BigDataCloud, Santa Clara, Oct 18 2012
  • 2. Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) data 4 Immutable data store for photos, videos, etc 5 Hadoop/Hive analytics storage
  • 3.
  • 4. Four major types of storage systems ▪  Online Transaction Processing Databases (OLTP) ▪  The Facebook Social Graph ▪  Semi-online Light Transaction Processing Databases (SLTP) ▪  Facebook Messages and Sensory Time Series Data ▪  Immutable DataStore ▪  Photos, videos, etc ▪  Analytics DataStore ▪  Data Warehouse, Logs storage
  • 5. Size and Scale of Databases Total Size Technology Bottlenecks Single digit petabytes MySQL and TAO/ Random read IOPS Facebook Graph memcache Facebook Write IOPS and Messages and Tens of petabytes Hadoop HDFS and storage capacity Time Series Hbase Data Facebook High tens of Haystack storage capacity Photos petabytes Data Hundreds of Hive, HDFS and storage capacity Warehouse petabytes Hadoop
  • 6. Characteristics Query Consistency Durability Latency < few quickly No data loss Facebook milliseconds consistent Graph across data centers Facebook consistent No data loss Messages < 200 millisec within a data and Time center Series Data Facebook < 250 millisec immutable No data loss Photos Data < 1 min not consistent No silent data Warehouse across data loss centers
  • 7. Facebook Graph: Objects and Associations
  • 8. Data model Objects & Associations 18429207554 (page) fan name: Barack Obama 8636146 birthday: 08/04/1961 admin (user) website: http://… verified: 1 friend … likes liked by friend 604191769 (user) 6205972929 (story)
  • 9. Facebook Social Graph: TAO and MySQL An OLTP workload: ▪  Uneven read heavy workload ▪  Huge working set with creation-time locality ▪  Highly interconnected data ▪  Constantly evolving ▪  As consistent as possible
  • 10. Architecture Cache & Storage MySQL Storage Web servers TAO Storage Cache
  • 11. Architecture Sharding ▪  Object ids and Assoc id1s are mapped to shard ids MySQL Storage Web Servers TAO Cache db2 db4 s1 s3 s5 db1 db3 db8 s2 s6 db7 s4 s7 s8 db5 db6
  • 12. Workload ▪  Read-heavy workload ▪  Significant range queries ▪  LinkBench benchmark being open-sourced ▪  http://www.github.com/facebook/linkbench ▪  Real data distribution of Assocs and their access patterns
  • 13. Messages & Time Series Database SLTP workload
  • 14. Facebook Messages Messages Chats Emails SMS
  • 15. Why we chose Hbase and HDFS ▪  High write throughput ▪  Horizontal scalability ▪  Automatic Failover ▪  Strong consistency within a data center ▪  Benefits of HDFS : Fault tolerant, scalable, Map-Reduce toolset, ▪  Why is this SLTP? ▪  Semi-online: Queries run even if part of the database is offline ▪  Light Transactions: single row transactions ▪  Storage capacity bound rather than iops or cpu bound
  • 16. What we store in HBase ▪  Small messages ▪  Message metadata (thread/message indices) ▪  Search index ▪  Large attachments stored in Haystack (photo store
  • 17. Size and scale of Messages Database ▪  6 Billion messages/day ▪  74 Billion operations/day ▪  At peak: 1.5 million operations/sec ▪  55% read, 45% write operations ▪  Average write operation inserts 16 records ▪  All data is lzo compressed ▪  Growing at 8 TB/day
  • 19. Facebook Photo DataStore 2009 2012 Total Size 15 billion photos 1.5 Petabyte High tens of petabytes 30 million photos/day 300 million photos/day Upload Rate 3 TB/day 30 TB/day 555K images/sec Serving Rate
  • 20. Haystack based Design Haystack Store Haystack Directory Web Haystack Server Cache Browser CDN
  • 21. Haystack Internals ▪  Log structured, append-only object store ▪  Built on commodity hardware ▪  Application-aware replication ▪  Images stored in 100 GB xfs-files called needles ▪  An in-memory index for each needle file ▪  32 bytes of index per photo
  • 23. Life of a photo tag in Hadoop/Hive storage Periodic  Analysis  (HIVE)   Adhoc  Analysis  (HIVE)   Daily report on count of Count photos tagged by photo tags by country (1day) females age 20-25 yesterday Scrapes   User info reaches Hive  Warehouse   Warehouse (1day) MySQL  DB   copier/loader   Log line reaches warehouse (15 min) User tags a photo puma   RealHme  AnalyHcs  (HBASE)   Scribe  Log  Storage  (HDFS)   www.facebook.com   Log line reaches Log line generated: Count users tagging photos Scribeh (10s) in the last hour (1min) <user_id, photo_id>
  • 24. Analytics Data Growth(last 4 years) Facebook Scribe Data/ Nodes in Queries/Day Size (Total) Users Day warehouse Growth 14X 60X 250X 260X 2500X
  • 25. Why use Hadoop instead of Parallel DBMS? ▪  Stonebraker/DeWitt from the DBMS community: ▪  Quote “major step backwards” ▪  Published benchmark results which show that Hive is not as performant as a traditional DBMS ▪  http://database.cs.brown.edu/ projects/mapreduce-vs-dbms/
  • 26. Why Hadoop/Hive? Prospecting for Gold.. ▪  “Gold in the wild-west” ▪  A platform for huge data-experiments ▪  A majority of queries are searching for a single gold nugget ▪  Great advantage in keeping all data in one queryable system ▪  No structure to data, specify structure at query time
  • 27. Why Hadoop/Hive? Open to all users ▪  Open Warehouse ▪  No need to take approval from DBA to create new table ▪  This is possible only because Hadoop is horizontally scalable ▪  Provisioning excess capacity is cheaper than traditional databases ▪  Lower $$/GB ($$/query-latency not relevant here)
  • 28. Why Hadoop/Hive? Crowd Sourcing ▪  Crowd Sourcing for data discovery ▪  There are 50K tables in a single warehouse ▪  Users are DBAs themselves ▪  Questions about a table are directed to users of that table ▪  Automatic query lineage tools help here
  • 29. Why Hadoop/Hive? Open Data Format ▪  No Lock-in ▪  Hive/Hadoop is open source ▪  Data is open-format, one can access data below the database layer ▪  HBase can query the same data that is in HDFS ▪  Want a new UDF? No problem, Very easily extendable
  • 30. Why Hadoop/Hive? Benchmarks ▪  Shortcomings of existing DBMS benchmarks ▪  most DBMS benchmarks measure latency ▪  Do not measure $$/Gigabyte-of-data ▪  Do not test fault tolerance – kill machines ▪  Do not measure elasticity – add and remove machines ▪  Do not measure throughput – concurrent queries in parallel
  • 31. New trends in storage software ▪  Trends: ▪  SSDs cheaper, increasing number of CPUs per server ▪  How can BigData use SSDs? ▪  SATA disk capacities reaching 4 - 8 TB per disk, falling prices $/GB ▪  How can BigData become even bigger?
  • 32. Questions? dhruba@fb.com http://hadoopblog.blogspot.com/