SlideShare a Scribd company logo
Data Management at Facebook


Jeff Hammerbacher
Manager, Data
June 12, 2008
My Background
Thanks for Asking
    hammer@facebook.com
▪

    Studied Mathematics at Harvard
▪

    Worked as a Quant on Wall Street
▪

    Came to Facebook in early 2006 as a Research Scientist
▪

    Now manage the Facebook Data Team
▪

        25 amazing data engineers and scientists with more on the way
    ▪

        Skills span databases, distributed systems, statistics, machine
    ▪
        learning, data visualization, social network analysis, and more
Serving Facebook.com
Data Retrieval and Hardware
                                                   GET /index.php HTTP/1.1
                                                   Host: www.facebook.com
    Three main server profiles:
▪

        Web
    ▪

        Memcached
    ▪
                                                                           Web Tier
        MySQL
    ▪                                                              (more than 10,000 Servers)




    Simplified away:
▪

        AJAX
    ▪                               Memcached Tier
                                 (around 1,000 servers)
                                                                  MySQL Tier
        Photo and Video
    ▪                                                        (around 2,000 servers)



        Services
    ▪
Serving Facebook.com
Request Volume per Second


                Web
                             10M
                            requests



                                       15TB RAM
              Memcache


                            500K
                         requests



                                       25TB RAM
               MySQL
Services Infrastructure
Thrift, Mainly

    Developing a Thrift service:
▪

        Define your data structures
    ▪

            JSON-like data model
        ▪


        Define your service endpoints
    ▪

        Select your languages
    ▪

        Generate stub code
    ▪

        Write service logic
    ▪

        Write client
    ▪

        Configure and deploy
    ▪

        Monitor, provision, and upgrade
    ▪
Services Infrastructure
What’s an SOA?

    Almost all services written in Thrift
▪

        Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe
    ▪

    Batteries included
▪

        Network transport libraries
    ▪

        Serialization libraries
    ▪

        Code generation
    ▪

        Robust server implementations (multithreaded, nonblocking, etc.)
    ▪

    Now an Apache Incubator project
▪

    For more information, read the whitepaper
▪

    Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch
▪
Data Infrastructure
Offline Batch Processing
                                                 Scribe Tier                     MySQL Tier

    “Data Warehousing”
▪

    Began with Oracle database
▪

    Schedule data collection via cron
▪

    Collect data every 24 hours
▪

    “ETL” scripts: hand-coded Python
▪
                                                               Data Collection
                                                                   Server
    Data volumes quickly grew
▪

        Started at tens of GB in early 2006
    ▪
                                                               Oracle Database
                                                                    Server
        Up to about 1 TB per day by mid-2007
    ▪

        Log files largest source of data growth
    ▪
Data Infrastructure
Distributed Processing with Cheetah

    Goal: summarize log files outside of the database
▪

    Solution: Cheetah, a distributed log file processing system
▪

        Distributor.pl: distribute binaries to processing nodes
    ▪

        C++ Binaries: parse, agg, load
    ▪




                         Partitioned Log File
                                                                  Cheetah Master




                                  Filer         Processing Tier
Data Infrastructure
Moving from Cheetah to Hadoop

    Cheetah limitations
▪

        Limited filer bandwidth
    ▪

        No centralized logfile metadata
    ▪

        Writing a new Cheetah job requires writing C++ binaries
    ▪

        Jobs are difficult to monitor and debug
    ▪

        No support for ad hoc querying
    ▪

        Not open source
    ▪
Data Infrastructure
Hadoop as Enterprise Data Warehouse
              Scribe Tier     MySQL Tier




      Hadoop Tier




         Oracle RAC Servers
Anatomy of the Facebook Cluster
Hardware
    Individual nodes
▪

        CPU: Intel Xeon dual socket quad cores (8 cores per box)
    ▪

        Memory: 16 GB ECC DRAM
    ▪

        Disk: 4 x 1 TB 7200 RPM SATA
    ▪

        Network: 1 gE
    ▪

    Topology
▪

        320 nodes arranged into 8 racks of 40 nodes each
    ▪

        8 x 1 Gbps links out to the core switch
    ▪
Anatomy of the Facebook Cluster
Recent Cluster Statistics
    From May 2nd to May 21st:
▪

        Total jobs: 8,794
    ▪

        Total map tasks: 1,362,429
    ▪

        Total reduce tasks: 86,806
    ▪

        Average duration of a successful job: 296 s
    ▪

        Average duration of a successful map: 81 s
    ▪

        Average duration of a successful reduce: 678 s
    ▪
Initial Hadoop Applications
Hadoop Streaming
    Almost all applications at Facebook use Hadoop Streaming
▪

    Mapper and Reducer take inputs from a pipe and write outputs to a pipe
▪

    Facebook users write in Python, PHP, C++
▪

    Allows for library reuse, faster development
▪

    Eats way too much CPU
▪

    More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
▪
Initial Hadoop Applications
Unstructured text analysis
    Intern asked to understand brand sentiment and influence
▪




    Many tools for supporting his project had to be built
▪

        Understanding serialization format of wall post logs
    ▪

        Common data operations: project, filter, join, group by
    ▪

        Developed using Hadoop streaming for rapid prototyping in Python
    ▪

        Scheduling regular processing and recovering from failures
    ▪

        Making it easy to regularly load new data
    ▪
Lexicon
Initial Hadoop Applications
Ensemble Learning
    Build a lot of Decision Trees and average them
▪

        Random Forests are a combination of tree predictors such that each
    ▪
        tree depends on the values of a random vector sampled independently
        and with the same distribution for all trees in the forest
        Can be used for regression or classification
    ▪

        See “Random Forests” by Leo Breiman
    ▪
More Hadoop Applications
Insights
    Monitor performance of your Facebook Ad, Page, Application
▪

    Regular aggregation of high volumes of log file data
▪

    First hourly pipelines
▪

    Publish data back to a MySQL tier
▪

    System currently only running partially on Hadoop
▪
Insights
More Hadoop Applications
Platform Application Reputation Scoring
    Users complaining about being spammed by Platform applications
▪

    Now, every Platform Application has a set of quotas
▪

        Notifications
    ▪

        News Feed story insertion
    ▪

        Invitations
    ▪

        Emails
    ▪

    Quotas determined by calculating a “reputation score” for the
▪
    application
More Hadoop Applications
Miscellaneous
    Experimentation Platform back end
▪

        A/B Testing
    ▪

        Champion/Challenger Testing
    ▪

    Lots of internal analyses
▪

        Export smaller data sets to R
    ▪

    Ads: targeting optimization, fraud detection
▪

    Search: index building, ranking optimization
▪

    Load testing for new storage systems
▪

    Language prediction for translation targeting
▪
Hive
Structured Data Management with Hadoop
    Hadoop:
▪

        HDFS
    ▪

        MapReduce
    ▪

        Resource Manager
    ▪

        Job Scheduler
    ▪

    Hive:
▪

        Logical data partitioning
    ▪

        Metadata store
    ▪

        Query Operators
    ▪

        Query Language
    ▪
Hive
Hive
Data Model
Hive
Sample Queries
▪ CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT,
                         page_url STRING, referrer_url STRING, ip STRING)
 COMMENT 'This is the page view table'
 PARTITIONED BY(date DATETIME, country STRING)
 BUCKETED ON (userid) INTO 32 BUCKETS
 ROW FORMAT DELIMITED FIELD DELIMITER 001 ROW DELIMITER 013
 STORED AS COMPRESSED
 LOCATION '/user/facebook/warehouse/page_view';

▪ FROM pv_users
 INSERT INTO TABLE pv_gender_sum
   SELECT pv_users.gender, count_distinct(pv_users.userid)
   GROUP BY(pv_users.gender)
 INSERT INTO FILE /user/facebook/tmp/pv_age_sum.txt
   SELECT pv_users.age, count_distinct(pv_users.userid)
   GROUP BY(pv_users.age);
Hive
The Team
    Joydeep Sen Sarma
▪

    Ashish Thusoo
▪

    Pete Wyckoff
▪

    Suresh Anthony
▪

    Zheng Shao
▪

    Venky Iyer
▪

    Dhruba Borthakur
▪

    Namit Jain
▪

    Raghu Murthy
▪
Cassandra
Structured Storage over a P2P Network
    Conceptually: BigTable data model on Dynamo infrastructure
▪

    Design Goals:
▪

        High availability
    ▪

        Incremental scalability
    ▪

        Eventual consistency (trade consistency for availability)
    ▪

        Optimistic replication
    ▪

        Low total cost of ownership
    ▪

        Minimal administrative overhead
    ▪

        Tunable tradeoffs between consistency, durability, and latency
    ▪
Cassandra
Architecture
Cassandra
The Team
    Avinash Lakshman
▪

    Prashant Malik
▪

    Karthik Ranganathan
▪
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
Antonio Severien
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
Edureka!
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
Gear6
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
Aneesh Pulickal Karunakaran
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
Murat Çakal
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
Alex Moundalexis
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ovidiu Dimulescu
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Cloudera, Inc.
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
Balaji Rajan
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
Cloudera, Inc.
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera, Inc.
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
Amal G Jose
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
Edureka!
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
parallellabs
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 

What's hot (20)

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 

Viewers also liked

Event notifications
Event notificationsEvent notifications
Event notifications
Ofqual Slideshare
 
Multiplayer Online Gaming
Multiplayer Online GamingMultiplayer Online Gaming
Multiplayer Online Gaming
chetnamistry
 
Regulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisationsRegulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisations
Ofqual Slideshare
 
Network Management
Network ManagementNetwork Management
Network Management
azura787
 
SQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workSQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they work
Markus Winand
 
Basic Concepts and Types of Network Management
Basic Concepts and Types of Network ManagementBasic Concepts and Types of Network Management
Basic Concepts and Types of Network Management
Sorath Asnani
 

Viewers also liked (6)

Event notifications
Event notificationsEvent notifications
Event notifications
 
Multiplayer Online Gaming
Multiplayer Online GamingMultiplayer Online Gaming
Multiplayer Online Gaming
 
Regulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisationsRegulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisations
 
Network Management
Network ManagementNetwork Management
Network Management
 
SQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workSQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they work
 
Basic Concepts and Types of Network Management
Basic Concepts and Types of Network ManagementBasic Concepts and Types of Network Management
Basic Concepts and Types of Network Management
 

Similar to 20080611accel

20081022cca
20081022cca20081022cca
20081022cca
Jeff Hammerbacher
 
Qcon
QconQcon
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
Chris Purrington
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
dzhou
 
20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
Jeff Hammerbacher
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
MySQLConference
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
andyman3000
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
The Web Scale
The Web ScaleThe Web Scale
The Web Scale
Guille -bisho-
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
Marcus Deglos
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_de
mahalomeetup
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
Brandon
BrandonBrandon
Brandon
Brandon Smith
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
Severalnines
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Lviv Startup Club
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
eby
 
[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice
javablend
 
From One to a Cluster
From One to a ClusterFrom One to a Cluster
From One to a Cluster
guestd34230
 
Adding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSASAdding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSAS
sumedha.r
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
VoltDB
 

Similar to 20080611accel (20)

20081022cca
20081022cca20081022cca
20081022cca
 
Qcon
QconQcon
Qcon
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
The Web Scale
The Web ScaleThe Web Scale
The Web Scale
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_de
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Brandon
BrandonBrandon
Brandon
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice
 
From One to a Cluster
From One to a ClusterFrom One to a Cluster
From One to a Cluster
 
Adding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSASAdding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSAS
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 

More from Jeff Hammerbacher

20120223keystone
20120223keystone20120223keystone
20120223keystone
Jeff Hammerbacher
 
20100714accel
20100714accel20100714accel
20100714accel
Jeff Hammerbacher
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
Jeff Hammerbacher
 
20100513brown
20100513brown20100513brown
20100513brown
Jeff Hammerbacher
 
20100423sage
20100423sage20100423sage
20100423sage
Jeff Hammerbacher
 
20100418sos
20100418sos20100418sos
20100418sos
Jeff Hammerbacher
 
20100301icde
20100301icde20100301icde
20100301icde
Jeff Hammerbacher
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
Jeff Hammerbacher
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
Jeff Hammerbacher
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
Jeff Hammerbacher
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
Jeff Hammerbacher
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
Jeff Hammerbacher
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
Jeff Hammerbacher
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
Jeff Hammerbacher
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Jeff Hammerbacher
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
Jeff Hammerbacher
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
Jeff Hammerbacher
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
Jeff Hammerbacher
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
Jeff Hammerbacher
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
Jeff Hammerbacher
 

More from Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Recently uploaded

Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
manji sharman06
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
DianaGray10
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
DianaGray10
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 

Recently uploaded (20)

Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
Call Girls Chandigarh🔥7023059433🔥Agency Profile Escorts in Chandigarh Availab...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
What is an RPA CoE? Session 2 – CoE Roles
What is an RPA CoE?  Session 2 – CoE RolesWhat is an RPA CoE?  Session 2 – CoE Roles
What is an RPA CoE? Session 2 – CoE Roles
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
What is an RPA CoE? Session 1 – CoE Vision
What is an RPA CoE?  Session 1 – CoE VisionWhat is an RPA CoE?  Session 1 – CoE Vision
What is an RPA CoE? Session 1 – CoE Vision
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 

20080611accel

  • 1.
  • 2. Data Management at Facebook Jeff Hammerbacher Manager, Data June 12, 2008
  • 3. My Background Thanks for Asking hammer@facebook.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Now manage the Facebook Data Team ▪ 25 amazing data engineers and scientists with more on the way ▪ Skills span databases, distributed systems, statistics, machine ▪ learning, data visualization, social network analysis, and more
  • 4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com Three main server profiles: ▪ Web ▪ Memcached ▪ Web Tier MySQL ▪ (more than 10,000 Servers) Simplified away: ▪ AJAX ▪ Memcached Tier (around 1,000 servers) MySQL Tier Photo and Video ▪ (around 2,000 servers) Services ▪
  • 5. Serving Facebook.com Request Volume per Second Web 10M requests 15TB RAM Memcache 500K requests 25TB RAM MySQL
  • 6. Services Infrastructure Thrift, Mainly Developing a Thrift service: ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy ▪ Monitor, provision, and upgrade ▪
  • 7. Services Infrastructure What’s an SOA? Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper ▪ Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch ▪
  • 8. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python ▪ Data Collection Server Data volumes quickly grew ▪ Started at tens of GB in early 2006 ▪ Oracle Database Server Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth ▪
  • 9. Data Infrastructure Distributed Processing with Cheetah Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load ▪ Partitioned Log File Cheetah Master Filer Processing Tier
  • 10. Data Infrastructure Moving from Cheetah to Hadoop Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized logfile metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ Jobs are difficult to monitor and debug ▪ No support for ad hoc querying ▪ Not open source ▪
  • 11. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
  • 12. Anatomy of the Facebook Cluster Hardware Individual nodes ▪ CPU: Intel Xeon dual socket quad cores (8 cores per box) ▪ Memory: 16 GB ECC DRAM ▪ Disk: 4 x 1 TB 7200 RPM SATA ▪ Network: 1 gE ▪ Topology ▪ 320 nodes arranged into 8 racks of 40 nodes each ▪ 8 x 1 Gbps links out to the core switch ▪
  • 13. Anatomy of the Facebook Cluster Recent Cluster Statistics From May 2nd to May 21st: ▪ Total jobs: 8,794 ▪ Total map tasks: 1,362,429 ▪ Total reduce tasks: 86,806 ▪ Average duration of a successful job: 296 s ▪ Average duration of a successful map: 81 s ▪ Average duration of a successful reduce: 678 s ▪
  • 14. Initial Hadoop Applications Hadoop Streaming Almost all applications at Facebook use Hadoop Streaming ▪ Mapper and Reducer take inputs from a pipe and write outputs to a pipe ▪ Facebook users write in Python, PHP, C++ ▪ Allows for library reuse, faster development ▪ Eats way too much CPU ▪ More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html ▪
  • 15. Initial Hadoop Applications Unstructured text analysis Intern asked to understand brand sentiment and influence ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data ▪
  • 17. Initial Hadoop Applications Ensemble Learning Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each ▪ tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman ▪
  • 18. More Hadoop Applications Insights Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop ▪
  • 20. More Hadoop Applications Platform Application Reputation Scoring Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the ▪ application
  • 21. More Hadoop Applications Miscellaneous Experimentation Platform back end ▪ A/B Testing ▪ Champion/Challenger Testing ▪ Lots of internal analyses ▪ Export smaller data sets to R ▪ Ads: targeting optimization, fraud detection ▪ Search: index building, ranking optimization ▪ Load testing for new storage systems ▪ Language prediction for translation targeting ▪
  • 22. Hive Structured Data Management with Hadoop Hadoop: ▪ HDFS ▪ MapReduce ▪ Resource Manager ▪ Job Scheduler ▪ Hive: ▪ Logical data partitioning ▪ Metadata store ▪ Query Operators ▪ Query Language ▪
  • 23. Hive
  • 25. Hive Sample Queries ▪ CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT, page_url STRING, referrer_url STRING, ip STRING) COMMENT 'This is the page view table' PARTITIONED BY(date DATETIME, country STRING) BUCKETED ON (userid) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELD DELIMITER 001 ROW DELIMITER 013 STORED AS COMPRESSED LOCATION '/user/facebook/warehouse/page_view'; ▪ FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO FILE /user/facebook/tmp/pv_age_sum.txt SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
  • 26. Hive The Team Joydeep Sen Sarma ▪ Ashish Thusoo ▪ Pete Wyckoff ▪ Suresh Anthony ▪ Zheng Shao ▪ Venky Iyer ▪ Dhruba Borthakur ▪ Namit Jain ▪ Raghu Murthy ▪
  • 27. Cassandra Structured Storage over a P2P Network Conceptually: BigTable data model on Dynamo infrastructure ▪ Design Goals: ▪ High availability ▪ Incremental scalability ▪ Eventual consistency (trade consistency for availability) ▪ Optimistic replication ▪ Low total cost of ownership ▪ Minimal administrative overhead ▪ Tunable tradeoffs between consistency, durability, and latency ▪
  • 29. Cassandra The Team Avinash Lakshman ▪ Prashant Malik ▪ Karthik Ranganathan ▪
  • 30. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0