SlideShare a Scribd company logo
1 of 30
Download to read offline
Data Management at Facebook


Jeff Hammerbacher
Manager, Data
June 12, 2008
My Background
Thanks for Asking
    hammer@facebook.com
▪

    Studied Mathematics at Harvard
▪

    Worked as a Quant on Wall Street
▪

    Came to Facebook in early 2006 as a Research Scientist
▪

    Now manage the Facebook Data Team
▪

        25 amazing data engineers and scientists with more on the way
    ▪

        Skills span databases, distributed systems, statistics, machine
    ▪
        learning, data visualization, social network analysis, and more
Serving Facebook.com
Data Retrieval and Hardware
                                                   GET /index.php HTTP/1.1
                                                   Host: www.facebook.com
    Three main server profiles:
▪

        Web
    ▪

        Memcached
    ▪
                                                                           Web Tier
        MySQL
    ▪                                                              (more than 10,000 Servers)




    Simplified away:
▪

        AJAX
    ▪                               Memcached Tier
                                 (around 1,000 servers)
                                                                  MySQL Tier
        Photo and Video
    ▪                                                        (around 2,000 servers)



        Services
    ▪
Serving Facebook.com
Request Volume per Second


                Web
                             10M
                            requests



                                       15TB RAM
              Memcache


                            500K
                         requests



                                       25TB RAM
               MySQL
Services Infrastructure
Thrift, Mainly

    Developing a Thrift service:
▪

        Define your data structures
    ▪

            JSON-like data model
        ▪


        Define your service endpoints
    ▪

        Select your languages
    ▪

        Generate stub code
    ▪

        Write service logic
    ▪

        Write client
    ▪

        Configure and deploy
    ▪

        Monitor, provision, and upgrade
    ▪
Services Infrastructure
What’s an SOA?

    Almost all services written in Thrift
▪

        Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe
    ▪

    Batteries included
▪

        Network transport libraries
    ▪

        Serialization libraries
    ▪

        Code generation
    ▪

        Robust server implementations (multithreaded, nonblocking, etc.)
    ▪

    Now an Apache Incubator project
▪

    For more information, read the whitepaper
▪

    Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch
▪
Data Infrastructure
Offline Batch Processing
                                                 Scribe Tier                     MySQL Tier

    “Data Warehousing”
▪

    Began with Oracle database
▪

    Schedule data collection via cron
▪

    Collect data every 24 hours
▪

    “ETL” scripts: hand-coded Python
▪
                                                               Data Collection
                                                                   Server
    Data volumes quickly grew
▪

        Started at tens of GB in early 2006
    ▪
                                                               Oracle Database
                                                                    Server
        Up to about 1 TB per day by mid-2007
    ▪

        Log files largest source of data growth
    ▪
Data Infrastructure
Distributed Processing with Cheetah

    Goal: summarize log files outside of the database
▪

    Solution: Cheetah, a distributed log file processing system
▪

        Distributor.pl: distribute binaries to processing nodes
    ▪

        C++ Binaries: parse, agg, load
    ▪




                         Partitioned Log File
                                                                  Cheetah Master




                                  Filer         Processing Tier
Data Infrastructure
Moving from Cheetah to Hadoop

    Cheetah limitations
▪

        Limited filer bandwidth
    ▪

        No centralized logfile metadata
    ▪

        Writing a new Cheetah job requires writing C++ binaries
    ▪

        Jobs are difficult to monitor and debug
    ▪

        No support for ad hoc querying
    ▪

        Not open source
    ▪
Data Infrastructure
Hadoop as Enterprise Data Warehouse
              Scribe Tier     MySQL Tier




      Hadoop Tier




         Oracle RAC Servers
Anatomy of the Facebook Cluster
Hardware
    Individual nodes
▪

        CPU: Intel Xeon dual socket quad cores (8 cores per box)
    ▪

        Memory: 16 GB ECC DRAM
    ▪

        Disk: 4 x 1 TB 7200 RPM SATA
    ▪

        Network: 1 gE
    ▪

    Topology
▪

        320 nodes arranged into 8 racks of 40 nodes each
    ▪

        8 x 1 Gbps links out to the core switch
    ▪
Anatomy of the Facebook Cluster
Recent Cluster Statistics
    From May 2nd to May 21st:
▪

        Total jobs: 8,794
    ▪

        Total map tasks: 1,362,429
    ▪

        Total reduce tasks: 86,806
    ▪

        Average duration of a successful job: 296 s
    ▪

        Average duration of a successful map: 81 s
    ▪

        Average duration of a successful reduce: 678 s
    ▪
Initial Hadoop Applications
Hadoop Streaming
    Almost all applications at Facebook use Hadoop Streaming
▪

    Mapper and Reducer take inputs from a pipe and write outputs to a pipe
▪

    Facebook users write in Python, PHP, C++
▪

    Allows for library reuse, faster development
▪

    Eats way too much CPU
▪

    More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
▪
Initial Hadoop Applications
Unstructured text analysis
    Intern asked to understand brand sentiment and influence
▪




    Many tools for supporting his project had to be built
▪

        Understanding serialization format of wall post logs
    ▪

        Common data operations: project, filter, join, group by
    ▪

        Developed using Hadoop streaming for rapid prototyping in Python
    ▪

        Scheduling regular processing and recovering from failures
    ▪

        Making it easy to regularly load new data
    ▪
Lexicon
Initial Hadoop Applications
Ensemble Learning
    Build a lot of Decision Trees and average them
▪

        Random Forests are a combination of tree predictors such that each
    ▪
        tree depends on the values of a random vector sampled independently
        and with the same distribution for all trees in the forest
        Can be used for regression or classification
    ▪

        See “Random Forests” by Leo Breiman
    ▪
More Hadoop Applications
Insights
    Monitor performance of your Facebook Ad, Page, Application
▪

    Regular aggregation of high volumes of log file data
▪

    First hourly pipelines
▪

    Publish data back to a MySQL tier
▪

    System currently only running partially on Hadoop
▪
Insights
More Hadoop Applications
Platform Application Reputation Scoring
    Users complaining about being spammed by Platform applications
▪

    Now, every Platform Application has a set of quotas
▪

        Notifications
    ▪

        News Feed story insertion
    ▪

        Invitations
    ▪

        Emails
    ▪

    Quotas determined by calculating a “reputation score” for the
▪
    application
More Hadoop Applications
Miscellaneous
    Experimentation Platform back end
▪

        A/B Testing
    ▪

        Champion/Challenger Testing
    ▪

    Lots of internal analyses
▪

        Export smaller data sets to R
    ▪

    Ads: targeting optimization, fraud detection
▪

    Search: index building, ranking optimization
▪

    Load testing for new storage systems
▪

    Language prediction for translation targeting
▪
Hive
Structured Data Management with Hadoop
    Hadoop:
▪

        HDFS
    ▪

        MapReduce
    ▪

        Resource Manager
    ▪

        Job Scheduler
    ▪

    Hive:
▪

        Logical data partitioning
    ▪

        Metadata store
    ▪

        Query Operators
    ▪

        Query Language
    ▪
Hive
Hive
Data Model
Hive
Sample Queries
▪ CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT,
                         page_url STRING, referrer_url STRING, ip STRING)
 COMMENT 'This is the page view table'
 PARTITIONED BY(date DATETIME, country STRING)
 BUCKETED ON (userid) INTO 32 BUCKETS
 ROW FORMAT DELIMITED FIELD DELIMITER 001 ROW DELIMITER 013
 STORED AS COMPRESSED
 LOCATION '/user/facebook/warehouse/page_view';

▪ FROM pv_users
 INSERT INTO TABLE pv_gender_sum
   SELECT pv_users.gender, count_distinct(pv_users.userid)
   GROUP BY(pv_users.gender)
 INSERT INTO FILE /user/facebook/tmp/pv_age_sum.txt
   SELECT pv_users.age, count_distinct(pv_users.userid)
   GROUP BY(pv_users.age);
Hive
The Team
    Joydeep Sen Sarma
▪

    Ashish Thusoo
▪

    Pete Wyckoff
▪

    Suresh Anthony
▪

    Zheng Shao
▪

    Venky Iyer
▪

    Dhruba Borthakur
▪

    Namit Jain
▪

    Raghu Murthy
▪
Cassandra
Structured Storage over a P2P Network
    Conceptually: BigTable data model on Dynamo infrastructure
▪

    Design Goals:
▪

        High availability
    ▪

        Incremental scalability
    ▪

        Eventual consistency (trade consistency for availability)
    ▪

        Optimistic replication
    ▪

        Low total cost of ownership
    ▪

        Minimal administrative overhead
    ▪

        Tunable tradeoffs between consistency, durability, and latency
    ▪
Cassandra
Architecture
Cassandra
The Team
    Avinash Lakshman
▪

    Prashant Malik
▪

    Karthik Ranganathan
▪
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksEdureka!
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedGear6
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best PracticesCloudera, Inc.
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersAmal G Jose
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop ClusterEdureka!
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 

What's hot (20)

Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
Top 5 Hadoop Admin Tasks
Top 5 Hadoop Admin TasksTop 5 Hadoop Admin Tasks
Top 5 Hadoop Admin Tasks
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Wmware NoSQL
Wmware NoSQLWmware NoSQL
Wmware NoSQL
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
Hadoop admin
Hadoop adminHadoop admin
Hadoop admin
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Deployment and Management of Hadoop Clusters
Deployment and Management of Hadoop ClustersDeployment and Management of Hadoop Clusters
Deployment and Management of Hadoop Clusters
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBaseHBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 

Viewers also liked

Multiplayer Online Gaming
Multiplayer Online GamingMultiplayer Online Gaming
Multiplayer Online Gamingchetnamistry
 
Regulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisationsRegulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisationsOfqual Slideshare
 
Network Management
Network ManagementNetwork Management
Network Managementazura787
 
SQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workSQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workMarkus Winand
 
Basic Concepts and Types of Network Management
Basic Concepts and Types of Network ManagementBasic Concepts and Types of Network Management
Basic Concepts and Types of Network ManagementSorath Asnani
 

Viewers also liked (6)

Event notifications
Event notificationsEvent notifications
Event notifications
 
Multiplayer Online Gaming
Multiplayer Online GamingMultiplayer Online Gaming
Multiplayer Online Gaming
 
Regulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisationsRegulation in balance: an update for awarding organisations
Regulation in balance: an update for awarding organisations
 
Network Management
Network ManagementNetwork Management
Network Management
 
SQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workSQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they work
 
Basic Concepts and Types of Network Management
Basic Concepts and Types of Network ManagementBasic Concepts and Types of Network Management
Basic Concepts and Types of Network Management
 

Similar to 20080611accel

AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applicationsdzhou
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At CraigslistMySQLConference
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentationandyman3000
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The SnailMarcus Deglos
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_demahalomeetup
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...Lviv Startup Club
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Luceneeby
 
[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practicejavablend
 
From One to a Cluster
From One to a ClusterFrom One to a Cluster
From One to a Clusterguestd34230
 
Adding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSASAdding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSASsumedha.r
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 

Similar to 20080611accel (20)

20081022cca
20081022cca20081022cca
20081022cca
 
Qcon
QconQcon
Qcon
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Facebook Hadoop Data & Applications
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
 
20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
 
My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Storage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems PresentationStorage Systems for High Scalable Systems Presentation
Storage Systems for High Scalable Systems Presentation
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
The Web Scale
The Web ScaleThe Web Scale
The Web Scale
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
 
Cassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_deCassandra at mahalo_com_scale_la_meetup_de
Cassandra at mahalo_com_scale_la_meetup_de
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Brandon
BrandonBrandon
Brandon
 
Severalnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IXSeveralnines Training: MySQL® Cluster - Part IX
Severalnines Training: MySQL® Cluster - Part IX
 
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...Yaroslav Nedashkovsky  "How to manage hundreds of pipelines for processing da...
Yaroslav Nedashkovsky "How to manage hundreds of pipelines for processing da...
 
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into LuceneLuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene
 
[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice[Roblek] Distributed computing in practice
[Roblek] Distributed computing in practice
 
From One to a Cluster
From One to a ClusterFrom One to a Cluster
From One to a Cluster
 
Adding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSASAdding Data into your SOA with WSO2 WSAS
Adding Data into your SOA with WSO2 WSAS
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 

More from Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Recently uploaded

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

20080611accel

  • 1.
  • 2. Data Management at Facebook Jeff Hammerbacher Manager, Data June 12, 2008
  • 3. My Background Thanks for Asking hammer@facebook.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Now manage the Facebook Data Team ▪ 25 amazing data engineers and scientists with more on the way ▪ Skills span databases, distributed systems, statistics, machine ▪ learning, data visualization, social network analysis, and more
  • 4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com Three main server profiles: ▪ Web ▪ Memcached ▪ Web Tier MySQL ▪ (more than 10,000 Servers) Simplified away: ▪ AJAX ▪ Memcached Tier (around 1,000 servers) MySQL Tier Photo and Video ▪ (around 2,000 servers) Services ▪
  • 5. Serving Facebook.com Request Volume per Second Web 10M requests 15TB RAM Memcache 500K requests 25TB RAM MySQL
  • 6. Services Infrastructure Thrift, Mainly Developing a Thrift service: ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy ▪ Monitor, provision, and upgrade ▪
  • 7. Services Infrastructure What’s an SOA? Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper ▪ Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch ▪
  • 8. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python ▪ Data Collection Server Data volumes quickly grew ▪ Started at tens of GB in early 2006 ▪ Oracle Database Server Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth ▪
  • 9. Data Infrastructure Distributed Processing with Cheetah Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load ▪ Partitioned Log File Cheetah Master Filer Processing Tier
  • 10. Data Infrastructure Moving from Cheetah to Hadoop Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized logfile metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ Jobs are difficult to monitor and debug ▪ No support for ad hoc querying ▪ Not open source ▪
  • 11. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
  • 12. Anatomy of the Facebook Cluster Hardware Individual nodes ▪ CPU: Intel Xeon dual socket quad cores (8 cores per box) ▪ Memory: 16 GB ECC DRAM ▪ Disk: 4 x 1 TB 7200 RPM SATA ▪ Network: 1 gE ▪ Topology ▪ 320 nodes arranged into 8 racks of 40 nodes each ▪ 8 x 1 Gbps links out to the core switch ▪
  • 13. Anatomy of the Facebook Cluster Recent Cluster Statistics From May 2nd to May 21st: ▪ Total jobs: 8,794 ▪ Total map tasks: 1,362,429 ▪ Total reduce tasks: 86,806 ▪ Average duration of a successful job: 296 s ▪ Average duration of a successful map: 81 s ▪ Average duration of a successful reduce: 678 s ▪
  • 14. Initial Hadoop Applications Hadoop Streaming Almost all applications at Facebook use Hadoop Streaming ▪ Mapper and Reducer take inputs from a pipe and write outputs to a pipe ▪ Facebook users write in Python, PHP, C++ ▪ Allows for library reuse, faster development ▪ Eats way too much CPU ▪ More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html ▪
  • 15. Initial Hadoop Applications Unstructured text analysis Intern asked to understand brand sentiment and influence ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data ▪
  • 17. Initial Hadoop Applications Ensemble Learning Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each ▪ tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman ▪
  • 18. More Hadoop Applications Insights Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop ▪
  • 20. More Hadoop Applications Platform Application Reputation Scoring Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the ▪ application
  • 21. More Hadoop Applications Miscellaneous Experimentation Platform back end ▪ A/B Testing ▪ Champion/Challenger Testing ▪ Lots of internal analyses ▪ Export smaller data sets to R ▪ Ads: targeting optimization, fraud detection ▪ Search: index building, ranking optimization ▪ Load testing for new storage systems ▪ Language prediction for translation targeting ▪
  • 22. Hive Structured Data Management with Hadoop Hadoop: ▪ HDFS ▪ MapReduce ▪ Resource Manager ▪ Job Scheduler ▪ Hive: ▪ Logical data partitioning ▪ Metadata store ▪ Query Operators ▪ Query Language ▪
  • 23. Hive
  • 25. Hive Sample Queries ▪ CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT, page_url STRING, referrer_url STRING, ip STRING) COMMENT 'This is the page view table' PARTITIONED BY(date DATETIME, country STRING) BUCKETED ON (userid) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELD DELIMITER 001 ROW DELIMITER 013 STORED AS COMPRESSED LOCATION '/user/facebook/warehouse/page_view'; ▪ FROM pv_users INSERT INTO TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY(pv_users.gender) INSERT INTO FILE /user/facebook/tmp/pv_age_sum.txt SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);
  • 26. Hive The Team Joydeep Sen Sarma ▪ Ashish Thusoo ▪ Pete Wyckoff ▪ Suresh Anthony ▪ Zheng Shao ▪ Venky Iyer ▪ Dhruba Borthakur ▪ Namit Jain ▪ Raghu Murthy ▪
  • 27. Cassandra Structured Storage over a P2P Network Conceptually: BigTable data model on Dynamo infrastructure ▪ Design Goals: ▪ High availability ▪ Incremental scalability ▪ Eventual consistency (trade consistency for availability) ▪ Optimistic replication ▪ Low total cost of ownership ▪ Minimal administrative overhead ▪ Tunable tradeoffs between consistency, durability, and latency ▪
  • 29. Cassandra The Team Avinash Lakshman ▪ Prashant Malik ▪ Karthik Ranganathan ▪
  • 30. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0