Measuring the Social Graph:
Facebook’s Data Infrastructure


Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008
My Background
    hammer@facebook.com
▪

    Studied Mathematics at Harvard
▪

    Worked as a Quant on Wall Street
▪

   ...
Developing for Facebook.com
Commonly Used Software
    Traditional LAMP stack
▪

        Linux: mostly FC and RHEL; now te...
Serving Facebook.com
Data Retrieval and Hardware
                                                            GET /index.ph...
Serving Facebook.com
Request Volume per Second


                Web
                             10M
                    ...
Services Infrastructure
More About Thrift

    Developing a Thrift service
▪

        Define your data structures
    ▪

  ...
Services Infrastructure
Reinventing the SOA Wheel

    Almost all services written in Thrift
▪

        Networks Type-ahea...
Data Infrastructure
Offline Batch Processing
                                                 Scribe Tier                  ...
Data Infrastructure
Distributed Processing with Cheetah

    Goal: summarize log files outside of the database
▪

    Solut...
Data Infrastructure
Moving from Cheetah to Hadoop

    Cheetah limitations
▪

        Limited filer bandwidth
    ▪

      ...
Data Infrastructure
Hadoop as Enterprise Data Warehouse
              Scribe Tier     MySQL Tier




      Hadoop Tier



...
The Facebook Data team builds scalable platforms for the
collection, management, and analysis of data.

   We use these pl...
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights rese...
20080528dublinpt1
Upcoming SlideShare
Loading in …5
×

20080528dublinpt1

1,556 views
1,464 views

Published on

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,556
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
54
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

20080528dublinpt1

  1. 1. Measuring the Social Graph: Facebook’s Data Infrastructure Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  2. 2. My Background hammer@facebook.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Now manage the Facebook Data Team ▪ 22 amazing data engineers and scientists with more on the way ▪ Skills span databases, distributed systems, statistics, machine ▪ learning, data visualization, social network analysis, and more
  3. 3. Developing for Facebook.com Commonly Used Software Traditional LAMP stack ▪ Linux: mostly FC and RHEL; now testing CentOS ▪ Apache ▪ MySQL: Running 4.1 for a long time, now up to 5.0 ▪ PHP: heavily modified ▪ Memcached ▪ Subversion ▪ Firebug ▪ Thrift ▪
  4. 4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com Session information ▪ Stored on client (in cookie) ▪ New web server for each request ▪ Web Tier (more than 10,000 Servers) Three server profiles: ▪ Web (all CPU) ▪ Memcached (all RAM) Memcached Tier ▪ (around 1,000 servers) MySQL Tier MySQL (mostly RAM) ▪ (around 2,000 servers)
  5. 5. Serving Facebook.com Request Volume per Second Web 10M requests 15TB RAM Memcache 500K requests 25TB RAM MySQL
  6. 6. Services Infrastructure More About Thrift Developing a Thrift service ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy! ▪ Monitor, provision, and upgrade ▪
  7. 7. Services Infrastructure Reinventing the SOA Wheel Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper ▪ Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch ▪
  8. 8. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python ▪ Data Collection Server Data volumes quickly grew ▪ Started at tens of GB in early 2006 ▪ Oracle Database Server Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth ▪
  9. 9. Data Infrastructure Distributed Processing with Cheetah Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load ▪ Partitioned Log File Cheetah Master Filer Processing Tier
  10. 10. Data Infrastructure Moving from Cheetah to Hadoop Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized log file metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ No support for ad hoc querying ▪ Not open source ▪
  11. 11. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
  12. 12. The Facebook Data team builds scalable platforms for the collection, management, and analysis of data. We use these platforms to help drive informed decisions in areas critical to the success of the company. We build tools and provide support for anyone at Facebook who would like to use our platforms to help make data-driven decisions or build data-intensive products and services.
  13. 13. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×