Published on

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Measuring the Social Graph: Facebook’s Data Infrastructure Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  2. 2. My Background hammer@facebook.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Now manage the Facebook Data Team ▪ 22 amazing data engineers and scientists with more on the way ▪ Skills span databases, distributed systems, statistics, machine ▪ learning, data visualization, social network analysis, and more
  3. 3. Developing for Facebook.com Commonly Used Software Traditional LAMP stack ▪ Linux: mostly FC and RHEL; now testing CentOS ▪ Apache ▪ MySQL: Running 4.1 for a long time, now up to 5.0 ▪ PHP: heavily modified ▪ Memcached ▪ Subversion ▪ Firebug ▪ Thrift ▪
  4. 4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com Session information ▪ Stored on client (in cookie) ▪ New web server for each request ▪ Web Tier (more than 10,000 Servers) Three server profiles: ▪ Web (all CPU) ▪ Memcached (all RAM) Memcached Tier ▪ (around 1,000 servers) MySQL Tier MySQL (mostly RAM) ▪ (around 2,000 servers)
  5. 5. Serving Facebook.com Request Volume per Second Web 10M requests 15TB RAM Memcache 500K requests 25TB RAM MySQL
  6. 6. Services Infrastructure More About Thrift Developing a Thrift service ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy! ▪ Monitor, provision, and upgrade ▪
  7. 7. Services Infrastructure Reinventing the SOA Wheel Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper ▪ Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch ▪
  8. 8. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python ▪ Data Collection Server Data volumes quickly grew ▪ Started at tens of GB in early 2006 ▪ Oracle Database Server Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth ▪
  9. 9. Data Infrastructure Distributed Processing with Cheetah Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load ▪ Partitioned Log File Cheetah Master Filer Processing Tier
  10. 10. Data Infrastructure Moving from Cheetah to Hadoop Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized log file metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ No support for ad hoc querying ▪ Not open source ▪
  11. 11. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
  12. 12. The Facebook Data team builds scalable platforms for the collection, management, and analysis of data. We use these platforms to help drive informed decisions in areas critical to the success of the company. We build tools and provide support for anyone at Facebook who would like to use our platforms to help make data-driven decisions or build data-intensive products and services.
  13. 13. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0