20080528dublinpt1

Measuring the Social Graph:
Facebook’s Data Infrastructure

Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008

My Background
hammer@facebook.com
▪

Studied Mathematics at Harvard
▪

Worked as a Quant on Wall Street
▪

Came to Facebook in early 2006 as a Research Scientist
▪

Now manage the Facebook Data Team
▪

22 amazing data engineers and scientists with more on the way
▪

Skills span databases, distributed systems, statistics, machine
▪
learning, data visualization, social network analysis, and more

Developing for Facebook.com
Commonly Used Software
Traditional LAMP stack
▪

Linux: mostly FC and RHEL; now testing CentOS
▪

Apache
▪

MySQL: Running 4.1 for a long time, now up to 5.0
▪

PHP: heavily modiﬁed
▪

Memcached
▪

Subversion
▪

Firebug
▪

Thrift
▪

Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
Session information
▪

Stored on client (in cookie)
▪

New web server for each request
▪
Web Tier
(more than 10,000 Servers)

Three server proﬁles:
▪

Web (all CPU)
▪

Memcached (all RAM) Memcached Tier
▪ (around 1,000 servers)
MySQL Tier

MySQL (mostly RAM)
▪ (around 2,000 servers)

Serving Facebook.com
Request Volume per Second

Web
10M
requests

15TB RAM
Memcache

500K
requests

25TB RAM
MySQL

Services Infrastructure
More About Thrift

Developing a Thrift service
▪

Define your data structures
▪

JSON-like data model
▪

Define your service endpoints
▪

Select your languages
▪

Generate stub code
▪

Write service logic
▪

Write client
▪

Configure and deploy!
▪

Monitor, provision, and upgrade
▪

Services Infrastructure
Reinventing the SOA Wheel

Almost all services written in Thrift
▪

Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe
▪

Batteries included
▪

Network transport libraries
▪

Serialization libraries
▪

Code generation
▪

Robust server implementations (multithreaded, nonblocking, etc.)
▪

Now an Apache Incubator project
▪

For more information, read the whitepaper
▪

Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch
▪

Data Infrastructure
Ofﬂine Batch Processing
Scribe Tier MySQL Tier

“Data Warehousing”
▪

Began with Oracle database
▪

Schedule data collection via cron
▪

Collect data every 24 hours
▪

“ETL” scripts: hand-coded Python
▪
Data Collection
Server
Data volumes quickly grew
▪

Started at tens of GB in early 2006
▪
Oracle Database
Server
Up to about 1 TB per day by mid-2007
▪

Log ﬁles largest source of data growth
▪

Data Infrastructure
Distributed Processing with Cheetah

Goal: summarize log ﬁles outside of the database
▪

Solution: Cheetah, a distributed log ﬁle processing system
▪

Distributor.pl: distribute binaries to processing nodes
▪

C++ Binaries: parse, agg, load
▪

Partitioned Log File
Cheetah Master

Filer Processing Tier

Data Infrastructure
Moving from Cheetah to Hadoop

Cheetah limitations
▪

Limited ﬁler bandwidth
▪

No centralized log ﬁle metadata
▪

Writing a new Cheetah job requires writing C++ binaries
▪

No support for ad hoc querying
▪

Not open source
▪

Data Infrastructure
Hadoop as Enterprise Data Warehouse
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers

The Facebook Data team builds scalable platforms for the
collection, management, and analysis of data.

We use these platforms to help drive informed decisions in
areas critical to the success of the company.

We build tools and provide support for anyone at Facebook
who would like to use our platforms to help make data-driven
decisions or build data-intensive products and services.

20080528dublinpt1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to 20080528dublinpt1

Similar to 20080528dublinpt1 (20)

More from Jeff Hammerbacher

More from Jeff Hammerbacher (20)

Recently uploaded

Recently uploaded (20)

20080528dublinpt1