20080611accel

Data Management at Facebook

Jeff Hammerbacher
Manager, Data
June 12, 2008

My Background
Thanks for Asking
hammer@facebook.com
▪

Studied Mathematics at Harvard
▪

Worked as a Quant on Wall Street
▪

Came to Facebook in early 2006 as a Research Scientist
▪

Now manage the Facebook Data Team
▪

25 amazing data engineers and scientists with more on the way
▪

Skills span databases, distributed systems, statistics, machine
▪
learning, data visualization, social network analysis, and more

Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
Three main server proﬁles:
▪

Web
▪

Memcached
▪
Web Tier
MySQL
▪ (more than 10,000 Servers)

Simpliﬁed away:
▪

AJAX
▪ Memcached Tier
(around 1,000 servers)
MySQL Tier
Photo and Video
▪ (around 2,000 servers)

Services
▪

Serving Facebook.com
Request Volume per Second

Web
10M
requests

15TB RAM
Memcache

500K
requests

25TB RAM
MySQL

Services Infrastructure
Thrift, Mainly

Developing a Thrift service:
▪

Define your data structures
▪

JSON-like data model
▪

Define your service endpoints
▪

Select your languages
▪

Generate stub code
▪

Write service logic
▪

Write client
▪

Configure and deploy
▪

Monitor, provision, and upgrade
▪

Services Infrastructure
What’s an SOA?

Almost all services written in Thrift
▪

Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe
▪

Batteries included
▪

Network transport libraries
▪

Serialization libraries
▪

Code generation
▪

Robust server implementations (multithreaded, nonblocking, etc.)
▪

Now an Apache Incubator project
▪

For more information, read the whitepaper
▪

Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch
▪

Data Infrastructure
Ofﬂine Batch Processing
Scribe Tier MySQL Tier

“Data Warehousing”
▪

Began with Oracle database
▪

Schedule data collection via cron
▪

Collect data every 24 hours
▪

“ETL” scripts: hand-coded Python
▪
Data Collection
Server
Data volumes quickly grew
▪

Started at tens of GB in early 2006
▪
Oracle Database
Server
Up to about 1 TB per day by mid-2007
▪

Log ﬁles largest source of data growth
▪

Data Infrastructure
Distributed Processing with Cheetah

Goal: summarize log ﬁles outside of the database
▪

Solution: Cheetah, a distributed log ﬁle processing system
▪

Distributor.pl: distribute binaries to processing nodes
▪

C++ Binaries: parse, agg, load
▪

Partitioned Log File
Cheetah Master

Filer Processing Tier

Data Infrastructure
Moving from Cheetah to Hadoop

Cheetah limitations
▪

Limited filer bandwidth
▪

No centralized logfile metadata
▪

Writing a new Cheetah job requires writing C++ binaries
▪

Jobs are difficult to monitor and debug
▪

No support for ad hoc querying
▪

Not open source
▪

Data Infrastructure
Hadoop as Enterprise Data Warehouse
Scribe Tier MySQL Tier

Hadoop Tier

Oracle RAC Servers

Anatomy of the Facebook Cluster
Hardware
Individual nodes
▪

CPU: Intel Xeon dual socket quad cores (8 cores per box)
▪

Memory: 16 GB ECC DRAM
▪

Disk: 4 x 1 TB 7200 RPM SATA
▪

Network: 1 gE
▪

Topology
▪

320 nodes arranged into 8 racks of 40 nodes each
▪

8 x 1 Gbps links out to the core switch
▪

Anatomy of the Facebook Cluster
Recent Cluster Statistics
From May 2nd to May 21st:
▪

Total jobs: 8,794
▪

Total map tasks: 1,362,429
▪

Total reduce tasks: 86,806
▪

Average duration of a successful job: 296 s
▪

Average duration of a successful map: 81 s
▪

Average duration of a successful reduce: 678 s
▪

Initial Hadoop Applications
Hadoop Streaming
Almost all applications at Facebook use Hadoop Streaming
▪

Mapper and Reducer take inputs from a pipe and write outputs to a pipe
▪

Facebook users write in Python, PHP, C++
▪

Allows for library reuse, faster development
▪

Eats way too much CPU
▪

More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
▪

Unstructured text analysis
Intern asked to understand brand sentiment and inﬂuence
▪

Many tools for supporting his project had to be built
▪

Understanding serialization format of wall post logs
▪

Common data operations: project, ﬁlter, join, group by
▪

Developed using Hadoop streaming for rapid prototyping in Python
▪

Scheduling regular processing and recovering from failures
▪

Making it easy to regularly load new data
▪

Ensemble Learning
Build a lot of Decision Trees and average them
▪

Random Forests are a combination of tree predictors such that each
▪
tree depends on the values of a random vector sampled independently
and with the same distribution for all trees in the forest
Can be used for regression or classiﬁcation
▪

See “Random Forests” by Leo Breiman
▪

More Hadoop Applications
Insights
Monitor performance of your Facebook Ad, Page, Application
▪

Regular aggregation of high volumes of log ﬁle data
▪

First hourly pipelines
▪

Publish data back to a MySQL tier
▪

System currently only running partially on Hadoop
▪

Platform Application Reputation Scoring
Users complaining about being spammed by Platform applications
▪

Now, every Platform Application has a set of quotas
▪

Notiﬁcations
▪

News Feed story insertion
▪

Invitations
▪

Emails
▪

Quotas determined by calculating a “reputation score” for the
▪
application

Miscellaneous
Experimentation Platform back end
▪

A/B Testing
▪

Champion/Challenger Testing
▪

Lots of internal analyses
▪

Export smaller data sets to R
▪

Ads: targeting optimization, fraud detection
▪

Search: index building, ranking optimization
▪

Load testing for new storage systems
▪

Language prediction for translation targeting
▪

Hive
Structured Data Management with Hadoop
Hadoop:
▪

HDFS
▪

MapReduce
▪

Resource Manager
▪

Job Scheduler
▪

Hive:
▪

Logical data partitioning
▪

Metadata store
▪

Query Operators
▪

Query Language
▪

Hive
Sample Queries
▪ CREATE TABLE page_view(viewTime DATETIME, userid MEDIUMINT,
page_url STRING, referrer_url STRING, ip STRING)
COMMENT 'This is the page view table'
PARTITIONED BY(date DATETIME, country STRING)
BUCKETED ON (userid) INTO 32 BUCKETS
ROW FORMAT DELIMITED FIELD DELIMITER 001 ROW DELIMITER 013
STORED AS COMPRESSED
LOCATION '/user/facebook/warehouse/page_view';

▪ FROM pv_users
INSERT INTO TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY(pv_users.gender)
INSERT INTO FILE /user/facebook/tmp/pv_age_sum.txt
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY(pv_users.age);

Hive
The Team
Joydeep Sen Sarma
▪

Ashish Thusoo
▪

Pete Wyckoff
▪

Suresh Anthony
▪

Zheng Shao
▪

Venky Iyer
▪

Dhruba Borthakur
▪

Namit Jain
▪

Raghu Murthy
▪

Cassandra
Structured Storage over a P2P Network
Conceptually: BigTable data model on Dynamo infrastructure
▪

Design Goals:
▪

High availability
▪

Incremental scalability
▪

Eventual consistency (trade consistency for availability)
▪

Optimistic replication
▪

Low total cost of ownership
▪

Minimal administrative overhead
▪

Tunable tradeoffs between consistency, durability, and latency
▪

Cassandra
The Team
Avinash Lakshman
▪

Prashant Malik
▪

Karthik Ranganathan
▪

20080611accel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to 20080611accel

Similar to 20080611accel (20)

More from Jeff Hammerbacher

More from Jeff Hammerbacher (20)

Recently uploaded

Recently uploaded (20)

20080611accel