Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
20080528dublinpt1
1.
2. Measuring the Social Graph:
Facebook’s Data Infrastructure
Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008
3. My Background
hammer@facebook.com
▪
Studied Mathematics at Harvard
▪
Worked as a Quant on Wall Street
▪
Came to Facebook in early 2006 as a Research Scientist
▪
Now manage the Facebook Data Team
▪
22 amazing data engineers and scientists with more on the way
▪
Skills span databases, distributed systems, statistics, machine
▪
learning, data visualization, social network analysis, and more
4. Developing for Facebook.com
Commonly Used Software
Traditional LAMP stack
▪
Linux: mostly FC and RHEL; now testing CentOS
▪
Apache
▪
MySQL: Running 4.1 for a long time, now up to 5.0
▪
PHP: heavily modified
▪
Memcached
▪
Subversion
▪
Firebug
▪
Thrift
▪
5. Serving Facebook.com
Data Retrieval and Hardware
GET /index.php HTTP/1.1
Host: www.facebook.com
Session information
▪
Stored on client (in cookie)
▪
New web server for each request
▪
Web Tier
(more than 10,000 Servers)
Three server profiles:
▪
Web (all CPU)
▪
Memcached (all RAM) Memcached Tier
▪ (around 1,000 servers)
MySQL Tier
MySQL (mostly RAM)
▪ (around 2,000 servers)
7. Services Infrastructure
More About Thrift
Developing a Thrift service
▪
Define your data structures
▪
JSON-like data model
▪
Define your service endpoints
▪
Select your languages
▪
Generate stub code
▪
Write service logic
▪
Write client
▪
Configure and deploy!
▪
Monitor, provision, and upgrade
▪
8. Services Infrastructure
Reinventing the SOA Wheel
Almost all services written in Thrift
▪
Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe
▪
Batteries included
▪
Network transport libraries
▪
Serialization libraries
▪
Code generation
▪
Robust server implementations (multithreaded, nonblocking, etc.)
▪
Now an Apache Incubator project
▪
For more information, read the whitepaper
▪
Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch
▪
9. Data Infrastructure
Offline Batch Processing
Scribe Tier MySQL Tier
“Data Warehousing”
▪
Began with Oracle database
▪
Schedule data collection via cron
▪
Collect data every 24 hours
▪
“ETL” scripts: hand-coded Python
▪
Data Collection
Server
Data volumes quickly grew
▪
Started at tens of GB in early 2006
▪
Oracle Database
Server
Up to about 1 TB per day by mid-2007
▪
Log files largest source of data growth
▪
10. Data Infrastructure
Distributed Processing with Cheetah
Goal: summarize log files outside of the database
▪
Solution: Cheetah, a distributed log file processing system
▪
Distributor.pl: distribute binaries to processing nodes
▪
C++ Binaries: parse, agg, load
▪
Partitioned Log File
Cheetah Master
Filer Processing Tier
11. Data Infrastructure
Moving from Cheetah to Hadoop
Cheetah limitations
▪
Limited filer bandwidth
▪
No centralized log file metadata
▪
Writing a new Cheetah job requires writing C++ binaries
▪
No support for ad hoc querying
▪
Not open source
▪
13. The Facebook Data team builds scalable platforms for the
collection, management, and analysis of data.
We use these platforms to help drive informed decisions in
areas critical to the success of the company.
We build tools and provide support for anyone at Facebook
who would like to use our platforms to help make data-driven
decisions or build data-intensive products and services.
14. (c) 2008 Facebook, Inc. or its licensors. quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0