Data Management at Facebook
(Back in the Day)

Jeff Hammerbacher
VP Product and Chief Scientist, Cloudera
October 22, 2008
My Background
Thanks for Asking
▪   hammer@cloudera.com
▪   Studied Mathematics at Harvard
▪   Worked as a Quant on Wall S...
Common Themes
1. Simplicity
 ▪   Do one thing well ...
2. Scalability
 ▪   ... a lot
3. Manageability
 ▪   Remove the huma...
Serving Facebook.com
Data Retrieval and Hardware
                                                   GET /index.php HTTP/1....
Services Infrastructure
What’s an SOA?
▪   Almost all services written in Thrift
    ▪   Networks Type-ahead, Search, Ads,...
Services Infrastructure
Thrift, Mainly
▪   Developing a Thrift service:
    ▪   Define your data structures
        ▪   JSO...
Data Infrastructure
Offline Batch Processing
                                              Scribe Tier                    ...
Data Infrastructure
Distributed Processing with Cheetah
▪   Goal: summarize log files outside of the database
▪   Solution:...
Data Infrastructure
Moving from Cheetah to Hadoop
▪   Cheetah limitations
    ▪   Limited filer bandwidth
    ▪   No centra...
Data Infrastructure
Hadoop as Enterprise Data Warehouse
               Scribe Tier     MySQL Tier




       Hadoop Tier

...
Initial Hadoop Applications
Unstructured text analysis
▪   Intern asked to understand brand sentiment and influence


▪   M...
Lexicon
Initial Hadoop Applications
Ensemble Learning
▪   Build a lot of Decision Trees and average them
    ▪   Random Forests ar...
More Hadoop Applications
Insights
▪   Monitor performance of your Facebook Ad, Page, Application
▪   Regular aggregation o...
Insights
More Hadoop Applications
Platform Application Reputation Scoring
▪   Users complaining about being spammed by Platform
   ...
Hive
Structured Data Management with Hadoop
▪   Hadoop:
    ▪   HDFS
    ▪   MapReduce
    ▪   Resource Manager
    ▪   Jo...
Hive
Hive
The Team
▪   Joydeep Sen Sarma
▪   Ashish Thusoo
▪   Pete Wyckoff
▪   Suresh Anthony
▪   Zheng Shao
▪   Venky Iyer
▪ ...
Hive
Some Stats
▪   Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity
▪   Total data (compressed, deduplicated) - 180 ...
Cassandra
Structured Storage over a P2P Network
▪   Conceptually: BigTable data model on Dynamo infrastructure
▪   Design ...
Cassandra
Architecture
Cassandra
Initial Application
▪   Inbox search
Cassandra
The Team
▪   Avinash Lakshman
▪   Prashant Malik
▪   Karthik Ranganathan
▪   Kannan Muthukkaruppan
Cassandra
Some Stats
▪   Cluster size - 120 nodes
    ▪   Single instance across two data centers
▪   Total data stored - ...
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights rese...
20081022cca
Upcoming SlideShare
Loading in …5
×

20081022cca

23,512 views

Published on

My presentation from CCA08 (http://www.cca08.org).

Published in: Technology
1 Comment
38 Likes
Statistics
Notes
No Downloads
Views
Total views
23,512
On SlideShare
0
From Embeds
0
Number of Embeds
18,615
Actions
Shares
0
Downloads
492
Comments
1
Likes
38
Embeds 0
No embeds

No notes for slide

20081022cca

  1. 1. Data Management at Facebook (Back in the Day) Jeff Hammerbacher VP Product and Chief Scientist, Cloudera October 22, 2008
  2. 2. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Managed the Facebook Data Team through September 2008 ▪ Over 25 amazing engineers and data scientists ▪ Now a cofounder of Cloudera ▪ Hadoop support and optimization
  3. 3. Common Themes 1. Simplicity ▪ Do one thing well ... 2. Scalability ▪ ... a lot 3. Manageability ▪ Remove the humans 4. Open Source ▪ Build a community
  4. 4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com ▪ Three main server profiles: ▪ Web ▪ Memcached Web Tier ▪ MySQL (more than 10,000 Servers) ▪ Simplified away: Memcached Tier ▪ AJAX (around 1,000 servers) MySQL Tier (around 2,000 servers) ▪ Photo and Video ▪ Services
  5. 5. Services Infrastructure What’s an SOA? ▪ Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper
  6. 6. Services Infrastructure Thrift, Mainly ▪ Developing a Thrift service: ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy ▪ Monitor, provision, and upgrade
  7. 7. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier ▪ “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python Data Collection Server ▪ Data volumes quickly grew ▪ Started at tens of GB in early 2006 Oracle Database Server ▪ Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth
  8. 8. Data Infrastructure Distributed Processing with Cheetah ▪ Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load Partitioned Log File Cheetah Master Filer Processing Tier
  9. 9. Data Infrastructure Moving from Cheetah to Hadoop ▪ Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized logfile metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ Jobs are difficult to monitor and debug ▪ No support for ad hoc querying ▪ Not open source
  10. 10. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
  11. 11. Initial Hadoop Applications Unstructured text analysis ▪ Intern asked to understand brand sentiment and influence ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data
  12. 12. Lexicon
  13. 13. Initial Hadoop Applications Ensemble Learning ▪ Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ▪ Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman
  14. 14. More Hadoop Applications Insights ▪ Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop
  15. 15. Insights
  16. 16. More Hadoop Applications Platform Application Reputation Scoring ▪ Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the application
  17. 17. Hive Structured Data Management with Hadoop ▪ Hadoop: ▪ HDFS ▪ MapReduce ▪ Resource Manager ▪ Job Scheduler ▪ Hive: ▪ Logical data partitioning ▪ Metadata store (command line and web interfaces) ▪ Query Operators ▪ Query Language
  18. 18. Hive
  19. 19. Hive The Team ▪ Joydeep Sen Sarma ▪ Ashish Thusoo ▪ Pete Wyckoff ▪ Suresh Anthony ▪ Zheng Shao ▪ Venky Iyer ▪ Dhruba Borthakur ▪ Namit Jain ▪ Raghu Murthy ▪ Prasad Chakka
  20. 20. Hive Some Stats ▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity ▪ Total data (compressed, deduplicated) - 180 TB ▪ Net data per day ▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs ▪ Over 2 TB compressed ▪ Data Processing Statistics ▪ 3,200 Jobs and 800,000 Tasks per day ▪ 55 TB of compressed data processed per day ▪ 15 TB of compressed data produced per day ▪ 80 M minutes of compute time per day
  21. 21. Cassandra Structured Storage over a P2P Network ▪ Conceptually: BigTable data model on Dynamo infrastructure ▪ Design Goals: ▪ High availability ▪ Incremental scalability ▪ Eventual consistency (trade consistency for availability) ▪ Optimistic replication ▪ Low total cost of ownership ▪ Minimal administrative overhead ▪ Tunable tradeoffs between consistency, durability, and latency
  22. 22. Cassandra Architecture
  23. 23. Cassandra Initial Application ▪ Inbox search
  24. 24. Cassandra The Team ▪ Avinash Lakshman ▪ Prashant Malik ▪ Karthik Ranganathan ▪ Kannan Muthukkaruppan
  25. 25. Cassandra Some Stats ▪ Cluster size - 120 nodes ▪ Single instance across two data centers ▪ Total data stored - 36 TB ▪ Writes - 300 million writes per day. ▪ Reads - 1 million reads per day. ▪ Read Latencies ▪ Min - 6.03 ms ▪ Mean - 90.6 ms ▪ Median - 18.24 ms
  26. 26. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

×