20081022cca

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    15 Favorites

    20081022cca - Presentation Transcript

    1. Data Management at Facebook (Back in the Day) Jeff Hammerbacher VP Product and Chief Scientist, Cloudera October 22, 2008
    2. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Managed the Facebook Data Team through September 2008 ▪ Over 25 amazing engineers and data scientists ▪ Now a cofounder of Cloudera ▪ Hadoop support and optimization
    3. Common Themes 1. Simplicity ▪ Do one thing well ... 2. Scalability ▪ ... a lot 3. Manageability ▪ Remove the humans 4. Open Source ▪ Build a community
    4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com ▪ Three main server profiles: ▪ Web ▪ Memcached Web Tier ▪ MySQL (more than 10,000 Servers) ▪ Simplified away: Memcached Tier ▪ AJAX (around 1,000 servers) MySQL Tier (around 2,000 servers) ▪ Photo and Video ▪ Services
    5. Services Infrastructure What’s an SOA? ▪ Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper
    6. Services Infrastructure Thrift, Mainly ▪ Developing a Thrift service: ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy ▪ Monitor, provision, and upgrade
    7. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier ▪ “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python Data Collection Server ▪ Data volumes quickly grew ▪ Started at tens of GB in early 2006 Oracle Database Server ▪ Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth
    8. Data Infrastructure Distributed Processing with Cheetah ▪ Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load Partitioned Log File Cheetah Master Filer Processing Tier
    9. Data Infrastructure Moving from Cheetah to Hadoop ▪ Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized logfile metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ Jobs are difficult to monitor and debug ▪ No support for ad hoc querying ▪ Not open source
    10. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
    11. Initial Hadoop Applications Unstructured text analysis ▪ Intern asked to understand brand sentiment and influence ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data
    12. Lexicon
    13. Initial Hadoop Applications Ensemble Learning ▪ Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ▪ Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman
    14. More Hadoop Applications Insights ▪ Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop
    15. Insights
    16. More Hadoop Applications Platform Application Reputation Scoring ▪ Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the application
    17. Hive Structured Data Management with Hadoop ▪ Hadoop: ▪ HDFS ▪ MapReduce ▪ Resource Manager ▪ Job Scheduler ▪ Hive: ▪ Logical data partitioning ▪ Metadata store (command line and web interfaces) ▪ Query Operators ▪ Query Language
    18. Hive
    19. Hive The Team ▪ Joydeep Sen Sarma ▪ Ashish Thusoo ▪ Pete Wyckoff ▪ Suresh Anthony ▪ Zheng Shao ▪ Venky Iyer ▪ Dhruba Borthakur ▪ Namit Jain ▪ Raghu Murthy ▪ Prasad Chakka
    20. Hive Some Stats ▪ Cluster size - 320 nodes, 2560 cores, 1.3 PB capacity ▪ Total data (compressed, deduplicated) - 180 TB ▪ Net data per day ▪ 10 TB uncompressed - 4 TB from databases, 6 TB from logs ▪ Over 2 TB compressed ▪ Data Processing Statistics ▪ 3,200 Jobs and 800,000 Tasks per day ▪ 55 TB of compressed data processed per day ▪ 15 TB of compressed data produced per day ▪ 80 M minutes of compute time per day
    21. Cassandra Structured Storage over a P2P Network ▪ Conceptually: BigTable data model on Dynamo infrastructure ▪ Design Goals: ▪ High availability ▪ Incremental scalability ▪ Eventual consistency (trade consistency for availability) ▪ Optimistic replication ▪ Low total cost of ownership ▪ Minimal administrative overhead ▪ Tunable tradeoffs between consistency, durability, and latency
    22. Cassandra Architecture
    23. Cassandra Initial Application ▪ Inbox search
    24. Cassandra The Team ▪ Avinash Lakshman ▪ Prashant Malik ▪ Karthik Ranganathan ▪ Kannan Muthukkaruppan
    25. Cassandra Some Stats ▪ Cluster size - 120 nodes ▪ Single instance across two data centers ▪ Total data stored - 36 TB ▪ Writes - 300 million writes per day. ▪ Reads - 1 million reads per day. ▪ Read Latencies ▪ Min - 6.03 ms ▪ Mean - 90.6 ms ▪ Median - 18.24 ms
    26. (c) 2008 Facebook, Inc. or its licensors.  \"Facebook\" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

    + jhammerbjhammerb, 2 years ago

    custom

    3851 views, 15 favs, 10 embeds more stats

    My presentation from CCA08 (http://www.cca08.org).

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3851
      • 2050 on SlideShare
      • 1801 from embeds
    • Comments 0
    • Favorites 15
    • Downloads 117
    Most viewed embeds
    • 1118 views on http://www.cloudera.com
    • 669 views on http://20bits.com
    • 6 views on http://static.slideshare.net
    • 2 views on http://74.125.77.132
    • 1 views on http://209.85.129.132

    more

    All embeds
    • 1118 views on http://www.cloudera.com
    • 669 views on http://20bits.com
    • 6 views on http://static.slideshare.net
    • 2 views on http://74.125.77.132
    • 1 views on http://209.85.129.132
    • 1 views on http://favit.dev
    • 1 views on http://74.125.113.132
    • 1 views on http://feeds2.feedburner.com
    • 1 views on http://209.85.195.132
    • 1 views on file://

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories