20080528dublinpt1

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    20080528dublinpt1 - Presentation Transcript

    1. Measuring the Social Graph: Facebook’s Data Infrastructure Jeff Hammerbacher Manager, Data May 28 - 29, 2008
    2. My Background hammer@facebook.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Came to Facebook in early 2006 as a Research Scientist ▪ Now manage the Facebook Data Team ▪ 22 amazing data engineers and scientists with more on the way ▪ Skills span databases, distributed systems, statistics, machine ▪ learning, data visualization, social network analysis, and more
    3. Developing for Facebook.com Commonly Used Software Traditional LAMP stack ▪ Linux: mostly FC and RHEL; now testing CentOS ▪ Apache ▪ MySQL: Running 4.1 for a long time, now up to 5.0 ▪ PHP: heavily modified ▪ Memcached ▪ Subversion ▪ Firebug ▪ Thrift ▪
    4. Serving Facebook.com Data Retrieval and Hardware GET /index.php HTTP/1.1 Host: www.facebook.com Session information ▪ Stored on client (in cookie) ▪ New web server for each request ▪ Web Tier (more than 10,000 Servers) Three server profiles: ▪ Web (all CPU) ▪ Memcached (all RAM) Memcached Tier ▪ (around 1,000 servers) MySQL Tier MySQL (mostly RAM) ▪ (around 2,000 servers)
    5. Serving Facebook.com Request Volume per Second Web 10M requests 15TB RAM Memcache 500K requests 25TB RAM MySQL
    6. Services Infrastructure More About Thrift Developing a Thrift service ▪ Define your data structures ▪ JSON-like data model ▪ Define your service endpoints ▪ Select your languages ▪ Generate stub code ▪ Write service logic ▪ Write client ▪ Configure and deploy! ▪ Monitor, provision, and upgrade ▪
    7. Services Infrastructure Reinventing the SOA Wheel Almost all services written in Thrift ▪ Networks Type-ahead, Search, Ads, SMS Gateway, Chat, Notes Import, Scribe ▪ Batteries included ▪ Network transport libraries ▪ Serialization libraries ▪ Code generation ▪ Robust server implementations (multithreaded, nonblocking, etc.) ▪ Now an Apache Incubator project ▪ For more information, read the whitepaper ▪ Related projects: Sun-RPC, CORBA, RMI, ICE, XML-RPC, JSON, Cisco’s Etch ▪
    8. Data Infrastructure Offline Batch Processing Scribe Tier MySQL Tier “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python ▪ Data Collection Server Data volumes quickly grew ▪ Started at tens of GB in early 2006 ▪ Oracle Database Server Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth ▪
    9. Data Infrastructure Distributed Processing with Cheetah Goal: summarize log files outside of the database ▪ Solution: Cheetah, a distributed log file processing system ▪ Distributor.pl: distribute binaries to processing nodes ▪ C++ Binaries: parse, agg, load ▪ Partitioned Log File Cheetah Master Filer Processing Tier
    10. Data Infrastructure Moving from Cheetah to Hadoop Cheetah limitations ▪ Limited filer bandwidth ▪ No centralized log file metadata ▪ Writing a new Cheetah job requires writing C++ binaries ▪ No support for ad hoc querying ▪ Not open source ▪
    11. Data Infrastructure Hadoop as Enterprise Data Warehouse Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers
    12. The Facebook Data team builds scalable platforms for the collection, management, and analysis of data. We use these platforms to help drive informed decisions in areas critical to the success of the company. We build tools and provide support for anyone at Facebook who would like to use our platforms to help make data-driven decisions or build data-intensive products and services.
    13. (c) 2008 Facebook, Inc. or its licensors.  \"Facebook\" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

    + jhammerbjhammerb, 2 years ago

    custom

    573 views, 2 favs, 0 embeds more stats

    One in a series of presentations given at the IBM C more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 573
      • 573 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 27
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories