Published on

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Published in: Business, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Hadoop Applications at Facebook Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  2. 2. Initial Hadoop Deployment ▪ Tested in mid-2006: not great performance, small community ▪ Already had Cheetah and another Hadoop-like project underway ▪ Strong resistance to Java ▪ Early adopters: Yahoo!, Powerset, Quantcast, Last.fm ▪ First serious cluster: spring 2007 ▪ Pulled sixty web server boxes and put 3 x 500 GB SATA disks in the back ▪ Loaded two separate log files: clickstream and activity logs ▪ Clickstream was nearly 600 GB per day, activity logs around 200 GB ▪ Lots of difficulties just getting data into the system ▪ All sorts of fun learning to operate the file system
  3. 3. Initial Hadoop Applications Hadoop Streaming ▪ Almost all applications at Facebook use Hadoop Streaming ▪ Mapper and Reducer take inputs from a pipe and write outputs to a pipe ▪ Facebook users write in Python, PHP, C++ (though Pipes would be better) ▪ Allows for library reuse, faster development ▪ Eats way too much CPU ▪ More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
  4. 4. Initial Hadoop Applications Unstructured text analysis ▪ Intern asked to understand brand sentiment and influence ▪ First began by building an online language classifier for wall posts ▪ Ported application to Hadoop for offline processing ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data
  5. 5. Lexicon
  6. 6. Initial Hadoop Applications Lexicon: Future Directions ▪ Further segmentation and visualization of term intensities ▪ Age ▪ Gender ▪ Geography ▪ TF-IDF ▪ Topic modeling ▪ Sentiment analysis ▪ Augment with data sources from around the internet
  7. 7. Initial Hadoop Applications Ensemble Learning ▪ Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ▪ Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman
  8. 8. More Hadoop Applications Insights ▪ Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop
  9. 9. Insights
  10. 10. More Hadoop Applications Platform Application Reputation Scoring ▪ Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the application
  11. 11. Platform Application Reputation Scoring
  12. 12. More Hadoop Applications Recommendation Engines and Affinity Scores ▪ People You May Know (PYMK) ▪ Other application areas ▪ Pages ▪ Applications ▪ News Feed ▪ Search ▪ Ads ▪ Chat
  13. 13. More Hadoop Applications Miscellaneous ▪ Experimentation Platform back end ▪ A/B Testing ▪ Champion/Challenger Testing ▪ Lots of internal analyses ▪ Export smaller data sets to R ▪ Ad targeting optimization ▪ Search index building ▪ Load testing for new storage systems ▪ Language prediction for translation targeting
  14. 14. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0