20080528dublinpt2
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

20080528dublinpt2

on

  • 1,293 views

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

One in a series of presentations given at the IBM Cloud Computing Center in Dublin.

Statistics

Views

Total Views
1,293
Views on SlideShare
1,290
Embed Views
3

Actions

Likes
1
Downloads
37
Comments
0

1 Embed 3

http://www.linkedin.com 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

20080528dublinpt2 Presentation Transcript

  • 1. Hadoop Applications at Facebook Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  • 2. Initial Hadoop Deployment ▪ Tested in mid-2006: not great performance, small community ▪ Already had Cheetah and another Hadoop-like project underway ▪ Strong resistance to Java ▪ Early adopters: Yahoo!, Powerset, Quantcast, Last.fm ▪ First serious cluster: spring 2007 ▪ Pulled sixty web server boxes and put 3 x 500 GB SATA disks in the back ▪ Loaded two separate log files: clickstream and activity logs ▪ Clickstream was nearly 600 GB per day, activity logs around 200 GB ▪ Lots of difficulties just getting data into the system ▪ All sorts of fun learning to operate the file system
  • 3. Initial Hadoop Applications Hadoop Streaming ▪ Almost all applications at Facebook use Hadoop Streaming ▪ Mapper and Reducer take inputs from a pipe and write outputs to a pipe ▪ Facebook users write in Python, PHP, C++ (though Pipes would be better) ▪ Allows for library reuse, faster development ▪ Eats way too much CPU ▪ More info: http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
  • 4. Initial Hadoop Applications Unstructured text analysis ▪ Intern asked to understand brand sentiment and influence ▪ First began by building an online language classifier for wall posts ▪ Ported application to Hadoop for offline processing ▪ Many tools for supporting his project had to be built ▪ Understanding serialization format of wall post logs ▪ Common data operations: project, filter, join, group by ▪ Developed using Hadoop streaming for rapid prototyping in Python ▪ Scheduling regular processing and recovering from failures ▪ Making it easy to regularly load new data
  • 5. Lexicon
  • 6. Initial Hadoop Applications Lexicon: Future Directions ▪ Further segmentation and visualization of term intensities ▪ Age ▪ Gender ▪ Geography ▪ TF-IDF ▪ Topic modeling ▪ Sentiment analysis ▪ Augment with data sources from around the internet
  • 7. Initial Hadoop Applications Ensemble Learning ▪ Build a lot of Decision Trees and average them ▪ Random Forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ▪ Can be used for regression or classification ▪ See “Random Forests” by Leo Breiman
  • 8. More Hadoop Applications Insights ▪ Monitor performance of your Facebook Ad, Page, Application ▪ Regular aggregation of high volumes of log file data ▪ First hourly pipelines ▪ Publish data back to a MySQL tier ▪ System currently only running partially on Hadoop
  • 9. Insights
  • 10. More Hadoop Applications Platform Application Reputation Scoring ▪ Users complaining about being spammed by Platform applications ▪ Now, every Platform Application has a set of quotas ▪ Notifications ▪ News Feed story insertion ▪ Invitations ▪ Emails ▪ Quotas determined by calculating a “reputation score” for the application
  • 11. Platform Application Reputation Scoring
  • 12. More Hadoop Applications Recommendation Engines and Affinity Scores ▪ People You May Know (PYMK) ▪ Other application areas ▪ Pages ▪ Applications ▪ News Feed ▪ Search ▪ Ads ▪ Chat
  • 13. More Hadoop Applications Miscellaneous ▪ Experimentation Platform back end ▪ A/B Testing ▪ Champion/Challenger Testing ▪ Lots of internal analyses ▪ Export smaller data sets to R ▪ Ad targeting optimization ▪ Search index building ▪ Load testing for new storage systems ▪ Language prediction for translation targeting
  • 14. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0