Your SlideShare is downloading. ×
Hadoopsummitfb09 090611023401-phpapp02
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hadoop and Hive at Facebook Data and Applications Dhruba Borthakur, Ding Zhou Your Company Logo Here Wednesday, June 10, 2009    Santa Clara Marriott  
  • 2. Who generates this data? Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least once each day »  More than 850 million photos uploaded to the site each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
  • 3. Where do we store parts of this data? »  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes total size »  Other Hadoop Clusters •  HDFS-Scribe cluster: 320 cores, 160 TB total size •  Hadoop Archival Cluster : 80 cores, 200TB total size •  Test cluster : 800 cores, 150 TB total size
  • 4. Data Collection using Scribe Network  Storage  and  Servers  Web Servers  Scribe MidTier  Oracle RAC  Hadoop Hive Warehouse  MySQL 
  • 5. Data Collection using Scribe and HDFS Scribe MidTier  RealBme  Hadoop  Cluster  Web Servers  Oracle RAC  Hadoop Hive Warehouse  Hadoop Scribe Integration MySQL 
  • 6. Data Archive: Move old data to cheap storage Hadoop Warehouse  distcp  NFS  Hadoop Archive Node  Cheap NAS  Hadoop Archival Cluster  20TB per node  Hive Query  HADOOP‐5048 
  • 7. Hive User Interfaces Hive shell access Hive Web UI
  • 8. Data Analysis at Facebook »  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data »  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;
  • 9. BI: Build centralized reporting tools »  Top-level site metrics Bird-view of user growth by countries Comparing certain metrics between user groups
  • 10. BI: Make AdHoc reporting easy »  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” »  »  »  »  »  SELECT, count(1) FROM status_updates a WHERE a.status LIKE “%swine flu%” AND >= ‘2009-05-01’ AND <= ‘2009-05-31’ GROUP BY
  • 11. Build site metric dashboard in a day »  Data collection: ›  ›  ›  ›  Define metrics and log format (Hive schema) Add logging to the site (Scribe logging) Create a Hive table partitioned by date Set up metric ETL cron job (Hive -> mysql/oracle) »  Data visualization (using mysql) »  Data access (adhoc query using Hive)
  • 12. Build Machine Learning Products on Hadoop/Hive •  Recommendation & ranking •  Advertising optimization •  Text analytics
  • 13. What applications the user may like »  Recommend apps based on social and demographic popularity »  User-app log is huge »  Joining user-app log with user demographics is difficult »  Hive for data aggregation
  • 14. Who the user wants to connect »  Take existing edges and user feedbacks as labels »  Build regression models based on user profile and local graph features »  Too many friends of friends »  Model trained by sampling »  Hive for model inference »  Hive for feature selection
  • 15. What users are talking about (Lexicon) »  Market research & ad tool »  Extract popular words from user content »  Slice by age, gender, region »  Sentiment analysis »  Keyword association laid-off »  Hadoop used for text analytics Words associated with vodka
  • 16. What ads the user might click on »  Predict user-ad click-through »  Ads click data is sparse so sampling can miss info »  Many ML algorithms are iterative thus not easy for hadoop »  Hadoop for model training
  • 17. Build ensemble ML models on Hadoop »  Each mapper trains a number of models »  Each model output as a intermediate feature »  Model selection at reducer »  A regression model is built on selected features Train models locally Cross-Test models locally ds1 ds2 ds3 ds4 ensembles Models assembled by ensemble methods Model inference in a second Hadoop job
  • 18. In summary »  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools; »  »  »  »  So Zuckerberg’s urgent questions are answered; So celebrities know where their fans are from; So we know one can like vodka and lemonade at the same time; It’s fun playing with the data; Dhruba Borthakur, Ding Zhou dhruba@, dzhou@