Your SlideShare is downloading. ×
0
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Facebook Hadoop Data & Applications

3,382

Published on

Published in: Technology, Education
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,382
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
301
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop and Hive at Facebook Data and Applications Dhruba Borthakur, Ding Zhou Your Company Logo Here Wednesday, June 10, 2009    Santa Clara Marriott  
  • 2. Who generates this data? Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least once each day »  More than 850 million photos uploaded to the site each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols
  • 3. Where do we store parts of this data? »  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes total size »  Other Hadoop Clusters •  HDFS-Scribe cluster: 320 cores, 160 TB total size •  Hadoop Archival Cluster : 80 cores, 200TB total size •  Test cluster : 800 cores, 150 TB total size
  • 4. Data Collection using Scribe Network  Storage  and  Servers  Web Servers  Scribe MidTier  Oracle RAC  Hadoop Hive Warehouse  MySQL 
  • 5. Data Collection using Scribe and HDFS Scribe MidTier  RealBme  Hadoop  Cluster  Web Servers  Oracle RAC  Hadoop Hive Warehouse  Hadoop Scribe Integration MySQL 
  • 6. Data Archive: Move old data to cheap storage Hadoop Warehouse  distcp  NFS  Hadoop Archive Node  Cheap NAS  Hadoop Archival Cluster  20TB per node  HADOOP‐5048  Hive Query 
  • 7. Hive User Interfaces Hive shell access Hive Web UI
  • 8. Data Analysis at Facebook »  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data »  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;
  • 9. BI: Build centralized reporting tools »  Top-level site metrics Bird-view of user growth by countries Comparing certain metrics between user groups
  • 10. BI: Make AdHoc reporting easy »  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” »  SELECT a.date, count(1) »  FROM status_updates a »  WHERE a.status LIKE “%swine flu%” »  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ »  GROUP BY a.date
  • 11. Build site metric dashboard in a day »  Data collection: ›  Define metrics and log format (Hive schema) ›  Add logging to the site (Scribe logging) ›  Create a Hive table partitioned by date ›  Set up metric ETL cron job (Hive -> mysql/oracle) »  Data visualization (using mysql) »  Data access (adhoc query using Hive)
  • 12. Build Machine Learning Products on Hadoop/Hive •  Recommendation & ranking •  Advertising optimization •  Text analytics
  • 13. What applications the user may like »  Recommend apps based on social and demographic popularity »  User-app log is huge »  Joining user-app log with user demographics is difficult »  Hive for data aggregation
  • 14. Who the user wants to connect »  Take existing edges and user feedbacks as labels »  Build regression models based on user profile and local graph features »  Too many friends of friends »  Model trained by sampling »  Hive for model inference »  Hive for feature selection
  • 15. What users are talking about (Lexicon) »  Market research & ad tool »  Extract popular words from user content »  Slice by age, gender, region »  Sentiment analysis laid-off »  Keyword association »  Hadoop used for text analytics Words associated with vodka
  • 16. What ads the user might click on »  Predict user-ad click-through »  Ads click data is sparse so sampling can miss info »  Many ML algorithms are iterative thus not easy for hadoop »  Hadoop for model training
  • 17. Build ensemble ML models on Hadoop Train models locally Cross-Test models locally »  Each mapper trains a number of models »  Each model output as a ds1 ds2 ds3 ds4 intermediate feature »  Model selection at reducer »  A regression model is built on selected features ensembles Models assembled by ensemble methods Model inference in a second Hadoop job
  • 18. In summary »  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools; »  So Zuckerberg’s urgent questions are answered; »  So celebrities know where their fans are from; »  So we know one can like vodka and lemonade at the same time; »  It’s fun playing with the data; Dhruba Borthakur, Ding Zhou dhruba@, dzhou@

×