Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"


Published on

Presentation at the 2009 Text Analytics Summit

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

  1. Social Media, Happiness, Petabytes and LOLs Roddy Lindsay, Data Scientist, Facebook June 1, 2009
  2. Lots of data is generated on Facebook ▪ 200 million active users ▪ More than 20 million users update their statuses at least once each day ▪ More than 850 million photos uploaded to the site each month ▪ More than 8 million videos uploaded each month ▪ More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week ▪ More than 2.5 million events created each month ▪ More than 25 million active user groups exist on the site
  3. Lots of data is generated on Facebook ▪ Undoubtedly a very rich data set (and large...we’re talking petabytes) ▪ Many different groups clamoring for data: ▪ Internal analysts ▪ FB Engineers ▪ Advertisers ▪ Page owners ▪ Platform/Connect developers ▪ Marketers ▪ Academics
  4. Challenges ▪ How can Facebook satisfy all the different consumers of data? ▪ What are the challenges? ▪ 1. Infrastructure ▪ 2. Infrastructure ▪ 3. Infrastructure
  5. Facebook’s Data Infrastructure ▪ Attempt 1: Oracle Data Warehouse (2005) ▪ Business analysts already familiar with tools, SQL ▪ Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP) ▪ i.e. growth by country and demographic ▪ When growth took off (2007), ETL processes to load and roll-up data started taking a very long time ▪ A single machine (or several machines) were not going to cut it much longer for data volumes at that scale...
  6. Facebook’s Data Infrastructure ▪ Attempt 2: Hadoop (2007) ▪ Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage ▪ Map-Reduce (invented at Google) provides a way to process large data sets that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers ▪ Hadoop initially developed by Doug Cutting, now an Apache project led by the Grid Computing team at Yahoo! ▪ Much faster ETL when transform and load is distributed across a cluster ▪ Engineers able to write jobs in Java and Python ▪ Not a viable solution for analysts who can write SQL but not code
  7. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop ▪ Developed at Facebook, now an Apache subproject ▪ Also includes: ▪ Web interface for constructing queries on the fly without using a shell ▪ Live support for query problems from the data team ▪ Easy integration with charts and dashboards ▪ One-click scheduling ▪ CSV/Excel export
  8. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” ▪ SELECT a.date, count(1) ▪ FROM status_updates a ▪ WHERE a.status LIKE “%swine flu%” ▪ AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ ▪ GROUP BY a.date
  9. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Easily extendable to new operators ▪ Hypothetical example: “Find the sentiment of the ‘Terminator’ movie” ▪ FROM ( ▪ FROM status_updates b ▪ SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment ▪ WHERE b.status LIKE “%terminator%” ▪ AND b.date >= ‘2009-05-01’ AND b.date <= ‘2009-05-31’) a ▪ SELECT a.sentiment, count(1) ▪ GROUP BY a.sentiment
  10. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Successfully decentralized the querying and consumption of data across the company ▪ Instead of 10 dedicated data analysts, we trained a few hundred ▪ Everyone is able to answer 95% of his or her data questions with minimal training ▪ Dedicated data scientists, instead of working on an endless queue of ad-hoc requests, can spend their time performing complex analyses and building scalable systems on top of Hadoop/Hive ▪ Machine Learning systems ▪ Rich reporting for clients + Page owners ▪ Text analytics
  11. Facebook text analytics ▪ Lexicon (Spring 2008) ▪ Started as an intern project to test Hadoop ▪ First external deployment of a Hadoop-powered system at Facebook (and one of the first anywhere) ▪ Simple idea: count the number of occurrences of words and bigrams on Facebook Walls per day, plot them on a line graph
  12. “american idol”
  13. Facebook text analytics ▪ “New” Lexicon (Fall 2008), beta preview ▪ Leveraged Hive’s structured metadata and the raw computational power of a 600-node Hadoop cluster ▪ Slices by age, gender, region ▪ Sentiment analysis ▪ Common user interests ▪ Associations graph of similar keywords, with age and gender axes
  14. Dashboard: “economy”
  15. Demographics: “economy”
  16. Map: “laid off”
  17. Sentiment: “iron man” (blue) vs. “indiana jones” (yellow)
  18. Associations: “marriage”
  19. Associations: “vodka”
  20. Facebook text analytics ▪ Hadoop and Hive makes this all possible ▪ Consider “Associations” (similar words and phrases) ▪ Need to compare the co-occurrence of each term with every single other word and bigram, compared to baseline probability of occurrence (TF-IDF)......and keep demographic metadata around for fun ▪ Typical job generates several TB of data along the way ▪ Absolutely need a cluster of machines ▪ Distributed computation opens up the possibilities for text analytics algorithms! ▪ And.....the software is free!
  21. Text Analytics ▪ Text analytics is clearly useful in the “macro”: ▪ Big data sets ▪ Big compute clusters ▪ Big consumers (corporations) ▪ What about in the micro? ▪ Small data sets ▪ B, not PB ▪ Small consumers ▪ Individual people analyzing their own data
  22. HappyFactor ▪ Facebook Application (personal project, not associated with Facebook) ▪ Idea: ask people privately how happy they are and what they are doing ▪ Uses random text messages to ensure a good sample and to collect data easily ▪ Provide users with trends on their happiness (by day, week, month, etc.) ▪ When are you happiest? ▪ Sift through the unstructured text to find patterns in behavior that correlate with happiness and unhappiness ▪ Which activities make you happiest? ▪ Which people in your life make you happiest?
  23. HappyFactor ▪ Just like corporations can learn about (and improve) themselves through text analytics.... ▪ Why not humans?
  24. On a scale from 1 to 10, how happy are you right now? Reply with your score and an optional description of what you are doing.
  25. In sum... ▪ Analyzing large data sets is a challenging problem that requires significant investment (both human and financial) in infrastructure ▪ We’re now just learning what we can do with Facebook data since we developed the infrastructure to support it ▪ Distributed computation and structured metadata allow for a powerful new class of text analytics algorithms ▪ Text analytics has applications well beyond enterprise data-mining... ▪ ...could it potentially make the world a happier place?
  26. (c) 2009 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0