Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"


Published on

Presentation at the 2009 Text Analytics Summit

Published in: Technology
  • Useful! By converting voice of customer data into useful datasets to support fact-based decision making, text analytics helps to solve real business problems with impact.
    Are you sure you want to  Yes  No
    Your message goes here

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

  1. Social Media, Happiness, Petabytes and LOLs Roddy Lindsay, Data Scientist, Facebook June 1, 2009
  2. Lots of data is generated on Facebook ▪ 200 million active users ▪ More than 20 million users update their statuses at least once each day ▪ More than 850 million photos uploaded to the site each month ▪ More than 8 million videos uploaded each month ▪ More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week ▪ More than 2.5 million events created each month ▪ More than 25 million active user groups exist on the site
  3. Lots of data is generated on Facebook ▪ Undoubtedly a very rich data set (and large...we’re talking petabytes) ▪ Many different groups clamoring for data: ▪ Internal analysts ▪ FB Engineers ▪ Advertisers ▪ Page owners ▪ Platform/Connect developers ▪ Marketers ▪ Academics
  4. Challenges ▪ How can Facebook satisfy all the different consumers of data? ▪ What are the challenges? ▪ 1. Infrastructure ▪ 2. Infrastructure ▪ 3. Infrastructure
  5. Facebook’s Data Infrastructure ▪ Attempt 1: Oracle Data Warehouse (2005) ▪ Business analysts already familiar with tools, SQL ▪ Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP) ▪ i.e. growth by country and demographic ▪ When growth took off (2007), ETL processes to load and roll-up data started taking a very long time ▪ A single machine (or several machines) were not going to cut it much longer for data volumes at that scale...
  6. Facebook’s Data Infrastructure ▪ Attempt 2: Hadoop (2007) ▪ Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage ▪ Map-Reduce (invented at Google) provides a way to process large data sets that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers ▪ Hadoop initially developed by Doug Cutting, now an Apache project led by the Grid Computing team at Yahoo! ▪ Much faster ETL when transform and load is distributed across a cluster ▪ Engineers able to write jobs in Java and Python ▪ Not a viable solution for analysts who can write SQL but not code
  7. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop ▪ Developed at Facebook, now an Apache subproject ▪ Also includes: ▪ Web interface for constructing queries on the fly without using a shell ▪ Live support for query problems from the data team ▪ Easy integration with charts and dashboards ▪ One-click scheduling ▪ CSV/Excel export
  8. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” ▪ SELECT, count(1) ▪ FROM status_updates a ▪ WHERE a.status LIKE “%swine flu%” ▪ AND >= ‘2009-05-01’ AND <= ‘2009-05-31’ ▪ GROUP BY
  9. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Easily extendable to new operators ▪ Hypothetical example: “Find the sentiment of the ‘Terminator’ movie” ▪ FROM ( ▪ FROM status_updates b ▪ SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment ▪ WHERE b.status LIKE “%terminator%” ▪ AND >= ‘2009-05-01’ AND <= ‘2009-05-31’) a ▪ SELECT a.sentiment, count(1) ▪ GROUP BY a.sentiment
  10. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Successfully decentralized the querying and consumption of data across the company ▪ Instead of 10 dedicated data analysts, we trained a few hundred ▪ Everyone is able to answer 95% of his or her data questions with minimal training ▪ Dedicated data scientists, instead of working on an endless queue of ad-hoc requests, can spend their time performing complex analyses and building scalable systems on top of Hadoop/Hive ▪ Machine Learning systems ▪ Rich reporting for clients + Page owners ▪ Text analytics
  11. Facebook text analytics ▪ Lexicon (Spring 2008) ▪ Started as an intern project to test Hadoop ▪ First external deployment of a Hadoop-powered system at Facebook (and one of the first anywhere) ▪ Simple idea: count the number of occurrences of words and bigrams on Facebook Walls per day, plot them on a line graph
  12. “american idol”
  13. Facebook text analytics ▪ “New” Lexicon (Fall 2008), beta preview ▪ Leveraged Hive’s structured metadata and the raw computational power of a 600-node Hadoop cluster ▪ Slices by age, gender, region ▪ Sentiment analysis ▪ Common user interests ▪ Associations graph of similar keywords, with age and gender axes
  14. Dashboard: “economy”
  15. Demographics: “economy”
  16. Map: “laid off”
  17. Sentiment: “iron man” (blue) vs. “indiana jones” (yellow)
  18. Associations: “marriage”
  19. Associations: “vodka”
  20. Facebook text analytics ▪ Hadoop and Hive makes this all possible ▪ Consider “Associations” (similar words and phrases) ▪ Need to compare the co-occurrence of each term with every single other word and bigram, compared to baseline probability of occurrence (TF-IDF)......and keep demographic metadata around for fun ▪ Typical job generates several TB of data along the way ▪ Absolutely need a cluster of machines ▪ Distributed computation opens up the possibilities for text analytics algorithms! ▪ And.....the software is free!
  21. Text Analytics ▪ Text analytics is clearly useful in the “macro”: ▪ Big data sets ▪ Big compute clusters ▪ Big consumers (corporations) ▪ What about in the micro? ▪ Small data sets ▪ B, not PB ▪ Small consumers ▪ Individual people analyzing their own data
  22. HappyFactor ▪ Facebook Application (personal project, not associated with Facebook) ▪ Idea: ask people privately how happy they are and what they are doing ▪ Uses random text messages to ensure a good sample and to collect data easily ▪ Provide users with trends on their happiness (by day, week, month, etc.) ▪ When are you happiest? ▪ Sift through the unstructured text to find patterns in behavior that correlate with happiness and unhappiness ▪ Which activities make you happiest? ▪ Which people in your life make you happiest?
  23. HappyFactor ▪ Just like corporations can learn about (and improve) themselves through text analytics.... ▪ Why not humans?
  24. On a scale from 1 to 10, how happy are you right now? Reply with your score and an optional description of what you are doing.
  25. In sum... ▪ Analyzing large data sets is a challenging problem that requires significant investment (both human and financial) in infrastructure ▪ We’re now just learning what we can do with Facebook data since we developed the infrastructure to support it ▪ Distributed computation and structured metadata allow for a powerful new class of text analytics algorithms ▪ Text analytics has applications well beyond enterprise data-mining... ▪ ...could it potentially make the world a happier place?
  26. (c) 2009 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0