SlideShare a Scribd company logo
1 of 31
Download to read offline
Social Media, Happiness,
Petabytes and LOLs
Roddy Lindsay, Data Scientist, Facebook


June 1, 2009
Lots of data is generated on Facebook
▪   200 million active users
▪   More than 20 million users update their statuses at least once each day
▪   More than 850 million photos uploaded to the site each month
▪   More than 8 million videos uploaded each month
▪   More than 1 billion pieces of content (web links, news stories, blog
    posts, notes, photos, etc.) shared each week
▪   More than 2.5 million events created each month
▪   More than 25 million active user groups exist on the site
Lots of data is generated on Facebook
▪   Undoubtedly a very rich data set (and large...we’re talking petabytes)
▪   Many different groups clamoring for data:
    ▪   Internal analysts
    ▪   FB Engineers
    ▪   Advertisers
    ▪   Page owners
    ▪   Platform/Connect developers
    ▪   Marketers
    ▪   Academics
Challenges
▪   How can Facebook satisfy all the different consumers of data?
▪   What are the challenges?
    ▪   1. Infrastructure
    ▪   2. Infrastructure
    ▪   3. Infrastructure
Facebook’s Data Infrastructure
▪   Attempt 1: Oracle Data Warehouse (2005)
    ▪   Business analysts already familiar with tools, SQL
    ▪   Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP)
        ▪   i.e. growth by country and demographic
    ▪   When growth took off (2007), ETL processes to load and roll-up data
        started taking a very long time
    ▪   A single machine (or several machines) were not going to cut it much
        longer for data volumes at that scale...
Facebook’s Data Infrastructure
▪   Attempt 2: Hadoop (2007)
    ▪   Open-source framework for running Map-Reduce on a cluster of
        commodity machines, as well as a distributed file system for long-term
        storage
        ▪   Map-Reduce (invented at Google) provides a way to process large data sets
            that scales linearly with the number of machines in the cluster....if your
            data doubles in size, just buy twice as many computers
        ▪   Hadoop initially developed by Doug Cutting, now an Apache project led by
            the Grid Computing team at Yahoo!
    ▪   Much faster ETL when transform and load is distributed across a
        cluster
    ▪   Engineers able to write jobs in Java and Python
    ▪   Not a viable solution for analysts who can write SQL but not code
Facebook’s Data Infrastructure
▪   Attempt 3: Hive (2008)
    ▪   SQL-like query language, table partitioning schema, and metadata
        store built on top of Hadoop
    ▪   Developed at Facebook, now an Apache subproject
    ▪   Also includes:
        ▪   Web interface for constructing queries on the fly without using a shell
        ▪   Live support for query problems from the data team
        ▪   Easy integration with charts and dashboards
        ▪   One-click scheduling
        ▪   CSV/Excel export
Facebook’s Data Infrastructure
▪   Attempt 3: Hive (2008)
    ▪   Example: “Find the number of status updates mentioning ‘swine flu’
        per day last month”

    ▪   SELECT a.date, count(1)
    ▪   FROM status_updates a
    ▪   WHERE a.status LIKE “%swine flu%”
    ▪   AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’
    ▪   GROUP BY a.date
Facebook’s Data Infrastructure
▪   Attempt 3: Hive (2008)
    ▪   Easily extendable to new operators
    ▪   Hypothetical example: “Find the sentiment of the ‘Terminator’ movie”

    ▪   FROM (
    ▪   FROM status_updates b
    ▪   SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment
    ▪   WHERE b.status LIKE “%terminator%”
    ▪   AND b.date >= ‘2009-05-01’ AND b.date <= ‘2009-05-31’) a
    ▪   SELECT a.sentiment, count(1)
    ▪   GROUP BY a.sentiment
Facebook’s Data Infrastructure
▪   Attempt 3: Hive (2008)
    ▪   Successfully decentralized the querying and consumption of data
        across the company
    ▪   Instead of 10 dedicated data analysts, we trained a few hundred
    ▪   Everyone is able to answer 95% of his or her data questions with
        minimal training
    ▪   Dedicated data scientists, instead of working on an endless queue of
        ad-hoc requests, can spend their time performing complex analyses
        and building scalable systems on top of Hadoop/Hive
        ▪   Machine Learning systems
        ▪   Rich reporting for clients + Page owners
        ▪   Text analytics
Facebook text analytics
▪   Lexicon (Spring 2008)
    ▪   Started as an intern project to test Hadoop
    ▪   First external deployment of a Hadoop-powered system at Facebook
        (and one of the first anywhere)
    ▪   Simple idea: count the number of occurrences of words and bigrams
        on Facebook Walls per day, plot them on a line graph
“american idol”
Facebook text analytics
▪   “New” Lexicon (Fall 2008), beta preview
    ▪   Leveraged Hive’s structured metadata and the raw computational
        power of a 600-node Hadoop cluster
        ▪   Slices by age, gender, region
        ▪   Sentiment analysis
        ▪   Common user interests
        ▪   Associations graph of similar keywords, with age and gender axes
Dashboard: “economy”
Demographics: “economy”
Map: “laid off”
Sentiment: “iron man” (blue) vs.
“indiana jones” (yellow)
Associations: “marriage”
Associations: “vodka”
Facebook text analytics
▪   Hadoop and Hive makes this all possible
▪   Consider “Associations” (similar words and phrases)
    ▪   Need to compare the co-occurrence of each term with every single
        other word and bigram, compared to baseline probability of
        occurrence (TF-IDF)......and keep demographic metadata around for fun
    ▪   Typical job generates several TB of data along the way
    ▪   Absolutely need a cluster of machines
▪   Distributed computation opens up the possibilities for text analytics
    algorithms!
▪   And.....the software is free!
Text Analytics
▪   Text analytics is clearly useful in the “macro”:
    ▪   Big data sets
    ▪   Big compute clusters
    ▪   Big consumers (corporations)
▪   What about in the micro?
    ▪   Small data sets
        ▪   B, not PB
    ▪   Small consumers
        ▪   Individual people analyzing their own data
HappyFactor
▪   Facebook Application (personal project, not associated with Facebook)
▪   Idea: ask people privately how happy they are and what they are doing
▪   Uses random text messages to ensure a good sample and to collect data
    easily
▪   Provide users with trends on their happiness (by day, week, month, etc.)
    ▪   When are you happiest?
▪   Sift through the unstructured text to find patterns in behavior that
    correlate with happiness and unhappiness
    ▪   Which activities make you happiest?
    ▪   Which people in your life make you happiest?
HappyFactor
▪   Just like corporations can learn about (and improve) themselves through
    text analytics....
▪   Why not humans?
On a scale from 1 to 10, how happy are
you right now? Reply with your score and
an optional description of what you are
doing.
In sum...
▪   Analyzing large data sets is a challenging problem that requires
    significant investment (both human and financial) in infrastructure
▪   We’re now just learning what we can do with Facebook data since we
    developed the infrastructure to support it
▪   Distributed computation and structured metadata allow for a powerful
    new class of text analytics algorithms
▪   Text analytics has applications well beyond enterprise data-mining...
▪   ...could it potentially make the world a happier place?
(c) 2009 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Scala in hulu's data platform
Scala in hulu's data platformScala in hulu's data platform
Scala in hulu's data platformPrasan Samtani
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
SciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkSciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkWes McKinney
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5Yan Zhou
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Databricks
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 

What's hot (20)

A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Scala in hulu's data platform
Scala in hulu's data platformScala in hulu's data platform
Scala in hulu's data platform
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
SciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talkSciPy 2011 pandas lightning talk
SciPy 2011 pandas lightning talk
 
A Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen FanA Developer’s View into Spark's Memory Model with Wenchen Fan
A Developer’s View into Spark's Memory Model with Wenchen Fan
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
 
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web ArchivingHBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
HBaseCon 2015: Warcbase - Scaling 'Out' and 'Down' HBase for Web Archiving
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 

Viewers also liked

singley+mackie Capabilities Deck
singley+mackie Capabilities Decksingley+mackie Capabilities Deck
singley+mackie Capabilities DeckMatthew Levine
 
Growth of facebook
Growth of facebookGrowth of facebook
Growth of facebookDivam Goyal
 
The Future of Text Analytics
The Future of Text AnalyticsThe Future of Text Analytics
The Future of Text AnalyticsAttensity
 
28 Facebook Metrics Every Community Manager Should Track
28 Facebook Metrics Every Community Manager Should Track28 Facebook Metrics Every Community Manager Should Track
28 Facebook Metrics Every Community Manager Should TrackSimplify360
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social mediaJeremiah Fadugba
 
Metrics for Facebook Fan Pages and Ads
Metrics for Facebook Fan Pages and AdsMetrics for Facebook Fan Pages and Ads
Metrics for Facebook Fan Pages and AdsHelen Todd
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsDaniel Tunkelang
 
Growth strategy for Facebook (ideas)
Growth strategy for Facebook (ideas)Growth strategy for Facebook (ideas)
Growth strategy for Facebook (ideas)Minh Phan
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
When to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning CloudWhen to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning CloudMeaningCloud
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationSeth Grimes
 
Facebook's Growth Hacker on how they put Facebook on the Path to 1 Billion Users
Facebook's Growth Hacker on how they put Facebook on the Path to 1 Billion UsersFacebook's Growth Hacker on how they put Facebook on the Path to 1 Billion Users
Facebook's Growth Hacker on how they put Facebook on the Path to 1 Billion Usersgrowthhackersconference
 
Facebook Connect Design Patterns and Metrics
Facebook Connect Design Patterns and MetricsFacebook Connect Design Patterns and Metrics
Facebook Connect Design Patterns and MetricsHiten Shah
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at UberSudhir Tonse
 
Uber Analytics Test
Uber Analytics TestUber Analytics Test
Uber Analytics TestCoursetake
 
THE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNS
THE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNSTHE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNS
THE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNSunfunnel
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 

Viewers also liked (19)

singley+mackie Capabilities Deck
singley+mackie Capabilities Decksingley+mackie Capabilities Deck
singley+mackie Capabilities Deck
 
Growth of facebook
Growth of facebookGrowth of facebook
Growth of facebook
 
The Future of Text Analytics
The Future of Text AnalyticsThe Future of Text Analytics
The Future of Text Analytics
 
28 Facebook Metrics Every Community Manager Should Track
28 Facebook Metrics Every Community Manager Should Track28 Facebook Metrics Every Community Manager Should Track
28 Facebook Metrics Every Community Manager Should Track
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Metrics for Facebook Fan Pages and Ads
Metrics for Facebook Fan Pages and AdsMetrics for Facebook Fan Pages and Ads
Metrics for Facebook Fan Pages and Ads
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text Analytics
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
 
Growth strategy for Facebook (ideas)
Growth strategy for Facebook (ideas)Growth strategy for Facebook (ideas)
Growth strategy for Facebook (ideas)
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
When to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning CloudWhen to use the different text analytics tools - Meaning Cloud
When to use the different text analytics tools - Meaning Cloud
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Facebook's Growth Hacker on how they put Facebook on the Path to 1 Billion Users
Facebook's Growth Hacker on how they put Facebook on the Path to 1 Billion UsersFacebook's Growth Hacker on how they put Facebook on the Path to 1 Billion Users
Facebook's Growth Hacker on how they put Facebook on the Path to 1 Billion Users
 
Facebook Connect Design Patterns and Metrics
Facebook Connect Design Patterns and MetricsFacebook Connect Design Patterns and Metrics
Facebook Connect Design Patterns and Metrics
 
Stream Computing & Analytics at Uber
Stream Computing & Analytics at UberStream Computing & Analytics at Uber
Stream Computing & Analytics at Uber
 
Uber Analytics Test
Uber Analytics TestUber Analytics Test
Uber Analytics Test
 
THE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNS
THE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNSTHE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNS
THE SCIENCE BEHIND EFFECTIVE FACEBOOK AD CAMPAIGNS
 
Image processing ppt
Image processing pptImage processing ppt
Image processing ppt
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 

Similar to Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)
Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)
Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)Paolo Massa
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Dealing with web scale data
Dealing with web scale dataDealing with web scale data
Dealing with web scale dataJnaapti
 
Teaching high school_stats_1_
Teaching high school_stats_1_Teaching high school_stats_1_
Teaching high school_stats_1_mcnewbold
 
PostgreSQL Conference: East 08
PostgreSQL Conference: East 08PostgreSQL Conference: East 08
PostgreSQL Conference: East 08Joshua Drake
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 
Tableau text tables - nobody wants them but everybody needs them.pdf
Tableau text tables - nobody wants them but everybody needs them.pdfTableau text tables - nobody wants them but everybody needs them.pdf
Tableau text tables - nobody wants them but everybody needs them.pdfpatrickdlugosch1
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Future platform for internet of things
Future platform for internet of thingsFuture platform for internet of things
Future platform for internet of thingsColdbeans Software
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersEmanuele Della Valle
 
Metadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionMetadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionKevin Rundblad
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataIMC Institute
 

Similar to Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs" (20)

20080528dublinpt2
20080528dublinpt220080528dublinpt2
20080528dublinpt2
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)
Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)
Web2.0: from "I know nothing" to "I know something" in 2 hours (what?!?)
 
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
20080529dublinpt1
20080529dublinpt120080529dublinpt1
20080529dublinpt1
 
Dealing with web scale data
Dealing with web scale dataDealing with web scale data
Dealing with web scale data
 
Teaching high school_stats_1_
Teaching high school_stats_1_Teaching high school_stats_1_
Teaching high school_stats_1_
 
PostgreSQL Conference: East 08
PostgreSQL Conference: East 08PostgreSQL Conference: East 08
PostgreSQL Conference: East 08
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
Tableau text tables - nobody wants them but everybody needs them.pdf
Tableau text tables - nobody wants them but everybody needs them.pdfTableau text tables - nobody wants them but everybody needs them.pdf
Tableau text tables - nobody wants them but everybody needs them.pdf
 
Introduction To R
Introduction To RIntroduction To R
Introduction To R
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Big Data
Big DataBig Data
Big Data
 
Future platform for internet of things
Future platform for internet of thingsFuture platform for internet of things
Future platform for internet of things
 
Introduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS PractitionersIntroduction to Semantic Web for GIS Practitioners
Introduction to Semantic Web for GIS Practitioners
 
Metadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge ProductionMetadata in a Crowd: Shared Knowledge Production
Metadata in a Crowd: Shared Knowledge Production
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and HealthcareBig data technologies with Case Study Finance and Healthcare
Big data technologies with Case Study Finance and Healthcare
 

Recently uploaded

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Text Analytics Summit 2009 - Roddy Lindsay - "Social Media, Happiness, Petabytes and LOLs"

  • 1.
  • 2. Social Media, Happiness, Petabytes and LOLs Roddy Lindsay, Data Scientist, Facebook June 1, 2009
  • 3. Lots of data is generated on Facebook ▪ 200 million active users ▪ More than 20 million users update their statuses at least once each day ▪ More than 850 million photos uploaded to the site each month ▪ More than 8 million videos uploaded each month ▪ More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week ▪ More than 2.5 million events created each month ▪ More than 25 million active user groups exist on the site
  • 4. Lots of data is generated on Facebook ▪ Undoubtedly a very rich data set (and large...we’re talking petabytes) ▪ Many different groups clamoring for data: ▪ Internal analysts ▪ FB Engineers ▪ Advertisers ▪ Page owners ▪ Platform/Connect developers ▪ Marketers ▪ Academics
  • 5. Challenges ▪ How can Facebook satisfy all the different consumers of data? ▪ What are the challenges? ▪ 1. Infrastructure ▪ 2. Infrastructure ▪ 3. Infrastructure
  • 6. Facebook’s Data Infrastructure ▪ Attempt 1: Oracle Data Warehouse (2005) ▪ Business analysts already familiar with tools, SQL ▪ Fast JOINs for data slicing ideal for dashboards (home-rolled in PHP) ▪ i.e. growth by country and demographic ▪ When growth took off (2007), ETL processes to load and roll-up data started taking a very long time ▪ A single machine (or several machines) were not going to cut it much longer for data volumes at that scale...
  • 7. Facebook’s Data Infrastructure ▪ Attempt 2: Hadoop (2007) ▪ Open-source framework for running Map-Reduce on a cluster of commodity machines, as well as a distributed file system for long-term storage ▪ Map-Reduce (invented at Google) provides a way to process large data sets that scales linearly with the number of machines in the cluster....if your data doubles in size, just buy twice as many computers ▪ Hadoop initially developed by Doug Cutting, now an Apache project led by the Grid Computing team at Yahoo! ▪ Much faster ETL when transform and load is distributed across a cluster ▪ Engineers able to write jobs in Java and Python ▪ Not a viable solution for analysts who can write SQL but not code
  • 8. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ SQL-like query language, table partitioning schema, and metadata store built on top of Hadoop ▪ Developed at Facebook, now an Apache subproject ▪ Also includes: ▪ Web interface for constructing queries on the fly without using a shell ▪ Live support for query problems from the data team ▪ Easy integration with charts and dashboards ▪ One-click scheduling ▪ CSV/Excel export
  • 9. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Example: “Find the number of status updates mentioning ‘swine flu’ per day last month” ▪ SELECT a.date, count(1) ▪ FROM status_updates a ▪ WHERE a.status LIKE “%swine flu%” ▪ AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ ▪ GROUP BY a.date
  • 10. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Easily extendable to new operators ▪ Hypothetical example: “Find the sentiment of the ‘Terminator’ movie” ▪ FROM ( ▪ FROM status_updates b ▪ SELECT SENTIMENT(b.status, ‘terminator’) AS sentiment ▪ WHERE b.status LIKE “%terminator%” ▪ AND b.date >= ‘2009-05-01’ AND b.date <= ‘2009-05-31’) a ▪ SELECT a.sentiment, count(1) ▪ GROUP BY a.sentiment
  • 11. Facebook’s Data Infrastructure ▪ Attempt 3: Hive (2008) ▪ Successfully decentralized the querying and consumption of data across the company ▪ Instead of 10 dedicated data analysts, we trained a few hundred ▪ Everyone is able to answer 95% of his or her data questions with minimal training ▪ Dedicated data scientists, instead of working on an endless queue of ad-hoc requests, can spend their time performing complex analyses and building scalable systems on top of Hadoop/Hive ▪ Machine Learning systems ▪ Rich reporting for clients + Page owners ▪ Text analytics
  • 12. Facebook text analytics ▪ Lexicon (Spring 2008) ▪ Started as an intern project to test Hadoop ▪ First external deployment of a Hadoop-powered system at Facebook (and one of the first anywhere) ▪ Simple idea: count the number of occurrences of words and bigrams on Facebook Walls per day, plot them on a line graph
  • 14. Facebook text analytics ▪ “New” Lexicon (Fall 2008), beta preview ▪ Leveraged Hive’s structured metadata and the raw computational power of a 600-node Hadoop cluster ▪ Slices by age, gender, region ▪ Sentiment analysis ▪ Common user interests ▪ Associations graph of similar keywords, with age and gender axes
  • 18. Sentiment: “iron man” (blue) vs. “indiana jones” (yellow)
  • 21. Facebook text analytics ▪ Hadoop and Hive makes this all possible ▪ Consider “Associations” (similar words and phrases) ▪ Need to compare the co-occurrence of each term with every single other word and bigram, compared to baseline probability of occurrence (TF-IDF)......and keep demographic metadata around for fun ▪ Typical job generates several TB of data along the way ▪ Absolutely need a cluster of machines ▪ Distributed computation opens up the possibilities for text analytics algorithms! ▪ And.....the software is free!
  • 22. Text Analytics ▪ Text analytics is clearly useful in the “macro”: ▪ Big data sets ▪ Big compute clusters ▪ Big consumers (corporations) ▪ What about in the micro? ▪ Small data sets ▪ B, not PB ▪ Small consumers ▪ Individual people analyzing their own data
  • 23. HappyFactor ▪ Facebook Application (personal project, not associated with Facebook) ▪ Idea: ask people privately how happy they are and what they are doing ▪ Uses random text messages to ensure a good sample and to collect data easily ▪ Provide users with trends on their happiness (by day, week, month, etc.) ▪ When are you happiest? ▪ Sift through the unstructured text to find patterns in behavior that correlate with happiness and unhappiness ▪ Which activities make you happiest? ▪ Which people in your life make you happiest?
  • 24. HappyFactor ▪ Just like corporations can learn about (and improve) themselves through text analytics.... ▪ Why not humans?
  • 25. On a scale from 1 to 10, how happy are you right now? Reply with your score and an optional description of what you are doing.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30. In sum... ▪ Analyzing large data sets is a challenging problem that requires significant investment (both human and financial) in infrastructure ▪ We’re now just learning what we can do with Facebook data since we developed the infrastructure to support it ▪ Distributed computation and structured metadata allow for a powerful new class of text analytics algorithms ▪ Text analytics has applications well beyond enterprise data-mining... ▪ ...could it potentially make the world a happier place?
  • 31. (c) 2009 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0