SlideShare a Scribd company logo
BUZZ FEEDER
FINDING OUT THE TRENDS BEHIND WHAT’S TRENDING
TEAM
➔Anurag Khaitan
➔Josh Erb
➔Walter Tyrna
CONTEXT
WHAT IS BUZZFEED?
“BuzzFeed is a cross-platform, global
network for news and entertainment that
generates seven billion views each month.
BuzzFeed creates and distributes content
for a global audience and utilizes
proprietary technology to continuously
test, learn and optimize.”
(buzzfeed website)
More than 7 billion monthly
global content views
More than 200M monthly
unique visitors to
BuzzFeed.com
11 international editions
including US, UK,
Germany, Espanol, France,
Spain, India, Canada,
Mexico, Brazil, Australia
and Japan
(buzzfeed website)
PROBLEM
• There is good money to be made from
consistently generating popular content on the
internet.
• A significant portion (20%-30%) of Buzzfeed’s
articles generate very little traffic.
Hypothesi
s
We believe there may be a
correlation between the
content of the language
associated with an article (title,
description, tags, etc.) and how
likely it is to go viral.
We also believe that this
likelihood is tied to the country
in which an article goes viral
WHY DOES IT MATTER?
• 20%-30% of the articles we pulled,
gained little traction
• BuzzFeed could hypothetically save
money and improve user experience by
informing content by what topics
consistently draw readership
SOLUTION APPROACH
• Visualization to help identify underlying themes
in a given dataset through three lenses-the title,
the content of the article itself, or the tags
ascribed to it by the author.
• Title Generator to suggest topics and themes
based upon recent trends in the Social Media to
guide the editing staff in writing content that is
likely to generate significant online traffic.
• Given sufficient number of articles in our data
and trending topics, we believe that the product
of reasonable title generator can be fed into a
predictor to help assess its potential virality.
OUR EXPERIENCE
INGESTION
“You need to start pulling
data, like, now.”
- Ben Bengfort, 1st Day of Class
➔ Project required us to gather data from 5 separate
public APIs
➔ Before anything else, it was necessary to
automate the process of querying the APIs
➔ Set up an ubuntu instance on Amazon Web
Services’ Elastic Compute Cloud (EC2)
➔ Run Python Script hourly (crontab) to capture
.json files on a server-side WORM -- 5 calls/hour,
each for Australia, Canada, India, UK and US
Data Collection began: May 18, 2016.
Data Collection ended: Aug 31, 2016
Total raw data size in WORM: 1.16GB.
Number of records pulled: 330,000
(25 articles/hr each for 5 countries for
100 days)
ARCHITECTURE
WRANGLING
➔ Clean Raw Data
◆ Remove tags, images and other content outside the scope of our analysis
◆ Used insight from this to drop irrelevant variables and identify gaps that
could be accounted for
➔ Understand Target Variable (Measure of Virality)
◆ A frequency column to understand how each article was “persisting”, as a
measure of virality
◆ Understand the accuracy and applicability of Number of Impressions
provided in the data
➔ Capture all Instances, Features and Target Variables in Postgres Table to use
downstream in the pipeline
WHAT DOES THE DATA LOOK LIKE?
Australia Canada India
UK US
9%
5%
7%
17%
62%
ANALYSIS
➔ Word Clouds
◆ What terms “jump out”?
➔ Natural Language Toolkit
◆ What sorts of analysis can we run
on our textual data?
➔ Sci-Kit Learn
◆ What can Machine Learning
models can help us predict?
TOP TERMS
Tags: Australia
1. game
2. thrones
3. australia
4. season
5. 6
6. fan
7. twitter
8. quiz
9. stark
10. hot
Canada
1. canada
2. canadian
3. news
4. social
5. quiz
6. animals
7. twitter
8. funny
9. lol
10. food
India
1. social
2. news
3. india
4. bollywood
5. indian
6. twitter
7. desi
8. khan
9. stories
10. women
UK
1. quiz
2. british
3. uk
4. food
5. trivia
6. twitter
7. you
8. funny
9. celebrity
10. 00s
US
1. test
2. quiz
3. food
4. recipes
5. you
6. funny
7. news
8. social
9. summer
10. music
● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles)
while Australia and India have more distinct preferences.
● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia
● “Women/woman” only appears on the top list for India, perhaps reflective of readership
● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy
Kaling Was Just Too Relatable On Twitter”)
WORDCLOUDS
Tags
WORDCLOUDS
Titles
TITLE GENERATOR
• Generated a corpus of all the unique
titles from API pulls
• Natural Language Toolkit: Trigram
Collocation Finder & Trigram Assoc Metrics
• Grabbing most likely subsequent words
using Likelihood Ratios
• Introduced minor stochasticity to
prevent it always providing the same
titles
• Notable Examples:
– “Canada Goose Is Most Calories”
– “You More Hilary Duff or Lohan?”
– “What Game of Thrones Fan if You
Guess We Thrones”
FEATURE SELECTION WHAT FEATURES ARE THE
MOST TELLING - HYPOTHESIS
CATEGORY: SOME SIGNAL
There are 140+ categories on
Buzzfeed? Is there a relationship
between the categories and
virality?
METAVALUE: TOO BROAD - NO
SIGNAL
How many keywords are there?
What is the relationship between
virality and certain keywords?
➔ Each “Buzz” had 36 data points
◆ Some of these data points were standardized
◆ Some of them were not
➔ A significant amount of these data points did
not contain any signal
➔ Other than category, only fields that
contained signals had text/words that are
contained in the article:
◆ Decription, Title, Primary Keywords
◆ Tags, containing phrases and words
TARGET
MEASURE OF VIRALITY
IMPRESSIONS
Number of times an article is
views
FREQUENCY
Number of hours an article stays
on a country’s BuzzFeed page.
➔ Impressions: Inaccurate and aggregated
measure in the snapshot
➔ Frequency: Another measure but not always
aligned with the corresponding impression
provided in the instance
➔ Some f(Impressions, Frequency) worked
➔ Needed to use the function to identify classes
➔ Log Transformation to account for wide
variability and skewed distribution as follows:
Virality = Log (Impressions * Frequency)
Non-Viral: Virality < mean- standard devitation
Viral: Virality >= mean - standard deviation
FEATURE ENGINEERING
FEATURE ENGINEERING
ATTEMPTED OBVIOUS ONES
STOP WORDS OR COMMON
WORDS COULD HAVE HELPED
➔ Title Length: Fairly constant and not a good indicator.
➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such
correlation in the data.
➔ Words in tags: To retain the context in the tags, we
used individual phrases, as provided (simulated n-
grams) and individual words (1-gram).
➔ Low Document Frequency: No positive impact on the
predictability.
➔ High Document Frequency: Negative impact on the
predictability on the model.
➔ Stop Words OR Common Words: Did not attempt it
due to time constraints.
MODELING WITH SCI-KIT LEARN
Multinomial Naive Bayes and Logistic Regressionas follows:
Feature Selection: For each instance, we used all the text contained in Title, Description, Category,
Primary Keywords and Phrases in Tags.
Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact
vect = CountVectorizer()
Output Number of Features in vect: 70,000 more more features
Model Selection: For both models, we did 12-fold cross-validation as follows:
skf = StratifiedKFold(y, n_folds=12, shuffle=True)
for train, test in skf: …
Another cross-validation for both Multinomial NB and Logistic Regression as follows:
cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
MODEL RESULTS
Multinomial NB Logistic
Regression
Accuracy 0.839620 0.865165
AUC 0.699976 0.677515
F1 0.904905 0.922518
Precision 0.908419 0.898182
Recall 0.901438 0.948231
CROSS VALIDATION
ACCURACY SCORES
Multinomial Naive Bayes:
0.840168
Logistic Regression:
0.864645
TOOLS
NLTK
Word
Cloud
WHAT COULD BE
DONE BETTER?
ROOM FOR IMPROVEMENT
• BuzzFeed’s public API does not share the whole
story--Include data points from other sources
• Limit focus to English-speaking countries limited
ability to see impact of cultural context outside of
the US content-engine’s orbit.
• With more time, might apply a better methodology
to the Title Generator
• With more time, might stand up the user-facing web
application and capture user data to improve the
model and generate better recommendations
QUESTIONS?

More Related Content

Similar to Georgetown Data Science - Team BuzzFeed

TextMiningTwitters
TextMiningTwittersTextMiningTwitters
TextMiningTwitters
Liu Chang
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
Carlos Toxtli
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
c.titus.brown
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?
Salesforce Engineering
 
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
Localogy
 
Boosting Product Categorization with Machine Learning
Boosting Product Categorization with Machine LearningBoosting Product Categorization with Machine Learning
Boosting Product Categorization with Machine Learning
Amadeus Magrabi
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
Qubole
 
Serverless Text Analytics with Amazon Comprehend
Serverless Text Analytics with Amazon ComprehendServerless Text Analytics with Amazon Comprehend
Serverless Text Analytics with Amazon Comprehend
Donnie Prakoso
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Kris Jack
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
Erudite
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content Strategists
Louis Rosenfeld
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Designing Narrative Content Workshop
Designing Narrative Content WorkshopDesigning Narrative Content Workshop
Designing Narrative Content Workshop
Martha Rotter
 
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditMaking Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Lucidworks
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
Victor de Boer
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
Eli White
 
Behemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge WebsitesBehemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge Websites
Philipp Klöckner
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
Lucidworks
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppt
tpoelzer
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
kelbedweihy
 

Similar to Georgetown Data Science - Team BuzzFeed (20)

TextMiningTwitters
TextMiningTwittersTextMiningTwitters
TextMiningTwitters
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?
 
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
 
Boosting Product Categorization with Machine Learning
Boosting Product Categorization with Machine LearningBoosting Product Categorization with Machine Learning
Boosting Product Categorization with Machine Learning
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Serverless Text Analytics with Amazon Comprehend
Serverless Text Analytics with Amazon ComprehendServerless Text Analytics with Amazon Comprehend
Serverless Text Analytics with Amazon Comprehend
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content Strategists
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Designing Narrative Content Workshop
Designing Narrative Content WorkshopDesigning Narrative Content Workshop
Designing Narrative Content Workshop
 
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditMaking Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Behemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge WebsitesBehemoth SEO: Search Strategy for Huge Websites
Behemoth SEO: Search Strategy for Huge Websites
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppt
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 

Recently uploaded

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Georgetown Data Science - Team BuzzFeed

  • 1. BUZZ FEEDER FINDING OUT THE TRENDS BEHIND WHAT’S TRENDING
  • 4. WHAT IS BUZZFEED? “BuzzFeed is a cross-platform, global network for news and entertainment that generates seven billion views each month. BuzzFeed creates and distributes content for a global audience and utilizes proprietary technology to continuously test, learn and optimize.” (buzzfeed website) More than 7 billion monthly global content views More than 200M monthly unique visitors to BuzzFeed.com 11 international editions including US, UK, Germany, Espanol, France, Spain, India, Canada, Mexico, Brazil, Australia and Japan (buzzfeed website)
  • 5. PROBLEM • There is good money to be made from consistently generating popular content on the internet. • A significant portion (20%-30%) of Buzzfeed’s articles generate very little traffic.
  • 6. Hypothesi s We believe there may be a correlation between the content of the language associated with an article (title, description, tags, etc.) and how likely it is to go viral. We also believe that this likelihood is tied to the country in which an article goes viral
  • 7. WHY DOES IT MATTER? • 20%-30% of the articles we pulled, gained little traction • BuzzFeed could hypothetically save money and improve user experience by informing content by what topics consistently draw readership
  • 8. SOLUTION APPROACH • Visualization to help identify underlying themes in a given dataset through three lenses-the title, the content of the article itself, or the tags ascribed to it by the author. • Title Generator to suggest topics and themes based upon recent trends in the Social Media to guide the editing staff in writing content that is likely to generate significant online traffic. • Given sufficient number of articles in our data and trending topics, we believe that the product of reasonable title generator can be fed into a predictor to help assess its potential virality.
  • 10. INGESTION “You need to start pulling data, like, now.” - Ben Bengfort, 1st Day of Class ➔ Project required us to gather data from 5 separate public APIs ➔ Before anything else, it was necessary to automate the process of querying the APIs ➔ Set up an ubuntu instance on Amazon Web Services’ Elastic Compute Cloud (EC2) ➔ Run Python Script hourly (crontab) to capture .json files on a server-side WORM -- 5 calls/hour, each for Australia, Canada, India, UK and US Data Collection began: May 18, 2016. Data Collection ended: Aug 31, 2016 Total raw data size in WORM: 1.16GB. Number of records pulled: 330,000 (25 articles/hr each for 5 countries for 100 days)
  • 12. WRANGLING ➔ Clean Raw Data ◆ Remove tags, images and other content outside the scope of our analysis ◆ Used insight from this to drop irrelevant variables and identify gaps that could be accounted for ➔ Understand Target Variable (Measure of Virality) ◆ A frequency column to understand how each article was “persisting”, as a measure of virality ◆ Understand the accuracy and applicability of Number of Impressions provided in the data ➔ Capture all Instances, Features and Target Variables in Postgres Table to use downstream in the pipeline
  • 13. WHAT DOES THE DATA LOOK LIKE? Australia Canada India UK US 9% 5% 7% 17% 62%
  • 14. ANALYSIS ➔ Word Clouds ◆ What terms “jump out”? ➔ Natural Language Toolkit ◆ What sorts of analysis can we run on our textual data? ➔ Sci-Kit Learn ◆ What can Machine Learning models can help us predict?
  • 15. TOP TERMS Tags: Australia 1. game 2. thrones 3. australia 4. season 5. 6 6. fan 7. twitter 8. quiz 9. stark 10. hot Canada 1. canada 2. canadian 3. news 4. social 5. quiz 6. animals 7. twitter 8. funny 9. lol 10. food India 1. social 2. news 3. india 4. bollywood 5. indian 6. twitter 7. desi 8. khan 9. stories 10. women UK 1. quiz 2. british 3. uk 4. food 5. trivia 6. twitter 7. you 8. funny 9. celebrity 10. 00s US 1. test 2. quiz 3. food 4. recipes 5. you 6. funny 7. news 8. social 9. summer 10. music ● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles) while Australia and India have more distinct preferences. ● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia ● “Women/woman” only appears on the top list for India, perhaps reflective of readership ● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy Kaling Was Just Too Relatable On Twitter”)
  • 18. TITLE GENERATOR • Generated a corpus of all the unique titles from API pulls • Natural Language Toolkit: Trigram Collocation Finder & Trigram Assoc Metrics • Grabbing most likely subsequent words using Likelihood Ratios • Introduced minor stochasticity to prevent it always providing the same titles • Notable Examples: – “Canada Goose Is Most Calories” – “You More Hilary Duff or Lohan?” – “What Game of Thrones Fan if You Guess We Thrones”
  • 19. FEATURE SELECTION WHAT FEATURES ARE THE MOST TELLING - HYPOTHESIS CATEGORY: SOME SIGNAL There are 140+ categories on Buzzfeed? Is there a relationship between the categories and virality? METAVALUE: TOO BROAD - NO SIGNAL How many keywords are there? What is the relationship between virality and certain keywords? ➔ Each “Buzz” had 36 data points ◆ Some of these data points were standardized ◆ Some of them were not ➔ A significant amount of these data points did not contain any signal ➔ Other than category, only fields that contained signals had text/words that are contained in the article: ◆ Decription, Title, Primary Keywords ◆ Tags, containing phrases and words
  • 20. TARGET MEASURE OF VIRALITY IMPRESSIONS Number of times an article is views FREQUENCY Number of hours an article stays on a country’s BuzzFeed page. ➔ Impressions: Inaccurate and aggregated measure in the snapshot ➔ Frequency: Another measure but not always aligned with the corresponding impression provided in the instance ➔ Some f(Impressions, Frequency) worked ➔ Needed to use the function to identify classes ➔ Log Transformation to account for wide variability and skewed distribution as follows: Virality = Log (Impressions * Frequency) Non-Viral: Virality < mean- standard devitation Viral: Virality >= mean - standard deviation
  • 21. FEATURE ENGINEERING FEATURE ENGINEERING ATTEMPTED OBVIOUS ONES STOP WORDS OR COMMON WORDS COULD HAVE HELPED ➔ Title Length: Fairly constant and not a good indicator. ➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such correlation in the data. ➔ Words in tags: To retain the context in the tags, we used individual phrases, as provided (simulated n- grams) and individual words (1-gram). ➔ Low Document Frequency: No positive impact on the predictability. ➔ High Document Frequency: Negative impact on the predictability on the model. ➔ Stop Words OR Common Words: Did not attempt it due to time constraints.
  • 22. MODELING WITH SCI-KIT LEARN Multinomial Naive Bayes and Logistic Regressionas follows: Feature Selection: For each instance, we used all the text contained in Title, Description, Category, Primary Keywords and Phrases in Tags. Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact vect = CountVectorizer() Output Number of Features in vect: 70,000 more more features Model Selection: For both models, we did 12-fold cross-validation as follows: skf = StratifiedKFold(y, n_folds=12, shuffle=True) for train, test in skf: … Another cross-validation for both Multinomial NB and Logistic Regression as follows: cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
  • 23. MODEL RESULTS Multinomial NB Logistic Regression Accuracy 0.839620 0.865165 AUC 0.699976 0.677515 F1 0.904905 0.922518 Precision 0.908419 0.898182 Recall 0.901438 0.948231 CROSS VALIDATION ACCURACY SCORES Multinomial Naive Bayes: 0.840168 Logistic Regression: 0.864645
  • 24. TOOLS
  • 27. ROOM FOR IMPROVEMENT • BuzzFeed’s public API does not share the whole story--Include data points from other sources • Limit focus to English-speaking countries limited ability to see impact of cultural context outside of the US content-engine’s orbit. • With more time, might apply a better methodology to the Title Generator • With more time, might stand up the user-facing web application and capture user data to improve the model and generate better recommendations