SlideShare a Scribd company logo
1 of 28
Download to read offline
BUZZ FEEDER
FINDING OUT THE TRENDS BEHIND WHAT’S TRENDING
TEAM
➔ Anurag Khaitan
➔ Josh Erb
➔ Walter Tyrna
CONTEXT
WHAT IS BUZZFEED?
“BuzzFeed is a cross-platform, global
network for news and entertainment that
generates seven billion views each month.
BuzzFeed creates and distributes content
for a global audience and utilizes
proprietary technology to continuously
test, learn and optimize.”
(buzzfeed website)
● More than 7 billion
monthly global content
views
● More than 200M monthly
unique visitors to
BuzzFeed.com
● 11 international editions
including US, UK,
Germany, Espanol, France,
Spain, India, Canada,
Mexico, Brazil, Australia
and Japan
(buzzfeed website)
PROBLEM
➔ There is good money to be made from
consistently generating popular content on the
internet.
➔ A significant portion (20%-30%) of Buzzfeed’s
articles generate very little traffic.
Hypothesis
We believe there may be a
correlation between the
content of the language
associated with an article (title,
description, tags, etc.) and how
likely it is to go viral.
We also believe that this
likelihood is tied to the country
in which an article goes viral
WHY DOES IT MATTER?
➔ BuzzFeed could hypothetically save
money and improve user experience by
informing content by what topics
consistently draw readership
OUR APPROACH
➔ Visualization to help identify underlying themes
in a given dataset through three lenses-the title,
the content of the article itself, or the tags
ascribed to it by the author.
➔ Title Generator to suggest topics and themes
based upon recent trends in the Social Media to
guide the editing staff in writing content that is
likely to generate significant online traffic.
➔ Given sufficient number of articles in our data
and trending topics, we believe that the product
of reasonable title generator can be fed into a
predictor to help assess its potential virality.
OUR EXPERIENCE
INGESTION
“You need to start pulling
data, like, now.”
- Ben Bengfort, 1st Day of Class
➔ Project required us to gather data from 5 separate
public APIs
➔ Before anything else, it was necessary to
automate the process of querying the APIs
➔ Set up an ubuntu instance on Amazon Web
Services’ Elastic Compute Cloud (EC2)
➔ Run Python Script hourly (crontab) to capture
.json files on a server-side WORM -- 5 calls/hour,
each for Australia, Canada, India, UK and US
Data Collection began: May 18, 2016.
Data Collection ended: Aug 31, 2016
Total raw data size in WORM: 1.16GB.
Number of records pulled: 330,000
(25 articles/hr each for 5 countries for
100 days)
ARCHITECTURE
WRANGLING
➔ Clean Raw Data
◆ Remove tags, images and other content outside the scope of our analysis
◆ Used insight from this to drop irrelevant variables and identify gaps that
could be accounted for
➔ Understand Target Variable (Measure of Virality)
◆ A frequency column to understand how each article was “persisting”, as a
measure of virality
◆ Understand the accuracy and applicability of Number of Impressions
provided in the data
➔ Capture all Instances, Features and Target Variables in Postgres Table to use
downstream in the pipeline
WHAT DOES THE DATA LOOK LIKE?
Australia Canada India
UK US
9%
5%
7%
17%
62%
ANALYSIS
➔ Word Clouds
◆ What terms “jump out”?
➔ Natural Language Toolkit
◆ What sorts of analysis can we run
on our textual data?
➔ Sci-Kit Learn
◆ What can Machine Learning
models can help us predict?
TOP TERMS
Tags: Australia
1. game
2. thrones
3. australia
4. season
5. 6
6. fan
7. twitter
8. quiz
9. stark
10. hot
Canada
1. canada
2. canadian
3. news
4. social
5. quiz
6. animals
7. twitter
8. funny
9. lol
10. food
India
1. social
2. news
3. india
4. bollywood
5. indian
6. twitter
7. desi
8. khan
9. stories
10. women
UK
1. quiz
2. british
3. uk
4. food
5. trivia
6. twitter
7. you
8. funny
9. celebrity
10. 00s
US
1. test
2. quiz
3. food
4. recipes
5. you
6. funny
7. news
8. social
9. summer
10. music
● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles)
while Australia and India have more distinct preferences.
● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia
● “Women/woman” only appears on the top list for India, perhaps reflective of readership
● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy
Kaling Was Just Too Relatable On Twitter”)
WORDCLOUDS
Tags
AUSTRALIA CANADA INDIA
UNITED KINGDOM UNITED STATES
WORDCLOUDS
Titles
AUSTRALIA CANADA INDIA
UNITED KINGDOM UNITED STATES
TITLE GENERATOR
➔ Generated a corpus of all the unique
titles from API pulls
➔ Natural Language Toolkit: Trigram
Collocation Finder & Trigram Assoc Metrics
➔ Grabbing most likely subsequent words
using Likelihood Ratios
➔ Introduced minor stochasticity to
prevent it always providing the same
titles
➔ Notable Examples:
◆ “Canada Goose Is Most Calories”
◆ “You More Hilary Duff or Lohan?”
◆ “What Game of Thrones Fan if You
Guess We Thrones”
FEATURE SELECTION WHAT FEATURES ARE THE
MOST TELLING - HYPOTHESIS
CATEGORY: SOME SIGNAL
There are 140+ categories on
Buzzfeed? Is there a relationship
between the categories and
virality?
METAVALUE: TOO BROAD - NO
SIGNAL
How many keywords are there?
What is the relationship between
virality and certain keywords?
➔ Each “Buzz” had 36 data points
◆ Some of these data points were standardized
◆ Some of them were not
➔ A significant amount of these data points did
not contain any signal
➔ Other than category, only fields that
contained signals had text/words that are
contained in the article:
◆ Decription, Title, Primary Keywords
◆ Tags, containing phrases and words
TARGET
MEASURE OF VIRALITY
IMPRESSIONS
Number of times an article is
views
FREQUENCY
Number of hours an article stays
on a country’s BuzzFeed page.
➔ Impressions: Inaccurate and aggregated
measure in the snapshot
➔ Frequency: Another measure but not always
aligned with the corresponding impression
provided in the instance
➔ Some f(Impressions, Frequency) worked
➔ Needed to use the function to identify classes
➔ Log Transformation to account for wide
variability and skewed distribution as follows:
Virality = Log (Impressions * Frequency)
Non-Viral: Virality < mean- standard devitation
Viral: Virality >= mean - standard deviation
FEATURE ENGINEERING
FEATURE ENGINEERING
ATTEMPTED OBVIOUS ONES
STOP WORDS OR COMMON
WORDS COULD HAVE HELPED
➔ Title Length: Fairly constant and not a good indicator.
➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such
correlation in the data.
➔ Words in tags: To retain the context in the tags, we
used individual phrases, as provided (simulated
n-grams) and individual words (1-gram).
➔ Low Document Frequency: No positive impact on the
predictability.
➔ High Document Frequency: Negative impact on the
predictability on the model.
➔ Stop Words OR Common Words: Did not attempt it
due to time constraints.
MODELING WITH SCI-KIT LEARN
Multinomial Naive Bayes and Logistic Regression:
Feature Selection: For each instance, we used all the text contained in Title, Description, Category,
Primary Keywords and Phrases in Tags.
Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact
vect = CountVectorizer()
Output Number of Features in vect: 70,000 more more features
Model Selection: For both models, we did 12-fold cross-validation as follows:
skf = StratifiedKFold(y, n_folds=12, shuffle=True)
for train, test in skf: …
Another cross-validation for both Multinomial NB and Logistic Regression as follows:
cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
MODEL RESULTS
Multinomial NB Logistic
Regression
Accuracy 0.839620 0.865165
AUC 0.699976 0.677515
F1 0.904905 0.922518
Precision 0.908419 0.898182
Recall 0.901438 0.948231
CROSS VALIDATION
ACCURACY SCORES
Multinomial Naive Bayes:
0.840168
Logistic Regression:
0.864645
TOOLS
NLTK
Word
Cloud
WHAT COULD BE
DONE BETTER?
ROOM FOR IMPROVEMENT
➔ BuzzFeed’s public API does not share the whole
story--Include data points from other sources
➔ Limit focus to English-speaking countries limited
ability to see impact of cultural context outside of
the US content-engine’s orbit.
➔ With more time, might apply a better methodology
to the Title Generator
➔ With more time, might stand up the user-facing web
application and capture user data to improve the
model and generate better recommendations
QUESTIONS?

More Related Content

Similar to Team BuzzFeed: Project Presentation

ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchrohitcse52
 
TextMiningTwitters
TextMiningTwittersTextMiningTwitters
TextMiningTwittersLiu Chang
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Salesforce Engineering
 
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local PagesLocalogy
 
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditMaking Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditLucidworks
 
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...Amazon Web Services
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content StrategistsLouis Rosenfeld
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionLucidworks
 
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望Rakuten Group, Inc.
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppttpoelzer
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppttpoelzer
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web searchVictor de Boer
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveQubole
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big FunHilary Mason
 
SEO in the Age of Entities: Using Schema.org for Findability
SEO in the Age of Entities: Using Schema.org for FindabilitySEO in the Age of Entities: Using Schema.org for Findability
SEO in the Age of Entities: Using Schema.org for FindabilityJonathon Colman
 

Similar to Team BuzzFeed: Project Presentation (20)

ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
TextMiningTwitters
TextMiningTwittersTextMiningTwitters
TextMiningTwitters
 
Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?Probabilistic Programming: Why, What, How, When?
Probabilistic Programming: Why, What, How, When?
 
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages2018 NYC Localogy: Using Data to Build Exceptional Local Pages
2018 NYC Localogy: Using Data to Build Exceptional Local Pages
 
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditMaking Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, Reddit
 
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content Strategists
 
Open Source Press Relations
Open Source Press RelationsOpen Source Press Relations
Open Source Press Relations
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
Webinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with FusionWebinar: Modern Techniques for Better Search Relevance with Fusion
Webinar: Modern Techniques for Better Search Relevance with Fusion
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
E-commerce企業におけるビッグデータ活用の取り組みと今後の展望
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppt
 
Digifoot 2012 ppt
Digifoot 2012 pptDigifoot 2012 ppt
Digifoot 2012 ppt
 
Web technology: Web search
Web technology: Web searchWeb technology: Web search
Web technology: Web search
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Basic Sentiment Analysis using Hive
Basic Sentiment Analysis using HiveBasic Sentiment Analysis using Hive
Basic Sentiment Analysis using Hive
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Short URLs, Big Fun
Short URLs, Big FunShort URLs, Big Fun
Short URLs, Big Fun
 
SEO in the Age of Entities: Using Schema.org for Findability
SEO in the Age of Entities: Using Schema.org for FindabilitySEO in the Age of Entities: Using Schema.org for Findability
SEO in the Age of Entities: Using Schema.org for Findability
 

Recently uploaded

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Team BuzzFeed: Project Presentation

  • 1. BUZZ FEEDER FINDING OUT THE TRENDS BEHIND WHAT’S TRENDING
  • 2. TEAM ➔ Anurag Khaitan ➔ Josh Erb ➔ Walter Tyrna
  • 4. WHAT IS BUZZFEED? “BuzzFeed is a cross-platform, global network for news and entertainment that generates seven billion views each month. BuzzFeed creates and distributes content for a global audience and utilizes proprietary technology to continuously test, learn and optimize.” (buzzfeed website) ● More than 7 billion monthly global content views ● More than 200M monthly unique visitors to BuzzFeed.com ● 11 international editions including US, UK, Germany, Espanol, France, Spain, India, Canada, Mexico, Brazil, Australia and Japan (buzzfeed website)
  • 5. PROBLEM ➔ There is good money to be made from consistently generating popular content on the internet. ➔ A significant portion (20%-30%) of Buzzfeed’s articles generate very little traffic.
  • 6. Hypothesis We believe there may be a correlation between the content of the language associated with an article (title, description, tags, etc.) and how likely it is to go viral. We also believe that this likelihood is tied to the country in which an article goes viral
  • 7. WHY DOES IT MATTER? ➔ BuzzFeed could hypothetically save money and improve user experience by informing content by what topics consistently draw readership
  • 8. OUR APPROACH ➔ Visualization to help identify underlying themes in a given dataset through three lenses-the title, the content of the article itself, or the tags ascribed to it by the author. ➔ Title Generator to suggest topics and themes based upon recent trends in the Social Media to guide the editing staff in writing content that is likely to generate significant online traffic. ➔ Given sufficient number of articles in our data and trending topics, we believe that the product of reasonable title generator can be fed into a predictor to help assess its potential virality.
  • 10. INGESTION “You need to start pulling data, like, now.” - Ben Bengfort, 1st Day of Class ➔ Project required us to gather data from 5 separate public APIs ➔ Before anything else, it was necessary to automate the process of querying the APIs ➔ Set up an ubuntu instance on Amazon Web Services’ Elastic Compute Cloud (EC2) ➔ Run Python Script hourly (crontab) to capture .json files on a server-side WORM -- 5 calls/hour, each for Australia, Canada, India, UK and US Data Collection began: May 18, 2016. Data Collection ended: Aug 31, 2016 Total raw data size in WORM: 1.16GB. Number of records pulled: 330,000 (25 articles/hr each for 5 countries for 100 days)
  • 12. WRANGLING ➔ Clean Raw Data ◆ Remove tags, images and other content outside the scope of our analysis ◆ Used insight from this to drop irrelevant variables and identify gaps that could be accounted for ➔ Understand Target Variable (Measure of Virality) ◆ A frequency column to understand how each article was “persisting”, as a measure of virality ◆ Understand the accuracy and applicability of Number of Impressions provided in the data ➔ Capture all Instances, Features and Target Variables in Postgres Table to use downstream in the pipeline
  • 13. WHAT DOES THE DATA LOOK LIKE? Australia Canada India UK US 9% 5% 7% 17% 62%
  • 14. ANALYSIS ➔ Word Clouds ◆ What terms “jump out”? ➔ Natural Language Toolkit ◆ What sorts of analysis can we run on our textual data? ➔ Sci-Kit Learn ◆ What can Machine Learning models can help us predict?
  • 15. TOP TERMS Tags: Australia 1. game 2. thrones 3. australia 4. season 5. 6 6. fan 7. twitter 8. quiz 9. stark 10. hot Canada 1. canada 2. canadian 3. news 4. social 5. quiz 6. animals 7. twitter 8. funny 9. lol 10. food India 1. social 2. news 3. india 4. bollywood 5. indian 6. twitter 7. desi 8. khan 9. stories 10. women UK 1. quiz 2. british 3. uk 4. food 5. trivia 6. twitter 7. you 8. funny 9. celebrity 10. 00s US 1. test 2. quiz 3. food 4. recipes 5. you 6. funny 7. news 8. social 9. summer 10. music ● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles) while Australia and India have more distinct preferences. ● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia ● “Women/woman” only appears on the top list for India, perhaps reflective of readership ● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy Kaling Was Just Too Relatable On Twitter”)
  • 18. TITLE GENERATOR ➔ Generated a corpus of all the unique titles from API pulls ➔ Natural Language Toolkit: Trigram Collocation Finder & Trigram Assoc Metrics ➔ Grabbing most likely subsequent words using Likelihood Ratios ➔ Introduced minor stochasticity to prevent it always providing the same titles ➔ Notable Examples: ◆ “Canada Goose Is Most Calories” ◆ “You More Hilary Duff or Lohan?” ◆ “What Game of Thrones Fan if You Guess We Thrones”
  • 19. FEATURE SELECTION WHAT FEATURES ARE THE MOST TELLING - HYPOTHESIS CATEGORY: SOME SIGNAL There are 140+ categories on Buzzfeed? Is there a relationship between the categories and virality? METAVALUE: TOO BROAD - NO SIGNAL How many keywords are there? What is the relationship between virality and certain keywords? ➔ Each “Buzz” had 36 data points ◆ Some of these data points were standardized ◆ Some of them were not ➔ A significant amount of these data points did not contain any signal ➔ Other than category, only fields that contained signals had text/words that are contained in the article: ◆ Decription, Title, Primary Keywords ◆ Tags, containing phrases and words
  • 20. TARGET MEASURE OF VIRALITY IMPRESSIONS Number of times an article is views FREQUENCY Number of hours an article stays on a country’s BuzzFeed page. ➔ Impressions: Inaccurate and aggregated measure in the snapshot ➔ Frequency: Another measure but not always aligned with the corresponding impression provided in the instance ➔ Some f(Impressions, Frequency) worked ➔ Needed to use the function to identify classes ➔ Log Transformation to account for wide variability and skewed distribution as follows: Virality = Log (Impressions * Frequency) Non-Viral: Virality < mean- standard devitation Viral: Virality >= mean - standard deviation
  • 21. FEATURE ENGINEERING FEATURE ENGINEERING ATTEMPTED OBVIOUS ONES STOP WORDS OR COMMON WORDS COULD HAVE HELPED ➔ Title Length: Fairly constant and not a good indicator. ➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such correlation in the data. ➔ Words in tags: To retain the context in the tags, we used individual phrases, as provided (simulated n-grams) and individual words (1-gram). ➔ Low Document Frequency: No positive impact on the predictability. ➔ High Document Frequency: Negative impact on the predictability on the model. ➔ Stop Words OR Common Words: Did not attempt it due to time constraints.
  • 22. MODELING WITH SCI-KIT LEARN Multinomial Naive Bayes and Logistic Regression: Feature Selection: For each instance, we used all the text contained in Title, Description, Category, Primary Keywords and Phrases in Tags. Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact vect = CountVectorizer() Output Number of Features in vect: 70,000 more more features Model Selection: For both models, we did 12-fold cross-validation as follows: skf = StratifiedKFold(y, n_folds=12, shuffle=True) for train, test in skf: … Another cross-validation for both Multinomial NB and Logistic Regression as follows: cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
  • 23. MODEL RESULTS Multinomial NB Logistic Regression Accuracy 0.839620 0.865165 AUC 0.699976 0.677515 F1 0.904905 0.922518 Precision 0.908419 0.898182 Recall 0.901438 0.948231 CROSS VALIDATION ACCURACY SCORES Multinomial Naive Bayes: 0.840168 Logistic Regression: 0.864645
  • 24. TOOLS
  • 27. ROOM FOR IMPROVEMENT ➔ BuzzFeed’s public API does not share the whole story--Include data points from other sources ➔ Limit focus to English-speaking countries limited ability to see impact of cultural context outside of the US content-engine’s orbit. ➔ With more time, might apply a better methodology to the Title Generator ➔ With more time, might stand up the user-facing web application and capture user data to improve the model and generate better recommendations