Applying Machine-Learning and Natural Language Processing tools in an attempt to better predict article virality for BuzzFeed; a Data Science capstone project.
The document discusses harnessing the power of content by translating metadata into Ruby scripts. It notes there is always an increase in the number of content over time and possible bugs in the search engine. While a few tags have high frequencies, most tags have low frequencies. The conclusion states obscurity is not an issue as hits from a wide range of popular and unpopular content will be displayed in search results, and growth in available content shows no sign of slowing down.
Learn how leading organizations are leveraging Crafter CMS in combination with the Alfresco ECM platform to manage and optimize enterprise websites, intranets, portal applications, digital assists, and more!
This document discusses getting to know data using R. It begins by outlining the typical steps in a data analysis, including defining the question, obtaining and cleaning the data, performing exploratory analysis, modeling, interpreting results, and creating reproducible code. It then describes different types of data science questions from descriptive to mechanistic. The remainder of the document provides more details on descriptive, exploratory, inferential, predictive, causal, and mechanistic analysis. It also discusses R, including its design, packages, data types like vectors, matrices, factors, lists, and data frames.
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...Amazon Web Services
The need for Natural Language Processing (NLP) is gaining more importance as the amount of unstructured text data doubles every 18 months and customers are looking to extend their existing analytics workloads to include natural language capabilities. Historically, this data had been prohibitively expensive to store and early manual processing evolved into rule-based systems, which were expensive to operate and inflexible. In this session we will show you how you can address this problem using Amazon Comprehend.
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...Databricks
1) The document describes an AI use case to analyze news articles and social media sentiment to determine potential impacts on financial markets and predict market movements.
2) Natural language processing is used to analyze news feeds for event timing, keywords, emotions, and sentiment scores, which are then combined with market data.
3) Machine learning algorithms are used to build predictive models and determine if sentiment analysis can provide additional insights into potential market impacts from events and supplement traditional market data analysis.
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...Amazon Web Services
In this session, we present AWS IoT and Amazon Machine Learning (Amazon ML) to demonstrate how you can use these services together to build smart applications. Customer SKF presents their use case around AWS IoT and Amazon ML in their wind turbines.
Applying Machine-Learning and Natural Language Processing tools in an attempt to better predict article virality for BuzzFeed; a Data Science capstone project.
The document discusses harnessing the power of content by translating metadata into Ruby scripts. It notes there is always an increase in the number of content over time and possible bugs in the search engine. While a few tags have high frequencies, most tags have low frequencies. The conclusion states obscurity is not an issue as hits from a wide range of popular and unpopular content will be displayed in search results, and growth in available content shows no sign of slowing down.
Learn how leading organizations are leveraging Crafter CMS in combination with the Alfresco ECM platform to manage and optimize enterprise websites, intranets, portal applications, digital assists, and more!
This document discusses getting to know data using R. It begins by outlining the typical steps in a data analysis, including defining the question, obtaining and cleaning the data, performing exploratory analysis, modeling, interpreting results, and creating reproducible code. It then describes different types of data science questions from descriptive to mechanistic. The remainder of the document provides more details on descriptive, exploratory, inferential, predictive, causal, and mechanistic analysis. It also discusses R, including its design, packages, data types like vectors, matrices, factors, lists, and data frames.
NEW LAUNCH! Natural Language Processing for Data Analytics - MCL343 - re:Inve...Amazon Web Services
The need for Natural Language Processing (NLP) is gaining more importance as the amount of unstructured text data doubles every 18 months and customers are looking to extend their existing analytics workloads to include natural language capabilities. Historically, this data had been prohibitively expensive to store and early manual processing evolved into rule-based systems, which were expensive to operate and inflexible. In this session we will show you how you can address this problem using Amazon Comprehend.
An AI Use Case: Market Event Impact Determination via Sentiment and Emotion A...Databricks
1) The document describes an AI use case to analyze news articles and social media sentiment to determine potential impacts on financial markets and predict market movements.
2) Natural language processing is used to analyze news feeds for event timing, keywords, emotions, and sentiment scores, which are then combined with market data.
3) Machine learning algorithms are used to build predictive models and determine if sentiment analysis can provide additional insights into potential market impacts from events and supplement traditional market data analysis.
IOT313_AWS IoT and Machine Learning for Building Predictive Applications with...Amazon Web Services
In this session, we present AWS IoT and Amazon Machine Learning (Amazon ML) to demonstrate how you can use these services together to build smart applications. Customer SKF presents their use case around AWS IoT and Amazon ML in their wind turbines.
The document describes a project using tweets to predict stock price fluctuations of Microsoft (MSFT). It discusses:
1. Using sentiment analysis and naive Bayes classification to determine if tweets in a 1-minute period are positive or negative, and linking this to stock price movement in the next minute.
2. Pre-implementation considerations like data sources, filtration of tweets by keywords, language, encoding, and normalization.
3. The implementation steps of downloading tweet data, processing and filtering it, splitting into 1-minute blocks and linking to intraday stock prices.
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
ExperTwin is a Knowledge Advantage Machine (KAM) that is able to collect data from your areas of interest and present it in-time, in-context and in place to the worker workspace. This research paper describes how workers can be benefited from having a personal net of crawlers (as Google does) collecting and organizing updated data relevant to their areas of interest and delivering these to their workspace.
This document introduces khmer, a platform for scalable sequence analysis. It discusses how khmer uses k-mers to provide implicit read alignments and assemble sequences using de Bruijn graphs. It also describes some of the challenges with k-mers, such as each sequencing error resulting in novel k-mers. The document outlines khmer's data structures and algorithms for efficiently counting k-mers and represents de Bruijn graphs. It discusses how khmer has been applied to real biological problems and highlights areas of current research using khmer, such as error correction, variant calling, and assembly-free comparisons of data sets.
Probabilistic programming is a new approach to machine learning and data science that is currently the focus of intense academic research, including an ongoing DARPA program. If successful, probabilistic programming systems will allow sophisticated predictive models to be written by a wide range of domain experts. Before we get to the promised land, though, some basic challenges need to be addressed, including performance on real-world datasets, programming tools support, and education.
2018 NYC Localogy: Using Data to Build Exceptional Local PagesLocalogy
This document discusses using data-driven approaches to generate localized content at scale for local business pages. It begins by outlining the types of competition on local search engine results pages. It then discusses what makes a good local page, focusing on relevance, authority and uniqueness. The document proposes using natural language generation techniques to transform local landing pages by drawing on relevant data sources to create customized, location-specific content fragments. It outlines a process for identifying locations, brainstorming content topics, connecting data to content structures, and generating unique pages for each location based on the location's numeric representation. Provided the content is properly attributed and overseen for accuracy, this approach aims to better serve customers with more useful local information than generic templates.
Boosting Product Categorization with Machine LearningAmadeus Magrabi
Talk on using machine learning to build a category recommendation system for e-commerce. Presented at Open Data Science Conference West in San Francisco (11/2017).
This webinar discusses how to perform sentiment analysis on large datasets using Apache Hive. It provides an overview of sentiment analysis and demonstrates useful Hive UDFs for preprocessing text data and extracting n-grams. The webinar also includes a tutorial analyzing sentiment around the topic of "mortgage" using the MemeTracker dataset containing 90 million records of URLs, timestamps, memes and links over 36GB of JSON data. Advanced custom sentiment analysis can be developed by extending Hive's extensibility framework.
Serverless Text Analytics with Amazon ComprehendDonnie Prakoso
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text.
This deck provides how to build your own text analytics using Amazon Comprehend and integration with other AWS services. On top of that, this deck also provides an introduction to Amazon Lex.
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
Presentation given at Workshop on Academic-Industrial Collaborations for Recommender Systems 2013 (http://bit.ly/114XDsE), JCDL'13. A walk through Mendeley as a platform, growing pains involved with engineering at a large scale, the data that we're making publicly available and some demos that have come out of academic collaborations.
Utilizing the natural langauage toolkit for keyword researchErudite
This document discusses using the Natural Language Toolkit (NLTK) for keyword research and analysis. It provides instructions on installing NLTK and other Python libraries, preparing keyword data, and running scripts to classify and cluster keywords to identify trends and topics. The document demonstrates how to automate aspects of keyword research using NLTK to help analyze large datasets.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
This was a workshop I gave at http://csforum.eu in 2011.
DESIGNING NARRATIVE CONTENT
---------------------------------------
How can you be sure your content reaches the largest audience possible? By designing content for all contexts, that will reach your audience via any device, any phone, any laptop, anywhere.
This workshop will discuss how to create a content strategy for narrative content. We'll explore how to tailor your content, as well as your editorial workflows, for different devices and audiences. We'll use Treesaver, an open-source content layout framework to illustrate narrative content principles.
Publishing usually comes at the end of your content strategy, but by orchestrating your process for narrative content, you can ensure your stories, news, product descriptions, and more will be tailored for your audience wherever they are.
What you’ll learn
How to optimise workflow, production, and deployment for narrative content.
How to use the technology behind narrative content.
How to customise content for different contexts.
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditLucidworks
The document discusses improving Reddit's search capabilities. It describes Reddit's search architecture, including how they improved relevance through signals like click data and comments. It also discusses scaling the infrastructure through techniques like Terraform, autoscaling replicas across availability zones, and building a faster data ingestion pipeline.
Slides for VU Web Technology course lecture on "Search on the Web". Explaining how search engines work, some basic information laws and inverted indices.
Talk on SEO for Aggregation Websites like Comparison Search Engines, Marketplaces or Classifieds platforms. Including Panda Diet and Internal linking, etc
This document provides guidance on how to effectively research topics online. It discusses the importance of using reliable sources like purchased databases, library books, and teacher-approved websites. It also notes the challenges of finding quality information on the open web. The document offers tips for crafting effective search queries, carefully examining search results, and evaluating websites based on factors like the author, date, content quality, and domain extension. It promotes the "triangle method" of cross-referencing information from multiple sources. Key strategies presented include using domain extensions to identify source types, checking for author information and site currency, and looking for inconsistencies or errors.
This document describes an approach for bridging the gap between natural language queries and linked data concepts using BabelNet. The approach uses BabelNet for word sense disambiguation, named entity recognition and disambiguation. It parses queries, matches terms to ontology concepts and properties, generates candidate triples, and integrates the triples to produce SPARQL queries. The approach was evaluated on test data from QALD-2, achieving a promising 76% of questions answered correctly.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
The document describes a project using tweets to predict stock price fluctuations of Microsoft (MSFT). It discusses:
1. Using sentiment analysis and naive Bayes classification to determine if tweets in a 1-minute period are positive or negative, and linking this to stock price movement in the next minute.
2. Pre-implementation considerations like data sources, filtration of tweets by keywords, language, encoding, and normalization.
3. The implementation steps of downloading tweet data, processing and filtering it, splitting into 1-minute blocks and linking to intraday stock prices.
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
ExperTwin is a Knowledge Advantage Machine (KAM) that is able to collect data from your areas of interest and present it in-time, in-context and in place to the worker workspace. This research paper describes how workers can be benefited from having a personal net of crawlers (as Google does) collecting and organizing updated data relevant to their areas of interest and delivering these to their workspace.
This document introduces khmer, a platform for scalable sequence analysis. It discusses how khmer uses k-mers to provide implicit read alignments and assemble sequences using de Bruijn graphs. It also describes some of the challenges with k-mers, such as each sequencing error resulting in novel k-mers. The document outlines khmer's data structures and algorithms for efficiently counting k-mers and represents de Bruijn graphs. It discusses how khmer has been applied to real biological problems and highlights areas of current research using khmer, such as error correction, variant calling, and assembly-free comparisons of data sets.
Probabilistic programming is a new approach to machine learning and data science that is currently the focus of intense academic research, including an ongoing DARPA program. If successful, probabilistic programming systems will allow sophisticated predictive models to be written by a wide range of domain experts. Before we get to the promised land, though, some basic challenges need to be addressed, including performance on real-world datasets, programming tools support, and education.
2018 NYC Localogy: Using Data to Build Exceptional Local PagesLocalogy
This document discusses using data-driven approaches to generate localized content at scale for local business pages. It begins by outlining the types of competition on local search engine results pages. It then discusses what makes a good local page, focusing on relevance, authority and uniqueness. The document proposes using natural language generation techniques to transform local landing pages by drawing on relevant data sources to create customized, location-specific content fragments. It outlines a process for identifying locations, brainstorming content topics, connecting data to content structures, and generating unique pages for each location based on the location's numeric representation. Provided the content is properly attributed and overseen for accuracy, this approach aims to better serve customers with more useful local information than generic templates.
Boosting Product Categorization with Machine LearningAmadeus Magrabi
Talk on using machine learning to build a category recommendation system for e-commerce. Presented at Open Data Science Conference West in San Francisco (11/2017).
This webinar discusses how to perform sentiment analysis on large datasets using Apache Hive. It provides an overview of sentiment analysis and demonstrates useful Hive UDFs for preprocessing text data and extracting n-grams. The webinar also includes a tutorial analyzing sentiment around the topic of "mortgage" using the MemeTracker dataset containing 90 million records of URLs, timestamps, memes and links over 36GB of JSON data. Advanced custom sentiment analysis can be developed by extending Hive's extensibility framework.
Serverless Text Analytics with Amazon ComprehendDonnie Prakoso
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text.
This deck provides how to build your own text analytics using Amazon Comprehend and integration with other AWS services. On top of that, this deck also provides an introduction to Amazon Lex.
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
Presentation given at Workshop on Academic-Industrial Collaborations for Recommender Systems 2013 (http://bit.ly/114XDsE), JCDL'13. A walk through Mendeley as a platform, growing pains involved with engineering at a large scale, the data that we're making publicly available and some demos that have come out of academic collaborations.
Utilizing the natural langauage toolkit for keyword researchErudite
This document discusses using the Natural Language Toolkit (NLTK) for keyword research and analysis. It provides instructions on installing NLTK and other Python libraries, preparing keyword data, and running scripts to classify and cluster keywords to identify trends and topics. The document demonstrates how to automate aspects of keyword research using NLTK to help analyze large datasets.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
This was a workshop I gave at http://csforum.eu in 2011.
DESIGNING NARRATIVE CONTENT
---------------------------------------
How can you be sure your content reaches the largest audience possible? By designing content for all contexts, that will reach your audience via any device, any phone, any laptop, anywhere.
This workshop will discuss how to create a content strategy for narrative content. We'll explore how to tailor your content, as well as your editorial workflows, for different devices and audiences. We'll use Treesaver, an open-source content layout framework to illustrate narrative content principles.
Publishing usually comes at the end of your content strategy, but by orchestrating your process for narrative content, you can ensure your stories, news, product descriptions, and more will be tailored for your audience wherever they are.
What you’ll learn
How to optimise workflow, production, and deployment for narrative content.
How to use the technology behind narrative content.
How to customise content for different contexts.
Making Reddit Search Relevant and Scalable - Anupama Joshi & Jerry Bao, RedditLucidworks
The document discusses improving Reddit's search capabilities. It describes Reddit's search architecture, including how they improved relevance through signals like click data and comments. It also discusses scaling the infrastructure through techniques like Terraform, autoscaling replicas across availability zones, and building a faster data ingestion pipeline.
Slides for VU Web Technology course lecture on "Search on the Web". Explaining how search engines work, some basic information laws and inverted indices.
Talk on SEO for Aggregation Websites like Comparison Search Engines, Marketplaces or Classifieds platforms. Including Panda Diet and Internal linking, etc
This document provides guidance on how to effectively research topics online. It discusses the importance of using reliable sources like purchased databases, library books, and teacher-approved websites. It also notes the challenges of finding quality information on the open web. The document offers tips for crafting effective search queries, carefully examining search results, and evaluating websites based on factors like the author, date, content quality, and domain extension. It promotes the "triangle method" of cross-referencing information from multiple sources. Key strategies presented include using domain extensions to identify source types, checking for author information and site currency, and looking for inconsistencies or errors.
This document describes an approach for bridging the gap between natural language queries and linked data concepts using BabelNet. The approach uses BabelNet for word sense disambiguation, named entity recognition and disambiguation. It parses queries, matches terms to ontology concepts and properties, generates candidate triples, and integrates the triples to produce SPARQL queries. The approach was evaluated on test data from QALD-2, achieving a promising 76% of questions answered correctly.
Similar to Georgetown Data Science - Team BuzzFeed (20)
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
4. WHAT IS BUZZFEED?
“BuzzFeed is a cross-platform, global
network for news and entertainment that
generates seven billion views each month.
BuzzFeed creates and distributes content
for a global audience and utilizes
proprietary technology to continuously
test, learn and optimize.”
(buzzfeed website)
More than 7 billion monthly
global content views
More than 200M monthly
unique visitors to
BuzzFeed.com
11 international editions
including US, UK,
Germany, Espanol, France,
Spain, India, Canada,
Mexico, Brazil, Australia
and Japan
(buzzfeed website)
5. PROBLEM
• There is good money to be made from
consistently generating popular content on the
internet.
• A significant portion (20%-30%) of Buzzfeed’s
articles generate very little traffic.
6. Hypothesi
s
We believe there may be a
correlation between the
content of the language
associated with an article (title,
description, tags, etc.) and how
likely it is to go viral.
We also believe that this
likelihood is tied to the country
in which an article goes viral
7. WHY DOES IT MATTER?
• 20%-30% of the articles we pulled,
gained little traction
• BuzzFeed could hypothetically save
money and improve user experience by
informing content by what topics
consistently draw readership
8. SOLUTION APPROACH
• Visualization to help identify underlying themes
in a given dataset through three lenses-the title,
the content of the article itself, or the tags
ascribed to it by the author.
• Title Generator to suggest topics and themes
based upon recent trends in the Social Media to
guide the editing staff in writing content that is
likely to generate significant online traffic.
• Given sufficient number of articles in our data
and trending topics, we believe that the product
of reasonable title generator can be fed into a
predictor to help assess its potential virality.
10. INGESTION
“You need to start pulling
data, like, now.”
- Ben Bengfort, 1st Day of Class
➔ Project required us to gather data from 5 separate
public APIs
➔ Before anything else, it was necessary to
automate the process of querying the APIs
➔ Set up an ubuntu instance on Amazon Web
Services’ Elastic Compute Cloud (EC2)
➔ Run Python Script hourly (crontab) to capture
.json files on a server-side WORM -- 5 calls/hour,
each for Australia, Canada, India, UK and US
Data Collection began: May 18, 2016.
Data Collection ended: Aug 31, 2016
Total raw data size in WORM: 1.16GB.
Number of records pulled: 330,000
(25 articles/hr each for 5 countries for
100 days)
12. WRANGLING
➔ Clean Raw Data
◆ Remove tags, images and other content outside the scope of our analysis
◆ Used insight from this to drop irrelevant variables and identify gaps that
could be accounted for
➔ Understand Target Variable (Measure of Virality)
◆ A frequency column to understand how each article was “persisting”, as a
measure of virality
◆ Understand the accuracy and applicability of Number of Impressions
provided in the data
➔ Capture all Instances, Features and Target Variables in Postgres Table to use
downstream in the pipeline
13. WHAT DOES THE DATA LOOK LIKE?
Australia Canada India
UK US
9%
5%
7%
17%
62%
14. ANALYSIS
➔ Word Clouds
◆ What terms “jump out”?
➔ Natural Language Toolkit
◆ What sorts of analysis can we run
on our textual data?
➔ Sci-Kit Learn
◆ What can Machine Learning
models can help us predict?
15. TOP TERMS
Tags: Australia
1. game
2. thrones
3. australia
4. season
5. 6
6. fan
7. twitter
8. quiz
9. stark
10. hot
Canada
1. canada
2. canadian
3. news
4. social
5. quiz
6. animals
7. twitter
8. funny
9. lol
10. food
India
1. social
2. news
3. india
4. bollywood
5. indian
6. twitter
7. desi
8. khan
9. stories
10. women
UK
1. quiz
2. british
3. uk
4. food
5. trivia
6. twitter
7. you
8. funny
9. celebrity
10. 00s
US
1. test
2. quiz
3. food
4. recipes
5. you
6. funny
7. news
8. social
9. summer
10. music
● The United States, United Kingdom, and Canada share the most similar top tags (as well as titles)
while Australia and India have more distinct preferences.
● Articles about Game of Thrones - and television in general - fare better in Buzzfeed Australia
● “Women/woman” only appears on the top list for India, perhaps reflective of readership
● Twitter does well across all five groups - evidence of the popularity of listicles (“27 Times Mindy
Kaling Was Just Too Relatable On Twitter”)
18. TITLE GENERATOR
• Generated a corpus of all the unique
titles from API pulls
• Natural Language Toolkit: Trigram
Collocation Finder & Trigram Assoc Metrics
• Grabbing most likely subsequent words
using Likelihood Ratios
• Introduced minor stochasticity to
prevent it always providing the same
titles
• Notable Examples:
– “Canada Goose Is Most Calories”
– “You More Hilary Duff or Lohan?”
– “What Game of Thrones Fan if You
Guess We Thrones”
19. FEATURE SELECTION WHAT FEATURES ARE THE
MOST TELLING - HYPOTHESIS
CATEGORY: SOME SIGNAL
There are 140+ categories on
Buzzfeed? Is there a relationship
between the categories and
virality?
METAVALUE: TOO BROAD - NO
SIGNAL
How many keywords are there?
What is the relationship between
virality and certain keywords?
➔ Each “Buzz” had 36 data points
◆ Some of these data points were standardized
◆ Some of them were not
➔ A significant amount of these data points did
not contain any signal
➔ Other than category, only fields that
contained signals had text/words that are
contained in the article:
◆ Decription, Title, Primary Keywords
◆ Tags, containing phrases and words
20. TARGET
MEASURE OF VIRALITY
IMPRESSIONS
Number of times an article is
views
FREQUENCY
Number of hours an article stays
on a country’s BuzzFeed page.
➔ Impressions: Inaccurate and aggregated
measure in the snapshot
➔ Frequency: Another measure but not always
aligned with the corresponding impression
provided in the instance
➔ Some f(Impressions, Frequency) worked
➔ Needed to use the function to identify classes
➔ Log Transformation to account for wide
variability and skewed distribution as follows:
Virality = Log (Impressions * Frequency)
Non-Viral: Virality < mean- standard devitation
Viral: Virality >= mean - standard deviation
21. FEATURE ENGINEERING
FEATURE ENGINEERING
ATTEMPTED OBVIOUS ONES
STOP WORDS OR COMMON
WORDS COULD HAVE HELPED
➔ Title Length: Fairly constant and not a good indicator.
➔ Lists vs. Non-Lists: Contrary to our hypothesis, no such
correlation in the data.
➔ Words in tags: To retain the context in the tags, we
used individual phrases, as provided (simulated n-
grams) and individual words (1-gram).
➔ Low Document Frequency: No positive impact on the
predictability.
➔ High Document Frequency: Negative impact on the
predictability on the model.
➔ Stop Words OR Common Words: Did not attempt it
due to time constraints.
22. MODELING WITH SCI-KIT LEARN
Multinomial Naive Bayes and Logistic Regressionas follows:
Feature Selection: For each instance, we used all the text contained in Title, Description, Category,
Primary Keywords and Phrases in Tags.
Document Frequency: Maximum and minimum document frequency, in increments of 10%...No Impact
vect = CountVectorizer()
Output Number of Features in vect: 70,000 more more features
Model Selection: For both models, we did 12-fold cross-validation as follows:
skf = StratifiedKFold(y, n_folds=12, shuffle=True)
for train, test in skf: …
Another cross-validation for both Multinomial NB and Logistic Regression as follows:
cross_val_score(pipe,X,y,cv=12,scoring='accuracy').mean()
27. ROOM FOR IMPROVEMENT
• BuzzFeed’s public API does not share the whole
story--Include data points from other sources
• Limit focus to English-speaking countries limited
ability to see impact of cultural context outside of
the US content-engine’s orbit.
• With more time, might apply a better methodology
to the Title Generator
• With more time, might stand up the user-facing web
application and capture user data to improve the
model and generate better recommendations