Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene?

Online text data for machine learning,
data science, and research — Who can
provide data? What data can’t you get?
What about data hygiene?
Fredrik Olsson, PhD
Senior Research Scientist
RISE AI

Take-home message
Web scraping is non-trivial: data always noisy.
Social data is expensive! Think of it as an extra co-worker.
Contracts with providers are mostly long-term: pay as you
go makes data even more expensive.
Large overhead in integration: requires a lot of the buyer.
Content-based enrichments, e.g., entities, sentiment, not
available in all languages.
Few vendors provide service level agreements concerning
data coverage, or latency. None concerning quality.

In God we trust; all
others bring data
— W. Edwards Deming

Preliminaries
Getting good data is an iterative process. It
is imperative for empirical research.
Do not cut the wrong corners!
After this talk, you’ll know who can provide
online text data, what types of data is hard
to get, and principal data hygiene factors.

Machine learning needs data
But relevant text data is surprisingly hard to
get your hands on.
Tech giants open-source software, e.g.
TensorFlow, FastText, CNTK, but their data
remain well-protected.
Read: The value of data (1|2|3)

What text data to aim for?
Editorial news, individuals’ blogs, social networks, targeted
surveys are all different beasts wrt readership, purpose,
trustworthiness, political bias, reach etc.
Questions along the way:
Is internet penetration in your region of interest high
enough? See Internet World Stats.
What sites are popular in the region? See Alexa, SimilarWeb,
Quantcast and International Media & Newspapers.
Collect continuously or once?

From: World Map of Social Networks

What type of data is hard to get?
Chat app data from, e.g., WhatsApp, WeChat, Kik,
Facebook Messenger, Viber, Line, Telegram.
Historical data older than 30 days.
Geo-tagged data.
Individual posts (public and private) from Facebook,
Instagram, LinkedIn.
Demographic variables, e.g., gender, age, income.

Who can provide data?
Online data providers: Gnip, Meltwater (DataSift, fairhair.ai),
StockTwits, Twingly, webhose.io, Talkwalker, Socialgist,
PublicNow, LexisNexis, Dow Jones, glean.info, DataStreamer,
InfoNgen, News API, EventRegistry, Common Crawl. 
Crawler as a service: import.io, 80legs.com, Connotate,
Promptcloud, Diffbot, Scrapy.
Survey panels: Dynata (formerly SSI), CINT, Tawasol.
Crowd sourced data: Appen (acquired Figure Eight in 2019),
Amazon Mechanical Turk, Annotell

gnip.com
• Twitter’s enterprise API platform - the only
provider of tweets, including historical
data.
• Acquired by Twitter for $134M in 2014.
• Firehose access to Twitter (500M tweets/
day), WordPress (2-5M blogs and
comments/day), Disqus.
• Managed access to Facebook, Youtube,
G+, Vimeo, VK, Reddit, Instagram etc.

datasift.com
• Facebook Topic Data provider — anonymized,
aggregated actions from 1.7B users, across 60+
attributes.
• Acquired by MeltWater in April 2018.
• LinkedIn Engagement Insights — aggregated
actions from 460M users, across 130 attributes.
22k interactions/minute, 11M posts.
• Firehose access: WordPress, IntenseDebate,
Tumblr, Disqus.
• Managed access: G+, Instagram etc.
• News articles: LexisNexis, NewsCred.

TalkWalker
• 187 languages.
• 40M+ documents/day.
• Multiple data enrichments.
• Provides data from 150M sources — news,
blogs, discussion boards, forums etc.
• Monitoring/analytics is their primary
business.

Twingly
• Blogs: 1.2M posts/day, 10k new blogs
added/day.
• Forum: 30M posts/day from 9k forums.
• News: 3M stories/day, 135k sources, 100
countries.
• Social Feed: Facebook Public Pages,
approx. 17M posts/day.
• Several categories, 35 languages, entities
and sentiment for some languages.

webhose.io
• Sources: news, blogs, forums, reviews, e-
commerce, dark web, broadcast (US tv &
radio).
• Enrichments: named entities, sentiment,
categories, countries.
• 80 languages.
• Reasonably priced and easy to get going.
• Live data w 30 days history, and historical
data going back to December 2014.

Common Crawl
• Non-profit organization.
• ”… web crawl data that can be accessed
and analyzed by anyone” and ”… years of
free web page data…”.
• 40 languages.
• 8 years history, petabytes of data.
• Raw data, metadata, text data.

import.io
”Create your own datasets within minutes, no
coding required.”
Good for getting clean data from individual
web sites, without a large overhead. We’ve
used it for, e.g., hotel reviews, Glassdoor
data.
Acquired Connotate in 2019.

diffbot.com
”Using AI, computer vision, machine learning
and natural language processing, Diffbot
provides software developers with tools to
extract and understand objects from any web
page.”
Good for programmatic integration, large-
scale extraction of web contents.

cint.com
Good for reaching and querying target
audiences, based on a range of variables.
Not a surveying company.

Other types of data to complement
online media data
Open data: in science and government.
Financial data: xignite.com.
Data for sales and marketing: mixrank.com.
Company web data: yipitdata.com.
App usage data: 7parkdata.com.

Other types of data to complement
online media data
Data set search engines:
• Google Dataset Search
• Microsoft Research Open Data
• Quandl
• Shovel AI
Existing datasets, see:
• The greatest Public Datasets for AI.
• Awesome Public Datasets
• Linguistic Data Consortium.
• European Language Resources Association.
• CLARIN Virtual Language Observatory.

From: The New Gold Rush? Wall Street Wants your Data

Data providers round-up
Data is expensive! Think of it as an extra co-
worker.
Contracts with providers are mostly long-term: pay
as you go makes data even more expensive.
Usually large overhead in integration: requires a lot
of the buyer.
Content-based enrichments, e.g., entities,
sentiment, not available in all languages.

Data providers round-up
Few vendors provide service level
agreements concerning data coverage, or
latency. None concerning quality.
Web scraping is non-trivial: data always
noisy, requires processing before use.
Read the terms of service carefully. Example:
Facebook, Twitter cut off data access for
Geofeedia, a social media surveillance startup.

Nuisances
Data that matter will be harder for outsiders to get.
Sharing data is hard due to its business value for data
creators. Inimical to reproducibility of scientific results.
Political factors impact the data landscape, e.g.,:
• China: How the Chinese Government Fabricates Social
Media Posts for Strategic Distraction, not Engaged
Argument. Chinese govt. fabricated 448M comments.
Affects representativity.
• USA: Diehard Coders just Rescued Nasa’s Earth Science
Data and Empty search results at US govt Open Data site.
Anecdotal evidence of US govt. reducing access to data.

Data processing hygiene factors
• Collect early, collect all (depending, of course, on GDPR).
• Your data will be noisy — Clean it.
• Your initial hypotheses will be wrong — Immerse yourself in
data!
• Data provenance — Who touched it? What did he do?
• Versioning of data and the software that processes it, e.g.,
pachyderm.io, DataKit
• Keep track of data characteristics, e.g., Great Expectations
• Facilitate collaboration — ”The most important collaborator is
your future self.”
• Strive for reproducibility — Your data is an integral part.
• Talk about data readiness — What can you expect to achieve
with your data [1, 2]? Akin to NASA’s TRL.

Continue with…
Subscribe to newsletters: Data Elixir, Data
Science Weekly, Data is Plural.
Listen to: Raw Data, Data Skeptic.
Use: metacurate.io

Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene?

Similar to Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene? (20)

Recently uploaded

Recently uploaded (20)

Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene?