SlideShare a Scribd company logo
1 of 29
Download to read offline
Online text data for machine learning,
data science, and research — Who can
provide data? What data can’t you get?
What about data hygiene?
Fredrik Olsson, PhD
Senior Research Scientist
RISE AI
Take-home message
Web scraping is non-trivial: data always noisy.
Social data is expensive! Think of it as an extra co-worker.
Contracts with providers are mostly long-term: pay as you
go makes data even more expensive.
Large overhead in integration: requires a lot of the buyer.
Content-based enrichments, e.g., entities, sentiment, not
available in all languages.
Few vendors provide service level agreements concerning
data coverage, or latency. None concerning quality.
In God we trust; all
others bring data
— W. Edwards Deming
Preliminaries
Getting good data is an iterative process. It
is imperative for empirical research.
Do not cut the wrong corners!
After this talk, you’ll know who can provide
online text data, what types of data is hard
to get, and principal data hygiene factors.
Machine learning needs data
But relevant text data is surprisingly hard to
get your hands on.
Tech giants open-source software, e.g.
TensorFlow, FastText, CNTK, but their data
remain well-protected.
Read: The value of data (1|2|3)
What text data to aim for?
Editorial news, individuals’ blogs, social networks, targeted
surveys are all different beasts wrt readership, purpose,
trustworthiness, political bias, reach etc.
Questions along the way:
Is internet penetration in your region of interest high
enough? See Internet World Stats.
What sites are popular in the region? See Alexa, SimilarWeb,
Quantcast and International Media & Newspapers.
Collect continuously or once?
From: World Map of Social Networks
From: World Map of Social Networks
What type of data is hard to get?
Chat app data from, e.g., WhatsApp, WeChat, Kik,
Facebook Messenger, Viber, Line, Telegram.
Historical data older than 30 days.
Geo-tagged data.
Individual posts (public and private) from Facebook,
Instagram, LinkedIn.
Demographic variables, e.g., gender, age, income.
Who can provide data?
Online data providers: Gnip, Meltwater (DataSift, fairhair.ai),
StockTwits, Twingly, webhose.io, Talkwalker, Socialgist,
PublicNow, LexisNexis, Dow Jones, glean.info, DataStreamer,
InfoNgen, News API, EventRegistry, Common Crawl.

Crawler as a service: import.io, 80legs.com, Connotate,
Promptcloud, Diffbot, Scrapy.
Survey panels: Dynata (formerly SSI), CINT, Tawasol.
Crowd sourced data: Appen (acquired Figure Eight in 2019),
Amazon Mechanical Turk, Annotell
gnip.com
• Twitter’s enterprise API platform - the only
provider of tweets, including historical
data.
• Acquired by Twitter for $134M in 2014.
• Firehose access to Twitter (500M tweets/
day), WordPress (2-5M blogs and
comments/day), Disqus.
• Managed access to Facebook, Youtube,
G+, Vimeo, VK, Reddit, Instagram etc.
datasift.com
• Facebook Topic Data provider — anonymized,
aggregated actions from 1.7B users, across 60+
attributes.
• Acquired by MeltWater in April 2018.
• LinkedIn Engagement Insights — aggregated
actions from 460M users, across 130 attributes.
22k interactions/minute, 11M posts.
• Firehose access: WordPress, IntenseDebate,
Tumblr, Disqus.
• Managed access: G+, Instagram etc.
• News articles: LexisNexis, NewsCred.
TalkWalker
• 187 languages.
• 40M+ documents/day.
• Multiple data enrichments.
• Provides data from 150M sources — news,
blogs, discussion boards, forums etc.
• Monitoring/analytics is their primary
business.
Twingly
• Blogs: 1.2M posts/day, 10k new blogs
added/day.
• Forum: 30M posts/day from 9k forums.
• News: 3M stories/day, 135k sources, 100
countries.
• Social Feed: Facebook Public Pages,
approx. 17M posts/day.
• Several categories, 35 languages, entities
and sentiment for some languages.
webhose.io
• Sources: news, blogs, forums, reviews, e-
commerce, dark web, broadcast (US tv &
radio).
• Enrichments: named entities, sentiment,
categories, countries.
• 80 languages.
• Reasonably priced and easy to get going.
• Live data w 30 days history, and historical
data going back to December 2014.
Common Crawl
• Non-profit organization.
• ”… web crawl data that can be accessed
and analyzed by anyone” and ”… years of
free web page data…”.
• 40 languages.
• 8 years history, petabytes of data.
• Raw data, metadata, text data.
import.io
”Create your own datasets within minutes, no
coding required.”
Good for getting clean data from individual
web sites, without a large overhead. We’ve
used it for, e.g., hotel reviews, Glassdoor
data.
Acquired Connotate in 2019.
diffbot.com
”Using AI, computer vision, machine learning
and natural language processing, Diffbot
provides software developers with tools to
extract and understand objects from any web
page.”
Good for programmatic integration, large-
scale extraction of web contents.
cint.com
Good for reaching and querying target
audiences, based on a range of variables.
Not a surveying company.
Other types of data to complement
online media data
Open data: in science and government.
Financial data: xignite.com.
Data for sales and marketing: mixrank.com.
Company web data: yipitdata.com.
App usage data: 7parkdata.com.
Other types of data to complement
online media data
Data set search engines:
• Google Dataset Search
• Microsoft Research Open Data
• Quandl
• Shovel AI
Existing datasets, see:
• The greatest Public Datasets for AI.
• Awesome Public Datasets
• Linguistic Data Consortium.
• European Language Resources Association.
• CLARIN Virtual Language Observatory.
From: The New Gold Rush? Wall Street Wants your Data
From: The New Gold Rush? Wall Street Wants your Data
Data providers round-up
Data is expensive! Think of it as an extra co-
worker.
Contracts with providers are mostly long-term: pay
as you go makes data even more expensive.
Usually large overhead in integration: requires a lot
of the buyer.
Content-based enrichments, e.g., entities,
sentiment, not available in all languages.
Data providers round-up
Few vendors provide service level
agreements concerning data coverage, or
latency. None concerning quality.
Web scraping is non-trivial: data always
noisy, requires processing before use.
Read the terms of service carefully. Example:
Facebook, Twitter cut off data access for
Geofeedia, a social media surveillance startup.
Nuisances
Data that matter will be harder for outsiders to get.
Sharing data is hard due to its business value for data
creators. Inimical to reproducibility of scientific results.
Political factors impact the data landscape, e.g.,:
• China: How the Chinese Government Fabricates Social
Media Posts for Strategic Distraction, not Engaged
Argument. Chinese govt. fabricated 448M comments.
Affects representativity.
• USA: Diehard Coders just Rescued Nasa’s Earth Science
Data and Empty search results at US govt Open Data site.
Anecdotal evidence of US govt. reducing access to data.
Data processing hygiene factors
• Collect early, collect all (depending, of course, on GDPR).
• Your data will be noisy — Clean it.
• Your initial hypotheses will be wrong — Immerse yourself in
data!
• Data provenance — Who touched it? What did he do?
• Versioning of data and the software that processes it, e.g.,
pachyderm.io, DataKit
• Keep track of data characteristics, e.g., Great Expectations
• Facilitate collaboration — ”The most important collaborator is
your future self.”
• Strive for reproducibility — Your data is an integral part.
• Talk about data readiness — What can you expect to achieve
with your data [1, 2]? Akin to NASA’s TRL.
From: NASA
Continue with…
Subscribe to newsletters: Data Elixir, Data
Science Weekly, Data is Plural.
Listen to: Raw Data, Data Skeptic.
Use: metacurate.io

More Related Content

What's hot

Social Media Forensics for Investigators
Social Media Forensics for InvestigatorsSocial Media Forensics for Investigators
Social Media Forensics for InvestigatorsCase IQ
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationDigital Reasoning
 
Chapter 8 big data and privacy - social media 3533
Chapter 8  big data and privacy - social media 3533Chapter 8  big data and privacy - social media 3533
Chapter 8 big data and privacy - social media 3533Hubbamar
 
Welcome - 2011 Text Analytics Summit
Welcome - 2011 Text Analytics SummitWelcome - 2011 Text Analytics Summit
Welcome - 2011 Text Analytics SummitSeth Grimes
 
Memory Connected
Memory ConnectedMemory Connected
Memory ConnectedLi Ding
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Digital Reasoning
 
Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Konkuk University
 
Lecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsLecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsMarina Santini
 
Tim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldTim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldDigital Reasoning
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical OverviewDigital Reasoning
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processingPranav Gontalwar
 

What's hot (20)

Social Media Forensics for Investigators
Social Media Forensics for InvestigatorsSocial Media Forensics for Investigators
Social Media Forensics for Investigators
 
Data stories
Data storiesData stories
Data stories
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big Data Paper
Big Data PaperBig Data Paper
Big Data Paper
 
Big data intro.pptx
Big data intro.pptxBig data intro.pptx
Big data intro.pptx
 
Mining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your OrganizationMining the Social Web for Fun & Profit Within Your Organization
Mining the Social Web for Fun & Profit Within Your Organization
 
Chapter 8 big data and privacy - social media 3533
Chapter 8  big data and privacy - social media 3533Chapter 8  big data and privacy - social media 3533
Chapter 8 big data and privacy - social media 3533
 
Welcome - 2011 Text Analytics Summit
Welcome - 2011 Text Analytics SummitWelcome - 2011 Text Analytics Summit
Welcome - 2011 Text Analytics Summit
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Data mining on big data
Data mining on big dataData mining on big data
Data mining on big data
 
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
Got Chaos? Extracting Business Intelligence from Email with Natural Language ...
 
Niso library law
Niso library lawNiso library law
Niso library law
 
Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend Lessons Learned from Lod Failure and Big Data : The Future Trend
Lessons Learned from Lod Failure and Big Data : The Future Trend
 
The data we want
The data we wantThe data we want
The data we want
 
Lecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive DatasetsLecture 7: Learning from Massive Datasets
Lecture 7: Learning from Massive Datasets
 
Tim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric WorldTim Estes - Information Systems in an Entity Centric World
Tim Estes - Information Systems in an Entity Centric World
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical Overview
 
In memory big data management and processing
In memory big data management and processingIn memory big data management and processing
In memory big data management and processing
 

Similar to Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene?

Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Big data and information privacy 20190117
Big data and information privacy 20190117Big data and information privacy 20190117
Big data and information privacy 20190117Maria Correia
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptxSamiksha880257
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 
Bigdata the technological renaissance
Bigdata the technological renaissanceBigdata the technological renaissance
Bigdata the technological renaissanceRituBhargava7
 
SoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningSoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningResearch Data Alliance
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science TJ Stalcup
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
Lecture 01-1-IIS.pptx
Lecture 01-1-IIS.pptxLecture 01-1-IIS.pptx
Lecture 01-1-IIS.pptxAsadkhan47384
 
What is AI without Data?
What is AI without Data?What is AI without Data?
What is AI without Data?InnoTech
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressMarcel Blattner, PhD
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.saranya270513
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...Piet J.H. Daas
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media suresh sood
 

Similar to Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene? (20)

Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
data, big data, open data
data, big data, open datadata, big data, open data
data, big data, open data
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Big data and information privacy 20190117
Big data and information privacy 20190117Big data and information privacy 20190117
Big data and information privacy 20190117
 
Unit 1 (DSBDA) PD.pptx
Unit 1 (DSBDA)  PD.pptxUnit 1 (DSBDA)  PD.pptx
Unit 1 (DSBDA) PD.pptx
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 
Bigdata the technological renaissance
Bigdata the technological renaissanceBigdata the technological renaissance
Bigdata the technological renaissance
 
Big data
Big data Big data
Big data
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
SoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social MiningSoBigData. European Research Infrastructure for Big Data and Social Mining
SoBigData. European Research Infrastructure for Big Data and Social Mining
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
Understanding big data
Understanding big dataUnderstanding big data
Understanding big data
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Lecture 01-1-IIS.pptx
Lecture 01-1-IIS.pptxLecture 01-1-IIS.pptx
Lecture 01-1-IIS.pptx
 
Big Data-Job 2
Big Data-Job 2Big Data-Job 2
Big Data-Job 2
 
What is AI without Data?
What is AI without Data?What is AI without Data?
What is AI without Data?
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Introduction to big data – convergences.
Introduction to big data – convergences.Introduction to big data – convergences.
Introduction to big data – convergences.
 
Opportunities and methodological challenges of Big Data for official statist...
Opportunities and methodological challenges of  Big Data for official statist...Opportunities and methodological challenges of  Big Data for official statist...
Opportunities and methodological challenges of Big Data for official statist...
 
Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 

Recently uploaded

➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...gajnagarg
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 

Online text data for machine learning, data science, and research - Who can provide data? What data can't you get? What about data hygiene?

  • 1. Online text data for machine learning, data science, and research — Who can provide data? What data can’t you get? What about data hygiene? Fredrik Olsson, PhD Senior Research Scientist RISE AI
  • 2. Take-home message Web scraping is non-trivial: data always noisy. Social data is expensive! Think of it as an extra co-worker. Contracts with providers are mostly long-term: pay as you go makes data even more expensive. Large overhead in integration: requires a lot of the buyer. Content-based enrichments, e.g., entities, sentiment, not available in all languages. Few vendors provide service level agreements concerning data coverage, or latency. None concerning quality.
  • 3. In God we trust; all others bring data — W. Edwards Deming
  • 4. Preliminaries Getting good data is an iterative process. It is imperative for empirical research. Do not cut the wrong corners! After this talk, you’ll know who can provide online text data, what types of data is hard to get, and principal data hygiene factors.
  • 5. Machine learning needs data But relevant text data is surprisingly hard to get your hands on. Tech giants open-source software, e.g. TensorFlow, FastText, CNTK, but their data remain well-protected. Read: The value of data (1|2|3)
  • 6. What text data to aim for? Editorial news, individuals’ blogs, social networks, targeted surveys are all different beasts wrt readership, purpose, trustworthiness, political bias, reach etc. Questions along the way: Is internet penetration in your region of interest high enough? See Internet World Stats. What sites are popular in the region? See Alexa, SimilarWeb, Quantcast and International Media & Newspapers. Collect continuously or once?
  • 7. From: World Map of Social Networks
  • 8. From: World Map of Social Networks
  • 9. What type of data is hard to get? Chat app data from, e.g., WhatsApp, WeChat, Kik, Facebook Messenger, Viber, Line, Telegram. Historical data older than 30 days. Geo-tagged data. Individual posts (public and private) from Facebook, Instagram, LinkedIn. Demographic variables, e.g., gender, age, income.
  • 10. Who can provide data? Online data providers: Gnip, Meltwater (DataSift, fairhair.ai), StockTwits, Twingly, webhose.io, Talkwalker, Socialgist, PublicNow, LexisNexis, Dow Jones, glean.info, DataStreamer, InfoNgen, News API, EventRegistry, Common Crawl.
 Crawler as a service: import.io, 80legs.com, Connotate, Promptcloud, Diffbot, Scrapy. Survey panels: Dynata (formerly SSI), CINT, Tawasol. Crowd sourced data: Appen (acquired Figure Eight in 2019), Amazon Mechanical Turk, Annotell
  • 11. gnip.com • Twitter’s enterprise API platform - the only provider of tweets, including historical data. • Acquired by Twitter for $134M in 2014. • Firehose access to Twitter (500M tweets/ day), WordPress (2-5M blogs and comments/day), Disqus. • Managed access to Facebook, Youtube, G+, Vimeo, VK, Reddit, Instagram etc.
  • 12. datasift.com • Facebook Topic Data provider — anonymized, aggregated actions from 1.7B users, across 60+ attributes. • Acquired by MeltWater in April 2018. • LinkedIn Engagement Insights — aggregated actions from 460M users, across 130 attributes. 22k interactions/minute, 11M posts. • Firehose access: WordPress, IntenseDebate, Tumblr, Disqus. • Managed access: G+, Instagram etc. • News articles: LexisNexis, NewsCred.
  • 13. TalkWalker • 187 languages. • 40M+ documents/day. • Multiple data enrichments. • Provides data from 150M sources — news, blogs, discussion boards, forums etc. • Monitoring/analytics is their primary business.
  • 14. Twingly • Blogs: 1.2M posts/day, 10k new blogs added/day. • Forum: 30M posts/day from 9k forums. • News: 3M stories/day, 135k sources, 100 countries. • Social Feed: Facebook Public Pages, approx. 17M posts/day. • Several categories, 35 languages, entities and sentiment for some languages.
  • 15. webhose.io • Sources: news, blogs, forums, reviews, e- commerce, dark web, broadcast (US tv & radio). • Enrichments: named entities, sentiment, categories, countries. • 80 languages. • Reasonably priced and easy to get going. • Live data w 30 days history, and historical data going back to December 2014.
  • 16. Common Crawl • Non-profit organization. • ”… web crawl data that can be accessed and analyzed by anyone” and ”… years of free web page data…”. • 40 languages. • 8 years history, petabytes of data. • Raw data, metadata, text data.
  • 17. import.io ”Create your own datasets within minutes, no coding required.” Good for getting clean data from individual web sites, without a large overhead. We’ve used it for, e.g., hotel reviews, Glassdoor data. Acquired Connotate in 2019.
  • 18. diffbot.com ”Using AI, computer vision, machine learning and natural language processing, Diffbot provides software developers with tools to extract and understand objects from any web page.” Good for programmatic integration, large- scale extraction of web contents.
  • 19. cint.com Good for reaching and querying target audiences, based on a range of variables. Not a surveying company.
  • 20. Other types of data to complement online media data Open data: in science and government. Financial data: xignite.com. Data for sales and marketing: mixrank.com. Company web data: yipitdata.com. App usage data: 7parkdata.com.
  • 21. Other types of data to complement online media data Data set search engines: • Google Dataset Search • Microsoft Research Open Data • Quandl • Shovel AI Existing datasets, see: • The greatest Public Datasets for AI. • Awesome Public Datasets • Linguistic Data Consortium. • European Language Resources Association. • CLARIN Virtual Language Observatory.
  • 22. From: The New Gold Rush? Wall Street Wants your Data
  • 23. From: The New Gold Rush? Wall Street Wants your Data
  • 24. Data providers round-up Data is expensive! Think of it as an extra co- worker. Contracts with providers are mostly long-term: pay as you go makes data even more expensive. Usually large overhead in integration: requires a lot of the buyer. Content-based enrichments, e.g., entities, sentiment, not available in all languages.
  • 25. Data providers round-up Few vendors provide service level agreements concerning data coverage, or latency. None concerning quality. Web scraping is non-trivial: data always noisy, requires processing before use. Read the terms of service carefully. Example: Facebook, Twitter cut off data access for Geofeedia, a social media surveillance startup.
  • 26. Nuisances Data that matter will be harder for outsiders to get. Sharing data is hard due to its business value for data creators. Inimical to reproducibility of scientific results. Political factors impact the data landscape, e.g.,: • China: How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, not Engaged Argument. Chinese govt. fabricated 448M comments. Affects representativity. • USA: Diehard Coders just Rescued Nasa’s Earth Science Data and Empty search results at US govt Open Data site. Anecdotal evidence of US govt. reducing access to data.
  • 27. Data processing hygiene factors • Collect early, collect all (depending, of course, on GDPR). • Your data will be noisy — Clean it. • Your initial hypotheses will be wrong — Immerse yourself in data! • Data provenance — Who touched it? What did he do? • Versioning of data and the software that processes it, e.g., pachyderm.io, DataKit • Keep track of data characteristics, e.g., Great Expectations • Facilitate collaboration — ”The most important collaborator is your future self.” • Strive for reproducibility — Your data is an integral part. • Talk about data readiness — What can you expect to achieve with your data [1, 2]? Akin to NASA’s TRL.
  • 29. Continue with… Subscribe to newsletters: Data Elixir, Data Science Weekly, Data is Plural. Listen to: Raw Data, Data Skeptic. Use: metacurate.io