SlideShare a Scribd company logo
Conceptual text mining
Pim Huijnen
Utrecht University & University of Sheffield Digital Humanities Workshop
May 12, 2016
What to do with 11 million newspaper
pages?
a) distant reading
In: Het Centrum, 10 October 1919, p. 4.
b) finding the needle
in the hay stack
In: Het Volk: Dagblad voor de
arbeiderspartij, 29 January 1921, p. 3.
1 How to define a concept? 

Efficiency ≠ “efficiency”
Eugenetica ≠ “eugenetica” +
“eugenetiek” + “eugeniek" + "rassenleer"

2 How to study its changing uses,
contexts, and meaning over time?
5
How to know what to look at?
1 How to define a concept? 

Efficiency ≠ “efficiency”
Eugenetica ≠ “eugenetica” +
“eugenetiek” + “eugeniek" + "rassenleer"

2 How to study its changing uses,
contexts, and meaning over time?
6
How to know what to look at?
1) Eugenics
7
* topic modeling newspaper articles containing "eugenics"
* using meaningful words to look for eugenics without
“eugenics”
* in the given example: querying Texcavator with
‘regulation AND health AND race’ (575 results)
Texcavator
8
plotting the results on a
time scale (relative to total
number of articles per year)
extracting distinctive words
from query results per year
(tf-idf)
Texcavator
9
Texcavator
10
Texcavator
11
2) Scientific management
12
* using close reading to find all significant Dutch
equivalents for “scientific management"
* extract results, divide them per year and upload them
to Voyant Tools
* study changing vocabulary in the subset over time
Scientific management query
13
”wetenschappelijke bedrijfsleiding” (233)	
”wetenschappelijke bedrijfsorganisatie” (216)	
”wetenschappelijke bedrijfsvoering” (32)	
”scientific management” (28)	
’taylorstelsel OR taylor-stelsel’ (330)	
’taylorsysteem OR taylor-systeem’ (369)	
’taylorisme’ (42)	
Combined in a single query results in 1175 hits
The third way: distributional semantics
17
* Our implementation combines a) creating dictionaries and
b) tracing meaning over time in a single workflow
* by finding ‘most similar words’ (i.e. words with equal vector
values / words with similar meaning in sentences)
* Use cluster of most similar words from ten-year time period
to find most similar words in next (and partly overlapping)
time frame
* Trace word use of concepts over time without being
dependant on single terms or predefined dictionaries
Shico
18

More Related Content

Similar to Conceptual Text Mining

The basics of ontologies
The basics of ontologiesThe basics of ontologies
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
Tobias Kuhn
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
Richard Littauer
 
Introduction to automated text analyses in the Political Sciences
Introduction to automated text analyses in the Political SciencesIntroduction to automated text analyses in the Political Sciences
Introduction to automated text analyses in the Political Sciences
ChristianRauh2
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro Slides
Jenny Molloy
 
Rm3
Rm3Rm3
E scidocdays review
E scidocdays reviewE scidocdays review
E scidocdays review
Jeffrey Demaine
 
Vol 3 No 1 - March 2014
Vol 3 No 1 - March 2014Vol 3 No 1 - March 2014
Vol 3 No 1 - March 2014
ijlterorg
 
Integrated Human Decision Making Platform based on human anatomy
Integrated Human Decision Making Platform based on human anatomyIntegrated Human Decision Making Platform based on human anatomy
Integrated Human Decision Making Platform based on human anatomy
Manuel Manolache
 
Analytical-frameworks - Methods in user-technology studies
Analytical-frameworks - Methods in user-technology studiesAnalytical-frameworks - Methods in user-technology studies
Analytical-frameworks - Methods in user-technology studies
Antti Salovaara
 
Searching anthropology plus
Searching anthropology plusSearching anthropology plus
Searching anthropology plus
bgargan
 
Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010
NTNU
 
Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010
NTNU
 
Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010
NTNU
 

Similar to Conceptual Text Mining (14)

The basics of ontologies
The basics of ontologiesThe basics of ontologies
The basics of ontologies
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
 
Towards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in LinguisticsTowards Open Methods: Using Scientific Workflows in Linguistics
Towards Open Methods: Using Scientific Workflows in Linguistics
 
Introduction to automated text analyses in the Political Sciences
Introduction to automated text analyses in the Political SciencesIntroduction to automated text analyses in the Political Sciences
Introduction to automated text analyses in the Political Sciences
 
SciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro SlidesSciDataCon 2014 TDM Workshop Intro Slides
SciDataCon 2014 TDM Workshop Intro Slides
 
Rm3
Rm3Rm3
Rm3
 
E scidocdays review
E scidocdays reviewE scidocdays review
E scidocdays review
 
Vol 3 No 1 - March 2014
Vol 3 No 1 - March 2014Vol 3 No 1 - March 2014
Vol 3 No 1 - March 2014
 
Integrated Human Decision Making Platform based on human anatomy
Integrated Human Decision Making Platform based on human anatomyIntegrated Human Decision Making Platform based on human anatomy
Integrated Human Decision Making Platform based on human anatomy
 
Analytical-frameworks - Methods in user-technology studies
Analytical-frameworks - Methods in user-technology studiesAnalytical-frameworks - Methods in user-technology studies
Analytical-frameworks - Methods in user-technology studies
 
Searching anthropology plus
Searching anthropology plusSearching anthropology plus
Searching anthropology plus
 
Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010
 
Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010
 
Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010Leading research in technoscience institutttseminar-281010
Leading research in technoscience institutttseminar-281010
 

Recently uploaded

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 

Recently uploaded (20)

一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 

Conceptual Text Mining

  • 1. Conceptual text mining Pim Huijnen Utrecht University & University of Sheffield Digital Humanities Workshop May 12, 2016
  • 2. What to do with 11 million newspaper pages?
  • 3. a) distant reading In: Het Centrum, 10 October 1919, p. 4. b) finding the needle in the hay stack
  • 4. In: Het Volk: Dagblad voor de arbeiderspartij, 29 January 1921, p. 3.
  • 5. 1 How to define a concept? 
 Efficiency ≠ “efficiency” Eugenetica ≠ “eugenetica” + “eugenetiek” + “eugeniek" + "rassenleer"
 2 How to study its changing uses, contexts, and meaning over time? 5 How to know what to look at?
  • 6. 1 How to define a concept? 
 Efficiency ≠ “efficiency” Eugenetica ≠ “eugenetica” + “eugenetiek” + “eugeniek" + "rassenleer"
 2 How to study its changing uses, contexts, and meaning over time? 6 How to know what to look at?
  • 7. 1) Eugenics 7 * topic modeling newspaper articles containing "eugenics" * using meaningful words to look for eugenics without “eugenics” * in the given example: querying Texcavator with ‘regulation AND health AND race’ (575 results)
  • 8. Texcavator 8 plotting the results on a time scale (relative to total number of articles per year) extracting distinctive words from query results per year (tf-idf)
  • 12. 2) Scientific management 12 * using close reading to find all significant Dutch equivalents for “scientific management" * extract results, divide them per year and upload them to Voyant Tools * study changing vocabulary in the subset over time
  • 13. Scientific management query 13 ”wetenschappelijke bedrijfsleiding” (233) ”wetenschappelijke bedrijfsorganisatie” (216) ”wetenschappelijke bedrijfsvoering” (32) ”scientific management” (28) ’taylorstelsel OR taylor-stelsel’ (330) ’taylorsysteem OR taylor-systeem’ (369) ’taylorisme’ (42) Combined in a single query results in 1175 hits
  • 14.
  • 15.
  • 16.
  • 17. The third way: distributional semantics 17 * Our implementation combines a) creating dictionaries and b) tracing meaning over time in a single workflow * by finding ‘most similar words’ (i.e. words with equal vector values / words with similar meaning in sentences) * Use cluster of most similar words from ten-year time period to find most similar words in next (and partly overlapping) time frame * Trace word use of concepts over time without being dependant on single terms or predefined dictionaries