SlideShare a Scribd company logo
Search engines for the humanities that go beyond Google Suzan Verberne Centre for Language and Speech Technology  Radboud University Nijmegen Brainstorm Meeting  e-Humanities, March 29 2011 29.03.2011 Suzan Verberne 1
Outline Searching with Google Limitations of Google search Searching in text collections Better guidance through texts What technology is needed? 29.03.2011 2 Suzan Verberne
Searching with Google 29.03.2011 3 Suzan Verberne
Searching with Google 29.03.2011 4 Suzan Verberne
How does Google work? Index Relevance model query 29.03.2011 5 Suzan Verberne
How does Google work? Google calculates the relevance of web pages using word counts and popularity estimates. So, Google does not ‘understand’ the texts it sees; it can efficiently estimate a document’s relevance based on the words it contains. This is very effective and efficient for retrieving full documents (web pages). 29.03.2011 Suzan Verberne 6
Limitations of Google But what if I have a more specific information need:  Which books did Multatuli write? How did other writers respond to Multatuli’s work? To which events did Multatuli refer in ‘Max Havelaar’? Then I need  a more specialized text collection than the web a search engine that guides me through the retrieved documents. 29.03.2011 Suzan Verberne 7
Specialized text collection: DBNL DBNL: The Digital Library of Dutch Literature A website about Dutch literature, language and cultural history.  Contains literary texts, secondary literature and additional information such as biographies, portraits and hyperlinks. http://www.dbnl.org 29.03.2011 Suzan Verberne 8
Searching in DBNL 29.03.2011 Suzan Verberne 9
Searching in DBNL 29.03.2011 Suzan Verberne 10 ,[object Object],[object Object]
Searching in DBNL 29.03.2011 Suzan Verberne 12 ,[object Object]
4 pages of results, not sorted by relevance.,[object Object]
The document is 6 pages long.
Only the query term is highlighted,[object Object]
Better guidance through texts Step 1: Label important terms and entities in the text Person and place names Book and journal titles Events Other terms of interest This task is called ‘named entity recognition’. It is well developed in the field of computational linguistics. 29.03.2011 Suzan Verberne 15
Better guidance through texts 29.03.2011 Suzan Verberne 16 Journal title Book title Person name Person name
Better guidance through texts Step 2: collect information about entities in the text: Factual information: what is it and to whom does it relate? Links to external sources (biographies, encyclopaedias) Links to other mentions in the collection Automatically collecting large amounts of factual information is a current research topic in computational linguistics/artificial intelligence. 29.03.2011 Suzan Verberne 17
Better guidance through texts 29.03.2011 Suzan Verberne 18 Vaderlandsche Letteroefeningen was meer dan een eeuw lang een van de toonaangevende literair-culturele tijdschriften van Nederland.  Verscheen maandelijks.  Het laatste nummer kwam van de pers in december 1876. Het doel was in de eerste plaats om de lezers te wijzen op nuttige publicaties. Dat betrof zowel recente werken als boeken die lang geleden verschenen waren en niet meer in de aandacht stonden. http://www.kb.nl/dossiers/vaderlandscheletteroefeningen/
Better guidance through texts 29.03.2011 Suzan Verberne 19 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli.  Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandsekoloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur.  http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
Collecting facts from text Dutch Wikipedia: 678.683 articles (March 2011) Articles are categorized by topic Number of articles about Dutch writers: 439  29.03.2011 Suzan Verberne 20
Collecting facts from text Split the texts in sentences Analyze the sentences with a parser that indicates the most important syntactic parts of each sentence. Generate (nuclear) facts from the syntactic analysis: SUBJECT  | VERB  | OBJECT/PREDICATE | COMPLEMENTS Multatuli | write  | Max Havelaar           | in 1860, in Java 29.03.2011 Suzan Verberne 21
Collecting facts from text Hans Dekkers http://nl.wikipedia.org/wiki/Hans_Dekkers_(1954) “Hijschrijftromans, korteverhalen, gedichten en theaterstukken”“He writes novels, short stories, poems and plays” Factoids: hij | schrijven | theaterstukken |  | hij | schrijven | gedichten |  | hij | schrijven | romans |  | hij | schrijven | korteverhalen |  | 29.03.2011 Suzan Verberne 22
Collecting facts from text P.F. Thomése http://nl.wikipedia.org/wiki/P.F._Thom%C3%A9se “In 1991 en 2003 ontving hij literaire prijzen.”“In 1991 and 2003, hereceivedliteraryawards.” Factoids: hij | ontvangen | literaire prijzen |  in 1991,  in 2003 | 29.03.2011 Suzan Verberne 23
Better guidance through texts Step 3: enrich the text collection with this factual information. When the user clicks one of the labelled terms, the most important factual information will be shown, together with links to sources. 29.03.2011 Suzan Verberne 24
Better guidance through texts 29.03.2011 Suzan Verberne 25 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli.  Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandse koloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur.  http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
How to proceed? There are multiple initiatives (also in the Netherlands) to develop the described techniques. Challenges: What are the needs of the target group? Collaboration is essential. Older varieties of Dutch: development of resources and tools is needed (some already exist). User interfacing is very important: specialist knowledge needed. … 29.03.2011 Suzan Verberne 26
Thankyou! You can find more information on my web site (Google my name and you will get there) 29.03.2011 27 Suzan Verberne

More Related Content

Similar to Search engines for the humanities that go beyond Google

How to write a dissertation literature review chapter
How to write a dissertation literature review chapterHow to write a dissertation literature review chapter
How to write a dissertation literature review chapter
The Free School
 
238974514 autobibliography
238974514 autobibliography238974514 autobibliography
238974514 autobibliography
homeworkping4
 
Analysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docxAnalysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docx
write12
 
One of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docxOne of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docx
vannagoforth
 
1st class culture, identity, and mass media
1st class culture, identity, and mass media1st class culture, identity, and mass media
1st class culture, identity, and mass media
lmazurs1
 
Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"
bullsi
 
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docxAssignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
simba35
 
Jonathan Culler on Literary Theory
Jonathan Culler on Literary TheoryJonathan Culler on Literary Theory
Jonathan Culler on Literary Theory
Shiva Kumar Srinivasan
 
Background research
Background researchBackground research
Background research
Samiulhaq32
 
OWL Purdue MLA format
OWL Purdue MLA formatOWL Purdue MLA format
OWL Purdue MLA format
brittanyhavers
 
Elit 46 c class 19
Elit 46 c class 19Elit 46 c class 19
Elit 46 c class 19
kimpalmore
 
14 ayesha abrar
14 ayesha abrar14 ayesha abrar
14 ayesha abrar
SRJIS
 
Literary Analysis - Worlds Collide
Literary Analysis - Worlds CollideLiterary Analysis - Worlds Collide
Literary Analysis - Worlds Collide
RhondaKitchensLibrarian
 

Similar to Search engines for the humanities that go beyond Google (13)

How to write a dissertation literature review chapter
How to write a dissertation literature review chapterHow to write a dissertation literature review chapter
How to write a dissertation literature review chapter
 
238974514 autobibliography
238974514 autobibliography238974514 autobibliography
238974514 autobibliography
 
Analysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docxAnalysis of the Skin of a novel by Michael Ondaatje.docx
Analysis of the Skin of a novel by Michael Ondaatje.docx
 
One of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docxOne of the main aspects of software development for a marketing co.docx
One of the main aspects of software development for a marketing co.docx
 
1st class culture, identity, and mass media
1st class culture, identity, and mass media1st class culture, identity, and mass media
1st class culture, identity, and mass media
 
Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"Database Tutorial: "Women And Social Movements In The United States"
Database Tutorial: "Women And Social Movements In The United States"
 
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docxAssignment InstructionsWrite a 500-750 word essay on one of the fo.docx
Assignment InstructionsWrite a 500-750 word essay on one of the fo.docx
 
Jonathan Culler on Literary Theory
Jonathan Culler on Literary TheoryJonathan Culler on Literary Theory
Jonathan Culler on Literary Theory
 
Background research
Background researchBackground research
Background research
 
OWL Purdue MLA format
OWL Purdue MLA formatOWL Purdue MLA format
OWL Purdue MLA format
 
Elit 46 c class 19
Elit 46 c class 19Elit 46 c class 19
Elit 46 c class 19
 
14 ayesha abrar
14 ayesha abrar14 ayesha abrar
14 ayesha abrar
 
Literary Analysis - Worlds Collide
Literary Analysis - Worlds CollideLiterary Analysis - Worlds Collide
Literary Analysis - Worlds Collide
 

More from Leiden University

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
Leiden University
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discovery
Leiden University
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
Leiden University
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals
Leiden University
 
kanker.nl & Data Science
kanker.nl & Data Sciencekanker.nl & Data Science
kanker.nl & Data Science
Leiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
Leiden University
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
Leiden University
 
Computationeel denken
Computationeel denkenComputationeel denken
Computationeel denken
Leiden University
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threads
Leiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
Leiden University
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
Leiden University
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in context
Leiden University
 

More from Leiden University (12)

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discovery
 
Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals
 
kanker.nl & Data Science
kanker.nl & Data Sciencekanker.nl & Data Science
kanker.nl & Data Science
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
Computationeel denken
Computationeel denkenComputationeel denken
Computationeel denken
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threads
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in context
 

Recently uploaded

20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Zilliz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Search engines for the humanities that go beyond Google

  • 1. Search engines for the humanities that go beyond Google Suzan Verberne Centre for Language and Speech Technology Radboud University Nijmegen Brainstorm Meeting e-Humanities, March 29 2011 29.03.2011 Suzan Verberne 1
  • 2. Outline Searching with Google Limitations of Google search Searching in text collections Better guidance through texts What technology is needed? 29.03.2011 2 Suzan Verberne
  • 3. Searching with Google 29.03.2011 3 Suzan Verberne
  • 4. Searching with Google 29.03.2011 4 Suzan Verberne
  • 5. How does Google work? Index Relevance model query 29.03.2011 5 Suzan Verberne
  • 6. How does Google work? Google calculates the relevance of web pages using word counts and popularity estimates. So, Google does not ‘understand’ the texts it sees; it can efficiently estimate a document’s relevance based on the words it contains. This is very effective and efficient for retrieving full documents (web pages). 29.03.2011 Suzan Verberne 6
  • 7. Limitations of Google But what if I have a more specific information need: Which books did Multatuli write? How did other writers respond to Multatuli’s work? To which events did Multatuli refer in ‘Max Havelaar’? Then I need a more specialized text collection than the web a search engine that guides me through the retrieved documents. 29.03.2011 Suzan Verberne 7
  • 8. Specialized text collection: DBNL DBNL: The Digital Library of Dutch Literature A website about Dutch literature, language and cultural history. Contains literary texts, secondary literature and additional information such as biographies, portraits and hyperlinks. http://www.dbnl.org 29.03.2011 Suzan Verberne 8
  • 9. Searching in DBNL 29.03.2011 Suzan Verberne 9
  • 10.
  • 11.
  • 12.
  • 13. The document is 6 pages long.
  • 14.
  • 15. Better guidance through texts Step 1: Label important terms and entities in the text Person and place names Book and journal titles Events Other terms of interest This task is called ‘named entity recognition’. It is well developed in the field of computational linguistics. 29.03.2011 Suzan Verberne 15
  • 16. Better guidance through texts 29.03.2011 Suzan Verberne 16 Journal title Book title Person name Person name
  • 17. Better guidance through texts Step 2: collect information about entities in the text: Factual information: what is it and to whom does it relate? Links to external sources (biographies, encyclopaedias) Links to other mentions in the collection Automatically collecting large amounts of factual information is a current research topic in computational linguistics/artificial intelligence. 29.03.2011 Suzan Verberne 17
  • 18. Better guidance through texts 29.03.2011 Suzan Verberne 18 Vaderlandsche Letteroefeningen was meer dan een eeuw lang een van de toonaangevende literair-culturele tijdschriften van Nederland. Verscheen maandelijks. Het laatste nummer kwam van de pers in december 1876. Het doel was in de eerste plaats om de lezers te wijzen op nuttige publicaties. Dat betrof zowel recente werken als boeken die lang geleden verschenen waren en niet meer in de aandacht stonden. http://www.kb.nl/dossiers/vaderlandscheletteroefeningen/
  • 19. Better guidance through texts 29.03.2011 Suzan Verberne 19 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli. Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandsekoloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur. http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
  • 20. Collecting facts from text Dutch Wikipedia: 678.683 articles (March 2011) Articles are categorized by topic Number of articles about Dutch writers: 439 29.03.2011 Suzan Verberne 20
  • 21. Collecting facts from text Split the texts in sentences Analyze the sentences with a parser that indicates the most important syntactic parts of each sentence. Generate (nuclear) facts from the syntactic analysis: SUBJECT | VERB | OBJECT/PREDICATE | COMPLEMENTS Multatuli | write | Max Havelaar | in 1860, in Java 29.03.2011 Suzan Verberne 21
  • 22. Collecting facts from text Hans Dekkers http://nl.wikipedia.org/wiki/Hans_Dekkers_(1954) “Hijschrijftromans, korteverhalen, gedichten en theaterstukken”“He writes novels, short stories, poems and plays” Factoids: hij | schrijven | theaterstukken | | hij | schrijven | gedichten | | hij | schrijven | romans | | hij | schrijven | korteverhalen | | 29.03.2011 Suzan Verberne 22
  • 23. Collecting facts from text P.F. Thomése http://nl.wikipedia.org/wiki/P.F._Thom%C3%A9se “In 1991 en 2003 ontving hij literaire prijzen.”“In 1991 and 2003, hereceivedliteraryawards.” Factoids: hij | ontvangen | literaire prijzen | in 1991, in 2003 | 29.03.2011 Suzan Verberne 23
  • 24. Better guidance through texts Step 3: enrich the text collection with this factual information. When the user clicks one of the labelled terms, the most important factual information will be shown, together with links to sources. 29.03.2011 Suzan Verberne 24
  • 25. Better guidance through texts 29.03.2011 Suzan Verberne 25 Max Havelaar, of de koffij-veilingen der NederlandscheHandel-Maatschappij is een in 1860 gepubliceerde roman van Multatuli. Het boek gaat over een man die probeert te vechten tegen het corrupte regeringssysteem van Nederlands-Indië, en zou grote invloed hebben op de Nederlandse literatuur, maar ook op de Nederlandse koloniale politiek. Max Havelaar geldt als een van de belangrijkste werken uit de Nederlandse literatuur. http://nl.wikipedia.org/wiki/Max_Havelaar_(boek)
  • 26. How to proceed? There are multiple initiatives (also in the Netherlands) to develop the described techniques. Challenges: What are the needs of the target group? Collaboration is essential. Older varieties of Dutch: development of resources and tools is needed (some already exist). User interfacing is very important: specialist knowledge needed. … 29.03.2011 Suzan Verberne 26
  • 27. Thankyou! You can find more information on my web site (Google my name and you will get there) 29.03.2011 27 Suzan Verberne