Timo Honkela: Semantic and pragmatics representations of large text corpora
1. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Timo Honkela
FIN-CLARIN Jubilee Seminar and
Nordic CLARIN Network Seminar
University of Helsinki, 9 Jun 2016
Semantic and pragmatic
representations
of large text corpora
timo.honkela@helsinki.fi
2. 2
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Agenda
● Digital humanities in Finland
● Strategic role of humanities and
social sciences
● Research using text corpora
3. 3
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Digital humanities in Finland
● Research in humanities and social sciences is
increasingly using digitally stored resources
and computational analysis tools
5. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Varieng - Research Unit for the Study of
Variation, Contacts and Change in English
Big Data, Rich Data,
Uncharted Data
19–22 October 2015
Helsinki, Finland
Terttu Nevalainen
Irma TaavitsainenTanja Säilyhttp://www.helsinki.fi/varieng/
http://www.helsinki.fi/varieng/people/varieng_saily.html
et al.
6. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Multilingual
language technology
Jörg Tiedemann
Mathias Creutz
et al.
7. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Text mining historical newspapers
Mikko Tolonen
Kimmo Kettunen
8. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Citizen Mindscapes
Analysis of large social media corpora
in order to increase understanding of
social and societal phenomena
9. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Educational efforts:
e.g. Digital Humanities Hackathon
10. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
In many such research efforts and
educational activities, FIN-CLARIN
serves as an essential resource
and infrastructure.
11. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
In many such research efforts and
educational activities, FIN-CLARIN
serves as an essential resource
and infrastructure.
Let's celebrate and
have a moment
of applause
http://375humanistia.helsinki.fi/en/humanists/kimmo-koskenniemi
12. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Complexity associated with
different areas of science
Biological phenomena
Physical phenomena
Cultural phenomena
13. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Importance of
humanities and social sciences
● As surprising it may at first sound, one can
claim that humanities and social sciences are
the most important ones
● These disciplines deal with topics like language
and communication, social condition, historical
developments, economy, etc.
● Due to the complexity, research in these areas
is challenging; generalizations commonplace
in physics are rarely possible
14. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Understanding
the phenomena
Theory and
knowledge
formation
Qualitative Quantitative
Open data:
corpora
Open
methods
Computational
resources
15. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Lars Borin
Linguistics has
been the first
e-science
16. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Challenges:
“Language is BIG”
“Human INTERPRETATION is
inherently involved”
Importance of language:
”Language is involved in most
relevant human activities”
17. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Example:
Complexity of
Finnish at the
level of word
forms
Kimmo Koskenniemi (2013):
Johdatus kieliteknologiaan,
sen merkitykseen ja sovelluksiin
(Introduction to language
technology, its significance and
applications)
https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1
18. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
> 6000 languages,
many more dialects Billions of people
blogs.state.gov
en.wikipedia.org
A large number of
different cultures
en.wikipedia.org
A vast number of ways to relate
language, concepts and
the world to each other
19. Simulating processes of language emergence and communication 19
Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Language as a system
● Considering natural language as a signal and dynamic
system at cognitive and social levels (also in its written
form) rather than a symbolic and logical system
● Importance of embodiment (cf. e.g. Harnad) and
embeddedness (cf. e.g. Edelman)
● Learning and pattern recognition processes are
essential (as opposed to the theories presented e.g. by
Chomsky, Fodor, Pinker); much of the learning is bound
to be unsupervised
20. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Complexity of language
regarding different areas and levels
Structure:
morphology and syntax
Meaning:
semantics and
pragmatics
21. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Complexity of language
regarding different areas and levels
Structure:
morphology and syntax
Meaning:
semantics and
pragmatics
What are the nature,
granularity, type,
metadata involved, etc.
for different research
purposes in different
areas of linguistics and
other areas of humanities
and social sciences?
22. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Need to
harmonize,
build shared
terminologies,
theories,
frameworks, etc.
Need to model
contextuality,
ambiguity, vagueness,
history-dependence,
change, ambiguity,
etc.
23. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Need to
harmonize,
build shared
terminologies,
theories,
frameworks, etc.
Need to model
contextuality,
ambiguity, vagueness,
history-dependence,
change, ambiguity,
etc.
The same medium, language, is
the object of study as well as the
basis for theory formation,
representing the ideas and resources, etc.
24. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Philosophy of science
is essential to
understand what
is going on...
Data-driven
inductive mode
Hypothesis
driven,
deductive mode
25. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
An old research example:
Data-driven emergence
of implicit word
categories that match with
human syntactic
and semantic intuitions
26. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Classical example: Learning meaning from context:
Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
27. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Research example:
Multimodally
grounded
models
of meaning
28. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Labeling movements: Associating
high-dim. kinesthetic time series
with linguistic labels
Förger & Honkela 2014
30. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Research example:
Tensor-based analysis of
subjective aspect
of interpreting linguistic
expressions
31. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
GICA: Grounded Intersubjective
Concept Analysis
Honkela et al. 2012
32. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Analysis of the word 'health'
Honkela et al. 2012
33. Timo Honkela, FIN-CLARIN Jubilee seminar, 9.6.2016
Ideas for building corpora
● Espansion of the contextual framework
● Enriching metadata
● Increasing multimodal data sources
that associate linguistic data with other
modalities
● Involving large number of people
in labeling data to model variation
● Collecting data in real world contexts