We introduce four research questions that can be addressed using log files of online dictionaries:
(1) Are words that occur more frequently in everyday language also looked up more frequently in a dictionary? (2) Are polysemic words visited more frequently than monosemic words? (3) How can we investigate temporal effects on visiting frequency? (4) What portions of Wiktionary stay “in the dark” (i.e., are not visited at all or very infrequently)? For almost all analyses of log file data, additional information is necessary, like corpus frequency of headwords or information that can be extracted from the dictionary article itself (e.g, part-of-speech of the headword or number of senses). We will focus on the methodological side of the analyses, proposing a quantitative view on the data. Apart from that, we will also discuss what limitations we face when dealing with log file data.
2. • Lew (2015a): „Until fairly recently, dictionary users were not usually of
central concern in the process of dictionary making […].”
• Advantages of focusing on the user:
Discover the challenges users face when accessing and using dictionaries
user instruction, usability
Learn how users are working with the dictionary
Discover what users are interested in the most/least
Test preconceptions of the lexicographer about the users
User studies enable us to make better dictionaries.
RESEARCH INTO DICTIONARY USE
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 2
Lew, R. (2015a). Dictionaries and their users. In P. Hanks & G.-M. De Schryver (Ed.), International Handbook of Modern
Lexis and Lexicography (1–9). Berlin/Heidelberg: Springer.
3. • Main aim: Collect empirical data to gain insights into dictionary usage
• Multiple methods of data collection:
(Web) questionnaires, eye tracking studies, usability studies,
log file analyses, …
• Choice of method depends on the research question we want to
address.
Lew, R. (2015b). Opportunities and limitations of user studies. In C. Tiberius & C. Müller-
Spitzer (Hrsg.), Research into dictionary use / Wörterbuchbenutzungsforschung. 5.
Arbeitsbericht des wissenschaftlichen Netzwerks „Internetlexikografie“ (Bd. 2/2015, S. 6–
16). Mannheim: Institut für Deutsche Sprache. Abgerufen von http://pub.ids-
mannheim.de/laufend/opal/pdf/opal15-2.pdf
Müller-Spitzer, C. (2014). Using Online Dictionaries. Berlin, New York: De Gruyter.
RESEARCH INTO DICTIONARY USE
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 3
4. • Log files: Protocols of search requests or article
look-ups.
• Varying amount of information:
Minimum: Article ID, Timestamp
User information, article history, technical information (e.g.,
browser, device), ...
Some log files are already aggregated (e.g., per hour).
• Take care of the legal framework of your country: What
kind of information are you allowed to use without
explicit user consent?
LOG FILE ANALYSES
604.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
5. • Bergenholtz, H., & Johnsen, M. (2005). Log Files as a Tool for Improving Internet Dictionaries. Hermes, 34,
117–141.
• Bergenholtz, H., & Johnson, M. (2007). Log files can and should be prepared for a functionalistic approach.
Lexikos, 17, 1–21.
• Verlinde, S., & Binon, J. (2010). Monitoring Dictionary Use in the Electronic Age. In A. Dykstra & T. Schoonheim
(Hrsg.), Proceedings of the XIV Euralex International Congress (S. 1144–1151). Ljouwert: Afûk.
• Hult, A.-K. (2012). Old and New User Study Methods Combined ‒ Linking Web Questionnaires with Log Files
from the Swedish Lexin Dictionary. Oslo. Universitetet i Oslo, Institutt for lingvistiske og nordiske studier. In J.
M. Torjusen & R. V. Fjeld (Hrsg.), Proceedings of the 15th EURALEX International Congress 2012 (S. 922–928).
Oslo, Norway. Abgerufen von http://www.euralex.org/elx_proceedings/Euralex2012/pp922-928%20Hult.pdf
• Schoonheim, T., Tiberius, C., Niestadt, J., & Tempelaars, R. (2012). Dictionary Use and Language Games:
Getting to Know the Dictionary as Part of the Game. In R. Vatvedt Fjeld & J. M. Torjusen (Hrsg.), Proceedings of
the 15th EURALEX International Congress. 7-11 August 2012 (S. 974–979). Oslo: Department of Linguistics and
Scandinavian Studies: University of Oslo.
• De Schryver, G.-M., Joffe, D., Joffe, P., & Hillewaert, S. (2006). Do dictionary users really look up frequent
words?—on the overestimation of the value of corpus-based lexicography. Lexikos, 16, 67–83.
• Koplenig, A., Meyer, P., & Müller-Spitzer, C. (2014). Dictionary users do look up frequent words. A log file
analysis. In C. Müller-Spitzer (Hrsg.), Using Online Dictionaries (S. 229–250). Berlin, Boston: de Gruyter.
LOG FILE ANALYSES: PREVIOUS
RESEARCH
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 7
6. • The Wikimedia foundation provides log files for all
their sites, including all the different language editions
of Wiktionary.
https://dumps.wikimedia.org/other/pagecounts-raw/
STUDIES USING WIKTIONARY LOG FILES
804.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
• One file per hour with all projects.
• Approx. 66 GB (gzipped) per month.
7. DATA PREPARATION
904.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
Downloaded files
Relevant rows
(e.g. „de.d“)
Daily
aggregates
Weekly
aggregates
Yearly
aggregates
Additional information (some
extracted from Wiktionary)
• part-of-speech
• # of senses
• headword frequency
• ...
9. • Are more frequent words visited more frequently?
• Are polysemic words visited more frequently than
monosemic words?
• How can we investigate temporal effects on visiting
frequency?
• What portions of Wiktionary stay „in the dark“
(i.e., are not visited at all or very seldom)?
• Data base: German language edition of Wiktionary
RESEARCH QUESTIONS
1104.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
10. • If we compile a general dictionary from scratch, does it make
sense to include more frequent words first?
• Log-file analyses from Wiktionary and DWDS log files suggest:
Yes, words that occur more frequently in every-day language are
also visited more frequently.
CORPUS AND LOOK-UP FREQUENCY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 12
11. • Corpus frequency still matters if most frequent words
are excluded.
CORPUS AND LOOK-UP FREQUENCY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 13
10,000 most
frequent words
A
B
10,000 words randomly
sampled from rest
10,000 most frequent
words from rest
34%
56%
successful
searches
12. • Are polysemic words visited more often than monosemic
words?
• Challenge: Polysemic words are also more frequent. So, we have
to control for the effect of frequency just shown.
POLYSEMIC AND MONOSEMIC WORDS
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 14
monosemic
polysemic
13. POLYSEMIC AND MONOSEMIC WORDS
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 15
• Effect of frequency still visible.
• Effect of polysemy
• Interaction effect: Polysemy
contrast tends to be more
pronounced in higher frequency
bands (especially in the highest
decile)
14. • If we want to extract temporary effects, we have to
take time into consideration.
Interactive visualisation (German Wiktionary, more to come):
http://www.owid.de/plus/wikivi2015/
• We employed a trend-residualisation technique.
Calculate the current trend of visitation frequency.
Calculate the deviations from this trend („residuals“) at
specific points in time.
TEMPORARY EFFECTS ON LOOK-UP
FREQUENCY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 16
19. TEMPORARY EFFECTS: ‚LARMOYANT‘
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 21
„Der ist jetzt aber richtig sauer. Das passt dem gar
nicht. Und wenn ich das richtig deute, blickt er da eher
Richtung Toni Kroos. Das ist ihm ein bisschen zu
larmoyant... und ... der ist vielleicht noch eher im
Freundschaftsspielmodus …“
He is really peeved now. That really doesn‘t suit him. And
if I interpret this correctly, he is looking into the direction
of Toni Kroos. That‘s a little too lachrymose for him.
And... maybe, he‘s more in exhibition mode …“
20. • How many and which articles are not visited at all?
We consider the years 2013, 2014 and 2015.
Account for the fact that the number of articles is rising.
THE DARK SIDE OF WIKTIONARY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 22
21. THE DARK SIDE OF WIKTIONARY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 23
?
22. • Approx. 25,000 articles were not visited
during 2013, 2014 and 2015.
Mostly newer
Mostly non-German
German idioms
Inflected forms
THE DARK SIDE OF WIKTIONARY
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 24
23. • Log files are well suited to investigate effects on the
„macro user“ level:
Corpus frequency and look-up frequency
Polysemy and look-up frequency
Temporary effects
„Dark side“ of dictionaries
Collaborative dictionaries: Look-up and revision frequency
…
SUMMARY
2504.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
24. • Lew (2015b: 11-12): „[…] we need to be aware of the limitations of
the approach.
One such limitation is that server log files will rarely tell us what the context
of dictionary use is:
what activity the user is involved in,
what particular problem they are trying to solve,
and the levels of success and satisfaction achieved in the consultation.
Nothing is known about the user, either, such as their age, languages spoken,
proficiency in them, or professional background. […]
Issues of data privacy can also be a limiting factor in log file analysis.“
OUTLOOK / LIMITATIONS
2604.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses
25. • Little can be inferred from a small number of log file
events.
Research based on individual cases is virtually impossible.
Log file analyses work best if many cases are available for
longer periods.
Quantitative methods
• Log files might be integrated with other methodologies
to gain an even broader insight into dictionary usage.
Test hypotheses generated by log file analyses with methods
that assess individual performances or preferences.
OUTLOOK / LIMITATIONS
04.04.2016 ScotLex-1 - Müller-Spitzer & Wolfer - Log file analyses 27