2. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
AGENDA
›the project
›the data
›the analytical design — four approaches
›the methods
›next steps
2
3. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE PROJECT
›RQ: What has the Danish web talked about, where
have specific topics 'lived', and how has this
developed?
›Empirical result: A mapping of the textual web
landscape 2006-2016
›Methodological result: Develop and test methods for
large scale textual analysis
3
4. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
4
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
5. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE DATA
›Corpus extracted from the Danish web archive
Netarkivet
›One 'time slice' from each year
›Versions removed, one version of each web domain
›Part of the larger study of the Danish web, started
some years ago
›Read about the first digs in Brügger, N., Nielsen, J.,
Laursen, D. (2020). Big data experiments with the
archived Web: Methodological reflections on
studying the development of a nation's Web, First
Monday, 25(3)
5
6. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE DATA
Acknowledgements:
›corpus extraction and selection of versions: Janne
Nielsen, Ditte Laursen, Ulrich Have, Per Møldrup-
Dalum
›initial calculations or words: Janne & Ulrich
›selected events to be studied: Ditte & me
›Textual analyses: Kristoffer Nielbo, Peter Vahlstrup,
the Centre for Humanities Computing Aarhus
(CHCAA, chcaa.io)
6
7. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE DATA
The corpus:
›all words on the Danish web as
it has been archived by
Netarkivet,
›in one time slice from each
year, 2006 to 2016
›language recognition
performed, only analyse Danish
words
7
8. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
8
9. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE ANALYTICAL DESIGN — FOUR
APPROACHES
›First, we will follow the talk of the web itself by
identifying a number of significant words among the
most used words per year (minus stop words)
›e.g. 10-20 words chosen from the 1,000 most used
words
9
10. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
10
word n
1 til 735099980
2 med 444560089
3 for 396849909
4 det 394031045
5 der 273008926
6 den 267696098
7 har 251044875
8 kan 225782096
9 ikke 214874628
10 som 205071867
11 jeg 204010581
12 fra 199932948
13 mere 159701573
14 alle 134942476
15 kontakt 125492896
16 eller 124463101
17 dkk 119472355
18 din 112618070
19 her 104361397
20 2016 104005585
21 skal 102123933
22 ved 100361724
23 efter 95686049
24 pris 94400983
25 2015 93486153
26 men 90079959
27 læs 87427063
28 man 82128864
29 vil 80107116
30 vores 79158246
31 dig 77971690
32 også 76196557
33 var 75266165
34 dette 71606610
35 fragt 69648534
36 min 67659330
37 ind 67579071
38 søg 67341460
39 produkter 66794635
40 forside 65254918
41 2014 65073113
42 siden 64703860
43 the 64441149
44 tilbehør 62739776
45 hvis 60727133
› most used words 2016, stop words
included
› lots of prepositions, pronouns, etc.
› also a few indicating commercial
websites: dkk (Danish kroner), pris (prize),
fragt (freight), produkter (products) —
could be used to identify where trade
takes place
› other interesting words: nyheder (news, no
53), indlæg (comment, no 54), børn
(children, no 105), cookies (no 162,
interesting to see development), blog (no
212), spil (games, no 233), sport (no 237)
— and many more about trade (tilbud
(offer), kurv (basket), køb (buy), læg (put))
11. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE ANALYTICAL DESIGN — FOUR
APPROACHES
›Second, we will use 'the word of the year' for each
year
›'the word of the year' has been chosen since 2006
›including the other candidates that were not
selected
›e.g. in 2020 that would be 'samfundssind' (public
spirit), and 'afstand' (distance), 'albuehilsen' (elbow
bump), 'flokimmunitet' (herd immunity), and
'mundbind' (face mask)
11
12. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE ANALYTICAL DESIGN — FOUR
APPROACHES
›Third, a number of discussions, topics or events that
have set the agenda throughout each year are
identified
›e.g. 2006, 'Muhammedkrise' (cartoon crisis), 2008,
'finanskrise' (financial crisis), 2012 'lukkelov' (shops
Act), 2015 'flygtningekrise' (refugee crisis)
›a 'dictionary' of synonyms will be established for
each event
12
13. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE ANALYTICAL DESIGN — FOUR
APPROACHES
›Fourth, the most used search terms on Google from
Denmark are identified (either at Google or in legacy
media where they are often mentioned at the end of
each year)
13
14. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
THE METHODS
›A variety of calculations of word occurences
›Train embeddings of words and documents to
represent the lexical co-occurrence structure within
and between websites
›Possibly, model and predict information propagation
across the Danish web
14
15. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
15
word word word word
word word word word
wordword word word
word word word word
word word word word
wordword word
wordword word
wordword word word
word word wordword
word word
Training in recognising
topics — topic model +
neural word embeddings
'Dictionary' of words,
identification of lexical co-
occurrence structure
Grapf of where topics 'live'
Biggest challenge: training efficiency and speed
due to the size of data => training algorithms.
Estimated training time: 4-6 months
16. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
NEXT STEPS
›The textual analyses to be supplemented by a
hyperlink network analysis
›Identify the link relations between websites where a
given topic is talked about
›Hyperlinks have already been extracted as a
separate dataset
16
17. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
17
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
Topic x
18. AARHUS
UNIVERSITET
WHAT DOES A NATIONAL WEB TALK ABOUT? — DIGGING INTO BILLIONS OF WORDS, THE DANISH CASE
NIELS BRÜGGER
4 NOVEMBER 2021
NEXT STEPS
Possible relevance in WARCnet:
›track topics from other countries on the Danish web,
possibly on web pages of the country's language
›replicate the study based on holdings from other
national web archive
18