SlideShare a Scribd company logo
1 of 73
Download to read offline
Faculty of Technology and Society
Department of Computer Science
Master Thesis Project, 15 ECTS, DA613A, Spring 2015
Identifying Single and Stacked News Triangles
in Online News Articles
- an Analysis of 31 Danish Online News Articles Annotated by 68 Journalists
By Miklas Njor
Supervisor:
Daniel Spikol
Examiner:
Bengt Nilsson
Contact Information:
Author:
Miklas Njor
miklas@miklasnjor.com
Supervisor:
Daniel Spikol
daniel.spikol@mah.se
Examiner:
Bengt Nilsson
bengt.nilsson.ts@mah.se
Abstract: While news articles for print use one News Triangle, where important information is at the top of the
article, online news articles are supposed to use a series of Stacked News Triangles, due to online readers text-
skimming habits[1]. To identify Stacked News Triangles presence, we analyse how 68 Danish journalists
annotate 31 articles. We use keyword frequency as the measure of popularity. To explore if Named Entities
influence News Triangle presence, we analyse Named Entities found in the articles and keywords.
We find the presence of an overall News Triangle in 30 of 31 articles, while, for the presence of Stacked News
Triangles, 14 of the 31 articles have Stacked News Triangles. For Named Entities in News Triangles we cannot
see what their influences is. Nonetheless, we find difference in Named Entity Types in each category (Culture,
Domestic, Economy, Sports).
Keywords: Keyword popularity, Keyword frequency, Folksonomy, Named Entity, Online News,
Popular Science Summary
The Internet forced the media to be more streamlined. It also resulted in a huge decline in
circulation and revenue for newspapers, which further led to layoff of staff. The staff that are
left are pressed for time, both when it comes to producing news, and putting the articles online
and adding metadata to the content.
With the shift from reading news in a physical newspaper, to reading news online on computers
or mobile phones, there has also been a change in reading habits, where readers now skim text
when reading news online. To serve the reader the most important information first, when
journalists write a news article for print, they use an overall News Triangle, that is, the most
important information is at the beginning of the story. But for online news articles, the new
idiom is to use a series of stacked News Triangles, so that each section is a News Triangle in
itself, as this allows the reader to skim and understand the text more easily.
From a text-mining perspective this is interesting, because knowing how a piece of text is
supposed to be structured, will make it more straightforward to teach a computer how to read
and learn from the text, for instance to automatically add keywords and taxonomies. Dividing
the text into even smaller more manageable chunks, could allow for even better output.
However, one thing is planning how things should be, another thing is how reality actually
plays out.
This paper finds, with the help of 68 journalists, that although the News Triangle does exist, the
idea of dividing the text into smaller News Triangles is not so clear-cut and is problematic to
identify. Only half of the articles we look at, consist of smaller News Triangles throughout the
article, and exactly what influences whether or not there is a series of News Triangles still
remains unclear. Thus more research has to be done to understand the complexity of how
online news articles are structured.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 2 of 73
Extended Abstract
ABSTRACT: The concept of a News Triangle is to place the most important information at the
top of the story. This is how news articles for print traditionally have been written. Due to
online readers text-skimming habits and to make it easier for readers to understand the text,
online news articles supposedly use a series of Stacked News Triangles, [1].
To identify and test if this pattern is meaningful for an automated process of annotating articles,
we analyse how 68 Danish journalists annotate 31 articles and analyse where these keywords
appear in each article. We use annotation keyword frequency as a measure of popularity.
To see if Named Entities (Persons, Places, Organisations) influence News Triangle presence, we
analyse Named Entities found in the articles and keywords. We also analyse Named Entities (NE)
across article subject categories.
Motivation: Each day numerous articles are published on online newspapers. To make search
retrieval and recommendation of related articles easier, each article is manually annotated with
relevant keywords, taxonomies and a category. This process is tedious and subjective, and over
time keywords and taxonomies can become stale.
Automatically annotating articles with relevant keywords and taxonomies could help organise
content better and ensure that keywords are always relevant. Furthermore, it could prevent
high bounce rates, by suggesting more relevant articles to readers who enter the site via
external links, such as social media[2], which in turn could lead to more page-views and
revenue from online advertising.
Problem statement: While counting names in the text, or counting word frequencies,
concluding that names or most common words are what the article is about, the reality is more
complex. Our intuition is that structuring online news articles by stacking News Triangles upon
each other, will form an even more semi-structured document, and ease automated annotation,
since algorithms will easier identify pivoting points. It is however unknown if News Triangles or
Stacked News Triangles exist, are measurable or useful to identify.
Methodology: Sixty-eight Danish journalist annotate 8 articles from a set of 31 articles. From
the annotation data-set, the popular keywords in the 3rd
quartile from each article are used to
measure keyword popularity across the articles content.
Each article is divided into partitions (Headpiece, Intro and subsequent Section blocks according
to the original HTML markup) and the placement of keywords are mapped. The position and
popularity count for each keyword is used to create a graph, showing the distribution of
keyword popularity in each section block.
A linear fit across each block and the entire article acts as a boolean value, to identify the
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 3 of 73
presence of one overall News Triangle and Stacked News Triangles. We also extract Named
Entities from the articles and annotations, and analyse their presence, position and Named
Entity Type.
Results: For the presence of an overall News Triangle in articles we find that this is true for 30 of
31 articles. For the presence of a series of stacked News Triangles within the content, we find
that 14 of the 31 articles have stacked News Triangles in all sections. However, its is not clear
what influences this behaviour. For Named Entities in annotations we cannot see what
influences them either. We find difference in Named Entity Types in each category.
Conclusion: To our knowledge, this is the first time that News Triangles and stacked News
Triangles have been identified within Computer Science. Looking at the block level, it is
however difficult to see what influence the News Triangle presence. We find that annotations
quickly group around the same words. We also find that there is a difference in Types of Named
Entities used for each category and that Named Entities mentions follow a Pareto power-law
distribution as that of Zipf's law. We identify future work that needs to be done within this area
of mining online news.
In the name of transparency and reproducibility we have uploaded much of the data and
illustrations to http://plot.ly. Where possible, we will link from within the captions of the tables
and illustrations. The raw data (annotations and participant data) can be found here:
http://figshare.com/account/projects/4414
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 4 of 73
Table of Contents
1 Introduction ...................................................................................8
1.1 Background .........................................................................................................8
1.2 The News Triangle ...............................................................................................9
1.3 Related Work .....................................................................................................11
1.4 Research Questions ............................................................................................14
1.4.1 Research Question 1 – News Triangles ...................................................................................................14
1.4.2 Research Question 2 and 3 – Named Entities .........................................................................................16
2 Methodology ................................................................................17
2.1 Articles - criteria and selection methods ..........................................................18
2.2 Preprocessing of articles ...................................................................................18
2.3 Participants - criteria and selection methods ...................................................19
2.4 Collecting keywords via a web questionnaire ....................................................19
2.5 Analysis of collected tags ..................................................................................21
2.6 Identifying News Triangles. ...............................................................................22
3 Results ..........................................................................................22
3.1 Participants ......................................................................................................23
3.2 Articles and Annotations ..................................................................................25
3.3 Keyword Distribution ........................................................................................28
3.4 Named Entities in Detail ....................................................................................39
3.5 Named Entity Occurrence in Articles ................................................................40
3.6 Named Entities in Keywords ..............................................................................42
4 Analysis ........................................................................................45
4.1 Analysis of Keyword Distribution and News Triangle Presence ........................46
4.2 Analysis of Named Entities ................................................................................47
5 Discussion ....................................................................................48
5.1 Future Work .......................................................................................................50
A Appendix ......................................................................................52
B Appendix ......................................................................................53
B.A Keyword Distribution – Culture .......................................................................................................................... 53
B.B Keyword Distribution – Domestic ....................................................................................................................... 57
B.C Keyword Distribution – Economy ....................................................................................................................... 64
B.D Keyword Distribution – Sports ............................................................................................................................ 67
C Appendix ......................................................................................69
C.A List of articles ....................................................................................................................................................... 69
C.B Grouping of articles in Categories ...................................................................................................................... 70
C.C Participants Job Functions and Job Titles .......................................................................................................... 71
C.D Detailed look at count of Annotations of Keywords ........................................................................................71
C.E Occurrence of Named Entities in Culture, Domestic, Economy, Sports .........................................................72
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 5 of 73
Illustration Index
Illustration 1: Re-drawn example of “The News Triangle”...................................................................10
Illustration 2: Inverted News Triangle for print.....................................................................................15
Illustration 3: Inverted News triangles for online news........................................................................15
Illustration 4: Article from Politken.dk, explanations of elements, a Stacked News Triangle.......16
Illustration 5: Annotations by article Category......................................................................................28
Illustration 6: Article 27 - Keyword Distribution incl. All Popularity Fit............................................29
Illustration 7: Article 08 – Keyword Distribution incl. All Popularity Fit...........................................31
Illustration 8: Article 29 – Keyword Distribution incl. All Popularity Fit...........................................32
Illustration 9: Article 02 - Keyword Distribution incl. All Popularity Fit............................................33
Illustration 10: Article 31 - Keyword Distribution incl. All Popularity Fit..........................................34
Illustration 11: Mean, std.dev, variance, std. error of NT presence Per Category and Blocks........39
Illustration 12: Frequency and Occurrences of Named Entities (NE) from all articles.....................40
Illustration 13: Normalised Percentage of Occurrence of Named Entities per Category.................42
Illustration 14: Named Entity Type averages per Category for Named Entities keywords.............43
Illustration 15: Normalised Percentage of Named Entity Types per Category..................................44
Illustration 16: Named Entity Types per Category & std. dev. Of Annotations.................................45
Illustration 17: Example of an article as presented to the participants.............................................52
Illustration 18: Article 16 - Keyword Distribution incl. All Popularity Fit..........................................53
Illustration 19: Article 17 - Keyword Distribution incl. All Popularity Fit..........................................53
Illustration 20: Article 18 - Keyword Distribution incl. All Popularity Fit..........................................54
Illustration 21: Article 22 - Keyword Distribution incl. All Popularity Fit..........................................54
Illustration 22: Article 28 - Keyword Distribution incl. All Popularity Fit..........................................55
Illustration 23: Article 23 - Keyword Distribution incl. All Popularity Fit..........................................55
Illustration 24: Article 28 - Keyword Distribution incl. All Popularity Fit..........................................56
Illustration 25: Article 27 - Keyword Distribution incl. All Popularity Fit..........................................56
Illustration 26: Article 03 - Keyword Distribution incl. All Popularity Fit..........................................57
Illustration 27: Article 05 – Keyword Distribution incl. All Popularity Fit.........................................57
Illustration 28: Article 04 – Keyword Distribution incl. All Popularity Fit.........................................58
Illustration 29: Article 07 – Keyword Distribution incl. All Popularity Fit.........................................58
Illustration 30: Article 08 – Keyword Distribution incl. All Popularity Fit.........................................59
Illustration 31: Article 13 – Keyword Distribution incl. All Popularity Fit.........................................59
Illustration 32: Article 19 – Keyword Distribution incl. All Popularity Fit.........................................60
Illustration 33: Article 20 – Keyword Distribution incl. All Popularity Fit.........................................60
Illustration 34: Article 21 – Keyword Distribution incl. All Popularity Fit.........................................61
Illustration 35: Article 24 – Keyword Distribution incl. All Popularity Fit.........................................61
Illustration 36: Article 25 – Keyword Distribution incl. All Popularity Fit.........................................62
Illustration 37: Article 26 – Keyword Distribution incl. All Popularity Fit.........................................62
Illustration 38: Article 29 – Keyword Distribution incl. All Popularity Fit.........................................63
Illustration 39: Article 30 – Keyword Distribution incl. All Popularity Fit.........................................63
Illustration 40: Article 01 - Keyword Distribution incl. All Popularity Fit..........................................64
Illustration 41: Article 02 - Keyword Distribution incl. All Popularity Fit..........................................64
Illustration 42: Article 06 - Keyword Distribution incl. All Popularity Fit..........................................65
Illustration 43: Article 09 - Keyword Distribution incl. All Popularity Fit..........................................65
Illustration 44: Article 10 - Keyword Distribution incl. All Popularity Fit..........................................66
Illustration 45: Article 11 - Keyword Distribution incl. All Popularity Fit..........................................66
Illustration 46: Article 12 - Keyword Distribution incl. All Popularity Fit..........................................67
Illustration 47: Article 14 - Keyword Distribution incl. All Popularity Fit..........................................67
Illustration 48: Article 15 - Keyword Distribution incl. All Popularity Fit..........................................68
Illustration 49: Article 31 - Keyword Distribution incl. All Popularity Fit..........................................68
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 6 of 73
Index of Tables
Table 1: Participant's Ages.........................................................................................................................23
Table 2: Participant's Graduation Year....................................................................................................23
Table 3: Participant's Education Length..................................................................................................23
Table 4: Question: "For how many years have you worked with annotating news articles?"........24
Table 5: Question: "When did you last work with annotating news articles?".................................24
Table 6: Participant's Browsers and Operating Systems.......................................................................25
Table 7: Averages and Five Number Summary of Article Categories Annotation Rate....................26
Table 8: Annotations Overall......................................................................................................................26
Table 9: Keyword Annotations per Category: Culture 7, Domestic 14, Economy 6, Sports 4..........27
Table 10: Presence of News Triangles across all Sections and per Single Sections basis.................36
Table 11: Percentage of News Triangles per block.................................................................................37
Table 12: Mean, Standard Deviation, Variance and Standard Error for all Blocks/Categories.......38
Table 13: Grouping of articles in Categories............................................................................................70
Table 14: Job Functions and Job Titles......................................................................................................71
Table 15: Detailed look at counts of Annotations of Keywords............................................................72
Table 16: Occurrences of Named Entities per article for Culture........................................................72
Table 17: Occurrences of Named Entities per article for Domestic.....................................................73
Table 18: Occurrences of Named Entities per article for Economy.....................................................73
Table 19: Occurrences of Named Entities per article for Sports..........................................................73
List of acronyms
ML Machine Learning
NE Named Entity
NLP Natural Language Processing
NLTK Natural Language Tool Kit (a Python programming language library)
NT News Triangle
POS Part of Speech
TF-IDF Term Frequency Inverse Document Frequency
LF Linguistic Features
LA Lexical Affiliates
HCA hierarchical clustering algorithm
CMS Content Management System
HTML Hyper Text Markup Language
Concepts
Latent Dirichlet allocation (each document is a mixture of smaller topics)
Entropy (information gain)
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 7 of 73
1 Introduction
Each day numerous articles are published in online newspapers. In order to make search
retrieval and recommendations of related articles to readers easier, each of these articles are
manually annotated with relevant metadata including the article's keywords, taxonomies and
the category it belongs to.
The information in newspaper articles is presented in descending order of importance, where
important information is relayed first. Within journalism this is known as the “News Triangle”,
where the writer explains “What happened”, “How it happened”, “Amplify the point”, “Tie up
loose ends” (WHAT) [3]. The article is written using one large News Triangle, with a headpiece,
an introduction followed by smaller section blocks, where the level of informations diminishes
the further you read. The writing process has worked well for more than a hundred years.
However, the process of writing for online is different than writing for print. According to
Tverskov & Tverskov [1] (2004) and Sissons [3] (2006), the classic print article changed
somewhat when it entered the online arena. The setup of a headline, a sub headline, an
introduction followed by several section blocks still holds. But where articles written for print,
with a limited space not known in advance, are produced to be quickly edited bottom-up, online
articles are not limited in the same way. Moreover, readers of online news tend to skim text
instead of reading from top to bottom. This has supposedly led to a shift from using one overall
News Triangle, towards using smaller News Triangles for each sub-section [1] (pp. 40).
If it is the case that articles are partitioned into smaller News Triangles where the level of
important information decreases the further into the article we read, this could make keyword
extractions and knowledge discovery easier. When designing algorithms, we would be able to
add weights to each section and compare keyword's placement. With this in mind, this paper
investigates what the structure of a set of online newspaper articles actually look like.
1.1 Background
For many years the newspaper was a distribution channel for news and advertising. With the
Internet, online newspapers became just one of many options for users to spend their time and
advertisers to spend their money. This change brought on a huge decline in circulation and
revenue for newspapers, and many newspapers are today struggling to survive. A consequence
of the decline is layoff of staff, and the staff that are left, are pressed for time. Both when it
comes to producing news, but also when placing the articles online and adding metadata.
In a physical newspaper there is a clear priority of content and each article undergoes an
editorial process before publication. On the front page there are a number of top stories
prioritised as A (above the fold), and B and C stories (below the fold). Inside the newspaper,
stories are prioritised using page numbers, images, headline sizes, position on page etc. There is
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 8 of 73
a clear indication of the newspaper sections (Domestic, International, Sports, Culture, etc.).
Physical newspapers have a finite state, a beginning and end, for a given period, usually a day.
For online newspapers the story is slightly different. The newspaper's homepage (front-page)
can feature 50-100 stories shown in semi-prioritised chronological order, changing throughout
the day. The newspaper sections are different, in that they follow an identical design resembling
that of the homepage, where the articles are displayed in a chronological manner. Thus the
hierarchy and prioritisation on newspaper websites is different from a physical newspaper and
news media are well aware of the fact [4].
To combat the need for better navigation, findability and recommendations for users,
newspapers employ Information Architecture tools like keywords, hierarchies, taxonomies,
automated recommendation systems, and classification (sections) among other aspects [5].
These tools are managed according to journalistic principles, and are set and prioritised by
journalist who have knowledge of the domain area. This can be a laborious and repetitive task,
which often needs careful planning [6][7].
1.2 The News Triangle
“News stories should flow logically from the first paragraph. They should have pace
and no unnecessary elements should slow the story down. And even though readers
won’t see a structure, there is one. For hard news there is quite a strict structure.
One way of looking at it is through the News Triangle or inverted pyramid.
Generations of journalists have been brought up on this.” - Sissons [3] (pp. 70).
The reason for placing the less important news at the bottom of the article, is useful when you
don't know in advance, how much space there is on the physical newspaper page, where the
article is supposed to be placed. The story's priority might change or the allotted space might
have to give way to an advertisement. Knowing that content placed at the bottom of the article
is less important, makes it easier and quicker for editors, who might not know what is important
to the story, to edit the text. As such, the structure serves a guideline.
It is also useful to the reader, that information is presented in a logical order. Especially for
readers of online news. According to Sissons and Tverskov & Tverskov, when a journalist writes
for online news, the text should be even more structured, consisting of not one giant News
Triangle, but many. Condensed blocks of information allow readers to skim the article and
quickly go back and forth in the text.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 9 of 73
Looking at Illustration 1 above, we could also conclude, that although the information should be
more concentrated at the top of the triangle, the “What happened” could possibly contain more
Named Entities (Names, Places, Organisations), since something happened to someone/something,
“How it happened” could contain more verbs since these often describe some kind of action,
“Amplify the point” could be considered a summary together with “Tie up loose ends”.
Another strong point to highlight about online news as a structured landscape for information,
is that space is endless. Journalists are not forced to cram all information into one article, but
can divide the article into several articles and link them together via hyperlinks. This allows
journalist to dig a little deeper about what they are talking about in each piece, which in turn
could mean that each article is more to the point and the information is condensed [3] (pp. 143)
[1] (pp. 40). Knowing how a text within a certain domain is supposed to be structured could
greatly enhance successful data-, text-mining and Natural Language Processing (NLP).
Common data- and text-mining, and NLP techniques
There exist many Data Mining [8], Natural Language Processing (NLP) techniques and tools [9],
and frameworks like NLTK [10], for automating the extraction of keywords and classifying text.
Below we highlight some of the many techniques and concepts used within machine learning,
data- and text-mining, and NLP.
Common Preprocessing Steps: Text is normalised so as to better be able to count and compare
tokens in the text. First, common stop words (“a”, “I”, “it”, “he”, “she”, etc.) are removed. The
text is separated into sentences, and words are tokenised (Within NLP words are called tokens).
Each token is stemmed, that is, plural s's are removed.
Part of Speech (POS): Part of Speech tagging (POS) is the process of identifying nouns, verbs
and Named Entities (Names, Places and Organisations) etc. This is done by comparing each
token with the tokens that surround it and the likelihood of that the token being a certain POS-
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 10 of 73
tag given a certain context. I.e. “gay men”, can mean both happy males or homosexual males
depending of the context of the surrounding text.
Frequency Count and Frequency Distribution: A common step in identifying what a text is
about, is to count and rank the occurrence of tokens. The idea is, that the more often a word
appears, this is also what the text is about. Frequency Count can be combined with Frequency
Distribution, where the most common token's occurrences are mapped across the text corpus.
Unigrams, Bigrams and Trigrams: A single token (one word) is called a unigram. Two tokens
next to each other are called bigrams and three tokens next to each other are called trigrams.
Unigrams, bigrams and trigrams etc. are often used together with Frequency Count and
Frequency Distribution.
Collocations: Collocations (also known as Lexical Affiliations) are separate words that appear in
conjunction. The notion of conjunction can be tuned according to how far from each other the
words are “allowed” to appear. The idea is, that if words like “software” and “upgrade” often
appear fairly close together, they should be considered as a single concept: “software upgrade”.
TF-IDF: Term Frequency Inverse Document Frequency (TF-IDF) is a technique where terms or
tokens widely distributed across the document, are ranked higher than frequently occurring
terms. The notion is, that a wide distribution of a term means that the term is used throughout
the text, and thus must be important.
Organisation of content
The paper is organised as follows: Section 1 features 1.3 Related Work and 1.4 Research
Questions. Methods (section 2) are explained on page 17 - 22 . Results (section 3) are shown on
page 22 - 44. The analysis (section 4) starts on page 45 and the discussion (section 5) starts on
page 48. To not interrupt flow, illustrations of Keyword Popularity Distributions are for a large part
moved to Appendix A on page 53. A list with links to the articles used in this paper, and some
larger tables can be found in Appendix C on page 69.
1.3 Related Work
Extracting relevant keywords from texts is not a new area of research and much research within
machine learning (ML), data- and text mining, and Natural Language Processing (NLP) focuses
on this area. However, we have not been able to find research which uses human annotators or
evaluators before the algorithms have run, which is our approach, since we use the input as an
evaluation measure to identify News Triangles, not as a measure of evaluating if our model is
correct.
Dividing the text into smaller chunks for weighing output of keywords or selecting appropriate
taxonomies or classification is an area with some research. A common denominator for most
studies is using online news articles and algorithms to tackle specific domains of knowledge.
While the concept of a News Triangle is well known within journalism, there is, to our
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 11 of 73
knowledge, no research that identifies News Triangles in a scientific manner within the domains
of machine learning, data- and text-mining, or NLP. A common technique within text-mining
and NLP is to look at frequencies of occurrence across a text corpus as a means to understand
what the text is about, however, we have not been able to find research that uses the notion of
keyword popularity from annotations as a way to measure information density.
How we searched
We have searched for original research articles using Malmö University's Summon search
service, the websites of IEEE, ACM, ScienceDirect and Google Scholar, but found very little relevant
research, so our strategy was to follow citations up and downstream for the articles we did find.
The main search terms are “News Triangle”, “HTML blocks”, “Keyword Popularity”, “Keyword
Extraction”, “Information Density”, “Named Entities”, either as standalone searches, in combination
with each other, or with “data mining”, “text mining”, “Natural Language Processing” o r “NLP”
attached to the search term. Below we present eight relevant research papers.
Categorisation and topic identification
Muller, Dörre, Gerstl & Seiffert, (1999) [11] describe the use of TaxGen, an automated taxonomy
generator based on a hierarchical clustering algorithm (HCA). By using a bottom-up iterative
approach, where each text is analysed, the system slowly builds a taxonomy. The results are
analysed against the training set and show above 99% positive results. The authors preprocess
the text to find Lexical Affiliates (LA) and meaningful subjects in the text limited to a maximum
of top five keywords. They also use Linguistic Features (LF), also known as Named Entities (NE),
to extract names and places of people and go into great detail about how difficult this is to use
as taxonomies, since the clustering algorithm chokes. This could prove problematic for news
articles, where journalist, in order to spice up the language, use synonyms, or only first or last
names. The text documents in the paper are from news-wire services, where the language is
much more compact. There is also a possible misunderstanding of what classes and taxonomies
are, where the authors seem to mix the too, but this is not explicitly clear from the text.
In explaining TopCat [12], Clifton, Cooley & Rennie (2004) give a very thorough walkthrough of
the problems of NLP and text-mining, explaining problems and suggesting possible
workarounds. Their process too, is removal of stop words along with NE extraction to get a
more coherent sets of text bodies. The NE's are used to map articles to topics, i.e. “Sampras”, the
American tennis player Petros Sampras, is mapped to the “tennis” concept. They also use TF-IDF
to find important words in each text, which substantially improves the results. TopCat finds
around 30% more similar documents as that of a human process, however the authors are clear
in stating that their results are difficult to evaluate. Nonetheless it is a solid piece of work that
shows the difficulties in automation and text mining. They also note, like Muller, Dörre, Gerstl &
Seiffert above, that computation takes a very long time and conclude that the experiment is a
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 12 of 73
success even though the output is small. But this is based on algorithm evaluation by users,
which could be a sign of any improvements are good, since annotating content is a complicated
and time-consuming task for users.
Denecke & Brosowski also follow a recipe of preprocessing, where stop words are removed,
words are stemmed and sentences processed to find the related topic, in the 2010 paper “Topic
Detection in Noisy Data Sources“ [13]. Only sentences with a minimum of four words are
considered, and a Latent Dirichlet allocation (each document is a mixture of smaller topics) is
performed on each sentence. A keyword is only considered if it is contained in minimum 15
sentences. Finally the top five keywords are chosen as the keywords describing the article. The
algorithms is tested on medical blogs, slashdot.org and 14 products from Amazon.com. The
algorithm performs best on blogs and Amazon.com products, which might be due to the more
strict structure of blogs and Amazon.com product pages. The authors use annotators to test the
output against, but these have difficulty with medical blogs due to unfamiliarity with the
domain. Again there is confusion about classes, taxonomies and keywords.
Reza & Matin discuss in “Application of Data Mining For Identifying Topics at the Document
Level” (2013) how to identify topics at the document level [14]. The authors start by looking at
the sentence level using unigrams and bigrams, and Named Entity extraction and analysis, but
do not get satisfactory results. The authors then move on to the paragraph level and begin to
see good results on their test-corpus and the feedback from their test audience. This is
promising result for our purpose. However the authors do not take into account the use of
semantics and synonyms to get the true meaning of concepts and words. It is also unclear if they
have used stemming of words to find similarities.
Keyword extraction
Although they do not seem to be aware of the concept behind News Triangles, Nørvåg and Øyri
[15] describe a process where only the front page of online newspapers are used to extract news
from. Their intuition is, that headlines are short, to the point and created by humans, which
leads to higher classification accuracy than an automated process. They text-mine the front-
page of online newspapers and each news item is added to a database, with a link to the article.
The headpiece is used to categorise the article. This process takes up a lot less hard-disk space
and saves time regarding data cleaning. Their approach is somewhat problematic, in that not all
newspapers link their headlines to the article or they might use different HTML markup,
depending of where the article is placed on the webpage, which changes as the news items
slides down the newspapers homepage during the course of the day. Apart from the problem of
creating a template for each online newspaper, the logic behind the extraction process can be
complex. Furthermore, although news headlines are condensed with information and are
written by a person with domain knowledge of what the article is about, the process looses out
on possible valuable knowledge, which could have been mined from the article itself.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 13 of 73
By looking at each block (section element) and analysing the particular block's entropy
(information gain), Huang, Yen, Hung, Chuang, & Lee (2006) [16] propose to consider both
structure and information by adding weights to each block of text and letting the algorithm
decide which block is more informative. The authors conclude that using only entropy to
classify block importance, gives poor results in real life when tested on live websites. They
regard that future work in this area needs to be done. They also note that a side effect of a
boolean selection is that some blocks don’t reach a certain threshold for selection and are
deemed un-useful, even though a manual human process deems the block to be informative.
They conclude on this part, that a ranking of blocks could be more useful.
In “Automatic free-text-tagging of online news archives” from 2010 by Farkas, Berand, Hegedús,
Kárpárti & Krich [17] the authors extend the above mentioned set-ups by using semantics from
wikipedia. From the corpus they filter the keyword list to a fixed size based on statistics of the
site average. The headlines, sub-headlines etc. contain important words and the authors
perform TF-IDF on these parts to determine importance. The authors find that the raw text and
links to other articles are mostly noise. The authors also touch on the problem with non english
words using the available tools and frameworks on non-english languages. The use of
wikipedia.org for semantic extraction seems to circumvent this challenge greatly.
In “Topic identification based on document coherence and spectral analysis” [18] from 2011,
D'hondt, Verhaegen, Vertommen, Cattrysse & Duflou, treat texts as a non-sequential stream of
words, where the best parts of information could lie anywhere in the text. The authors use a
technique of lexical chains to quantify and describe similar keywords from different text blocks.
They add a score to keywords that appear closer to each other, word by word, sentence by
sentence, so occurrence of the same word very close to each other, receives a higher score than
words occurring similar times but further apart. Their technique receives good precision and
recall on both large randomized and standard test sets. The paper goes in great detail of
explaining the set-up of their experiment along with explanations of algorithms used. Their
approach is different to our line of thought, since our intuition is, that information is grouped
into blocks of knowledge, and that each section carries and adds weight to each keyword, we
choose to partition the article into sections.
1.4 Research Questions
The research questions set out to identify: the presence of News Triangles and Stacked News
Triangles; if Named Entities influence the presence of Stacked News Triangles; and what the
differences are among Named Entities across categories.
1.4.1 Research Question 1 – News Triangles
Research Question 1: To what extent do online news articles follow the idiom of many News
Triangles, instead of only one News Triangle, where information is distributed at the beginning
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 14 of 73
of the text. I.e. do the keyword candidates appear less frequently the further we move away
from the start of each element block?
Methodology and assumptions for Research Question 1
The participants annotate a selection of articles with keywords. The more often a keyword is
mentioned, the more popular it is, i.e. it has a higher value. To make sure only the most
important keywords with the highest value are used for looking at keyword popularity, only
keywords that belong to the top 3rd
quartile are considered.
A News Triangle (an inverse triangle) is a triangle where most of the information is in the top
part (see Illustration 2, 3 and 4). Thus, what we are looking for is the majority of popular
keywords in a text across a vertical time line, will have spikes of popularity that follow what is
seen in Illustration 3. Moreover, we want to see if what is taught about writing for online is in
fact true. We will look for patterns that resemble those seen in Illustration 3 and 4, where the
spike starts at the left side and descends downwards. The hypothesis is, that online news follows
a pattern that resembles Illustration 3 and 4.
Note that we are not measuring where the keywords begin or end. We are looking for a boolean
value, a True or False, of whether there is an overall News Triangle and/or a series of Stacked
News Triangles.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 15 of 73
Below we have inserted the News Triangles as per the theory of Sissons [3] and Tverskov &
Tverskov [1]. Note that the article may be organised exactly the same in online and print
version. We are only concerned with only online news and do not take into consideration the
print version.
1.4.2 Research Question 2 and 3 – Named Entities
Research Question 2: Given that much news concerns something that happened to someone
somewhere, what influence does Named Entity keywords have on the presence of News Triangles
and Stacked News Triangles?
Research Question 3: Is there a distinct variance of Named Entity Type (Persons, Places or
Organisations) in keywords within the categories (Culture, Domestic, Economy, Sports)?
Methodology and assumptions for Research Questions 2 and 3
Named Entities from the 3rd
quartile keywords are extracted and divided into the following
types: Person; Place; Organisation. The NE are gone over manually to find and correct
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 16 of 73
misspellings like “Georg W Bush”, “Georg W. Bush”, “George W. Bush”. Each NE Type is grouped by
type and category. Standard Deviation across NE and Non-NE is used to identify the influence of
NE's on News Triangle presence.
The concept of the News Triangle is to describe what happened to someone/something. As
described in the Background section above, the “What happened” part of the News Triangle
could possibly contain more NE's, since something happened to someone/something.
Investigating whether the “How it happened” contains more verbs since these often describe
some kind of action, or if the “Amplify the point” and the “Tie up loose ends” are used for
summarisation, is however out of the scope of this paper.
Relevance of research questions
Gaining knowledge of how news articles, which are semi structured texts, are partitioned into
smaller but discernible parts, is of great value to automated keyword extraction and the process
of automatically tagging news articles in a more relevant way, which could produce better
results when fetching related and relevant content.
Contribution
We expect to find that all news articles use an overall News Triangle to present information and
to a large extent that we will find that most element blocks contain News Triangles as per our
definition in Illustration 3 and Illustration 4 above. We expect a majority of keywords to be NE's
and that there is distinct difference between NE Types for each category.
2 Methodology
On the onset of this paper the idea and scope was to explore more than keywords and News
Triangles, thus the questionnaire, which forms the basis of the data collection method, also
asked the participants for input about taxonomies and category. The data about taxonomies and
categories has not been used in this paper.
Quick overview of methodology
Sixty-eight Danish journalist are each asked to add keywords to a random selection of eight
news articles from a set of 31 articles. Each article is divided into partitions (Headpiece, Intro
and subsequent Section blocks according to HTML markup). The keywords for each article are
ranked according to occurrence and the popular keywords in the 3rd
quartile are used to
measure keyword popularity in the article. The popular keywords from each article are
searched for in the article, and their position, if found, is mapped for each section block. The
position and popularity count for each keyword is used to create a graph, showing the
distribution of keyword popularity in each section block. A linear fit, acting as a boolean value,
is set across each section block and the entire article, to measure if the keyword distribution
ascend or descends. An ascent from left to right indicates that there is no News Triangle
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 17 of 73
present. A descent from left to right indicates that there is a News Triangle.
For analysis of NE's we extract NE's from both articles and annotations using NLTK and our own
automated method. Only NE annotations found in the 3rd
quartile are considered.
2.1 Articles - criteria and selection methods
We use 41 articles from the online edition of the Danish newspaper Politiken1
of which we have
selected 31 articles to test on. The original intent for this paper, was to also build and test an
algorithm based on the initial results, using the remaining ten articles as a test-set. However, on
first inspection of the annotations by the participants and mapping of keywords to each article,
we decided to instead dig deeper into what caused the presence and non-presence on News
Triangles and Stack News Triangles.
The criteria for selecting articles is to get recent articles from the categories: domestic news (14
articles), economy (6 articles), sports (4 articles) and culture (7 articles). By selecting articles
from four common newspaper categories we are able to generalise the results better. The reason
for choosing recent articles is that we wanted the participants to be fairly on par with what the
article was about. The article's subjects are: terror in Denmark, politics, integration, music
streaming, schools, education, the Copenhagen metro, credit cards, sports games.
We have not looked at, or consider it relevant, to include data about the article's author, since
this should not influence the presence of News Triangles.
All articles are written and edited by Politiken staff and the articles have been approved for use
in this thesis by Politiken's copyright office. No articles are from news-wire services. The full list
with links to articles can be found in Appendix C on page 69.
2.2 Preprocessing of articles
Each article's URL is accessed via a browser and the full HTML source-code is copied to a text
document and saved. The reason for not scraping the content automatically, is due to
Politiken.dk's paywall system, where users are only allowed to access five articles per month.
This can however can be bypassed by using the Chrome browsers “incognito” functionality.
Using the Python library BeautifulSoup42
, each text file is processed to removed excess elements
in the HTML, such as links to advertisements, navigation, semantic content etc. Only the HTML
surrounding the content is extracted. Based on the extracted HTML and content, a new HTML
page is generated, where the content is marked up to only contain H1 HTML-tags for the title,
H2 HTML-tags for the sub-headline, H3 HTML-tags for the subsequent section block sub-
headlines and HTML p-tags for paragraphs. The new HTML page does not contain images, image
captions, author bylines, dates, category identification or links to related stories.
1 http://politiken.dk
2 http://beautiful-soup-4.readthedocs.org/en/latest/
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 18 of 73
2.3 Participants - criteria and selection methods
The participants where chosen via the authors network of former colleagues in the Danish
media business. The participants have a wide background in the field of journalism, ranging
from broadcasting to traditional print and online media. Their ages, annotation experience and
type of education varies. Having said that, there is a bias towards participants being forty years
of age, working with print media and having completed an education from the Danish School of
Journalism in Aarhus around the year 2001.
113 journalist where invited via email and Facebook chat to participate, 37 journalist never
replied back, nine journalist declined and 71 journalists opted to participate of which 68
completed the test. There does not seem to be a difference in opt in rate over which invitation
method was used.
During the invitation process, several participant showed great interest in the research at hand,
since tagging (and updating taxonomy catalogues) is a daily chore, which according to some
participants, does not work to it’s full potential. Many felt that a lot of repetitive work was done
(and lost), and the newspaper's Content Management System's (CMS) where not good at
matching keywords and taxonomies or suggesting related articles. There where also many who
did not know what a taxonomy was, even though they, one imagines, work with one daily. This
could be due to online newspapers still hold on to the notion of classifying content into strict
sections, as they have done with print, where there is a natural physical affordance.
2.4 Collecting keywords via a web questionnaire
Questionnaire
The URL to the questionnaire (http://tagging.miklasnjor.com) is emailed or posted via Facebook
chat to the participant, and each URL contains a unique participant ID (example:
http://tagging.miklasnjor.com/index.php?userid=MN123). As mentioned earlier, we initially
wanted to also test an algorithm, and to be able to make comparisons between test one and two
(test two never took place) each participant is also assigned an internal user ID.
For copyright reasons the questionnaire is not made public. The article layout for the
questionnaire is responsive so as to fit all device types, although participants are informed that
the test is best taken on laptop/desktop computers or tablets. This step is done since a large
portion of the participants are contacted via Facebook and there is a chance that they received
the URL via their mobile phones.
To avoid looking like the most common Danish newspapers, the typography used is that from
the WordPress theme TwentyThirteen3
, which we conclude has undergone tests for readability.
On the introduction page, the experiment is explained to the participant, what the data will be
used for and that the data will be treated anonymously. Definitions of what is meant by
3 https://theme.wordpress.com/themes/twentythirteen/
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 19 of 73
keyword, taxonomy and section are also explained.
On the next page, participants are asked to enter data about their education type, year of
finishing their education, current job title, current job function, overall editorial tagging
experience, and years since working editorially with annotating content. This is done to see if
there is differences in annotation rate and keywords entered, between participants tagging
experiences and years since tagging articles, or education, job function and job title. After the
second page, participants continue to the main test, where they are asked to annotate eight
articles. The data was collected over a three week period from mid March, 2015.
Layout of articles
For each article page, the participant is presented with the article, where the markup follows
common HTML principles, where headline gets a <H1> HTML tag, sub headlines gets a <H2>
HTML tag and so forth described earlier in 2.2 Preprocessing of articles on page 18. See example
of a page in Illustration 17 in Appendix A on page 52.
Reading wide pieces of text on a screen is cumbersome, so we choose to set the max width of the
text to 760 pixels. Alongside the article is a box for writing keywords, taxonomies and
categories. The box follows the top of the web-page, so participants avoid scrolling up and down
when they read the article and need to enter data.
Underneath the box is a link to the bottom of the article where the difference between
keywords, taxonomies and categories is explained. This is done to make sure that if doubt or
uncertainties arise, participants can quickly get information about definitions, a need that was
raised by some participants prior to taking the test.
Choice of articles presented to the participant
The first time a participant is presented with an article, the article with the least annotations is
fetched from the stack of articles, to make sure that we get as many articles annotated by a wide
group of participants. This process continues except for the third and fifth article, where the
participant is presented with the most annotated article, as we want to see if there is a
difference in keywords between articles with less and more annotations.
Questions asked
The participants are asked to read the article and annotate the article with:
• Keywords: Which keywords are relevant to the article.
• Taxonomies: Which taxonomies they would place the article in (based on their
assumption).
• Classification: In which section the article belongs (based on their assumption).
The data is entered into the on-page box mentioned in Layout of articles above, with either one
keyword per line or keywords comma-separated. Note that a keyword can consist of several
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 20 of 73
words, i.e. “Football Stadium”, “Terror in Denmark” or “Klaus Riskær Pedersen”.
When the participant has annotated the article they click the “Next article” button. The data
they entered is saved to a database and they are presented with a new article which they are
asked to add keywords to. We choose to save the data after each annotation. As such, there is no
minimum or maximum amount of articles that needed to be tagged, since our intuition is that
the task of tagging is seen by participants as a chore, and we feared that if they exited the test
halfway through, we would loose valuable data. The users are informed of how many articles
they have tagged.
Minor problems
A. There was a coding error which once in a while showed the same article twice. Some
participants would skip the article, some would annotate it again, and some would write in the
data collection box, that they had seen this article before. These entries have been removed.
B. Since there was no finish button on the questionnaire, some participants annotated
more than eight articles.
2.5 Analysis of collected tags
The keywords belonging to each article are collected and gone over manually to make sure that
strange html entities or other oddities inside or surrounding the keywords are caught and
normalised. This is to avoid complications further down the preprocessing pipeline.
Removing Noise from the Data
All keywords for each article are collected and made lower case, after which they are compared
and ranked according to frequency. The frequency count from each keyword is divided into 1st
,
2nd
and 3rd
median. Keywords not belonging to the 3rd
median for each article are discarded, i.e.
we only consider keywords with high frequency and a strong presence. This is done to avoid
outliers in the data. The reason for making all keywords lowercase is to ensure that participants
may have spelled keywords with title-, upper- or lowercase.
Named Entities
Named Entities (Persons, Organisations, Locations) are extracted from all of the articles we train on.
We intended to do this in one swoop to avoid repetition by using the NLTK ne_chunk method4
. By
closer inspection we notice that the NLTK ne_chunk method doesn't collect all NE's, which
could possibly be due to the anglocentric nature of NLTK and our text corpus is in Danish. We
found a POS-tagging web service from “Center For Sprogteknologi”5
, however it would be cumbersome
and error prone to copy-paste data back and forth between the web form and text sheets. Likewise,
training a POS-tagger from scratch, to identify names entities etc., is out of the scope of this paper, so
we write our own function to collect the rest of the NE's.
4 http://www.nltk.org/api/nltk.chunk.html
5 http://cst.dk/online/pos_tagger/
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 21 of 73
The function collects n grams (from 1 – 5 tokens) that are either Titlecase or UPPERCASE. Naturally
we rake in many false positives. The tokens are gone over manually and false positives are removed
from the list. NE's from the NLTK ne_chunk function and our function are joined and edited again.
2.6 Identifying News Triangles.
To map where keywords are found across the articles, we partition the articles into: header and
sub-headline; intro; and subsequent section blocks including the section block sub-headline. The
intro and section blocks are divided into 20 buckets, based on the fact that most section blocks
consist of roughly 20 sentences.
We write a Python program to go through each sentence in each section block of the article, and
if a single or multiple token keywords in the 3rd
quartile is found, the program marks the
keywords position. The resulting data-set is sent of programmatically to plot.ly, where we
manually add the linear fit to each block. The values of the Squared Correlation Coefficients (R2),
Mean Squared Error (MSE), and a boolean value of whether there is a News Triangle present, is
read and entered into Table 10 on page 36.
For identifying News Triangles in the intro and each section block, the linear fit is calculated for
each section block. For calculating the linear fit across the entire article, the header and sub-
headline are included, but since this part of the article is so compact, we do not calculate a
linear fit across it.
Note that we use the linear fit as a boolean value to identify whether there is a News Triangle or
not. We considered using alternative methods for measuring the presence of News Triangles,
but decided a linear fit is the best method to see if there is an ascent or descent across the
sections. We do find that this set-up has certain drawbacks, since the linear fir does not show
where the keywords start or stop across the 20 partitions, which could be valuable information.
For section blocks where the linear fit is almost horisontal, this is problematic. However, we
choose to partition the section blocks into 20 partitions and this allows for visual feedback, so
the reader can see what the distribution looks like. A later step could be investigating the slopes
of the fit across the data or where the ascents or descent start and stop. We choose to
concentrate on a preprocessing step of identifying if there even are News Triangles or Stacked
News Triangles.
3 Results
Since we will show many tables and illustrations, we also choose to analyse part of the data in
the results section. For brevity, we have moved the majority of the illustrations for Keyword
Popularity Distribution to the B Appendix on page 53.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 22 of 73
3.1 Participants
All participant are from Denmark. Table 1, 2 and 3 show the participant ages, graduation year
and education length. The participant's ages range from 30 to 59, the year of graduation ranges
from 1980 to 2011, and the education length ranges from 2 to 6 years, with the majority of
education length being four years.
Participant's Ages
Min Max Count Mean 1st
Median Median 3rd
Median
30 59 67* 43.66 40 42 46
Table 1: Participant's ages. The oldest participant is 59, and the youngest is 30 years old. The majority of
participants are close to 40. * Note that one participant did not specify his or her age.
Participant's Graduation Year
Min Max Count Mean 1st
Median Median 3rd
Median
1980 2011 68 2000 2001 2001 2002
Table 2: Participant's Graduation year. The majority of participants graduated in 2001.
Participant's Education Length within Journalism and Communication
Min Max Count Mean 1st
Median Median 3rd
Median
2 6 68 4.088 4 4 4
Table 3: Participant's education length. All participants have an education related to the fields of either
journalism or communication, of which the majority have studied for 4 years.
Participant's Keyword Experience
The participants where asked to select for how long they had worked with annotating articles
and when they had last worked with this. In all, roughly half (54,5%) had experience with adding
keywords to journalistic articles, 42,6% have no experience and 2,9% answered “N/A”.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 23 of 73
Question: "For how many years have you worked with annotating news articles?"
Answer "I have
never
worked
with it”
"1 - 3 years" "4 - 6 years" "7 - 9 years" “More than
10 years”
“N/A”
Count 29 17 11 5 4 2
Table 4: More than half of the participant have experience with adding keywords to articles.
For when they last worked with annotating articles, 29 participants (42.64%) are currently
working with it or have worked with it within the past 3 years. 3 participants (4.41%) have not
worked with it since seven years ago. 26 participants (38.23%) have never worked with it and 6
participants (8.82%) chose “N/A”.
Question: "When did you last work with annotating news articles?"
Answer "I have
never
worked
with it”
"I work
with it
daily"
“1 – 3
years
ago”
"4 - 6
years
ago"
"7 - 9
years
ago"
“More
than 10
years
ago”
“N/A”
Count 26 15 14 4 1 2 6
Table 5: Close to half have current or recent keyword experience.
Participant's Job Functions and Job Titles
The 68 participants label themselves with 33 different job titles and 52 different job functions,
ranging from journalist to CEO. Job titles and job functions are shown in Table 14 on page 71.
Participant's Device and OS
One concern was that participants would take the test on their phones, which we feel hinders
the annotation of articles. Each participant's browser type and device type is collected during
the test. The split is as follows: Desktop/Laptop: 60 (88.2%); Tablet: 4 (5.9%); Mobile Phone: 4
(5.9%). From Table 6 we find a wide range of browsers used by the participants across both
operating systems.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 24 of 73
Participants Browsers and Operating Systems
Operating
System
IE Safari Chrome Firefox Others Total
Windows 4.4% - 30.9% 2.9% - 38,2%
Apple - 29.4% 5.9% 14.7% - 50%
IOS - 5.9% - - - 5.9%
Other - - - - 5.9% 5.9%
Table 6: Participant's Browser and Operating System. Windows, Chrome: 21 (30.9%); Apple, Safari: 20 (29.4%);
Apple, Firefox: 10 (14.7%); IOS iPad, Safari: 4 (5.9%); Apple, Chrome: 4 (5.9%); Windows, IE': 3 (4.4%); Windows,
Firefox' 2 (2.9%); Various others: 4 (4.5%)
3.2 Articles and Annotations
This first part of the results serves the purpose of getting an overview of the data and to make
sure that everything is aligned, to better make comparisons when we explore the data further.
Even though the content of the articles vary and as such could produce a wide spectrum of
annotations, both in type and count, we choose to use averages to compare article categories.
We also look at the respective group's median when possible. The reason for analysing the
participant's annotations is to understand what defines a “good” keyword. Later we will analyse
where the most popular annotations among participants are placed throughout the article text.
Article Groups
We group the articles into categories to understand differences among categories and
annotations. This is also a sanity check to see if there are any outliers in our data that might
become a problem later. Table 13 on page 70 (in Appendix C) lists the articles with their
categories, article ID, a rough translation of the headline from Danish to English, and the
original URL, which is the namespace that comes after “http://politiken.dk”.
We group the articles based on their content and the first taxon in the URL. Articles could be
grouped further by URL, however we would be left with such small categories that comparison
would be difficult to calculate. The placement of articles in the sports section is clearcut, while
the articles in Culture, Domestics, and Economy could be placed differently due to certain
subjects overlapping, i.e. an article about economy is likely to be concerned with Danish politics,
and an article about education is placed in domestic news, although that area, to some extent, is
governed by politicians. We chose to group articles about life style and consumerism in the
Culture category. Table 13 on page 70 shows that seven articles belonging to Culture, 14
belonging to Domestic News, six articles belong to the Economy category and four articles
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 25 of 73
belong to the Sports category.
From Table 7 below we see how the Domestic and Sports categories feature articles which have
been annotated more than double the times than the 2nd
median and average for that particular
group (Domestic max = 39 , Sports max = 37). However, looking at the 1st
, 2nd
and 3rd
medians for
all categories, we can see that the majority of articles have been annotated by 13 – 16
participants for each group respectively. This indicates easier comparison across article groups.
Averages and Five Number Summary of Article Categories Annotation Rate
Categories Average Min Max Median 3rd median 1st Median
Culture 15.57 13 19 16 16 16
Domestic 18.71 12 39 15.5 16 15
Economy 14.83 13 18 14.5 15 14
Sports 18.75 12 37 13 13 13
Table 7: Domestic and sport receive higher Max annotation by the 68 participants, however the 1st
, 2nd
and
3rd
medians for all categories group closely showing that the annotation rate for each article lies between 13
– 16 annotations from participants. Notice also how close the keyword count for 1st,
,2nd
and 3rd
medians for
each category are.
Keyword Annotations
Table 8 and Table 9 show: annotation types; the count of all annotations for that annotation
type; the average annotations per article; the count of unique annotations; and averages of
unique keywords per article.
The process of finding unique annotations is as follows: all annotations are made lower case and
duplicates are removed, so we have a set of non matching annotations. Having said that, it is
possible that annotations with the same semantic meaning can exist alongside each other, or,
since annotations are not stemmed, that annotations exist in different lemmas. Table 8 shows
that the 31 articles have collected 4930 keywords, of which 1467 are unique. Per article, on
average, there are 159.03 tokens of which 47.32 are unique.
Annotations Overall
All Annotations
Avg. Annotations
per article
collectively
Unique
Annotations
Avg. Unique
Annotations per
article collectively
All Keywords 4930 159.03 1467 47.32
Table 8: Averages of annotations for keywords.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 26 of 73
Keywords per Category
For Table 9, averages are the sum of all annotations divided by the number of articles in that
category, i.e. “Culture Keywords” = 993 / 7 (articles) = 141.86. The averages for each section that
are above average have been bolded. Illustration 5 on page 28 summarises Table 9.
In Table 9 we find that while articles in each category receive from 691 – 2510, the Average
Unique Annotations per Article collectively groups around 44 – 52.14. Unique Annotation for each
Category varies from 176 – 703, the Average Annotations per Article Collectively lie close(44 – 52.14
average unique annotations per article collectively) with a Standard Deviation of 4.04. There is a
wide spectrum of overall and unique keywords per category and each category settles closely to
each other.
Keyword Annotations per Category: Culture 7, Domestic 14, Economy 6, Sports 4
Category and
Annotation Type
All Annotations
for each
category
Avg.
Annotations
per article
collectively
Unique
Annotations
for each
category
Avg. Unique
Annotations
per article
collectively
Culture Keywords 993 141.86 365 52.14
Domestic Keywords 2510 179.29 703 50.21
Economy Keywords 755 125.83 268 44.67
Sports Keywords 691 172.75 176 44
Table 9: Keyword annotations per Category of articles. For Average Annotations per article collectively:
Min 125.83, Max 179.29, Mean 54.93, Q1 133.85, Median 157.30, Q3 176.02, Std Dev 25.35. For Average Unique
Annotations per article collectively: Min 44, Max 52.14, Mean 47.75, Q1 44.34, Median 47.44, Q3 51.18 Std Dev
4.04
Also evident from Table 9 is that although articles receive a different amount of attention from
participants the averages group around each other. Illustration 5 shows that culling the lists of
keywords to only contain unique annotations, the count drops quickly and all categories group
fairly even around the same range.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 27 of 73
To understand the variation within each group, we can in Table 15 (on page 71, Detailed look at
count of Annotations of Keywords ), see in detail the number of annotations added to each
article, along with the count of unique annotations. We calculate a one-way ANOVA test, for the
null hypothesis (H0) of no connection between the number of annotations added and the
number of unique annotations. We find that there is no significance between the number of
annotations added and the amount of unique annotations, except for the Sports category, where
there is a significance above 0.05 (P<0.07117) for keywords added vs. unique keywords. However
the score for the sports category is based on four articles.
3.3 Keyword Distribution
For brevity, the majority of illustrations of Keyword Popularity Distribution have been move to
Appendix B on page 53. The article ID's, illustration number and page number are show in the
footer for each results category.
As described in the methodology, the keywords in the 3rd
quartile for each article, are searched
for and mapped along a line representing the entire article. Each article is divided into sections
and each section (except the header and sub-headline) consist of 20 partitions. Each partition
can feature from zero to many keywords. To identify the presence of overall News Triangles and
Stacked News Triangles, the mapping results are sent of to Plot.ly, where a linear fit is added
and used as a boolean value. A descent from left to right indicates that there is a News Triangle,
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 28 of 73
while an ascent from left to right, indicates that there is no News Triangle.
Keyword Distribution – Culture
Article 18, 23 and 28 have an overall News Triangle and a News Triangle for each block. Article
16 and 22 have an overall News Triangle, but not a News Triangle for the only block (the Intro).
For article 16, the linear fit over All - Popularity and Intro Popularity Count follow each other
closely. The Header popularity in article 16 is lower than some of the spikes in the Intro.
Article 17 has an overall News Triangle and a News Triangle for the Intro and Section 2 blocks.
Section 1 does not have a News Triangle. Section 2 has less popular keywords than Section 1 and
especially the Intro block. The Header popularity is lower than some of the spikes in the Intro.
Article 27 (Illustration 6 below) has an overall News Triangle and a News Triangle for Section 1,3
and 4. The Intro and Section 2 have no News Triangles. Article 27 is very jagged, both for spikes
in popularity and for how the Keyword Popularity Distribution ascends and descends in the
Intro, and Section 1 and 2.
A list of illustration references for Culture can be found in the footnotes6
.
6 Culture Articles pp. 53:
Article 16: Illustration 18 - pp. 53, Article 17: Illustration 19 - pp. 53, Article 18: Illustration 20 - pp. 54,
Article 22: Illustration 21 - pp. 54, Article 23: Illustration 23 - pp. 55, Article 27: Illustration 25 - pp. 56,
Article 28: Illustration 22 - pp. 55
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 29 of 73
Keyword Distribution – Domestic News
Article 03, 04, 05, 08, 13, 19 and 26 all have an overall News Triangle and a News Triangle for
each block. For article 05, 08, 19 and 26, the linear fit over All - Popularity and the Intro Popularity
Count fit follow each other closely.
In article 04 the Intro, Section 2 and 3 have an almost horizontal linear fit, indicating that the
distribution of popular keywords in those blocks is fairly even. Note how in article 08
(Illustration 7 below on page 31) the article starts with more popular keywords, than there are
in the header, which by convention is supposed to be the most condensed part. This is also
apparent in article 20 (Illustration 32 page 60).
In Article 13 the keyword distribution for all blocks is dispersed and the header block is the most
keyword popularity condensed area.
Article 07 has an overall News Triangle and a News Triangle for Sections 1 and 2, but not for the
Intro block, where the majority of the popular keywords are in the rear partitions, in contrast to
Section 1 and 2, where the popular keywords are at the beginning.
Article 20 has an overall News Triangle and a News Triangle for each block, except for the last
section (Section 2), where the popularity ascends. The popularity of keywords in the Intro and
Section 1 ascends. Note how the Header popularity is lower than many of the popular partitions
in the blocks.
Article 21 has an overall News Triangle and a News Triangle for the Intro and Section 3. Sections
1, 2, 4 and 5 ascend and have no News Triangle. Section 2, 3 and 4 have an almost horizontal
linear fit.
Article 24 has an overall News Triangle and a News Triangle for Section 1,2 and 3. The keyword
popularity in Intro and Section 4 ascends and the article has few popular keywords.
Article 25 has an overall News Triangle and a News Triangle for Section 1 and 3. The Intro and
Section 2 ascend and have no News Triangles.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 30 of 73
Article 29 (Illustration 8 below on page 32) has an overall News Triangle and a News Triangle for
the Intro and Section 1. Section 2 does not have a News Triangle. Section 1 and 2 have an almost
horizontal linear fit. All blocks are condensed and the keyword popularity scores are high.
Especially for the Header block which reaches above 160. Some of the popular keywords for
article 29 are "S" and "R" which are acronyms for two danish political parties, namely S:
“Socialdemokraterne” (Social Democrats), and R: “De Radikale” (Liberal Democrats).
Due to the processing and tokenisation of the text data, it is likely that these acronyms interfere
with the scoring process, especially "S", since the tokenisation process could have separated
apostrophe s's from their main word. We have not been able to verify their influence. Looking at
the number of words for each block and partitions, we cannot see any abnormality in
comparison to the other articles. Nor can we see any apparent pattern in the spikes in each
partition position. Nonetheless, the results from article 29 should be approached with caution.
Article 30 has a overall News Triangle but no News Triangle for only block (the Intro). The linear
fit across All - Popularity and Intro Popularity Count follow each other in a almost horizontal line.
A list of illustration references for Domestic can be found in the footnotes below7
.
7 Domestic Articles pp. 57:
Article 03: Illustration 26 - pp. 66 Article 04: Illustration 28 - pp. 58, Article 05: Illustration 27 - pp. 57,
Article 07: Illustration 29 - pp. 58, Article 08: Illustration 30,- pp. 59, Article 13: Illustration 31 - pp. 59,
Article 19: Illustration 32 - pp. 60, Article 20: Illustration 33 - pp. 60, Article 21: Illustration 34 - pp. 61,
Article 24: Illustration 35 - pp. 61, Article 25: Illustration 36 - pp. 62, Article 26: Illustration 37 - pp. 62,
Article 29: Illustration 38 - pp. 63, Article 30: Illustration 39 - pp. 63
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 31 of 73
Keyword Distribution – Economy
For the Economy category, all articles have an overall News Triangle. Article 09 and Article 10
also feature News Triangles for every section.
In article 01 three out of four blocks have News Triangle. In article 02 half of the four blocks
have News Triangles. Notice how the first two blocks (Intro and Section 1) in article 02
(Illustration 9 below) ascend steeply due to the popularity of keywords being situated in the last
partitions. Note also that the spike in Intro - Number of Words and the last spike for Section 1 -
Number of Words is a lot higher than the rest. Section 2 and 3 show the presence of News
Triangles, with a linear fit over each section that is almost horizontal.
Article 06 has News Triangle for the first and third block (Intro and Section 2), but not for the
second and last (Section 1 and 3). The linear fit in Section 1 is almost horizontal, where it for the
Intro and Section 2 blocks it somewhat steeply descends. Section 3, which has no News Triangle,
ascends somewhat steeply. Article 09 has News Triangles overall and for each block.
In article 10 there is an overall News Triangle and a News Triangle for each section., where the
Intro, Section 2 and 3 blocks almost follow the overall linear fit.
Article 11 shows the presence of an overall News Triangle and a News Triangle for the Intro,
Section 1 and 3. Section 2 ascends slightly and has not News Triangle. Section 1 has few popular
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 32 of 73
keywords and the linear fit floats a lot lower than the overall linear fit.
A list of illustration references for Economy can be found in the footnotes below8
.
Keyword Distribution – Sports
For the sports category there is an overall News Triangle for article 12, 14 and 15, but not for
article 31 (Illustration 10 on page 34), which is the only article of all 31 articles, that does not
have an overall News Triangle. Article 14 has an overall News Triangle even though the main
block, the Intro block ascends. It is interesting for article 14 that the Header block has so much
power over the overall linear fit, that it manages to create a descending linear fit, since we can
see that the majority of the popular keywords are placed at the end of the article. Article 12 and
Article 15 both have an overall News Triangle and a News Triangle for each block.Looking closer
at article 31, it is apparent that the Intro block has less popular keywords across partitions,
which might be what causes the All Popularity Fit to ascend.
A list of illustration references for Sports can be found in the footnotes below9
.
8 Economy Articles (pp. 64):
Article 01: Illustration 40 - pp. 64, Article 02: Illustration 41 - pp. 64, Article 06: Illustration 42 - pp. 65,
Article 09: Illustration 43 - pp. 65, Article 10: Illustration 44 - pp. 66, Article 11: Illustration 45 - pp. 66
9 Sports Articles (pp. 67):
Article 12: Illustration 46 - pp. 67, Article 14: Illustration 47 - pp. 67, Article 15: Illustration 48 - pp. 68,
Article 31: Illustration 49 - pp. 68
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 33 of 73
Keyword Popularity across all articles
Table 10 below (on page 36) shows that apart from article 31, there is an overall News Triangle
for each article except article 31. However, on a per article basis, we can not see a distinct
pattern of News Triangle presence across all section blocks. Fourteen of the 31 articles feature a
News Triangle for each block and have the Article ID bolded.
The score of whether there is a presence of a News Triangle is shown in Table 10 by the field
Valued, where “1” indicates that the keyword popularity descends over the course of the block
or entire article (there is the presence of a News Triangle), and “0” indicates that the keyword
popularity ascends over the course of the block or entire article (there is no presence of a News
Triangle). The Valued score is taken from looking at each article's distribution of keyword
popularity and readings off the Squared Correlation Coefficients and Mean Squared Error for
each article.
In Culture, articles 18, 23 and 28 have News Triangles present in all blocks, where article 16, 17,
22 and 27 do not have News Triangles present in all blocks. In Domestic, articles 03, 04, 05, 08, 13,
19, 26 have News Triangles present in all blocks, where articles 07, 20, 21, 24, 29, 30 do not have
the presence of News Triangles. In Economy, article 09 and 10 have News Triangles present in all
blocks, where article 01, 02, 06 and 11 do not have News Triangles present in all blocks. In Sports,
articles 12 and 15 have News Triangles present in all blocks, where articles 14 and 31 do not
have News Triangles present in all blocks. From Table 10 below, we see that 5 of 7 articles in
Culture end with blocks where there is a News Triangles, 9 of 14 articles in domestic end with
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 34 of 73
News Triangles, 5 of 6 articles in Economy end with News Triangles, and 3 of 4 articles in Sports
end with News Triangles. So all in all 22 of 31 articles end with a News Triangle.
Articles with News Triangle in all blocks.
For the articles with News Triangles present in all blocks, in Culture, articles 18 and 23 only
consist of the intro and for article 28 a Section 1 block also. For Domestic, articles 08 and 26
consist of an Intro block only, where articles 03, 05 and 19 consist of Intro and Section 1 blocks.
Article 04 and 13 (Domestic), go all the way to Section 3 and 4 respectively. For Economy, article
09 has an Intro and Section 1 block with News Triangles and article 10 has News Triangles all the
way to Section 4. In Sports, article 12 and 15 has News Triangles for each block, where article 12
goes to Intro blocks and article 15 goes all the way to Section 2.
Articles without News Triangle in all blocks.
For articles without the presence of News Triangles for each block, in Culture, article 16, 17, 22
and 27 do not have a series of Stacked News Triangles present across all blocks. However going
back to article 16 and article 22, it is clear that the linear fit is almost horizontal. The blocks in
article 17, 27 where the popularity of keywords ascends, it does so distinctly. In Domestic, article
07, 20, 21, 24, 29, 30 do not have the presence of News Triangles, where, for article 21, Section 1,2
and 3, and article 29, Section 1 and 2, and article 30, the Intro, there is an almost horizontal
linear fit for each block. The blocks in article 07, 20, and 24 where the popularity of keywords
ascends, it does so distinctly. In Economy, article 01, 02, 06 and 11 do not have News Triangles
present in all blocks, where, for article 06, Section 1, the linear fit is almost horizontal. For
article 01, 02 and 11 the popularity of keywords ascends fairly distinctly. In Sports, article 14 and
31 do not have News Triangles in all blocks. The blocks ascend fairly distinctly.
Presence of News Triangles across all Sections and per Single Sections basis
Article ID
All Sections Intro Section 1 Section 2 Section 3 Section 4 Section 5
R2 MSE
Valued
R2
MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
Culture
Article 16 0.0026 11.79 1 0.0009 11.84 0
Article 17 0.1762 8.527 1 0.1768 10.71 1 0.0253 7.494 0 0.0498 5.132 1
Article 18 0.0936 8.619 1 0.0127 7.378 1
Article 22 0.0575 7.629 1 0.0001 5.975 0
Article 23 0.1420 10.79 1 0.0334 8.559 1
Article 27 0.0706 7.219 1 0.0372 9.019 0 0.2497 7.748 1 0.1420 5.595 0 0.0053 5.512 1 0.0055 5.309 1
Article 28 0.0155 10.29 1 0.0077 8.602 1 0.0928 8 1
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 35 of 73
Presence of News Triangles across all Sections and per Single Sections basis
Article ID
All Sections Intro Section 1 Section 2 Section 3 Section 4 Section 5
R2 MSE
Valued
R2
MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
R2 MSE
Valued
Domestic
Article 03 0.0365 12.18 1 0.0207 10.51 1 0.0743 10.57 1
Article 04 0.1060 12.81 1 0.0005 12.16 1 0.0549 12.18 1 0.0126 6.106 1 0.0105 6.959 1
Article 05 0.1329 10.5 1 0.0061 9.446 1 0.0146 5.184 1
Article 07 0.0543 6.436 1 0.0154 7.34 0 0.1775 3.453 1 0.3084 4.12 1
Article 08 0.4150 15.75 1 0.4117 16.05 1
Article 13 0.0393 5.665 1 0.0048 4.995 1 0.0393 3.976 1 0.1517 1.232 1 0.0174 6.487 1 0.3162 3.132 1
Article 19 0.2474 8.191 1 0.0832 7.752 1 0.3484 4.644 1
Article 20 0.0153 19.49 1 0.1846 20.63 1 0.0808 15.55 1 0.0218 19.39 0
Article 21 0.0265 5.487 1 0.0891 6.592 1 0.0731 5.42 0 0.0000 3.271 0 0.0061 3.503 1 0.0071 3.286 0 0.0623 6.084 0
Article 24 0.0459 5.733 1 0.0397 6.88 0 0.0507 2.887 1 0.0890 1.638 1 0.0059 4.035 1 0.0099 2.515 0
Article 25 0.1050 4.887 1 0.0030 5.433 0 0.0191 3.933 1 0.0059 4.668 0 0.1501 3.426 1
Article 26 0.3197 11.49 1 0.2347 11.3 1
Article 29 0.2863 22.07 1 0.1272 19.19 1 0.0089 10.27 1 0.0016 9.882 0
Article 30 7.3050 9.673 1 0.0027 9.844 0
Economy
Article 01 0.0130 7.173 1 0.0226 8.903 1 0.0093 7.219 0 0.0067 5.72 1 0.1092 6.722 1
Article 02 0.0038 3.815 1 0.2559 3.406 0 0.1798 4.743 0 0.0138 2.591 1 0.0003 3.041 1
Article 06 0.0138 7.077 1 0.0148 7.561 1 4.0820 7.153 0 0.0758 7.612 1 0.0668 5.228 0
Article 09 0.0791 8.223 1 0.0954 6.107 1 0.1119 6.18 1
Article 10 0.0857 8.645 1 0.0026 8.338 1 0.1458 7.522 1 0.0190 6.176 1 0.0001 4.082 1
Article 11 0.1402 8.45 1 0.0065 8.64 1 0.0052 6.652 1 0.0001 7.314 0 0.0055 7.35 1
Sports
Article 12 0.0778 8.854 1 0.0258 8.643 1
Article 14 0.0140 24.78 1 0.0111 21.4 0
Article 15 0.0486 7.196 1 0.2669 5.567 1 0.0171 5.338 1 0.2257 4.199 1
Article 31 0.0008 9.1910 0 0.2033 7.988 1 0.0572 9.686 0 0.0643 6.957 1 0.0145 8.882 1
Table 10 Squared Correlation Coefficients (R2), Mean Squared Error (MSE) and a boolean value (Valued) of
whether the there is a News Triangle pattern present, where a value of 1 equals a News Triangle pattern
(Keyword Popularity ascends from left to right), and a value of 0 equals no News Triangle presence (Keyword
Popularity descends from left to right) . R2 and MSE are based on the linear fit of all sections or on a per section
element basis. The values for Squared Correlation Coefficients and Mean Squared Error are taken from each
“Keyword Distribution incl. All Popularity Fit” illustrations (See Appendix B) on Plot.ly. Articles with News
Triangles in all blocks are shown with the respective article's Article ID bolded.
In Table 11 (on page 37) the count of News Triangle presence vs. number of blocks is presented.
96.77% of the articles have an overall News Triangle, however there is not a clear picture for any
section blocks past Section 1. For Intro and Section 1 blocks the percentage of News Triangles
found is 70.97% and 72.73 percent respectively, while for Section 2 (57.89%), Section 3 (91.67%),
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 36 of 73
Section 4 (50%) and Section 5 (0%), there is a much greater variance in News Triangle presence.
There is no apparent pattern.
Percentage of News Triangles per block
Overall
Articles
Intro Section 1 Section 2 Section 3 Section 4 Section 5
News Triangles
Found
30 22 16 11 11 2 0
Number of
Blocks
31 31 22 17 12 4 0
Percent 96,77% 70.97% 72.73% 57.89% 91.67% 50% 0%
Table 11: Summary and Percentage of News Triangles in all articles and per block. Percentage is calculated
by dividing number of blocks where we found a News Triangle with the Number of Blocks with descending
or ascending keyword popularity.
Looking at Table 12 below and Illustration 11 on page 39 in conjunction, it is apparent that not
all articles have an equal amount of section blocks. Economy and Sports do not have any article
blocks in section 4 or 5. All articles have an Intro block, however not all articles go beyond the
intro block and we denote this by adding (block: no. of blocks) after the mean value.
We find the mean value for all blocks (first column) follows each other closely. The exception is
in Sports, where article 31 has no overall News Triangle. The mean value for all articles is 0.9677
(Culture: mean 1, Domestic: mean 1, Economy: mean 1, Sports: mean 0.7500).
For the Intro blocks, Domestic (mean: 0.7143), Economy (mean: 0.8333) and Sports (mean: 0.7500) are
above the overall mean of 0.7097, and Culture (mean: 0.5714) is below the mean average.
For the Section 1 blocks, Domestic (mean: 0.9091, blocks: 11) is above the overall mean of 0.7273.
Culture (mean: 0.6667, blocks: 3), Economy (mean: 0.5000, blocks: 6) and Sports (mean: 0.5000, blocks: 2)
are below the mean average.
For the Section 2 blocks, Economy (mean: 0.8000, blocks: 5) and Sports (mean: 1, blocks: 2) are well
above the overall mean of 0.5789. Culture (mean: 0.5000, blocks: 2) and Domestic (mean: 0.5000,
blocks: 8) are below the mean average.
For the Section 3 blocks, Culture (mean: 1, blocks: 1), Domestic (mean: 1, blocks: 5), Sports (mean: 1,
blocks: 1) are above the overall mean of 0.9167. Economy (mean: 0.8000, blocks: 5) is below the mean
average.
For the Section 4 blocks, Culture (mean: 1, blocks: 1) is above the overall mean of 0.5000, where
Domestic (mean: 0.3333, blocks: 3) is below.
For the Section 5 blocks, only Domestic is present with one article, where it has no News
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 37 of 73
Triangle and the mean for this block and the overall block is 0.
Mean, Standard Deviation, Variance and Standard Error for all Blocks/Categories
All Blocks Intro Section 1 Section 2 Section 3 Section 4 Section 5
Culture
Mean 1 0.5714 0.6667 0.5000 1 1 -
Std Dev 0 0.4949 0.4714 0.5000 0 0 -
Variance 0 0.2449 0.2222 0.2500 0 0 -
Std Error 0 0.1870 0.2722 0.3536 0 0 -
Domestic
Mean 1 0.7143 0.9091 0.5000 1 0.3333 0
Std Dev 0 0.4518 0.2875 0.5000 0 0.4714 0
Variance 0 0.2041 0.0826 0.2500 0 0.2222 0
Std Error 0 0.1207 0.0867 0.1768 0 0.2722 0
Economy
Mean 1 0.8333 0.5000 0.8000 0.8000 - -
Std Dev 0 0.3727 0.5000 0.4000 0.4000 - -
Variance 0 0.1389 0.2500 0.1600 0.1600 - -
Std Error 0 0.1521 0.2041 0.1789 0.1789 - -
Sports
Mean 0.7500 0.7500 0.5000 1 1 - -
Std Dev 0.4330 0.4330 0.5000 0 0 - -
Variance 0.1875 0.1875 0.2500 0 0 - -
Std Error 0.2165 0.2165 0.3536 0 0 - -
All
Mean 0.9677 0.7097 0.7273 0.5789 0.9167 0.5000 0
Std Dev 0.1767 0.4539 0.4454 0.4937 0.2764 0.5000 0
Variance 0.0312 0.2060 0.1983 0.2438 0.0764 0.2500 0
Std Error 0.0317 0.0815 0.0950 0.1133 0.0798 0.2500 0
Table 12: Mean, Standard Deviation, Variance and Standard Error for all blocks in each category. Only
blocks with values are considered. Values have been rounded to four decimals. This table goes with
Illustration 11 on page 39 below.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 38 of 73
3.4 Named Entities in Detail
The annotations entered by the participants take many forms: verbs, nouns or names of people,
places or organisations. In this section we dig deeper into Named Entities (NE).
Automatically identifying which keywords are nouns, verbs or NE's, etc. is difficult, since the
common step is Part of Speech (POS) tagging, where the token's position in the text helps
identify which POS-tag a word belongs to. Since we deal with keywords that are added without
surrounding context, it is difficult to do POS-tagging. Thus we only concentrate on NE's, since
these are easier to identify. The following results are divided into NE occurrence in the articles
and NE's in keywords per article.
Named Entities
A Named Entity (NE) is a Person, Place or Organisation. NE's are a central part of news articles as
they describe something has happened to someone. It is worth noting that parts of the NE may
occur other times in the text if a person is mentioned by only first or last name. We find that
this occurs in some texts, i.e. “Klaus Riskær Pedersen” is also referred to as only “Riskær”. We also
note that names in news articles are sometimes misspelled: “Roberto Ferhi”, “Roberto Mehri”,
“Roberto Merhi”. It is difficult to extract this information in a useful manner, since both first and
last names could be nouns or verbs, and we could end up with false positives in our dataset.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 39 of 73
From the list of NE's extracted from the entire article collection of 31 articles, we find 341 NE's,
of which 298 are unique. For article 29 we find that the annotation “S”, an abbreviation for the
Danish party “Socialdemokratiet” (Social Democrats), appears 264 times in the text. We regard
this as an abnormality (and an indication of why text mining and NLP is difficult), since the NLP
process of tokenisation of sentences and words, and converting words to lowercase, might have
removed plural s's in the text. Thus the token “s”'s count has been removed from the data. We
also notice that the word “OL”, the Danish abbreviation for the Olympic Games, is mentioned 60
times in article 31. Since the abbreviation “OL” is unique, we do not regard this as an
abnormality and it has not been removed from the data.
3.5 Named Entity Occurrence in Articles
Overall Named Entity Occurrence in Articles
Illustration 12 below shows a line and bar chart, and a box plot of the distribution of frequencies
of all types of NE in the 31 articles, where frequency is the times a NE is mentioned per article.
We find that the occurrences follow a Pareto power-law distribution, where the majority of
NE's in an article occur two to eight times in the text.
By conducting a Chi-2 test (F-value 15.00143 / p-value 0.00027), we can draw the conclusion that
there is no connection between the times an article has been annotated by participants and the
count of NE in the article, thus we are certain what we are not working with skewed data.
Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 40 of 73
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES
M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES

More Related Content

What's hot

76201960
7620196076201960
76201960IJRAT
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...ijtsrd
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis reportSavio Aberneithie
 
IRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment AnalysisIRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment AnalysisIRJET Journal
 
Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013ijcsbi
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsRESHAN FARAZ
 
Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering ShowcaseTucker Truesdale
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
25 ijcse-01238-3 saratha
25 ijcse-01238-3 saratha25 ijcse-01238-3 saratha
25 ijcse-01238-3 sarathaShivlal Mewada
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learningijtsrd
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET Journal
 
An Approach to Block Negative Posts on Social Media at Server Side
An Approach to Block Negative Posts on Social Media at Server SideAn Approach to Block Negative Posts on Social Media at Server Side
An Approach to Block Negative Posts on Social Media at Server Sideijtsrd
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sasAnalyst
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News DetectionIRJET Journal
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
 
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYcscpconf
 
Paper id 24201441
Paper id 24201441Paper id 24201441
Paper id 24201441IJRAT
 
POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK
POLITICAL OPINION ANALYSIS IN SOCIAL  NETWORKS: CASE OF TWITTER AND FACEBOOK POLITICAL OPINION ANALYSIS IN SOCIAL  NETWORKS: CASE OF TWITTER AND FACEBOOK
POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK dannyijwest
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project reportBharat Khanna
 

What's hot (20)

76201960
7620196076201960
76201960
 
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
IRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment AnalysisIRJET - Election Result Prediction using Sentiment Analysis
IRJET - Election Result Prediction using Sentiment Analysis
 
Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013
 
Analyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-TweetsAnalyzing-Threat-Levels-of-Extremists-using-Tweets
Analyzing-Threat-Levels-of-Extremists-using-Tweets
 
Final Poster for Engineering Showcase
Final Poster for Engineering ShowcaseFinal Poster for Engineering Showcase
Final Poster for Engineering Showcase
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
25 ijcse-01238-3 saratha
25 ijcse-01238-3 saratha25 ijcse-01238-3 saratha
25 ijcse-01238-3 saratha
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learning
 
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
IRJET- Big Data Driven Information Diffusion Analytics and Control on Social ...
 
An Approach to Block Negative Posts on Social Media at Server Side
An Approach to Block Negative Posts on Social Media at Server SideAn Approach to Block Negative Posts on Social Media at Server Side
An Approach to Block Negative Posts on Social Media at Server Side
 
Twitter text mining using sas
Twitter text mining using sasTwitter text mining using sas
Twitter text mining using sas
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News Detection
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITYFRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
FRAMEWORK FOR ANALYZING TWITTER TO DETECT COMMUNITY SUSPICIOUS CRIME ACTIVITY
 
Paper id 24201441
Paper id 24201441Paper id 24201441
Paper id 24201441
 
SAS Text Mining
SAS Text MiningSAS Text Mining
SAS Text Mining
 
POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK
POLITICAL OPINION ANALYSIS IN SOCIAL  NETWORKS: CASE OF TWITTER AND FACEBOOK POLITICAL OPINION ANALYSIS IN SOCIAL  NETWORKS: CASE OF TWITTER AND FACEBOOK
POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 

Viewers also liked

Sibéal Turraoin - Irish Adventures in the North-West Passage
Sibéal Turraoin - Irish Adventures in the North-West PassageSibéal Turraoin - Irish Adventures in the North-West Passage
Sibéal Turraoin - Irish Adventures in the North-West PassageRealsmartmedia
 
1067855064 enero 1
1067855064 enero 11067855064 enero 1
1067855064 enero 1manuel-g-l
 
Hari Krishna Vetsa Resume
Hari Krishna Vetsa ResumeHari Krishna Vetsa Resume
Hari Krishna Vetsa ResumeHari Krishna
 
Joensuu 13.10.2016, Elanto pelaamalla, Peluuri, Mari Pajula
Joensuu 13.10.2016, Elanto pelaamalla, Peluuri,  Mari PajulaJoensuu 13.10.2016, Elanto pelaamalla, Peluuri,  Mari Pajula
Joensuu 13.10.2016, Elanto pelaamalla, Peluuri, Mari PajulaAspa Foundation
 
Article Becas Media Superior (34)
Article   Becas Media Superior (34)Article   Becas Media Superior (34)
Article Becas Media Superior (34)allegedransom4260
 
Shale gas by sanyam jain
Shale gas by sanyam jainShale gas by sanyam jain
Shale gas by sanyam jainSanyam Jain
 
Intranettien uudet tuulet 2017 2016-11-09
Intranettien uudet tuulet 2017 2016-11-09Intranettien uudet tuulet 2017 2016-11-09
Intranettien uudet tuulet 2017 2016-11-09Hanna P. Korhonen
 
FPGA Verilog Processor Design
FPGA Verilog Processor DesignFPGA Verilog Processor Design
FPGA Verilog Processor DesignArchana Udaranga
 
Los padres y la escuela
Los padres y la escuelaLos padres y la escuela
Los padres y la escuelaStefanie Prado
 
I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24Varun Mahajan
 
Tecnologia como afecta a los adolescentes
Tecnologia como afecta a los adolescentes Tecnologia como afecta a los adolescentes
Tecnologia como afecta a los adolescentes MadeSuazo
 

Viewers also liked (17)

ResumeP.1
ResumeP.1ResumeP.1
ResumeP.1
 
Scan 11
Scan 11Scan 11
Scan 11
 
Securing Legacy CFML Code
Securing Legacy CFML CodeSecuring Legacy CFML Code
Securing Legacy CFML Code
 
Sibéal Turraoin - Irish Adventures in the North-West Passage
Sibéal Turraoin - Irish Adventures in the North-West PassageSibéal Turraoin - Irish Adventures in the North-West Passage
Sibéal Turraoin - Irish Adventures in the North-West Passage
 
1067855064 enero 1
1067855064 enero 11067855064 enero 1
1067855064 enero 1
 
Metro Boston 9.27.06
Metro Boston 9.27.06Metro Boston 9.27.06
Metro Boston 9.27.06
 
Hari Krishna Vetsa Resume
Hari Krishna Vetsa ResumeHari Krishna Vetsa Resume
Hari Krishna Vetsa Resume
 
Eskaintza
EskaintzaEskaintza
Eskaintza
 
Mehatxua
MehatxuaMehatxua
Mehatxua
 
Joensuu 13.10.2016, Elanto pelaamalla, Peluuri, Mari Pajula
Joensuu 13.10.2016, Elanto pelaamalla, Peluuri,  Mari PajulaJoensuu 13.10.2016, Elanto pelaamalla, Peluuri,  Mari Pajula
Joensuu 13.10.2016, Elanto pelaamalla, Peluuri, Mari Pajula
 
Article Becas Media Superior (34)
Article   Becas Media Superior (34)Article   Becas Media Superior (34)
Article Becas Media Superior (34)
 
Shale gas by sanyam jain
Shale gas by sanyam jainShale gas by sanyam jain
Shale gas by sanyam jain
 
Intranettien uudet tuulet 2017 2016-11-09
Intranettien uudet tuulet 2017 2016-11-09Intranettien uudet tuulet 2017 2016-11-09
Intranettien uudet tuulet 2017 2016-11-09
 
FPGA Verilog Processor Design
FPGA Verilog Processor DesignFPGA Verilog Processor Design
FPGA Verilog Processor Design
 
Los padres y la escuela
Los padres y la escuelaLos padres y la escuela
Los padres y la escuela
 
I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24I2C Subsystem In Linux-2.6.24
I2C Subsystem In Linux-2.6.24
 
Tecnologia como afecta a los adolescentes
Tecnologia como afecta a los adolescentes Tecnologia como afecta a los adolescentes
Tecnologia como afecta a los adolescentes
 

Similar to M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES

thesis_presentation_v1 shorter split
thesis_presentation_v1 shorter splitthesis_presentation_v1 shorter split
thesis_presentation_v1 shorter splitMiklas Njor
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...ijcax
 
IRJET- An Improved Machine Learning for Twitter Breaking News Extraction ...
IRJET-  	  An Improved Machine Learning for Twitter Breaking News Extraction ...IRJET-  	  An Improved Machine Learning for Twitter Breaking News Extraction ...
IRJET- An Improved Machine Learning for Twitter Breaking News Extraction ...IRJET Journal
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAanargha gangadharan
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAMary Lis Joseph
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATAParvathy Devaraj
 
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank SummarizationTopic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank SummarizationIJERA Editor
 
News Recommender_Poster
News Recommender_PosterNews Recommender_Poster
News Recommender_PosterIan Chu
 
IRJET- Event Detection and Text Summary by Disaster Warning
IRJET- Event Detection and Text Summary by Disaster WarningIRJET- Event Detection and Text Summary by Disaster Warning
IRJET- Event Detection and Text Summary by Disaster WarningIRJET Journal
 
Multiple Regression to Analyse Social Graph of Brand Awareness
Multiple Regression to Analyse Social Graph of Brand AwarenessMultiple Regression to Analyse Social Graph of Brand Awareness
Multiple Regression to Analyse Social Graph of Brand AwarenessTELKOMNIKA JOURNAL
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine LearningIRJET Journal
 
Knime social media_white_paper
Knime social media_white_paperKnime social media_white_paper
Knime social media_white_paperFiras Husseini
 

Similar to M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES (20)

thesis_presentation_v1 shorter split
thesis_presentation_v1 shorter splitthesis_presentation_v1 shorter split
thesis_presentation_v1 shorter split
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
INTELLIGENT AGENT FOR PUBLICATION AND SUBSCRIPTION PATTERN ANALYSIS OF NEWS W...
 
IRJET- An Improved Machine Learning for Twitter Breaking News Extraction ...
IRJET-  	  An Improved Machine Learning for Twitter Breaking News Extraction ...IRJET-  	  An Improved Machine Learning for Twitter Breaking News Extraction ...
IRJET- An Improved Machine Learning for Twitter Breaking News Extraction ...
 
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATAREAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank SummarizationTopic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
Topic Evolutionary Tweet Stream Clustering Algorithm and TCV Rank Summarization
 
Sub1557
Sub1557Sub1557
Sub1557
 
News Recommender_Poster
News Recommender_PosterNews Recommender_Poster
News Recommender_Poster
 
IRJET- Event Detection and Text Summary by Disaster Warning
IRJET- Event Detection and Text Summary by Disaster WarningIRJET- Event Detection and Text Summary by Disaster Warning
IRJET- Event Detection and Text Summary by Disaster Warning
 
Multiple Regression to Analyse Social Graph of Brand Awareness
Multiple Regression to Analyse Social Graph of Brand AwarenessMultiple Regression to Analyse Social Graph of Brand Awareness
Multiple Regression to Analyse Social Graph of Brand Awareness
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine Learning
 
Knime social media_white_paper
Knime social media_white_paperKnime social media_white_paper
Knime social media_white_paper
 

M_NJOR_MasterThesis_2015_StackedNewsTriangles_FINAL_LOWRES

  • 1. Faculty of Technology and Society Department of Computer Science Master Thesis Project, 15 ECTS, DA613A, Spring 2015 Identifying Single and Stacked News Triangles in Online News Articles - an Analysis of 31 Danish Online News Articles Annotated by 68 Journalists By Miklas Njor Supervisor: Daniel Spikol Examiner: Bengt Nilsson Contact Information: Author: Miklas Njor miklas@miklasnjor.com Supervisor: Daniel Spikol daniel.spikol@mah.se Examiner: Bengt Nilsson bengt.nilsson.ts@mah.se Abstract: While news articles for print use one News Triangle, where important information is at the top of the article, online news articles are supposed to use a series of Stacked News Triangles, due to online readers text- skimming habits[1]. To identify Stacked News Triangles presence, we analyse how 68 Danish journalists annotate 31 articles. We use keyword frequency as the measure of popularity. To explore if Named Entities influence News Triangle presence, we analyse Named Entities found in the articles and keywords. We find the presence of an overall News Triangle in 30 of 31 articles, while, for the presence of Stacked News Triangles, 14 of the 31 articles have Stacked News Triangles. For Named Entities in News Triangles we cannot see what their influences is. Nonetheless, we find difference in Named Entity Types in each category (Culture, Domestic, Economy, Sports). Keywords: Keyword popularity, Keyword frequency, Folksonomy, Named Entity, Online News,
  • 2. Popular Science Summary The Internet forced the media to be more streamlined. It also resulted in a huge decline in circulation and revenue for newspapers, which further led to layoff of staff. The staff that are left are pressed for time, both when it comes to producing news, and putting the articles online and adding metadata to the content. With the shift from reading news in a physical newspaper, to reading news online on computers or mobile phones, there has also been a change in reading habits, where readers now skim text when reading news online. To serve the reader the most important information first, when journalists write a news article for print, they use an overall News Triangle, that is, the most important information is at the beginning of the story. But for online news articles, the new idiom is to use a series of stacked News Triangles, so that each section is a News Triangle in itself, as this allows the reader to skim and understand the text more easily. From a text-mining perspective this is interesting, because knowing how a piece of text is supposed to be structured, will make it more straightforward to teach a computer how to read and learn from the text, for instance to automatically add keywords and taxonomies. Dividing the text into even smaller more manageable chunks, could allow for even better output. However, one thing is planning how things should be, another thing is how reality actually plays out. This paper finds, with the help of 68 journalists, that although the News Triangle does exist, the idea of dividing the text into smaller News Triangles is not so clear-cut and is problematic to identify. Only half of the articles we look at, consist of smaller News Triangles throughout the article, and exactly what influences whether or not there is a series of News Triangles still remains unclear. Thus more research has to be done to understand the complexity of how online news articles are structured. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 2 of 73
  • 3. Extended Abstract ABSTRACT: The concept of a News Triangle is to place the most important information at the top of the story. This is how news articles for print traditionally have been written. Due to online readers text-skimming habits and to make it easier for readers to understand the text, online news articles supposedly use a series of Stacked News Triangles, [1]. To identify and test if this pattern is meaningful for an automated process of annotating articles, we analyse how 68 Danish journalists annotate 31 articles and analyse where these keywords appear in each article. We use annotation keyword frequency as a measure of popularity. To see if Named Entities (Persons, Places, Organisations) influence News Triangle presence, we analyse Named Entities found in the articles and keywords. We also analyse Named Entities (NE) across article subject categories. Motivation: Each day numerous articles are published on online newspapers. To make search retrieval and recommendation of related articles easier, each article is manually annotated with relevant keywords, taxonomies and a category. This process is tedious and subjective, and over time keywords and taxonomies can become stale. Automatically annotating articles with relevant keywords and taxonomies could help organise content better and ensure that keywords are always relevant. Furthermore, it could prevent high bounce rates, by suggesting more relevant articles to readers who enter the site via external links, such as social media[2], which in turn could lead to more page-views and revenue from online advertising. Problem statement: While counting names in the text, or counting word frequencies, concluding that names or most common words are what the article is about, the reality is more complex. Our intuition is that structuring online news articles by stacking News Triangles upon each other, will form an even more semi-structured document, and ease automated annotation, since algorithms will easier identify pivoting points. It is however unknown if News Triangles or Stacked News Triangles exist, are measurable or useful to identify. Methodology: Sixty-eight Danish journalist annotate 8 articles from a set of 31 articles. From the annotation data-set, the popular keywords in the 3rd quartile from each article are used to measure keyword popularity across the articles content. Each article is divided into partitions (Headpiece, Intro and subsequent Section blocks according to the original HTML markup) and the placement of keywords are mapped. The position and popularity count for each keyword is used to create a graph, showing the distribution of keyword popularity in each section block. A linear fit across each block and the entire article acts as a boolean value, to identify the Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 3 of 73
  • 4. presence of one overall News Triangle and Stacked News Triangles. We also extract Named Entities from the articles and annotations, and analyse their presence, position and Named Entity Type. Results: For the presence of an overall News Triangle in articles we find that this is true for 30 of 31 articles. For the presence of a series of stacked News Triangles within the content, we find that 14 of the 31 articles have stacked News Triangles in all sections. However, its is not clear what influences this behaviour. For Named Entities in annotations we cannot see what influences them either. We find difference in Named Entity Types in each category. Conclusion: To our knowledge, this is the first time that News Triangles and stacked News Triangles have been identified within Computer Science. Looking at the block level, it is however difficult to see what influence the News Triangle presence. We find that annotations quickly group around the same words. We also find that there is a difference in Types of Named Entities used for each category and that Named Entities mentions follow a Pareto power-law distribution as that of Zipf's law. We identify future work that needs to be done within this area of mining online news. In the name of transparency and reproducibility we have uploaded much of the data and illustrations to http://plot.ly. Where possible, we will link from within the captions of the tables and illustrations. The raw data (annotations and participant data) can be found here: http://figshare.com/account/projects/4414 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 4 of 73
  • 5. Table of Contents 1 Introduction ...................................................................................8 1.1 Background .........................................................................................................8 1.2 The News Triangle ...............................................................................................9 1.3 Related Work .....................................................................................................11 1.4 Research Questions ............................................................................................14 1.4.1 Research Question 1 – News Triangles ...................................................................................................14 1.4.2 Research Question 2 and 3 – Named Entities .........................................................................................16 2 Methodology ................................................................................17 2.1 Articles - criteria and selection methods ..........................................................18 2.2 Preprocessing of articles ...................................................................................18 2.3 Participants - criteria and selection methods ...................................................19 2.4 Collecting keywords via a web questionnaire ....................................................19 2.5 Analysis of collected tags ..................................................................................21 2.6 Identifying News Triangles. ...............................................................................22 3 Results ..........................................................................................22 3.1 Participants ......................................................................................................23 3.2 Articles and Annotations ..................................................................................25 3.3 Keyword Distribution ........................................................................................28 3.4 Named Entities in Detail ....................................................................................39 3.5 Named Entity Occurrence in Articles ................................................................40 3.6 Named Entities in Keywords ..............................................................................42 4 Analysis ........................................................................................45 4.1 Analysis of Keyword Distribution and News Triangle Presence ........................46 4.2 Analysis of Named Entities ................................................................................47 5 Discussion ....................................................................................48 5.1 Future Work .......................................................................................................50 A Appendix ......................................................................................52 B Appendix ......................................................................................53 B.A Keyword Distribution – Culture .......................................................................................................................... 53 B.B Keyword Distribution – Domestic ....................................................................................................................... 57 B.C Keyword Distribution – Economy ....................................................................................................................... 64 B.D Keyword Distribution – Sports ............................................................................................................................ 67 C Appendix ......................................................................................69 C.A List of articles ....................................................................................................................................................... 69 C.B Grouping of articles in Categories ...................................................................................................................... 70 C.C Participants Job Functions and Job Titles .......................................................................................................... 71 C.D Detailed look at count of Annotations of Keywords ........................................................................................71 C.E Occurrence of Named Entities in Culture, Domestic, Economy, Sports .........................................................72 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 5 of 73
  • 6. Illustration Index Illustration 1: Re-drawn example of “The News Triangle”...................................................................10 Illustration 2: Inverted News Triangle for print.....................................................................................15 Illustration 3: Inverted News triangles for online news........................................................................15 Illustration 4: Article from Politken.dk, explanations of elements, a Stacked News Triangle.......16 Illustration 5: Annotations by article Category......................................................................................28 Illustration 6: Article 27 - Keyword Distribution incl. All Popularity Fit............................................29 Illustration 7: Article 08 – Keyword Distribution incl. All Popularity Fit...........................................31 Illustration 8: Article 29 – Keyword Distribution incl. All Popularity Fit...........................................32 Illustration 9: Article 02 - Keyword Distribution incl. All Popularity Fit............................................33 Illustration 10: Article 31 - Keyword Distribution incl. All Popularity Fit..........................................34 Illustration 11: Mean, std.dev, variance, std. error of NT presence Per Category and Blocks........39 Illustration 12: Frequency and Occurrences of Named Entities (NE) from all articles.....................40 Illustration 13: Normalised Percentage of Occurrence of Named Entities per Category.................42 Illustration 14: Named Entity Type averages per Category for Named Entities keywords.............43 Illustration 15: Normalised Percentage of Named Entity Types per Category..................................44 Illustration 16: Named Entity Types per Category & std. dev. Of Annotations.................................45 Illustration 17: Example of an article as presented to the participants.............................................52 Illustration 18: Article 16 - Keyword Distribution incl. All Popularity Fit..........................................53 Illustration 19: Article 17 - Keyword Distribution incl. All Popularity Fit..........................................53 Illustration 20: Article 18 - Keyword Distribution incl. All Popularity Fit..........................................54 Illustration 21: Article 22 - Keyword Distribution incl. All Popularity Fit..........................................54 Illustration 22: Article 28 - Keyword Distribution incl. All Popularity Fit..........................................55 Illustration 23: Article 23 - Keyword Distribution incl. All Popularity Fit..........................................55 Illustration 24: Article 28 - Keyword Distribution incl. All Popularity Fit..........................................56 Illustration 25: Article 27 - Keyword Distribution incl. All Popularity Fit..........................................56 Illustration 26: Article 03 - Keyword Distribution incl. All Popularity Fit..........................................57 Illustration 27: Article 05 – Keyword Distribution incl. All Popularity Fit.........................................57 Illustration 28: Article 04 – Keyword Distribution incl. All Popularity Fit.........................................58 Illustration 29: Article 07 – Keyword Distribution incl. All Popularity Fit.........................................58 Illustration 30: Article 08 – Keyword Distribution incl. All Popularity Fit.........................................59 Illustration 31: Article 13 – Keyword Distribution incl. All Popularity Fit.........................................59 Illustration 32: Article 19 – Keyword Distribution incl. All Popularity Fit.........................................60 Illustration 33: Article 20 – Keyword Distribution incl. All Popularity Fit.........................................60 Illustration 34: Article 21 – Keyword Distribution incl. All Popularity Fit.........................................61 Illustration 35: Article 24 – Keyword Distribution incl. All Popularity Fit.........................................61 Illustration 36: Article 25 – Keyword Distribution incl. All Popularity Fit.........................................62 Illustration 37: Article 26 – Keyword Distribution incl. All Popularity Fit.........................................62 Illustration 38: Article 29 – Keyword Distribution incl. All Popularity Fit.........................................63 Illustration 39: Article 30 – Keyword Distribution incl. All Popularity Fit.........................................63 Illustration 40: Article 01 - Keyword Distribution incl. All Popularity Fit..........................................64 Illustration 41: Article 02 - Keyword Distribution incl. All Popularity Fit..........................................64 Illustration 42: Article 06 - Keyword Distribution incl. All Popularity Fit..........................................65 Illustration 43: Article 09 - Keyword Distribution incl. All Popularity Fit..........................................65 Illustration 44: Article 10 - Keyword Distribution incl. All Popularity Fit..........................................66 Illustration 45: Article 11 - Keyword Distribution incl. All Popularity Fit..........................................66 Illustration 46: Article 12 - Keyword Distribution incl. All Popularity Fit..........................................67 Illustration 47: Article 14 - Keyword Distribution incl. All Popularity Fit..........................................67 Illustration 48: Article 15 - Keyword Distribution incl. All Popularity Fit..........................................68 Illustration 49: Article 31 - Keyword Distribution incl. All Popularity Fit..........................................68 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 6 of 73
  • 7. Index of Tables Table 1: Participant's Ages.........................................................................................................................23 Table 2: Participant's Graduation Year....................................................................................................23 Table 3: Participant's Education Length..................................................................................................23 Table 4: Question: "For how many years have you worked with annotating news articles?"........24 Table 5: Question: "When did you last work with annotating news articles?".................................24 Table 6: Participant's Browsers and Operating Systems.......................................................................25 Table 7: Averages and Five Number Summary of Article Categories Annotation Rate....................26 Table 8: Annotations Overall......................................................................................................................26 Table 9: Keyword Annotations per Category: Culture 7, Domestic 14, Economy 6, Sports 4..........27 Table 10: Presence of News Triangles across all Sections and per Single Sections basis.................36 Table 11: Percentage of News Triangles per block.................................................................................37 Table 12: Mean, Standard Deviation, Variance and Standard Error for all Blocks/Categories.......38 Table 13: Grouping of articles in Categories............................................................................................70 Table 14: Job Functions and Job Titles......................................................................................................71 Table 15: Detailed look at counts of Annotations of Keywords............................................................72 Table 16: Occurrences of Named Entities per article for Culture........................................................72 Table 17: Occurrences of Named Entities per article for Domestic.....................................................73 Table 18: Occurrences of Named Entities per article for Economy.....................................................73 Table 19: Occurrences of Named Entities per article for Sports..........................................................73 List of acronyms ML Machine Learning NE Named Entity NLP Natural Language Processing NLTK Natural Language Tool Kit (a Python programming language library) NT News Triangle POS Part of Speech TF-IDF Term Frequency Inverse Document Frequency LF Linguistic Features LA Lexical Affiliates HCA hierarchical clustering algorithm CMS Content Management System HTML Hyper Text Markup Language Concepts Latent Dirichlet allocation (each document is a mixture of smaller topics) Entropy (information gain) Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 7 of 73
  • 8. 1 Introduction Each day numerous articles are published in online newspapers. In order to make search retrieval and recommendations of related articles to readers easier, each of these articles are manually annotated with relevant metadata including the article's keywords, taxonomies and the category it belongs to. The information in newspaper articles is presented in descending order of importance, where important information is relayed first. Within journalism this is known as the “News Triangle”, where the writer explains “What happened”, “How it happened”, “Amplify the point”, “Tie up loose ends” (WHAT) [3]. The article is written using one large News Triangle, with a headpiece, an introduction followed by smaller section blocks, where the level of informations diminishes the further you read. The writing process has worked well for more than a hundred years. However, the process of writing for online is different than writing for print. According to Tverskov & Tverskov [1] (2004) and Sissons [3] (2006), the classic print article changed somewhat when it entered the online arena. The setup of a headline, a sub headline, an introduction followed by several section blocks still holds. But where articles written for print, with a limited space not known in advance, are produced to be quickly edited bottom-up, online articles are not limited in the same way. Moreover, readers of online news tend to skim text instead of reading from top to bottom. This has supposedly led to a shift from using one overall News Triangle, towards using smaller News Triangles for each sub-section [1] (pp. 40). If it is the case that articles are partitioned into smaller News Triangles where the level of important information decreases the further into the article we read, this could make keyword extractions and knowledge discovery easier. When designing algorithms, we would be able to add weights to each section and compare keyword's placement. With this in mind, this paper investigates what the structure of a set of online newspaper articles actually look like. 1.1 Background For many years the newspaper was a distribution channel for news and advertising. With the Internet, online newspapers became just one of many options for users to spend their time and advertisers to spend their money. This change brought on a huge decline in circulation and revenue for newspapers, and many newspapers are today struggling to survive. A consequence of the decline is layoff of staff, and the staff that are left, are pressed for time. Both when it comes to producing news, but also when placing the articles online and adding metadata. In a physical newspaper there is a clear priority of content and each article undergoes an editorial process before publication. On the front page there are a number of top stories prioritised as A (above the fold), and B and C stories (below the fold). Inside the newspaper, stories are prioritised using page numbers, images, headline sizes, position on page etc. There is Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 8 of 73
  • 9. a clear indication of the newspaper sections (Domestic, International, Sports, Culture, etc.). Physical newspapers have a finite state, a beginning and end, for a given period, usually a day. For online newspapers the story is slightly different. The newspaper's homepage (front-page) can feature 50-100 stories shown in semi-prioritised chronological order, changing throughout the day. The newspaper sections are different, in that they follow an identical design resembling that of the homepage, where the articles are displayed in a chronological manner. Thus the hierarchy and prioritisation on newspaper websites is different from a physical newspaper and news media are well aware of the fact [4]. To combat the need for better navigation, findability and recommendations for users, newspapers employ Information Architecture tools like keywords, hierarchies, taxonomies, automated recommendation systems, and classification (sections) among other aspects [5]. These tools are managed according to journalistic principles, and are set and prioritised by journalist who have knowledge of the domain area. This can be a laborious and repetitive task, which often needs careful planning [6][7]. 1.2 The News Triangle “News stories should flow logically from the first paragraph. They should have pace and no unnecessary elements should slow the story down. And even though readers won’t see a structure, there is one. For hard news there is quite a strict structure. One way of looking at it is through the News Triangle or inverted pyramid. Generations of journalists have been brought up on this.” - Sissons [3] (pp. 70). The reason for placing the less important news at the bottom of the article, is useful when you don't know in advance, how much space there is on the physical newspaper page, where the article is supposed to be placed. The story's priority might change or the allotted space might have to give way to an advertisement. Knowing that content placed at the bottom of the article is less important, makes it easier and quicker for editors, who might not know what is important to the story, to edit the text. As such, the structure serves a guideline. It is also useful to the reader, that information is presented in a logical order. Especially for readers of online news. According to Sissons and Tverskov & Tverskov, when a journalist writes for online news, the text should be even more structured, consisting of not one giant News Triangle, but many. Condensed blocks of information allow readers to skim the article and quickly go back and forth in the text. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 9 of 73
  • 10. Looking at Illustration 1 above, we could also conclude, that although the information should be more concentrated at the top of the triangle, the “What happened” could possibly contain more Named Entities (Names, Places, Organisations), since something happened to someone/something, “How it happened” could contain more verbs since these often describe some kind of action, “Amplify the point” could be considered a summary together with “Tie up loose ends”. Another strong point to highlight about online news as a structured landscape for information, is that space is endless. Journalists are not forced to cram all information into one article, but can divide the article into several articles and link them together via hyperlinks. This allows journalist to dig a little deeper about what they are talking about in each piece, which in turn could mean that each article is more to the point and the information is condensed [3] (pp. 143) [1] (pp. 40). Knowing how a text within a certain domain is supposed to be structured could greatly enhance successful data-, text-mining and Natural Language Processing (NLP). Common data- and text-mining, and NLP techniques There exist many Data Mining [8], Natural Language Processing (NLP) techniques and tools [9], and frameworks like NLTK [10], for automating the extraction of keywords and classifying text. Below we highlight some of the many techniques and concepts used within machine learning, data- and text-mining, and NLP. Common Preprocessing Steps: Text is normalised so as to better be able to count and compare tokens in the text. First, common stop words (“a”, “I”, “it”, “he”, “she”, etc.) are removed. The text is separated into sentences, and words are tokenised (Within NLP words are called tokens). Each token is stemmed, that is, plural s's are removed. Part of Speech (POS): Part of Speech tagging (POS) is the process of identifying nouns, verbs and Named Entities (Names, Places and Organisations) etc. This is done by comparing each token with the tokens that surround it and the likelihood of that the token being a certain POS- Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 10 of 73
  • 11. tag given a certain context. I.e. “gay men”, can mean both happy males or homosexual males depending of the context of the surrounding text. Frequency Count and Frequency Distribution: A common step in identifying what a text is about, is to count and rank the occurrence of tokens. The idea is, that the more often a word appears, this is also what the text is about. Frequency Count can be combined with Frequency Distribution, where the most common token's occurrences are mapped across the text corpus. Unigrams, Bigrams and Trigrams: A single token (one word) is called a unigram. Two tokens next to each other are called bigrams and three tokens next to each other are called trigrams. Unigrams, bigrams and trigrams etc. are often used together with Frequency Count and Frequency Distribution. Collocations: Collocations (also known as Lexical Affiliations) are separate words that appear in conjunction. The notion of conjunction can be tuned according to how far from each other the words are “allowed” to appear. The idea is, that if words like “software” and “upgrade” often appear fairly close together, they should be considered as a single concept: “software upgrade”. TF-IDF: Term Frequency Inverse Document Frequency (TF-IDF) is a technique where terms or tokens widely distributed across the document, are ranked higher than frequently occurring terms. The notion is, that a wide distribution of a term means that the term is used throughout the text, and thus must be important. Organisation of content The paper is organised as follows: Section 1 features 1.3 Related Work and 1.4 Research Questions. Methods (section 2) are explained on page 17 - 22 . Results (section 3) are shown on page 22 - 44. The analysis (section 4) starts on page 45 and the discussion (section 5) starts on page 48. To not interrupt flow, illustrations of Keyword Popularity Distributions are for a large part moved to Appendix A on page 53. A list with links to the articles used in this paper, and some larger tables can be found in Appendix C on page 69. 1.3 Related Work Extracting relevant keywords from texts is not a new area of research and much research within machine learning (ML), data- and text mining, and Natural Language Processing (NLP) focuses on this area. However, we have not been able to find research which uses human annotators or evaluators before the algorithms have run, which is our approach, since we use the input as an evaluation measure to identify News Triangles, not as a measure of evaluating if our model is correct. Dividing the text into smaller chunks for weighing output of keywords or selecting appropriate taxonomies or classification is an area with some research. A common denominator for most studies is using online news articles and algorithms to tackle specific domains of knowledge. While the concept of a News Triangle is well known within journalism, there is, to our Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 11 of 73
  • 12. knowledge, no research that identifies News Triangles in a scientific manner within the domains of machine learning, data- and text-mining, or NLP. A common technique within text-mining and NLP is to look at frequencies of occurrence across a text corpus as a means to understand what the text is about, however, we have not been able to find research that uses the notion of keyword popularity from annotations as a way to measure information density. How we searched We have searched for original research articles using Malmö University's Summon search service, the websites of IEEE, ACM, ScienceDirect and Google Scholar, but found very little relevant research, so our strategy was to follow citations up and downstream for the articles we did find. The main search terms are “News Triangle”, “HTML blocks”, “Keyword Popularity”, “Keyword Extraction”, “Information Density”, “Named Entities”, either as standalone searches, in combination with each other, or with “data mining”, “text mining”, “Natural Language Processing” o r “NLP” attached to the search term. Below we present eight relevant research papers. Categorisation and topic identification Muller, Dörre, Gerstl & Seiffert, (1999) [11] describe the use of TaxGen, an automated taxonomy generator based on a hierarchical clustering algorithm (HCA). By using a bottom-up iterative approach, where each text is analysed, the system slowly builds a taxonomy. The results are analysed against the training set and show above 99% positive results. The authors preprocess the text to find Lexical Affiliates (LA) and meaningful subjects in the text limited to a maximum of top five keywords. They also use Linguistic Features (LF), also known as Named Entities (NE), to extract names and places of people and go into great detail about how difficult this is to use as taxonomies, since the clustering algorithm chokes. This could prove problematic for news articles, where journalist, in order to spice up the language, use synonyms, or only first or last names. The text documents in the paper are from news-wire services, where the language is much more compact. There is also a possible misunderstanding of what classes and taxonomies are, where the authors seem to mix the too, but this is not explicitly clear from the text. In explaining TopCat [12], Clifton, Cooley & Rennie (2004) give a very thorough walkthrough of the problems of NLP and text-mining, explaining problems and suggesting possible workarounds. Their process too, is removal of stop words along with NE extraction to get a more coherent sets of text bodies. The NE's are used to map articles to topics, i.e. “Sampras”, the American tennis player Petros Sampras, is mapped to the “tennis” concept. They also use TF-IDF to find important words in each text, which substantially improves the results. TopCat finds around 30% more similar documents as that of a human process, however the authors are clear in stating that their results are difficult to evaluate. Nonetheless it is a solid piece of work that shows the difficulties in automation and text mining. They also note, like Muller, Dörre, Gerstl & Seiffert above, that computation takes a very long time and conclude that the experiment is a Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 12 of 73
  • 13. success even though the output is small. But this is based on algorithm evaluation by users, which could be a sign of any improvements are good, since annotating content is a complicated and time-consuming task for users. Denecke & Brosowski also follow a recipe of preprocessing, where stop words are removed, words are stemmed and sentences processed to find the related topic, in the 2010 paper “Topic Detection in Noisy Data Sources“ [13]. Only sentences with a minimum of four words are considered, and a Latent Dirichlet allocation (each document is a mixture of smaller topics) is performed on each sentence. A keyword is only considered if it is contained in minimum 15 sentences. Finally the top five keywords are chosen as the keywords describing the article. The algorithms is tested on medical blogs, slashdot.org and 14 products from Amazon.com. The algorithm performs best on blogs and Amazon.com products, which might be due to the more strict structure of blogs and Amazon.com product pages. The authors use annotators to test the output against, but these have difficulty with medical blogs due to unfamiliarity with the domain. Again there is confusion about classes, taxonomies and keywords. Reza & Matin discuss in “Application of Data Mining For Identifying Topics at the Document Level” (2013) how to identify topics at the document level [14]. The authors start by looking at the sentence level using unigrams and bigrams, and Named Entity extraction and analysis, but do not get satisfactory results. The authors then move on to the paragraph level and begin to see good results on their test-corpus and the feedback from their test audience. This is promising result for our purpose. However the authors do not take into account the use of semantics and synonyms to get the true meaning of concepts and words. It is also unclear if they have used stemming of words to find similarities. Keyword extraction Although they do not seem to be aware of the concept behind News Triangles, Nørvåg and Øyri [15] describe a process where only the front page of online newspapers are used to extract news from. Their intuition is, that headlines are short, to the point and created by humans, which leads to higher classification accuracy than an automated process. They text-mine the front- page of online newspapers and each news item is added to a database, with a link to the article. The headpiece is used to categorise the article. This process takes up a lot less hard-disk space and saves time regarding data cleaning. Their approach is somewhat problematic, in that not all newspapers link their headlines to the article or they might use different HTML markup, depending of where the article is placed on the webpage, which changes as the news items slides down the newspapers homepage during the course of the day. Apart from the problem of creating a template for each online newspaper, the logic behind the extraction process can be complex. Furthermore, although news headlines are condensed with information and are written by a person with domain knowledge of what the article is about, the process looses out on possible valuable knowledge, which could have been mined from the article itself. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 13 of 73
  • 14. By looking at each block (section element) and analysing the particular block's entropy (information gain), Huang, Yen, Hung, Chuang, & Lee (2006) [16] propose to consider both structure and information by adding weights to each block of text and letting the algorithm decide which block is more informative. The authors conclude that using only entropy to classify block importance, gives poor results in real life when tested on live websites. They regard that future work in this area needs to be done. They also note that a side effect of a boolean selection is that some blocks don’t reach a certain threshold for selection and are deemed un-useful, even though a manual human process deems the block to be informative. They conclude on this part, that a ranking of blocks could be more useful. In “Automatic free-text-tagging of online news archives” from 2010 by Farkas, Berand, Hegedús, Kárpárti & Krich [17] the authors extend the above mentioned set-ups by using semantics from wikipedia. From the corpus they filter the keyword list to a fixed size based on statistics of the site average. The headlines, sub-headlines etc. contain important words and the authors perform TF-IDF on these parts to determine importance. The authors find that the raw text and links to other articles are mostly noise. The authors also touch on the problem with non english words using the available tools and frameworks on non-english languages. The use of wikipedia.org for semantic extraction seems to circumvent this challenge greatly. In “Topic identification based on document coherence and spectral analysis” [18] from 2011, D'hondt, Verhaegen, Vertommen, Cattrysse & Duflou, treat texts as a non-sequential stream of words, where the best parts of information could lie anywhere in the text. The authors use a technique of lexical chains to quantify and describe similar keywords from different text blocks. They add a score to keywords that appear closer to each other, word by word, sentence by sentence, so occurrence of the same word very close to each other, receives a higher score than words occurring similar times but further apart. Their technique receives good precision and recall on both large randomized and standard test sets. The paper goes in great detail of explaining the set-up of their experiment along with explanations of algorithms used. Their approach is different to our line of thought, since our intuition is, that information is grouped into blocks of knowledge, and that each section carries and adds weight to each keyword, we choose to partition the article into sections. 1.4 Research Questions The research questions set out to identify: the presence of News Triangles and Stacked News Triangles; if Named Entities influence the presence of Stacked News Triangles; and what the differences are among Named Entities across categories. 1.4.1 Research Question 1 – News Triangles Research Question 1: To what extent do online news articles follow the idiom of many News Triangles, instead of only one News Triangle, where information is distributed at the beginning Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 14 of 73
  • 15. of the text. I.e. do the keyword candidates appear less frequently the further we move away from the start of each element block? Methodology and assumptions for Research Question 1 The participants annotate a selection of articles with keywords. The more often a keyword is mentioned, the more popular it is, i.e. it has a higher value. To make sure only the most important keywords with the highest value are used for looking at keyword popularity, only keywords that belong to the top 3rd quartile are considered. A News Triangle (an inverse triangle) is a triangle where most of the information is in the top part (see Illustration 2, 3 and 4). Thus, what we are looking for is the majority of popular keywords in a text across a vertical time line, will have spikes of popularity that follow what is seen in Illustration 3. Moreover, we want to see if what is taught about writing for online is in fact true. We will look for patterns that resemble those seen in Illustration 3 and 4, where the spike starts at the left side and descends downwards. The hypothesis is, that online news follows a pattern that resembles Illustration 3 and 4. Note that we are not measuring where the keywords begin or end. We are looking for a boolean value, a True or False, of whether there is an overall News Triangle and/or a series of Stacked News Triangles. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 15 of 73
  • 16. Below we have inserted the News Triangles as per the theory of Sissons [3] and Tverskov & Tverskov [1]. Note that the article may be organised exactly the same in online and print version. We are only concerned with only online news and do not take into consideration the print version. 1.4.2 Research Question 2 and 3 – Named Entities Research Question 2: Given that much news concerns something that happened to someone somewhere, what influence does Named Entity keywords have on the presence of News Triangles and Stacked News Triangles? Research Question 3: Is there a distinct variance of Named Entity Type (Persons, Places or Organisations) in keywords within the categories (Culture, Domestic, Economy, Sports)? Methodology and assumptions for Research Questions 2 and 3 Named Entities from the 3rd quartile keywords are extracted and divided into the following types: Person; Place; Organisation. The NE are gone over manually to find and correct Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 16 of 73
  • 17. misspellings like “Georg W Bush”, “Georg W. Bush”, “George W. Bush”. Each NE Type is grouped by type and category. Standard Deviation across NE and Non-NE is used to identify the influence of NE's on News Triangle presence. The concept of the News Triangle is to describe what happened to someone/something. As described in the Background section above, the “What happened” part of the News Triangle could possibly contain more NE's, since something happened to someone/something. Investigating whether the “How it happened” contains more verbs since these often describe some kind of action, or if the “Amplify the point” and the “Tie up loose ends” are used for summarisation, is however out of the scope of this paper. Relevance of research questions Gaining knowledge of how news articles, which are semi structured texts, are partitioned into smaller but discernible parts, is of great value to automated keyword extraction and the process of automatically tagging news articles in a more relevant way, which could produce better results when fetching related and relevant content. Contribution We expect to find that all news articles use an overall News Triangle to present information and to a large extent that we will find that most element blocks contain News Triangles as per our definition in Illustration 3 and Illustration 4 above. We expect a majority of keywords to be NE's and that there is distinct difference between NE Types for each category. 2 Methodology On the onset of this paper the idea and scope was to explore more than keywords and News Triangles, thus the questionnaire, which forms the basis of the data collection method, also asked the participants for input about taxonomies and category. The data about taxonomies and categories has not been used in this paper. Quick overview of methodology Sixty-eight Danish journalist are each asked to add keywords to a random selection of eight news articles from a set of 31 articles. Each article is divided into partitions (Headpiece, Intro and subsequent Section blocks according to HTML markup). The keywords for each article are ranked according to occurrence and the popular keywords in the 3rd quartile are used to measure keyword popularity in the article. The popular keywords from each article are searched for in the article, and their position, if found, is mapped for each section block. The position and popularity count for each keyword is used to create a graph, showing the distribution of keyword popularity in each section block. A linear fit, acting as a boolean value, is set across each section block and the entire article, to measure if the keyword distribution ascend or descends. An ascent from left to right indicates that there is no News Triangle Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 17 of 73
  • 18. present. A descent from left to right indicates that there is a News Triangle. For analysis of NE's we extract NE's from both articles and annotations using NLTK and our own automated method. Only NE annotations found in the 3rd quartile are considered. 2.1 Articles - criteria and selection methods We use 41 articles from the online edition of the Danish newspaper Politiken1 of which we have selected 31 articles to test on. The original intent for this paper, was to also build and test an algorithm based on the initial results, using the remaining ten articles as a test-set. However, on first inspection of the annotations by the participants and mapping of keywords to each article, we decided to instead dig deeper into what caused the presence and non-presence on News Triangles and Stack News Triangles. The criteria for selecting articles is to get recent articles from the categories: domestic news (14 articles), economy (6 articles), sports (4 articles) and culture (7 articles). By selecting articles from four common newspaper categories we are able to generalise the results better. The reason for choosing recent articles is that we wanted the participants to be fairly on par with what the article was about. The article's subjects are: terror in Denmark, politics, integration, music streaming, schools, education, the Copenhagen metro, credit cards, sports games. We have not looked at, or consider it relevant, to include data about the article's author, since this should not influence the presence of News Triangles. All articles are written and edited by Politiken staff and the articles have been approved for use in this thesis by Politiken's copyright office. No articles are from news-wire services. The full list with links to articles can be found in Appendix C on page 69. 2.2 Preprocessing of articles Each article's URL is accessed via a browser and the full HTML source-code is copied to a text document and saved. The reason for not scraping the content automatically, is due to Politiken.dk's paywall system, where users are only allowed to access five articles per month. This can however can be bypassed by using the Chrome browsers “incognito” functionality. Using the Python library BeautifulSoup42 , each text file is processed to removed excess elements in the HTML, such as links to advertisements, navigation, semantic content etc. Only the HTML surrounding the content is extracted. Based on the extracted HTML and content, a new HTML page is generated, where the content is marked up to only contain H1 HTML-tags for the title, H2 HTML-tags for the sub-headline, H3 HTML-tags for the subsequent section block sub- headlines and HTML p-tags for paragraphs. The new HTML page does not contain images, image captions, author bylines, dates, category identification or links to related stories. 1 http://politiken.dk 2 http://beautiful-soup-4.readthedocs.org/en/latest/ Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 18 of 73
  • 19. 2.3 Participants - criteria and selection methods The participants where chosen via the authors network of former colleagues in the Danish media business. The participants have a wide background in the field of journalism, ranging from broadcasting to traditional print and online media. Their ages, annotation experience and type of education varies. Having said that, there is a bias towards participants being forty years of age, working with print media and having completed an education from the Danish School of Journalism in Aarhus around the year 2001. 113 journalist where invited via email and Facebook chat to participate, 37 journalist never replied back, nine journalist declined and 71 journalists opted to participate of which 68 completed the test. There does not seem to be a difference in opt in rate over which invitation method was used. During the invitation process, several participant showed great interest in the research at hand, since tagging (and updating taxonomy catalogues) is a daily chore, which according to some participants, does not work to it’s full potential. Many felt that a lot of repetitive work was done (and lost), and the newspaper's Content Management System's (CMS) where not good at matching keywords and taxonomies or suggesting related articles. There where also many who did not know what a taxonomy was, even though they, one imagines, work with one daily. This could be due to online newspapers still hold on to the notion of classifying content into strict sections, as they have done with print, where there is a natural physical affordance. 2.4 Collecting keywords via a web questionnaire Questionnaire The URL to the questionnaire (http://tagging.miklasnjor.com) is emailed or posted via Facebook chat to the participant, and each URL contains a unique participant ID (example: http://tagging.miklasnjor.com/index.php?userid=MN123). As mentioned earlier, we initially wanted to also test an algorithm, and to be able to make comparisons between test one and two (test two never took place) each participant is also assigned an internal user ID. For copyright reasons the questionnaire is not made public. The article layout for the questionnaire is responsive so as to fit all device types, although participants are informed that the test is best taken on laptop/desktop computers or tablets. This step is done since a large portion of the participants are contacted via Facebook and there is a chance that they received the URL via their mobile phones. To avoid looking like the most common Danish newspapers, the typography used is that from the WordPress theme TwentyThirteen3 , which we conclude has undergone tests for readability. On the introduction page, the experiment is explained to the participant, what the data will be used for and that the data will be treated anonymously. Definitions of what is meant by 3 https://theme.wordpress.com/themes/twentythirteen/ Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 19 of 73
  • 20. keyword, taxonomy and section are also explained. On the next page, participants are asked to enter data about their education type, year of finishing their education, current job title, current job function, overall editorial tagging experience, and years since working editorially with annotating content. This is done to see if there is differences in annotation rate and keywords entered, between participants tagging experiences and years since tagging articles, or education, job function and job title. After the second page, participants continue to the main test, where they are asked to annotate eight articles. The data was collected over a three week period from mid March, 2015. Layout of articles For each article page, the participant is presented with the article, where the markup follows common HTML principles, where headline gets a <H1> HTML tag, sub headlines gets a <H2> HTML tag and so forth described earlier in 2.2 Preprocessing of articles on page 18. See example of a page in Illustration 17 in Appendix A on page 52. Reading wide pieces of text on a screen is cumbersome, so we choose to set the max width of the text to 760 pixels. Alongside the article is a box for writing keywords, taxonomies and categories. The box follows the top of the web-page, so participants avoid scrolling up and down when they read the article and need to enter data. Underneath the box is a link to the bottom of the article where the difference between keywords, taxonomies and categories is explained. This is done to make sure that if doubt or uncertainties arise, participants can quickly get information about definitions, a need that was raised by some participants prior to taking the test. Choice of articles presented to the participant The first time a participant is presented with an article, the article with the least annotations is fetched from the stack of articles, to make sure that we get as many articles annotated by a wide group of participants. This process continues except for the third and fifth article, where the participant is presented with the most annotated article, as we want to see if there is a difference in keywords between articles with less and more annotations. Questions asked The participants are asked to read the article and annotate the article with: • Keywords: Which keywords are relevant to the article. • Taxonomies: Which taxonomies they would place the article in (based on their assumption). • Classification: In which section the article belongs (based on their assumption). The data is entered into the on-page box mentioned in Layout of articles above, with either one keyword per line or keywords comma-separated. Note that a keyword can consist of several Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 20 of 73
  • 21. words, i.e. “Football Stadium”, “Terror in Denmark” or “Klaus Riskær Pedersen”. When the participant has annotated the article they click the “Next article” button. The data they entered is saved to a database and they are presented with a new article which they are asked to add keywords to. We choose to save the data after each annotation. As such, there is no minimum or maximum amount of articles that needed to be tagged, since our intuition is that the task of tagging is seen by participants as a chore, and we feared that if they exited the test halfway through, we would loose valuable data. The users are informed of how many articles they have tagged. Minor problems A. There was a coding error which once in a while showed the same article twice. Some participants would skip the article, some would annotate it again, and some would write in the data collection box, that they had seen this article before. These entries have been removed. B. Since there was no finish button on the questionnaire, some participants annotated more than eight articles. 2.5 Analysis of collected tags The keywords belonging to each article are collected and gone over manually to make sure that strange html entities or other oddities inside or surrounding the keywords are caught and normalised. This is to avoid complications further down the preprocessing pipeline. Removing Noise from the Data All keywords for each article are collected and made lower case, after which they are compared and ranked according to frequency. The frequency count from each keyword is divided into 1st , 2nd and 3rd median. Keywords not belonging to the 3rd median for each article are discarded, i.e. we only consider keywords with high frequency and a strong presence. This is done to avoid outliers in the data. The reason for making all keywords lowercase is to ensure that participants may have spelled keywords with title-, upper- or lowercase. Named Entities Named Entities (Persons, Organisations, Locations) are extracted from all of the articles we train on. We intended to do this in one swoop to avoid repetition by using the NLTK ne_chunk method4 . By closer inspection we notice that the NLTK ne_chunk method doesn't collect all NE's, which could possibly be due to the anglocentric nature of NLTK and our text corpus is in Danish. We found a POS-tagging web service from “Center For Sprogteknologi”5 , however it would be cumbersome and error prone to copy-paste data back and forth between the web form and text sheets. Likewise, training a POS-tagger from scratch, to identify names entities etc., is out of the scope of this paper, so we write our own function to collect the rest of the NE's. 4 http://www.nltk.org/api/nltk.chunk.html 5 http://cst.dk/online/pos_tagger/ Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 21 of 73
  • 22. The function collects n grams (from 1 – 5 tokens) that are either Titlecase or UPPERCASE. Naturally we rake in many false positives. The tokens are gone over manually and false positives are removed from the list. NE's from the NLTK ne_chunk function and our function are joined and edited again. 2.6 Identifying News Triangles. To map where keywords are found across the articles, we partition the articles into: header and sub-headline; intro; and subsequent section blocks including the section block sub-headline. The intro and section blocks are divided into 20 buckets, based on the fact that most section blocks consist of roughly 20 sentences. We write a Python program to go through each sentence in each section block of the article, and if a single or multiple token keywords in the 3rd quartile is found, the program marks the keywords position. The resulting data-set is sent of programmatically to plot.ly, where we manually add the linear fit to each block. The values of the Squared Correlation Coefficients (R2), Mean Squared Error (MSE), and a boolean value of whether there is a News Triangle present, is read and entered into Table 10 on page 36. For identifying News Triangles in the intro and each section block, the linear fit is calculated for each section block. For calculating the linear fit across the entire article, the header and sub- headline are included, but since this part of the article is so compact, we do not calculate a linear fit across it. Note that we use the linear fit as a boolean value to identify whether there is a News Triangle or not. We considered using alternative methods for measuring the presence of News Triangles, but decided a linear fit is the best method to see if there is an ascent or descent across the sections. We do find that this set-up has certain drawbacks, since the linear fir does not show where the keywords start or stop across the 20 partitions, which could be valuable information. For section blocks where the linear fit is almost horisontal, this is problematic. However, we choose to partition the section blocks into 20 partitions and this allows for visual feedback, so the reader can see what the distribution looks like. A later step could be investigating the slopes of the fit across the data or where the ascents or descent start and stop. We choose to concentrate on a preprocessing step of identifying if there even are News Triangles or Stacked News Triangles. 3 Results Since we will show many tables and illustrations, we also choose to analyse part of the data in the results section. For brevity, we have moved the majority of the illustrations for Keyword Popularity Distribution to the B Appendix on page 53. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 22 of 73
  • 23. 3.1 Participants All participant are from Denmark. Table 1, 2 and 3 show the participant ages, graduation year and education length. The participant's ages range from 30 to 59, the year of graduation ranges from 1980 to 2011, and the education length ranges from 2 to 6 years, with the majority of education length being four years. Participant's Ages Min Max Count Mean 1st Median Median 3rd Median 30 59 67* 43.66 40 42 46 Table 1: Participant's ages. The oldest participant is 59, and the youngest is 30 years old. The majority of participants are close to 40. * Note that one participant did not specify his or her age. Participant's Graduation Year Min Max Count Mean 1st Median Median 3rd Median 1980 2011 68 2000 2001 2001 2002 Table 2: Participant's Graduation year. The majority of participants graduated in 2001. Participant's Education Length within Journalism and Communication Min Max Count Mean 1st Median Median 3rd Median 2 6 68 4.088 4 4 4 Table 3: Participant's education length. All participants have an education related to the fields of either journalism or communication, of which the majority have studied for 4 years. Participant's Keyword Experience The participants where asked to select for how long they had worked with annotating articles and when they had last worked with this. In all, roughly half (54,5%) had experience with adding keywords to journalistic articles, 42,6% have no experience and 2,9% answered “N/A”. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 23 of 73
  • 24. Question: "For how many years have you worked with annotating news articles?" Answer "I have never worked with it” "1 - 3 years" "4 - 6 years" "7 - 9 years" “More than 10 years” “N/A” Count 29 17 11 5 4 2 Table 4: More than half of the participant have experience with adding keywords to articles. For when they last worked with annotating articles, 29 participants (42.64%) are currently working with it or have worked with it within the past 3 years. 3 participants (4.41%) have not worked with it since seven years ago. 26 participants (38.23%) have never worked with it and 6 participants (8.82%) chose “N/A”. Question: "When did you last work with annotating news articles?" Answer "I have never worked with it” "I work with it daily" “1 – 3 years ago” "4 - 6 years ago" "7 - 9 years ago" “More than 10 years ago” “N/A” Count 26 15 14 4 1 2 6 Table 5: Close to half have current or recent keyword experience. Participant's Job Functions and Job Titles The 68 participants label themselves with 33 different job titles and 52 different job functions, ranging from journalist to CEO. Job titles and job functions are shown in Table 14 on page 71. Participant's Device and OS One concern was that participants would take the test on their phones, which we feel hinders the annotation of articles. Each participant's browser type and device type is collected during the test. The split is as follows: Desktop/Laptop: 60 (88.2%); Tablet: 4 (5.9%); Mobile Phone: 4 (5.9%). From Table 6 we find a wide range of browsers used by the participants across both operating systems. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 24 of 73
  • 25. Participants Browsers and Operating Systems Operating System IE Safari Chrome Firefox Others Total Windows 4.4% - 30.9% 2.9% - 38,2% Apple - 29.4% 5.9% 14.7% - 50% IOS - 5.9% - - - 5.9% Other - - - - 5.9% 5.9% Table 6: Participant's Browser and Operating System. Windows, Chrome: 21 (30.9%); Apple, Safari: 20 (29.4%); Apple, Firefox: 10 (14.7%); IOS iPad, Safari: 4 (5.9%); Apple, Chrome: 4 (5.9%); Windows, IE': 3 (4.4%); Windows, Firefox' 2 (2.9%); Various others: 4 (4.5%) 3.2 Articles and Annotations This first part of the results serves the purpose of getting an overview of the data and to make sure that everything is aligned, to better make comparisons when we explore the data further. Even though the content of the articles vary and as such could produce a wide spectrum of annotations, both in type and count, we choose to use averages to compare article categories. We also look at the respective group's median when possible. The reason for analysing the participant's annotations is to understand what defines a “good” keyword. Later we will analyse where the most popular annotations among participants are placed throughout the article text. Article Groups We group the articles into categories to understand differences among categories and annotations. This is also a sanity check to see if there are any outliers in our data that might become a problem later. Table 13 on page 70 (in Appendix C) lists the articles with their categories, article ID, a rough translation of the headline from Danish to English, and the original URL, which is the namespace that comes after “http://politiken.dk”. We group the articles based on their content and the first taxon in the URL. Articles could be grouped further by URL, however we would be left with such small categories that comparison would be difficult to calculate. The placement of articles in the sports section is clearcut, while the articles in Culture, Domestics, and Economy could be placed differently due to certain subjects overlapping, i.e. an article about economy is likely to be concerned with Danish politics, and an article about education is placed in domestic news, although that area, to some extent, is governed by politicians. We chose to group articles about life style and consumerism in the Culture category. Table 13 on page 70 shows that seven articles belonging to Culture, 14 belonging to Domestic News, six articles belong to the Economy category and four articles Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 25 of 73
  • 26. belong to the Sports category. From Table 7 below we see how the Domestic and Sports categories feature articles which have been annotated more than double the times than the 2nd median and average for that particular group (Domestic max = 39 , Sports max = 37). However, looking at the 1st , 2nd and 3rd medians for all categories, we can see that the majority of articles have been annotated by 13 – 16 participants for each group respectively. This indicates easier comparison across article groups. Averages and Five Number Summary of Article Categories Annotation Rate Categories Average Min Max Median 3rd median 1st Median Culture 15.57 13 19 16 16 16 Domestic 18.71 12 39 15.5 16 15 Economy 14.83 13 18 14.5 15 14 Sports 18.75 12 37 13 13 13 Table 7: Domestic and sport receive higher Max annotation by the 68 participants, however the 1st , 2nd and 3rd medians for all categories group closely showing that the annotation rate for each article lies between 13 – 16 annotations from participants. Notice also how close the keyword count for 1st, ,2nd and 3rd medians for each category are. Keyword Annotations Table 8 and Table 9 show: annotation types; the count of all annotations for that annotation type; the average annotations per article; the count of unique annotations; and averages of unique keywords per article. The process of finding unique annotations is as follows: all annotations are made lower case and duplicates are removed, so we have a set of non matching annotations. Having said that, it is possible that annotations with the same semantic meaning can exist alongside each other, or, since annotations are not stemmed, that annotations exist in different lemmas. Table 8 shows that the 31 articles have collected 4930 keywords, of which 1467 are unique. Per article, on average, there are 159.03 tokens of which 47.32 are unique. Annotations Overall All Annotations Avg. Annotations per article collectively Unique Annotations Avg. Unique Annotations per article collectively All Keywords 4930 159.03 1467 47.32 Table 8: Averages of annotations for keywords. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 26 of 73
  • 27. Keywords per Category For Table 9, averages are the sum of all annotations divided by the number of articles in that category, i.e. “Culture Keywords” = 993 / 7 (articles) = 141.86. The averages for each section that are above average have been bolded. Illustration 5 on page 28 summarises Table 9. In Table 9 we find that while articles in each category receive from 691 – 2510, the Average Unique Annotations per Article collectively groups around 44 – 52.14. Unique Annotation for each Category varies from 176 – 703, the Average Annotations per Article Collectively lie close(44 – 52.14 average unique annotations per article collectively) with a Standard Deviation of 4.04. There is a wide spectrum of overall and unique keywords per category and each category settles closely to each other. Keyword Annotations per Category: Culture 7, Domestic 14, Economy 6, Sports 4 Category and Annotation Type All Annotations for each category Avg. Annotations per article collectively Unique Annotations for each category Avg. Unique Annotations per article collectively Culture Keywords 993 141.86 365 52.14 Domestic Keywords 2510 179.29 703 50.21 Economy Keywords 755 125.83 268 44.67 Sports Keywords 691 172.75 176 44 Table 9: Keyword annotations per Category of articles. For Average Annotations per article collectively: Min 125.83, Max 179.29, Mean 54.93, Q1 133.85, Median 157.30, Q3 176.02, Std Dev 25.35. For Average Unique Annotations per article collectively: Min 44, Max 52.14, Mean 47.75, Q1 44.34, Median 47.44, Q3 51.18 Std Dev 4.04 Also evident from Table 9 is that although articles receive a different amount of attention from participants the averages group around each other. Illustration 5 shows that culling the lists of keywords to only contain unique annotations, the count drops quickly and all categories group fairly even around the same range. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 27 of 73
  • 28. To understand the variation within each group, we can in Table 15 (on page 71, Detailed look at count of Annotations of Keywords ), see in detail the number of annotations added to each article, along with the count of unique annotations. We calculate a one-way ANOVA test, for the null hypothesis (H0) of no connection between the number of annotations added and the number of unique annotations. We find that there is no significance between the number of annotations added and the amount of unique annotations, except for the Sports category, where there is a significance above 0.05 (P<0.07117) for keywords added vs. unique keywords. However the score for the sports category is based on four articles. 3.3 Keyword Distribution For brevity, the majority of illustrations of Keyword Popularity Distribution have been move to Appendix B on page 53. The article ID's, illustration number and page number are show in the footer for each results category. As described in the methodology, the keywords in the 3rd quartile for each article, are searched for and mapped along a line representing the entire article. Each article is divided into sections and each section (except the header and sub-headline) consist of 20 partitions. Each partition can feature from zero to many keywords. To identify the presence of overall News Triangles and Stacked News Triangles, the mapping results are sent of to Plot.ly, where a linear fit is added and used as a boolean value. A descent from left to right indicates that there is a News Triangle, Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 28 of 73
  • 29. while an ascent from left to right, indicates that there is no News Triangle. Keyword Distribution – Culture Article 18, 23 and 28 have an overall News Triangle and a News Triangle for each block. Article 16 and 22 have an overall News Triangle, but not a News Triangle for the only block (the Intro). For article 16, the linear fit over All - Popularity and Intro Popularity Count follow each other closely. The Header popularity in article 16 is lower than some of the spikes in the Intro. Article 17 has an overall News Triangle and a News Triangle for the Intro and Section 2 blocks. Section 1 does not have a News Triangle. Section 2 has less popular keywords than Section 1 and especially the Intro block. The Header popularity is lower than some of the spikes in the Intro. Article 27 (Illustration 6 below) has an overall News Triangle and a News Triangle for Section 1,3 and 4. The Intro and Section 2 have no News Triangles. Article 27 is very jagged, both for spikes in popularity and for how the Keyword Popularity Distribution ascends and descends in the Intro, and Section 1 and 2. A list of illustration references for Culture can be found in the footnotes6 . 6 Culture Articles pp. 53: Article 16: Illustration 18 - pp. 53, Article 17: Illustration 19 - pp. 53, Article 18: Illustration 20 - pp. 54, Article 22: Illustration 21 - pp. 54, Article 23: Illustration 23 - pp. 55, Article 27: Illustration 25 - pp. 56, Article 28: Illustration 22 - pp. 55 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 29 of 73
  • 30. Keyword Distribution – Domestic News Article 03, 04, 05, 08, 13, 19 and 26 all have an overall News Triangle and a News Triangle for each block. For article 05, 08, 19 and 26, the linear fit over All - Popularity and the Intro Popularity Count fit follow each other closely. In article 04 the Intro, Section 2 and 3 have an almost horizontal linear fit, indicating that the distribution of popular keywords in those blocks is fairly even. Note how in article 08 (Illustration 7 below on page 31) the article starts with more popular keywords, than there are in the header, which by convention is supposed to be the most condensed part. This is also apparent in article 20 (Illustration 32 page 60). In Article 13 the keyword distribution for all blocks is dispersed and the header block is the most keyword popularity condensed area. Article 07 has an overall News Triangle and a News Triangle for Sections 1 and 2, but not for the Intro block, where the majority of the popular keywords are in the rear partitions, in contrast to Section 1 and 2, where the popular keywords are at the beginning. Article 20 has an overall News Triangle and a News Triangle for each block, except for the last section (Section 2), where the popularity ascends. The popularity of keywords in the Intro and Section 1 ascends. Note how the Header popularity is lower than many of the popular partitions in the blocks. Article 21 has an overall News Triangle and a News Triangle for the Intro and Section 3. Sections 1, 2, 4 and 5 ascend and have no News Triangle. Section 2, 3 and 4 have an almost horizontal linear fit. Article 24 has an overall News Triangle and a News Triangle for Section 1,2 and 3. The keyword popularity in Intro and Section 4 ascends and the article has few popular keywords. Article 25 has an overall News Triangle and a News Triangle for Section 1 and 3. The Intro and Section 2 ascend and have no News Triangles. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 30 of 73
  • 31. Article 29 (Illustration 8 below on page 32) has an overall News Triangle and a News Triangle for the Intro and Section 1. Section 2 does not have a News Triangle. Section 1 and 2 have an almost horizontal linear fit. All blocks are condensed and the keyword popularity scores are high. Especially for the Header block which reaches above 160. Some of the popular keywords for article 29 are "S" and "R" which are acronyms for two danish political parties, namely S: “Socialdemokraterne” (Social Democrats), and R: “De Radikale” (Liberal Democrats). Due to the processing and tokenisation of the text data, it is likely that these acronyms interfere with the scoring process, especially "S", since the tokenisation process could have separated apostrophe s's from their main word. We have not been able to verify their influence. Looking at the number of words for each block and partitions, we cannot see any abnormality in comparison to the other articles. Nor can we see any apparent pattern in the spikes in each partition position. Nonetheless, the results from article 29 should be approached with caution. Article 30 has a overall News Triangle but no News Triangle for only block (the Intro). The linear fit across All - Popularity and Intro Popularity Count follow each other in a almost horizontal line. A list of illustration references for Domestic can be found in the footnotes below7 . 7 Domestic Articles pp. 57: Article 03: Illustration 26 - pp. 66 Article 04: Illustration 28 - pp. 58, Article 05: Illustration 27 - pp. 57, Article 07: Illustration 29 - pp. 58, Article 08: Illustration 30,- pp. 59, Article 13: Illustration 31 - pp. 59, Article 19: Illustration 32 - pp. 60, Article 20: Illustration 33 - pp. 60, Article 21: Illustration 34 - pp. 61, Article 24: Illustration 35 - pp. 61, Article 25: Illustration 36 - pp. 62, Article 26: Illustration 37 - pp. 62, Article 29: Illustration 38 - pp. 63, Article 30: Illustration 39 - pp. 63 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 31 of 73
  • 32. Keyword Distribution – Economy For the Economy category, all articles have an overall News Triangle. Article 09 and Article 10 also feature News Triangles for every section. In article 01 three out of four blocks have News Triangle. In article 02 half of the four blocks have News Triangles. Notice how the first two blocks (Intro and Section 1) in article 02 (Illustration 9 below) ascend steeply due to the popularity of keywords being situated in the last partitions. Note also that the spike in Intro - Number of Words and the last spike for Section 1 - Number of Words is a lot higher than the rest. Section 2 and 3 show the presence of News Triangles, with a linear fit over each section that is almost horizontal. Article 06 has News Triangle for the first and third block (Intro and Section 2), but not for the second and last (Section 1 and 3). The linear fit in Section 1 is almost horizontal, where it for the Intro and Section 2 blocks it somewhat steeply descends. Section 3, which has no News Triangle, ascends somewhat steeply. Article 09 has News Triangles overall and for each block. In article 10 there is an overall News Triangle and a News Triangle for each section., where the Intro, Section 2 and 3 blocks almost follow the overall linear fit. Article 11 shows the presence of an overall News Triangle and a News Triangle for the Intro, Section 1 and 3. Section 2 ascends slightly and has not News Triangle. Section 1 has few popular Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 32 of 73
  • 33. keywords and the linear fit floats a lot lower than the overall linear fit. A list of illustration references for Economy can be found in the footnotes below8 . Keyword Distribution – Sports For the sports category there is an overall News Triangle for article 12, 14 and 15, but not for article 31 (Illustration 10 on page 34), which is the only article of all 31 articles, that does not have an overall News Triangle. Article 14 has an overall News Triangle even though the main block, the Intro block ascends. It is interesting for article 14 that the Header block has so much power over the overall linear fit, that it manages to create a descending linear fit, since we can see that the majority of the popular keywords are placed at the end of the article. Article 12 and Article 15 both have an overall News Triangle and a News Triangle for each block.Looking closer at article 31, it is apparent that the Intro block has less popular keywords across partitions, which might be what causes the All Popularity Fit to ascend. A list of illustration references for Sports can be found in the footnotes below9 . 8 Economy Articles (pp. 64): Article 01: Illustration 40 - pp. 64, Article 02: Illustration 41 - pp. 64, Article 06: Illustration 42 - pp. 65, Article 09: Illustration 43 - pp. 65, Article 10: Illustration 44 - pp. 66, Article 11: Illustration 45 - pp. 66 9 Sports Articles (pp. 67): Article 12: Illustration 46 - pp. 67, Article 14: Illustration 47 - pp. 67, Article 15: Illustration 48 - pp. 68, Article 31: Illustration 49 - pp. 68 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 33 of 73
  • 34. Keyword Popularity across all articles Table 10 below (on page 36) shows that apart from article 31, there is an overall News Triangle for each article except article 31. However, on a per article basis, we can not see a distinct pattern of News Triangle presence across all section blocks. Fourteen of the 31 articles feature a News Triangle for each block and have the Article ID bolded. The score of whether there is a presence of a News Triangle is shown in Table 10 by the field Valued, where “1” indicates that the keyword popularity descends over the course of the block or entire article (there is the presence of a News Triangle), and “0” indicates that the keyword popularity ascends over the course of the block or entire article (there is no presence of a News Triangle). The Valued score is taken from looking at each article's distribution of keyword popularity and readings off the Squared Correlation Coefficients and Mean Squared Error for each article. In Culture, articles 18, 23 and 28 have News Triangles present in all blocks, where article 16, 17, 22 and 27 do not have News Triangles present in all blocks. In Domestic, articles 03, 04, 05, 08, 13, 19, 26 have News Triangles present in all blocks, where articles 07, 20, 21, 24, 29, 30 do not have the presence of News Triangles. In Economy, article 09 and 10 have News Triangles present in all blocks, where article 01, 02, 06 and 11 do not have News Triangles present in all blocks. In Sports, articles 12 and 15 have News Triangles present in all blocks, where articles 14 and 31 do not have News Triangles present in all blocks. From Table 10 below, we see that 5 of 7 articles in Culture end with blocks where there is a News Triangles, 9 of 14 articles in domestic end with Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 34 of 73
  • 35. News Triangles, 5 of 6 articles in Economy end with News Triangles, and 3 of 4 articles in Sports end with News Triangles. So all in all 22 of 31 articles end with a News Triangle. Articles with News Triangle in all blocks. For the articles with News Triangles present in all blocks, in Culture, articles 18 and 23 only consist of the intro and for article 28 a Section 1 block also. For Domestic, articles 08 and 26 consist of an Intro block only, where articles 03, 05 and 19 consist of Intro and Section 1 blocks. Article 04 and 13 (Domestic), go all the way to Section 3 and 4 respectively. For Economy, article 09 has an Intro and Section 1 block with News Triangles and article 10 has News Triangles all the way to Section 4. In Sports, article 12 and 15 has News Triangles for each block, where article 12 goes to Intro blocks and article 15 goes all the way to Section 2. Articles without News Triangle in all blocks. For articles without the presence of News Triangles for each block, in Culture, article 16, 17, 22 and 27 do not have a series of Stacked News Triangles present across all blocks. However going back to article 16 and article 22, it is clear that the linear fit is almost horizontal. The blocks in article 17, 27 where the popularity of keywords ascends, it does so distinctly. In Domestic, article 07, 20, 21, 24, 29, 30 do not have the presence of News Triangles, where, for article 21, Section 1,2 and 3, and article 29, Section 1 and 2, and article 30, the Intro, there is an almost horizontal linear fit for each block. The blocks in article 07, 20, and 24 where the popularity of keywords ascends, it does so distinctly. In Economy, article 01, 02, 06 and 11 do not have News Triangles present in all blocks, where, for article 06, Section 1, the linear fit is almost horizontal. For article 01, 02 and 11 the popularity of keywords ascends fairly distinctly. In Sports, article 14 and 31 do not have News Triangles in all blocks. The blocks ascend fairly distinctly. Presence of News Triangles across all Sections and per Single Sections basis Article ID All Sections Intro Section 1 Section 2 Section 3 Section 4 Section 5 R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued Culture Article 16 0.0026 11.79 1 0.0009 11.84 0 Article 17 0.1762 8.527 1 0.1768 10.71 1 0.0253 7.494 0 0.0498 5.132 1 Article 18 0.0936 8.619 1 0.0127 7.378 1 Article 22 0.0575 7.629 1 0.0001 5.975 0 Article 23 0.1420 10.79 1 0.0334 8.559 1 Article 27 0.0706 7.219 1 0.0372 9.019 0 0.2497 7.748 1 0.1420 5.595 0 0.0053 5.512 1 0.0055 5.309 1 Article 28 0.0155 10.29 1 0.0077 8.602 1 0.0928 8 1 Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 35 of 73
  • 36. Presence of News Triangles across all Sections and per Single Sections basis Article ID All Sections Intro Section 1 Section 2 Section 3 Section 4 Section 5 R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued R2 MSE Valued Domestic Article 03 0.0365 12.18 1 0.0207 10.51 1 0.0743 10.57 1 Article 04 0.1060 12.81 1 0.0005 12.16 1 0.0549 12.18 1 0.0126 6.106 1 0.0105 6.959 1 Article 05 0.1329 10.5 1 0.0061 9.446 1 0.0146 5.184 1 Article 07 0.0543 6.436 1 0.0154 7.34 0 0.1775 3.453 1 0.3084 4.12 1 Article 08 0.4150 15.75 1 0.4117 16.05 1 Article 13 0.0393 5.665 1 0.0048 4.995 1 0.0393 3.976 1 0.1517 1.232 1 0.0174 6.487 1 0.3162 3.132 1 Article 19 0.2474 8.191 1 0.0832 7.752 1 0.3484 4.644 1 Article 20 0.0153 19.49 1 0.1846 20.63 1 0.0808 15.55 1 0.0218 19.39 0 Article 21 0.0265 5.487 1 0.0891 6.592 1 0.0731 5.42 0 0.0000 3.271 0 0.0061 3.503 1 0.0071 3.286 0 0.0623 6.084 0 Article 24 0.0459 5.733 1 0.0397 6.88 0 0.0507 2.887 1 0.0890 1.638 1 0.0059 4.035 1 0.0099 2.515 0 Article 25 0.1050 4.887 1 0.0030 5.433 0 0.0191 3.933 1 0.0059 4.668 0 0.1501 3.426 1 Article 26 0.3197 11.49 1 0.2347 11.3 1 Article 29 0.2863 22.07 1 0.1272 19.19 1 0.0089 10.27 1 0.0016 9.882 0 Article 30 7.3050 9.673 1 0.0027 9.844 0 Economy Article 01 0.0130 7.173 1 0.0226 8.903 1 0.0093 7.219 0 0.0067 5.72 1 0.1092 6.722 1 Article 02 0.0038 3.815 1 0.2559 3.406 0 0.1798 4.743 0 0.0138 2.591 1 0.0003 3.041 1 Article 06 0.0138 7.077 1 0.0148 7.561 1 4.0820 7.153 0 0.0758 7.612 1 0.0668 5.228 0 Article 09 0.0791 8.223 1 0.0954 6.107 1 0.1119 6.18 1 Article 10 0.0857 8.645 1 0.0026 8.338 1 0.1458 7.522 1 0.0190 6.176 1 0.0001 4.082 1 Article 11 0.1402 8.45 1 0.0065 8.64 1 0.0052 6.652 1 0.0001 7.314 0 0.0055 7.35 1 Sports Article 12 0.0778 8.854 1 0.0258 8.643 1 Article 14 0.0140 24.78 1 0.0111 21.4 0 Article 15 0.0486 7.196 1 0.2669 5.567 1 0.0171 5.338 1 0.2257 4.199 1 Article 31 0.0008 9.1910 0 0.2033 7.988 1 0.0572 9.686 0 0.0643 6.957 1 0.0145 8.882 1 Table 10 Squared Correlation Coefficients (R2), Mean Squared Error (MSE) and a boolean value (Valued) of whether the there is a News Triangle pattern present, where a value of 1 equals a News Triangle pattern (Keyword Popularity ascends from left to right), and a value of 0 equals no News Triangle presence (Keyword Popularity descends from left to right) . R2 and MSE are based on the linear fit of all sections or on a per section element basis. The values for Squared Correlation Coefficients and Mean Squared Error are taken from each “Keyword Distribution incl. All Popularity Fit” illustrations (See Appendix B) on Plot.ly. Articles with News Triangles in all blocks are shown with the respective article's Article ID bolded. In Table 11 (on page 37) the count of News Triangle presence vs. number of blocks is presented. 96.77% of the articles have an overall News Triangle, however there is not a clear picture for any section blocks past Section 1. For Intro and Section 1 blocks the percentage of News Triangles found is 70.97% and 72.73 percent respectively, while for Section 2 (57.89%), Section 3 (91.67%), Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 36 of 73
  • 37. Section 4 (50%) and Section 5 (0%), there is a much greater variance in News Triangle presence. There is no apparent pattern. Percentage of News Triangles per block Overall Articles Intro Section 1 Section 2 Section 3 Section 4 Section 5 News Triangles Found 30 22 16 11 11 2 0 Number of Blocks 31 31 22 17 12 4 0 Percent 96,77% 70.97% 72.73% 57.89% 91.67% 50% 0% Table 11: Summary and Percentage of News Triangles in all articles and per block. Percentage is calculated by dividing number of blocks where we found a News Triangle with the Number of Blocks with descending or ascending keyword popularity. Looking at Table 12 below and Illustration 11 on page 39 in conjunction, it is apparent that not all articles have an equal amount of section blocks. Economy and Sports do not have any article blocks in section 4 or 5. All articles have an Intro block, however not all articles go beyond the intro block and we denote this by adding (block: no. of blocks) after the mean value. We find the mean value for all blocks (first column) follows each other closely. The exception is in Sports, where article 31 has no overall News Triangle. The mean value for all articles is 0.9677 (Culture: mean 1, Domestic: mean 1, Economy: mean 1, Sports: mean 0.7500). For the Intro blocks, Domestic (mean: 0.7143), Economy (mean: 0.8333) and Sports (mean: 0.7500) are above the overall mean of 0.7097, and Culture (mean: 0.5714) is below the mean average. For the Section 1 blocks, Domestic (mean: 0.9091, blocks: 11) is above the overall mean of 0.7273. Culture (mean: 0.6667, blocks: 3), Economy (mean: 0.5000, blocks: 6) and Sports (mean: 0.5000, blocks: 2) are below the mean average. For the Section 2 blocks, Economy (mean: 0.8000, blocks: 5) and Sports (mean: 1, blocks: 2) are well above the overall mean of 0.5789. Culture (mean: 0.5000, blocks: 2) and Domestic (mean: 0.5000, blocks: 8) are below the mean average. For the Section 3 blocks, Culture (mean: 1, blocks: 1), Domestic (mean: 1, blocks: 5), Sports (mean: 1, blocks: 1) are above the overall mean of 0.9167. Economy (mean: 0.8000, blocks: 5) is below the mean average. For the Section 4 blocks, Culture (mean: 1, blocks: 1) is above the overall mean of 0.5000, where Domestic (mean: 0.3333, blocks: 3) is below. For the Section 5 blocks, only Domestic is present with one article, where it has no News Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 37 of 73
  • 38. Triangle and the mean for this block and the overall block is 0. Mean, Standard Deviation, Variance and Standard Error for all Blocks/Categories All Blocks Intro Section 1 Section 2 Section 3 Section 4 Section 5 Culture Mean 1 0.5714 0.6667 0.5000 1 1 - Std Dev 0 0.4949 0.4714 0.5000 0 0 - Variance 0 0.2449 0.2222 0.2500 0 0 - Std Error 0 0.1870 0.2722 0.3536 0 0 - Domestic Mean 1 0.7143 0.9091 0.5000 1 0.3333 0 Std Dev 0 0.4518 0.2875 0.5000 0 0.4714 0 Variance 0 0.2041 0.0826 0.2500 0 0.2222 0 Std Error 0 0.1207 0.0867 0.1768 0 0.2722 0 Economy Mean 1 0.8333 0.5000 0.8000 0.8000 - - Std Dev 0 0.3727 0.5000 0.4000 0.4000 - - Variance 0 0.1389 0.2500 0.1600 0.1600 - - Std Error 0 0.1521 0.2041 0.1789 0.1789 - - Sports Mean 0.7500 0.7500 0.5000 1 1 - - Std Dev 0.4330 0.4330 0.5000 0 0 - - Variance 0.1875 0.1875 0.2500 0 0 - - Std Error 0.2165 0.2165 0.3536 0 0 - - All Mean 0.9677 0.7097 0.7273 0.5789 0.9167 0.5000 0 Std Dev 0.1767 0.4539 0.4454 0.4937 0.2764 0.5000 0 Variance 0.0312 0.2060 0.1983 0.2438 0.0764 0.2500 0 Std Error 0.0317 0.0815 0.0950 0.1133 0.0798 0.2500 0 Table 12: Mean, Standard Deviation, Variance and Standard Error for all blocks in each category. Only blocks with values are considered. Values have been rounded to four decimals. This table goes with Illustration 11 on page 39 below. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 38 of 73
  • 39. 3.4 Named Entities in Detail The annotations entered by the participants take many forms: verbs, nouns or names of people, places or organisations. In this section we dig deeper into Named Entities (NE). Automatically identifying which keywords are nouns, verbs or NE's, etc. is difficult, since the common step is Part of Speech (POS) tagging, where the token's position in the text helps identify which POS-tag a word belongs to. Since we deal with keywords that are added without surrounding context, it is difficult to do POS-tagging. Thus we only concentrate on NE's, since these are easier to identify. The following results are divided into NE occurrence in the articles and NE's in keywords per article. Named Entities A Named Entity (NE) is a Person, Place or Organisation. NE's are a central part of news articles as they describe something has happened to someone. It is worth noting that parts of the NE may occur other times in the text if a person is mentioned by only first or last name. We find that this occurs in some texts, i.e. “Klaus Riskær Pedersen” is also referred to as only “Riskær”. We also note that names in news articles are sometimes misspelled: “Roberto Ferhi”, “Roberto Mehri”, “Roberto Merhi”. It is difficult to extract this information in a useful manner, since both first and last names could be nouns or verbs, and we could end up with false positives in our dataset. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 39 of 73
  • 40. From the list of NE's extracted from the entire article collection of 31 articles, we find 341 NE's, of which 298 are unique. For article 29 we find that the annotation “S”, an abbreviation for the Danish party “Socialdemokratiet” (Social Democrats), appears 264 times in the text. We regard this as an abnormality (and an indication of why text mining and NLP is difficult), since the NLP process of tokenisation of sentences and words, and converting words to lowercase, might have removed plural s's in the text. Thus the token “s”'s count has been removed from the data. We also notice that the word “OL”, the Danish abbreviation for the Olympic Games, is mentioned 60 times in article 31. Since the abbreviation “OL” is unique, we do not regard this as an abnormality and it has not been removed from the data. 3.5 Named Entity Occurrence in Articles Overall Named Entity Occurrence in Articles Illustration 12 below shows a line and bar chart, and a box plot of the distribution of frequencies of all types of NE in the 31 articles, where frequency is the times a NE is mentioned per article. We find that the occurrences follow a Pareto power-law distribution, where the majority of NE's in an article occur two to eight times in the text. By conducting a Chi-2 test (F-value 15.00143 / p-value 0.00027), we can draw the conclusion that there is no connection between the times an article has been annotated by participants and the count of NE in the article, thus we are certain what we are not working with skewed data. Identifying Single and Stacked News Triangles in online news articles. Master Thesis 15 ECTS, DA613A, 2015, Miklas Njor 40 of 73