WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
Topic Mining on disaster data (Robert Monné)
1. A practical application
of Topic Mining on
disaster data
Robert Monné
Master of Business Informatics
Capita Selecta (7.5 ECTS)
January 2016
2. Intro
As described in our previous research (Monné, van den Homberg & Spruit, 2016), in a disaster situation
there are many information challenges, one of which is the vast amount of unstructured information. We
want to solve a little piece of the puzzle by creating a method to quickly analyze unstructured documents,
this analysis can be used to extract the most relevant information for a specific audience.
We identified 84 information needs in earlier research, which depict the required information for disaster
responders, in a disaster situation around floods in Bangladesh. We use these in the current research to
extract relevant information from the unstructured data. A disaster puts time pressure on the decisions
required for an effective response, and therefore pressure on the timely retrieval of the information required
for these decisions. In our (and similar) contexts, NGOs and governments produce PDF reports to describe
the disaster situation, or the situation before the disaster. These documents entail hundreds of pages and are
therefore not easily handled in a disaster situation, however, they contain a high amount of useful
information for the disaster responder. We want to use algorithms to quickly extract the information
required for the disaster responder.
Scope
There are multiple ways to analyze textual data, we can read the text manually, or we could try to analyze
it automatically, one of the research subjects focusing on this is Text Analytics. We briefly introduce the
field to scope our research.
There are multiple ways in which text analytics algorithms can be applied to analyze data. The first one is
clustering, which is the process of calculating the distances between documents and then try to cluster them
in coherent groups. Secondly we have classification, which is the process to automatically classify the
documents to separate classes based on their characteristics. We also have sentiment analysis, which tries
to find the opinions and sentiments of the writer of the text. This could entail the writer’s attitude towards
specific functionality of a laptop for example. This overview is merely for positioning our paper, and does
not aim to be an exhaustive list.
We also have topic mining, which is the process of extracting words from the text which have a high
probability of representing the core of the document. We choose this approach to experiment with in our
prototype.
Literature
We used the text analytics course provided by the Illinois University, available on Coursera at:
https://www.coursera.org/course/textanalytics. This course gave us a clear understanding of the field and
pointed us in the direction of Topic Mining. We used the TM and Quanteda package and the xpdf software
for the preprocessing of the data, for which references can be found in the reference section. We used the
Topicmodel package for extracting the topics, which can also be found in the reference section.
Research Goal
Our goal is to create a replicable practice-oriented text-mining method that can be applied across cases.
Users of our method would know beforehand which practical challenges and considerations occur when
applying a topic model. Next to this we want to apply topic-mining-algorithms in a prototype, to validate
the applicability of our method, and the related techniques for the specific situation. This prototype can be
further developed and applied in similar (disaster) situations. In a new situation, only the input data needs
to be preprocessed in the same way as we did, then the script can predict topics related to the text segments.
3. Research Method
We conducted experimental and explorative research, where we used the CRISP-DM process to help us
determine the steps we need to take in a data oriented experiment. We specifically do not aim to create new
algorithms, we merely want to apply the ones readily available in a new context. Based on the keywords
from the techniques we found in the Coursera course, we used plain internet search to find packages that
matched the functionality we required. From this experiment we deduced the lessons learned and created a
replicable method for text mining projects.
To implement the topic model in our experiment we used R Studio, which is a powerful user interface for
R. R is a functional language used for statistical computing and graphics.
Results
Our method can be found in figure 1, and is described and validated
with an example in the following sections.
Document retrieval
The first step is to identify and retrieve the documents of interest. There
are two documents already identified in earlier research (Monné et al,
2016), these documents cover a large part of the information needs of
disaster responders. In our case the documents are downloaded and
saved to a local hard disk, but for larger cases we envision much more
documents, which could be stored in a document-store database for
example.
Pre-processing
First the documents need to be converted to a format that is easily
handled, the format (PDF) in which we downloaded the documents is
not usable. We decided to convert the documents to TXT format, which
is the most basic format of text representation. For this step we used the
TM package available for R, this package first requires to create a
corpus from the PDF documents. The PDFs are converted to the corpus
by using XPDF software, which is available online for free. This
software is only required to be unzipped on your machine, and then you
need to adjust your windows system path to the respective folder, so R
Studio can find the software. After the corpus is created we write it
completely to a TXT file, because in the next step we use a different
package that cannot handle a TM oriented corpus.
library(tm)
setwd("C:/R/")
fp <- file.path(".","docs")
corp <- Corpus(DirSource(fp), readerControl=list(reader=readPDF))
writeCorpus(corp, "C:/R/preprocessed")
Pre-processing
Fitting the model
Split documents
Document retrieval
Process results
Validate results
Process segments
Figure 1 Step wise text mining process
4. Split documents
There are multiple ways we can look at the data, ranging from a very high level perspective, to a very low
and granular perspective. On the highest level we can identify 2 units of interest in our case, namely the
District Disaster Management Plan and the JNA (Joint Needs Assessment). We could be interested to
identify the topic of the whole documents based on the contents, however, the results of this analysis would
yield no useful insight, since we already know the topics of these documents. We could also divide the
documents based solely on pages, so an 80 page document would yield 80 units which need to be tagged.
However, we deem this division not usable from the practical perspective. Since the similar information
could be split by the page break, and there would be a higher probability of incorrectly assigning a topic to
the text. The smallest unit could be a single word, which is not feasible since the information needs are far
more complex than single words. Tagging a sentence would yield a sufficient sample size since the amount
of individual sentences in the document is fairly large (couple of thousands). We choose to use the
information encapsulated in the table of contents to divide the document in portions that can be tagged.
This way we are sure all the information in the units is related, and it leaves an interesting number of
analyzable elements (116 to be precise).
So we wanted to split the corpus based on the table of contents. We extracted and manually cleaned the
table of contents to create a usable format for the splitting (removed page numbers and separated the
headings with a comma). Unfortunately, we discovered that the table of content headings do not exactly
match the headings in the text. Therefore we chose to manually copy the relevant split points (e.g. the
headers inside the text) and store them in a CSV file, this file is used to actually do the segmentation of the
texts. We also need to manually add some text because the string “Annexure 2” occurs multiple times in
the document (for example also in: “Annexure 23” etc.), resulting in wrong splits. Therefore we modified
“Annexure 2” to “Annexure 2-1” in the pre-processed TXT file, which leads to a unique string, and
therefore can be conveniently used for splitting.
The algorithm works as follows: first we read in the text from the earlier pre-processing step and we convert
it to a Quanteda corpus. Then we use the CSV file described above to segment the corpus with a for-loop
(and print some status indicators). We clean up our workspace with the rm-function. At last we use the CSV
split file to label the documents and also clean up the workspace again. We also found out our documents
are encoded in a different format than the standard, which resulted in wrong results.
library(quanteda) #segmenting to paragraphs
#import DDMP in quanteda format for splitting in blocks
JNAtxt <- textfile("C:/R/preprocessed/JNA.pdf.txt", encodingFrom = "ASCII")
qcorpJNA <- corpus(JNAtxt)
#splitting JNA
splitsJNA <- read.csv("C:/R/splitpoints/JNAsplits2.txt", header = T)
JNASplitted <- qcorpJNA
for (i in 1:length(splitsJNA[,1])) {
JNASplitted <- segment(JNASplitted, "other", delimiter = toString(splitsJNA[i,1]))
print("i")
print(i)
print("length after")
print(length(JNASplitted$documents[,1]))
}
rm(i)
5. rm(JNAtxt)
rm(qcorpJNA)
#creating names for documents
JNAnames <- c("JNA intro")
for (i in 1:length(splitsJNA[,1])) {
JNAnames <- c(JNAnames, paste("JNA", toString(splitsJNA[i,1])))
}
docnames(JNASplitted) <- JNAnames
rm(i)
rm(splitsJNA)
rm(JNAnames)
We used the same algorithm for segmenting the District Disaster Management Plan, which only differs on
the input file and the split points.
Process segments
Too fit a model that can predict the topic of the segments we need to process the documents further and
finally create a document feature matrix from it. These steps can be performed by the quanteda package.
We start with: combining the two corpora. Then we create a document frequency matrix with some
processing settings. Stemming, which brings the words in the document back to their root form, so all words
can be interpreted on the same level, instead of differences like Walk vs Walking. However, we did not use
this option, because it gave strange results, like: “disast” instead of “disaster”. We also remove punctuation
because this is irrelevant for the analysis. We also remove stop-words, because these are not deemed
plausible to be a segment topic. Stop words in English include: “I, Me and Yourself”. We also remove
words occurring frequently in the corpus and which are irrelevant as a topic, like for example: Sirajganj
(which is the area where the documents are written about) or Upazila (which means something like State in
the US). At last we remove numbers from the corpus, because these could lead the algorithm to be fitted on
unique numbers which are really not a topic. Numbers and punctuation removal are standard in the dfm-
function.
We convert the quanteda document frequency matrix back to a tm dfm, because this one can be handled by
the topicmodels package, which we use to fit a topic model.
Then we calculate a TF-IDF data table, this is used to remove terms that are “too frequent”, for example
terms that occur in nearly every document, these are not suitable for fitting. In our case we draw the line at
0.012 for the TF-IDF, which is just over the median over all terms and documents. The meaning of the TF-
IDF value is: It increases when the word is frequent in a document, but is decreased when the word is also
frequent in the corpus. This is a statistic of the importance of a term in a corpus. Then we remove documents
that have no frequent terms.
Before the TF-IDF removal we had a matrix with dimensions: 109 documents and 6412 features (words),
afterwards we have 105 documents and 3204 features.
DDMPJNA <- DDMPSplitted + JNASplitted
Bothdfm <- dfm(DDMPJNA,
ignoredFeatures =
c(stopwords("english"), "sirajganj", "sirajgonj",
"district", "upazila", "flood",
6. "unions", "assessment", "jna", "md"),
stem = F)
Bothdfm <- convert(Bothdfm, "tm")
#calculating tf-idf
tfidf <- tapply(Bothdfm$v/row_sums(
Bothdfm)[Bothdfm$i],Bothdfm$j, mean) *
log2(nDocs(Bothdfm)/col_sums(Bothdfm>0))
summary(col_sums(Bothdfm))
summary(tfidf)
#removing too frequent terms (and docs with 0 terms)
dim(Bothdfm)
Bothdfm2 <- Bothdfm[,tfidf>=0.012]
rm(Bothdfm)
dim(Bothdfm2)
Bothdfm2 <- Bothdfm2[row_sums(Bothdfm2)>0,]
dim(Bothdfm2)
Fitting the model
In this step we choose a Latent Dirichlet Allocation model to be fitted. The main reason we choose for this
model is the fact that it supports multiple topics (in fact the result is a probability distribution of topics), as
opposed to 1 topic resulting from a unigram model. There are many settings we can tweak and modify,
however, we don’t go into too much depth, because we want the algorithm to be easily applicable by non-
technical users. We are free to choose the amount of topics we want, in the example code below there are
40 topics. We set the seed to get replicable results.
k = 40
SEED = 2015
VEM = LDA(Bothdfm2, k = k, control = list(seed = SEED))
Process results
The topicmodels package delivers a fitted model with the results, so there is a vector where every
document/segment is related to a topic, and a data frame where every topic is related to most likely terms.
A disadvantage is that the package does not support a combination of the two objects. Which makes the
results not easily and quickly understandable. For this reason we wrote a script which combines the two
data frames, so this can be easily analyzed.
The 5 most likely terms are extracted with the terms-function, whilst the related topic for every document
is extracted with the topics-function. Now we are able to process the results further.
First we transpose the two objects to be column-oriented, instead of row-oriented. We need to use the
transpose function twice for the Topics vector, because with the first run it only transforms to a data frame
and does not switch rows with columns. Then we append a column to the Terms object with the topic
numbers (which are depicted in the row-names in the result set, however these are omitted when using the
merge function, so we choose to set them as a separate column). Then we append the row-names of the
Topics object (which are the related documents). Now we have two tables with were one is a 2 X 104 table
7. (104 = documents, 2 = Topic number and Document title), and one is a 6 X 15 table (6 = topic number and
5 related terms, 15 = amount of Topics). We use the colnames function to set the column names for the two
tables. At last we use the merge function from the base R kit to combine the two tables.
Terms <- terms(VEM, 5)
Topics <- topics(VEM, 1)
rm(k, SEED, tfidf)
Terms <- t(Terms)
Topics <- t(Topics)
Topics <- t(Topics)
Terms <- cbind(Terms, c(1:length(Terms[,1])))
Topics <- cbind(Topics, rownames(Topics))
colnames(Topics) <- c("Topic nr", "Heading")
colnames(Terms) <- c("Topic1",
"Topic2", "Topic3", "Topic4",
"Topic5", "Topic nr" )
TopicTerms <- merge(Topics, Terms, by = c("Topic nr"), all.x = T, all.y= F, sort = F)
rm(Terms, Topics)
Validate Results
The result table can be found in the appendix.
We determined by trial and error that 40 topics are the most suitable for this case, since 10, 15 and 20 gave
far too less distinguished topics as a result. This lead to text segments with the same topic, while they were
totally unrelated.
Semantics is an issue, it cannot link the topics to real world information. For example:
DDMP 1.3.2 Area: char
From this example we know the word “char” is very relevant for the area segment, because the specific
name for the ground situation is char (but this is a Bangladeshi term). An untrained respondent would not
know this, and therefore we would have liked to see a topic like: “Area” for this specific chapter. But this
term is not frequent enough in this segment, and therefore it will never be the topic using the LDA method.
We manually analysed every segment of text to see whether the topics from the algorithms matched our
understanding of the text. Every topic-segment combination we found useful we marked with a Yes and all
other with a No. The results are mixed. We see a very clear distinction between the results of the JNA and
the DDMP. DDMP has only 19/75 useful topics assigned, whilst the JNA has 24/30 topics usefully
assigned. This also means that the results for the total set are a bit unsatisfactory, only 43/105 topics are
usefully assigned.
Count of Useful Column Labels
Row Labels DDMP JNA Grand Total
No 56 6 62
Yes 19 24 43
Grand Total 75 30 105
8. Because we want to understand the results even more we gave every “not useful” marking a reason. These
reasons, and the occurrence counts can be found in the table below. We see a very clear leader in the reasons
for incorrect topic assignment, which is “Topic is not mentioned in the text”. This basically means that we
believe that the text is really about something else then the most frequent term makes us believe. The LDA
algorithm is solely based on the words mentioned in the actual text, and therefor incapable to suggest the
terms we see more probable as a topic.
The second most occurring issue is “Table as content” which occurs also in combination with “Numbers as
content” and “Picture as content”. These are segments that are not recognized correctly by the algorithm.
In a table, the header row has the highest probability of being related to the topic, however, this additional
information is not used by the algorithm, it weights all words equally. The numbers in the text are removed
in the pre-processing part of the analysis, because these can never be a topic, this leads however to some
segments being low in content, which leads to a wrong topic. The algorithms does not recognize images,
and therefore cannot correctly assign topics to text segments with a high amount of images.
The third most occurring issue is “Topics are not related to information need”, where the contents of the
text is not related to an information need expressed by our previous research. These are for example text
segments like: “Shortcoming of assessment”.
At last, in the DDMP there are a lot of region names mentioned, this leads to a high frequency of these
words, and therefor they seem probable to be the topic of a segment, this is however not the actual subject.
We could counter this to remove all region names in the “processing segments” step.
Count of Reason Column Labels
Row Labels DDMP JNA Grand Total
Topic is not mentioned in text (related information need) 14 14
Table as content 9 9
Topics are not related to information need 9 9
Table as content and Region names 6 6
Picture as content 4 1 5
Region names frequently occurs 5 5
Cannot reproduce 2 2 4
Not related to information need 3 3
Table and Numbers 3 3
Table with much numbers 1 1
Numbers as content 1 1
Schools are shelters 1 1
(blank)
Grand Total 55 6 61
9. Next steps
For further extraction of relevant details we propose to use an intelligent search engine which uses the topic
models we created. We now have broad categories in which the documents can be divided. However, we
cannot get exact statistics, like the amount of affected based on this analysis. These statistics can be highly
usable for the disaster responders.
We could also have used a categorization package like RTextTools which can be trained to predict in which
category a document belongs. However, we did not have enough data to both create a train and test set for
this specific case. Nonetheless, for future research we see the possibility to create a train set based on
Wikipedia articles. This way we can create custom categories like: baseline information, situation overview,
needs of affected etc. and train an algorithm to recognize and categorize these texts.
I perceive the field of text mining as a very interesting field, and will continue to explore it in my
professional career.
10. Conclusion and Research suggestions
For every interesting step in the process we draw conclusions and provide suggestions to improve
Splitting
We suggest to incorporate a string similarity function to the split-points in the “splitting” step. In our case
we got incorrect results because the strings from the table of contents did not match the strings in the actual
text. We could counter this by applying a string similarity algorithm, and selecting the most similar sentence
in the text, and use this as a split-point.
Process segments
The results provided by the topicmodels packages are not really intuitive to use. We needed to process the
two result datasets to analyse the dataset more effectively.
Validating
Unfortunately the topics we identified were not nearly a 100% match, this is mostly due to 4 reasons shared
in the validation of our results. These 4 are:
1. Topic incorrect because: not mentioned in text
This is basically a disagreement between the authors of this article and the algorithm on
the assumptions of the algorithm. The LDA algorithm assumes that the topic the text is
mentioned in the text. This is however not always the case.
Suggestion: we could develop an algorithm that matches the topics of the text to the related
information needs we are trying to find. For example by incorporating information from a
dictionary or a thesaurus.
2. Topic incorrect due to: Table as content
Tables have very valuable information in the headers, this is not recognized by this LDA
algorithm, and therefore leads to incorrect assignment of the topic. The LDA algorithm
values every word equal irrespective of place.
Suggestion: develop an algorithm that takes the position of the words in the table into
account. We know from earlier encounters that algorithms exist which assign a higher
value on the words when they are in a certain location of the sentence. We do not know of
any research which incorporates the position of a word in a table.
3. Topic useless because: not related to information need
This basically means that the topic is correctly assigned, they are however not applicable
for our specific case, because they are not related to an “information need” we are interested
in.
Suggestion: use the topics to filter the data not required for the disaster responders.
4. Topic incorrect due to wrong stop word removal
We removed some clearly wrong topics from the text (disaster, Sirajganj etc.), we did
however not remove all sub-area names. This lead to, the name of “small town in the
region” to be the assigned topic.
Suggestion: use an iterative approach to remove the useless stop words specific to the text.
We partly applied this iterative approach, since later in the process we identified the stop
words like (disaster and Sirajganj etc.) for our case. We then applied them again in the
“process segments” step.
11. References
Bettina Gruen, Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of
Statistical Software, 40(13), 1-30. URL http://www.jstatsoft.org/v40/i13/.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical
Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.
Kenneth Benoit and Paul Nulty (2015). quanteda: Quantitative Analysis of Textual Data. R package version
0.9.0-1. https://CRAN.R-project.org/package=quanteda
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL https://www.R-project.org/.