SlideShare a Scribd company logo
1 of 11
Download to read offline
A practical application
of Topic Mining on
disaster data
Robert Monné
Master of Business Informatics
Capita Selecta (7.5 ECTS)
January 2016
Intro
As described in our previous research (Monné, van den Homberg & Spruit, 2016), in a disaster situation
there are many information challenges, one of which is the vast amount of unstructured information. We
want to solve a little piece of the puzzle by creating a method to quickly analyze unstructured documents,
this analysis can be used to extract the most relevant information for a specific audience.
We identified 84 information needs in earlier research, which depict the required information for disaster
responders, in a disaster situation around floods in Bangladesh. We use these in the current research to
extract relevant information from the unstructured data. A disaster puts time pressure on the decisions
required for an effective response, and therefore pressure on the timely retrieval of the information required
for these decisions. In our (and similar) contexts, NGOs and governments produce PDF reports to describe
the disaster situation, or the situation before the disaster. These documents entail hundreds of pages and are
therefore not easily handled in a disaster situation, however, they contain a high amount of useful
information for the disaster responder. We want to use algorithms to quickly extract the information
required for the disaster responder.
Scope
There are multiple ways to analyze textual data, we can read the text manually, or we could try to analyze
it automatically, one of the research subjects focusing on this is Text Analytics. We briefly introduce the
field to scope our research.
There are multiple ways in which text analytics algorithms can be applied to analyze data. The first one is
clustering, which is the process of calculating the distances between documents and then try to cluster them
in coherent groups. Secondly we have classification, which is the process to automatically classify the
documents to separate classes based on their characteristics. We also have sentiment analysis, which tries
to find the opinions and sentiments of the writer of the text. This could entail the writer’s attitude towards
specific functionality of a laptop for example. This overview is merely for positioning our paper, and does
not aim to be an exhaustive list.
We also have topic mining, which is the process of extracting words from the text which have a high
probability of representing the core of the document. We choose this approach to experiment with in our
prototype.
Literature
We used the text analytics course provided by the Illinois University, available on Coursera at:
https://www.coursera.org/course/textanalytics. This course gave us a clear understanding of the field and
pointed us in the direction of Topic Mining. We used the TM and Quanteda package and the xpdf software
for the preprocessing of the data, for which references can be found in the reference section. We used the
Topicmodel package for extracting the topics, which can also be found in the reference section.
Research Goal
Our goal is to create a replicable practice-oriented text-mining method that can be applied across cases.
Users of our method would know beforehand which practical challenges and considerations occur when
applying a topic model. Next to this we want to apply topic-mining-algorithms in a prototype, to validate
the applicability of our method, and the related techniques for the specific situation. This prototype can be
further developed and applied in similar (disaster) situations. In a new situation, only the input data needs
to be preprocessed in the same way as we did, then the script can predict topics related to the text segments.
Research Method
We conducted experimental and explorative research, where we used the CRISP-DM process to help us
determine the steps we need to take in a data oriented experiment. We specifically do not aim to create new
algorithms, we merely want to apply the ones readily available in a new context. Based on the keywords
from the techniques we found in the Coursera course, we used plain internet search to find packages that
matched the functionality we required. From this experiment we deduced the lessons learned and created a
replicable method for text mining projects.
To implement the topic model in our experiment we used R Studio, which is a powerful user interface for
R. R is a functional language used for statistical computing and graphics.
Results
Our method can be found in figure 1, and is described and validated
with an example in the following sections.
Document retrieval
The first step is to identify and retrieve the documents of interest. There
are two documents already identified in earlier research (Monné et al,
2016), these documents cover a large part of the information needs of
disaster responders. In our case the documents are downloaded and
saved to a local hard disk, but for larger cases we envision much more
documents, which could be stored in a document-store database for
example.
Pre-processing
First the documents need to be converted to a format that is easily
handled, the format (PDF) in which we downloaded the documents is
not usable. We decided to convert the documents to TXT format, which
is the most basic format of text representation. For this step we used the
TM package available for R, this package first requires to create a
corpus from the PDF documents. The PDFs are converted to the corpus
by using XPDF software, which is available online for free. This
software is only required to be unzipped on your machine, and then you
need to adjust your windows system path to the respective folder, so R
Studio can find the software. After the corpus is created we write it
completely to a TXT file, because in the next step we use a different
package that cannot handle a TM oriented corpus.
library(tm)
setwd("C:/R/")
fp <- file.path(".","docs")
corp <- Corpus(DirSource(fp), readerControl=list(reader=readPDF))
writeCorpus(corp, "C:/R/preprocessed")
Pre-processing
Fitting the model
Split documents
Document retrieval
Process results
Validate results
Process segments
Figure 1 Step wise text mining process
Split documents
There are multiple ways we can look at the data, ranging from a very high level perspective, to a very low
and granular perspective. On the highest level we can identify 2 units of interest in our case, namely the
District Disaster Management Plan and the JNA (Joint Needs Assessment). We could be interested to
identify the topic of the whole documents based on the contents, however, the results of this analysis would
yield no useful insight, since we already know the topics of these documents. We could also divide the
documents based solely on pages, so an 80 page document would yield 80 units which need to be tagged.
However, we deem this division not usable from the practical perspective. Since the similar information
could be split by the page break, and there would be a higher probability of incorrectly assigning a topic to
the text. The smallest unit could be a single word, which is not feasible since the information needs are far
more complex than single words. Tagging a sentence would yield a sufficient sample size since the amount
of individual sentences in the document is fairly large (couple of thousands). We choose to use the
information encapsulated in the table of contents to divide the document in portions that can be tagged.
This way we are sure all the information in the units is related, and it leaves an interesting number of
analyzable elements (116 to be precise).
So we wanted to split the corpus based on the table of contents. We extracted and manually cleaned the
table of contents to create a usable format for the splitting (removed page numbers and separated the
headings with a comma). Unfortunately, we discovered that the table of content headings do not exactly
match the headings in the text. Therefore we chose to manually copy the relevant split points (e.g. the
headers inside the text) and store them in a CSV file, this file is used to actually do the segmentation of the
texts. We also need to manually add some text because the string “Annexure 2” occurs multiple times in
the document (for example also in: “Annexure 23” etc.), resulting in wrong splits. Therefore we modified
“Annexure 2” to “Annexure 2-1” in the pre-processed TXT file, which leads to a unique string, and
therefore can be conveniently used for splitting.
The algorithm works as follows: first we read in the text from the earlier pre-processing step and we convert
it to a Quanteda corpus. Then we use the CSV file described above to segment the corpus with a for-loop
(and print some status indicators). We clean up our workspace with the rm-function. At last we use the CSV
split file to label the documents and also clean up the workspace again. We also found out our documents
are encoded in a different format than the standard, which resulted in wrong results.
library(quanteda) #segmenting to paragraphs
#import DDMP in quanteda format for splitting in blocks
JNAtxt <- textfile("C:/R/preprocessed/JNA.pdf.txt", encodingFrom = "ASCII")
qcorpJNA <- corpus(JNAtxt)
#splitting JNA
splitsJNA <- read.csv("C:/R/splitpoints/JNAsplits2.txt", header = T)
JNASplitted <- qcorpJNA
for (i in 1:length(splitsJNA[,1])) {
JNASplitted <- segment(JNASplitted, "other", delimiter = toString(splitsJNA[i,1]))
print("i")
print(i)
print("length after")
print(length(JNASplitted$documents[,1]))
}
rm(i)
rm(JNAtxt)
rm(qcorpJNA)
#creating names for documents
JNAnames <- c("JNA intro")
for (i in 1:length(splitsJNA[,1])) {
JNAnames <- c(JNAnames, paste("JNA", toString(splitsJNA[i,1])))
}
docnames(JNASplitted) <- JNAnames
rm(i)
rm(splitsJNA)
rm(JNAnames)
We used the same algorithm for segmenting the District Disaster Management Plan, which only differs on
the input file and the split points.
Process segments
Too fit a model that can predict the topic of the segments we need to process the documents further and
finally create a document feature matrix from it. These steps can be performed by the quanteda package.
We start with: combining the two corpora. Then we create a document frequency matrix with some
processing settings. Stemming, which brings the words in the document back to their root form, so all words
can be interpreted on the same level, instead of differences like Walk vs Walking. However, we did not use
this option, because it gave strange results, like: “disast” instead of “disaster”. We also remove punctuation
because this is irrelevant for the analysis. We also remove stop-words, because these are not deemed
plausible to be a segment topic. Stop words in English include: “I, Me and Yourself”. We also remove
words occurring frequently in the corpus and which are irrelevant as a topic, like for example: Sirajganj
(which is the area where the documents are written about) or Upazila (which means something like State in
the US). At last we remove numbers from the corpus, because these could lead the algorithm to be fitted on
unique numbers which are really not a topic. Numbers and punctuation removal are standard in the dfm-
function.
We convert the quanteda document frequency matrix back to a tm dfm, because this one can be handled by
the topicmodels package, which we use to fit a topic model.
Then we calculate a TF-IDF data table, this is used to remove terms that are “too frequent”, for example
terms that occur in nearly every document, these are not suitable for fitting. In our case we draw the line at
0.012 for the TF-IDF, which is just over the median over all terms and documents. The meaning of the TF-
IDF value is: It increases when the word is frequent in a document, but is decreased when the word is also
frequent in the corpus. This is a statistic of the importance of a term in a corpus. Then we remove documents
that have no frequent terms.
Before the TF-IDF removal we had a matrix with dimensions: 109 documents and 6412 features (words),
afterwards we have 105 documents and 3204 features.
DDMPJNA <- DDMPSplitted + JNASplitted
Bothdfm <- dfm(DDMPJNA,
ignoredFeatures =
c(stopwords("english"), "sirajganj", "sirajgonj",
"district", "upazila", "flood",
"unions", "assessment", "jna", "md"),
stem = F)
Bothdfm <- convert(Bothdfm, "tm")
#calculating tf-idf
tfidf <- tapply(Bothdfm$v/row_sums(
Bothdfm)[Bothdfm$i],Bothdfm$j, mean) *
log2(nDocs(Bothdfm)/col_sums(Bothdfm>0))
summary(col_sums(Bothdfm))
summary(tfidf)
#removing too frequent terms (and docs with 0 terms)
dim(Bothdfm)
Bothdfm2 <- Bothdfm[,tfidf>=0.012]
rm(Bothdfm)
dim(Bothdfm2)
Bothdfm2 <- Bothdfm2[row_sums(Bothdfm2)>0,]
dim(Bothdfm2)
Fitting the model
In this step we choose a Latent Dirichlet Allocation model to be fitted. The main reason we choose for this
model is the fact that it supports multiple topics (in fact the result is a probability distribution of topics), as
opposed to 1 topic resulting from a unigram model. There are many settings we can tweak and modify,
however, we don’t go into too much depth, because we want the algorithm to be easily applicable by non-
technical users. We are free to choose the amount of topics we want, in the example code below there are
40 topics. We set the seed to get replicable results.
k = 40
SEED = 2015
VEM = LDA(Bothdfm2, k = k, control = list(seed = SEED))
Process results
The topicmodels package delivers a fitted model with the results, so there is a vector where every
document/segment is related to a topic, and a data frame where every topic is related to most likely terms.
A disadvantage is that the package does not support a combination of the two objects. Which makes the
results not easily and quickly understandable. For this reason we wrote a script which combines the two
data frames, so this can be easily analyzed.
The 5 most likely terms are extracted with the terms-function, whilst the related topic for every document
is extracted with the topics-function. Now we are able to process the results further.
First we transpose the two objects to be column-oriented, instead of row-oriented. We need to use the
transpose function twice for the Topics vector, because with the first run it only transforms to a data frame
and does not switch rows with columns. Then we append a column to the Terms object with the topic
numbers (which are depicted in the row-names in the result set, however these are omitted when using the
merge function, so we choose to set them as a separate column). Then we append the row-names of the
Topics object (which are the related documents). Now we have two tables with were one is a 2 X 104 table
(104 = documents, 2 = Topic number and Document title), and one is a 6 X 15 table (6 = topic number and
5 related terms, 15 = amount of Topics). We use the colnames function to set the column names for the two
tables. At last we use the merge function from the base R kit to combine the two tables.
Terms <- terms(VEM, 5)
Topics <- topics(VEM, 1)
rm(k, SEED, tfidf)
Terms <- t(Terms)
Topics <- t(Topics)
Topics <- t(Topics)
Terms <- cbind(Terms, c(1:length(Terms[,1])))
Topics <- cbind(Topics, rownames(Topics))
colnames(Topics) <- c("Topic nr", "Heading")
colnames(Terms) <- c("Topic1",
"Topic2", "Topic3", "Topic4",
"Topic5", "Topic nr" )
TopicTerms <- merge(Topics, Terms, by = c("Topic nr"), all.x = T, all.y= F, sort = F)
rm(Terms, Topics)
Validate Results
The result table can be found in the appendix.
We determined by trial and error that 40 topics are the most suitable for this case, since 10, 15 and 20 gave
far too less distinguished topics as a result. This lead to text segments with the same topic, while they were
totally unrelated.
Semantics is an issue, it cannot link the topics to real world information. For example:
DDMP 1.3.2 Area: char
From this example we know the word “char” is very relevant for the area segment, because the specific
name for the ground situation is char (but this is a Bangladeshi term). An untrained respondent would not
know this, and therefore we would have liked to see a topic like: “Area” for this specific chapter. But this
term is not frequent enough in this segment, and therefore it will never be the topic using the LDA method.
We manually analysed every segment of text to see whether the topics from the algorithms matched our
understanding of the text. Every topic-segment combination we found useful we marked with a Yes and all
other with a No. The results are mixed. We see a very clear distinction between the results of the JNA and
the DDMP. DDMP has only 19/75 useful topics assigned, whilst the JNA has 24/30 topics usefully
assigned. This also means that the results for the total set are a bit unsatisfactory, only 43/105 topics are
usefully assigned.
Count of Useful Column Labels
Row Labels DDMP JNA Grand Total
No 56 6 62
Yes 19 24 43
Grand Total 75 30 105
Because we want to understand the results even more we gave every “not useful” marking a reason. These
reasons, and the occurrence counts can be found in the table below. We see a very clear leader in the reasons
for incorrect topic assignment, which is “Topic is not mentioned in the text”. This basically means that we
believe that the text is really about something else then the most frequent term makes us believe. The LDA
algorithm is solely based on the words mentioned in the actual text, and therefor incapable to suggest the
terms we see more probable as a topic.
The second most occurring issue is “Table as content” which occurs also in combination with “Numbers as
content” and “Picture as content”. These are segments that are not recognized correctly by the algorithm.
In a table, the header row has the highest probability of being related to the topic, however, this additional
information is not used by the algorithm, it weights all words equally. The numbers in the text are removed
in the pre-processing part of the analysis, because these can never be a topic, this leads however to some
segments being low in content, which leads to a wrong topic. The algorithms does not recognize images,
and therefore cannot correctly assign topics to text segments with a high amount of images.
The third most occurring issue is “Topics are not related to information need”, where the contents of the
text is not related to an information need expressed by our previous research. These are for example text
segments like: “Shortcoming of assessment”.
At last, in the DDMP there are a lot of region names mentioned, this leads to a high frequency of these
words, and therefor they seem probable to be the topic of a segment, this is however not the actual subject.
We could counter this to remove all region names in the “processing segments” step.
Count of Reason Column Labels
Row Labels DDMP JNA Grand Total
Topic is not mentioned in text (related information need) 14 14
Table as content 9 9
Topics are not related to information need 9 9
Table as content and Region names 6 6
Picture as content 4 1 5
Region names frequently occurs 5 5
Cannot reproduce 2 2 4
Not related to information need 3 3
Table and Numbers 3 3
Table with much numbers 1 1
Numbers as content 1 1
Schools are shelters 1 1
(blank)
Grand Total 55 6 61
Next steps
For further extraction of relevant details we propose to use an intelligent search engine which uses the topic
models we created. We now have broad categories in which the documents can be divided. However, we
cannot get exact statistics, like the amount of affected based on this analysis. These statistics can be highly
usable for the disaster responders.
We could also have used a categorization package like RTextTools which can be trained to predict in which
category a document belongs. However, we did not have enough data to both create a train and test set for
this specific case. Nonetheless, for future research we see the possibility to create a train set based on
Wikipedia articles. This way we can create custom categories like: baseline information, situation overview,
needs of affected etc. and train an algorithm to recognize and categorize these texts.
I perceive the field of text mining as a very interesting field, and will continue to explore it in my
professional career.
Conclusion and Research suggestions
For every interesting step in the process we draw conclusions and provide suggestions to improve
Splitting
We suggest to incorporate a string similarity function to the split-points in the “splitting” step. In our case
we got incorrect results because the strings from the table of contents did not match the strings in the actual
text. We could counter this by applying a string similarity algorithm, and selecting the most similar sentence
in the text, and use this as a split-point.
Process segments
The results provided by the topicmodels packages are not really intuitive to use. We needed to process the
two result datasets to analyse the dataset more effectively.
Validating
Unfortunately the topics we identified were not nearly a 100% match, this is mostly due to 4 reasons shared
in the validation of our results. These 4 are:
1. Topic incorrect because: not mentioned in text
 This is basically a disagreement between the authors of this article and the algorithm on
the assumptions of the algorithm. The LDA algorithm assumes that the topic the text is
mentioned in the text. This is however not always the case.
 Suggestion: we could develop an algorithm that matches the topics of the text to the related
information needs we are trying to find. For example by incorporating information from a
dictionary or a thesaurus.
2. Topic incorrect due to: Table as content
 Tables have very valuable information in the headers, this is not recognized by this LDA
algorithm, and therefore leads to incorrect assignment of the topic. The LDA algorithm
values every word equal irrespective of place.
 Suggestion: develop an algorithm that takes the position of the words in the table into
account. We know from earlier encounters that algorithms exist which assign a higher
value on the words when they are in a certain location of the sentence. We do not know of
any research which incorporates the position of a word in a table.
3. Topic useless because: not related to information need
 This basically means that the topic is correctly assigned, they are however not applicable
for our specific case, because they are not related to an “information need” we are interested
in.
 Suggestion: use the topics to filter the data not required for the disaster responders.
4. Topic incorrect due to wrong stop word removal
 We removed some clearly wrong topics from the text (disaster, Sirajganj etc.), we did
however not remove all sub-area names. This lead to, the name of “small town in the
region” to be the assigned topic.
 Suggestion: use an iterative approach to remove the useless stop words specific to the text.
We partly applied this iterative approach, since later in the process we identified the stop
words like (disaster and Sirajganj etc.) for our case. We then applied them again in the
“process segments” step.
References
Bettina Gruen, Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of
Statistical Software, 40(13), 1-30. URL http://www.jstatsoft.org/v40/i13/.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical
Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/.
Kenneth Benoit and Paul Nulty (2015). quanteda: Quantitative Analysis of Textual Data. R package version
0.9.0-1. https://CRAN.R-project.org/package=quanteda
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL https://www.R-project.org/.

More Related Content

What's hot

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trendsKU Leuven
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...INFOGAIN PUBLICATION
 
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAFLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAIJwest
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONIJDKP
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 
Text Data Mining
Text Data MiningText Data Mining
Text Data MiningKU Leuven
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysismbruemmer
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query RoutingIJERA Editor
 

What's hot (19)

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Ibm haifa.mq.final
Ibm haifa.mq.finalIbm haifa.mq.final
Ibm haifa.mq.final
 
Tdm recent trends
Tdm recent trendsTdm recent trends
Tdm recent trends
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
 
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATAFLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
FLOWER VOICE: VIRTUAL ASSISTANT FOR OPEN DATA
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
Data science unit3
Data science unit3Data science unit3
Data science unit3
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
Text mining
Text miningText mining
Text mining
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic Web
 
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysisLemon-aid: using Lemon to aid quantitative historical linguistic analysis
Lemon-aid: using Lemon to aid quantitative historical linguistic analysis
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Approaches for Keyword Query Routing
Approaches for Keyword Query RoutingApproaches for Keyword Query Routing
Approaches for Keyword Query Routing
 

Similar to Topic Mining on disaster data (Robert Monné)

Article Summarizer
Article SummarizerArticle Summarizer
Article SummarizerJose Katab
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big DataNick Boucart
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Requirment
RequirmentRequirment
Requirmentstat
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification SystemIRJET Journal
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 
Automatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingAutomatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingIRJET Journal
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications
 
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
JPJ1421  Facilitating Document Annotation Using Content and Querying ValueJPJ1421  Facilitating Document Annotation Using Content and Querying Value
JPJ1421 Facilitating Document Annotation Using Content and Querying Valuechennaijp
 
Automatic Annotation Approach Of Events In News Articles
Automatic Annotation Approach Of Events In News ArticlesAutomatic Annotation Approach Of Events In News Articles
Automatic Annotation Approach Of Events In News ArticlesJoaquin Hamad
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ireSoham Saha
 
Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationijcsit
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
 

Similar to Topic Mining on disaster data (Robert Monné) (20)

[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Article Summarizer
Article SummarizerArticle Summarizer
Article Summarizer
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
 
G04124041046
G04124041046G04124041046
G04124041046
 
Requirment
RequirmentRequirment
Requirment
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Automatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language ProcessingAutomatic Text Summarization using Natural Language Processing
Automatic Text Summarization using Natural Language Processing
 
Duplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash TableDuplicate File Analyzer using N-layer Hash and Hash Table
Duplicate File Analyzer using N-layer Hash and Hash Table
 
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
JPJ1421  Facilitating Document Annotation Using Content and Querying ValueJPJ1421  Facilitating Document Annotation Using Content and Querying Value
JPJ1421 Facilitating Document Annotation Using Content and Querying Value
 
Automatic Annotation Approach Of Events In News Articles
Automatic Annotation Approach Of Events In News ArticlesAutomatic Annotation Approach Of Events In News Articles
Automatic Annotation Approach Of Events In News Articles
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
 
Data mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configurationData mining model for the data retrieval from central server configuration
Data mining model for the data retrieval from central server configuration
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 

Topic Mining on disaster data (Robert Monné)

  • 1. A practical application of Topic Mining on disaster data Robert Monné Master of Business Informatics Capita Selecta (7.5 ECTS) January 2016
  • 2. Intro As described in our previous research (Monné, van den Homberg & Spruit, 2016), in a disaster situation there are many information challenges, one of which is the vast amount of unstructured information. We want to solve a little piece of the puzzle by creating a method to quickly analyze unstructured documents, this analysis can be used to extract the most relevant information for a specific audience. We identified 84 information needs in earlier research, which depict the required information for disaster responders, in a disaster situation around floods in Bangladesh. We use these in the current research to extract relevant information from the unstructured data. A disaster puts time pressure on the decisions required for an effective response, and therefore pressure on the timely retrieval of the information required for these decisions. In our (and similar) contexts, NGOs and governments produce PDF reports to describe the disaster situation, or the situation before the disaster. These documents entail hundreds of pages and are therefore not easily handled in a disaster situation, however, they contain a high amount of useful information for the disaster responder. We want to use algorithms to quickly extract the information required for the disaster responder. Scope There are multiple ways to analyze textual data, we can read the text manually, or we could try to analyze it automatically, one of the research subjects focusing on this is Text Analytics. We briefly introduce the field to scope our research. There are multiple ways in which text analytics algorithms can be applied to analyze data. The first one is clustering, which is the process of calculating the distances between documents and then try to cluster them in coherent groups. Secondly we have classification, which is the process to automatically classify the documents to separate classes based on their characteristics. We also have sentiment analysis, which tries to find the opinions and sentiments of the writer of the text. This could entail the writer’s attitude towards specific functionality of a laptop for example. This overview is merely for positioning our paper, and does not aim to be an exhaustive list. We also have topic mining, which is the process of extracting words from the text which have a high probability of representing the core of the document. We choose this approach to experiment with in our prototype. Literature We used the text analytics course provided by the Illinois University, available on Coursera at: https://www.coursera.org/course/textanalytics. This course gave us a clear understanding of the field and pointed us in the direction of Topic Mining. We used the TM and Quanteda package and the xpdf software for the preprocessing of the data, for which references can be found in the reference section. We used the Topicmodel package for extracting the topics, which can also be found in the reference section. Research Goal Our goal is to create a replicable practice-oriented text-mining method that can be applied across cases. Users of our method would know beforehand which practical challenges and considerations occur when applying a topic model. Next to this we want to apply topic-mining-algorithms in a prototype, to validate the applicability of our method, and the related techniques for the specific situation. This prototype can be further developed and applied in similar (disaster) situations. In a new situation, only the input data needs to be preprocessed in the same way as we did, then the script can predict topics related to the text segments.
  • 3. Research Method We conducted experimental and explorative research, where we used the CRISP-DM process to help us determine the steps we need to take in a data oriented experiment. We specifically do not aim to create new algorithms, we merely want to apply the ones readily available in a new context. Based on the keywords from the techniques we found in the Coursera course, we used plain internet search to find packages that matched the functionality we required. From this experiment we deduced the lessons learned and created a replicable method for text mining projects. To implement the topic model in our experiment we used R Studio, which is a powerful user interface for R. R is a functional language used for statistical computing and graphics. Results Our method can be found in figure 1, and is described and validated with an example in the following sections. Document retrieval The first step is to identify and retrieve the documents of interest. There are two documents already identified in earlier research (Monné et al, 2016), these documents cover a large part of the information needs of disaster responders. In our case the documents are downloaded and saved to a local hard disk, but for larger cases we envision much more documents, which could be stored in a document-store database for example. Pre-processing First the documents need to be converted to a format that is easily handled, the format (PDF) in which we downloaded the documents is not usable. We decided to convert the documents to TXT format, which is the most basic format of text representation. For this step we used the TM package available for R, this package first requires to create a corpus from the PDF documents. The PDFs are converted to the corpus by using XPDF software, which is available online for free. This software is only required to be unzipped on your machine, and then you need to adjust your windows system path to the respective folder, so R Studio can find the software. After the corpus is created we write it completely to a TXT file, because in the next step we use a different package that cannot handle a TM oriented corpus. library(tm) setwd("C:/R/") fp <- file.path(".","docs") corp <- Corpus(DirSource(fp), readerControl=list(reader=readPDF)) writeCorpus(corp, "C:/R/preprocessed") Pre-processing Fitting the model Split documents Document retrieval Process results Validate results Process segments Figure 1 Step wise text mining process
  • 4. Split documents There are multiple ways we can look at the data, ranging from a very high level perspective, to a very low and granular perspective. On the highest level we can identify 2 units of interest in our case, namely the District Disaster Management Plan and the JNA (Joint Needs Assessment). We could be interested to identify the topic of the whole documents based on the contents, however, the results of this analysis would yield no useful insight, since we already know the topics of these documents. We could also divide the documents based solely on pages, so an 80 page document would yield 80 units which need to be tagged. However, we deem this division not usable from the practical perspective. Since the similar information could be split by the page break, and there would be a higher probability of incorrectly assigning a topic to the text. The smallest unit could be a single word, which is not feasible since the information needs are far more complex than single words. Tagging a sentence would yield a sufficient sample size since the amount of individual sentences in the document is fairly large (couple of thousands). We choose to use the information encapsulated in the table of contents to divide the document in portions that can be tagged. This way we are sure all the information in the units is related, and it leaves an interesting number of analyzable elements (116 to be precise). So we wanted to split the corpus based on the table of contents. We extracted and manually cleaned the table of contents to create a usable format for the splitting (removed page numbers and separated the headings with a comma). Unfortunately, we discovered that the table of content headings do not exactly match the headings in the text. Therefore we chose to manually copy the relevant split points (e.g. the headers inside the text) and store them in a CSV file, this file is used to actually do the segmentation of the texts. We also need to manually add some text because the string “Annexure 2” occurs multiple times in the document (for example also in: “Annexure 23” etc.), resulting in wrong splits. Therefore we modified “Annexure 2” to “Annexure 2-1” in the pre-processed TXT file, which leads to a unique string, and therefore can be conveniently used for splitting. The algorithm works as follows: first we read in the text from the earlier pre-processing step and we convert it to a Quanteda corpus. Then we use the CSV file described above to segment the corpus with a for-loop (and print some status indicators). We clean up our workspace with the rm-function. At last we use the CSV split file to label the documents and also clean up the workspace again. We also found out our documents are encoded in a different format than the standard, which resulted in wrong results. library(quanteda) #segmenting to paragraphs #import DDMP in quanteda format for splitting in blocks JNAtxt <- textfile("C:/R/preprocessed/JNA.pdf.txt", encodingFrom = "ASCII") qcorpJNA <- corpus(JNAtxt) #splitting JNA splitsJNA <- read.csv("C:/R/splitpoints/JNAsplits2.txt", header = T) JNASplitted <- qcorpJNA for (i in 1:length(splitsJNA[,1])) { JNASplitted <- segment(JNASplitted, "other", delimiter = toString(splitsJNA[i,1])) print("i") print(i) print("length after") print(length(JNASplitted$documents[,1])) } rm(i)
  • 5. rm(JNAtxt) rm(qcorpJNA) #creating names for documents JNAnames <- c("JNA intro") for (i in 1:length(splitsJNA[,1])) { JNAnames <- c(JNAnames, paste("JNA", toString(splitsJNA[i,1]))) } docnames(JNASplitted) <- JNAnames rm(i) rm(splitsJNA) rm(JNAnames) We used the same algorithm for segmenting the District Disaster Management Plan, which only differs on the input file and the split points. Process segments Too fit a model that can predict the topic of the segments we need to process the documents further and finally create a document feature matrix from it. These steps can be performed by the quanteda package. We start with: combining the two corpora. Then we create a document frequency matrix with some processing settings. Stemming, which brings the words in the document back to their root form, so all words can be interpreted on the same level, instead of differences like Walk vs Walking. However, we did not use this option, because it gave strange results, like: “disast” instead of “disaster”. We also remove punctuation because this is irrelevant for the analysis. We also remove stop-words, because these are not deemed plausible to be a segment topic. Stop words in English include: “I, Me and Yourself”. We also remove words occurring frequently in the corpus and which are irrelevant as a topic, like for example: Sirajganj (which is the area where the documents are written about) or Upazila (which means something like State in the US). At last we remove numbers from the corpus, because these could lead the algorithm to be fitted on unique numbers which are really not a topic. Numbers and punctuation removal are standard in the dfm- function. We convert the quanteda document frequency matrix back to a tm dfm, because this one can be handled by the topicmodels package, which we use to fit a topic model. Then we calculate a TF-IDF data table, this is used to remove terms that are “too frequent”, for example terms that occur in nearly every document, these are not suitable for fitting. In our case we draw the line at 0.012 for the TF-IDF, which is just over the median over all terms and documents. The meaning of the TF- IDF value is: It increases when the word is frequent in a document, but is decreased when the word is also frequent in the corpus. This is a statistic of the importance of a term in a corpus. Then we remove documents that have no frequent terms. Before the TF-IDF removal we had a matrix with dimensions: 109 documents and 6412 features (words), afterwards we have 105 documents and 3204 features. DDMPJNA <- DDMPSplitted + JNASplitted Bothdfm <- dfm(DDMPJNA, ignoredFeatures = c(stopwords("english"), "sirajganj", "sirajgonj", "district", "upazila", "flood",
  • 6. "unions", "assessment", "jna", "md"), stem = F) Bothdfm <- convert(Bothdfm, "tm") #calculating tf-idf tfidf <- tapply(Bothdfm$v/row_sums( Bothdfm)[Bothdfm$i],Bothdfm$j, mean) * log2(nDocs(Bothdfm)/col_sums(Bothdfm>0)) summary(col_sums(Bothdfm)) summary(tfidf) #removing too frequent terms (and docs with 0 terms) dim(Bothdfm) Bothdfm2 <- Bothdfm[,tfidf>=0.012] rm(Bothdfm) dim(Bothdfm2) Bothdfm2 <- Bothdfm2[row_sums(Bothdfm2)>0,] dim(Bothdfm2) Fitting the model In this step we choose a Latent Dirichlet Allocation model to be fitted. The main reason we choose for this model is the fact that it supports multiple topics (in fact the result is a probability distribution of topics), as opposed to 1 topic resulting from a unigram model. There are many settings we can tweak and modify, however, we don’t go into too much depth, because we want the algorithm to be easily applicable by non- technical users. We are free to choose the amount of topics we want, in the example code below there are 40 topics. We set the seed to get replicable results. k = 40 SEED = 2015 VEM = LDA(Bothdfm2, k = k, control = list(seed = SEED)) Process results The topicmodels package delivers a fitted model with the results, so there is a vector where every document/segment is related to a topic, and a data frame where every topic is related to most likely terms. A disadvantage is that the package does not support a combination of the two objects. Which makes the results not easily and quickly understandable. For this reason we wrote a script which combines the two data frames, so this can be easily analyzed. The 5 most likely terms are extracted with the terms-function, whilst the related topic for every document is extracted with the topics-function. Now we are able to process the results further. First we transpose the two objects to be column-oriented, instead of row-oriented. We need to use the transpose function twice for the Topics vector, because with the first run it only transforms to a data frame and does not switch rows with columns. Then we append a column to the Terms object with the topic numbers (which are depicted in the row-names in the result set, however these are omitted when using the merge function, so we choose to set them as a separate column). Then we append the row-names of the Topics object (which are the related documents). Now we have two tables with were one is a 2 X 104 table
  • 7. (104 = documents, 2 = Topic number and Document title), and one is a 6 X 15 table (6 = topic number and 5 related terms, 15 = amount of Topics). We use the colnames function to set the column names for the two tables. At last we use the merge function from the base R kit to combine the two tables. Terms <- terms(VEM, 5) Topics <- topics(VEM, 1) rm(k, SEED, tfidf) Terms <- t(Terms) Topics <- t(Topics) Topics <- t(Topics) Terms <- cbind(Terms, c(1:length(Terms[,1]))) Topics <- cbind(Topics, rownames(Topics)) colnames(Topics) <- c("Topic nr", "Heading") colnames(Terms) <- c("Topic1", "Topic2", "Topic3", "Topic4", "Topic5", "Topic nr" ) TopicTerms <- merge(Topics, Terms, by = c("Topic nr"), all.x = T, all.y= F, sort = F) rm(Terms, Topics) Validate Results The result table can be found in the appendix. We determined by trial and error that 40 topics are the most suitable for this case, since 10, 15 and 20 gave far too less distinguished topics as a result. This lead to text segments with the same topic, while they were totally unrelated. Semantics is an issue, it cannot link the topics to real world information. For example: DDMP 1.3.2 Area: char From this example we know the word “char” is very relevant for the area segment, because the specific name for the ground situation is char (but this is a Bangladeshi term). An untrained respondent would not know this, and therefore we would have liked to see a topic like: “Area” for this specific chapter. But this term is not frequent enough in this segment, and therefore it will never be the topic using the LDA method. We manually analysed every segment of text to see whether the topics from the algorithms matched our understanding of the text. Every topic-segment combination we found useful we marked with a Yes and all other with a No. The results are mixed. We see a very clear distinction between the results of the JNA and the DDMP. DDMP has only 19/75 useful topics assigned, whilst the JNA has 24/30 topics usefully assigned. This also means that the results for the total set are a bit unsatisfactory, only 43/105 topics are usefully assigned. Count of Useful Column Labels Row Labels DDMP JNA Grand Total No 56 6 62 Yes 19 24 43 Grand Total 75 30 105
  • 8. Because we want to understand the results even more we gave every “not useful” marking a reason. These reasons, and the occurrence counts can be found in the table below. We see a very clear leader in the reasons for incorrect topic assignment, which is “Topic is not mentioned in the text”. This basically means that we believe that the text is really about something else then the most frequent term makes us believe. The LDA algorithm is solely based on the words mentioned in the actual text, and therefor incapable to suggest the terms we see more probable as a topic. The second most occurring issue is “Table as content” which occurs also in combination with “Numbers as content” and “Picture as content”. These are segments that are not recognized correctly by the algorithm. In a table, the header row has the highest probability of being related to the topic, however, this additional information is not used by the algorithm, it weights all words equally. The numbers in the text are removed in the pre-processing part of the analysis, because these can never be a topic, this leads however to some segments being low in content, which leads to a wrong topic. The algorithms does not recognize images, and therefore cannot correctly assign topics to text segments with a high amount of images. The third most occurring issue is “Topics are not related to information need”, where the contents of the text is not related to an information need expressed by our previous research. These are for example text segments like: “Shortcoming of assessment”. At last, in the DDMP there are a lot of region names mentioned, this leads to a high frequency of these words, and therefor they seem probable to be the topic of a segment, this is however not the actual subject. We could counter this to remove all region names in the “processing segments” step. Count of Reason Column Labels Row Labels DDMP JNA Grand Total Topic is not mentioned in text (related information need) 14 14 Table as content 9 9 Topics are not related to information need 9 9 Table as content and Region names 6 6 Picture as content 4 1 5 Region names frequently occurs 5 5 Cannot reproduce 2 2 4 Not related to information need 3 3 Table and Numbers 3 3 Table with much numbers 1 1 Numbers as content 1 1 Schools are shelters 1 1 (blank) Grand Total 55 6 61
  • 9. Next steps For further extraction of relevant details we propose to use an intelligent search engine which uses the topic models we created. We now have broad categories in which the documents can be divided. However, we cannot get exact statistics, like the amount of affected based on this analysis. These statistics can be highly usable for the disaster responders. We could also have used a categorization package like RTextTools which can be trained to predict in which category a document belongs. However, we did not have enough data to both create a train and test set for this specific case. Nonetheless, for future research we see the possibility to create a train set based on Wikipedia articles. This way we can create custom categories like: baseline information, situation overview, needs of affected etc. and train an algorithm to recognize and categorize these texts. I perceive the field of text mining as a very interesting field, and will continue to explore it in my professional career.
  • 10. Conclusion and Research suggestions For every interesting step in the process we draw conclusions and provide suggestions to improve Splitting We suggest to incorporate a string similarity function to the split-points in the “splitting” step. In our case we got incorrect results because the strings from the table of contents did not match the strings in the actual text. We could counter this by applying a string similarity algorithm, and selecting the most similar sentence in the text, and use this as a split-point. Process segments The results provided by the topicmodels packages are not really intuitive to use. We needed to process the two result datasets to analyse the dataset more effectively. Validating Unfortunately the topics we identified were not nearly a 100% match, this is mostly due to 4 reasons shared in the validation of our results. These 4 are: 1. Topic incorrect because: not mentioned in text  This is basically a disagreement between the authors of this article and the algorithm on the assumptions of the algorithm. The LDA algorithm assumes that the topic the text is mentioned in the text. This is however not always the case.  Suggestion: we could develop an algorithm that matches the topics of the text to the related information needs we are trying to find. For example by incorporating information from a dictionary or a thesaurus. 2. Topic incorrect due to: Table as content  Tables have very valuable information in the headers, this is not recognized by this LDA algorithm, and therefore leads to incorrect assignment of the topic. The LDA algorithm values every word equal irrespective of place.  Suggestion: develop an algorithm that takes the position of the words in the table into account. We know from earlier encounters that algorithms exist which assign a higher value on the words when they are in a certain location of the sentence. We do not know of any research which incorporates the position of a word in a table. 3. Topic useless because: not related to information need  This basically means that the topic is correctly assigned, they are however not applicable for our specific case, because they are not related to an “information need” we are interested in.  Suggestion: use the topics to filter the data not required for the disaster responders. 4. Topic incorrect due to wrong stop word removal  We removed some clearly wrong topics from the text (disaster, Sirajganj etc.), we did however not remove all sub-area names. This lead to, the name of “small town in the region” to be the assigned topic.  Suggestion: use an iterative approach to remove the useless stop words specific to the text. We partly applied this iterative approach, since later in the process we identified the stop words like (disaster and Sirajganj etc.) for our case. We then applied them again in the “process segments” step.
  • 11. References Bettina Gruen, Kurt Hornik (2011). topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40(13), 1-30. URL http://www.jstatsoft.org/v40/i13/. Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text Mining Infrastructure in R. Journal of Statistical Software 25(5): 1-54. URL: http://www.jstatsoft.org/v25/i05/. Kenneth Benoit and Paul Nulty (2015). quanteda: Quantitative Analysis of Textual Data. R package version 0.9.0-1. https://CRAN.R-project.org/package=quanteda R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.