Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
Improving Document Clustering by Eliminating Unnatural Language
1. Improving Document Clustering by Removing Unnatural Language
Myungha Jang1, Jinho D. Choi and James Allan 1
1 College of Information and Computer Sciences, Center for Intelligent Information
Retrieval,
1 University of Massachusetts Amherst
2 Mathematics and Computer Science Emory University
Motivation
Features
Dataset
Classification Results
Conclusion
• Existing NLP tools frequently fail on technical documents (e.g.,
lecture slides in math/engineering, academic papers) as they
contain lots of “unnatural language blocks”
• We initially aimed to build an application that clusters lecture
slides in CS by the topics. Unlike other text documents, NLP
techniques such as clustering or similarity function fails on
technical documents as they contain “non-text” components,
such as code, formula, table, diagrams.
• Especially, when the technical documents in PPT and PDF are
converted to raw text, these components lose its semantics as
it loses its form. In a bag-of-word representation, they are only
considered as noise.
Two code are semantically different but share a lot of same vocabulary.
Math Formula loses its original format when converted to raw
text.
The vocabulary itself also doesn’t mean anything without human
intelligence to understand its meaning.
Diagrams and tables, when converted to raw text, loses its layout and
meaning.
• N Gram Features (N): Unigrams, bigrams and trigrams of each line
• Parsing Features (P): Unnatural languages are not likely to form any
grammar structure. When we attempt to parse the unnatural
language line, the resultant parsing tree would form unusual
syntactic structure.
• Table String Layout (T): While text extracted from tables loses its
visual layout as a table, the extracted tables tend to convey the
same type of data along the same column or row. We encode each
line by replacing each token as either S (String) or N (Numeral). We
compute the edit distance among neighboring lines.
• Word Embedding Features (E): We train a word embedding model
using the technical document corpus, and generate the embedding
for each line by taking an average of embedding of each word.
Problem Definition
Input to our task is the plain text extracted from PDF or
PPT documents. The goal is to assign a class label to each
line in that plain text, identifying it as natural language
(regular text) or one of the four types of unnatural language
block components: table, code, formula, or miscellaneous
text
*.PDF, *.PPT *.txt
Table
Text
Code
Formula
• Sequential Features (S): The sequential nature of the lines
is also an important feature because the component most
likely occurs over a block of contiguous lines.
The Effect of Unnatural Language to Document
Clustering
• Unnatural language should be distinguished from natural language in
technical documents for NLP tools to work effectively.
• We presented an approach to identify four types of unnatural language
blocks from plain text, which is not dependent on document format.
• Our method extracts five sets of line-based textual features, and had an
F1- score that was above 82% for the four categories of unnatural
language.
• We demonstrated removing unnatural language improved document
clustering in many settings by up to 15% and 11% at best.
We use the Liblinear SVM to run 5-fold cross-validation for
evaluation. To improve the robustness of structured prediction, we
adopt a learning to search algorithm known as DAGGER to SVM.
Datasets used in our paper. All data are available for download at
http://cs.umass.edu/ ~mhjang/publications.html
• We prepared two lecture slides dataset, D_OS (lecture
slides on Operating System) and D_DSA (lecture slides on
Data Structure and Algorithm) collected on the Web.
• We applied Seeded K-means clustering on lecture slides
data where gold standard clustering are manually curated to
compare the clustering performance with and without
unnatural language components. We experiment three types
of seeding: For each final cluster, (1) topic keywords are
given as seed, or (2) top-1 document is given as a seed, (3)
and several relevant documents are selected by users as
seed.
Text extraction
Component
Identification