Improving Document Clustering by Removing Unnatural Language
Myungha Jang1, Jinho D. Choi and James Allan 1
1 College of Information and Computer Sciences, Center for Intelligent Information
Retrieval,
1 University of Massachusetts Amherst
2 Mathematics and Computer Science Emory University
Motivation
Features
Dataset
Classification Results
Conclusion
• Existing NLP tools frequently fail on technical documents (e.g.,
lecture slides in math/engineering, academic papers) as they
contain lots of “unnatural language blocks”
• We initially aimed to build an application that clusters lecture
slides in CS by the topics. Unlike other text documents, NLP
techniques such as clustering or similarity function fails on
technical documents as they contain “non-text” components,
such as code, formula, table, diagrams.
• Especially, when the technical documents in PPT and PDF are
converted to raw text, these components lose its semantics as
it loses its form. In a bag-of-word representation, they are only
considered as noise.
Two code are semantically different but share a lot of same vocabulary.
Math Formula loses its original format when converted to raw
text.
The vocabulary itself also doesn’t mean anything without human
intelligence to understand its meaning.
Diagrams and tables, when converted to raw text, loses its layout and
meaning.
• N Gram Features (N): Unigrams, bigrams and trigrams of each line
• Parsing Features (P): Unnatural languages are not likely to form any
grammar structure. When we attempt to parse the unnatural
language line, the resultant parsing tree would form unusual
syntactic structure.
• Table String Layout (T): While text extracted from tables loses its
visual layout as a table, the extracted tables tend to convey the
same type of data along the same column or row. We encode each
line by replacing each token as either S (String) or N (Numeral). We
compute the edit distance among neighboring lines.
• Word Embedding Features (E): We train a word embedding model
using the technical document corpus, and generate the embedding
for each line by taking an average of embedding of each word.
Problem Definition
Input to our task is the plain text extracted from PDF or
PPT documents. The goal is to assign a class label to each
line in that plain text, identifying it as natural language
(regular text) or one of the four types of unnatural language
block components: table, code, formula, or miscellaneous
text
*.PDF, *.PPT *.txt
Table
Text
Code
Formula
• Sequential Features (S): The sequential nature of the lines
is also an important feature because the component most
likely occurs over a block of contiguous lines.
The Effect of Unnatural Language to Document
Clustering
• Unnatural language should be distinguished from natural language in
technical documents for NLP tools to work effectively.
• We presented an approach to identify four types of unnatural language
blocks from plain text, which is not dependent on document format.
• Our method extracts five sets of line-based textual features, and had an
F1- score that was above 82% for the four categories of unnatural
language.
• We demonstrated removing unnatural language improved document
clustering in many settings by up to 15% and 11% at best.
We use the Liblinear SVM to run 5-fold cross-validation for
evaluation. To improve the robustness of structured prediction, we
adopt a learning to search algorithm known as DAGGER to SVM.
Datasets used in our paper. All data are available for download at 

http://cs.umass.edu/ ~mhjang/publications.html
• We prepared two lecture slides dataset, D_OS (lecture
slides on Operating System) and D_DSA (lecture slides on
Data Structure and Algorithm) collected on the Web.
• We applied Seeded K-means clustering on lecture slides
data where gold standard clustering are manually curated to
compare the clustering performance with and without
unnatural language components. We experiment three types
of seeding: For each final cluster, (1) topic keywords are
given as seed, or (2) top-1 document is given as a seed, (3)
and several relevant documents are selected by users as
seed.
Text extraction
Component
Identification

Improving Document Clustering by Eliminating Unnatural Language

  • 1.
    Improving Document Clusteringby Removing Unnatural Language Myungha Jang1, Jinho D. Choi and James Allan 1 1 College of Information and Computer Sciences, Center for Intelligent Information Retrieval, 1 University of Massachusetts Amherst 2 Mathematics and Computer Science Emory University Motivation Features Dataset Classification Results Conclusion • Existing NLP tools frequently fail on technical documents (e.g., lecture slides in math/engineering, academic papers) as they contain lots of “unnatural language blocks” • We initially aimed to build an application that clusters lecture slides in CS by the topics. Unlike other text documents, NLP techniques such as clustering or similarity function fails on technical documents as they contain “non-text” components, such as code, formula, table, diagrams. • Especially, when the technical documents in PPT and PDF are converted to raw text, these components lose its semantics as it loses its form. In a bag-of-word representation, they are only considered as noise. Two code are semantically different but share a lot of same vocabulary. Math Formula loses its original format when converted to raw text. The vocabulary itself also doesn’t mean anything without human intelligence to understand its meaning. Diagrams and tables, when converted to raw text, loses its layout and meaning. • N Gram Features (N): Unigrams, bigrams and trigrams of each line • Parsing Features (P): Unnatural languages are not likely to form any grammar structure. When we attempt to parse the unnatural language line, the resultant parsing tree would form unusual syntactic structure. • Table String Layout (T): While text extracted from tables loses its visual layout as a table, the extracted tables tend to convey the same type of data along the same column or row. We encode each line by replacing each token as either S (String) or N (Numeral). We compute the edit distance among neighboring lines. • Word Embedding Features (E): We train a word embedding model using the technical document corpus, and generate the embedding for each line by taking an average of embedding of each word. Problem Definition Input to our task is the plain text extracted from PDF or PPT documents. The goal is to assign a class label to each line in that plain text, identifying it as natural language (regular text) or one of the four types of unnatural language block components: table, code, formula, or miscellaneous text *.PDF, *.PPT *.txt Table Text Code Formula • Sequential Features (S): The sequential nature of the lines is also an important feature because the component most likely occurs over a block of contiguous lines. The Effect of Unnatural Language to Document Clustering • Unnatural language should be distinguished from natural language in technical documents for NLP tools to work effectively. • We presented an approach to identify four types of unnatural language blocks from plain text, which is not dependent on document format. • Our method extracts five sets of line-based textual features, and had an F1- score that was above 82% for the four categories of unnatural language. • We demonstrated removing unnatural language improved document clustering in many settings by up to 15% and 11% at best. We use the Liblinear SVM to run 5-fold cross-validation for evaluation. To improve the robustness of structured prediction, we adopt a learning to search algorithm known as DAGGER to SVM. Datasets used in our paper. All data are available for download at 
 http://cs.umass.edu/ ~mhjang/publications.html • We prepared two lecture slides dataset, D_OS (lecture slides on Operating System) and D_DSA (lecture slides on Data Structure and Algorithm) collected on the Web. • We applied Seeded K-means clustering on lecture slides data where gold standard clustering are manually curated to compare the clustering performance with and without unnatural language components. We experiment three types of seeding: For each final cluster, (1) topic keywords are given as seed, or (2) top-1 document is given as a seed, (3) and several relevant documents are selected by users as seed. Text extraction Component Identification