Document Clustering
By:
Ankur Shrivastava
Ritesh Modi
Vinayak Bharti
Introduction
• Document clustering scheme aims to minimize within cluster distances and
maximize intra cluster distances.
• Given a heterogeneous data-set, performing clustering based on relevant
features.
• Represent the document clusters in different visual form as per
requirements.
Block Diagram
• Text data extraction from multimedia
documents
Raw corpus
(heterogeneous
documents)
• Documents in plain text format
Homogeneous Data
• Removing stop words from documents and
stemming
Preprocessing
• Relevant features of documents
Feature Extraction
• Clustered documents
Document Clustering
Part 1: Conversion to Homogeneous form
The heterogeneous data is converted into a plain text file using the tool
Apache Tika. Tika provides a number of different ways to parse a file. These
provide different levels of control, flexibility, and complexity.
• Parsing: The Auto-detect Parser automatically figures out the kind of
content like pdf file or html file and parses according to the appropriate
parser
• Plain-text Conversion: Function returns the content of the document's
body as a plain-text string.
Aggregation of these steps results in a plain text file
Part 2: Feature Extraction
List of features extracted from the text files:
Apache UIMA(Unstructured Information Management Architecture) and
Stanford NLP Library are used for extraction of these features.
• Unigrams, Bigrams, Trigrams: N- grams is a contiguous sequence of n
words. N- grams of sizes 1,2,3 are extracted from the corpus.
• Punctuations: Number of punctuations in the text.
• Capitals: Words with all capital letters.
• #Sentences: Number of sentences in the text.
Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per
requirement.
Part 2: Feature Extraction
(continued)
• Parts-of-Speech(POS) Tagging: Identification of words as nouns, verbs,
adjectives, adverbs etc. The Stanford POS Tagger is used and a count of
POS tags is maintained.
• Named Entities: Identification of named entities like Person, Location or
Organization etc. The Stanford NER is used and a count of named
entities is used.
• Positive and Negative words: Count of positive and negative words in
the text.
• URLs: URLS in the text.
Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per
requirement.
Part 3: Clustering
K-means clustering on the feature space using the tool Weka.
• Clustering based on Euclidean distance between the means.
• The algorithm automatically normalizes numerical attributes when doing
distance computations.
• Input documents are stored in folders titled with their cluster number.
Thank You

Document Classification and Clustering

  • 1.
  • 2.
    Introduction • Document clusteringscheme aims to minimize within cluster distances and maximize intra cluster distances. • Given a heterogeneous data-set, performing clustering based on relevant features. • Represent the document clusters in different visual form as per requirements.
  • 3.
    Block Diagram • Textdata extraction from multimedia documents Raw corpus (heterogeneous documents) • Documents in plain text format Homogeneous Data • Removing stop words from documents and stemming Preprocessing • Relevant features of documents Feature Extraction • Clustered documents Document Clustering
  • 4.
    Part 1: Conversionto Homogeneous form The heterogeneous data is converted into a plain text file using the tool Apache Tika. Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity. • Parsing: The Auto-detect Parser automatically figures out the kind of content like pdf file or html file and parses according to the appropriate parser • Plain-text Conversion: Function returns the content of the document's body as a plain-text string. Aggregation of these steps results in a plain text file
  • 5.
    Part 2: FeatureExtraction List of features extracted from the text files: Apache UIMA(Unstructured Information Management Architecture) and Stanford NLP Library are used for extraction of these features. • Unigrams, Bigrams, Trigrams: N- grams is a contiguous sequence of n words. N- grams of sizes 1,2,3 are extracted from the corpus. • Punctuations: Number of punctuations in the text. • Capitals: Words with all capital letters. • #Sentences: Number of sentences in the text. Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per requirement.
  • 6.
    Part 2: FeatureExtraction (continued) • Parts-of-Speech(POS) Tagging: Identification of words as nouns, verbs, adjectives, adverbs etc. The Stanford POS Tagger is used and a count of POS tags is maintained. • Named Entities: Identification of named entities like Person, Location or Organization etc. The Stanford NER is used and a count of named entities is used. • Positive and Negative words: Count of positive and negative words in the text. • URLs: URLS in the text. Preprocessing :Stop word removal and Stemming (Porter Stemmer) as per requirement.
  • 7.
    Part 3: Clustering K-meansclustering on the feature space using the tool Weka. • Clustering based on Euclidean distance between the means. • The algorithm automatically normalizes numerical attributes when doing distance computations. • Input documents are stored in folders titled with their cluster number.
  • 8.