Published on

A small tutorial on Mallet

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Tutorial on MALLET Shatakirti MT2011096
  2. 2. MALLETContents1 Introduction to MALLET 22 Where do we use MALLET? 23 Getting Started 3 3.1 Installing MALLET . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Using the Script . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Importing Data files 35 Natural Language Processing 46 Document classification 57 Sequence Tagging 98 Topic Models 11References 12List of Figures 1 Natural Language Processing using MALLET . . . . . . . . . 5 2 Document classification . . . . . . . . . . . . . . . . . . . . . . 8 3 Sequence Tagging . . . . . . . . . . . . . . . . . . . . . . . . . 10 1
  3. 3. MALLET1 Introduction to MALLETMALLET is a Java-based package for statistical natural language processing,document classification, clustering, topic modeling, information extraction,and other machine learning applications to text..2 Where do we use MALLET?: 1. Historical Topics and Trends Our aim here is to automatically discover general topics that appear in a large newpaper corpus. MALLET is run over a period of interest to find the top general topic groups. For example: if we wish to know the top ten topic groups between the years 1965-1901, the MALLET is run to find this dataset. In addition, we can also find topics more strongly associated with say ”iron”. We can extract 5 lines on each side of the line containing ”iron” and again run mallet to find the top general topic groups. 2. Detect spam mails We can use the document classification capabilities of MALLET to detect spam mails. A simple example of this would be a spam classifier like you’d find in your email inbox. Since we know what good mail looks like, and since we know what spam typically looks like, we can craft a Naive Bayes classifier to make a statistical approximation as to whether or not a new message is spam. 3. Extract important information We can use the sequence tagging functionality that MALLET provides to extract important information from data. By employing named- entity recognition techniques, we can figure out exactly what a docu- ment is talking about without having to read through the entire text ourselves. Imagine someone hands you a book and asks you for all the characters and locations featured throughout the text. Using named- entity recognition, a computer can accomplish that task in mere seconds as compared to the hours it would take a human. 2
  4. 4. MALLET3 Getting Started3.1 Installing MALLET 1. Download the latest version of mallet from 2. To Build MALLET 2.0, you must have Apache Ant. You can download it from 3. Set all the environment variables pointing to Java Home, Ant Home and Mallet Home (Mallet Directory). 4. Change to the MALLET directory and type: ant Example : C:UsersVAIOWebIRmallet-2.0.7>ant If ant finishes with ”BUILD SUCCESSFUL”, MALLET is now ready to use.3.2 Using the ScriptNow, if you installed MALLET in the directory WebIRmallet-2.0.7,this script will be present in the WebIRmallet-2.0.7bin. If the cur-rent working directory is the MALLET directory, you can use this script inthis pattern:binmallet [command] --option value --option value ...Type binmallet to get a list of commands and the help can be foundby using the option --help with any command to get a description of validcommands.4 Importing Data filesTo import a data file use the command:binmallet import-file --input [filename]--output [output filename] [options] 3
  5. 5. MALLETSimilarly, to import an entire directory use:binmallet import dir [dir path]--output [output filename] [options]For example:binmallet import-file --input sample-datawebenhill.txt--output output.malletin the above example, the input data is hill.txt and the output is presentin the output.mallet file after removing the stopwords.binmallet import-dir --input sample-dataweb*--output output.malletin the above example, the input data is folders present in web folder andthe output is given in the output.mallet file after removing the stopwordsFor more options use the help by typing in:binmallet import-file --help orbinmallet import-dir --help5 Natural Language ProcessingMALLET includes routines for transforming text documents into numericalrepresentations that can then be processed efficiently. This process is imple-mented through a flexible system called ”pipes”, which handle distinct taskssuch as tokenizing strings, removing stopwords, and converting sequencesinto count vectors. MALLET uses Unicode files, and thus, we can use vari-ous language files and provide MALLET with certain rules for for processingthe data. We can use regular expressions to tokanize any word segment inany language. For example if we type inbinmallet import-file --input sample-datawebenhill.txt--output output.mallet --print-output --remove-stopwordsin the above example, MALLET removes the stopwords and prints the out-put and also writes the output in the output.mallet file. A sample output 4
  6. 6. MALLETwith and without removing stopwords is shown below : (a) without removing stopwords (b) Removing stopwords Figure 1: Natural Language Processing using MALLET The above figure shows the support for English language by MALLET.In the above snapshot, a simple txt file ”hill.txt” written in English languageis imported. The words are numbered and the number of occurrences arealso shown. The stopwords are recognized by MALLET and can or cannotbe included in the output file as per the user’s requirements. Currently,MALLET doesn’t support only Chineese and Japaneese text..6 Document classificationA classifier is an algorithm that distinguishes between a fixed set of classes,such as ”spam” vs. ”non-spam”, based on some previous training (Note thatMALLET is also a machine learning tool). MALLET includes implemen-tations of several classification algorithms. Some of them are Naive Bayesalgorithm, Maximum Entropy, and Decision Trees. To get strted with the document classifier, first loasd the data into MAL-LET format. Then follow the following steps: 5
  7. 7. MALLET 1. Train the classifier: Suppose u have a MALLET data file called train.mallet, use the command : binmallet train-classifier --input train.mallet --output-classifier my.classifier 2. Choose the algorithm: The default classification algorithm is Naive Bayes Theorem. To select a different algorithm, use the --trainer option. For example, to use the MaxEnt algorithm, use the following command: binmallet train-classifier --input training.mallet --output-classifier my.classifier --trainer MaxEnt You can also try - NaiveBayes, C45, Decision Tree. To compare multiple training algorithms, use the following command, binmallet train-classifier --input labeled.mallet --training-portion --trainer MaxEnt --trainer NaiveBayes This command will comapre the MaxEnt and the NaiveBayes algo- rithms. 3. Evaluation: If we wish to know if the classifier is producing good results on data now used in the training, we can split a single set of instance into train- ing and testing lists. For this purpose, you can use a command like: binmallet train-classifier --input labeled.mallet --training-portion 0.9 This command will randomly split the data into 90% training instances, which will be used to train the classifier and the remaining 10% testing instances. MALLET will use the classifier to predict the class labels of the testing instances, compare those to the true labels, and report results. You can even try various training options that u can find in the help of mallet. 6
  8. 8. MALLET For example, u can try the following command : binmallet train-classifier --input web.mallet --trainer MaxEnt --trainer NaiveBayes --training-portion 0.9 --num-trials 10 This command will run 10 trials, in which the input data is randomly split into 90% training instances and 10% testing instances. For each trial, MALLET trains a MaxEnt classifier and a Naive Bayes classifier on the training instances, then prints accuracy results and a matrix of correct and predicted labels for each classifier. An illustration is shown in the next page. 7
  9. 9. MALLET (a) 8 (b)
  10. 10. MALLET7 Sequence TaggingSometimes, we may have a very large database with distinct values in it, takefor example, a large gene database. MALLET includes implementations ofwidely used sequence algorithms including hidden Markov models (HMMs)and linear chain conditional random fields (CRFs). These algorithms supportapplications such as gene finding and named-entity recognition.Simple TaggerSimple tagger is a command line interface to the MALLET CRF class. Touse this, each line in the input file should represent a token. The neededformat is :feature1 feature2 ... featuren labelFor example, write the following in a file named ”sample” and put it inthe mallet directory.Kirti CAPITALIZED nounslept non-nounhere LOWERCASE STOPWORD non-nounTo train the CRF, use the following command while in the mallet direc-tory:java -cp class;libmallet-deps.jarcc.mallet.fst.SimpleTagger --train true--model-file nouncrf sampleThis command will train the CRF. The --train true command will spec-ify that this is the training. Here the CRF file is created in the mallet direc-tory itself. We can however specify the locations as per convinience. 9
  11. 11. MALLET (a) (b) Figure 3: Sequence Tagging 10
  12. 12. MALLET Now that we have trained MALLET, we can put it to test by creating anew file called ”test”. Inside this file, we write :CAPITAL Alslepthere .Now we need the file to be labelled, so, we use CRF in the nouncrf bytyping:java -cp class;libmallet-deps.jarcc.mallet.fst.SimpleTagger--model-file nouncrf testwhich produces the following output:Number of predicates: 5noun CAPITAL Alnon-noun sleptnon-noun here8 Topic ModelsTopic models provide a simple way to analyze large volumes of unlabeled text.A ”topic” consists of a cluster of words that frequently occur together. Usingsome contextual clues, the topic models can connect the words with similarmeanings and distinguish between uses of words with multiple meanings. Now the first step in acheiving a Topic model is to import a set of doc-uments. Suppose we want to import the files in the folder ”en”, type thecommand:binmallet import-dir--input sample-dataweben --output output.mallet--keep-sequence --remove-stopwordsThis command will remove all the stopwords, keep all the sequences andwrite the output to a ”output.mallet” file in the mallet directory. 11
  13. 13. MALLET Now, type in the command:binmallet train-topics--input sample-datawebenoutput.mallet--num-topics 100 --output-state topic-state.gzHere --num-topics [NUMBER] represents the number of topics to use.More the number, more the fine-grained results we get and --output-stateoutputs a compressed text file containing the words in the corpus with theirtopic assignments. This file format can easily be parsed and used by non-Java-based software. Note that the state file will be GZipped, so it is helpfulto provide a filename that ends in .gz.References[1][2] 12