RapidMiner5 2.9 - Word vector tool and RapidMiner
Word Vector tool The Word & Web Vector Tool is a flexible Java library for statistical language modeling and integration of Web and Webservice based data sources. It supports the creation of word vector representations of text documents in the vector space model that is the point of departure for many text processing applications .
Installation 1. Download the archive form wvtoolsourceforge website.
Installation 2. Putting it into lib/plugins directory of your RapidMiner installation, example: D:Program FilesRapid-IRapidMiner5libplugins
Word Vector tool The aim of the WVTool is to provide a simple to use, simple to extend pure Java library for text and webmining. It can easily be invoked from any Java application.
Word Vector tool WVTool bridges a gap between highly sophisticated linguistic packages as the GATE system on the one side and many partial solutions that are part of diverse text and information retrieval applications on the other side.
Word List A word list contains all terms used for vectorization together with some statistics (e.g. in how many documents a term appears). The word list is needed for vectorization to define which terms are considered as dimensions of the vector space and for weighting purposes.
WVtool functions Input list that tells the system which text documents to process WVTool Function Inputs A configuration object, that tells the system which methods to use in the individual steps.
Defining the input The input list tells the WVTool which texts should be processed. Every item in the list contains the following information: A URI The language the document is written in (optional) The type of the document (optional) The character encoding of the document, e.g. UTF-8 (optional) A class label
Using Predefined Word Lists In some cases it is necessary to exactly define the dimensions of the vector space, yet leaving the counting of terms and documents to the WVTool . This can be achieved by calling the word list creation function with a list of String values.
Text Input The TextInput operator creates an ExampleSet from a collection of texts. The output ExampleSet contains one row for each text document and one column of each term.
Text Classification, Clustering and Visualization For text classification, the class labels (e.g. positive, negative) are defined in the TextInput operator, as described above. Using clustering or dimensionality reduction, there is a possibility to directly visualize text documents from the RapidMiner Visualization panel.
Creating and Maintaining Word Lists Creating an Initial Word List: An initial word list can be created by using the following chain of operators:
Creating and Maintaining Word Lists Applying a Word List: You can apply a word list in two ways: To use the actual weights, first create word vectors using the TextInput Operator and then use the AttributeWeightsLoader and AttributesWeightsApplier on the resulting ExampleSet.
Creating and Maintaining Word Lists Applying a Word List: You can apply a word list in two ways: 2. To use the word list only as a selection of relevant terms and leave it to the TextInput to actually weight them, use the AttributeWeightsLoader before. The TextInput will create vectors that contain as dimensions only terms in the word list, that have a weight larger than zero.
Creating and Maintaining Word Lists Updating a Word List : If you add new documents to your corpus, usually additional terms will be relevant and should be added to the word list. After the InteractiveAttributeWeighting operator pops up, use the load function to load your original word list.
More Questions? Reach us at firstname.lastname@example.org Visit: www.dataminingtools.net