1. Prerequisites:
Minorthird[7] jar file
Mixup files
Project pattern
External dependencies:
JAXB[4]-technology : used to output results in xml format. However jaxb has only been used to generate
the classes and is not necessary to run the U.K address extraction package. But it is necessary to add the
generated classes to your classpath. In order to do it, just add [u.k extraction package]buildclasses to
the classpath.
Overview of the package:
Src-contains the java source files of the package.
Its contents:
Default package-contains the mixup java program. This program writes the results to a text file as plain
text.
1)Mixupjaxb program: Writes the results as tagged xml file.
2) Labeler demo: simple program demonstrating how to use mixup programs to label text data.
Pretty good for beginners to understand minorthird and it’s functionality.
3)Websearch program: The web search program is responsible for extracting links from the web
given the search string. It makes use of the yahoo web search api[8].The first 1000 links are in turn
written to searchlinks.txt .This process is performed by the program websearch.java in the default
package. These links are then accessed inside the mixup jdom file. The resulting html files are parsed
and the text is extracted using a html parser[5]. The extracted text is written to a text file. This file is in
turn loaded by the text base loader, Which works in tandem with the mixup interpreter to extract labels
from the text.
Evaluation package:
Contains java code that calculates the frequency of occurrence of each word in a text file and prints it. It
Might be useful for evaluation purposes.
Lib: contains all the required libraries.Html parser, minorthird, yahoo jar files are the most important.
All these jar files can also be downloaded from the internet except the yahoo jar file which was
generated from source using netbeans[6]. Also, caution in using the latest minorthird jar. It seems to
have too many bugs associated with it. It is safer to use the older minorthird jar.
2. Xmlresources: Contains the xml schema file used to generate the jaxb classes. Looking at the schema
will give an idea about the output result format. Also the result format can be modified and extended by
changing the xml schema file and regenerating the jaxb code.
Mixup:
The pattern rules:
Patterns are written using mixup[1]. Mixup is a simple pattern-matching and information extraction
language included in minorthird. The name's an acronym for My Information eXtraction and
Understanding Package. Using mixup we can easily write extraction pattern rules in extended backus
naur form(EBNF).
Mixup files and their arrangement:
1)Cities.txt: contains list of all cities in the united kingdom[2].
2)Counties.txt: Contains list of all counties in the united Kingdom[3].
3)County.mixup: contains rules to extract postcode,phone number,counties and cities.These are central
to the extraction process.
4)POIDes.mixup: contains rules to extract poiname. Uses the street address to extract and
poidesignators to extract the rules. Rules are quite generic in this file but the outliers get filtered in the
POI mixup file.
5)Streetaddress.mixup: contains rules to extract street address. Uses county and city names as well as
street indicators to extract the rules.
6)POI.mixup:The file which extract the actual poi tag.Uses the tags generated by the rest of the mixup
files.
Running from command line:
1)Add minorthird jar file to your classpath. As the software is built on top of minorthird, it is essential to
add minorthird to your classpath at least when running mixup and mixupjaxb files. It is not required for
the websearch program. However the websearch program needs the yahoo jar file in its path.
Note:You can find all the relevant jar files in the lib directory of the address extraction package.
2)Make sure you have the folder containing the mixup files on your classpath. Under windows you could
set the environmental variable CLASSPATH to point to the directory mixup in the package.
3)Don’t forget to allocate extra memory if you get out of memory error while running the mixup
program. This is caused because the text base loader loads the file using the DOC_PER_FILE option
instead of the DOC_PER_LINE option. The DOC_PER_FILE option considers the whole file as a single
3. document where as the DOC_PER_LINE option considers each line as a separate document. This inhibits
the extraction of patterns that occur across different lines as is generally the style of occurrence with
address patterns.
4)Please note that the test file into which the extracted text from multiple links is written is in append
mode. Due to this, the file size increases each time you run the program. As this file is loaded by the text
base loader, its size is directly proportional to the amount of java memory needed to be allocated. The
best way to go about it is to delete the file before running the mixup program and check the size of this
file once it gets written,so that you can allocate memory proportional to its size.
The command for allocating extra memory:
java –Xmx500M mixup
(if you want to allocate 500 mb to your mixup program)
References:
[1]http://minorthird.sourceforge.net/tutorials/Mixup%20Tutorial.htm
[2] http://www.gbet.com/AtoZ_cities/
[3] http://www.gbet.com/AtoZ_counties/
[4] https://jaxb.dev.java.net/
[5] http://htmlparser.sourceforge.net/
[6] http://www.netbeans.org/
[7] http://minorthird.sourceforge.net/
[8] http://developer.yahoo.com/search/