SlideShare a Scribd company logo
1 of 3
Download to read offline
Prerequisites:
Minorthird[7] jar file
Mixup files
Project pattern
External dependencies:
JAXB[4]-technology : used to output results in xml format. However jaxb has only been used to generate
the classes and is not necessary to run the U.K address extraction package. But it is necessary to add the
generated classes to your classpath. In order to do it, just add [u.k extraction package]buildclasses to
the classpath.
Overview of the package:
Src-contains the java source files of the package.
Its contents:
Default package-contains the mixup java program. This program writes the results to a text file as plain
text.
1)Mixupjaxb program: Writes the results as tagged xml file.
2) Labeler demo: simple program demonstrating how to use mixup programs to label text data.
Pretty good for beginners to understand minorthird and it’s functionality.
3)Websearch program: The web search program is responsible for extracting links from the web
given the search string. It makes use of the yahoo web search api[8].The first 1000 links are in turn
written to searchlinks.txt .This process is performed by the program websearch.java in the default
package. These links are then accessed inside the mixup jdom file. The resulting html files are parsed
and the text is extracted using a html parser[5]. The extracted text is written to a text file. This file is in
turn loaded by the text base loader, Which works in tandem with the mixup interpreter to extract labels
from the text.
Evaluation package:
Contains java code that calculates the frequency of occurrence of each word in a text file and prints it. It
Might be useful for evaluation purposes.
Lib: contains all the required libraries.Html parser, minorthird, yahoo jar files are the most important.
All these jar files can also be downloaded from the internet except the yahoo jar file which was
generated from source using netbeans[6]. Also, caution in using the latest minorthird jar. It seems to
have too many bugs associated with it. It is safer to use the older minorthird jar.
Xmlresources: Contains the xml schema file used to generate the jaxb classes. Looking at the schema
will give an idea about the output result format. Also the result format can be modified and extended by
changing the xml schema file and regenerating the jaxb code.
Mixup:
The pattern rules:
Patterns are written using mixup[1]. Mixup is a simple pattern-matching and information extraction
language included in minorthird. The name's an acronym for My Information eXtraction and
Understanding Package. Using mixup we can easily write extraction pattern rules in extended backus
naur form(EBNF).
Mixup files and their arrangement:
1)Cities.txt: contains list of all cities in the united kingdom[2].
2)Counties.txt: Contains list of all counties in the united Kingdom[3].
3)County.mixup: contains rules to extract postcode,phone number,counties and cities.These are central
to the extraction process.
4)POIDes.mixup: contains rules to extract poiname. Uses the street address to extract and
poidesignators to extract the rules. Rules are quite generic in this file but the outliers get filtered in the
POI mixup file.
5)Streetaddress.mixup: contains rules to extract street address. Uses county and city names as well as
street indicators to extract the rules.
6)POI.mixup:The file which extract the actual poi tag.Uses the tags generated by the rest of the mixup
files.
Running from command line:
1)Add minorthird jar file to your classpath. As the software is built on top of minorthird, it is essential to
add minorthird to your classpath at least when running mixup and mixupjaxb files. It is not required for
the websearch program. However the websearch program needs the yahoo jar file in its path.
Note:You can find all the relevant jar files in the lib directory of the address extraction package.
2)Make sure you have the folder containing the mixup files on your classpath. Under windows you could
set the environmental variable CLASSPATH to point to the directory mixup in the package.
3)Don’t forget to allocate extra memory if you get out of memory error while running the mixup
program. This is caused because the text base loader loads the file using the DOC_PER_FILE option
instead of the DOC_PER_LINE option. The DOC_PER_FILE option considers the whole file as a single
document where as the DOC_PER_LINE option considers each line as a separate document. This inhibits
the extraction of patterns that occur across different lines as is generally the style of occurrence with
address patterns.
4)Please note that the test file into which the extracted text from multiple links is written is in append
mode. Due to this, the file size increases each time you run the program. As this file is loaded by the text
base loader, its size is directly proportional to the amount of java memory needed to be allocated. The
best way to go about it is to delete the file before running the mixup program and check the size of this
file once it gets written,so that you can allocate memory proportional to its size.
The command for allocating extra memory:
java –Xmx500M mixup
(if you want to allocate 500 mb to your mixup program)
References:
[1]http://minorthird.sourceforge.net/tutorials/Mixup%20Tutorial.htm
[2] http://www.gbet.com/AtoZ_cities/
[3] http://www.gbet.com/AtoZ_counties/
[4] https://jaxb.dev.java.net/
[5] http://htmlparser.sourceforge.net/
[6] http://www.netbeans.org/
[7] http://minorthird.sourceforge.net/
[8] http://developer.yahoo.com/search/

More Related Content

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

address_extraction_manual

  • 1. Prerequisites: Minorthird[7] jar file Mixup files Project pattern External dependencies: JAXB[4]-technology : used to output results in xml format. However jaxb has only been used to generate the classes and is not necessary to run the U.K address extraction package. But it is necessary to add the generated classes to your classpath. In order to do it, just add [u.k extraction package]buildclasses to the classpath. Overview of the package: Src-contains the java source files of the package. Its contents: Default package-contains the mixup java program. This program writes the results to a text file as plain text. 1)Mixupjaxb program: Writes the results as tagged xml file. 2) Labeler demo: simple program demonstrating how to use mixup programs to label text data. Pretty good for beginners to understand minorthird and it’s functionality. 3)Websearch program: The web search program is responsible for extracting links from the web given the search string. It makes use of the yahoo web search api[8].The first 1000 links are in turn written to searchlinks.txt .This process is performed by the program websearch.java in the default package. These links are then accessed inside the mixup jdom file. The resulting html files are parsed and the text is extracted using a html parser[5]. The extracted text is written to a text file. This file is in turn loaded by the text base loader, Which works in tandem with the mixup interpreter to extract labels from the text. Evaluation package: Contains java code that calculates the frequency of occurrence of each word in a text file and prints it. It Might be useful for evaluation purposes. Lib: contains all the required libraries.Html parser, minorthird, yahoo jar files are the most important. All these jar files can also be downloaded from the internet except the yahoo jar file which was generated from source using netbeans[6]. Also, caution in using the latest minorthird jar. It seems to have too many bugs associated with it. It is safer to use the older minorthird jar.
  • 2. Xmlresources: Contains the xml schema file used to generate the jaxb classes. Looking at the schema will give an idea about the output result format. Also the result format can be modified and extended by changing the xml schema file and regenerating the jaxb code. Mixup: The pattern rules: Patterns are written using mixup[1]. Mixup is a simple pattern-matching and information extraction language included in minorthird. The name's an acronym for My Information eXtraction and Understanding Package. Using mixup we can easily write extraction pattern rules in extended backus naur form(EBNF). Mixup files and their arrangement: 1)Cities.txt: contains list of all cities in the united kingdom[2]. 2)Counties.txt: Contains list of all counties in the united Kingdom[3]. 3)County.mixup: contains rules to extract postcode,phone number,counties and cities.These are central to the extraction process. 4)POIDes.mixup: contains rules to extract poiname. Uses the street address to extract and poidesignators to extract the rules. Rules are quite generic in this file but the outliers get filtered in the POI mixup file. 5)Streetaddress.mixup: contains rules to extract street address. Uses county and city names as well as street indicators to extract the rules. 6)POI.mixup:The file which extract the actual poi tag.Uses the tags generated by the rest of the mixup files. Running from command line: 1)Add minorthird jar file to your classpath. As the software is built on top of minorthird, it is essential to add minorthird to your classpath at least when running mixup and mixupjaxb files. It is not required for the websearch program. However the websearch program needs the yahoo jar file in its path. Note:You can find all the relevant jar files in the lib directory of the address extraction package. 2)Make sure you have the folder containing the mixup files on your classpath. Under windows you could set the environmental variable CLASSPATH to point to the directory mixup in the package. 3)Don’t forget to allocate extra memory if you get out of memory error while running the mixup program. This is caused because the text base loader loads the file using the DOC_PER_FILE option instead of the DOC_PER_LINE option. The DOC_PER_FILE option considers the whole file as a single
  • 3. document where as the DOC_PER_LINE option considers each line as a separate document. This inhibits the extraction of patterns that occur across different lines as is generally the style of occurrence with address patterns. 4)Please note that the test file into which the extracted text from multiple links is written is in append mode. Due to this, the file size increases each time you run the program. As this file is loaded by the text base loader, its size is directly proportional to the amount of java memory needed to be allocated. The best way to go about it is to delete the file before running the mixup program and check the size of this file once it gets written,so that you can allocate memory proportional to its size. The command for allocating extra memory: java –Xmx500M mixup (if you want to allocate 500 mb to your mixup program) References: [1]http://minorthird.sourceforge.net/tutorials/Mixup%20Tutorial.htm [2] http://www.gbet.com/AtoZ_cities/ [3] http://www.gbet.com/AtoZ_counties/ [4] https://jaxb.dev.java.net/ [5] http://htmlparser.sourceforge.net/ [6] http://www.netbeans.org/ [7] http://minorthird.sourceforge.net/ [8] http://developer.yahoo.com/search/