SlideShare a Scribd company logo
1 of 7
Documentation of changes on document parsing of BlueHOUND
Version 1.2 Date 2016-06-02
Mentor: Long Yin
Interns: Wan Xulang, Zhang Yuxi
Author: Wan Xulang
1. Files
To improve the performance of structure detection & clause extraction in BlueHOUND, we identified a
lists of issues and tested the solutions for 4 of them. Changes have been made in two files. One is the
configurationfile (.txt) whichcontainsthe parametersettingsof structure detection& clause extraction
module.The otherfile (.java)the structuredetection&clause extractionmodulewhichcontainsthe detail
rulesof structure detection&clause extraction. These are 6filesinthe documentfolder.
ConfigFile-Original.txt
- the original configuration file
ConfigFile-Updated.txt
- the updatedconfigurationfile
DetectStructure-Original.java
- the original source code of structure detection&clause extractionmodule
DetectStructure-Updated.java
- the updated source code of structure detection&clause extractionmodule
StructureDetection&ClauseExtractionModule.pptx
- Flow charts to describe the procedures in structure detection & clause extraction module. This
file canbe usedas a reference toquicklyunderstandDetectStructure-Original.java
DocumentationofchangesondocumentparsingofBlueHOUND.docx
- Document the changes we made in configuration file (.txt) and the structure detection& clause
extractionmodule (.java)
2. Problemsand solutions
We testedthe performanceof documentparsing function by reviewing31documents. We foundthatnot
all documents could be parsed 100% correctly. A list of issues has been identified, and we proposed 4
changes to fix some of these issues. Note that in problem 2 and 3, we need to modify both the
configurationfileandthe source code.
# Problems Changes Modifications of
ConfigurationFile
Modificationsof
Source Code
1 Clauses/Subclausesstarted
with‘section#.#’cannot be
detected
Addregularexpressionin
configfile forsection#.#
3.1.TAB
Punctuation
3.3. Section#.#
2 Clauses/Subclauses started
with ‘Article
one/two/three’ cannot be
detected
Addregularexpressionin
configfile forArticle
One/Two/Three
3.2.Article
one/two/three…
4.1.Recognize
LetterNumeric
3 Title of Clauses/Subclauses
endedwithcoloncannotbe
detected
Turn off the filteringrule
whichexclude Clause
whose title endedwith
colon
3.4. ColonTitles 4.2.Colon Titles
Filter
4 Missing Clauses/Subclauses
if there isgap in numbering
Setcontinuouskey 4.3.Numbering
Gap
3. ModificationsofConfigFile
3.1. TAB Punctuation
Motivation:Indocumentparsing,titleswhichfollowatabpunctuationwillbe setaspriority3.Butactually
theyare importanttitlesaswe don’twantthembe priority3.
Solution:AddTAB intoimportantpunctuationinthe configfile.
3.2. Article one/two/three…
Motivation:Titleswith “letternumber”likearticle one/two/three can’tbe detectedbythe tool.
Solution: Add a new regular expression in config file to extract such titles. Besides, to make them
recognizable, amappingfunctionisalso neededtomap the “letternumber”like one/two/three to“key
value” like 1,2,3. This mapping function is added in the source code. Please refer to 4.1 for more
information.
Regular Expression:
sectionRegexp b((?:[Aa][Rr][Tt][Ii][Cc][Ll][Ee])s+(([TtWwEeNnHhIiRrFfOoGg]{4,5}[Yy]){0,1}-
{0,1}[OoNnTtWwRrEeFfSsVvXxGgHhLlIiUuYy]{3,9}))b
Explanation:Thisregularexpressioniscombinedbytwoparts. The firstpart isa prefix of “article”while
the secondpart isa lettersetof all possible combinationof letternumbersfromone tofifty-nine.
3.3. Section#. # (# refersto numeral symbols)
Motivation:Whenextractingmultileveltitles,the systemgiveslow prioritytothose titles whichfollowa
wordlike ‘section’ or‘article’etc.
Solution:Addnewregularexpressionswhichcontainthese wordsasprefixestoavoidthissituation.
Figure 1 - TAB Punctuation
Regular Expression:
multilevelRegexp (([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5}).)((d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.))*((d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.?))s+
multilevelRegexp(([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})-)((d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(-))*(d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})s+
3.4. ColonTitles
Motivation:Whenextractingitems,the systemexclude titlesfollowedbycolonswhen
“LongTitleAndBlankDescription”filterisopen.
Solution: “LongTitleAndBlankDescription” filter contains a set of rules, like exclude clauses/subclauses
withlongtitle;excludeclause/subclauseswithblankdescription;andexcludeclauses/subclausesifitstitle
isendedwithcolonorsemicolon. If we simplyclosethisfilter,itwillhave badeffectsonotherdocuments.
Sowe separate the colonfunctionfrom“LongTitleAndBlankDescription”filterand addanew filternamed
“ColonTitles” inconfigfile. We alsoadjustedthe source code toseparate the rule aboutcolon/semicolon
from the set of rules of “LongTitleAndBlankDescription” filter. Please refer to 4.2 for more information.
4. ModificationsofSource Code
4.1. Recognize LetterNumeric
Motivation: As we’ve added a new regular expression in config file in 3.1, the system now can extract
titleswithletternumeric.Butthe systemcan’t give a numerickeyto themas there’snosuch functionin
the source code to achieve so.
Solution:Buildafunctiontogive keysto suchtitles.We simplyuse a“decode”functiontoachieve this.
Figure 3 shows twolistswhichare pre-builtforthe decoding function.
Figure 2 - ColonTitles Filter
Figure 3 – Preparation for Decoding
Figure 4 showsthe code of decodingfunction.Thisfunctionwill divide the numericsymbol intodifferent
parts by “-” and convert each part into a numeric value by going through the two lists we set before.
Finally,sumupthe numericvalue of eachpart to give the key.
Before thisfunction,anotherfunctionnamed”isLetterNumeral”isusedtoidentifywhetheratitleisletter
numberor not.
4.2. ColonTitlesFilter
Motivation: To separate the colon titles excluding function from “LongTitleAndBlankDescription” filter.
We make a newfilternamed “ColonTitles”. Thus, “LongTitleAndBlankDescription”filterwill onlyexclude
clauseswhichhave longtitlesandblankdescription.
Solution:Adda newfunctioninfilteringmodule tobecome anew filter.
Figure 4 – Convert Function
Figure 5 showsthe code of howto identifyatitle followedbycolonandgive it a mark. Whenthisfilteris
setto “true”.These markedtitleswill be excludedfromthe final output. Atthe same time,suchfunctions
inlongtitle filterhave beenremoved.
4.3. NumberingGap
Motivation: In item extraction part, system will do a fast filtering to check the continuous of extracted
titles.However,somedocumentshave missingtitleswillbe effectbadlybythis.Alsothe pruningfunction
is basedon continuouskeysof extractedtitles.Sowe can’t justclose that fast filteringfunctiontoavoid
this.
Solution: Build a tricky function to across this numbering gap. Note that this change may have negative
impacton parsingotherdocumentsandthusit needsmore validationbeforeimplementation.
Figure 5 – ColonTitles Filter
Figure 6 shows the code of Givekeys function. This function will check the priority and prefix of
neighboringtitles.If theyare the same,the systemwill change the keyof the secondtitle toone plusthe
keyof the firsttitle. Notedthatlists(e.g.:LastKn[],LastPrfx[]..) inthisfunctionwill be resetwhenusinga
newregularexpression.
This function now can’t identify English character and roman letter perfectly. When they are used at a
same article,it will be confused.Inthe attached code,thisfunctionnow is justused fordigital titles.For
letternumeric,Englishcharacterandromannumber,itwon’tbe usednow.(However,if onlyone of them
isusedin a givenarticle,thisfunction isstill agoodsolution).Youcansimplysearchby “givekeys” to see
where toopen thisfunctionforthese kindsof titles (relatedsentencesare commented inthe code).
Figure 6 – Givekeys Function
5. Outcome and Defects
Figure 7 shows performance of documentparsingafterapplychange 1-4. It isshownthat change 1&2&3
can improve the overall performance of documentparsing.
Change 4 has negative impactonseveral documents,especiallyfor“Google PlayTermsof Service”.More
validation is needed before we implement this change to production. After improving change 4, some
negative impactshave beensolved.
Figure 7 – Performance on Test Documents

More Related Content

What's hot

Application sql issues_and_tuning
Application sql issues_and_tuningApplication sql issues_and_tuning
Application sql issues_and_tuningAnil Pandey
 
Cis 336 Extraordinary Success/newtonhelp.com
Cis 336 Extraordinary Success/newtonhelp.com  Cis 336 Extraordinary Success/newtonhelp.com
Cis 336 Extraordinary Success/newtonhelp.com amaranthbeg146
 
CIS 336 Focus Dreams/newtonhelp.com
CIS 336 Focus Dreams/newtonhelp.comCIS 336 Focus Dreams/newtonhelp.com
CIS 336 Focus Dreams/newtonhelp.combellflower85
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
CIS 336 Start With a Dream /newtonhelp.com
CIS 336 Start With a Dream /newtonhelp.comCIS 336 Start With a Dream /newtonhelp.com
CIS 336 Start With a Dream /newtonhelp.comqwsdd2
 
CIS 336 PAPERS Lessons in Excellence--cis336papers.com
CIS 336 PAPERS Lessons in Excellence--cis336papers.comCIS 336 PAPERS Lessons in Excellence--cis336papers.com
CIS 336 PAPERS Lessons in Excellence--cis336papers.comthomashard82
 
CIS 336 Inspiring Innovation -- cis336.com
CIS 336 Inspiring Innovation -- cis336.comCIS 336 Inspiring Innovation -- cis336.com
CIS 336 Inspiring Innovation -- cis336.comkopiko105
 
CIS 336 PAPERS Education for Service--cis336papers.com
CIS 336 PAPERS Education for Service--cis336papers.comCIS 336 PAPERS Education for Service--cis336papers.com
CIS 336 PAPERS Education for Service--cis336papers.comKeatonJennings14
 
CIS 336 STUDY Inspiring Innovation--cis336study.com
CIS 336 STUDY Inspiring Innovation--cis336study.comCIS 336 STUDY Inspiring Innovation--cis336study.com
CIS 336 STUDY Inspiring Innovation--cis336study.comKeatonJennings90
 
CIS 336 Achievement Education --cis336.com
CIS 336 Achievement Education --cis336.comCIS 336 Achievement Education --cis336.com
CIS 336 Achievement Education --cis336.comagathachristie171
 
CIS336 Education for Service--cis336.com
CIS336 Education for Service--cis336.comCIS336 Education for Service--cis336.com
CIS336 Education for Service--cis336.comwilliamwordsworth11
 
CIS 336 Redefined Education--cis336.com
CIS 336 Redefined Education--cis336.comCIS 336 Redefined Education--cis336.com
CIS 336 Redefined Education--cis336.comagathachristie208
 
Data Structure Lecture 5
Data Structure Lecture 5Data Structure Lecture 5
Data Structure Lecture 5Teksify
 
CIS 336 Become Exceptional--cis336.com
CIS 336 Become Exceptional--cis336.comCIS 336 Become Exceptional--cis336.com
CIS 336 Become Exceptional--cis336.comclaric131
 
CIS 336 STUDY Education Counseling--cis336study.com
CIS 336 STUDY Education Counseling--cis336study.comCIS 336 STUDY Education Counseling--cis336study.com
CIS 336 STUDY Education Counseling--cis336study.comshanaabe13
 
Linked lists in Data Structure
Linked lists in Data StructureLinked lists in Data Structure
Linked lists in Data StructureMuhazzab Chouhadry
 
Cis 336 Enhance teaching / snaptutorial.com
Cis 336    Enhance teaching / snaptutorial.comCis 336    Enhance teaching / snaptutorial.com
Cis 336 Enhance teaching / snaptutorial.comDavis104
 

What's hot (18)

Application sql issues_and_tuning
Application sql issues_and_tuningApplication sql issues_and_tuning
Application sql issues_and_tuning
 
Cis 336 Extraordinary Success/newtonhelp.com
Cis 336 Extraordinary Success/newtonhelp.com  Cis 336 Extraordinary Success/newtonhelp.com
Cis 336 Extraordinary Success/newtonhelp.com
 
CIS 336 Focus Dreams/newtonhelp.com
CIS 336 Focus Dreams/newtonhelp.comCIS 336 Focus Dreams/newtonhelp.com
CIS 336 Focus Dreams/newtonhelp.com
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
Lec6 mod linked list
Lec6 mod linked listLec6 mod linked list
Lec6 mod linked list
 
CIS 336 Start With a Dream /newtonhelp.com
CIS 336 Start With a Dream /newtonhelp.comCIS 336 Start With a Dream /newtonhelp.com
CIS 336 Start With a Dream /newtonhelp.com
 
CIS 336 PAPERS Lessons in Excellence--cis336papers.com
CIS 336 PAPERS Lessons in Excellence--cis336papers.comCIS 336 PAPERS Lessons in Excellence--cis336papers.com
CIS 336 PAPERS Lessons in Excellence--cis336papers.com
 
CIS 336 Inspiring Innovation -- cis336.com
CIS 336 Inspiring Innovation -- cis336.comCIS 336 Inspiring Innovation -- cis336.com
CIS 336 Inspiring Innovation -- cis336.com
 
CIS 336 PAPERS Education for Service--cis336papers.com
CIS 336 PAPERS Education for Service--cis336papers.comCIS 336 PAPERS Education for Service--cis336papers.com
CIS 336 PAPERS Education for Service--cis336papers.com
 
CIS 336 STUDY Inspiring Innovation--cis336study.com
CIS 336 STUDY Inspiring Innovation--cis336study.comCIS 336 STUDY Inspiring Innovation--cis336study.com
CIS 336 STUDY Inspiring Innovation--cis336study.com
 
CIS 336 Achievement Education --cis336.com
CIS 336 Achievement Education --cis336.comCIS 336 Achievement Education --cis336.com
CIS 336 Achievement Education --cis336.com
 
CIS336 Education for Service--cis336.com
CIS336 Education for Service--cis336.comCIS336 Education for Service--cis336.com
CIS336 Education for Service--cis336.com
 
CIS 336 Redefined Education--cis336.com
CIS 336 Redefined Education--cis336.comCIS 336 Redefined Education--cis336.com
CIS 336 Redefined Education--cis336.com
 
Data Structure Lecture 5
Data Structure Lecture 5Data Structure Lecture 5
Data Structure Lecture 5
 
CIS 336 Become Exceptional--cis336.com
CIS 336 Become Exceptional--cis336.comCIS 336 Become Exceptional--cis336.com
CIS 336 Become Exceptional--cis336.com
 
CIS 336 STUDY Education Counseling--cis336study.com
CIS 336 STUDY Education Counseling--cis336study.comCIS 336 STUDY Education Counseling--cis336study.com
CIS 336 STUDY Education Counseling--cis336study.com
 
Linked lists in Data Structure
Linked lists in Data StructureLinked lists in Data Structure
Linked lists in Data Structure
 
Cis 336 Enhance teaching / snaptutorial.com
Cis 336    Enhance teaching / snaptutorial.comCis 336    Enhance teaching / snaptutorial.com
Cis 336 Enhance teaching / snaptutorial.com
 

Viewers also liked

Como trabalhar pela internet
Como trabalhar pela internetComo trabalhar pela internet
Como trabalhar pela internetLucas Ribeiro
 
Cuestionario quinto sociales
Cuestionario quinto socialesCuestionario quinto sociales
Cuestionario quinto socialesjaviergodoy74
 
Titanic y leyendas de pasion. 23
Titanic y leyendas de pasion. 23Titanic y leyendas de pasion. 23
Titanic y leyendas de pasion. 23notaloko
 
World cultures fall 2015 ppt day2
World cultures fall 2015 ppt day2World cultures fall 2015 ppt day2
World cultures fall 2015 ppt day2terrikaplan
 
De alba gonzalez_marlene_actividad1_mapa_c
De alba gonzalez_marlene_actividad1_mapa_cDe alba gonzalez_marlene_actividad1_mapa_c
De alba gonzalez_marlene_actividad1_mapa_cmarlen de alba
 
High frequency modeling
High frequency modeling High frequency modeling
High frequency modeling Ning Song
 
FBI letter from Alan Malinchak - Confidential
FBI letter from Alan Malinchak - ConfidentialFBI letter from Alan Malinchak - Confidential
FBI letter from Alan Malinchak - ConfidentialElla Forbes
 
Coach emagrecimento rj
Coach emagrecimento rjCoach emagrecimento rj
Coach emagrecimento rjLucas Ribeiro
 
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...Conferência Luso-Brasileira de Ciência Aberta
 

Viewers also liked (16)

Como trabalhar pela internet
Como trabalhar pela internetComo trabalhar pela internet
Como trabalhar pela internet
 
Cuestionario quinto sociales
Cuestionario quinto socialesCuestionario quinto sociales
Cuestionario quinto sociales
 
Titanic y leyendas de pasion. 23
Titanic y leyendas de pasion. 23Titanic y leyendas de pasion. 23
Titanic y leyendas de pasion. 23
 
World cultures fall 2015 ppt day2
World cultures fall 2015 ppt day2World cultures fall 2015 ppt day2
World cultures fall 2015 ppt day2
 
Konferencia 20110705
Konferencia 20110705Konferencia 20110705
Konferencia 20110705
 
Conference[1]A
Conference[1]AConference[1]A
Conference[1]A
 
RISC
RISCRISC
RISC
 
De alba gonzalez_marlene_actividad1_mapa_c
De alba gonzalez_marlene_actividad1_mapa_cDe alba gonzalez_marlene_actividad1_mapa_c
De alba gonzalez_marlene_actividad1_mapa_c
 
REBAJAS
REBAJASREBAJAS
REBAJAS
 
High frequency modeling
High frequency modeling High frequency modeling
High frequency modeling
 
FBI letter from Alan Malinchak - Confidential
FBI letter from Alan Malinchak - ConfidentialFBI letter from Alan Malinchak - Confidential
FBI letter from Alan Malinchak - Confidential
 
Coach emagrecimento rj
Coach emagrecimento rjCoach emagrecimento rj
Coach emagrecimento rj
 
Mimi
MimiMimi
Mimi
 
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
 
2016_bismoyo_oct
2016_bismoyo_oct2016_bismoyo_oct
2016_bismoyo_oct
 
Sistemas de Gestão de Ciência e Repositórios - Diretrizes nacionais e interna...
Sistemas de Gestão de Ciência e Repositórios - Diretrizes nacionais e interna...Sistemas de Gestão de Ciência e Repositórios - Diretrizes nacionais e interna...
Sistemas de Gestão de Ciência e Repositórios - Diretrizes nacionais e interna...
 

Similar to DocumentationofchangesondocumentparsingofBlueHOUND

COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxmonicafrancis71118
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxcargillfilberto
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxdrandy1
 
Functions in Python Syntax and working .
Functions in Python Syntax and working .Functions in Python Syntax and working .
Functions in Python Syntax and working .tarunsharmaug23
 
Software Systems Modularization
Software Systems ModularizationSoftware Systems Modularization
Software Systems Modularizationchiao-fan yang
 
You must implement the following functions- Name the functions exactly.docx
You must implement the following functions- Name the functions exactly.docxYou must implement the following functions- Name the functions exactly.docx
You must implement the following functions- Name the functions exactly.docxSebastian6SWSlaterb
 
RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG Greg.Helton
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1ReKruiTIn.com
 
VB_ERROR CONTROL_FILE HANDLING.ppt
VB_ERROR CONTROL_FILE HANDLING.pptVB_ERROR CONTROL_FILE HANDLING.ppt
VB_ERROR CONTROL_FILE HANDLING.pptBhuvanaR13
 
Assignment 02 Process State SimulationCSci 430 Introduction to.docx
Assignment 02 Process State SimulationCSci 430 Introduction to.docxAssignment 02 Process State SimulationCSci 430 Introduction to.docx
Assignment 02 Process State SimulationCSci 430 Introduction to.docxcargillfilberto
 
Instructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeInstructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeImperial College, London
 
SMP4 Thread Scheduler======================INSTRUCTIONS.docx
SMP4 Thread Scheduler======================INSTRUCTIONS.docxSMP4 Thread Scheduler======================INSTRUCTIONS.docx
SMP4 Thread Scheduler======================INSTRUCTIONS.docxpbilly1
 
Automation Framework 042009 V2
Automation Framework   042009  V2Automation Framework   042009  V2
Automation Framework 042009 V2Devukjs
 
Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...cscpconf
 

Similar to DocumentationofchangesondocumentparsingofBlueHOUND (20)

COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docxCOMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
 
Functions in Python Syntax and working .
Functions in Python Syntax and working .Functions in Python Syntax and working .
Functions in Python Syntax and working .
 
Software Systems Modularization
Software Systems ModularizationSoftware Systems Modularization
Software Systems Modularization
 
Calnf
CalnfCalnf
Calnf
 
You must implement the following functions- Name the functions exactly.docx
You must implement the following functions- Name the functions exactly.docxYou must implement the following functions- Name the functions exactly.docx
You must implement the following functions- Name the functions exactly.docx
 
Slides chapters 28-32
Slides chapters 28-32Slides chapters 28-32
Slides chapters 28-32
 
final
finalfinal
final
 
RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG RPG Program for Unit Testing RPG
RPG Program for Unit Testing RPG
 
C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1C, C++ Interview Questions Part - 1
C, C++ Interview Questions Part - 1
 
VB_ERROR CONTROL_FILE HANDLING.ppt
VB_ERROR CONTROL_FILE HANDLING.pptVB_ERROR CONTROL_FILE HANDLING.ppt
VB_ERROR CONTROL_FILE HANDLING.ppt
 
Assignment 02 Process State SimulationCSci 430 Introduction to.docx
Assignment 02 Process State SimulationCSci 430 Introduction to.docxAssignment 02 Process State SimulationCSci 430 Introduction to.docx
Assignment 02 Process State SimulationCSci 430 Introduction to.docx
 
Instructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping codeInstructions for using the phase wrapping and unwrapping code
Instructions for using the phase wrapping and unwrapping code
 
SMP4 Thread Scheduler======================INSTRUCTIONS.docx
SMP4 Thread Scheduler======================INSTRUCTIONS.docxSMP4 Thread Scheduler======================INSTRUCTIONS.docx
SMP4 Thread Scheduler======================INSTRUCTIONS.docx
 
Technical Interview
Technical InterviewTechnical Interview
Technical Interview
 
Automation Framework 042009 V2
Automation Framework   042009  V2Automation Framework   042009  V2
Automation Framework 042009 V2
 
Dost.jar and fo.jar
Dost.jar and fo.jarDost.jar and fo.jar
Dost.jar and fo.jar
 
Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...Creation of a Test Bed Environment for Core Java Applications using White Box...
Creation of a Test Bed Environment for Core Java Applications using White Box...
 
Unit 1
Unit  1Unit  1
Unit 1
 

DocumentationofchangesondocumentparsingofBlueHOUND

  • 1. Documentation of changes on document parsing of BlueHOUND Version 1.2 Date 2016-06-02 Mentor: Long Yin Interns: Wan Xulang, Zhang Yuxi Author: Wan Xulang 1. Files To improve the performance of structure detection & clause extraction in BlueHOUND, we identified a lists of issues and tested the solutions for 4 of them. Changes have been made in two files. One is the configurationfile (.txt) whichcontainsthe parametersettingsof structure detection& clause extraction module.The otherfile (.java)the structuredetection&clause extractionmodulewhichcontainsthe detail rulesof structure detection&clause extraction. These are 6filesinthe documentfolder. ConfigFile-Original.txt - the original configuration file ConfigFile-Updated.txt - the updatedconfigurationfile DetectStructure-Original.java - the original source code of structure detection&clause extractionmodule DetectStructure-Updated.java - the updated source code of structure detection&clause extractionmodule StructureDetection&ClauseExtractionModule.pptx - Flow charts to describe the procedures in structure detection & clause extraction module. This file canbe usedas a reference toquicklyunderstandDetectStructure-Original.java DocumentationofchangesondocumentparsingofBlueHOUND.docx - Document the changes we made in configuration file (.txt) and the structure detection& clause extractionmodule (.java) 2. Problemsand solutions We testedthe performanceof documentparsing function by reviewing31documents. We foundthatnot all documents could be parsed 100% correctly. A list of issues has been identified, and we proposed 4 changes to fix some of these issues. Note that in problem 2 and 3, we need to modify both the configurationfileandthe source code.
  • 2. # Problems Changes Modifications of ConfigurationFile Modificationsof Source Code 1 Clauses/Subclausesstarted with‘section#.#’cannot be detected Addregularexpressionin configfile forsection#.# 3.1.TAB Punctuation 3.3. Section#.# 2 Clauses/Subclauses started with ‘Article one/two/three’ cannot be detected Addregularexpressionin configfile forArticle One/Two/Three 3.2.Article one/two/three… 4.1.Recognize LetterNumeric 3 Title of Clauses/Subclauses endedwithcoloncannotbe detected Turn off the filteringrule whichexclude Clause whose title endedwith colon 3.4. ColonTitles 4.2.Colon Titles Filter 4 Missing Clauses/Subclauses if there isgap in numbering Setcontinuouskey 4.3.Numbering Gap 3. ModificationsofConfigFile 3.1. TAB Punctuation Motivation:Indocumentparsing,titleswhichfollowatabpunctuationwillbe setaspriority3.Butactually theyare importanttitlesaswe don’twantthembe priority3. Solution:AddTAB intoimportantpunctuationinthe configfile. 3.2. Article one/two/three… Motivation:Titleswith “letternumber”likearticle one/two/three can’tbe detectedbythe tool. Solution: Add a new regular expression in config file to extract such titles. Besides, to make them recognizable, amappingfunctionisalso neededtomap the “letternumber”like one/two/three to“key value” like 1,2,3. This mapping function is added in the source code. Please refer to 4.1 for more information. Regular Expression: sectionRegexp b((?:[Aa][Rr][Tt][Ii][Cc][Ll][Ee])s+(([TtWwEeNnHhIiRrFfOoGg]{4,5}[Yy]){0,1}- {0,1}[OoNnTtWwRrEeFfSsVvXxGgHhLlIiUuYy]{3,9}))b Explanation:Thisregularexpressioniscombinedbytwoparts. The firstpart isa prefix of “article”while the secondpart isa lettersetof all possible combinationof letternumbersfromone tofifty-nine. 3.3. Section#. # (# refersto numeral symbols) Motivation:Whenextractingmultileveltitles,the systemgiveslow prioritytothose titles whichfollowa wordlike ‘section’ or‘article’etc. Solution:Addnewregularexpressionswhichcontainthese wordsasprefixestoavoidthissituation. Figure 1 - TAB Punctuation
  • 3. Regular Expression: multilevelRegexp (([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5}).)((d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.))*((d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.?))s+ multilevelRegexp(([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})-)((d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(-))*(d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})s+ 3.4. ColonTitles Motivation:Whenextractingitems,the systemexclude titlesfollowedbycolonswhen “LongTitleAndBlankDescription”filterisopen. Solution: “LongTitleAndBlankDescription” filter contains a set of rules, like exclude clauses/subclauses withlongtitle;excludeclause/subclauseswithblankdescription;andexcludeclauses/subclausesifitstitle isendedwithcolonorsemicolon. If we simplyclosethisfilter,itwillhave badeffectsonotherdocuments. Sowe separate the colonfunctionfrom“LongTitleAndBlankDescription”filterand addanew filternamed “ColonTitles” inconfigfile. We alsoadjustedthe source code toseparate the rule aboutcolon/semicolon from the set of rules of “LongTitleAndBlankDescription” filter. Please refer to 4.2 for more information. 4. ModificationsofSource Code 4.1. Recognize LetterNumeric Motivation: As we’ve added a new regular expression in config file in 3.1, the system now can extract titleswithletternumeric.Butthe systemcan’t give a numerickeyto themas there’snosuch functionin the source code to achieve so. Solution:Buildafunctiontogive keysto suchtitles.We simplyuse a“decode”functiontoachieve this. Figure 3 shows twolistswhichare pre-builtforthe decoding function. Figure 2 - ColonTitles Filter Figure 3 – Preparation for Decoding
  • 4. Figure 4 showsthe code of decodingfunction.Thisfunctionwill divide the numericsymbol intodifferent parts by “-” and convert each part into a numeric value by going through the two lists we set before. Finally,sumupthe numericvalue of eachpart to give the key. Before thisfunction,anotherfunctionnamed”isLetterNumeral”isusedtoidentifywhetheratitleisletter numberor not. 4.2. ColonTitlesFilter Motivation: To separate the colon titles excluding function from “LongTitleAndBlankDescription” filter. We make a newfilternamed “ColonTitles”. Thus, “LongTitleAndBlankDescription”filterwill onlyexclude clauseswhichhave longtitlesandblankdescription. Solution:Adda newfunctioninfilteringmodule tobecome anew filter. Figure 4 – Convert Function
  • 5. Figure 5 showsthe code of howto identifyatitle followedbycolonandgive it a mark. Whenthisfilteris setto “true”.These markedtitleswill be excludedfromthe final output. Atthe same time,suchfunctions inlongtitle filterhave beenremoved. 4.3. NumberingGap Motivation: In item extraction part, system will do a fast filtering to check the continuous of extracted titles.However,somedocumentshave missingtitleswillbe effectbadlybythis.Alsothe pruningfunction is basedon continuouskeysof extractedtitles.Sowe can’t justclose that fast filteringfunctiontoavoid this. Solution: Build a tricky function to across this numbering gap. Note that this change may have negative impacton parsingotherdocumentsandthusit needsmore validationbeforeimplementation. Figure 5 – ColonTitles Filter
  • 6. Figure 6 shows the code of Givekeys function. This function will check the priority and prefix of neighboringtitles.If theyare the same,the systemwill change the keyof the secondtitle toone plusthe keyof the firsttitle. Notedthatlists(e.g.:LastKn[],LastPrfx[]..) inthisfunctionwill be resetwhenusinga newregularexpression. This function now can’t identify English character and roman letter perfectly. When they are used at a same article,it will be confused.Inthe attached code,thisfunctionnow is justused fordigital titles.For letternumeric,Englishcharacterandromannumber,itwon’tbe usednow.(However,if onlyone of them isusedin a givenarticle,thisfunction isstill agoodsolution).Youcansimplysearchby “givekeys” to see where toopen thisfunctionforthese kindsof titles (relatedsentencesare commented inthe code). Figure 6 – Givekeys Function
  • 7. 5. Outcome and Defects Figure 7 shows performance of documentparsingafterapplychange 1-4. It isshownthat change 1&2&3 can improve the overall performance of documentparsing. Change 4 has negative impactonseveral documents,especiallyfor“Google PlayTermsof Service”.More validation is needed before we implement this change to production. After improving change 4, some negative impactshave beensolved. Figure 7 – Performance on Test Documents