1. Documentation of changes on document parsing of BlueHOUND
Version 1.2 Date 2016-06-02
Mentor: Long Yin
Interns: Wan Xulang, Zhang Yuxi
Author: Wan Xulang
1. Files
To improve the performance of structure detection & clause extraction in BlueHOUND, we identified a
lists of issues and tested the solutions for 4 of them. Changes have been made in two files. One is the
configurationfile (.txt) whichcontainsthe parametersettingsof structure detection& clause extraction
module.The otherfile (.java)the structuredetection&clause extractionmodulewhichcontainsthe detail
rulesof structure detection&clause extraction. These are 6filesinthe documentfolder.
ConfigFile-Original.txt
- the original configuration file
ConfigFile-Updated.txt
- the updatedconfigurationfile
DetectStructure-Original.java
- the original source code of structure detection&clause extractionmodule
DetectStructure-Updated.java
- the updated source code of structure detection&clause extractionmodule
StructureDetection&ClauseExtractionModule.pptx
- Flow charts to describe the procedures in structure detection & clause extraction module. This
file canbe usedas a reference toquicklyunderstandDetectStructure-Original.java
DocumentationofchangesondocumentparsingofBlueHOUND.docx
- Document the changes we made in configuration file (.txt) and the structure detection& clause
extractionmodule (.java)
2. Problemsand solutions
We testedthe performanceof documentparsing function by reviewing31documents. We foundthatnot
all documents could be parsed 100% correctly. A list of issues has been identified, and we proposed 4
changes to fix some of these issues. Note that in problem 2 and 3, we need to modify both the
configurationfileandthe source code.
2. # Problems Changes Modifications of
ConfigurationFile
Modificationsof
Source Code
1 Clauses/Subclausesstarted
with‘section#.#’cannot be
detected
Addregularexpressionin
configfile forsection#.#
3.1.TAB
Punctuation
3.3. Section#.#
2 Clauses/Subclauses started
with ‘Article
one/two/three’ cannot be
detected
Addregularexpressionin
configfile forArticle
One/Two/Three
3.2.Article
one/two/three…
4.1.Recognize
LetterNumeric
3 Title of Clauses/Subclauses
endedwithcoloncannotbe
detected
Turn off the filteringrule
whichexclude Clause
whose title endedwith
colon
3.4. ColonTitles 4.2.Colon Titles
Filter
4 Missing Clauses/Subclauses
if there isgap in numbering
Setcontinuouskey 4.3.Numbering
Gap
3. ModificationsofConfigFile
3.1. TAB Punctuation
Motivation:Indocumentparsing,titleswhichfollowatabpunctuationwillbe setaspriority3.Butactually
theyare importanttitlesaswe don’twantthembe priority3.
Solution:AddTAB intoimportantpunctuationinthe configfile.
3.2. Article one/two/three…
Motivation:Titleswith “letternumber”likearticle one/two/three can’tbe detectedbythe tool.
Solution: Add a new regular expression in config file to extract such titles. Besides, to make them
recognizable, amappingfunctionisalso neededtomap the “letternumber”like one/two/three to“key
value” like 1,2,3. This mapping function is added in the source code. Please refer to 4.1 for more
information.
Regular Expression:
sectionRegexp b((?:[Aa][Rr][Tt][Ii][Cc][Ll][Ee])s+(([TtWwEeNnHhIiRrFfOoGg]{4,5}[Yy]){0,1}-
{0,1}[OoNnTtWwRrEeFfSsVvXxGgHhLlIiUuYy]{3,9}))b
Explanation:Thisregularexpressioniscombinedbytwoparts. The firstpart isa prefix of “article”while
the secondpart isa lettersetof all possible combinationof letternumbersfromone tofifty-nine.
3.3. Section#. # (# refersto numeral symbols)
Motivation:Whenextractingmultileveltitles,the systemgiveslow prioritytothose titles whichfollowa
wordlike ‘section’ or‘article’etc.
Solution:Addnewregularexpressionswhichcontainthese wordsasprefixestoavoidthissituation.
Figure 1 - TAB Punctuation
4. Figure 4 showsthe code of decodingfunction.Thisfunctionwill divide the numericsymbol intodifferent
parts by “-” and convert each part into a numeric value by going through the two lists we set before.
Finally,sumupthe numericvalue of eachpart to give the key.
Before thisfunction,anotherfunctionnamed”isLetterNumeral”isusedtoidentifywhetheratitleisletter
numberor not.
4.2. ColonTitlesFilter
Motivation: To separate the colon titles excluding function from “LongTitleAndBlankDescription” filter.
We make a newfilternamed “ColonTitles”. Thus, “LongTitleAndBlankDescription”filterwill onlyexclude
clauseswhichhave longtitlesandblankdescription.
Solution:Adda newfunctioninfilteringmodule tobecome anew filter.
Figure 4 – Convert Function
5. Figure 5 showsthe code of howto identifyatitle followedbycolonandgive it a mark. Whenthisfilteris
setto “true”.These markedtitleswill be excludedfromthe final output. Atthe same time,suchfunctions
inlongtitle filterhave beenremoved.
4.3. NumberingGap
Motivation: In item extraction part, system will do a fast filtering to check the continuous of extracted
titles.However,somedocumentshave missingtitleswillbe effectbadlybythis.Alsothe pruningfunction
is basedon continuouskeysof extractedtitles.Sowe can’t justclose that fast filteringfunctiontoavoid
this.
Solution: Build a tricky function to across this numbering gap. Note that this change may have negative
impacton parsingotherdocumentsandthusit needsmore validationbeforeimplementation.
Figure 5 – ColonTitles Filter
6. Figure 6 shows the code of Givekeys function. This function will check the priority and prefix of
neighboringtitles.If theyare the same,the systemwill change the keyof the secondtitle toone plusthe
keyof the firsttitle. Notedthatlists(e.g.:LastKn[],LastPrfx[]..) inthisfunctionwill be resetwhenusinga
newregularexpression.
This function now can’t identify English character and roman letter perfectly. When they are used at a
same article,it will be confused.Inthe attached code,thisfunctionnow is justused fordigital titles.For
letternumeric,Englishcharacterandromannumber,itwon’tbe usednow.(However,if onlyone of them
isusedin a givenarticle,thisfunction isstill agoodsolution).Youcansimplysearchby “givekeys” to see
where toopen thisfunctionforthese kindsof titles (relatedsentencesare commented inthe code).
Figure 6 – Givekeys Function
7. 5. Outcome and Defects
Figure 7 shows performance of documentparsingafterapplychange 1-4. It isshownthat change 1&2&3
can improve the overall performance of documentparsing.
Change 4 has negative impactonseveral documents,especiallyfor“Google PlayTermsof Service”.More
validation is needed before we implement this change to production. After improving change 4, some
negative impactshave beensolved.
Figure 7 – Performance on Test Documents