StructureDetection&ClauseExtractionModule
- 1. 1
© 2015 IBM Corporation
During the extraction process, item’s priority is given by the left hand character of this item:
Priority 1: Left hand character is important punctuation.
Priority 2: Left hand character is normal punctuation.
Priority 3: Left hand character is not a defined punctuation (may be a word).
About Item’s Priority
Important
Punctuations
Normal
Punctuations
- 2. 2
© 2015 IBM Corporation
About Item’s Key and Prefix
Section
Chapter
1.
1.1.
2-
2-2-
…
2
A
1
3
2
5
Ii
…
Prefix Key
For an item
Prefix: Item’s list mark. (Item may not have this)
Key: Item’s sequence mark in a list.
- 3. 3
© 2015 IBM Corporation
About Tree & Linear Chain
1. Items: Titles extracted by
regular expression from input
document.
A 1.1
2
……
2.3
2. Tree: Potential assignments
of items. (Due to items’
sequence of key, format and
prefix)
B
B
A
C
C
D
C
Subsections!
3.Linear Chain: a typical tree
which have only one branch in
each node.
1.2
1.1
1.3
Note: in tree building phase, one child can have many parents. After the
pruning phase, one child one parent, aka. Linear chain
B
A
- 4. 4
© 2015 IBM Corporation
Extraction
1. Section
Pattern
6. Sort and
remove
overlaps
3. Multilevel
Pattern
2. Patent
Pattern
4. Item
Pattern
5. TOC
Pattern
Input
Output
Extraction Module (Due to RegExp or Style information)
Document
Converter
(.XML file)
Build Forest
A fast filtering in 1-5
to check item’s
continuity
- 5. 5
© 2015 IBM Corporation
Build Forest
2. The Forest
(Priority 1)
Items from
Extractor
2. The Low
Forest
(Priority 2 and
3)
3. The Forest
4. Prune
Forest
Output
Drop
Add items. (Due to
the result of check
in-line list for the
low forest)
1. Detect
Subsections
Detect Subsections
and build potential
Trees.
Build forest due to
tree’s priority.
- 6. 6
© 2015 IBM Corporation
Prune Forest (detect structure.java)
The Forest
Get all linear
chains
1. Linear
Chains
2. Valid Linear
Chains
Verify and Validate
3. Prune with
Linear Chains
Output
Prune trees in The Forest with linear chains
Filters
4. Hierarchy
the output
5. Clause
Extraction &
Filtering
6. Create
Clause
Annotation
Output for
the user
Iteratively
- 7. 7
© 2015 IBM Corporation
Prune Theory
1. Linear chain ends
before tree’s start.
Nothing to prune
2. Linear chain starts
before tree’s start.
3. Tree starts before
linear chain’s start