SlideShare a Scribd company logo
1 of 8
Download to read offline
I NS TITUTE FOR DE FE NSE ANAL YSE S
Mining Measured Information from Text
Arun S. Maiya
Dale Visser
Andrew Wan
February 18, 2015
Approved for public
release; distribution is
unlimited.
IDA Non-Standard
D-5436
Log: H 15-000157
Copy
INSTITUTE FOR DEFENSE
ANALYSES
4850 Mark Center Drive
Alexandria, Virginia 22311-1882
About This Publication
This work was conducted by the Institute for Defense Analyses (IDA) under contract HQ0034-
14-D-0001, Task AK-5-3153, “DAMO Implementation of Phase 3 Operational Capabilities,”
for Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics
Defense Science Board (AT&L/DSB). The views, opinions, and findings should not be
construed as representing the official position of either the Department of Defense or the
sponsoring organization.
Copyright Notice
© 2015 Institute for Defense Analyses
4850 Mark Center Drive, Alexandria, Virginia 22311-1882 • (703) 845-2000.
Acknowledgments
«Reviewers»
This material may be reproduced by or for the U.S. Government pursuant to the
copyright license under the clause at DFARS 252.227-7013 (a)(16) [Jun 2013].
Mining Measured Information from Text
Arun S. Maiya, Dale Visser, and Andrew Wan
Institute for Defense Analyses — Alexandria, VA, USA
{amaiya, dvisser, awan}@ida.org
ABSTRACT
We present an approach to extract measured information
from text (e.g., a 1370 ◦
C melting point, a BMI greater
than 29.9 kg/m2
). Such extractions are critically important
across a wide range of domains — especially those involv-
ing search and exploration of scientific and technical doc-
uments. We first propose a rule-based entity extractor to
mine measured quantities (i.e., a numeric value paired with
a measurement unit), which supports a vast and comprehen-
sive set of both common and obscure measurement units.
Our method is highly robust and can correctly recover valid
measured quantities even when significant errors are intro-
duced through the process of converting document formats
like PDF to plain text. Next, we describe an approach to
extracting the properties being measured (e.g., the property
pixel pitch in the phrase “a pixel pitch as high as 352 µm”).
Finally, we present MQSearch: the realization of a search
engine with full support for measured information.
Categories and Subject Descriptors
I.2.7 [Artificial Intelligence]: Natural Language Process-
ing—Text Analysis; H.3.3 [Information Storage and Re-
trieval]: Information Search and Retrieval—Search Process
Keywords
text mining, information retrieval, information extraction,
measured quantities, numerical queries
1. INTRODUCTION AND MOTIVATION
Scientific and technical documents describe methods and
results using measured quantities: a numeric value paired
with a unit of measurement. Examples of text snippets con-
taining such measured quantities include:
• average gravity curvature ζ = (1.3999±0.003)×10−5
s−2
m−1
• 12 ◦
C melting point
• distance from Earth to the Sun is 9.3 × 107
miles
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
SIGIR ’15, August 9–13, 2015, Santiago, Chile.
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3621-5/15/08...$15.00
DOI: http://dx.doi.org/10.1145/2766462.2767789.
• average responsivity as low as 6.2 pA/K
Note that these measured quantities (e.g., 6.2 pA/K) are
typically associated with a specific measured property (e.g., av-
erage responsivity). In this paper, we study ways in which
to extract these kinds of measured information from doc-
uments.1
The mining of such information is critically im-
portant across many domains — especially those involving
search and exploration of scientific and technical articles.
For instance, an optics researcher may wish to know if the
performance of Nd:YAG laser-pumped KTP parametric os-
cillators has ever been tested at wavelengths longer than
2.4 µm. Full-text search engines using inverted indexes al-
low ad hoc queries on terms such as “KTP parametric oscil-
lator”, but the ability to further filter search results based
on wavelengths greater than 2.4 µm is not typically sup-
ported. To accomplish this, one must first identify and ex-
tract valid measured quantities (e.g., 2.4 µm) in unstruc-
tured text and, then, identify and extract the properties be-
ing measured (e.g., wavelength). These extractions could
then be stored in the index of a search engine in a way that
supports subsequent document queries on measured infor-
mation (e.g., faceted navigation, numeric range queries).
Surprisingly, there is very little existing work on how best
to realize this process. Lines of research most closely re-
lated to the present work include extracting numerical at-
tributes (e.g., [1,4]), supporting numerical document queries
(e.g., [5,12]), and formula identification (e.g., [7]). However,
none of these existing works address the comprehensive ex-
traction of and search for measured information in document
data, as described above. Indeed, numerous challenges exist
in such scenarios. Many widely-used, full-text search engines
(e.g., Apache Solr) convert the original document format
to plain text prior to indexing and storage — an extremely
error-ridden process. For instance, in the extracted text,
exponents are typically lost (e.g., 105
becomes 105, s−2
be-
comes s–2). Moreover, the conversion of some characters can
be highly inconsistent and unpredictable. A simple minus
sign can be converted to a range of different dash characters
or even “garbage” characters. The same is true for other
symbols such as µ, multiplication signs, and degree sym-
bols. It is this inconsistent and error-ridden text, then, that
is ultimately stored in the index of the search engine mak-
ing it virtually impossible to adequately locate documents
by measured quantities. Without the correct identification
of measured quantities, it is virtually impossible to identify
properties being measured, which are critical in efficiently
1
We define measured information as measured quantities and
the measured properties to which they are associated.
navigating scientific and technical articles for state-of-the-
art information. In general, there is a great deal of het-
erogeneity in how measured quantities and measured proper-
ties appear in text – both naturally and through corruption.
This, then, motivates the current investigation of how best
to extract such information.
Recent studies [3, 6] have revealed that rule-based ap-
proaches to information extraction tend to be more effec-
tive, interpretable, and customizable than state-of-the-art
machine learning approaches. We employ rule-based extrac-
tion methods in this work. Our contributions are as follows:
• We propose and describe a rule-based entity extractor to
identify measured quantities in unstructured text docu-
ments. Our method includes an error-correcting proce-
dure that recovers from aforementioned text conversion
errors by 1) reverse engineering the corrupted and man-
gled measured quantities back to their original, correct
form and 2) standardizing this form for storage in an
inverted index and subsequent query processing.
• Using these extracted measured quantities, we show how
to further extract the measured properties to which they
are associated.
• Finally, we present MQSearch: the realization of a
search engine with full support for measured informa-
tion. MQSearch is a facet-based navigation system that
allows users to navigate large document sets based on
measured quantities, measured properties, and the top-
ics and themes to which they are associated. To the best
of our knowledge, no other search engine in existence
fully supports such a capability.
We begin with describing the extraction of measured quan-
tities.
2. MEASURED QUANTITIES
We view measured quantities as a 5-tuple of the form:
(sign, number, error, scientific notation, units), where un-
derlined elements are mandatory and others are optional.
As an example, a team of researchers in Italy recently re-
ported the first direct measurement of gravity’s curvature
as (1.3999 ± 0.003) × 10−5
s−2
m−1
[10]. The corresponding
5-tuple representation of this2
is:
(<empty>, 1.3999, 0.003, 10−5
, s−2
m−1
).
5-tuples such as this are populated using a series of extrac-
tion rules that operate on individual sentences. These rules
fall into four broad categories: 1) pre-processing, 2) units,
3) quantities, and 4) post-processing. Simplified forms of
some of the rules for units and quantities are shown in Ta-
ble 1.3
We refer to the algorithm implementing such rules
as Measured Quantity Extractor or MQE. We begin with
pre-processing rules.
Pre-Processing. As mentioned previously, when extract-
ing text from various document formats (e.g., PDF, MS Of-
fice), characters often appear inconsistently. Minus signs,
2
Since there is no explicit sign in this example, the first
element is left empty.
3
Rules are shown in Perl-like syntax, the de facto standard
for regular expressions.
multiplication signs (e.g., ×, ·), equal-like symbols (e.g., ≈,
, ∼=), degree symbols, and the µ character can appear in
a variety of ways or, in some cases, as “garbage” characters.
For instance, minus can appear as the en dash character or
appear corrupted as ˆae. Pre-processing rules identify these
variations in text and perform the necessary normalization
for accurate extraction of units and quantities.
Units. A measurement unit preceded by a numeric string
conforming to the 5-tuple structure is the base indicator of
a measured quantity. Thus, to identify valid measured quan-
tities, we require a comprehensive ontology of units. We
obtained an initial units ontology from the OBO Foundry,4
but this was quite incomplete. We, then, expanded the on-
tology using largely public sources (e.g., convert-me.com,
DoD technical reports, Physical Review, Nature Commu-
nications). Each unit has an associated rule. An example
rule for m (i.e., symbol for meters) is shown in Rule 5 of
Table 1. Note that such rules include optional prefixes for
submultiples and multiples (e.g., µ before m, kilo before me-
ter). Unit rules, when combined with pre-processing rules
described previously, can accurately extract units under a
range of noisy conditions. For instance, the corrupted unit
mˆae1 is correctly recovered as m−1
by MQE. Finally, as
shown in Rules 6 and 7, compound units are also supported
(e.g., km/h, kilometer per hour, s−2
· m−1
).
Quantities. Like units, quantities (i.e., numbers with op-
tional error ranges and scientific notation) can appear in
a range of ways due to both corruptions and natural vari-
ation. These variations are collectively captured by rules
such as those shown in Table 1 (i.e., Rules 1-4), which pop-
ulate the remainder of the 5-tuple structure. As shown in
Table 1, such rules capture a wide range of quantity formats
(e.g., 10, 000 with a comma, 1.3999±0.003×10−5
with both
an error range and scientific notation, 1.23×105 with lost ex-
ponent in 105
). To support numeric range queries, extracted
quantities are standardized prior to storage in a search en-
gine index (e.g., the extracted quantity 1.3999±0.003×10−5
is stored simply as 0.000013999) [11].
Post-Processing. We have already seen that text extracted
from various document formats can be noisy. For instance,
information from tables, headers, and figures can sometimes
result in seemingly random sequences of numbers and let-
ters in extracted text. In some cases, such information can
erroneously be picked up by aforementioned rules as mea-
sured quantities. This is especially true for single letter units
(e.g., m for meters, A for Ampere, etc.). Post-processing
rules are employed to reject such extractions and minimize
false positives. Examples of such rejection rules include
context-based rules (e.g., reject when preceded by “Table”
or “Figure”), repetition-based rules such as rejecting com-
pound units consisting of repeated single letter units (e.g., 3
AJmm), and allowing a dash only between certain quantities
and units (e.g., 10-cm is okay but not 10-A).
As we will show in Section 4, when used in combination,
these rules collectively enable highly accurate extractions of
measured quantities – which, in turn, can be exploited to
extract the properties being measured, as described next.
4
http://www.obofoundry.org/
Rule Pattern Example Matches
1) number [+−]?(d((d?d?[, ]d{2,3}([, ]d{2,3})*)|d*))(.(d[ds]*d|d))? 1000.05, +5, -0.2, and 1,000
2) number (leading point) [+-]?.d(dd(sd{3})+(sd{1,3})?|d*) -.98, .04, +.755
3) error (s{0,2} ± s{0,2}[d.]+)? ±0.003 in “1.3999 ± 0.003”
4) sci. notation. (s*[eE]|s*([xX× ])s*10 *ˆ? [+-]?d+)? e.g., forms of ×10−5
: ×105, e-5, E-5
5) unit e.g., [fpnµmcdk]?m([ˆ]?[2-6] | [-][1-6]) — m#
normalized to mˆ# µm, m–1 (m−1
), cm2 (cm2
), cmˆ2
6) connector (s?/s? | [Pp]er |-per-| [-s× ·*])? per, /, ·, ×
7) compound unit <unit>(<connector><unit>)+ km/h, kilometer per hour,km·h−1
Table 1: [MQE Rules.] Simplified forms of some rules for extraction of measured quantities.
Pattern Example Matches (two examples shown for each rule)
NP SYM{0,2} EQ mq 1) gravity curvature ζ = 1.4 × 10−5
s−2
m−1
2) floor area ≈ 32m2
mq IN? NP 1) a 352 µm pixel pitch 2) 50mL of 30% fuming sulfuric acid
NP IN DT? NP VP+ (TO|IN|RB|JJ)* mq 1) strength of panel was set to 9 ksi 2) freq. of scans was roughly 300 Hz
NP (IN DT? NP)* VP+ (IN|TO|RB|JJ)* mq 1) pixel pitch employed was 352 µm. 2) panel strength was recorded at 9 ksi.
NP (CC|IN|TO|RB|JJ)* (?mq)? 1) wavelengths of at least 2.4 µm 2) panel strength (9 ksi)
Table 2: [MPE Rules.] Simplified forms of some syntactic patterns to extract measured properties.
3. MEASURED PROPERTIES
We now turn our attention to the extraction of measured
properties. To better illustrate the problem, we show several
example snippets containing measured quantities. In each
example, the measured quantity is shown in blue, the prop-
erty being measured is highlighted in red, and the characters
connecting them are underlined:
• a pixel pitch as high as roughly 352 µm
• a 352 µm pixel pitch
• The pixel pitch employed was 352 µm.
• average gravity curvature ζ =(1.3999±0.003)×10−5
s−2
m−1
• with 50mL of 30% fuming sulfuric acid
• size ∼= 0.1m2
• frequency of longitudinal scan was approximately 300 Hz.
• a nominal current density of 1.3 A/cm2
to 0.03 A/cm2
• panel strength lower than 8.90 ksi (61.4 MPa)
• wavelengths at least 2.4 µm
• large fields of about, or above 10 kV/cm
From just the examples shown, it is easy to see that there
is an extremely high degree of variability in the words con-
necting a measured property with a measured quantity. These
examples represent just a small sample of the many possible
variations. However, upon closer inspection, we find that
this variability can be reduced to a small number of syn-
tactic patterns based on part-of-speech (POS) that capture
most scenarios. Table 2 shows some syntactic patterns that
we employ to extract measured properties. We refer to the
extractor applying such syntactic rules as Measured Property
Extractor or MPE.
In Table 2, noun phrases shown in red (i.e., NP) are
extracted and taken as the measured property. Measured
quantities are represented in blue by mq. The EQ tag rep-
resents all symbols related to ’=’ (e.g., ≈, ). The SYM tag
matches one or two character symbols (e.g., a greek letter).
Other symbols (e.g., JJ, RB, IN, CC, VP) are part-of-speech
tags in Penn Treebank format. Note that tags such as RB
(i.e., an adverb) should be taken to include variations such
as the comparative and superlative forms. This is not explic-
itly shown for reasons of brevity. This small set of patterns
matches a very wide range of possible phrase combinations
for measured properties and are executed sequentially in the
order shown. We implemented MPE using the Brill part-
of-speech tagger [2]. As we will show in the next section,
the accuracy with which our algorithms are able to extract
measured properties and measured quantities is remarkable
— especially given the aforementioned issues with noisy and
corrupted input text.
4. EXPERIMENTAL EVALUATION
Since our research is sponsored by the U.S. Department
of Defense (DoD), we evaluate our approach on a text cor-
pus consisting of 40,807 unclassified research reports pub-
lished in the 2008-2010 time frame and hosted by the De-
fense Technical Information Center (DTIC). This rich col-
lection describes a wide range of research funded by the DoD
spanning numerous fields from engineering and physical sci-
ence to biomedical research and social science. The DTIC
documents considered in this paper have been approved for
public release and unlimited distribution. All documents
are in PDF format, and text was extracted from them us-
ing the pdftotext utility.5
From this collection, we gener-
ated samples using the following procedure. To evaluate the
ability of MQE to extract measured quantities, we sampled
uniform random sentences from the population of all sen-
tences containing a numeric value. By examining sentences
with a number (but not necessarily a measurement unit),
we are able to accurately identify false negatives in addition
to false positives. Next, to evaluate the ability of MPE to
extract measured properties, we generated a random sample
of sentences from the population of all sentences containing
a measured quantity, as identified by MQE. We employed
sample sizes of 1000 and 500 for MQE and MPE, respec-
tively. This produced sufficient 95% confidence bounds on
our estimates for precision and recall over the entire corpus.
Different fields employ different measures in different ways.
By considering sentences sampled randomly in this fashion,
we are able to evaluate our methods on text data that cap-
ture the diverse ways in which measured information is re-
ported across different fields. To the best of our knowledge,
no other approaches exist for extracting such measured in-
formation from scientific and technical documents. Thus,
there are no appropriate baselines against which our meth-
5
http://www.foolabs.com/xpdf/home.html
ods can be compared. Table 3 shows the precision and recall
estimates for both the measured quantity extractor and the
measured property extractor over the entire corpus.
Extractor Precision Recall
MQE (0.93, 0.99) (0.92, 0.99)
MPE (0.93, 0.97) (0.88, 0.94)
Table 3: 95% Confidence Intervals for precision and re-
call when extracting measured quantities (using MQE) and
measured properties (using MPE) from the DTIC corpus.
As can be seen in the table, both MQE and MPE per-
form extraordinarily well in extracting measured quantities
and the properties they describe from documents across dis-
parate fields. Having demonstrated the success with which
measured information can be mined, we now demonstrate
how these extractions can be exploited in novel search ap-
plications.
5. AN APPLICATION: MQSEARCH
Here, we present MQSearch: a realization of a search en-
gine with full support for measured information. MQSearch
is implemented using Apache Solr6
and AJAX Solr7
, both
of which support full-text search, faceted navigation, and
numeric range queries. During the process of indexing and
ingesting the DTIC document set into our search engine, we
apply our extractors to encountered text and store both mea-
sured quantities and measured properties in the search engine
index. In addition, the search engine performs keyphrase ex-
traction on documents using the KERA algorithm described
in [8]. Using Solr filter queries, extracted keyphrases can be
used to produce a tag cloud for any subset of the docu-
ment set. Figure 1 shows the faceted navigation panel of
MQSearch, which allows users to filter documents based
on discovered measurement units, quantity ranges, and mea-
sured properties. In Figure 1, the measurement unit U/mL
is selected. We see that there are 153 documents (out of
roughly 40,000) mentioning this unit with quantities rang-
ing from 0.001 U/mL to 10, 000 U/mL. The property most
frequently measured in U/mL is penicillin. From the tag
cloud, we see that documents containing quantities mea-
sured in U/mL tend to cover topics such as breast cancer
and prostate cancer research.8
The search results can be
filtered further along any of these dimensions. Filtering by
LDA-discovered topics is also supported but not shown in
the figure [9]. To the best of our knowledge, ours is the first
search engine with such support for measured information.
6. CONCLUSION
In this paper, we have proposed a demonstrably effec-
tive approach to extracting measured information from un-
structured text data. We showed both how to extract mea-
sured quantities and the properties being measured. We fur-
ther demonstrated how such extractions might be used in a
search engine for documents rich in measured information.
To the best of our knowledge, no other search engine in ex-
istence supports such functionality. Our extraction methods
6
http://lucene.apache.org/solr/
7
https://github.com/evolvingweb/ajax-solr
8
This cancer research was funded by U.S. Army MRMC
through a congressionally directed research program.
Figure 1: [MQSearch.] The measurement unit
U/mL is selected, which reveals the associated topics
(e.g., breast/prostate cancer), associated measured proper-
ties (e.g., concentrations of penicillin), and associated quan-
tity ranges (i.e., 0.001 to 10,000).
have the potential to substantially improve search, naviga-
tion, and exploratory analysis of large or even massive col-
lections of scientific and technical articles. For future work,
we plan on marrying our proposed approaches with other
well-studied techniques for exploratory search.
7. REFERENCES
[1] A. Bakalov, A. Fuxman, P. P. Talukdar, and S. Chakrabarti.
Scad: collective discovery of attribute values. In WWW ’11.
[2] E. Brill. A Simple Rule-based Part of Speech Tagger. In ANLC
’92.
[3] L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based Information
Extraction is Dead! Long Live Rule-based Information
Extraction Systems! In EMNLP ’13.
[4] D. Davidov and A. Rappoport. Extraction and Approximation
of Numerical Attributes from the Web. In ACL ’10.
[5] M. Fontoura, R. Lempel, R. Qi, and J. Zien. Inverted Index
Support for Numeric Search. Internet Math., 3(2):153–186,
2006.
[6] S. Gupta and C. Manning. SPIED: Stanford Pattern based
Information Extraction and Diagnostics. In Proc. 2014
Workshop on Interactive Language Learning, Visualization,
and Interfaces.
[7] X. Lin, L. Gao, Z. Tang, X. Lin, and X. Hu. Mathematical
Formula Identification in PDF Documents. In ICDAR ’11.
[8] A. S. Maiya, J. P. Thompson, F. L. Lemos, and R. M. Rolfe.
Exploratory Analysis of Highly Heterogeneous Document
Collections. In KDD ’13.
[9] A. K. McCallum. MALLET: A Machine Learning for Language
Toolkit, 2002.
[10] G. Rosi, L. Cacciapuoti, F. Sorrentino, M. Menchetti,
M. Prevedelli, and G. M. Tino. Measurement of the
Gravity-Field Curvature by Atom Interferometry. Physical
Review Letters, 114(1), Jan. 2015.
[11] U. Schindler and M. Diepenbroek. Generic XML-based
Framework for Metadata Portals. Comput. Geosci.,
34(12):1947–1955, Dec. 2008.
[12] H. Seidl, F. I. Informatik, T. Schwentick, and A. Muscholl.
Numerical Document Queries. In PODS 2003. ACM, June
2003.
REPORT DOCUMENTATION PAGE
Form Approved
OMB No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching
existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this
burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington
Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-
4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a
collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS.
1. REPORT DATE (DD-MM-YY) 2. REPORT TYPE 3. DATES COVERED (From – To)
02/18/2015 Non-Standard
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Mining Measured Information from Text HQ0034-14-D-0001
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBERS
6. AUTHOR(S) 5d. PROJECT NUMBER
Arun S. Maiya, Dale Visser, Andrew Wan AK-5-3153
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESSES 8. PERFORMING ORGANIZATION REPORT
NUMBER
NS D-5436
H 15-000157
Institute for Defense Analyses
4850 Mark Center Drive
Alexandria, VA 22311-1882
9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR’S / MONITOR’S ACRONYM
OUSD(AT&L/DSB)Brian Hughes Executive Director
Office of the Under Secretary of Defense for Acquisition, Technology, and
Logistics Defense Science Board (AT&L/DSB)
DSB, The Pentagon, OUSD (AT&L) Room 3B888A, Washington, DC 20310
11. SPONSOR’S / MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION / AVAILABILITY STATEMENT
13. SUPPLEMENTARY NOTES
Project Leader: Arun S. Maiya
14. ABSTRACT
We present an approach to extract measured information from text (e.g., a 1370 °C melting point, a BMI greater than 29.9
kg/m2
). Such extractions are critically important across a wide range of domains – especially those involving search and
exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured
quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both
common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities
even when significant errors are introduced through the process of converting document formats like PDF to plain text.
Next, we describe an approach to extracting the properties being measured (e.g., the property pixel pitch in the phrase “a
pixel pitch as high as 352 µm”). Finally we present MQSEARCH: the realization of a search engine with full support for
measured information.
15. SUBJECT TERMS
natural language processing, information retrieval, information extraction, text mining
16. SECURITY CLASSIFICATION OF:
17. LIMITATION OF
ABSTRACT
Unlimited
18. NUMBER
OF PAGES
6
19a. NAME OF RESPONSIBLE PERSON
Brian Hughes Executive Director
a. REPORT b. ABSTRACT c. THIS PAGE 19b. TELEPHONE NUMBER (Include Area
Code)
703-571-0079Unclassified Unclassified Unclassified
D-5436

More Related Content

What's hot

Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaIJDKP
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeAlexander Decker
 
MAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASES
MAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASESMAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASES
MAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASESijdms
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal databaseTPO TPO
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code executionAlexander Decker
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code executionAlexander Decker
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE cscpconf
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiersamreshkr19
 
Parametric comparison based on split criterion on classification algorithm
Parametric comparison based on split criterion on classification algorithmParametric comparison based on split criterion on classification algorithm
Parametric comparison based on split criterion on classification algorithmIAEME Publication
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET Journal
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...ijseajournal
 
Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase IJECEIAES
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
 

What's hot (20)

Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
 
Implementation of query optimization for reducing run time
Implementation of query optimization for reducing run timeImplementation of query optimization for reducing run time
Implementation of query optimization for reducing run time
 
MAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASES
MAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASESMAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASES
MAPPING DATA BETWEEN PROBABILITY SPACES IN PROBABILISTIC DATABASES
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
 
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringA Novel Multi- Viewpoint based Similarity Measure for Document Clustering
A Novel Multi- Viewpoint based Similarity Measure for Document Clustering
 
11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution11.query optimization to improve performance of the code execution
11.query optimization to improve performance of the code execution
 
Query optimization to improve performance of the code execution
Query optimization to improve performance of the code executionQuery optimization to improve performance of the code execution
Query optimization to improve performance of the code execution
 
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
 
QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE QUERY INVERSION TO FIND DATA PROVENANCE
QUERY INVERSION TO FIND DATA PROVENANCE
 
Performance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various ClassifiersPerformance Evaluation: A Comparative Study of Various Classifiers
Performance Evaluation: A Comparative Study of Various Classifiers
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Parametric comparison based on split criterion on classification algorithm
Parametric comparison based on split criterion on classification algorithmParametric comparison based on split criterion on classification algorithm
Parametric comparison based on split criterion on classification algorithm
 
ChemEngine ACS
ChemEngine ACSChemEngine ACS
ChemEngine ACS
 
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
IRJET- Sampling Selection Strategy for Large Scale Deduplication of Synthetic...
 
G0354451
G0354451G0354451
G0354451
 
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
HOLISTIC EVALUATION OF XML QUERIES WITH STRUCTURAL PREFERENCES ON AN ANNOTATE...
 
Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase Converting UML Class Diagrams into Temporal Object Relational DataBase
Converting UML Class Diagrams into Temporal Object Relational DataBase
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
Optimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database DesignOptimized Access Strategies for a Distributed Database Design
Optimized Access Strategies for a Distributed Database Design
 

Similar to D-5436

Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisIRJET Journal
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...IOSRjournaljce
 
Automatic Text Classification Of News Blog using Machine Learning
Automatic Text Classification Of News Blog using Machine LearningAutomatic Text Classification Of News Blog using Machine Learning
Automatic Text Classification Of News Blog using Machine LearningIRJET Journal
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...theijes
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Query Processing with k-Anonymity
Query Processing with k-AnonymityQuery Processing with k-Anonymity
Query Processing with k-AnonymityWaqas Tariq
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESIJCSES Journal
 
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...IRJET Journal
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification SystemIRJET Journal
 
Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039ijsrd.com
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET Journal
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET Journal
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataredpel dot com
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 

Similar to D-5436 (20)

Smart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend AnalysisSmart E-Logistics for SCM Spend Analysis
Smart E-Logistics for SCM Spend Analysis
 
Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...Different Classification Technique for Data mining in Insurance Industry usin...
Different Classification Technique for Data mining in Insurance Industry usin...
 
130509
130509130509
130509
 
130509
130509130509
130509
 
Automatic Text Classification Of News Blog using Machine Learning
Automatic Text Classification Of News Blog using Machine LearningAutomatic Text Classification Of News Blog using Machine Learning
Automatic Text Classification Of News Blog using Machine Learning
 
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
Applying K-Means Clustering Algorithm to Discover Knowledge from Insurance Da...
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
rscript_paper-1
rscript_paper-1rscript_paper-1
rscript_paper-1
 
Query Processing with k-Anonymity
Query Processing with k-AnonymityQuery Processing with k-Anonymity
Query Processing with k-Anonymity
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...IRJET-  	  A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
IRJET- A Survey on Searching of Keyword on Encrypted Data in Cloud using ...
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 
Ijsrdv1 i2039
Ijsrdv1 i2039Ijsrdv1 i2039
Ijsrdv1 i2039
 
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
IRJET- Top-K Query Processing using Top Order Preserving Encryption (TOPE)
 
IRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword ManagerIRJET- Data Mining - Secure Keyword Manager
IRJET- Data Mining - Secure Keyword Manager
 
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsHortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data Sets
 
A tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big dataA tutorial on secure outsourcing of large scalecomputation for big data
A tutorial on secure outsourcing of large scalecomputation for big data
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 

D-5436

  • 1. I NS TITUTE FOR DE FE NSE ANAL YSE S Mining Measured Information from Text Arun S. Maiya Dale Visser Andrew Wan February 18, 2015 Approved for public release; distribution is unlimited. IDA Non-Standard D-5436 Log: H 15-000157 Copy INSTITUTE FOR DEFENSE ANALYSES 4850 Mark Center Drive Alexandria, Virginia 22311-1882
  • 2. About This Publication This work was conducted by the Institute for Defense Analyses (IDA) under contract HQ0034- 14-D-0001, Task AK-5-3153, “DAMO Implementation of Phase 3 Operational Capabilities,” for Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics Defense Science Board (AT&L/DSB). The views, opinions, and findings should not be construed as representing the official position of either the Department of Defense or the sponsoring organization. Copyright Notice © 2015 Institute for Defense Analyses 4850 Mark Center Drive, Alexandria, Virginia 22311-1882 • (703) 845-2000. Acknowledgments «Reviewers» This material may be reproduced by or for the U.S. Government pursuant to the copyright license under the clause at DFARS 252.227-7013 (a)(16) [Jun 2013].
  • 3. Mining Measured Information from Text Arun S. Maiya, Dale Visser, and Andrew Wan Institute for Defense Analyses — Alexandria, VA, USA {amaiya, dvisser, awan}@ida.org ABSTRACT We present an approach to extract measured information from text (e.g., a 1370 ◦ C melting point, a BMI greater than 29.9 kg/m2 ). Such extractions are critically important across a wide range of domains — especially those involv- ing search and exploration of scientific and technical doc- uments. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehen- sive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are intro- duced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property pixel pitch in the phrase “a pixel pitch as high as 352 µm”). Finally, we present MQSearch: the realization of a search engine with full support for measured information. Categories and Subject Descriptors I.2.7 [Artificial Intelligence]: Natural Language Process- ing—Text Analysis; H.3.3 [Information Storage and Re- trieval]: Information Search and Retrieval—Search Process Keywords text mining, information retrieval, information extraction, measured quantities, numerical queries 1. INTRODUCTION AND MOTIVATION Scientific and technical documents describe methods and results using measured quantities: a numeric value paired with a unit of measurement. Examples of text snippets con- taining such measured quantities include: • average gravity curvature ζ = (1.3999±0.003)×10−5 s−2 m−1 • 12 ◦ C melting point • distance from Earth to the Sun is 9.3 × 107 miles Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. SIGIR ’15, August 9–13, 2015, Santiago, Chile. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3621-5/15/08...$15.00 DOI: http://dx.doi.org/10.1145/2766462.2767789. • average responsivity as low as 6.2 pA/K Note that these measured quantities (e.g., 6.2 pA/K) are typically associated with a specific measured property (e.g., av- erage responsivity). In this paper, we study ways in which to extract these kinds of measured information from doc- uments.1 The mining of such information is critically im- portant across many domains — especially those involving search and exploration of scientific and technical articles. For instance, an optics researcher may wish to know if the performance of Nd:YAG laser-pumped KTP parametric os- cillators has ever been tested at wavelengths longer than 2.4 µm. Full-text search engines using inverted indexes al- low ad hoc queries on terms such as “KTP parametric oscil- lator”, but the ability to further filter search results based on wavelengths greater than 2.4 µm is not typically sup- ported. To accomplish this, one must first identify and ex- tract valid measured quantities (e.g., 2.4 µm) in unstruc- tured text and, then, identify and extract the properties be- ing measured (e.g., wavelength). These extractions could then be stored in the index of a search engine in a way that supports subsequent document queries on measured infor- mation (e.g., faceted navigation, numeric range queries). Surprisingly, there is very little existing work on how best to realize this process. Lines of research most closely re- lated to the present work include extracting numerical at- tributes (e.g., [1,4]), supporting numerical document queries (e.g., [5,12]), and formula identification (e.g., [7]). However, none of these existing works address the comprehensive ex- traction of and search for measured information in document data, as described above. Indeed, numerous challenges exist in such scenarios. Many widely-used, full-text search engines (e.g., Apache Solr) convert the original document format to plain text prior to indexing and storage — an extremely error-ridden process. For instance, in the extracted text, exponents are typically lost (e.g., 105 becomes 105, s−2 be- comes s–2). Moreover, the conversion of some characters can be highly inconsistent and unpredictable. A simple minus sign can be converted to a range of different dash characters or even “garbage” characters. The same is true for other symbols such as µ, multiplication signs, and degree sym- bols. It is this inconsistent and error-ridden text, then, that is ultimately stored in the index of the search engine mak- ing it virtually impossible to adequately locate documents by measured quantities. Without the correct identification of measured quantities, it is virtually impossible to identify properties being measured, which are critical in efficiently 1 We define measured information as measured quantities and the measured properties to which they are associated.
  • 4. navigating scientific and technical articles for state-of-the- art information. In general, there is a great deal of het- erogeneity in how measured quantities and measured proper- ties appear in text – both naturally and through corruption. This, then, motivates the current investigation of how best to extract such information. Recent studies [3, 6] have revealed that rule-based ap- proaches to information extraction tend to be more effec- tive, interpretable, and customizable than state-of-the-art machine learning approaches. We employ rule-based extrac- tion methods in this work. Our contributions are as follows: • We propose and describe a rule-based entity extractor to identify measured quantities in unstructured text docu- ments. Our method includes an error-correcting proce- dure that recovers from aforementioned text conversion errors by 1) reverse engineering the corrupted and man- gled measured quantities back to their original, correct form and 2) standardizing this form for storage in an inverted index and subsequent query processing. • Using these extracted measured quantities, we show how to further extract the measured properties to which they are associated. • Finally, we present MQSearch: the realization of a search engine with full support for measured informa- tion. MQSearch is a facet-based navigation system that allows users to navigate large document sets based on measured quantities, measured properties, and the top- ics and themes to which they are associated. To the best of our knowledge, no other search engine in existence fully supports such a capability. We begin with describing the extraction of measured quan- tities. 2. MEASURED QUANTITIES We view measured quantities as a 5-tuple of the form: (sign, number, error, scientific notation, units), where un- derlined elements are mandatory and others are optional. As an example, a team of researchers in Italy recently re- ported the first direct measurement of gravity’s curvature as (1.3999 ± 0.003) × 10−5 s−2 m−1 [10]. The corresponding 5-tuple representation of this2 is: (<empty>, 1.3999, 0.003, 10−5 , s−2 m−1 ). 5-tuples such as this are populated using a series of extrac- tion rules that operate on individual sentences. These rules fall into four broad categories: 1) pre-processing, 2) units, 3) quantities, and 4) post-processing. Simplified forms of some of the rules for units and quantities are shown in Ta- ble 1.3 We refer to the algorithm implementing such rules as Measured Quantity Extractor or MQE. We begin with pre-processing rules. Pre-Processing. As mentioned previously, when extract- ing text from various document formats (e.g., PDF, MS Of- fice), characters often appear inconsistently. Minus signs, 2 Since there is no explicit sign in this example, the first element is left empty. 3 Rules are shown in Perl-like syntax, the de facto standard for regular expressions. multiplication signs (e.g., ×, ·), equal-like symbols (e.g., ≈, , ∼=), degree symbols, and the µ character can appear in a variety of ways or, in some cases, as “garbage” characters. For instance, minus can appear as the en dash character or appear corrupted as ˆae. Pre-processing rules identify these variations in text and perform the necessary normalization for accurate extraction of units and quantities. Units. A measurement unit preceded by a numeric string conforming to the 5-tuple structure is the base indicator of a measured quantity. Thus, to identify valid measured quan- tities, we require a comprehensive ontology of units. We obtained an initial units ontology from the OBO Foundry,4 but this was quite incomplete. We, then, expanded the on- tology using largely public sources (e.g., convert-me.com, DoD technical reports, Physical Review, Nature Commu- nications). Each unit has an associated rule. An example rule for m (i.e., symbol for meters) is shown in Rule 5 of Table 1. Note that such rules include optional prefixes for submultiples and multiples (e.g., µ before m, kilo before me- ter). Unit rules, when combined with pre-processing rules described previously, can accurately extract units under a range of noisy conditions. For instance, the corrupted unit mˆae1 is correctly recovered as m−1 by MQE. Finally, as shown in Rules 6 and 7, compound units are also supported (e.g., km/h, kilometer per hour, s−2 · m−1 ). Quantities. Like units, quantities (i.e., numbers with op- tional error ranges and scientific notation) can appear in a range of ways due to both corruptions and natural vari- ation. These variations are collectively captured by rules such as those shown in Table 1 (i.e., Rules 1-4), which pop- ulate the remainder of the 5-tuple structure. As shown in Table 1, such rules capture a wide range of quantity formats (e.g., 10, 000 with a comma, 1.3999±0.003×10−5 with both an error range and scientific notation, 1.23×105 with lost ex- ponent in 105 ). To support numeric range queries, extracted quantities are standardized prior to storage in a search en- gine index (e.g., the extracted quantity 1.3999±0.003×10−5 is stored simply as 0.000013999) [11]. Post-Processing. We have already seen that text extracted from various document formats can be noisy. For instance, information from tables, headers, and figures can sometimes result in seemingly random sequences of numbers and let- ters in extracted text. In some cases, such information can erroneously be picked up by aforementioned rules as mea- sured quantities. This is especially true for single letter units (e.g., m for meters, A for Ampere, etc.). Post-processing rules are employed to reject such extractions and minimize false positives. Examples of such rejection rules include context-based rules (e.g., reject when preceded by “Table” or “Figure”), repetition-based rules such as rejecting com- pound units consisting of repeated single letter units (e.g., 3 AJmm), and allowing a dash only between certain quantities and units (e.g., 10-cm is okay but not 10-A). As we will show in Section 4, when used in combination, these rules collectively enable highly accurate extractions of measured quantities – which, in turn, can be exploited to extract the properties being measured, as described next. 4 http://www.obofoundry.org/
  • 5. Rule Pattern Example Matches 1) number [+−]?(d((d?d?[, ]d{2,3}([, ]d{2,3})*)|d*))(.(d[ds]*d|d))? 1000.05, +5, -0.2, and 1,000 2) number (leading point) [+-]?.d(dd(sd{3})+(sd{1,3})?|d*) -.98, .04, +.755 3) error (s{0,2} ± s{0,2}[d.]+)? ±0.003 in “1.3999 ± 0.003” 4) sci. notation. (s*[eE]|s*([xX× ])s*10 *ˆ? [+-]?d+)? e.g., forms of ×10−5 : ×105, e-5, E-5 5) unit e.g., [fpnµmcdk]?m([ˆ]?[2-6] | [-][1-6]) — m# normalized to mˆ# µm, m–1 (m−1 ), cm2 (cm2 ), cmˆ2 6) connector (s?/s? | [Pp]er |-per-| [-s× ·*])? per, /, ·, × 7) compound unit <unit>(<connector><unit>)+ km/h, kilometer per hour,km·h−1 Table 1: [MQE Rules.] Simplified forms of some rules for extraction of measured quantities. Pattern Example Matches (two examples shown for each rule) NP SYM{0,2} EQ mq 1) gravity curvature ζ = 1.4 × 10−5 s−2 m−1 2) floor area ≈ 32m2 mq IN? NP 1) a 352 µm pixel pitch 2) 50mL of 30% fuming sulfuric acid NP IN DT? NP VP+ (TO|IN|RB|JJ)* mq 1) strength of panel was set to 9 ksi 2) freq. of scans was roughly 300 Hz NP (IN DT? NP)* VP+ (IN|TO|RB|JJ)* mq 1) pixel pitch employed was 352 µm. 2) panel strength was recorded at 9 ksi. NP (CC|IN|TO|RB|JJ)* (?mq)? 1) wavelengths of at least 2.4 µm 2) panel strength (9 ksi) Table 2: [MPE Rules.] Simplified forms of some syntactic patterns to extract measured properties. 3. MEASURED PROPERTIES We now turn our attention to the extraction of measured properties. To better illustrate the problem, we show several example snippets containing measured quantities. In each example, the measured quantity is shown in blue, the prop- erty being measured is highlighted in red, and the characters connecting them are underlined: • a pixel pitch as high as roughly 352 µm • a 352 µm pixel pitch • The pixel pitch employed was 352 µm. • average gravity curvature ζ =(1.3999±0.003)×10−5 s−2 m−1 • with 50mL of 30% fuming sulfuric acid • size ∼= 0.1m2 • frequency of longitudinal scan was approximately 300 Hz. • a nominal current density of 1.3 A/cm2 to 0.03 A/cm2 • panel strength lower than 8.90 ksi (61.4 MPa) • wavelengths at least 2.4 µm • large fields of about, or above 10 kV/cm From just the examples shown, it is easy to see that there is an extremely high degree of variability in the words con- necting a measured property with a measured quantity. These examples represent just a small sample of the many possible variations. However, upon closer inspection, we find that this variability can be reduced to a small number of syn- tactic patterns based on part-of-speech (POS) that capture most scenarios. Table 2 shows some syntactic patterns that we employ to extract measured properties. We refer to the extractor applying such syntactic rules as Measured Property Extractor or MPE. In Table 2, noun phrases shown in red (i.e., NP) are extracted and taken as the measured property. Measured quantities are represented in blue by mq. The EQ tag rep- resents all symbols related to ’=’ (e.g., ≈, ). The SYM tag matches one or two character symbols (e.g., a greek letter). Other symbols (e.g., JJ, RB, IN, CC, VP) are part-of-speech tags in Penn Treebank format. Note that tags such as RB (i.e., an adverb) should be taken to include variations such as the comparative and superlative forms. This is not explic- itly shown for reasons of brevity. This small set of patterns matches a very wide range of possible phrase combinations for measured properties and are executed sequentially in the order shown. We implemented MPE using the Brill part- of-speech tagger [2]. As we will show in the next section, the accuracy with which our algorithms are able to extract measured properties and measured quantities is remarkable — especially given the aforementioned issues with noisy and corrupted input text. 4. EXPERIMENTAL EVALUATION Since our research is sponsored by the U.S. Department of Defense (DoD), we evaluate our approach on a text cor- pus consisting of 40,807 unclassified research reports pub- lished in the 2008-2010 time frame and hosted by the De- fense Technical Information Center (DTIC). This rich col- lection describes a wide range of research funded by the DoD spanning numerous fields from engineering and physical sci- ence to biomedical research and social science. The DTIC documents considered in this paper have been approved for public release and unlimited distribution. All documents are in PDF format, and text was extracted from them us- ing the pdftotext utility.5 From this collection, we gener- ated samples using the following procedure. To evaluate the ability of MQE to extract measured quantities, we sampled uniform random sentences from the population of all sen- tences containing a numeric value. By examining sentences with a number (but not necessarily a measurement unit), we are able to accurately identify false negatives in addition to false positives. Next, to evaluate the ability of MPE to extract measured properties, we generated a random sample of sentences from the population of all sentences containing a measured quantity, as identified by MQE. We employed sample sizes of 1000 and 500 for MQE and MPE, respec- tively. This produced sufficient 95% confidence bounds on our estimates for precision and recall over the entire corpus. Different fields employ different measures in different ways. By considering sentences sampled randomly in this fashion, we are able to evaluate our methods on text data that cap- ture the diverse ways in which measured information is re- ported across different fields. To the best of our knowledge, no other approaches exist for extracting such measured in- formation from scientific and technical documents. Thus, there are no appropriate baselines against which our meth- 5 http://www.foolabs.com/xpdf/home.html
  • 6. ods can be compared. Table 3 shows the precision and recall estimates for both the measured quantity extractor and the measured property extractor over the entire corpus. Extractor Precision Recall MQE (0.93, 0.99) (0.92, 0.99) MPE (0.93, 0.97) (0.88, 0.94) Table 3: 95% Confidence Intervals for precision and re- call when extracting measured quantities (using MQE) and measured properties (using MPE) from the DTIC corpus. As can be seen in the table, both MQE and MPE per- form extraordinarily well in extracting measured quantities and the properties they describe from documents across dis- parate fields. Having demonstrated the success with which measured information can be mined, we now demonstrate how these extractions can be exploited in novel search ap- plications. 5. AN APPLICATION: MQSEARCH Here, we present MQSearch: a realization of a search en- gine with full support for measured information. MQSearch is implemented using Apache Solr6 and AJAX Solr7 , both of which support full-text search, faceted navigation, and numeric range queries. During the process of indexing and ingesting the DTIC document set into our search engine, we apply our extractors to encountered text and store both mea- sured quantities and measured properties in the search engine index. In addition, the search engine performs keyphrase ex- traction on documents using the KERA algorithm described in [8]. Using Solr filter queries, extracted keyphrases can be used to produce a tag cloud for any subset of the docu- ment set. Figure 1 shows the faceted navigation panel of MQSearch, which allows users to filter documents based on discovered measurement units, quantity ranges, and mea- sured properties. In Figure 1, the measurement unit U/mL is selected. We see that there are 153 documents (out of roughly 40,000) mentioning this unit with quantities rang- ing from 0.001 U/mL to 10, 000 U/mL. The property most frequently measured in U/mL is penicillin. From the tag cloud, we see that documents containing quantities mea- sured in U/mL tend to cover topics such as breast cancer and prostate cancer research.8 The search results can be filtered further along any of these dimensions. Filtering by LDA-discovered topics is also supported but not shown in the figure [9]. To the best of our knowledge, ours is the first search engine with such support for measured information. 6. CONCLUSION In this paper, we have proposed a demonstrably effec- tive approach to extracting measured information from un- structured text data. We showed both how to extract mea- sured quantities and the properties being measured. We fur- ther demonstrated how such extractions might be used in a search engine for documents rich in measured information. To the best of our knowledge, no other search engine in ex- istence supports such functionality. Our extraction methods 6 http://lucene.apache.org/solr/ 7 https://github.com/evolvingweb/ajax-solr 8 This cancer research was funded by U.S. Army MRMC through a congressionally directed research program. Figure 1: [MQSearch.] The measurement unit U/mL is selected, which reveals the associated topics (e.g., breast/prostate cancer), associated measured proper- ties (e.g., concentrations of penicillin), and associated quan- tity ranges (i.e., 0.001 to 10,000). have the potential to substantially improve search, naviga- tion, and exploratory analysis of large or even massive col- lections of scientific and technical articles. For future work, we plan on marrying our proposed approaches with other well-studied techniques for exploratory search. 7. REFERENCES [1] A. Bakalov, A. Fuxman, P. P. Talukdar, and S. Chakrabarti. Scad: collective discovery of attribute values. In WWW ’11. [2] E. Brill. A Simple Rule-based Part of Speech Tagger. In ANLC ’92. [3] L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems! In EMNLP ’13. [4] D. Davidov and A. Rappoport. Extraction and Approximation of Numerical Attributes from the Web. In ACL ’10. [5] M. Fontoura, R. Lempel, R. Qi, and J. Zien. Inverted Index Support for Numeric Search. Internet Math., 3(2):153–186, 2006. [6] S. Gupta and C. Manning. SPIED: Stanford Pattern based Information Extraction and Diagnostics. In Proc. 2014 Workshop on Interactive Language Learning, Visualization, and Interfaces. [7] X. Lin, L. Gao, Z. Tang, X. Lin, and X. Hu. Mathematical Formula Identification in PDF Documents. In ICDAR ’11. [8] A. S. Maiya, J. P. Thompson, F. L. Lemos, and R. M. Rolfe. Exploratory Analysis of Highly Heterogeneous Document Collections. In KDD ’13. [9] A. K. McCallum. MALLET: A Machine Learning for Language Toolkit, 2002. [10] G. Rosi, L. Cacciapuoti, F. Sorrentino, M. Menchetti, M. Prevedelli, and G. M. Tino. Measurement of the Gravity-Field Curvature by Atom Interferometry. Physical Review Letters, 114(1), Jan. 2015. [11] U. Schindler and M. Diepenbroek. Generic XML-based Framework for Metadata Portals. Comput. Geosci., 34(12):1947–1955, Dec. 2008. [12] H. Seidl, F. I. Informatik, T. Schwentick, and A. Muscholl. Numerical Document Queries. In PODS 2003. ACM, June 2003.
  • 7. REPORT DOCUMENTATION PAGE Form Approved OMB No. 0704-0188 Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Department of Defense, Washington Headquarters Services, Directorate for Information Operations and Reports (0704-0188), 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202- 4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS. 1. REPORT DATE (DD-MM-YY) 2. REPORT TYPE 3. DATES COVERED (From – To) 02/18/2015 Non-Standard 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Mining Measured Information from Text HQ0034-14-D-0001 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBERS 6. AUTHOR(S) 5d. PROJECT NUMBER Arun S. Maiya, Dale Visser, Andrew Wan AK-5-3153 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESSES 8. PERFORMING ORGANIZATION REPORT NUMBER NS D-5436 H 15-000157 Institute for Defense Analyses 4850 Mark Center Drive Alexandria, VA 22311-1882 9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR’S / MONITOR’S ACRONYM OUSD(AT&L/DSB)Brian Hughes Executive Director Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics Defense Science Board (AT&L/DSB) DSB, The Pentagon, OUSD (AT&L) Room 3B888A, Washington, DC 20310 11. SPONSOR’S / MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION / AVAILABILITY STATEMENT 13. SUPPLEMENTARY NOTES Project Leader: Arun S. Maiya 14. ABSTRACT We present an approach to extract measured information from text (e.g., a 1370 °C melting point, a BMI greater than 29.9 kg/m2 ). Such extractions are critically important across a wide range of domains – especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property pixel pitch in the phrase “a pixel pitch as high as 352 µm”). Finally we present MQSEARCH: the realization of a search engine with full support for measured information. 15. SUBJECT TERMS natural language processing, information retrieval, information extraction, text mining 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Unlimited 18. NUMBER OF PAGES 6 19a. NAME OF RESPONSIBLE PERSON Brian Hughes Executive Director a. REPORT b. ABSTRACT c. THIS PAGE 19b. TELEPHONE NUMBER (Include Area Code) 703-571-0079Unclassified Unclassified Unclassified