Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Physical quantity search with the simplicity of
keyword searching
Daniel Lowe
IC-SDV 2019, Nice, France 9th April 2019
Minesoft - who are we?
Inadequacies of Boolean searching
• Find an alloy containing 1-5% Vanadium, a
match is any range that overlaps
– Many sema...
Same quantity, many units
• Find a clamping device which exerts a
pressure of 5 MPa, example matches:
– about 2 to 50 bars...
Implementation
Apache Solr
Patents
SQL database
Text-mining Normalized
quantities
Client
query
Custom Solr
tokenizer
Quant...
Text-mining
Formal
grammar rules
State-machine
(for recognition)
Parser
(for normalization)
Same rules used for
recognitio...
Ways of expressing quantities
Quantity type Example Standardized
Form
Single value 5 kg [5,5] kg
Bounded range 1-2 kg [1,2...
Value variants
• 1000000 = 1,000,000 = one million = 1E6
= 1×106
• 1.25 =1,25
• 0.5 = ½ ≟1/2
• 120-125 = 120-5
Unit variants
• Same unit can have many written forms:
– g = gram = gramme = gm = gr
• Many different units of same type e...
Unit normalization
• Converted into SI units
• Where possible, composite units
normalized e.g.
• Note: Units are not norma...
• Essential for dimensionless quantities e.g.
BMI, LogP, pH, pKI
• Useful for distinguishing intent e.g. melting
point, bo...
Database details
• All quantities are stored as a range and
normalized unit in the database
– 230-500 mm/second → [0.23,0....
Solr details
• Tokenization modified such that each quantity
is a single token e.g. potential of 1.6 V or less
Token Posit...
Range binning
• A broad range search e.g. ≥ 1 gram, would
still match 100,000s of distinct ranges
– Problematic for proxim...
Proximity searching
• Internally implemented using the same
syntax as PatBase e.g.
30-50 m/s w5 (typhoon or hurricane)
• B...
Query assistance
• Same quantity interpretation used for text-
mining and query parsing
• Immediate feedback on how a quer...
Example usage (dose 50-90 mg/day)
Example usage (alloy searching)
Proximity constraints
Increasing result precision
• A query like “at least 0.2 picofarads” will
match ANY value greater than this
• Results can ...
Combination with other searches
• Quantity searches are just another field in
the search system
Combination with faceting
Future work
• Multi lingual ranges and units
– 50至500毫克 [毫 = milli, 克= gram, 至=to]
– 50 à 500 milligrammes
– 50 bis 500 Mi...
Conclusions
• Text-mining combined with a traditional full-text
indexer allows quantities to be seamlessly
searched
• Curr...
Head Office (London)
Boston House, Little Green
Richmond-upon-Thames, TW9 1QE
Tel: +44 (0)20 8404 0651
General enquiries: ...
IC-SDV 2019: Physical Quantity Search with the Simplicity of Keyword Searching - Daniel Lowe (Minesoft, UK)
Upcoming SlideShare
Loading in …5
×

IC-SDV 2019: Physical Quantity Search with the Simplicity of Keyword Searching - Daniel Lowe (Minesoft, UK)

160 views

Published on

When searching across full text patent data, the numerical data given in a document is often at least as important in determining a document’s relevancy as the keywords present. Chemists or metallurgists may be interested in compositions or alloys citing specific amounts or ranges of a certain metal or compound, and engineers may be interested in specific dimensions, current measurements or temperature ranges. Unfortunately, searching comprehensively for physical quantities using conventional text-searching is practically impossible as many lexically distinct quantities can be matches. The fact that measurements of the same type may use different units further complicates matters.
Minesoft have been working on the problem of facilitating searches including physical quantity criteria, and here we report on our success with using text-mining to automate the identification and interpretation of quantities in patents in our new PatDocs tool.
All indexed terms and user queries are converted into ranges in standardized units, for example, “>5 to ≤10 miles per hour” is interpreted as the range (2.2352,4.4704] m/s.) Rather than forcing the user to learn a search engine specific syntax, the same formats as appear in actual documents are used to write queries. The search then finds all indexed ranges, with the same standardized units, that intersect with the user’s query range.
It is also important to know what a quantity is referring to. For specific cases, such as alloy compositions, this is captured during the text-mining e.g. 2 wt% Fe, refers specifically to a weight percentage of iron. For the general case, we now allow searching for quantities in close proximity to arbitrary phrases, or even other quantities. The tool will also facilitate the user by showing the context of where their query matched as well as allowing combining of quantity queries with metadata queries.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

IC-SDV 2019: Physical Quantity Search with the Simplicity of Keyword Searching - Daniel Lowe (Minesoft, UK)

  1. 1. Physical quantity search with the simplicity of keyword searching Daniel Lowe IC-SDV 2019, Nice, France 9th April 2019
  2. 2. Minesoft - who are we?
  3. 3. Inadequacies of Boolean searching • Find an alloy containing 1-5% Vanadium, a match is any range that overlaps – Many semantically distinct matches – Multiple ways of writing the same thing • 2-7%, 2 to 7 %, 2.0% to 7.0%, between 2 and 7% • 3.3%, 3.3 percent • ≤ 8%, less than or equal to 8%, no greater than 8%
  4. 4. Same quantity, many units • Find a clamping device which exerts a pressure of 5 MPa, example matches: – about 2 to 50 bars – between 0.1 MPa and 10 MPa – at least 50 p.s.i. – 20 to 300 kgf/cm2 – at least 105Newtons/m2 – 1 N/mm2 to 30 N/mm2
  5. 5. Implementation Apache Solr Patents SQL database Text-mining Normalized quantities Client query Custom Solr tokenizer Quantities normalized Quantity normalized Custom Solr parser Relevant quantities retrieved
  6. 6. Text-mining Formal grammar rules State-machine (for recognition) Parser (for normalization) Same rules used for recognition and parsing Rules describe syntax of quantities and map units to their semantic interpretation
  7. 7. Ways of expressing quantities Quantity type Example Standardized Form Single value 5 kg [5,5] kg Bounded range 1-2 kg [1,2] kg Unbounded range >5 kg (5,) kg Measurement error 4 ± 1 kg [3,5] kg • Practically all quantities can be expressed as ranges
  8. 8. Value variants • 1000000 = 1,000,000 = one million = 1E6 = 1×106 • 1.25 =1,25 • 0.5 = ½ ≟1/2 • 120-125 = 120-5
  9. 9. Unit variants • Same unit can have many written forms: – g = gram = gramme = gm = gr • Many different units of same type e.g. mass – SI multiples: kg, kilogram – Non SI units: Pound (lb), ounce (oz), stone • Units are grouped into about 60 types
  10. 10. Unit normalization • Converted into SI units • Where possible, composite units normalized e.g. • Note: Units are not normalized by dimensional analysis as distinct units can have identical dimensional analysis – Torque = N·m/rad = kg⋅m2⋅s−2 – Energy = J = kg⋅m2⋅s−2 1 N m2 = 1 Pa 𝐹𝑜𝑟𝑐𝑒 𝐴𝑟𝑒𝑎 = 𝑃𝑟𝑒𝑠𝑠𝑢𝑟𝑒
  11. 11. • Essential for dimensionless quantities e.g. BMI, LogP, pH, pKI • Useful for distinguishing intent e.g. melting point, boiling point, flash point • More precise than proximity searching, but can’t cover all use cases melting point ranging from about 105° C. to about 125° C. pKa of 2.0 or less mp 200°-220°C Quantity type Quantity type Lower bound value + unit Upper bound value + unit Upper bound value
  12. 12. Database details • All quantities are stored as a range and normalized unit in the database – 230-500 mm/second → [0.23,0.5] m/s • Ranges indicate whether they are inclusive or exclusive • 16.7 million US patents, but after normalization, only 4.4 million distinct range-unit pairs actually observed
  13. 13. Solr details • Tokenization modified such that each quantity is a single token e.g. potential of 1.6 V or less Token Position Value(s) 0 potential 1 of 2 1.6 V or less Code for range + unit Code for bin covered by quantity Code for bin covered by quantity Code for bin covered by quantity ….
  14. 14. Range binning • A broad range search e.g. ≥ 1 gram, would still match 100,000s of distinct ranges – Problematic for proximity searching • Split -∞ to +∞ into a finite number ranges, index each quantity with whether it fully overlaps or partially overlaps each range. • A broad search can then be done mostly using the bin values with only quantity values close to the bounds searched explicitly
  15. 15. Proximity searching • Internally implemented using the same syntax as PatBase e.g. 30-50 m/s w5 (typhoon or hurricane) • Both sides of the proximity operator can have quantities e.g. 6-12 volts w10 1-5 amps
  16. 16. Query assistance • Same quantity interpretation used for text- mining and query parsing • Immediate feedback on how a query will be interpreted • Can spelling correct user queries:
  17. 17. Example usage (dose 50-90 mg/day)
  18. 18. Example usage (alloy searching)
  19. 19. Proximity constraints
  20. 20. Increasing result precision • A query like “at least 0.2 picofarads” will match ANY value greater than this • Results can be restricted to just bounded ranges or exact values Query > 5 Pascals 5-15 Pascals 12 Pascals 10-20 Pa (default) ✔ ✔ ✔ 10-20 Pa (bounded) ✘ ✔ ✔ 10-20 Pa (exact) ✘ ✘ ✔
  21. 21. Combination with other searches • Quantity searches are just another field in the search system
  22. 22. Combination with faceting
  23. 23. Future work • Multi lingual ranges and units – 50至500毫克 [毫 = milli, 克= gram, 至=to] – 50 à 500 milligrammes – 50 bis 500 Milligramm – 50~500ミリグラム [ミリ = milli,グラム=gram] – In practice SI symbols commonly used in all languages • Tables with quantity split between header and cell
  24. 24. Conclusions • Text-mining combined with a traditional full-text indexer allows quantities to be seamlessly searched • Currently available on all US patents through our PatDocs service • Planned for inclusion into PatBase later this year
  25. 25. Head Office (London) Boston House, Little Green Richmond-upon-Thames, TW9 1QE Tel: +44 (0)20 8404 0651 General enquiries: info@minesoft.com Support: support@minesoft.com German Office Hesemannstraße 1A 41460 Neuss Tel: +49 (0)211 7495 0930 General enquiries: germany@minesoft.com Thank you For more information, visit the Minesoft stand or contact us at:

×