Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search

538 views

Published on

David Woolls (CFL Software, UK)
The growth in computing power has made it possible to gain insights into very large quantities of text by both statistical and neural methodologies and linguistics, the way languages work for humans, is not a major part of that process. However, decisions on FTO, Invalidity and competition are still made by humans, which means reading the patents identified by the machines. Because humans are endlessly creative even in an apparently constrained world of patent writing, and different human languages have different ways of expressing similar concepts, identifying ranges in alloys, compounds, formulations etc. is a complex challenge for computer programs. This paper explores how the same computer hardware advances which have enabled machine learning can be exploited to produce overall solutions to the problems that natural languages present to humans and computers alike. It will identify those areas in which computer programs can outmatch human capability in identifying and assessing complex interactions of molecules, elements and the like and comparing them with potential or actual specifications, a capability which allows humans much more time to focus on the interpretation rather than the finding. And it will illustrate the application of such programs to both the main European languages and Chinese, Japanese and Korean.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

ICIC 2017: Babies and bathwater: Keeping linguistics alongside machine learning in patent search

  1. 1. Babies and Bathwater Keeping linguistics alongside machine learning in patent search David Woolls – CFL Software Limited, UK
  2. 2. Matter • Therefore, we cannot think that matter is made of points without extension, because no matter how many of these we manage to put together, we never obtain something with an extended dimension. Carlo Rovelli , Reality is not what it seems (2016 p:12) • Quindi non si può pensare che la materia sia fatta di punti senza estensione, perché, per quanti ne mettessimo insieme, non otterremmo mai qualcosa con una dimensione estesa. • What is the matter with this sentence? Does this matter? As a matter of fact it does. That’s another matter. • What does ‘matter’ mean on this page?
  3. 3. Imagined Readers – Text differences "It was a dark and stormy night, the rain came down in torrents, there were brigands on the mountains, and wolves, and the chief of the brigands said to Antonio, 'I'm bored - tell us a story!’” Janet and Allan Ahlberg From “Paul Clifford”
  4. 4. LSTM and linguistics • But there are also cases where we need more context. • Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Humans usually provide linguistic assistance in the form of function words (grammar) I grew up in France so I speak fluent … Definitely French I grew up in France and I speak fluent … Possibly French but maybe another I grew up in France but I speak fluent … Definitely not French I grew up in France but I also speak fluent … Very definitely not French I grew up in France but I don’t speak fluent … Definitely French I grew up in France so I don’t speak fluent … Definitely not French
  5. 5. Babies, bathwater, stems, lemmas and function words Becomes I think Christoph is brilliant Think Christoph brilli I thought Christoph was brilliant Think Christoph brilli I thought Christoph was brilliant but now I’m not so sure. Think Christoph brilli sure Hearing Christoph’s brilliance I asked him to speak. Hear Christoph brilli ask speak I wouldn’t do that if I were you! ! This is called telegraphic language and is spoken by children between 18 months and three years old during language acquisition. Perhaps not ideal for computers and comprehension.
  6. 6. Linguistic LSTM with real sentences. • It is a truth universally acknowledged, [6] • that a single man [4] • in possession of a good fortune, [6] • must be in want of a wife. [7] • [23/4] = 6 The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information by George A. Miller originally published in The Psychological Review, 1956, vol. 63, pp. 81-97 http://www.musanim.com/miller1956/ It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.
  7. 7. LSTM • However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. • However little known the feelings or views [7] • of such a man may be [7] • on his first entering a neighbourhood, [6] • this truth is so well fixed [6] • in the minds of the surrounding families, [7] • that he is considered the rightful property [7] • of some one or other of their daughters. [7] • [47/7] = 7
  8. 8. Why linguistics? • Patents are communicative documents, written in many languages. • Communication is achieved by context which can be close or distant. • Boolean searching gives results by document; range searching needs to be done by claim. • There are distractor numbers in a claim (e.g. Claim numbers, temperatures, lengths). • There are potential data quality or format problems introduced by OCR, machine translation or extraction from a database. • All these and others need to be taken into account to find only relevant material. ICIC 2017 8
  9. 9. Why linguistics for ranges? • Range information is in the unstructured text – The location and referent of ranges is signalled by linguistic structures and forms: • Range then element or Element then range or both 0,80 < Si < 1,20 • Elements by symbol Si or in full Silicon or silicon • Implicit or explicit marking: 1-5 or between 1 and 5 • Symbolic or lexical marking: <2.5 or less than 2.5, ≥ .76 or greater than or equal to 0.76 • Variation in proximity of additional markings 0.5%, 0.5wt%, 0.5 wt % – There can be mixtures of these forms in a single claim. ICIC 2017 9
  10. 10. Reading The program is a linear text reader because we need to: 1. Identify claims 2. Identify pairs of elements and ranges in each claim. So each line in the file is read word by word just once in the same sequence as a human reader. ICIC 2017 10
  11. 11. Reading • Items are identified as numbers, range indicators or elements in sequence. • As each element/range pair is identified, the relationship with the specification is calculated. • Following calculation the element and the range is colour-coded and the claim built for potential display. • At the conclusion of each claim the total found is compared with the total specification. • If the claim meets the overall specification requirement it is added to the list for display. • At the conclusion of the reading process, all the results are ranked and displayed. • The program can process the full claims of around 300 patents per second. ICIC 2017 11
  12. 12. Native languages v Machine Translation ICIC 2017 12 Here is the problem from the PatBase collection. <Claims><![CDATA[<CLA_MT><XXC1> <p> CN 1. A non-magnetic alloy of high strength and toughness, characterized in that the chemical composition in weight percent of: C:.. 0 14 ~0 30 percent, Si:.. 0 15 ~0 80 percent,.. Mn: 20 00 ~27 00 percent; Ni:.. 0 60 ~2 00 percent; Cr:.. 12 50 ~19 00 percent; </CLA_MT><CLA_CN><XXC1><p>CN 1. 一种高强度韧性无磁合金,其特征在于,化学成分重量百分数为: C :0. 14 〜0. 30%, Si :0. 15 〜0. 80%, Mn :20. 00 〜27. 00% ; Ni :0. 60 〜2. 00% ; Cr :12. 50 〜19. 00% ; You can see that the MT version into English is appalling!. You can also see that the original claim will be understandable by the program because the presentation is clear.
  13. 13. Detailed example (continued) ICIC 2017 13 It is not practicable to write a program that takes account of all the things that might go wrong, without also introducing potential errors to data that is actually ok. But it is possible for SpanMatch to recognise the original as correct as you see here. So, given clean data or cleaning the data up as best we can, we can do this in all the languages. Once you have an indication of potential interest you can use a good MT program to translate just the claims of interest. This is Google Translate translating the claim, and you can see that it is struggling, but is better than the PatBase one. CN is a high strength toughness nonmagnetic alloy characterized in that the chemical composition is in a weight percentage of C: 0.14 to 0. 30 Si: 0.015 to 80 Mn: 20 to 0000. Ni: 0.60 ~ 2.00; Cr: 12. 50 ~ 1900; Mo or W elements of one or two: 0. 60 ~ 2.50 ;; 0.8 ~ [0. LXMn (% - 0.5); 0 20 to 0.50; Ca, rare earth elements of one or two: 0. 003 ~ 0.05;: 彡 0.03:: 彡 0.03; Fe: balance.
  14. 14. Use of CN, JP, KR originals - rationale • Machine translation is often hard to understand and sometimes incomprehensible • Using native language patents ensures data quality • Limited inbuilt knowledge required for numerical searching – Searching for elements requires only that a program has the CJK equivalents for full element names; international symbols are identical. – Searching for ranges requires knowledge of potential CJK equivalent codes for digits – Searching for range indicators requires language specific identification of hyphen, <, > and words. • Accurate identification of the search specification with display of the claims means only those claims of interest need translation by machine or human ICIC 2017 14
  15. 15. Thank you Contact: d.woolls@cflsoftware.com Website: www.cflsoftware.com

×