Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

II-SDV 2017: Semantic Search Jargon - A short Guide

685 views

Published on

In the early 1990s, the term 'semantic' appeared in the context of text retrieval tools. However, from the very beginning of Information Retrieval as a research field (i.e. as computer-assisted identification of relevant documents), looking at the articles of Vannevar Bush (How we may think) or Luhn (The automatic creation of literature abstracts) in the 1940s and '50s, the idea of semantics was already there.

So where are we now in terms of semantics? The `latent semantic indexing` of the 1990s faded away, and the first decade of the millennium enthusiastically studied semantic web technologies. Now, in the second decade, `deep learning` is the new star. In this talk I will give a high-level overview of what has been done already, particularly in the context of the patent domain, what the main techniques are, and in which directions is the scientific community looking today. Ultimately, there will be no one answer to the question of 'What is semantic search?'. Instead, my aim is to empower the audience to ask the right questions next time somebody mentions the term.

Published in: Internet
  • Be the first to comment

II-SDV 2017: Semantic Search Jargon - A short Guide

  1. 1. Semantic Search Jargon – a short guide Mihai Lupu TU Wien / RSA Data Science mihai.lupu@researchstudio.at
  2. 2. “Semantic” ▪ adjective – dictionary.com: of, relating to, or arising from the different meanings of words or other symbols – Merriam-Webster: of or relating to the meanings of words and phrases – Cambridge: connected with the meanings of words – Oxford: connected with the meaning of words and sentences
  3. 3. They are among us
  4. 4. A human characteristic
  5. 5. Counting words (aka Statistics)
  6. 6. … semantics
  7. 7. The geometric metaphor of meaning “Meanings are locations in a semantic space, and semantic similarity is proximity between the locations” (Sahlgren, 2006)
  8. 8. Hans Peter Luhn
  9. 9. and others pure counting term frequency position in sentence SMART IDF cosine similarity and many more 195 196 197 198 199 200 201 202 from counting to predicting Latent Semantic Analysis Random Indexing WWW appears Semantic Web appears Deep Learning Speech Vision NLP IR The Golden Age of Artificial Intelligence Expert Systems, Knowledge bases (e.g. Cyc) Inference on billions of tuples on trillions Probabilistic models for IR Language Models
  10. 10. where are we now? ▪ Inference directly from text ▪ [Bowman et al. 2016] A man rides a bike on a snow covered road A man is outside 2 female babies eating chips Two female babies are enjoying chips A man in an apron shopping at a market A man in an apron is preparing dinner Model % Accur acy Feature-based classifier 78.2 Previous SOTA sentence encoder [Mou et al. 2016] 82.1 LSTN RNN sequence model 80.6 Tree LSTM 80.9 SPINN 83.2 SOTA (sentence pair alignment model) [Parikh et al. 2016] 86.8
  11. 11. where are we now? ▪ Inference directly from text ▪ [Bowman et al. 2016] A man rides a bike on a snow covered road A man is outside 2 female babies eating chips Two female babies are enjoying chips A man in an apron shopping at a market A man in an apron is preparing dinner Model % Accur acy Feature-based classifier 78.2 Previous SOTA sentence encoder [Mou et al. 2016] 82.1 LSTN RNN sequence model 80.6 Tree LSTM 80.9 SPINN 83.2 SOTA (sentence pair alignment model) [Parikh et al. 2016] 86.8 Particular success cases: Negation: - The rhythmic gymnast completes her floor exercise at the competition - The gymnast cannot finish her exercise Long examples (>20 words): - A man wearing glasses and a ragged costume is playing a Jaguar electric guitar and singing with the accompaniment of a drummer - A man with glasses and a disheveled outfit is playing a guitar and singing along with a drummer.
  12. 12. Where are we for patents? ▪ Latent Semantic Indexing – Some commercial systems claim to use it ▪ “Latent semantic analysis uses sophisticated statistical analysis of language to search on concepts, not just words, to help you find those documents - even if they don't contain any of the words you used in your search” – Minimal improvements found in experiments ▪ [Moldovan:2005]
  13. 13. Random Indexing ▪ Initial experiments using the Semantic Vectors package – Unsatisfactory results for document similarity – Noticeably good results for term similarity Term vectors Document vectors [Lupu et al.:2013]
  14. 14. Random Indexing ▪ Initial experiments using the Semantic Vectors package – Unsatisfactory results for document similarity – Noticeably good results for term similarity Term vectors Document vectors 1.0:coatings 0.9999339:rubs 0.9999338:coating 0.9999328:acrylics 0.9999271:vinyls 0.9999268:cratering 0.9999251:distinctness 0.9999246:blistering 0.9999235:pompano 0.9999234:cyanamid 1.0:crystal 0.9999378:cyrstal 0.9999305:crytal 0.9999022:nicol // a type of prism 0.9999014:jjap 0.9999006:nicols 0.9998996:nematic // a type of liquid crystal 0.9998943:uniaxial //minerals that form crystals used in optics 0.9998894:cb15 //a particular liquid crystal 0.9998887:anisotropy 1.0:crystals 0.9998632:supersaturation 0.9998519:crystallizing 0.9998281:supersaturated 0.9998213:crys 0.9998193:purer 0.9998166:soda 0.9998120:crystallize 0.9998105:crystallizers 0.9998081:tals [Lupu et al.:2013]
  15. 15. [Rekabsaz et al.:2016]
  16. 16. CLEF-IP patent collection
  17. 17. looks like we have a problem
  18. 18. Why words are too simple and documents are too large
  19. 19. documents are too large Particular success cases: Negation: - The rhythmic gymnast completes her floor exercise at the competition - The gymnast cannot finish her exercise Long examples (>20 words): - A man wearing glasses and a ragged costume is playing a Jaguar electric guitar and singing with the accompaniment of a drummer - A man with glasses and a disheveled outfit is playing a guitar and singing along with a drummer.
  20. 20. words are too simple “In a railroad car truck, a windowed side frame, a bolster extending through the window, a wedge pocket in said bolster having an upwardly and outwardly inclined floor in opposition to a vertical wear surface on the side frame, a stabilizing wedge in the pocket having a vertical friction surface in contact with the wear surface on the side frame and an inclined wedging surface in opposition to the floor of the pocket, a removable wear plate inset in a recess In said inclined floor, said recess having a horizontal lower edge, said wear plate having an inclined lower edge formed and adapted to engage and be supported on said horizontal lower edge of said recess, said wear plate being held in said recess by a weldment located between the upper edge of said recess and the lower edge of said wear plate, and, a spring biasing the wedge upwardly against the removable wear plate to cam the wedge laterally against the wear surface on the side frame.” How much is the patent corpus covered by the CELEX lexical database? [Verberne et al., 2010] Patent data COBUILD corpus Tokens 96% 92% Types 55% (?)
  21. 21. What to do?Research Evaluation
  22. 22. words are too simple Query Generation [Andersson:2016] – Baseline, NLP:(word, phrases) and Statistically:(unigram, bigram) – Section Claims or entire document – Termhood ▪ Experiment to learn termhoodness, two sample sets: – 637 with C-value and 4,400 without C-value ▪ upper boundary (manual list) versus machine learning ▪ Skip-gram versus exact phrase, ▪ Technical terms versus or non-technical
  23. 23. Continuous and objective evaluation Search Engine Effectiveness Test
  24. 24. Artificial Intelligence - Will it ever come? a machine will pass the Turing test by 2029 (Kurzweil 1999, pp. 189-235.) * The Turing Test does not specify the use of patents in the conversation
  25. 25. Thank you
  26. 26. Glossary ▪ CBOW Continuous Bag-of-Words ▪ DBPedia Automatically extracted knowledge resource from Wikipedia ▪ dimensionality reduction Any procedure that takes as input a vector of size N and outputs a vector of size M<N ▪ feed-forward a particular type of neural network, which does not contain cycles between its neurons ▪ hypernym a term denoting a broader category than another ▪ hyponym a term denoting a narrower category than another ▪ LOD Linked Open Data ▪ LSA Latent Semantic Analysis ▪ LSI Latent Semantic Indexing ▪ LSTM Long Short Term Memory ▪ matrix decomposition a mathematical procedure to represent a matrix as the product of two or more matrices ▪ matrix factorization matrix decomposition ▪ neural networks an algorithmic model (loosely) simulating brain structures ▪ ontology (here) a knowledge representation resource ▪ OWL Web Ontology Language ▪ PCA Principal Component Analysis ▪ PMI Pointwise Mutual Information ▪ RDF Resource Description Framework ▪ recurrent nn a particular type of neural network, which contains cycles between its neurons ▪ RI Random Indexing ▪ skip-grams method to predict a context from a word ▪ SVD Singular Value Decomposition ▪ WordNet a large lexical database of English

×