The document discusses the role of natural language processing (NLP) in information retrieval. It provides background on NLP, describing some of the fundamental problems in processing text like ambiguity and the contextual nature of language. It then outlines several common NLP tools and techniques used to analyze text at different levels, from part-of-speech tagging to named entity recognition and information extraction. The document concludes that NLP can help address some of the limitations of traditional document retrieval models by identifying implicit meanings and relationships within text.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing..
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Discover the evolving technology of artificial intelligence and text analysis. Learn about the importance, types, applications and challenges of the industry. Visit https://www.bytesview.com/ for more information.
Natural language processing provides a way in which human interacts with computer / machines by means of voice.
"Google Search by voice is the best example " which makes use of natural language processing..
Introduction to Natural Language ProcessingPranav Gupta
the presentation gives a gist about the major tasks and challenges involved in natural language processing. In the second part, it talks about one technique each for Part Of Speech Tagging and Automatic Text Summarization
Discover the evolving technology of artificial intelligence and text analysis. Learn about the importance, types, applications and challenges of the industry. Visit https://www.bytesview.com/ for more information.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
This is the presentation on Syntactic Analysis in NLP.It includes topics like Introduction to parsing, Basic parsing strategies, Top-down parsing, Bottom-up
parsing, Dynamic programming – CYK parser, Issues in basic parsing methods, Earley algorithm, Parsing
using Probabilistic Context Free Grammars.
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
This presentation contains information about the AI reasoning in uncertain situations. It also includes discussion regarding the types of uncertainty, predicate logic, non-monotonic logic, truth maintenance system and reasoning with fuzzy sets.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
This is the presentation on Syntactic Analysis in NLP.It includes topics like Introduction to parsing, Basic parsing strategies, Top-down parsing, Bottom-up
parsing, Dynamic programming – CYK parser, Issues in basic parsing methods, Earley algorithm, Parsing
using Probabilistic Context Free Grammars.
What one needs to know to work in Natural Language Processing field and the aspects of developing an NLP project using the example of a system to identify text language
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
This presentation contains information about the AI reasoning in uncertain situations. It also includes discussion regarding the types of uncertainty, predicate logic, non-monotonic logic, truth maintenance system and reasoning with fuzzy sets.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
I will try to say – what is QA, how could we get the answer to questions on natural language and how successful have we been in that domain.
I have gained all of my knowledge from three proposed papers and what I read around them.
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYijaia
In today’s world of digital media, connecting millions of users, large amounts of information is being
generated. These are potential mines of knowledge and could give deep insights about the trends of both
social and scientific value. However, owing to the fact that most of this is highly unstructured, we cannot
make any sense of it. Natural language processing (NLP) is a serious attempt in this direction to organise
the textual matter which is in a human understandable form (natural language) in a meaningful and
insightful way. In this, text entailment can be considered a key component in verifying or proving the
correctness or efficiency of this organisation. This paper tries to make a survey of various text entailment
methods proposed giving a comparative picture based on certain criteria like robustness and semantic
precision.
Concept and example of a semantic solution implemented with SQL views to cooperate with users on queries over structured data with independence from database schema knowledge and technology.
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
The objective of this workshop is to show how natural language processing applied in modern applications such as Google Search, Apple Siri, Bing Translator and etc. During the workshop we will go through history if natural language processing, talk about typical problems, consider classical approaches and methods, and compare them with state-of-the-art deep learning techniques.
Author: Rudolf Eremyan
Email: eremyan.rudolf@gmail.com
Phone: +995599607066
LinkedIn: https://www.linkedin.com/in/rudolferemyan/
DataFest Tbilisi 2017 website: https://datafest.ge
Big Data and Natural Language ProcessingMichel Bruley
Natural Language Processing (NLP) is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language.
Anyone seeking to pull meaning and insight from large datasets suffers from the current primitive state of search tools. Users develop search strategies that have to be painstakingly constructed using complex logic, and are difficult to maintain or reuse, especially across multiple data sources. They are also highly error prone. And yet continuous, consistent application of rigorous search is necessary to keep up with new materials. "Advanced" search interfaces are notoriously counter-intuitive and difficult to learn, and rigidly procedural. One wrong step and the whole query can go wrong. Spotting errors can be very hard.
In this session, Tony Russell-Rose will describe how the search interfaces for such problems can be simplified and made more intuitive, and he will demonstrate a "visual search" technique using a canvas to build and test complex search queries in a two dimensional frame, and which can be applied to major search engines such as Google and Bing, or content sources such as LinkedIn, Twitter and Github, to deliver highly precise and relevant results, in a repeatable, consistent, and easy-to-maintain way.
Professions such as healthcare, law, recruitment and patent search all share an interest in the resolution of complex information needs. This typically involves the formulation of structured search strategies that are expressed as Boolean strings. However, creating effective Boolean queries remains an ongoing challenge, often compromised by a lack of transparency and reproducibility. In this paper we explore some of the shortcomings of current approaches, examine alternative solutions and make recommendations towards improved explainability in professional search.
Anyone seeking to pull meaning and insight from large datasets suffers from the current primitive state of search tools. Users develop search strategies that have to be painstakingly constructed using complex logic, and are difficult to maintain or reuse, especially across multiple data sources. They are also highly error prone. And yet continuous, consistent application of rigorous search is necessary to keep up with new materials. "Advanced" search interfaces are notoriously counter-intuitive and difficult to learn, and rigidly procedural. One wrong step and the whole query can go wrong. Spotting errors can be very hard.
In this session, Tony Russell-Rose will describe how the search interfaces for such problems can be simplified and made more intuitive, and he will demonstrate a "visual search" technique using a canvas to build and test complex search queries in a two dimensional frame, and which can be applied to major search engines such as Google and Bing, or content sources such as LinkedIn, Twitter and Github, to deliver highly precise and relevant results, in a repeatable, consistent, and easy-to-maintain way.
Knowledge workers (such as healthcare information professionals, patent agents and recruitment professionals) undertake work tasks where search forms a core part of their duties. In these instances, the search task is often complex and time-consuming and requires specialist expertise to formulate accurate search strategies. Interactive features such as query expansion can play a key part in supporting these tasks. However, generating query suggestions within a professional search context requires that consideration be given to the complex, structured nature of the search strategies they employ. In this paper, we investigate a variety of query expansion methods applied to a collection of Boolean search strategies used in real-world professional search tasks. The results demonstrate the utility of distributional models and the value of using ngram order to optimize precision and recall.
Think outside the search box: a AI-based approach to search strategy formulationTony Russell-Rose
Knowledge workers (such as healthcare information professionals, patent agents and legal researchers) need to undertake complex search tasks to identify relevant documents and insights within large domain-specific repositories and collections. The traditional solution is to use line-by-line query builders offered by proprietary database vendors. However, these offer limited support for error checking or query optimization, and their output can often be compromised by errors and inefficiencies.
In this talk, we present a new approach to query formulation in which concepts are expressed as objects on a two-dimensional canvas, and relationships are articulated by direct manipulation. Automated search term suggestions are provided using a combination of knowledge-based and statistical NLP techniques. This has the potential to eliminate many sources of inefficiency, make the query semantics more transparent, and provide a universal framework for search strategy formulation.
This is an update of a talk I originally gave in 2010. I had intended to make a wholesale update to all the slides, but noticed that one of them was worth keeping verbatim: a snapshot of the state of the art back then (see slide 38). Less than a decade has passed since then but there are some interesting and noticeable changes. For example, there was no word2vec, GloVe or fastText, or any of the neurally-inspired distributed representations and frameworks that are now so popular. Also no mention of sentiment analysis (maybe that was an oversight on my part, but I rather think that what we perceive as a commodity technology now was just not sufficiently mainstream back then).
Also if you compare with Jurafsky and Martin's current take on the state of the art (see slide 39), you could argue that POS tagging, NER, IE and MT have all made significant progress too (which I would agree with). I am not sure I share their view that summarisation is in the 'still really hard' category; but like many things, it depends on how & where you set the quality bar.
Research into how people find and share expertise can be traced back to the 1960s, with early studies focusing on knowledge workers such as engineers and scientists and the information sources they consult. However, in recent years there has been a growing recognition that the effectiveness of expertise retrieval systems is highly dependent on a number of contextual factors, where the emphasis is on how people search for expertise in the context of a specific task. These studies have typically been performed in an enterprise context, where the aim is to utilize human knowledge within an organization as efficiently as possible. This talk presents results of an Innovate-UK funded project investigating the use of complex search strategies in the workplace, with the aim of producing requirements for the design of next generation search tools.
This presentation compares four tools for analysing the sentiment in the content of free-text survey responses concerning a healthcare information website. It was completed by Despo Georgiou as part of her internship at UXLabs (http://uxlabs.co.uk)
In order to design better search experiences, we need to understand the complexities of human information-seeking behaviour. In previous work, we proposed a model of information behavior based on an analysis of the information needs of knowledge workers within an enterprise search context. In this presentation, we extend this work to the site search context, examining the needs and behaviours of users of consumer-oriented websites and search applications.
We found that site search users presented significantly different information needs to those of enterprise search, implying some key differences in the information behaviours required to satisfy those needs. In particular, the site search users focused more on simple “lookup” activities, contrasting with the more complex, problem-solving behaviours associated with enterprise search. We also found repeating patterns or ‘chains’ of search behaviour in the site search context, but in contrast to the previous study these were shorter and less complex. These patterns can be used as a framework for understanding information seeking behaviour that can be adopted by other researchers who want to take a ‘needs first’ approach to understanding information behaviour.
Classic IR (information retrieval) is predicated on the notion of users searching for information in order to satisfy a particular ‘information need’. However, much of what we recognise as search behaviour is often not informational per se. In this presentation, we examine the behaviour of individuals across a range of site search scenarios, and define a taxonomy of fundamental behaviours or 'search modes'. We compare these with a comparable set of modes derived from the domain of enterprise search, and examine some of the key differences between the two. We also discuss some initial implications and techniques for the design of more effective site search experiences.
This presentation describes a framework for describing and categorising personalisation as experienced within eCommerce. It explores its application to a number of examples, and discusses the implications (e.g. instances that could in theory exist but don’t).
A taxonomy of search strategies and their design implicationsTony Russell-Rose
The focus of this particular talk is on extending the review of information search strategies (aka ‘Modes of Discovery‘) with a deeper exploration of their implications for design at the application (architectural) level.
An overview of the key theories and models of human information-seeking behavior and how these translate into principles for the design of effective search and discovery experiences.
In this talk we outline some of the key challenges in text analytics, describe some of Endeca's current research work in this area, examine the current state of the text analytics market and explore some of the prospects for the future.
UI Design Patterns for Search & Information DiscoveryTony Russell-Rose
This talk examines the role of patterns in designing information search and discovery applications, describes some of the challenges involved in creating the Endeca UI Design Pattern Library, and explores some of the issues involved in maintaining and growing a pattern library as a resource for a company as well as the wider user experience design community.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Assuring Contact Center Experiences for Your Customers With ThousandEyes
The Role of Natural Language Processing in Information Retrieval
1. The Role of Natural Language Processingin Information Retrieval Searching for Meaning in Text Tony Russell-Rose, PhD 21-Mar-2011
2. Contents Introduction & Overview Background, Terminology NLP Fundamentals Paradigms & Perspectives NLP Tools & Techniques NLP Applications Text mining, Question Answering, etc. Conclusions & Further Reading 21-Mar-2011 2 NLP & IR
3. Introduction & Overview The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 3 NLP & IR
4. Information Retrieval Models, techniques & frameworks for satisfying an information need Common paradigm: document retrieval 21-Mar-2011 4 NLP & IR
5. Document Retrieval Methods include Boolean, vector space, probabilistic Rely on index terms “bag of words” Stop lists + stemming But text is “unstructured” Information may be “hidden” 21-Mar-2011 5 NLP & IR
6. NLP – a solved problem? As humans we do it effortlessly ... don’t we? DRUNK GETS NINE YEARS IN VIOLIN CASE PROSTITUTES APPEAL TO POPE STOLEN PAINTING FOUND BY TREE RED TAPE HOLDS UP NEW BRIDGE DEER KILL 300,000 RESIDENTS CAN DROP OFF TREES INCLUDE CHILDREN WHEN BAKING COOKIES MINERS REFUSE TO WORK AFTER DEATH 21-Mar-2011 6 NLP & IR
7. Problems with Text (1) Polysemy One word maps to many concepts e.g. bat Synonymy One concept maps to many words 21-Mar-2011 7 NLP & IR
8. Problems with Text (2) Word order Venetian blind vs. Blind venetian 21-Mar-2011 8 NLP & IR
9. Problems with Text (3) Language is generative Starbucks coffee is the best The place I like most when I need to feed my caffeine addiction is the company from Seattle with branches everywhere Many different ways to express a given idea Synonymy, paraphrase, metaphor, etc. 21-Mar-2011 9 NLP & IR
10.
11. The meaning of a sentence is completely determined by the meaning of its symbols and the syntax used to combine themLanguage is a form of communication All communication has a context: time and place of utterance, the writer, the reader, their backgroundknowledge, intentions, assumptions re the reader’s knowledge/intentions, etc. 21-Mar-2011 10 NLP & IR
12. Problems with Text (5) Language is changing I want to buy a mobile Ill-formed input “accomodation office” Co-ordination, negation, etc. This is not a talk about neuro-linguistic programming Multi-linguality Claudia Schiffer is on the cover of Elle Also sarcasm, irony, slang, jargon, etc. That was a wicked lecture Yep – the coffee break was the best part 21-Mar-2011 11 NLP & IR
13. Enter NLP / Text Analytics… Text Analytics: a set of linguistic, analytical and predictive techniques to extract structure and meaning from unstructured documents NLP: the academic term for Text Analytics analogous to “search” vs. “IR” Text Analytics ~= Natural Language Processing ~= Text Mining Text Mining -> Scientific / technical context, automated processing Text Analytics -> Business context, interactive apps 21-Mar-2011 12 NLP & IR
14. NLP Fundamentals The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 13 NLP & IR
15. NLP Perspectives Computational Linguistics Use of computational techniques to study linguistic phenomena Cognitive science Study of human information processing (perception, language, reasoning, etc.) Computer science Theoretical foundations of computation and practical techniques for implementation Information science Analysis, classification, manipulation, retrieval and dissemination of information 21-Mar-2011 14 NLP & IR
16. NLP Paradigms Symbolic approaches Rule-based, hand coded (by linguists) Knowledge-intensive Statistical approaches Shallow statistical models, trained on annotated corpora Data-intensive 21-Mar-2011 15 NLP & IR
17. NLP Fundamentals (word level) Language is AMBIGUOUS To find structure, we must remove ambiguity! Lexical analysis (tokenisation) The cat sat on the mat I can’t tokenise this sentence Stop word removal No definitive list The Who, The The, Take That… To be or not to be 21-Mar-2011 16 NLP & IR
18. NLP Fundamentals (word level) Stemming fishing, fished, fish, fisher -> fish Lemmatization Stemming + context Passing -> pass + ING Were -> be + PAST Delegate = de-leg-ate (?) Ratify = rat-ify (?) Morphology (prefixes, suffixes, etc.) Gebäudereinigungsfirmenangestellter -> Gebäude + Reinigung + Firma + Angestellter (building + cleaning + company + employee) 21-Mar-2011 17 NLP & IR
19. NLP Fundamentals (sentence level) Syntax (part of speech tagging) book -> NOUN, VERB that -> DETERMINER flight -> NOUN Book that flight -> VB DT NN Ambiguity problem Time flies like an arrow -> ? Fruit flies like a banana -> ? Eats shoots and leaves -> ? 21-Mar-2011 18 NLP & IR
20. NLP Fundamentals (sentence level) Parsing (grammar) I saw a venetian blind I saw a blind venetian I saw the man on the hill with a telescope Rugby is a game played by men with odd-shaped balls Sentence boundary detection Punctuation denotes the end of a sentence! “But not always!”, said Fred... 21-Mar-2011 19 NLP & IR
21. NLP Tools & Techniques The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 20 NLP & IR
22. Early NLP Systems ELIZA Wiezenbaum 1966 Simple pattern matching PARRY Colby et al 1971 Pattern matching & planning SHRDLU Winograd 1972 Natural language understanding Comprehensive grammar of English 21-Mar-2011 21 NLP & IR
23. Word Prediction Assistive technologies Aurora Systems, TextHelp Google, Bing, Yahoo query suggestions 21-Mar-2011 22 NLP & IR
25. Text Categorization News agencies: classify incoming news stories Search engines: classify queries, e.g. search Google for ‘LOG313’ Identifying spam emails, e.g. http://www.paulgraham.com/spam.html Routing email or documents to appropriate people 21-Mar-2011 24 NLP & IR
26. Terminology Extraction Differentiate between useful index terms and ‘noise’ Help lexicographers identify new terminology: the C4 carbons of the nicotinamide rings X-ray therapy vs. X-ray therapies PAHO vs. Pan-American Health Organization Term extraction systems process scientific papers to identify terminology, possibly comparing it with a known list 21-Mar-2011 25 NLP & IR
27. Speech Recognition Spoken Dialogue Systems Directory enquiries (AT&T) Railways (Germany, Switzerland, Italy) Airlines (USA) iPhone Voice Search IBM Watson 21-Mar-2011 26 NLP & IR
28. Named Entity Recognition Identification of key concepts, e.g. people, places, organisations, etc. Also postcodes, temporal/numerical expressions, etc. “Mexico has been trying to stage a recovery since the beginning of this year and it's always been getting ahead of itself in terms of fundamentals,” said Matthew Hickman of Lehman Brothers in New York.” 21-Mar-2011 27 NLP & IR
29. Named Entity Recognition: Applications Increase precision of IR New companies in York vs. Companies in New York Support navigation SMART tags, Apture, etc. Improve machine translation: “I live in a new house in new york” Speech synthesis, auto-summarisation, etc. 21-Mar-2011 28 NLP & IR
30. Named Entity Recognition Systems usually developed for specific domain Are typically language and domain-specific Accuracy > 90% for newswire text Best system at MUC-7 scored 93.39% f-measure human annotators scored 97.60% and 96.95%. Lower for certain domains, e.g. life sciences (species, substances, proteins, genes, etc.) extended naming conventions, extensive use of conjunction and disjunction, high occurrence of neologisms and abbreviations, etc. 21-Mar-2011 29 NLP & IR
34. Summarisation Section 3 discusses two information access applications (text mining and question answering) closely associated with NLP. NLP Techniques Named entity recognition Information Extraction Current document retrieval technologies could not identify information as specific as this within text. Word Sense Disambiguation Text Mining Question Answering “Wh” questions List questions Examples include those mentioned here: text mining, question answering and cross-language information retrieval. Ponte and Croft (1998) “A Language Modeling Approach to Information Retrieval” SIGIR. Sanderson, M. (1994) “Word sense disambiguation and information retrieval” Proceedings of the 17th ACM SIGIR Conference Sanderson, M. (2000) “Retrieving with Good Sense” Information Retrieval 2(1):49-69. Question Answering (Section 3.2). Summarization single-document vs. multi-document MS Word Bing 21-Mar-2011 33 NLP & IR
35.
36. Based on pre-defined structures, e.g.movements of company executives victims of terrorist attacks corporate mergers / acquisitions interactions between genes and proteins Example (from MUC-6): Now, Mr. James is preparing to sail into the sunset, and Mr. Dooner is poised to rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st century. Yesterday, McCann made official what had been widely anticipated: Mr. James, 57 years old, is stepping down as chief executive officer on July 1 and will retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45. 21-Mar-2011 34 NLP & IR
37. Information Extraction <SUCCESSION_EVENT-2> := SUCCESSION_ORG: <ORGANIZATION-1> POST: "chairman" IN_AND_OUT: <IN_AND_OUT-4> VACANCY_REASON: DEPART_WORKFORCE <IN_AND_OUT-4> := IO_PERSON: <PERSON-1> NEW_STATUS: IN ON_THE_JOB: NO OTHER_ORG: <ORGANIZATION-1> REL_OTHER_ORG: SAME_ORG <ORGANIZATION-1> := ORG_NAME: "McCann-Erickson" ORG_ALIAS: "McCann" ORG_TYPE: COMPANY <PERSON-1> := PER_NAME: "John J. Dooner Jr." PER_ALIAS: "John Dooner" "Dooner" 21-Mar-2011 35 NLP & IR
38. Information Extraction Can now use as metadata (for retrieval) Or store in DB and query against it, e.g. “show me all the events where a finance officer left a company to take up the position of CEO in another company” 21-Mar-2011 36 NLP & IR
42. Information Extraction But IE is difficult when concepts are spread across multiple sentences: Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of “self-dealing” and “undisclosed financial relationships.” The executives are Don H. Pace, president and CEO; and Greg S. Kaplan, SVP and CFO. Event & org in first S, but names in second S Need to identify connection between “two top executives” and “the executives”. 21-Mar-2011 40 NLP & IR
43. Information Extraction Anaphora resolution relies on context (semantic / pragmatic knowledge): We gave the bananas to the monkeys because they were hungry. We gave the bananas to the monkeys because they were ripe. We gave the bananas to the monkeys because they were here. 21-Mar-2011 41 NLP & IR
44. WordNet Semantic database for English See also EuroWordNet Based on synsets car = automobile | railway car | elevator car | cable car 1stsynset = car, motorcar, machine, auto, automobile Connected via relationships E.g. hypernymy (a kind of) separate hierarchies for nouns, verbs, adjectives and adverbs 21-Mar-2011 42 NLP & IR
45. Word Sense Disambiguation Resolve ambiguities due to polysemy Often characterised by syntax, e.g. Magnesium is a light metal (ADJ) The light in the kitchen is quite bright (NOUN) Can use a Part of Speech Tagger to resolve broad ambiguities PoS tags = NN, NNP, NNS, NNPS, etc. 21-Mar-2011 43 NLP & IR
46.
47. 22. to adjust (a timer, alarm of a clock, etc.) so as to sound when desired: He set the alarm for seven o'clock.
48. 23. to fix or mount (a gem or the like) in a frame or setting.
49. 24. to ornament or stud with gems or the like: a bracelet set with pearls.
50. 25. to cause to sit; seat: to set a child in a highchair.
70. 41. to cause (glue, mortar, or the like) to become fixed or hard.
71. 42. to urge, goad, or encourage to attack: to set the hounds on a trespasser.
72. 43. Bridge . to cause (the opposing partnership or their contract) to fall short: We set them two tricks at four spades. Only perfect defense could set four spades.
73. 44. to affix or apply, as by stamping: The king set his seal to the decree.
74. 45. to fix or engage (a fishhook) firmly into the jaws of a fish by pulling hard on the line once the fish has taken the bait.
75. 46. to sharpen or put a keen edge on (a blade, knife, razor, etc.) by honing or grinding.
76. 47. to fix the length, width, and shape of (yarn, fabric, etc.).
82. 53. to assume a fixed or rigid state, as the countenance or the muscles.
83. 54. (of the hair) to be placed temporarily on rollers, in clips, or the like, in order to assume a particular style: Long hair sets more easily than short hair.
84. 55. to become firm, solid, or permanent, as mortar, glue, cement, or a dye, due to drying or physical or chemical change.
87. 58. to begin to move; start (usually followed by forth, out, off, etc.).
88. 59. (of a flower's ovary) to develop into a fruit.
89. 60. (of a hunting dog) to indicate the position of game. 21-Mar-2011 44 NLP & IR
90.
91. 77. an apparatus for receiving radio or television programs; receiver.
92. 78. Philately . a group of stamps that form a complete series.
93. 79. Tennis . a unit of a match, consisting of a group of not fewer than sixgames with a margin of at least two games between the winner and loser: He won the match in straight sets of 6–3, 6–4, 6–4.
94. 80. a construction representing a place or scene in which the action takes place in a stage, motion-picture, or television production.
95. 81. Machinery . a. the bending out of the points of alternate teeth of a saw in opposite directions.
96. b. a permanent deformation or displacement of an object or part.
97. c. a tool for giving a certain form to something, as a saw tooth.
98. 82. a chisel having a wide blade for dividing bricks.
99. 83. Horticulture . a young plant, or a slip, tuber, or the like, suitable for planting.
100. 84. Dance . a. the number of couples required to execute a quadrille or the like.
101. b. a series of movements or figures that make up a quadrille or the like.
102. 85. Music . a. a group of pieces played by a band, as in a night club, and followed by an intermission.
103. b. the period during which these pieces are played.
104. 86. Bridge . a failure to take the number of tricks specified by one's contract: Our being vulnerable made the set even more costly.
105. 87. Nautical . a. the direction of a wind, current, etc.
106. b. the form or arrangement of the sails, spars, etc., of a vessel.
121. –interjection 100. (in calling the start of a race): Ready! Set! Go! Also, get set! 21-Mar-2011 45 NLP & IR
122. Word Sense Disambiguation Surely PoS tagging would help IR performance? Could index docs using words AND PoS tags Very sensitive to tagger performance Does WSD improve IR performance? Conflicting evidence …“Perfect” WSD engine may provide only marginal gain, e.g. >76% accuracy needed for homonymy >55% for polysemy 21-Mar-2011 46 NLP & IR
123. NLP Applications The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 47 NLP & IR
124. NLP Applications Web & enterprise search Text classification Text summarisation Machine translation Human-computer interfaces (spelling correction etc.) Speech recognition & synthesis Natural language generation (e.g. in games) Text mining Question answering 21-Mar-2011 48 NLP & IR
125. Finally becoming mainstream? ‘80% of corporate information is unstructured’ Entire value chain for some organisation (media / publishing etc.) Retail / eCommerce: Product reviews User generated content: blogs, forums, wikis Voice of the Customer: social media + sentiment analysis 161 billion gigabytes of digital information in 2006 approximately 988 exabytes by 2010 Audio / video still needs summaries & tags etc. 21-Mar-2011 49 NLP & IR
126. Market Outlook Combine component technologies e.g. PoS tagging + NER, etc. Healthy growth: massive expansion in social media Lightweight NLP for buzz analysis, brand monitoring, etc. Dominant markets: Customer Experience (VoC), media/publishing, financial services & insurance, intelligence, life sciences, e-discovery Solutions still not standardized Need for self-service tuning & configuration Partner ecosystem developing Marketing services providers, platform vendors, CRM + call centre vendors, system integrators 21-Mar-2011 50 NLP & IR
127. Text Mining Analogy with Data Mining Discover or infer new knowledge from unstructured text resources A <-> B and B <-> C Infer A <-> C? e.g. link between migraine headaches and magnesium deficiency Applications in life sciences, media/publishing, counter terrorism and competitive intelligence 21-Mar-2011 51 NLP & IR
128. Text Mining Levels of text mining, e.g. for bioscience: Tagging entities: e.g. genes, proteins, diseases and chemical compounds Information extraction: identify meaningful structural relations within the text (e.g. interactions between proteins) Knowledge discovery: combine with other knowledge sources (such as external ontologies) to infer or discover new knowledge 21-Mar-2011 52 NLP & IR
129. Question Answering Going beyond the document retrieval paradigm provide specific answers to specific questions Introduced as a TREC track in 1999 21-Mar-2011 53 NLP & IR
130. Question Answering Yes/no questions “Is George W. Bush the current president of the USA?” “Is the Sea of Tranquility deep?” “Wh” questions “Who was the British Prime Minister before Margaret Thatcher?” “When was the Battle of Hastings?” List questions “Which football teams have won the Champions League this decade?” “Which roads lead to Rome?” Instruction-based questions “How do I cook lasagne?”, “What is the best way to build a bridge?” Explanation questions “Why did World War I start?” “How does a computer process floating point numbers?” Commands “Tell me the height of the Eiffel Tower.” “Name all the Kings of England” 21-Mar-2011 54 NLP & IR
131. Question Answering Common use cases handled by major search engines Google, Yahoo!, Bing, Baidu, ... Specialist QA systems Wolfram Alpha: http://www.wolframalpha.com True Knowledge: http://www.trueknowledge.com Powerset: http://www.powerset.com/ IBM Watson 21-Mar-2011 55 NLP & IR
132. Question Answering Basic approach: question analysis Predict type of answer expected Create query document retrieval answer extraction Based on output of (1) Varies from regex to deep linguistic analysis 21-Mar-2011 56 NLP & IR
133. Evaluation How do we measure NLP accuracy? Compare against human annotation However: Humans often disagree Particularly for complex tasks Comparison can be measured in many ways “Bill Gates is chairman of Microsoft” -> “Gates” = person? 21-Mar-2011 57 NLP & IR
134. Conclusions The Role of Natural Language Processing in Information Retrieval 21-Mar-2011 58 NLP & IR
135. Conclusions NLP’s contribution to IR not always apparent Overemphasis on precision & recall? NLP provides benefits that go beyond quantitative metrics to support broader tasks & richer experiences Insight, analysis, visualisation, etc. NLP required for text mining, question answering & cross-language IR Bag of words model is insufficient Web becoming increasingly multi-lingual 21-Mar-2011 59 NLP & IR
136. State of the Art Commodity technologies: Tokenisation, normalization, regular expression search Mainstream, but room for improvement: Lemmatization, spelling correction, speech synthesis Mainstream but needs improvement: PoS tagging, term extraction, summarization, speech recognition, text categorization, spoken dialogue systems Almost there: Information extraction, NL generation, simple speech translation Don’t hold your breath: Full machine translation, advanced dialogue systems 21-Mar-2011 60 NLP & IR
137. Platforms & Toolkits Many commercialvendors Open source/access: GATE (Sheffield University) RapidMiner Stanford NLP OpenNLP OpenCalais LingPipe nltk 21-Mar-2011 61 NLP & IR
138. Further Reading A. Goker & J. Davies, Information Retrieval: Searching in the 21st Century. Wiley-Blackwell, 2009 D. Jurafsky and J. Martin, Speech and Language Processing. Prentice-Hall, 2nd edition, 2008. Manning, C. and Schutze, H. Foundations of Statistical Language Processing. MIT Press, 1999. 21-Mar-2011 62 NLP & IR
139. Thank you Tony Russell-Rose, PhD Vice-chair, BCS IRSG Chair, IEHF HCI Group Email: tgr@uxlabs.co.uk Blog: http://isquared.wordpress.com LinkedIn: http://www.linkedin.com/tonyrussellrose/ Twitter: @tonygrr
Editor's Notes
Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stop words. Try queries that are dominated by stop words, with and without quotes. How do they differ?
Exercise: experiment with the 3 major search engines (Google, Bing, Yahoo) to see how they handle stemming. Try queries that are dominated by inflected forms, with and without quotes. How do they differ?
Do as class exercise
Demo: ELIZA http://nlp-addiction.com/eliza/
Demo Google suggest
Demo Auto-correct & DYM
Demo: try Google Translate (or Babelfish) with sentences containing ambiguous terms or NEs, e.g. “I live in a new house in new york”. Use the speech synthesizerhttp://translate.google.com/#en|de|I%20live%20in%20a%20new%20house%20in%20new%20york
Demo Yahoo search assist + “explore concepts”
Paired exercise: what is the most polysemous English word you can think of? How many senses do you think it has in an average dictionary? Think of some polysemous words (examples include “ball”, “bank”, “port” but there are many others). Use these as queries in a search engine and comment on what you notice about the results.
Paired Exercise: try each of above queries in Google, Yahoo, Bing. Now try to answer the same questions with a normal web search (by entering a set of query terms instead of a natural language question). Are the results of this search any better or worse?
Paired exercise: try previous queries on these sites