Making Knowledge Discoverable:
The Role of Agile Text Mining
David Milward
Linguamatics
II-SDV, Nice, April 2012
Click to edit Master title styleClick to edit Master title style
• Search vs. Text Mining
• Agile Text Mining
– Linguamati...
Click to edit Master title styleClick to edit Master title styleSearch vs. Text Mining
Text MiningSearch Engine
Filter to ...
Click to edit Master title styleClick to edit Master title style
• Text mining provides ability to discover
– but typicall...
Click to edit Master title styleClick to edit Master title style
• Linguamatics I2E first adopted in pharma/biotech, inclu...
Click to edit Master title styleClick to edit Master title style
Scientific Literature
Abstracts e.g. MEDLINE
Full text jo...
Click to edit Master title styleClick to edit Master title style
Actionable
Information
Ad hoc queries Smart queries Multi...
Using I2E to Extract Relationships
8
Click to edit Master title styleClick to edit Master title styleImprove Recall Using Terminologies
Which genes are known t...
Click to edit Master title styleClick to edit Master title styleExtracting Relationships using NLP: Biomarkers
Gene
(from
...
Click to edit Master title styleClick to edit Master title styleExtracting Numerical Data in Context: Safety
Compound
Pote...
Click to edit Master title styleClick to edit Master title styleStandardizing/Clustering to Save Review
• Go directly to a...
Finding the Most Relevant Documents
13
Click to edit Master title styleClick to edit Master title styleFinding Relevant Documents
• More comprehensive results
– ...
Click to edit Master title styleClick to edit Master title styleMore Comprehensive Results: Terminologies
• Re-use termino...
Click to edit Master title styleClick to edit Master title styleChemical Text Mining
• Ability to efficiently answer preci...
Accelerating a Search Strategy
17
Click to edit Master title styleClick to edit Master title style
Looking for words before or
after the word of interest
18...
Click to edit Master title styleClick to edit Master title style
• In search we can often restrict to a particular region ...
Click to edit Master title styleClick to edit Master title style
• Find most frequent codes in patents where a company or ...
Summarizing Results from Multiple
Documents
for Efficient Review and Integration
Click to edit Master title styleClick to edit Master title style
• Find and extract patterns
to find a particular
concept
...
Click to edit Master title styleClick to edit Master title style
• Terminologies can be used to link individual synonyms t...
Click to edit Master title styleClick to edit Master title style
• Extract the same relationship, even if it is expressed ...
Finding Information Across Multiple
Documents
Click to edit Master title styleClick to edit Master title style
• Categorizing Patents according to Disease Area
– Lingui...
Click to edit Master title styleClick to edit Master title styleTell me everything about X
• Search provides most relevant...
Click to edit Master title styleClick to edit Master title style
Visualization via Cytoscape
Linking Knowledge: Indirect R...
Within a Document
Click to edit Master title styleClick to edit Master title styleLinking Within Complex Patent Documents
• Linking informat...
Click to edit Master title styleClick to edit Master title styleFrom One Claim to the Next
• For information in claims, of...
Reproducible Workflows
32
Click to edit Master title styleClick to edit Master title style
• Queries can be run regularly as
documents get updated
•...
Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results
Inte...
Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results
Date...
Click to edit Master title styleClick to edit Master title style
• Agile Text Mining provides query power and flexibility
...
Upcoming SlideShare
Loading in...5
×

II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

257

Published on

Published in: Internet
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
257
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining

  1. 1. Making Knowledge Discoverable: The Role of Agile Text Mining David Milward Linguamatics II-SDV, Nice, April 2012
  2. 2. Click to edit Master title styleClick to edit Master title style • Search vs. Text Mining • Agile Text Mining – Linguamatics I2E • Relationship Extraction: Text Mining + Search • Finding the Most Relevant Documents: Search + Text Mining • Accelerating a Search Strategy • Example Results from – multiple documents – within complex documents • Reproducible Workflows 2 Overview
  3. 3. Click to edit Master title styleClick to edit Master title styleSearch vs. Text Mining Text MiningSearch Engine Filter to find most relevant documents, then read News Feeds Manipulate the text to discover what is there company activity company Sanofi bid Aventis Roche partner Antisoma Scientific Literature Patents Internal Reports Clinical Trials 3 Natural Language Processing (NLP) to understand meaning Statistics to provide trends
  4. 4. Click to edit Master title styleClick to edit Master title style • Text mining provides ability to discover – but typically queries have to be programmed in, and processing is slow • Search provides ability to filter quickly to relevant documents – but poor at answering open questions e.g. “what are biomarkers for breast cancer” • Combine text mining with search to discover within specific contexts e.g. What is a risk factor for diabetes 4 Agile Text Mining Filter to the context of interestDiscover what is available
  5. 5. Click to edit Master title styleClick to edit Master title style • Linguamatics I2E first adopted in pharma/biotech, including 9 of top 10 • Used to answer a wide range of questions e.g. – Dose durations of follow-on clinical trials – Therapeutic usages of recombinant proteins – Cofactors for Nuclear Receptors • Wide range of application areas across the drug pipeline e.g. – Target Prioritization, Safety/Toxicity, Clinical Trial Design – Competitive Intelligence, Marketing • Are the Life Sciences special? – Particularly knowledge intensive, so high demand – A lot of complex, ambiguous terminology, balanced by good resources • I2E is a generic platform – Now being used in chemicals, consumer products, health … 5 Technology Adoption
  6. 6. Click to edit Master title styleClick to edit Master title style Scientific Literature Abstracts e.g. MEDLINE Full text journal articles e.g. via Quosa Patents News Conference Abstracts Internal documents Social Media Twitter 6 Example Data Sources Competitive Intelligence, R&D, Marketing Clinical Trial Design, Safety, Relative Efficacy of Drugs Clinical Trials FDA Drug Label Inserts Electronic Health Records MEDLINE (20 million abstracts) Patent Full Text USPTO, WIPO, EPO … Clinical Trials … Cloud Based Service
  7. 7. Click to edit Master title styleClick to edit Master title style Actionable Information Ad hoc queries Smart queries Multi queries Batch queries NLP Class, concept Regular Expression Chemical Structure Agile NLP Querying Linguamatics I2E: An Agile Text Mining Platform Internal/External Sources Decision Support Documents Domain knowledge – ontologies, thesauri, dictionariesOntologies Flexible, Highly Scalable Rich IndexesIndexing Structured Results (Actionable Information) 7
  8. 8. Using I2E to Extract Relationships 8
  9. 9. Click to edit Master title styleClick to edit Master title styleImprove Recall Using Terminologies Which genes are known to affect breast cancer? (ESR1 OR ERBB2 OR CHEK2 OR BRCA1...) Affect Breast Cancer (ESR1 OR ERBB2 OR CHEK2 OR BRCA1...) Affect (“Breast Cancer” OR “Breast Carcinoma” OR “Cancer of Breast” OR ...) Could be 10,000s of terms Could be 100s of terms 9 Any Gene/Protein Relationship Breast Cancer
  10. 10. Click to edit Master title styleClick to edit Master title styleExtracting Relationships using NLP: Biomarkers Gene (from Entrez) Complex linguistic relationship Disease (from MedDRA) Relevant sentence extracted with terms highlighted Link to source document
  11. 11. Click to edit Master title styleClick to edit Master title styleExtracting Numerical Data in Context: Safety Compound Potential safety issues In this organ At this dosage 11
  12. 12. Click to edit Master title styleClick to edit Master title styleStandardizing/Clustering to Save Review • Go directly to answers, e.g. find all the genes associated with a specific disease • Highlighted evidence and link to the document • Save time in review: – Rather than reading all 470 documents for ERBB2, just read enough to check relationship exists – Can then concentrate on the longer tail of less well-known genes/proteins • Customer reports of order of magnitude speed-up 12
  13. 13. Finding the Most Relevant Documents 13
  14. 14. Click to edit Master title styleClick to edit Master title styleFinding Relevant Documents • More comprehensive results – Terminologies – Regular Expressions • Precise expressions e.g. for miRNA (simplified) » let-?d+.* » mirn?a?-?d+.* – High throughput searches • lists of 500 genes, chemicals etc. – Chemical substructure and similarity searching • Reduced noise in results – Use of linguistics rather than distance e.g. n words – Regions • abstract, methods, claims, claim, table – Local negation e.g. “dead” for death, but not “dead time” 14
  15. 15. Click to edit Master title styleClick to edit Master title styleMore Comprehensive Results: Terminologies • Re-use terminologies with 10s of thousands of concepts, and 100s of thousands of synonyms • If we are interested in genes associated with cancer, we don’t just want synonyms of “cancer” e.g. “malignant neoplasm”, but also any specific cancer Malignant neoplasm Malignant tumor Malignancy Carcinoma …. Cancer Leukaemia Hamartoma Paraneoplastic Syndromes Adamintonoma Peutz-Jeghers Syndromes Hydatiform Mole Plus 1000s more… Ways of Expressing concept of “Cancer” Different kinds of Cancer 15
  16. 16. Click to edit Master title styleClick to edit Master title styleChemical Text Mining • Ability to efficiently answer precise queries e.g. – What chemicals with this substructure act as inhibitors 16
  17. 17. Accelerating a Search Strategy 17
  18. 18. Click to edit Master title styleClick to edit Master title style Looking for words before or after the word of interest 18 Contextual Landscaping Results from 100K USPTO Patents Linguistics to reduce noise, to find types of chocolate
  19. 19. Click to edit Master title styleClick to edit Master title style • In search we can often restrict to a particular region of a document • In text mining we can first check all the regions that the term appears in – E.g. look for the regions that IC50 appears in • See what you want and what you might lose – We then have better evidence to restrict to particular regions 19 Which Regions Does this Term Appear In Results from 100K 2011 Patents
  20. 20. Click to edit Master title styleClick to edit Master title style • Find most frequent codes in patents where a company or set of companies (e.g. pharma) is the Applicant or Assignee 20 What IPC Codes are Assigned to a Company’s Patents Results from recent USPTO Patents
  21. 21. Summarizing Results from Multiple Documents for Efficient Review and Integration
  22. 22. Click to edit Master title styleClick to edit Master title style • Find and extract patterns to find a particular concept • For example: how do we distinguish people: – willing to get a vaccine – not willing • If we can partition the two populations we can then see how they are influenced Semantics: 1000s of ways of saying the same thing 22 Examples from Twitter Concept of getting a vaccine
  23. 23. Click to edit Master title styleClick to edit Master title style • Terminologies can be used to link individual synonyms to a semantic concept • Concept is uniquely identified by the ontology and the node identifier (and, typically, by the preferred term) • This allows better clustering of results, better statistics, and allows us to connect results from text mining with other databases, or the semantic web Semantics: Standardized Identifiers for Concepts 23
  24. 24. Click to edit Master title styleClick to edit Master title style • Extract the same relationship, even if it is expressed very differently e.g. – SRC phosphorylates EGFR – phosphorylation of EGFR by SRC – EGFR is phosphorylated by SRC • Establish the direction of the relationship – SRC phosphorylates EGFR, not EGFR phosphorylates SRC Semantics: Directed Relationships 24
  25. 25. Finding Information Across Multiple Documents
  26. 26. Click to edit Master title styleClick to edit Master title style • Categorizing Patents according to Disease Area – Linguistics and ontologies provide better context, greater insight 26 Document Categorization Freemind 400K Patents
  27. 27. Click to edit Master title styleClick to edit Master title styleTell me everything about X • Search provides most relevant documents mentioning X • Text mining can summarize distinct properties of X by clustering facts extracted from all documents 27
  28. 28. Click to edit Master title styleClick to edit Master title style Visualization via Cytoscape Linking Knowledge: Indirect Relationships 28 Thalidomide to a Gene/Protein A Gene/Protein to Angiogenesis
  29. 29. Within a Document
  30. 30. Click to edit Master title styleClick to edit Master title styleLinking Within Complex Patent Documents • Linking information in one part of a patent to another e.g. – Finding compounds with a particular substructure where a value is reported … 30
  31. 31. Click to edit Master title styleClick to edit Master title styleFrom One Claim to the Next • For information in claims, often want to work back along the chain of claims 31
  32. 32. Reproducible Workflows 32
  33. 33. Click to edit Master title styleClick to edit Master title style • Queries can be run regularly as documents get updated • Integrate text mining with workflow tools for – up-to-date dashboards – alerts – integration with other structured data sources – web portals • Provide a range of analytics and visualization 33 Automation Pipeline Pilot
  34. 34. Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results Intervention Using Pipeline Pilot Visualization 34
  35. 35. Click to edit Master title styleClick to edit Master title styleClinical Trials Analysis from I2E Text Mining Results Dates Colours indicate Phase Using Pipeline Pilot visualization 35
  36. 36. Click to edit Master title styleClick to edit Master title style • Agile Text Mining provides query power and flexibility – Can address the long tail of less predictable questions – Allows “queries of arbitrary complexity” combining • NLP, ontologies, regular expressions, regions, numerical expressions, chemical substructure/similarity, disambiguation – Uses the data itself to • inform search strategies • build terminologies – Provides flexible, structured output to • allow integration with existing structured data • fit into existing workflows • This enables very wide application of text mining: wherever there is a need to search, extract, or categorize information Conclusions 36
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×