Vast amounts of new information and data are generated every day through scientific research. More and more of this data is stored in rapidly growing, but siloed databases, creating “Big Data” challenges. New technologies such as text and data mining make it possible to efficiently search and improve knowledge by applying analytics across these data sources. Research-intensive companies in the pharmaceutical and chemical industry are exploring the use of text and data mining (TDM) techniques to glean new insights from patents, clinical data, scientific literature, and other data sources. These insights are seen as critical to accelerating the process of drug and product discovery. As these researchers leverage TDM techniques, obtaining easy, centralized access to TDM-ready full-text content from multiple publishers becomes more and more important. What will be the future role of TDM in 2014 and beyond? What are the major TDM trends and what solutions are companies looking for to accelerate their R&D; efforts? Based on the experience gathered in a text and data mining pilot program successfully run by RightsDirect’s parent company Copyright Clearance Center (CCC) in 2013, RightsDirect’s General Manager Kim Zwollo will give an overview of current market needs, options and trends in Text and Data Mining. Using CCC’s TDM solution as an example, the presentation outlines critical success factors in technology and business models that need to be part of a comprehensive approach to text and data mining.
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)
1. Finding Answers in the Data
The Future Role of Text and Data Mining
Kim Zwollo
General Manager, RightsDirect
Andrew Hinton
Linguamatics
2. Making Copyright Work – CCC and RightsDirect
Rightsholders
Content Users
600+ million rights from:
•Publishers
•Authors
•Creators
•35,000 companies
•Employees worldwide
•Users in 180 countries
•Licensing Solutions
•Rights Management
•Content Delivery
•Copyright Education
10/15/2014
3. Overview
•What is Text and Data Mining
•Why text mining is useful
•Technology Trends
•Information Retrieval Challenges
•Publisher perspective
•Emerging solutions
•Use cases from Linguamatics
10/15/2014
3
4. What is Text and Data Mining
Interpret Meaning, Identify & Extract
•Facts
•Relationships
•Assertions
Linguamatics 2014
5. Application Areas for text mining
Protein- Protein Interactions
Vocabulary Development
Target Identification & Prioritization
Conference Abstract Mining
Key Opinion Leader Identification
Safety/Tox
In-licensing Opportunities
Gene Profiling
Systems Biology
Mining FDA Drug Labels
Extracting Numerical and Experimental Data
Mutations and Gene Expression
Sentiment Analysis in Social Media
Workflow Integration
Mining Electronic Medical Records
Clinical Trial Analysis
Patent Analysis
Biomarker Discovery
Competitive Intelligence
Drug Repositioning
6. “Drug Discovery” Process
•Goal: Develop new treatments for diseases through hypothesis formation.
•Methodology:
–Keyword/Database Searching
–Review Literature
–Find relationships
–Develop hypothesis
–Test
–Product development
Etc.
10/15/2014
6
8. Problem: Too Much Research
•53M Records in Scopus
•800,000 Journal Articles published per year
10/15/2014
8
http://altmetrics.org/manifesto/ October 26, 2010
9. Even within one disease area…
•Angina
•Acute coronary syndrome
•Alexia
•Anomic aphasia
•Aortic dissection
•Aortic regurgitation
•Aortic stenosis
•Apoplexy
•Apraxia
•Arrhythmias
•Asymmetric septal hypertrophy (ASH)
•Atherosclerosis
•Atrial flutter
•Atrial septal defect
•Atrioventricular canal defect
•Atrioventricular septal defect
•Avascular necrosis
–Etc…
10/15/2014
9
Lots of disorders …
Lots of documents…
•35,000+ on Improve Circulation
•7,000+ per disease area
10. Literature Based Discovery
10/15/2014
10
Don Swanson (1924-2012)
[1986] Blood viscosity served as a bridge between the topics of Raynaud’s disease and dietary fish oil.
A
B
C
11. Information Retrieval and Discovery Process
10/15/2014
11
*http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining
Software Platforms for TDM
Information Retrieval
Knowledge Discovery
12. Challenges for Text Mining Researchers
•Many sources of content
•Many formats
•Difficult to obtain full-text in XML
•Difficult to integrate content into TDM software.
•Hard to negotiate and manage licenses and feeds from all publishers.
12
13. STM Publisher Perspective
•Concern about disruptive nature of TDM to subscription business
•Access problem, more than a copyright problem
•Technical challenges with formats and authentication
•More industry education needed
•Top STM Publishers are making their content available for mining
10/15/2014
13
14. Background: Timeline
•JISC paper May 2011
•First PDR-TDM meeting Nov 2011
•CCC TDM Event – March 2012
•CCC White Paper on TDM Issues and Solutions – May 2012
•CCC Pilot 2013
•Second PDR-TDM meeting Nov 2013
•Content acquisition 2014
•Launch CCC service for mining full text (2015)
10/15/2014
14
15. Helping TDM Researchers
Publisher 1
Publisher 2
Rightsholders provide CCC with a feed of their content in XML
Publisher 3
<XML>
16. Helping TDM Researchers
Company A
Company C
Company B
Companies provide CCC with information about their subscriptions and holdings, using our automated tools in DirectPath.
17. Helping TDM Researchers
Company A
Publisher 1
Company C
Publisher 2
Publisher 3
Company B
Companies request article sets for each TDM project.
CCC manages access based on subscription information.
<XML>
18. Looking Ahead: Emerging Solutions for Information Retrieval
•Open Access Content
•Publisher-specific capabilities for delivering content (Elsevier and others)
•Industry-wide content access solutions by intermediaries
–CrossRef
–CCC
–PLS
10/15/2014
18
19. A look at a Text Mining Application
A presentation by Linguamatics
Andrew Hinton, Linguamatics
10/15/2014
19
20. to edit Master title style
Click to edit Master title style
About Linguamatics
Boston
Cambridge
I2E: agile, scalable, real-time NLP-based text mining Fact extraction and knowledge synthesis
Fortune 500 Pharma/Biotech Healthcare Government
Linguamatics 2014
Including 17 of the top 20
Including Kaiser Permanente
Including FDA
Software
Consulting
Hosted Content
21. to edit Master title style
Click to edit Master title style
Linguistic Processing Using NLP
•Groups words into meaningful units
•Morphology allows search for different forms of words
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences
morphology -
different forms
noun groups match entities
verb groups
match actions
Linguamatics 2014
23. to edit Master title style
Click to edit Master title style
Linguamatics 2014
Biomarker Discovery - Genes
Gene (from Entrez)
Complex linguistic relationship
Disease
(from MedDRA)
Relevant sentence extracted with terms highlighted
Link to source document
24. to edit Master title style
Click to edit Master title style
Categorizing Relationships
Use of NLP allows accurate and precise identification of biomarker relationships
Linguamatics 2014
25. to edit Master title style
Click to edit Master title style
Patents Applications and Grants Companies vs. Diseases
0
5000
10000
15000
20000
25000
Applications
Grants
Applications
Grants
Applications
Grants
Applications
Grants
Applications
Grants
Applications
Grants
Applications
Grants
Applications
Grants
Applications
Grants
Abbott
AZ
Bayer
BMS
GSK
Roche
Merck
Novartis
Pfizer
Virus Diseases
Substance-Related Disorders
Stomatognathic Diseases
Skin and Connective Tissue Diseases
Respiratory Tract Diseases
Parasitic Diseases
Otorhinolaryngologic Diseases
Occupational Diseases
Nutritional and Metabolic Diseases
Nervous System Diseases
Neoplasms
Musculoskeletal Diseases
Mental Disorders
Male Urogenital Diseases
Immune System Diseases
Hemic and Lymphatic Diseases
Female Urogenital Diseases and
Pregnancy Complications
Eye Diseases
Endocrine System Diseases
Linguamatics 2014
26. to edit Master title style
Click to edit Master title style
Find properties
Melting Points for Exemplified Compounds
Output to e.g. Excel
Linguamatics 2014
27. to edit Master title style
Click to edit Master title style
Connecting information found in different parts of the document
for example finding a compound as “Example 12” in a patent
and linking to a table where numerical data is reported
Patent document
Linking from Definitions to Table Values
…
Combined into a row of data in the structured results table
Patent Data from IFI Claims Direct
Linguamatics 2014
28. to edit Master title style
Click to edit Master title style
•For information in claims, often want to work back along the chain of claims, to see what the current claim is dependent upon
Claim Chain Information
Linguamatics 2014
Compounds
Treats cervical cancer
Peptide Seq
Residues 33-176
29. to edit Master title style
Click to edit Master title style
•Analysis of PubMed Central records
•Look for analytical chemical techniques mention’s
•Identify concepts in abstract ‘v’ body
Benefits on Text Mining Using Full Text
Linguamatics 2014
Many more mentions of experimental techniques in full text compared to abstract alone!
Analytical Chemistry Techniques
Section