Strategic partnerships between pharmaceutical companies and medical experts lead to more effective medical and marketing activities throughout a product life cycle. Identification of such medical experts, that is, key opinion leaders (KOLs) from bibliometric analysis is challenging due to volume and variety of data. Today, the research community is flooded with scientific literature, with thousands of journals and over 20 million abstracts in PubMed. Developing a holistic framework to identify, profile and update the KOLs is the need of the hour. Customers want digestible information โ everything relevant. In this talk, we will present case studies on how we used the ontologies and disambiguation techniques to address KOL identification for different therapeutic areas.
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
ย
II-SDV 2016 Srinivasan Parthiban - KOL Analytics from Biomedical Literature
1. KOL Analytics from Biomedical Literature
II-SDV Conference
Nice, France
18 - 19 April 2016
Srinivasan Parthiban
Thava Alagu
New York, USA
2. โข Working with pharmaceutical Medical
Affairs, Clinical, R&D, and commercial
organizations since 2005
โข Working with more than half of the Top 50
Companies, 16 of the top 25 (17, and 18
contracting now!)
โข The only completely integrated Scientific
Information Solution
Provides timely insights and facilitates strategic decision making
from the vast amount of publicly available scientific information
Medmeme
Meme(noun) - An idea or behavior that spreads in a manner analogous to the biological
transmission of genes.
3. Bottom Up vs. Top Down
โข As each scientific dissemination is captured it is normalized
and disambiguated prior to being placed into the master data
warehouse
โข Matching, tagging and synonyms are added at this stage
โข Data is mapped to all relevant areas of interest:
โข People
โข Places
โข Institutions & Companies
โข Drugs
โข Keywords: Mechanism of Action, treatment paradigms, etc.
Building the Scientific Data Warehouse
4. Grants
Over 1,128,000
Data Sources
Patents
Over 800,000
Clinical
Trials
Over 280,000
Publications
Over 8,930,000
Abstracts from
5760 journals
Meetings
Over 11,870,000
Abstracts
Monitoring 14,000+
meetings/year
Treatment
Guidelines
Over 36,480
Rolling 10 years ๏ท Continuously Updated ๏ท Scientifically Credible Sources
Aligned to the Scientific Discovery Process โ from Grants to Guidelines
5. Impactmeme: The ultimate tool for constantly keeping on top of
who is saying what, where. It captures all available scientific
dissemination regardless of source
Profilememe: Complete, detailed profiles of virtually all significant
publishing and presenting activities for up to 10 years โ at oneโs
fingertips and continuously updated
Insightmeme: A virtual medical librarian on a desktop, allows a user
to search on almost any dimension, the entirety of medical journal
contents and congress outputs for the past 10 years up to the past
month โ all normalize and indexed
Conferencememe: The most comprehensive database of medical
congress output available anywhere available to users everywhere.
See trends in content, as well as where the opinion leaders of interest
are presenting
Medmeme Products
6. โข An Industry term and acronym: KOL = Key Opinion Leader
โข KOLs are influential doctors, physicians and members of
the medical community whoโs opinions are highly regarded
and who influence other doctorโs and physicians.
โข KOLs advise companies as to where unmet medical needs lie,
choose drug targets, help to define potential product profiles
and shape clinical programs, run clinical trials, and may be
involved in a drugโs regulatory or reimbursement review
process.
โข Peer-to-peer relationships with KOLs are maintained by
Medical Science Liaisons (MSL) from Pharma and healthcare
companies. MSLs are therapeutic specialists (e.g., oncology,
cardiology, neurology)
What is a KOL?
8. Geographic
Influence
Does the
physician
have to lead
clinical
research
studies?
Is the
physician an
early adopter
of new drugs?
Education
Level
Level of
Annual
Advising
Services
Funding
Level of
Annual Grant
Funding
Tier 1 Global Yes Yes Medical
Doctor
$25,000 to
$50,000
$100,000 to
$250,000
Tier 2 National (US) Yes Yes Medical
Doctor
$10,000 to
$25,000
Less Than
$100,000
Tier 3 Regional No Yes Medical
doctor
Less Than
$10,000
Less Than
$100,000
Tier 4 Local No Not
necessarily
Medical
doctor
Less Than
$10,000
Less Than
$100,000
Tier 5 Local or
National (non-
USA)
No No PharmD Less Than
$10,000
Less Than
$100,000
Different Levels of KOLs
9. Average Number of Publications per Year by
Thought Leader Tier
8,2
5,7
4,8
2,9
1,7
0
1
2
3
4
5
6
7
8
9
Tier-1 Tier-2 Tier-3 Tier-4 Tier-5
NumberofPublicationsperYear
Thought Leader Tier
10. Average Years of Clinical Experience by
Thought Leader Tier
12,9
9
7,4 7,3
5,2
0
2
4
6
8
10
12
14
Tier-1 Tier-2 Tier-3 Tier-4 Tier-5
ClinicalExperienceinYears
Thought Leader Tier
11. Average Number of Promotional Speeches per
Year by Thought Leader Tier
9,2
6
3,6
3,9
2,2
0
1
2
3
4
5
6
7
8
9
10
Tier-1 Tier-2 Tier-3 Tier-4 Tier-5
Speeches
Thought Leader Tier
13. 1,85
2,32
7,17
6,79
6,69
20,65
7,38
5,52
2,17
0 5 10 15 20 25
Delivering a Promotional Speech
Delivering a Scientific Speech
Leading an Advisory Panel (Chair)
Moderating an Advisory Panel
Participating in an Advisory Panel
Authoring a manuscript
Authoring an Abstract
Thought Leader Training (General)
Compilance Training
Hours
Average Amount of Hours Spent per
Thought Leader Activity
15. Three Challenges 1. Synonymy - A single individual may publish under
multiple namesโthis includes a) orthographic and spelling
variants, b) spelling errors, c) name changes over time as may
occur with marriage, religious conversion or gender re-
assignment, and d) the use of pen names.
2. Homonymy - Many different individuals have the same name
โ in fact, common names may comprise several thousand
individuals.
3. The necessary metadata are often incomplete or lacking
entirely โ for example, some publishers and bibliographic
databases did not record authorsโ first names, their
geographical locations, or identifying information such as their
degrees or their positions.
17. โฆmistaken identity has resulted in the wrong
person being invited to work on a project [โฆ] or
to undertake the peer review of an article
18. Type I error
False Positive: Identify different author
instances as same single author entity. Results
in bigger clusters than what it should be.
Type II error
False Negative: Not able to identify different
author instances of same author. Results in
too many small clusters.
What Can Go Wrong?
19. Percentage of author names in Medline that includes
full first name instead of an initial
0,0
10,0
20,0
30,0
40,0
50,0
60,0
70,0
80,0
90,0
1995 2000 2005 2010 2015
percentage(%)
Year
72,0
74,0
76,0
78,0
80,0
82,0
84,0
86,0
2000 2002 2004 2006 2008 2010 2012
percentage(%) Year
โข Full names work much better than initials
โข Only 5% of the author names on your institutionโs articles are people in your
instance of Profiles. The rest are former faculty or external collaborators that you
have never heard about.
20. Can never be
100% accurate
85% is
considered
quite good
Further manual
disambiguation
is optional
Close enough
21. Who is John Smith and what is he talking about ?
Retrieve all clusters with the same author name
What Do You Want to Know?
Who is this John Smith, the author of Article X?
Retrieve other PubMed ids of the same cluster
22. Give me top 10 KOLs in the field of Cancer!
DISA Platform retrieves top 10 Unique-Author-IDs.
Each UAID is associated with one cluster (of articles) and
associated Identity information. (Affiliations and E-mails).
DISA uses the keywords associated with articles to pre-index
the authors with associated keywords.
What Do You Want to Know?
23. โข High Precision and Recall is the goal.
โข Precision
โข Accuracy Ratio โ Be correct in grouping.
โข precision = #of correctly clustered pairs / #of
clustered pairs
โข Stricter the condition, higher the precision
โข Recall
โข Efficiency Ratio - Do not miss the matches.
โข recall = #of correctly clustered pairs / #of
true positive pairs
โข More liberal condition, higher the recall
Disambiguation Goal
24. โข Total Manual
Disambiguation is infeasible
โข Automation is great, but
canโt be 100%
โข Manual process is hard,
uncertain, subjective
โข Manual after Automation is
Pragmatic
Manual Vs Automated Disambiguation
25. โข Group all publications into author clusters
โข Match person to clusters
Clustering Methods
26. Clustering based on similarity probability model
Available factors :
โข Co-authors
โข Affiliation
โข Journal
โข Mesh Terms
โข Publication Date
Automation Approach
27. โข Self learning system possible โ Learns from Gold Set
โข Creating proper training set is the biggest challenge
โข Manual creation of proper training set is costly
โข Higher the complexity, vulnerable to bugs
โข Main goal is to find relative importance of
the criteria
โข Co-author Vs Affiliation Vs MeshTerms Vs Journal etc.
Machine Learning
28. โข Extensive affiliation disambiguation is more
challenging
โข Affiliation normalization helps in author
disambiguation
โข Involves recognizing countries, cities and address
normalization into canonical form.
โข Fuzzy matching possible after normalization โ for
smaller buckets only.
Affiliation Disambiguation
29. โข Remember โ It is costly operation !
โข Scalability Hazard !
โข Algorithms:
โข Monger-Elkan, Jaro-Winkler, Levenstein
based on edit distance.
โข Jaccard, TF-IDF based on token based
multi-sets. (Order of words are not important)
โข Some hybrid techniques are also common.
.
Fuzzy Matching
30. Article-1 Authors : X, Y
Article-2 Authors : X, Z (1 and 2 seems disconnected)
Article-3 Authors : X, Y, Z (Likely that X is same author
for all 3 articles)
Note: Clustering algorithm recognizes and handles this appropriately.
Transitivity Fixing
31. Introducing DISA
โข DISA stands for Disambiguation Automated Platform.
โข DISA provides powerful core kernel software system
backed by the author database.
โข DISA enables applications to be developed on this
platform to explore the KOLs based on Pubmed and
Conferences information.
32. ETL - Extract, Transform and Load
Pubmed Data
Explode To Author Instances
Unique Authors
Rule Based Unification Engine
Author Instances
DISA API Layer For Application Access.
Conference Data
DISA Application
DISA Platform Architecture
34. โข Disambiguation restricted to same
last name authors.
โข This โBlockingโ mechanism prevents
combinatorial explosion.
โข Still poses problems for common
names
โข Fuzzy algorithms are very expensive
on large buckets/blocks.
Scalability Issues
35. โข Relatively less researched so far.
โข Need faster updates for delta addition.
โข Reconstruct clusters of given name spaces.
โข Use incremental clustering
โข Embedded database to store and retrieve the
disambiguated author data.
Incremental Disambiguation
36. โข We need both higher precision and
recall.
โข But precision is more important.
โข Precision errors are more permanent
and harder to fix.
โข Recall misses may be fixed in future or
by manual disambiguation.
Being Conservative : Precision Vs Recall
37. Can not Fix Impossible Situations
Not possible to identify these without authorโs voluntary disclosures.