SlideShare a Scribd company logo
Advancing the chemical
sciences through big data
Dr Aileen Day
Data Science, Royal Society of Chemistry
The Royal Society of Chemistry
• The Royal Society of Chemistry
• We help other teams make evidence-
based decisions
• The chemical community:
• RSC members, authors, readers
• We help them to easily find our articles
and compound data
Who are Data Science’s customers?
We have access to:
• ChemSpider
What big data do we have?
What big data do we have?
We have access to:
• ChemSpider
• RSC publishing
What big data do we have?
We have access to:
• ChemSpider
• RSC publishing
• logs
2016-06-24 00:05:07 192.168.0.1 pubs.rsc.org – GET /en/content/articlepdf/2007/sm/b704827k - - - XXX.XXX.XXX.XXX -
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/50.0.2661.102+Safari/537.36
ShowEUCookieLawBanner=true;+X-Mapping-hhmaobcf=5EFF013F0F2EB5C7479A967277AFB2F4;+ASP.NET_SessionId=tzmjdkojxv2jcqh25omqerui;
+Branding=50000XXX;+AuthSystemSessionId=261e0a91-7d73-4fd7-9380-e73e298d6047;+__utmt=1;+__utma=1.2022872114.1464909160
.XXXXXXXXXX.XXXXXXXXXX.X;+__utmb=X.X.X.XXXXXXXXXXXXX;+__utmc=1;+__utmz=X.XXXXXXXXXX.X.X.utmcsr=google|utmccn=(organic)|utmcmd
=organic|utmctr=(not%20provided);+iislog-host=pubs.rsc.org;+iislog-s-ip=172.30.229.101
http://pubs.rsc.org/en/Content/ArticleLanding/2007/SM/b704827k - 200 - - - 409353 - -
• Not massive, but big enough!
• A combination of structured
(ChemSpider) and unstructured
(articles and logs)
• We can exploit this unstructured data…
What can this big data do for chemistry?
What can Data Science and this big data
do for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
What can Data Science and this big data
do for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
• The RSC editorial
teams boost RSC
publishing
• Identify hot emerging
topics
Customers for trend and category
analysis
Categorisation of RSC
journals and articles
Categories – partly
defined by RSC journal
categorization (top
level), partly generated,
reviewed, organised
• Aim: tool for seeing how highly accessed
various subsets of our papers are
• Based on:
• article information title, abstract, year, journal
but not full text
• access data for those papers
Category dashboard
…
User Inputs
Categories of results
Upward trend,
but note small
total number of
accesses
Subcategories
Hot terms
words that are found in the titles and abstracts of papers in that
category that are accessed more frequently than usual
What Data Science and this big data do
for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
• Aim: to identify and understand
emergent research trends in the
chemical sciences
Trend visualisation
Trend visualisation
Things people have
searched articles for
internally
Words that are over-
represented in the
articles which are
returned
Occurrences of each
in articles over time
uses entropy-based
measures (mutual
information)
Trend visualisation - example
• Shows the dramatic
emergence of perovskites for
solar applications
• Enables editorial staff to
commission reviews and
special issues in these more
specific areas
What can Data Science and this big data
do for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
Unstructured Data Structured Data
NER
• Participated in public, competitive evaluation extracting chemical names from
patents:
• BioCreative V.5 (Critical Assessment of Information Extraction in Biology)
community-wide effort with the aim of evaluating biomedical text mining and
information extraction tools, submitted and evaluated using Becalm platform
• CEMP (chemical entity mention in patents) task
• Using deep learning techniques – recurrent artificial neural networks
Chemlistem
Named Entity Recognition (NER)
Traditional CRF approach - sequence
labelling
“… the quisqualic acid-induced increase in the intracellular
calcium ion concentration …”
Tokenise: “… the quisqualic acid - induced increase
in the intracellular calcium ion concentration …”
Tag: “… the_O quisqualic_B acid_E -_O induced_O
increase_O in_O the_O intracellular_O calcium_S
ion_O concentration_O …”
O = outside B = begin I = inside E = end S = singleton
• Compared 3 methods:
• “Traditional” Conditional Random Fields CRF
translated to deep learning
• Minimalist approach
• Ensemble combination of previous two methods:
• Run Traditional and Minimalist systems with a low threshold =>
generate 2 lists of entities
• Combine scores of entities in lists and apply threshold of 0.475
Chemlistem Methods
Embeddings
Other Features
Convolutional
Layer
Merge
Bidirectional
LSTM
SOBIESOBIESOBIE
Token n Token n+1Token n-1
Final Layer
Outputs
Inputs
“Traditional” CRF neural network
• Word inputs and
output
• Word-level
embeddings
(GLOVE)
• Lots of chemical
knowledge
(features and
chemical
dictionaries)
• Single LSTM layer
Embeddings
Bidirectional
LSTM 3
SOBIESOBIESOBIE
Character n Character n+1Character n-1
Final Layer
Outputs
Inputs
Bidirectional
LSTM 2
Bidirectional
LSTM 1
“Minimalist” neural network
• Character inputs
and output
• Character-level
embeddings
• No chemical
knowledge
• Just relies on
training outputs
from inputs
• 3 LSTM layers
Results
System Offical
F-score
Official
Precision
Official
Recall
Internal
F-score
Internal
Precision
Internal
Recall
Trad .8919 .8867 .8971 .8703 .8648 .8758
Minimal .8901 .8865 .8936 .8664 .8479 .8858
Ensemble .9032 .9002 .9062 .8807 .8646 .8976
• Participating in public, competitive evaluation (BioCreative V.5 Becalm)
 0.9006 precision, 0.9062 recall, .9032 F
 3rd place out of 17 (0.1% off 1st, “differences in the top three
weren’t statistically significant”)
 Compare with typical inter-annotator agreement studies for
manual annotators 90-93% (human level)*
*Peter Corbett, Colin Batchelor, and Simone Teufel. Annotation of chemical named entities. Proceedings of
the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing. Association for
Computational Linguistics, 2007.
• Peter Corbett, John Boyle. “Chemlistem - chemical named entity recognition
using recurrent neural networks” (2017)
http://www.biocreative.org/media/store/files/2017/BioCreative_V5_paper8.pdf
• Open source:
 http://bitbucket.org/rscapplications/chemlistem
 pip install chemlistem
Chemlistem
e.g. word2vec and
GloVe
• high-dimensional
(often 300) vector
per token but can
be reduced and
visualised
Additional output - Word embeddings?
• Trainable inside neural network…
• e.g. king + man - woman = queen
• Could our trained GloVE word embeddings be
another useful output?
• query word to find and visualise related
words
• also e.g.:
• benzene – hazard + solvent = ?
• KOH – base + acid = ?
Additional output - Word embeddings?
What can Data Science and this big data
do for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
Article Recommendation
• Built in Elastic Search
• Four alternative methods considered
• User testing…
A New Article Recommender
content
similarity
content
similarity
cited
together
combined
results
readers
also readMethod 2 Method 3 Method 4Method 1
• All find interesting papers
• Early-stage reading/review writing
• Recent is good
• Different preferred methods before and after
they know which is which…
Key user observations
User Preferences
0
1
2
3
4
5
6
7
Before After
combined results
cited together
readers also read
content similarity
What can Data Science and this big data
do for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
What about a molecule recommender?
What other molecules are “related” to vancomycin?
Vancomycin
• ChemSpider web
logs (2015-2016),
molecules
grouped by user
IDs, anonymised,
aggregated
• RSC corpus
(2000-2012), text-
mined for
chemical
compounds,
molecules
grouped by article
• “Morgan
(radius=2)
fingerprinting
• Topology
fingerprinting
• Initial user testing indicates that
researchers prefer a range of methods and
molecules to eyeball
• Not just one “I feel lucky…” guess
• Primarily used as a tool to decide what to
research next but within that many different
questions
User testing
For example…
What can Data Science and this big data
do for Chemistry?
Tech Development
• Data processing pipeline
• Term extraction from
literature
Applications
• Citation velocity
• Recommending papers
Cheminformatics
• Molecular characterisation
• Chemical similarity
• Molecule recommender
Business analytics
• Lead generation
• Data dashboards
• Trend analysis
• Category dashboard
Colin Batchelor
Most of this work was done by…
Peter Corbett
John Boyle
Nicholas Bailey
Jeff White
Aileen Day
With help from the rest of RSC Data Science…
www.rsc.org/data-science

More Related Content

Recently uploaded

Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
Sérgio Sacani
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 
BIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROIDBIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROID
ShibsekharRoy1
 
Male reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptxMale reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptx
suyashempire
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
shubhijain836
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
Sérgio Sacani
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 

Recently uploaded (20)

Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 
BIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROIDBIOTRANSFORMATION MECHANISM FOR OF STEROID
BIOTRANSFORMATION MECHANISM FOR OF STEROID
 
Male reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptxMale reproduction physiology by Suyash Garg .pptx
Male reproduction physiology by Suyash Garg .pptx
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Advancing the chemical sciences through big data

  • 1. Advancing the chemical sciences through big data Dr Aileen Day Data Science, Royal Society of Chemistry
  • 2. The Royal Society of Chemistry
  • 3. • The Royal Society of Chemistry • We help other teams make evidence- based decisions • The chemical community: • RSC members, authors, readers • We help them to easily find our articles and compound data Who are Data Science’s customers?
  • 4. We have access to: • ChemSpider What big data do we have?
  • 5. What big data do we have? We have access to: • ChemSpider • RSC publishing
  • 6. What big data do we have? We have access to: • ChemSpider • RSC publishing • logs 2016-06-24 00:05:07 192.168.0.1 pubs.rsc.org – GET /en/content/articlepdf/2007/sm/b704827k - - - XXX.XXX.XXX.XXX - Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/50.0.2661.102+Safari/537.36 ShowEUCookieLawBanner=true;+X-Mapping-hhmaobcf=5EFF013F0F2EB5C7479A967277AFB2F4;+ASP.NET_SessionId=tzmjdkojxv2jcqh25omqerui; +Branding=50000XXX;+AuthSystemSessionId=261e0a91-7d73-4fd7-9380-e73e298d6047;+__utmt=1;+__utma=1.2022872114.1464909160 .XXXXXXXXXX.XXXXXXXXXX.X;+__utmb=X.X.X.XXXXXXXXXXXXX;+__utmc=1;+__utmz=X.XXXXXXXXXX.X.X.utmcsr=google|utmccn=(organic)|utmcmd =organic|utmctr=(not%20provided);+iislog-host=pubs.rsc.org;+iislog-s-ip=172.30.229.101 http://pubs.rsc.org/en/Content/ArticleLanding/2007/SM/b704827k - 200 - - - 409353 - -
  • 7. • Not massive, but big enough! • A combination of structured (ChemSpider) and unstructured (articles and logs) • We can exploit this unstructured data… What can this big data do for chemistry?
  • 8. What can Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 9. What can Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 10. • The RSC editorial teams boost RSC publishing • Identify hot emerging topics Customers for trend and category analysis
  • 11. Categorisation of RSC journals and articles Categories – partly defined by RSC journal categorization (top level), partly generated, reviewed, organised
  • 12. • Aim: tool for seeing how highly accessed various subsets of our papers are • Based on: • article information title, abstract, year, journal but not full text • access data for those papers Category dashboard
  • 14. Categories of results Upward trend, but note small total number of accesses
  • 16. Hot terms words that are found in the titles and abstracts of papers in that category that are accessed more frequently than usual
  • 17. What Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 18. • Aim: to identify and understand emergent research trends in the chemical sciences Trend visualisation
  • 19. Trend visualisation Things people have searched articles for internally Words that are over- represented in the articles which are returned Occurrences of each in articles over time uses entropy-based measures (mutual information)
  • 20. Trend visualisation - example • Shows the dramatic emergence of perovskites for solar applications • Enables editorial staff to commission reviews and special issues in these more specific areas
  • 21. What can Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 23. • Participated in public, competitive evaluation extracting chemical names from patents: • BioCreative V.5 (Critical Assessment of Information Extraction in Biology) community-wide effort with the aim of evaluating biomedical text mining and information extraction tools, submitted and evaluated using Becalm platform • CEMP (chemical entity mention in patents) task • Using deep learning techniques – recurrent artificial neural networks Chemlistem Named Entity Recognition (NER)
  • 24. Traditional CRF approach - sequence labelling “… the quisqualic acid-induced increase in the intracellular calcium ion concentration …” Tokenise: “… the quisqualic acid - induced increase in the intracellular calcium ion concentration …” Tag: “… the_O quisqualic_B acid_E -_O induced_O increase_O in_O the_O intracellular_O calcium_S ion_O concentration_O …” O = outside B = begin I = inside E = end S = singleton
  • 25. • Compared 3 methods: • “Traditional” Conditional Random Fields CRF translated to deep learning • Minimalist approach • Ensemble combination of previous two methods: • Run Traditional and Minimalist systems with a low threshold => generate 2 lists of entities • Combine scores of entities in lists and apply threshold of 0.475 Chemlistem Methods
  • 26. Embeddings Other Features Convolutional Layer Merge Bidirectional LSTM SOBIESOBIESOBIE Token n Token n+1Token n-1 Final Layer Outputs Inputs “Traditional” CRF neural network • Word inputs and output • Word-level embeddings (GLOVE) • Lots of chemical knowledge (features and chemical dictionaries) • Single LSTM layer
  • 27. Embeddings Bidirectional LSTM 3 SOBIESOBIESOBIE Character n Character n+1Character n-1 Final Layer Outputs Inputs Bidirectional LSTM 2 Bidirectional LSTM 1 “Minimalist” neural network • Character inputs and output • Character-level embeddings • No chemical knowledge • Just relies on training outputs from inputs • 3 LSTM layers
  • 28. Results System Offical F-score Official Precision Official Recall Internal F-score Internal Precision Internal Recall Trad .8919 .8867 .8971 .8703 .8648 .8758 Minimal .8901 .8865 .8936 .8664 .8479 .8858 Ensemble .9032 .9002 .9062 .8807 .8646 .8976 • Participating in public, competitive evaluation (BioCreative V.5 Becalm)  0.9006 precision, 0.9062 recall, .9032 F  3rd place out of 17 (0.1% off 1st, “differences in the top three weren’t statistically significant”)  Compare with typical inter-annotator agreement studies for manual annotators 90-93% (human level)* *Peter Corbett, Colin Batchelor, and Simone Teufel. Annotation of chemical named entities. Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing. Association for Computational Linguistics, 2007.
  • 29. • Peter Corbett, John Boyle. “Chemlistem - chemical named entity recognition using recurrent neural networks” (2017) http://www.biocreative.org/media/store/files/2017/BioCreative_V5_paper8.pdf • Open source:  http://bitbucket.org/rscapplications/chemlistem  pip install chemlistem Chemlistem
  • 30. e.g. word2vec and GloVe • high-dimensional (often 300) vector per token but can be reduced and visualised Additional output - Word embeddings? • Trainable inside neural network… • e.g. king + man - woman = queen
  • 31. • Could our trained GloVE word embeddings be another useful output? • query word to find and visualise related words • also e.g.: • benzene – hazard + solvent = ? • KOH – base + acid = ? Additional output - Word embeddings?
  • 32. What can Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 34. • Built in Elastic Search • Four alternative methods considered • User testing… A New Article Recommender
  • 36. • All find interesting papers • Early-stage reading/review writing • Recent is good • Different preferred methods before and after they know which is which… Key user observations
  • 37. User Preferences 0 1 2 3 4 5 6 7 Before After combined results cited together readers also read content similarity
  • 38. What can Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 39. What about a molecule recommender? What other molecules are “related” to vancomycin? Vancomycin
  • 40. • ChemSpider web logs (2015-2016), molecules grouped by user IDs, anonymised, aggregated • RSC corpus (2000-2012), text- mined for chemical compounds, molecules grouped by article • “Morgan (radius=2) fingerprinting • Topology fingerprinting
  • 41. • Initial user testing indicates that researchers prefer a range of methods and molecules to eyeball • Not just one “I feel lucky…” guess • Primarily used as a tool to decide what to research next but within that many different questions User testing
  • 43. What can Data Science and this big data do for Chemistry? Tech Development • Data processing pipeline • Term extraction from literature Applications • Citation velocity • Recommending papers Cheminformatics • Molecular characterisation • Chemical similarity • Molecule recommender Business analytics • Lead generation • Data dashboards • Trend analysis • Category dashboard
  • 44. Colin Batchelor Most of this work was done by… Peter Corbett John Boyle Nicholas Bailey Jeff White Aileen Day With help from the rest of RSC Data Science…

Editor's Notes

  1. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site
  2. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site
  3. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site
  4. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site
  5. Ensemble Paper, blog post?
  6. Ensemble Paper, blog post?
  7. Ensemble Paper, blog post?
  8. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site
  9. Recommenders are one way to drive up useage CLICK Amazon – “customers who bought this also looked at...” Netflix – the next box set... CLICK Facebook – You’ve been tagged at four heavy metal concerts this month, would you like to join one of these groups? CLICK Even in the scientific “space”, we have recommenders - Sigma-Aldrich – “Customers also viewed” - Nature – Recommender just launched - ScienceDirect – “recommended articles” and “related book content” This last example is the focus of the work that follows... #### We want to get the great science that we publish to the max possible audience, trying to bring serendipity to the user, and to help readers solve their questions We’re primarily interested in this case with recommending further articles based on what ever the reader is currently looking at Let’s examine the basic underlying principles of how we might make recommendations...
  10. Good framework to build a system good enough to test user opinions
  11. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site
  12. Recommender systems are an active area of research; molecule-based systems are a specialist subset within the chemistry domain that exhibit special features. Determination of “related-ness” are somewhat complicated by the three dimensional nature of compound structures, the ability of different structures to have similar actions, and the tendency of chemists to create entirely new structures on a regular basis.
  13. Which is the best?!
  14. Tech Dev Efficient pipelines to turn log file data into e.g. NoSQL DB of download activity, session info etc. Inc. Efficient bot removal, IP lookup Data mining and metadata enrichment ChemInformatics Chemical similarity using fingerprinting Chemical validation Analytics Find potential new customers, or customers who want/need to upgrade or expand their accessible products. Give internal teams access to good info to enable them to make better decisions Products Citation velocity – identify and promote trending articles Recommenders – online tools to help users and encourage them to stay on-site