SlideShare a Scribd company logo
Text and Data Mining (TDM) 
SciDataCon 2014 Workshop 
Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish) 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
What is MINING? 
1982 
“Automatically generating logical representations of text passages... by means of an 
analysis of the coherence structure of the passages.” 
Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - 
Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 
1999 
“(semi)automated discovery of trends and patterns across very large datasets” 
“Use of large online text collections to discover new facts and trends...” 
“(Automating) the tedious parts of the text manipulation process and (integrating) 
underlying computationally-driven text analysis with human-guided decision making within 
exploratory data analysis over text” 
Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL 
'99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679 
2008 
“The use of automated methods for exploiting the enormous amount of knowledge 
available in the biomedical literature.” 
Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 
18225946. 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
What is CONTENT? 
● Images 
● Photos 
● Graphs 
● Figures 
● Captions 
● Sound 
● Video 
● Tables 
● Datasets 
● Supplementary information 
● Metadata 
● Text 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
101 uses for content mining (nearly)... 
Which universities in SE Asia do scientists from Cambridge work with? (We get asked this 
sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of 
their co-authors we can get a very good approximation. (Feasible now). 
Which papers contain grayscale images which could be interpreted as Gels? A 
http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A 
typical gel (Wikipedia CC-BY-SA) looks like 
Find me papers in subjects which are (not) editorials, news, corrections, retractions, 
reviews, etc. Slightly journal/publisher-dependent but otherwise very simple. 
Find papers about chemistry in the German language. Highly tractable. Typical approach would be 
to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English 
(“one”, “the” …) 
Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial 
to extract references and authors. More difficult, of course to disambiguate. 
Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 
2006 when I started a Wikipedia article on it. 
Find papers where authors come from chemistry department(s) and a linguistics 
department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular 
Sciences”, “Biochemistry”)…) 
Find papers acknowledging support from the Wellcome Trust . (So we can check for OA 
compliance…). 
Find papers with supplemental data files. Journal-specific but easily scalable. 
Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, 
text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an 
enthusiast 
https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014

More Related Content

What's hot

How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...
Jenn Riley
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
Karen Estlund
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
Ross Mounce
 
Changing The Way We Discover Research
Changing The Way We Discover ResearchChanging The Way We Discover Research
Changing The Way We Discover Research
Open Knowledge Maps
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio EditionJordan Chapman
 
Islt doctoral day may2018_marwa
Islt doctoral day may2018_marwaIslt doctoral day may2018_marwa
Islt doctoral day may2018_marwa
Dr. Marwa Mekni-Toujani
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
robertstevens65
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
petermurrayrust
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
petermurrayrust
 
Altmetrics & visualizations for discovery
Altmetrics & visualizations for discoveryAltmetrics & visualizations for discovery
Altmetrics & visualizations for discovery
Open Knowledge Maps
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
Deborah McGuinness
 
Stack queue
Stack queueStack queue
Stack queue
Majharoddin Kazi
 
Data publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminarData publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminar
Carly Strasser
 
When the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big dataWhen the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big data
CEDAR: Center for Expanded Data Annotation and Retrieval
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
Frank van Harmelen
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
David Milward
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
TheContentMine
 
Open science platforms
Open science platformsOpen science platforms
Open science platforms
Irina Radchenko
 
Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...
Dario Taraborelli
 

What's hot (20)

How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...How Much to Semanticize? Looking at the future of Library Data and the Semant...
How Much to Semanticize? Looking at the future of Library Data and the Semant...
 
Converting Metadata to Linked Data
Converting Metadata to Linked DataConverting Metadata to Linked Data
Converting Metadata to Linked Data
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
Changing The Way We Discover Research
Changing The Way We Discover ResearchChanging The Way We Discover Research
Changing The Way We Discover Research
 
601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition601-CriticalEssay-2-Portfolio Edition
601-CriticalEssay-2-Portfolio Edition
 
Islt doctoral day may2018_marwa
Islt doctoral day may2018_marwaIslt doctoral day may2018_marwa
Islt doctoral day may2018_marwa
 
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambisTAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
TAMBIS: Transparent Access to Multiple Bioinformatics Information SourcesTambis
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literatureAmanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Altmetrics & visualizations for discovery
Altmetrics & visualizations for discoveryAltmetrics & visualizations for discovery
Altmetrics & visualizations for discovery
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Stack queue
Stack queueStack queue
Stack queue
 
Data publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminarData publication and Citation for CLIR postdoc seminar
Data publication and Citation for CLIR postdoc seminar
 
When the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big dataWhen the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big data
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Open science platforms
Open science platformsOpen science platforms
Open science platforms
 
Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...Crossing the streams: Social and technical interfaces between Wikimedia and O...
Crossing the streams: Social and technical interfaces between Wikimedia and O...
 

Similar to SciDataCon 2014 TDM Workshop Intro Slides

Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
Scottish Library & Information Council (SLIC), CILIP in Scotland (CILIPS)
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
Ross Mounce
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Sciencedrnigam
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
datacite
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
TheContentMine
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
petermurrayrust
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Wouter Beek
 
text_mining.doc
text_mining.doctext_mining.doc
text_mining.docbutest
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
guest756e05
 
Reading avoidance
Reading avoidanceReading avoidance
Reading avoidance
jodischneider
 
ContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data Seminar
Jenny Molloy
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
IJwest
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
dannyijwest
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
petermurrayrust
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 

Similar to SciDataCon 2014 TDM Workshop Intro Slides (20)

Rudi
RudiRudi
Rudi
 
Rudi
RudiRudi
Rudi
 
Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
How Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open ScienceHow Bio Ontologies Enable Open Science
How Bio Ontologies Enable Open Science
 
Riding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information accessRiding the wave - Paradigm shifts in information access
Riding the wave - Paradigm shifts in information access
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Data Publishing in Archaeozoology
Data Publishing in ArchaeozoologyData Publishing in Archaeozoology
Data Publishing in Archaeozoology
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
 
text_mining.doc
text_mining.doctext_mining.doc
text_mining.doc
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
An Open Context for Archaeology
An Open Context for ArchaeologyAn Open Context for Archaeology
An Open Context for Archaeology
 
Reading avoidance
Reading avoidanceReading avoidance
Reading avoidance
 
ContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data SeminarContentMine Presentation for WHO Health Data Seminar
ContentMine Presentation for WHO Health Data Seminar
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATIONONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 

More from Jenny Molloy

Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic Biology
Jenny Molloy
 
ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)
Jenny Molloy
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research data
Jenny Molloy
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDM
Jenny Molloy
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open Science
Jenny Molloy
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGM
Jenny Molloy
 
Id2 presentation
Id2 presentationId2 presentation
Id2 presentation
Jenny Molloy
 

More from Jenny Molloy (7)

Engineering Life with Synthetic Biology
Engineering Life with Synthetic BiologyEngineering Life with Synthetic Biology
Engineering Life with Synthetic Biology
 
ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)ContentMine (EMBL-EBI Industry Programme)
ContentMine (EMBL-EBI Industry Programme)
 
YEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research dataYEAR Conference 2015 - How to share our research data
YEAR Conference 2015 - How to share our research data
 
Legal Framework for TDM
Legal Framework for TDMLegal Framework for TDM
Legal Framework for TDM
 
Introducing Open Science
Introducing Open ScienceIntroducing Open Science
Introducing Open Science
 
ContentMine at EuropePMC AGM
ContentMine at EuropePMC AGMContentMine at EuropePMC AGM
ContentMine at EuropePMC AGM
 
Id2 presentation
Id2 presentationId2 presentation
Id2 presentation
 

Recently uploaded

GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 

Recently uploaded (20)

GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 

SciDataCon 2014 TDM Workshop Intro Slides

  • 1. Text and Data Mining (TDM) SciDataCon 2014 Workshop Jenny Molloy (@jenny_molloy) | Puneet Kishor (@punkish) https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 2. What is MINING? 1982 “Automatically generating logical representations of text passages... by means of an analysis of the coherence structure of the passages.” Jerry R. Hobbs, Donald E. Walker, and Robert A. Amsler. 1982. Natural language access to structured text. In Proceedings of the 9th conference on Computational linguistics - Volume 1(COLING '82), Ján Horecký (Ed.), Vol. 1. Academia Praha, , Czechoslovakia, 127-132. DOI=10.3115/991813.991833 http://dx.doi.org/10.3115/991813.991833 1999 “(semi)automated discovery of trends and patterns across very large datasets” “Use of large online text collections to discover new facts and trends...” “(Automating) the tedious parts of the text manipulation process and (integrating) underlying computationally-driven text analysis with human-guided decision making within exploratory data analysis over text” Marti A. Hearst. 1999. Untangling text data mining. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 3-10. DOI=10.3115/1034678.1034679 http://dx.doi.org/10.3115/1034678.1034679 2008 “The use of automated methods for exploiting the enormous amount of knowledge available in the biomedical literature.” Cohen, K. Bretonnel; Hunter, Lawrence (2008). "Getting Started in Text Mining". PLoS Computational Biology 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579.PMID 18225946. https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 3. What is CONTENT? ● Images ● Photos ● Graphs ● Figures ● Captions ● Sound ● Video ● Tables ● Datasets ● Supplementary information ● Metadata ● Text https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014
  • 4. 101 uses for content mining (nearly)... Which universities in SE Asia do scientists from Cambridge work with? (We get asked this sort of thing regularly by ViceChancellors). By examining the list of authors of papers from Cambridge and the affiliations of their co-authors we can get a very good approximation. (Feasible now). Which papers contain grayscale images which could be interpreted as Gels? A http://en.wikipedia.org/wiki/Polyacrylamide_gel is a universal method of identifying proteins and other biomolecules. A typical gel (Wikipedia CC-BY-SA) looks like Find me papers in subjects which are (not) editorials, news, corrections, retractions, reviews, etc. Slightly journal/publisher-dependent but otherwise very simple. Find papers about chemistry in the German language. Highly tractable. Typical approach would be to find the 50 commonest words (e.g. “ein”, “das”,…) in a paper and show the frequency is very different from English (“one”, “the” …) Find references to papers by a given author. This is metadata and therefore FACTual. It is usually trivial to extract references and authors. More difficult, of course to disambiguate. Find uses of the term “Open Data” before 2006. Remarkably the term was almost unknown before 2006 when I started a Wikipedia article on it. Find papers where authors come from chemistry department(s) and a linguistics department. Easyish (assuming the departments have reasonable names and you have some aliases (“Molecular Sciences”, “Biochemistry”)…) Find papers acknowledging support from the Wellcome Trust . (So we can check for OA compliance…). Find papers with supplemental data files. Journal-specific but easily scalable. Find papers with embedded mathematics. Lots of possible approaches. Equations are often whitespaced, text contains non-ASCII characters (e.g. greeks, scripts, aleph, etc.) Heavy use of sub- and superscripts. A fun project for an enthusiast https://github.com/ContentMine/SciDataCon2014 #SciDataCon2014