SlideShare a Scribd company logo
1 of 148
Large-scale integration of data and text




              Lars Juhl Jensen
Large-scale integration of data and text




              Lars Juhl Jensen
association networks
text mining
localization and diseases
me
promoter analysis
Jensen & Knudsen, Bioinformatics, 2000
function prediction
Jensen, Gupta et al., Journal of Molecular Biology, 2002
protein networks
de Lichtenberg, Jensen et al., Science, 2005
chemoinformatics
Campillos, Kuhn et al., Science, 2008
data mining
text mining
electronic health records
association networks
guilt by association
STRING
~2.6 million proteins
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
STITCH
~300,000 small molecules
Kuhn et al., Nucleic Acids Research, 2012
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
metagenome neighborhood
Harrington et al., PNAS, 2007
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell




       Cellulosomes




                 Cellulose
experimental data
gene coexpression
protein interactions
Jensen & Bork, Science, 2008
curated knowledge
drug targets
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
hard work
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
missing most of the data
text mining
>10 km
too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive lexicon
cyclin dependent kinase 1
CDK1
CDC2
flexible matching
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
orthographic variation
CDC2
hCdc2
“black list”
SDS
information extraction
count co-mentioning
within documents
within paragraphs
within sentences
scoring scheme
corpora
~22 million abstracts
no access
~4 million full-text articles
augmented browsing
Reflect
browser add-on
real-time text mining
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
            O’Donoghue et al., Journal of Web Semantics, 2010
localization and disease
small molecules
proteins
compartments
tissues
diseases
organisms
environments
suite of web resources
common backend database
jensenlab.org
text mining
curated knowledge
experimental data
computational predictions
quality scores
web-centric databases
DISEASES
visualization
COMPARTMENTS
compartments.jensenlab.org
TISSUES
tissues.jensenlab.org
project onto networks
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
compartments.jensenlab.org
tissues.jensenlab.org
diseases.jensenlab.org
summary
bioinformatics
more than alignment
data/text mining
save you much time
Acknowledgments
STRING/STITCH               Literature mining
    Christian von Mering    Sune Frankild
     Damian Szklarczyk      Evangelos Pafilis
            Michael Kuhn    Janos Binder
            Manuel Stark    Kalliopi Tsafou
       Samuel Chaffron      Alberto Santos
           Chris Creevey    Heiko Horn
              Jean Muller   Michael Kuhn
          Tobias Doerks     Nigel Brown
          Philippe Julien   Reinhardt Schneider
         Alexander Roth     Sean O’Donoghue
        Milan Simonovic
               Jan Korbel
             Berend Snel
         Martijn Huynen
                Peer Bork
Questions?

More Related Content

What's hot

Advanced bioinformatics methods for proteomics
Advanced bioinformatics methods for proteomicsAdvanced bioinformatics methods for proteomics
Advanced bioinformatics methods for proteomics
Lars Juhl Jensen
 
Systems biology: Large-scale biomedical data mining
Systems biology: Large-scale biomedical data miningSystems biology: Large-scale biomedical data mining
Systems biology: Large-scale biomedical data mining
Lars Juhl Jensen
 
Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...
Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...
Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...
SSR Institute of International Journal of Life Sciences
 
Protein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data miningProtein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data mining
Lars Juhl Jensen
 
Mining molecules from text and data
Mining molecules from text and dataMining molecules from text and data
Mining molecules from text and data
Lars Juhl Jensen
 
Activity 42 c a closer look
Activity 42 c a closer lookActivity 42 c a closer look
Activity 42 c a closer look
ddegennaro
 
Adriana San Miguel and Hang Lu (2013)
Adriana San Miguel and Hang Lu (2013)Adriana San Miguel and Hang Lu (2013)
Adriana San Miguel and Hang Lu (2013)
Fran Flores
 
Encyclopedia of Life: Use cases for phenotypes
Encyclopedia of Life: Use cases for phenotypesEncyclopedia of Life: Use cases for phenotypes
Encyclopedia of Life: Use cases for phenotypes
Cyndy Parr
 

What's hot (20)

Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
 
The STITCH and Reflect web resources
The STITCH and Reflect web resourcesThe STITCH and Reflect web resources
The STITCH and Reflect web resources
 
Advanced bioinformatics methods for proteomics
Advanced bioinformatics methods for proteomicsAdvanced bioinformatics methods for proteomics
Advanced bioinformatics methods for proteomics
 
Scientific Highlights: The Reflect and NetPhorest web resources
Scientific Highlights: The Reflect and NetPhorest web resourcesScientific Highlights: The Reflect and NetPhorest web resources
Scientific Highlights: The Reflect and NetPhorest web resources
 
Large-scale data and text mining
Large-scale data and text miningLarge-scale data and text mining
Large-scale data and text mining
 
Unraveling signaling networks by large-scale data integration
Unraveling signaling networks by large-scale data integrationUnraveling signaling networks by large-scale data integration
Unraveling signaling networks by large-scale data integration
 
Systems biology: Large-scale biomedical data mining
Systems biology: Large-scale biomedical data miningSystems biology: Large-scale biomedical data mining
Systems biology: Large-scale biomedical data mining
 
Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...
Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...
Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...
 
Sasan Sharee Ghourichaee
Sasan Sharee GhourichaeeSasan Sharee Ghourichaee
Sasan Sharee Ghourichaee
 
04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...
04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...
04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...
 
Protein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data miningProtein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data mining
 
Mining molecules from text and data
Mining molecules from text and dataMining molecules from text and data
Mining molecules from text and data
 
TMP presentation
TMP presentationTMP presentation
TMP presentation
 
Activity 42 c a closer look
Activity 42 c a closer lookActivity 42 c a closer look
Activity 42 c a closer look
 
Visualization of large-scale protein and disease networks
Visualization of large-scaleprotein and disease networksVisualization of large-scaleprotein and disease networks
Visualization of large-scale protein and disease networks
 
How dna works
How dna worksHow dna works
How dna works
 
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...
 
Chapter 1 final 20121-2022
Chapter 1  final  20121-2022Chapter 1  final  20121-2022
Chapter 1 final 20121-2022
 
Adriana San Miguel and Hang Lu (2013)
Adriana San Miguel and Hang Lu (2013)Adriana San Miguel and Hang Lu (2013)
Adriana San Miguel and Hang Lu (2013)
 
Encyclopedia of Life: Use cases for phenotypes
Encyclopedia of Life: Use cases for phenotypesEncyclopedia of Life: Use cases for phenotypes
Encyclopedia of Life: Use cases for phenotypes
 

Viewers also liked

Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
Lars Juhl Jensen
 
Mining literature and medical records
Mining literature and medical recordsMining literature and medical records
Mining literature and medical records
Lars Juhl Jensen
 
The pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health recordsThe pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
Lars Juhl Jensen
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
Lars Juhl Jensen
 
Network integration of data and text
Network integration of data and textNetwork integration of data and text
Network integration of data and text
Lars Juhl Jensen
 

Viewers also liked (13)

Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
 
Network biology
Network biologyNetwork biology
Network biology
 
Disease Systems Biology
Disease Systems BiologyDisease Systems Biology
Disease Systems Biology
 
Mining literature and medical records
Mining literature and medical recordsMining literature and medical records
Mining literature and medical records
 
2016 03-16 research seminar
2016 03-16 research seminar2016 03-16 research seminar
2016 03-16 research seminar
 
Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...
Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...
Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...
 
Text-mining practical
Text-mining practicalText-mining practical
Text-mining practical
 
The pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health recordsThe pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
 
HI201 in 2014
HI201 in 2014HI201 in 2014
HI201 in 2014
 
Network integration of data and text
Network integration of data and textNetwork integration of data and text
Network integration of data and text
 
MI227 Cousework1
MI227 Cousework1MI227 Cousework1
MI227 Cousework1
 
One tagger, many uses - Illustrating the power of ontologies in named entity ...
One tagger, many uses - Illustrating the power of ontologies in named entity ...One tagger, many uses - Illustrating the power of ontologies in named entity ...
One tagger, many uses - Illustrating the power of ontologies in named entity ...
 

Similar to Large-scale integration of data and text

Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
Lars Juhl Jensen
 
Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
Lars Juhl Jensen
 
Protein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data miningProtein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data mining
Lars Juhl Jensen
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
Lars Juhl Jensen
 
Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
Lars Juhl Jensen
 
Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
Lars Juhl Jensen
 
Mining text and data on chemicals
Mining text and data on chemicalsMining text and data on chemicals
Mining text and data on chemicals
Lars Juhl Jensen
 
Network biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data miningNetwork biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data mining
Lars Juhl Jensen
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
Lars Juhl Jensen
 
Network biology - Large-scale data integration and text mining
Network biology - Large-scale data integration and text miningNetwork biology - Large-scale data integration and text mining
Network biology - Large-scale data integration and text mining
Lars Juhl Jensen
 
Data integration: The STITCH database of protein-small molecule interactions
Data integration: The STITCH database of protein-small molecule interactionsData integration: The STITCH database of protein-small molecule interactions
Data integration: The STITCH database of protein-small molecule interactions
Lars Juhl Jensen
 
Large-scale data and text mining
Large-scale data and text miningLarge-scale data and text mining
Large-scale data and text mining
Lars Juhl Jensen
 

Similar to Large-scale integration of data and text (20)

Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
 
The STRING database and related tools
The STRING database and related toolsThe STRING database and related tools
The STRING database and related tools
 
Disease Systems Biology
Disease Systems BiologyDisease Systems Biology
Disease Systems Biology
 
Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
 
Protein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data miningProtein networks: A basis for large-scale data mining
Protein networks: A basis for large-scale data mining
 
Large-scale data and text mining
Large-scale data and text miningLarge-scale data and text mining
Large-scale data and text mining
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
 
Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
 
Networks of proteins and diseases
Networks of proteins and diseasesNetworks of proteins and diseases
Networks of proteins and diseases
 
Mining biomedical texts
Mining biomedical textsMining biomedical texts
Mining biomedical texts
 
Mining text and data on chemicals
Mining text and data on chemicalsMining text and data on chemicals
Mining text and data on chemicals
 
Unraveling signal transduction networks through data integration
Unraveling signal transduction networks through data integrationUnraveling signal transduction networks through data integration
Unraveling signal transduction networks through data integration
 
Network Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and textNetwork Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and text
 
Network biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data miningNetwork biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data mining
 
Unraveling signaling networks by large-scale data integration
Unraveling signaling networks by large-scale data integrationUnraveling signaling networks by large-scale data integration
Unraveling signaling networks by large-scale data integration
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
 
Network biology - Large-scale data integration and text mining
Network biology - Large-scale data integration and text miningNetwork biology - Large-scale data integration and text mining
Network biology - Large-scale data integration and text mining
 
Data integration: The STITCH database of protein-small molecule interactions
Data integration: The STITCH database of protein-small molecule interactionsData integration: The STITCH database of protein-small molecule interactions
Data integration: The STITCH database of protein-small molecule interactions
 
Unraveling signaling networks by data integration
Unraveling signaling networks by data integrationUnraveling signaling networks by data integration
Unraveling signaling networks by data integration
 
Large-scale data and text mining
Large-scale data and text miningLarge-scale data and text mining
Large-scale data and text mining
 

More from Lars Juhl Jensen

More from Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...
 
One tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicineOne tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicine
 
Extract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotationExtract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotation
 
Network visualization: A crash course on using Cytoscape
Network visualization: A crash course on using CytoscapeNetwork visualization: A crash course on using Cytoscape
Network visualization: A crash course on using Cytoscape
 
STRING & STITCH : Network integration of heterogeneous data
STRING & STITCH: Network integration of heterogeneous dataSTRING & STITCH: Network integration of heterogeneous data
STRING & STITCH : Network integration of heterogeneous data
 
Biomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured textBiomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured text
 
Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...
 
Network Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and CytoscapeNetwork Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and Cytoscape
 
Cellular networks
Cellular networksCellular networks
Cellular networks
 
Cellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and textCellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and text
 
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
 
STRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous dataSTRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous data
 
Tagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognitionTagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognition
 
Medical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactionsMedical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactions
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactionsMedical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactions
 
Cellular Network Biology
Cellular Network BiologyCellular Network Biology
Cellular Network Biology
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Biomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritizationBiomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritization
 
The Art of Counting: Scoring and ranking co-occurrences in literature
The Art of Counting: Scoring and ranking co-occurrences in literatureThe Art of Counting: Scoring and ranking co-occurrences in literature
The Art of Counting: Scoring and ranking co-occurrences in literature
 

Large-scale integration of data and text