Large-scale integration of data and text

•Download as PPT, PDF•

2 likes•288 views

Lars Juhl Jensen

This document discusses large-scale integration of data and text in bioinformatics. It describes using text mining on millions of abstracts and articles to extract information on biological entities and their associations in order to build networks of proteins, genes, diseases and small molecules. This information is integrated with experimental data and computational predictions into web-centric databases and resources that can help researchers by saving them time over manually reviewing the literature. Visualization tools are also provided to project network data onto tissue and subcellular localization information extracted from text.

Large-scale integration of data and text

Lars Juhl Jensen

Large-scale integration of data and text

Lars Juhl Jensen

association networks

text mining

localization and diseases

promoter analysis

Jensen & Knudsen, Bioinformatics, 2000

function prediction

Jensen, Gupta et al., Journal of Molecular Biology, 2002

protein networks

de Lichtenberg, Jensen et al., Science, 2005

chemoinformatics

Campillos, Kuhn et al., Science, 2008

data mining

text mining

electronic health records

association networks

guilt by association

STRING

~2.6 million proteins

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

STITCH

~300,000 small molecules

Kuhn et al., Nucleic Acids Research, 2012

genomic context

gene fusion

Korbel et al., Nature Biotechnology, 2004

operons

Korbel et al., Nature Biotechnology, 2004

bidirectional promoters

Korbel et al., Nature Biotechnology, 2004

metagenome neighborhood

Harrington et al., PNAS, 2007

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

protein interactions

Jensen & Bork, Science, 2008

curated knowledge

drug targets

complexes

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

hard work

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

missing most of the data

text mining

>10 km

too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive lexicon

cyclin dependent kinase 1

CDK1

CDC2

flexible matching

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

orthographic variation

CDC2

hCdc2

“black list”

SDS

information extraction

count co-mentioning

within documents

within paragraphs

within sentences

scoring scheme

corpora

~22 million abstracts

no access

~4 million full-text articles

augmented browsing

Reflect

browser add-on

real-time text mining

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
O’Donoghue et al., Journal of Web Semantics, 2010

localization and disease

small molecules

proteins

compartments

tissues

diseases

organisms

environments

suite of web resources

common backend database

jensenlab.org

text mining

curated knowledge

experimental data

computational predictions

quality scores

web-centric databases

DISEASES

visualization

COMPARTMENTS

compartments.jensenlab.org

TISSUES

tissues.jensenlab.org

project onto networks

Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

compartments.jensenlab.org

tissues.jensenlab.org

diseases.jensenlab.org

summary

bioinformatics

more than alignment

data/text mining

save you much time

Acknowledgments
STRING/STITCH Literature mining
Christian von Mering Sune Frankild
Damian Szklarczyk Evangelos Pafilis
Michael Kuhn Janos Binder
Manuel Stark Kalliopi Tsafou
Samuel Chaffron Alberto Santos
Chris Creevey Heiko Horn
Jean Muller Michael Kuhn
Tobias Doerks Nigel Brown
Philippe Julien Reinhardt Schneider
Alexander Roth Sean O’Donoghue
Milan Simonovic
Jan Korbel
Berend Snel
Martijn Huynen
Peer Bork

Questions?

More Related Content

What's hot

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Lars Juhl Jensen

The STITCH and Reflect web resources

The STITCH and Reflect web resources

The STITCH and Reflect web resources

Lars Juhl Jensen

Advanced bioinformatics methods for proteomics

Advanced bioinformatics methods for proteomics

Advanced bioinformatics methods for proteomics

Lars Juhl Jensen

Scientific Highlights: The Reflect and NetPhorest web resources

Scientific Highlights: The Reflect and NetPhorest web resources

Scientific Highlights: The Reflect and NetPhorest web resources

Lars Juhl Jensen

Large-scale data and text mining

Large-scale data and text mining

Large-scale data and text mining

Lars Juhl Jensen

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Lars Juhl Jensen

Systems biology: Large-scale biomedical data mining

Systems biology: Large-scale biomedical data mining

Systems biology: Large-scale biomedical data mining

Lars Juhl Jensen

Forensic investigation involves the collecting, assembling, and analysis of all crime-related evidence with the aim of getting to a conclusion about a suspect. Humans have microorganisms present in the gut, mouth, and skin, unique to each individual. Individual microbiome can be distinguished based on the bacterial 16S rRNA to tell the bacterial species diversity between and among persons. Sterilized swab-sticks were used to sample fifteen individuals’ fingertips, their personal items, office doorknob and a college photocopier. Skin-associated bacteria were readily recovered from surfaces and the structure of these bacterial communities can be used to link individuals to the objects they had touched. We compared the bacterial communities on objects and skin to match the objects to the individual. The 16S rRNA gene PCR polymorphism was used to analyze the bacterial community for each person and object. The higher similarity of bacterial community between individuals’ and personal laptop keyboards, office chairs and office member’s fingertips were evident than between the doorknob and the photocopier. Highest bacterial species diversity was observed in doorknob followed by the photocopier. Hence, an individual’s bacterial profile can be used as a human identification tool alongside other tools in forensic fields especially in cases where there is evidence of deficiency. Key-words: Microbial signature, Forensics, 16S rRNA, Individual person, Skin bacteria, Fingerprint

Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...

Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...

Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...

SSR Institute of International Journal of Life Sciences

Sasan Sharee Ghourichaee

Sasan Sharee Ghourichaee

Sasan Sharee Ghourichaee

Sasan Sharee Ghourichaee

04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...

04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...

04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...

Mitch Fernandez

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Lars Juhl Jensen

Mining molecules from text and data

Mining molecules from text and data

Mining molecules from text and data

Lars Juhl Jensen

TMP presentation

TMP presentation

TMP presentation

Activity 42 c a closer look

Activity 42 c a closer look

Activity 42 c a closer look

Visualization of large-scaleprotein and disease networks

Visualization of large-scaleprotein and disease networks

Visualization of large-scaleprotein and disease networks

Lars Juhl Jensen

How dna works

Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...

Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...

Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...

Chapter 1 final 20121-2022

Chapter 1 final 20121-2022

Chapter 1 final 20121-2022

Adriana San Miguel and Hang Lu (2013)

Adriana San Miguel and Hang Lu (2013)

Adriana San Miguel and Hang Lu (2013)

Encyclopedia of Life: Use cases for phenotypes

Encyclopedia of Life: Use cases for phenotypes

Encyclopedia of Life: Use cases for phenotypes

What's hot (20)

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

The STITCH and Reflect web resources

The STITCH and Reflect web resources

The STITCH and Reflect web resources

Advanced bioinformatics methods for proteomics

Advanced bioinformatics methods for proteomics

Advanced bioinformatics methods for proteomics

Scientific Highlights: The Reflect and NetPhorest web resources

Scientific Highlights: The Reflect and NetPhorest web resources

Scientific Highlights: The Reflect and NetPhorest web resources

Large-scale data and text mining

Large-scale data and text mining

Large-scale data and text mining

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Systems biology: Large-scale biomedical data mining

Systems biology: Large-scale biomedical data mining

Systems biology: Large-scale biomedical data mining

Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...

Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...

Microbial Forensics: Forensic Relevance of the Individual Person’s Microbial ...

Sasan Sharee Ghourichaee

Sasan Sharee Ghourichaee

Sasan Sharee Ghourichaee

04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...

04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...

04.19.2013.an.analytical.workflow.for.metagenomic.data.and.its.application.to...

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Mining molecules from text and data

Mining molecules from text and data

Mining molecules from text and data

TMP presentation

TMP presentation

TMP presentation

Activity 42 c a closer look

Activity 42 c a closer look

Activity 42 c a closer look

Visualization of large-scaleprotein and disease networks

Visualization of large-scaleprotein and disease networks

Visualization of large-scaleprotein and disease networks

How dna works

Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...

Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...

Determining the Human Gut Microbiome Using Genome Sequencing and Dell's Cloud...

Chapter 1 final 20121-2022

Chapter 1 final 20121-2022

Chapter 1 final 20121-2022

Adriana San Miguel and Hang Lu (2013)

Adriana San Miguel and Hang Lu (2013)

Adriana San Miguel and Hang Lu (2013)

Encyclopedia of Life: Use cases for phenotypes

Encyclopedia of Life: Use cases for phenotypes

Encyclopedia of Life: Use cases for phenotypes

Viewers also liked

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Lars Juhl Jensen

Network biology

Network biology

Network biology

Lars Juhl Jensen

Disease Systems Biology

Disease Systems Biology

Disease Systems Biology

Lars Juhl Jensen

Mining literature and medical records

Mining literature and medical records

Mining literature and medical records

Lars Juhl Jensen

2016 03-16 research seminar

2016 03-16 research seminar

2016 03-16 research seminar

Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...

Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...

Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...

Text-mining practical

Text-mining practical

Text-mining practical

Lars Juhl Jensen

The pragmatic text miner: From literature to electronic health records

The pragmatic text miner: From literature to electronic health records

The pragmatic text miner: From literature to electronic health records

Lars Juhl Jensen

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Lars Juhl Jensen

HI201 in 2014

Grace Villareal

Network integration of data and text

Network integration of data and text

Network integration of data and text

Lars Juhl Jensen

MI227 Cousework1

MI227 Cousework1

MI227 Cousework1

Grace Villareal

One tagger, many uses - Illustrating the power of ontologies in named entity ...

One tagger, many uses - Illustrating the power of ontologies in named entity ...

One tagger, many uses - Illustrating the power of ontologies in named entity ...

Lars Juhl Jensen

Viewers also liked (13)

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Network biology

Network biology

Network biology

Disease Systems Biology

Disease Systems Biology

Disease Systems Biology

Mining literature and medical records

Mining literature and medical records

Mining literature and medical records

2016 03-16 research seminar

2016 03-16 research seminar

2016 03-16 research seminar

Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...

Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...

Evaluating HIV Clinical Care Quality in Massachusetts Sites Supported through...

Text-mining practical

Text-mining practical

Text-mining practical

The pragmatic text miner: From literature to electronic health records

The pragmatic text miner: From literature to electronic health records

The pragmatic text miner: From literature to electronic health records

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

HI201 in 2014

Network integration of data and text

Network integration of data and text

Network integration of data and text

MI227 Cousework1

MI227 Cousework1

MI227 Cousework1

One tagger, many uses - Illustrating the power of ontologies in named entity ...

One tagger, many uses - Illustrating the power of ontologies in named entity ...

One tagger, many uses - Illustrating the power of ontologies in named entity ...

Similar to Large-scale integration of data and text

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Lars Juhl Jensen

The STRING database and related tools

The STRING database and related tools

The STRING database and related tools

Lars Juhl Jensen

Disease Systems Biology

Disease Systems Biology

Disease Systems Biology

Lars Juhl Jensen

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Lars Juhl Jensen

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Lars Juhl Jensen

Large-scale data and text mining

Large-scale data and text mining

Large-scale data and text mining

Lars Juhl Jensen

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Lars Juhl Jensen

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Lars Juhl Jensen

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Lars Juhl Jensen

Mining biomedical texts

Mining biomedical texts

Mining biomedical texts

Lars Juhl Jensen

Mining text and data on chemicals

Mining text and data on chemicals

Mining text and data on chemicals

Lars Juhl Jensen

Unraveling signal transduction networks through data integration

Unraveling signal transduction networks through data integration

Unraveling signal transduction networks through data integration

Lars Juhl Jensen

Network Biology: Large-scale integration of data and text

Network Biology: Large-scale integration of data and text

Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

Network biology - A basis for large-scale biomedica data mining

Network biology - A basis for large-scale biomedica data mining

Network biology - A basis for large-scale biomedica data mining

Lars Juhl Jensen

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Lars Juhl Jensen

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Lars Juhl Jensen

Network biology - Large-scale data integration and text mining

Network biology - Large-scale data integration and text mining

Network biology - Large-scale data integration and text mining

Lars Juhl Jensen

Data integration: The STITCH database of protein-small molecule interactions

Data integration: The STITCH database of protein-small molecule interactions

Data integration: The STITCH database of protein-small molecule interactions

Lars Juhl Jensen

Unraveling signaling networks by data integration

Unraveling signaling networks by data integration

Unraveling signaling networks by data integration

Lars Juhl Jensen

Large-scale data and text mining

Large-scale data and text mining

Large-scale data and text mining

Lars Juhl Jensen

Similar to Large-scale integration of data and text (20)

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

The STRING database and related tools

The STRING database and related tools

The STRING database and related tools

Disease Systems Biology

Disease Systems Biology

Disease Systems Biology

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Protein networks: A basis for large-scale data mining

Large-scale data and text mining

Large-scale data and text mining

Large-scale data and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Network biology: Large-scale data and text mining

Networks of proteins and diseases

Networks of proteins and diseases

Networks of proteins and diseases

Mining biomedical texts

Mining biomedical texts

Mining biomedical texts

Mining text and data on chemicals

Mining text and data on chemicals

Mining text and data on chemicals

Unraveling signal transduction networks through data integration

Unraveling signal transduction networks through data integration

Unraveling signal transduction networks through data integration

Network Biology: Large-scale integration of data and text

Network Biology: Large-scale integration of data and text

Network Biology: Large-scale integration of data and text

Network biology - A basis for large-scale biomedica data mining

Network biology - A basis for large-scale biomedica data mining

Network biology - A basis for large-scale biomedica data mining

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Unraveling signaling networks by large-scale data integration

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology: Large-scale data integration and text mining

Network biology - Large-scale data integration and text mining

Network biology - Large-scale data integration and text mining

Network biology - Large-scale data integration and text mining

Data integration: The STITCH database of protein-small molecule interactions

Data integration: The STITCH database of protein-small molecule interactions

Data integration: The STITCH database of protein-small molecule interactions

Unraveling signaling networks by data integration

Unraveling signaling networks by data integration

Unraveling signaling networks by data integration

Large-scale data and text mining

Large-scale data and text mining

Large-scale data and text mining

More from Lars Juhl Jensen

One tagger, many uses: Illustrating the power of dictionary-based named entit...

One tagger, many uses: Illustrating the power of dictionary-based named entit...

One tagger, many uses: Illustrating the power of dictionary-based named entit...

Lars Juhl Jensen

One tagger, many uses: Simple text-mining strategies for biomedicine

One tagger, many uses: Simple text-mining strategies for biomedicine

One tagger, many uses: Simple text-mining strategies for biomedicine

Lars Juhl Jensen

Extract 2.0: Text-mining-assisted interactive annotation

Extract 2.0: Text-mining-assisted interactive annotation

Extract 2.0: Text-mining-assisted interactive annotation

Lars Juhl Jensen

Network visualization: A crash course on using Cytoscape

Network visualization: A crash course on using Cytoscape

Network visualization: A crash course on using Cytoscape

Lars Juhl Jensen

STRING & STITCH: Network integration of heterogeneous data

STRING & STITCH: Network integration of heterogeneous data

STRING & STITCH: Network integration of heterogeneous data

Lars Juhl Jensen

Biomedical text mining: Automatic processing of unstructured text

Biomedical text mining: Automatic processing of unstructured text

Biomedical text mining: Automatic processing of unstructured text

Lars Juhl Jensen

Medical network analysis: Linking diseases and genes through data and text mi...

Medical network analysis: Linking diseases and genes through data and text mi...

Medical network analysis: Linking diseases and genes through data and text mi...

Lars Juhl Jensen

Network Biology: A crash course on STRING and Cytoscape

Network Biology: A crash course on STRING and Cytoscape

Network Biology: A crash course on STRING and Cytoscape

Lars Juhl Jensen

Cellular networks

Cellular networks

Cellular networks

Lars Juhl Jensen

Cellular Network Biology: Large-scale integration of data and text

Cellular Network Biology: Large-scale integration of data and text

Cellular Network Biology: Large-scale integration of data and text

Lars Juhl Jensen

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Lars Juhl Jensen

STRING & related databases: Large-scale integration of heterogeneous data

STRING & related databases: Large-scale integration of heterogeneous data

STRING & related databases: Large-scale integration of heterogeneous data

Lars Juhl Jensen

Tagger: Rapid dictionary-based named entity recognition

Tagger: Rapid dictionary-based named entity recognition

Tagger: Rapid dictionary-based named entity recognition

Lars Juhl Jensen

Medical text mining: Linking diseases, drugs, and adverse reactions

Medical text mining: Linking diseases, drugs, and adverse reactions

Medical text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Lars Juhl Jensen

Cellular Network Biology

Cellular Network Biology

Cellular Network Biology

Lars Juhl Jensen

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Lars Juhl Jensen

Biomarker bioinformatics: Network-based candidate prioritization

Biomarker bioinformatics: Network-based candidate prioritization

Biomarker bioinformatics: Network-based candidate prioritization

Lars Juhl Jensen

The Art of Counting: Scoring and ranking co-occurrences in literature

The Art of Counting: Scoring and ranking co-occurrences in literature

The Art of Counting: Scoring and ranking co-occurrences in literature

Lars Juhl Jensen

More from Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...

One tagger, many uses: Illustrating the power of dictionary-based named entit...

One tagger, many uses: Illustrating the power of dictionary-based named entit...

One tagger, many uses: Simple text-mining strategies for biomedicine

One tagger, many uses: Simple text-mining strategies for biomedicine

One tagger, many uses: Simple text-mining strategies for biomedicine

Extract 2.0: Text-mining-assisted interactive annotation

Extract 2.0: Text-mining-assisted interactive annotation

Extract 2.0: Text-mining-assisted interactive annotation

Network visualization: A crash course on using Cytoscape

Network visualization: A crash course on using Cytoscape

Network visualization: A crash course on using Cytoscape

STRING & STITCH: Network integration of heterogeneous data

STRING & STITCH: Network integration of heterogeneous data

STRING & STITCH: Network integration of heterogeneous data

Biomedical text mining: Automatic processing of unstructured text

Biomedical text mining: Automatic processing of unstructured text

Biomedical text mining: Automatic processing of unstructured text

Medical network analysis: Linking diseases and genes through data and text mi...

Medical network analysis: Linking diseases and genes through data and text mi...

Medical network analysis: Linking diseases and genes through data and text mi...

Network Biology: A crash course on STRING and Cytoscape

Network Biology: A crash course on STRING and Cytoscape

Network Biology: A crash course on STRING and Cytoscape

Cellular networks

Cellular networks

Cellular networks

Cellular Network Biology: Large-scale integration of data and text

Cellular Network Biology: Large-scale integration of data and text

Cellular Network Biology: Large-scale integration of data and text

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...

STRING & related databases: Large-scale integration of heterogeneous data

STRING & related databases: Large-scale integration of heterogeneous data

STRING & related databases: Large-scale integration of heterogeneous data

Tagger: Rapid dictionary-based named entity recognition

Tagger: Rapid dictionary-based named entity recognition

Tagger: Rapid dictionary-based named entity recognition

Medical text mining: Linking diseases, drugs, and adverse reactions

Medical text mining: Linking diseases, drugs, and adverse reactions

Medical text mining: Linking diseases, drugs, and adverse reactions

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Medical data and text mining: Linking diseases, drugs, and adverse reactions

Cellular Network Biology

Cellular Network Biology

Cellular Network Biology

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Network biology: Large-scale integration of data and text

Biomarker bioinformatics: Network-based candidate prioritization

Biomarker bioinformatics: Network-based candidate prioritization

Biomarker bioinformatics: Network-based candidate prioritization

The Art of Counting: Scoring and ranking co-occurrences in literature

The Art of Counting: Scoring and ranking co-occurrences in literature

The Art of Counting: Scoring and ranking co-occurrences in literature

Large-scale integration of data and text

1. Large-scale integration of data and text Lars Juhl Jensen

2. Large-scale integration of data and text Lars Juhl Jensen

3. association networks

5. localization and diseases

7.

8.

9. promoter analysis

10. Jensen & Knudsen, Bioinformatics, 2000

11. function prediction

12. Jensen, Gupta et al., Journal of Molecular Biology, 2002

13.

14.

15. protein networks

16. de Lichtenberg, Jensen et al., Science, 2005

17. chemoinformatics

18. Campillos, Kuhn et al., Science, 2008

19.

20.

21.

22.

23. data mining

24. text mining

25. electronic health records

26. association networks

27. guilt by association

28.

30. ~2.6 million proteins

31. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

33. ~300,000 small molecules

34. Kuhn et al., Nucleic Acids Research, 2012

35. genomic context

36. gene fusion

37. Korbel et al., Nature Biotechnology, 2004

39. Korbel et al., Nature Biotechnology, 2004

40. bidirectional promoters

41. Korbel et al., Nature Biotechnology, 2004

42. metagenome neighborhood

43. Harrington et al., PNAS, 2007

44. phylogenetic profiles

45. Korbel et al., Nature Biotechnology, 2004

46. a real example

47.

48.

49.

50. Cell Cellulosomes Cellulose

51. experimental data

52. gene coexpression

53.

54. protein interactions

55. Jensen & Bork, Science, 2008

56. curated knowledge

57. drug targets

60. Letunic & Bork, Trends in Biochemical Sciences, 2008

61. many databases

62. different formats

63. different identifiers

64. variable quality

65. not comparable

67. quality scores

68. von Mering et al., Nucleic Acids Research, 2005

69. calibrate vs. gold standard

70. missing most of the data

71. text mining

73. too much to read

75. as smart as a dog

76. teach it specific tricks

77.

78.

79. named entity recognition

80. comprehensive lexicon

81. cyclin dependent kinase 1

84. flexible matching

85. spaces and hyphens

86. cyclin dependent kinase 1

87. cyclin-dependent kinase 1

88. orthographic variation

91. “black list”

93. information extraction

94. count co-mentioning

95. within documents

96. within paragraphs

97. within sentences

98. scoring scheme

99.

100.

102. ~22 million abstracts

104. ~4 million full-text articles

105.

106. augmented browsing

108. browser add-on

109. real-time text mining

110. Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009 O’Donoghue et al., Journal of Web Semantics, 2010

111. localization and disease

112. small molecules

114. compartments

118. environments

119. suite of web resources

120. common backend database

121. jensenlab.org

122. text mining

123. curated knowledge

124. experimental data

125. computational predictions

126. quality scores

127. web-centric databases

129.

130.

131. visualization

132. COMPARTMENTS

133. compartments.jensenlab.org

135. tissues.jensenlab.org

136. project onto networks

137. Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011

138. compartments.jensenlab.org

139. tissues.jensenlab.org

140. diseases.jensenlab.org

142. bioinformatics

143. more than alignment

144. data/text mining

145. save you much time

146. Acknowledgments STRING/STITCH Literature mining Christian von Mering Sune Frankild Damian Szklarczyk Evangelos Pafilis Michael Kuhn Janos Binder Manuel Stark Kalliopi Tsafou Samuel Chaffron Alberto Santos Chris Creevey Heiko Horn Jean Muller Michael Kuhn Tobias Doerks Nigel Brown Philippe Julien Reinhardt Schneider Alexander Roth Sean O’Donoghue Milan Simonovic Jan Korbel Berend Snel Martijn Huynen Peer Bork

147.

148. Questions?