Large-scale integration of data and text

1. Large-scale integration of data and text Lars Juhl Jensen

2. data integration

3. text mining

4. molecular biology

5. medicine

6. association networks

7. guilt by association

9. STRING

10. Szklarczyk et al., Nucleic Acids Research, 2015string-db.org

11. 2000+ genomes

12. genomic context

13. gene fusion

14. Korbel et al., Nature Biotechnology, 2004

15. operons

17. bidirectional promoters

19. phylogenetic profiles

21. a real example

25. Cell Cellulosomes Cellulose

26. experimental data

27. gene coexpression

29. physical interactions

30. Jensen & Bork, Science, 2008

31. genetic interactions

32. Beyer et al., Nature Reviews Genetics, 2007

33. curated knowledge

34. pathways

35. Letunic & Bork, Trends in Biochemical Sciences, 2008

36. many databases

37. different formats

38. different identifiers

39. variable quality

40. not comparable

41. not same species

42. hard work

43. (Ph.D. students)

44. quality scores

45. von Mering et al., Nucleic Acids Research, 2005

46. calibrate vs. gold standard

47. von Mering et al., Nucleic Acids Research, 2005

48. homology-based transfer

49. Franceschini et al., Nucleic Acids Research, 2013

50. missing most of the data

51. text mining

52. >10 km

53. too much to read

54. computer

55. as smart as a dog

56. teach it specific tricks

59. named entity recognition

60. comprehensive lexicon

61. cyclin dependent kinase 1

62. CDC2

63. flexible matching

64. cyclin dependent kinase 1

65. cyclin-dependent kinase 1

66. orthographic variation

67. CDC2

68. hCdc2

69. “black list”

70. SDS

71. information extraction

72. co-mentioning

73. within documents

74. within paragraphs

75. within sentences

76. NLP Natural Language Processing

77. grammatical analysis

78. Gene and protein names Cue words for entity recognition Verbs for relation extraction [nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]] is controlled by [nxpg HAP1] Saric et al., Proceedings of ACL, 2004

79. more precise

80. worse recall

81. related web resources

82. STITCH

83. STRING + 300k chemicals

84. Kuhn et al., Nucleic Acids Research, 2014stitch-db.org

85. COMPARTMENTS

86. Binder et al., Database, 2014compartments.jensenlab.org

87. TISSUES

88. tissues.jensenlab.org Santos et al., submitted, 2015

89. DISEASES

90. diseases.jensenlab.org Frankild et al., Methods, 2015

91. general framework

92. curated knowledge

93. experimental data

94. text mining

95. computational predictions

96. common identifiers

97. quality scores

98. visualization

99. web resources

100. download files

101. why so many?

102. Swiss army knife syndrome

104. targeted resources

105. common infrastructure

106. medical data mining

107. Jensen et al., Nature Reviews Genetics, 2012

109. opt-out

110. opt-in

111. structured data

113. civil registration system

114. established in 1968

116. national discharge registry

117. 14 years

118. 6.2 million patients

119. 119 million diagnoses

121. guilt by association

122. naïve approach

123. comorbidity

125. confounding factors

126. “known knowns”

127. gender

128. age

129. type of hospital encounter

130. Jensen et al., Nature Communications, 2014

131. “known unknowns”

132. smoking

133. diet

134. “unknown unknowns”

135. reporting biases

136. matched controls

137. temporal correlations

138. trajectories

140. trajectory networks

142. complex networks

143. key diagnoses

145. direct medical implications

146. medical text mining

147. pharmacovigilance

148. unstructured data

150. Danish

151. comprehensive lexicon

152. drugs

153. Clozapine

154. Clozapine clozapi n clossapi n klozapin e chlosapi n chlosapi ne chlozapi n chlozapi ne klossapi n closapin e klozapi nklosapi n

155. adverse drug events

156. rule-based system

157. Eriksson et al., Drug Safety, 2014 Drug introduction Drug discontinuationAdverse event Adverse eventNegative modifier Indication Pre-existing condition Adverse drug reaction Possible adverse drug reaction ADR of additional drug

158. Eriksson et al., Drug Safety, 2014 Drug introduction Drug discontinuationAdverse eventIdentification start Adverse eventNegative modifier Indication Pre-existing condition Adverse drug reaction Possible adverse drug reaction ADR of additional drug

159. Eriksson et al., Drug Safety, 2014 Drug introduction Drug discontinuation Adverse eventNegative modifier Indication Pre-existing condition Adverse drug reaction Possible adverse drug reaction Adverse event ADR of additional drug Identification start

160. Eriksson et al., Drug Safety, 2014 Drug introduction Drug discontinuation Adverse eventNegative modifier Indication Pre-existing condition Adverse drug reaction Possible adverse drug reaction Adverse event ADR of additional drug Identification start

161. new adverse drug reactions

162. Eriksson et al., Drug Safety, 2014 Drug substance ADE p-value Chlordiazepoxide Nystagmus 4.0e-8 Simvastatin Personality changes 8.4e-8 Dipyridamole Visual impairment 4.4e-4 Citalopram Psychosis 8.8e-4 Bendroflumethiazi de Apoplexy 8.5e-3

163. estimate ADR frequencies

164. Eriksson et al., Drug Safety, 2014

165. Acknowledgments STRING/STITCH Michael Kuhn Damian Szklarczyk Andrea Franceschini Milan Simonovic Alexander Roth Sune Pletscher-Frankild Jianyi Lin Pablo Minguez Christian von Mering Peer Bork Text mining Sune Pletscher- Frankild Jasmin Saric Evangelos Pafilis Alberto Santos Janos Binder Kalliopi Tsafou Heiko Horn Michael Kuhn Reinhardt Schneider Sean O’ Donoghue EHR mining Anders Boeck Jensen Robert Eriksson Peter Bjødstrup Jensen Andreas Bok Andersen Sabrina Gade Ellesøe Henriette Schmock Tudor Oprea Pope Moseley Thomas Werge Søren Brunak

Large-scale integration of data and text

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Large-scale integration of data and text

Similar to Large-scale integration of data and text (20)

More from Lars Juhl Jensen

More from Lars Juhl Jensen (20)

Recently uploaded

Recently uploaded (20)

Large-scale integration of data and text