Leveraging Big Data to Accelerate
Biomedical Research:
Overlaying Computational Knowledge, Natural Language Processing, Ar...
How is Data Used in the Biomedical
Field?
Types of Data (Data Diversity)
Challenge: Data is both quantitative and visual




Cell Images

Gene Amplifications



...
Big Data vs. Biology - Challenges


Open science
Biological research and discovery calls for the accessibility to all
kin...
Big Data and Biology: Recommendations








New search engine
Go beyond Google search, and allow scientists to ask p...
Big Data – “Computational Knowledge
Engine” (Health/Medicine) – IPSO Model
1.
2.
3.

4.

5.

System Boundary – Data bits/Q...
Big Data – “Computational Knowledge
Engine” (Health/Medicine)





Expert System – Software that uses a knowledge
base ...
What Does Our Software Do?
Full Corpus of
Knowledge

• Everything that can be
known about Biology
• Most of this will not ...
How Will It Do This?








Uses natural language processing to determine meanings of
phrases in scientific publicati...
Conclusion









Every single piece of public data found by a biological
researcher (Pubmed/ExPasy/Human Genome/NI...
Upcoming SlideShare
Loading in …5
×

Milap Thaker et. al: Leveraging big data in biological science. Presented at Johns Hopkins University 2013.

777 views

Published on

Overlaying Computational Knowledge, Natural Language Processing, Artificial Intelligence, and Data Standardization onto Pubmed and Affiliated Databases to shift from a Google Paradigm to a Wolfram Alpha Paradigm

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
777
On SlideShare
0
From Embeds
0
Number of Embeds
149
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Today, life science is a science more and more depending on data. Here is a flow chart from BioVance a full service CRO (contract research organization) providing research services for pharmaceutical companies. Fundamentally, the drug discovery process follows a pattern of making hypothesis, testing; expanding sample numbers, testing; expanding sample numbers, testing, and expanding sample numbers, which goes on and on to complete all the phases of trials. From preclinical to Phase 1, 2, and 3, each time, the sample number enlarged and the data quantity expanded.
  • As a matter of fact, far before preclinical trial on animal, there were many trials and tests had been performed on the molecular level on cells and tissues in order to gain the knowledge of biomarkers. These researches are performed in university labs, and in the institutions such as NIH and NCI. All these discoveries will be published in papers such as Nature and Science, and be presented in international academic conferences such as AACR or Neuroscience. Some of these discoveries will be collected into public databases, such as mirBase for microRNAs by Sanger Institute in UK.In this whole process, more and more data is generating every day. Given the presence of the new technology, such as next generation sequencing, the amount of data that an every day biologist to handle is growing in an unexpected speed.
  • Open scienceFor biologists the pure size of the amount of data, and the variety of sources cause a challenge for them to access the full benefit. Here are just examples for some data sources, NCBI Refseq, UCSC Known Gene 6.0, Gencode v13; plushundreds and thousands of literatures from SCI to school publications, from keystone speeches to conference posters. All these are constantly updated and evolving. Currently, the best way to reach, compare, and summarize all these resources are still heavily depending on diligent human work. Due to the hardness and extend of the work, many open sources are not really being reached and utilized, but left in unattended corners as if do not ever exist. Too many type to be standardizedBiological data is much difficult to organize with its different types and forms, from a Western Blotting gel picture to hundreds millions of reads of the next generation sequencing. Adding in clinical symptoms, chemical or other exposures, and demographics, it is a very complicated analysis problem. Utility of data is low.Non only due to the complexity of the data, the utility of the data in biology is low is also due to its pure amount. Simply to put it, to most biologists, it is just too costly to compute the data available or generated. The tools for this level of computing is still very primitive, and uncoordinated. Require high analytic skillsUnlike a social network, the analyst understands exactly what data they are collecting mean. For example, each node in the network represents a Facebook account. However, for biologists, when they look at biological data, they do not know exactly what they are looking at. The data they used to construct the networks is noisy and imprecise, and they do not have a good understanding of how many different variables interact in these gene network yet. Although a gene regulatory net work is smaller than a social network, it is harder to define which gene controls the expression of the other genes.
  • New search engineGoogle search will only give you a pile of answers with the key words that you type. To be beneficial from all kinds of data bases available in the field. We need a computation capacity which will allow scientists to ask precise academic questions and get accurate snap shots of the specific biological topics. e.g. What’s the statistic confidence of P53 related to breast cancer based on the SCI publications with IF 4 and above during the past 5 years. Instead of offering a stack of papers or records with the keywords “P53”, the system will be able to tell you that the statistic confidence of a positive relation between P53 and breast cancer = 71%, based on 205 publications from SCI with an IF >4 during the past 5 years.Tap into massive existing data first Another untapped resource is the massive clinical data. Then, we need a system to analyze and answer questions, such as what‘s the best cure plan for woman breast cancer, age 60-80 with a high blood pressure. Traditionally, doctors have to memorize all the patients history so as to accumulate experiences and sharpen their skills. This system will obviously offer physicians and researchers a new respective to work with in an instant way. A standardized operation modelUnify and standardize the data of all kinds in biological experiments, such as images, graphs, curves, numbers, binary factors, and descriptions in words is the first step to use the massive data in the field. The traditional descriptive nature of the biomedical field can be easily understood just by our experience with ultra sound. The doctors will write a paragraph under the picture of the ultra sound descriptively, and the numbers are derived by observations. Even though the genomic data is pure numerical, with ATGC counted in sequences, the interpretation of gene expression level in the functionalities of the gene network becomes descriptive again. We can not simply say gene A regulate gene B is definite. Sometimes, gene A regulate gene B indirectly through other mechanisms, such as certain epigenetic factors, or under certain disease state. New analytic toolsNew ways of modeling and simulation has been popular in other fields. Computing can simulate many real world occasions from car crash to robotic assembly lines. Use computer generated models and simulations vs. live animal experiments to simulate a gene network regulatory state is possible. Map the network with all the data sources available will set a reliable research foundation for biomedical researchers and save them tremendous time in fishing in the oceans of information and imagine their relationships.
  • I wanted to utilize the IPSO diagram/model to portray how our computational knowledge engine would function: System Boundary – Data bits/Queries moving across system boundaryInput – Keyword or Question into search field (What you want answered) Example – What is P53 in cancer?Process – Digitize keyword(s) or question. Algorithmic Keyword Match and/or queries by sending bits of data to the Data Mart for retrievalStorage – Storage warehouse includes all indexed “Human Expert Knowledge” material related to health/medicine. Example – Genetic analysis and Gene studyEach Indexed Item will be attached to graphs, images, definitions and directly relevant material within the storage warehouse. Our intent is to deliver comprehensive “answers” to our inquirers.Output – Bits of data will travel from storage warehouse to “inquirer” with answers, and directly related material. No noise, just pertinent results (Graphs, Definitions, etc…!)
  • In class we briefly went over the topic of “expert system”. I believe that our desired product directly relates. We want to use approved human expertise within the Health/Medicine field. Accredited material found on health care sites such as PubMed, CDC, NIH and WebMD will comprise our human expertise storage warehouse. A current “computational knowledge engine” called “Wolfram Alpha” is an excellent example of what we are looking to create. I have included a video in our PowerPoint presentation that explains what a “computational knowledge engine” does and how it differentiates from a standard search engine. Wolfram Alpha examines many fields and does not have a very extensive health/medicine database. Our product will solely focus on health/medicine materials.
  • The purpose of education is to test the boundaries of what we know, and then to expand those boundaries to what we do not know. The problem we run into is learning that goes into areas that are irrelevant (science) as opposed to applied science (technology). The vision of our software is to convert qualitative and other data into quantitative data that can be housed in a table. Thus, the data will be manipulate-able by other higher order softwares to produce trends.The ideal software will take us from what we know to what we need to know in a way that is logical and does not waste limited scientific resources or time on endless quests or irrelevant research. The software will tell scientists what next to study to drive scientific research forward.
  • The purpose of our project is to create a tool that rapidly improves the production of scientists by arming them with better data.  Our tool is a meta level scientific publication crawler that uses natural language processing to determine meanings of phrases in scientific publications, and converts those to logical statements that are then aggregated.  Once aggregated, the tool will apply basic logic functions and statistics to determine probable scientific truths.  These truths will then be the output of our web tool.This program is intended to apply computational knowledge and organization to scientific publications, much in the way that Wolfram Alpha provides computational knowledge to mathematics.  When provided with the query, "square root of 2," Wolfram Alpha does not simply respond with all kinds of data, including irrelevant data such as, for example, papers on "a computational method for the fact checking of the square root of two," or with images of two four-equal-sided polygons - rather, Wolfram Alpha gives you a very specific output because it views the words "square root" and "two" as specific sub parameters of a function which it is to operate.
  • Our software, when given an input such as "role of protein p53 in cancer amongst women over age 50" will not simply return endless lists of articles on various p53 studies - rather, it will give a very specific output, such as "high correlation (ideally quantified as a percentage) likelihood of p53 impact in cancer for this age group."  It will do this by crawling the endless lists mentioned earlier, and use basic natural language processing to convert papers such as "p53 in cancer - qPCR data reveals gene upregulation in metasticizing cells," from University X and "p53 increases in densitometric analyses of western blots of metasticizing cancers," from University Y.  
  • Milap Thaker et. al: Leveraging big data in biological science. Presented at Johns Hopkins University 2013.

    1. 1. Leveraging Big Data to Accelerate Biomedical Research: Overlaying Computational Knowledge, Natural Language Processing, Artificial Intelligence, and Data Standardization onto Pubmed and Affiliated Databases to shift from a Google Paradigm to a Wolfram Alpha Paradigm Steven Koval, Milap Thaker, and Ping Zhu Information Systems 350.620.72 The Johns Hopkins University William Agresti, Professor 12 December 2013
    2. 2. How is Data Used in the Biomedical Field?
    3. 3. Types of Data (Data Diversity) Challenge: Data is both quantitative and visual   Cell Images Gene Amplifications  Sequences Adrenal_ALP_L311 D Adrenal_ALP_L311 D Adrenal_ALP_L311 D Adrenal_ALP_L311 D Adrenal_ALP_L311 D  5379 5167 5167 5167 4750 Blots/Gels TGAGATGAAGCACTGT AGCTCT TGGAAGACTAGTGATTT TGTTGT TGGAAGACTAGTGATTT TGTTGT TGGAAGACTAGTGATTT TGTTGT TCAGTGCACTACAGAA CTTTGT 22 23 23 23 22
    4. 4. Big Data vs. Biology - Challenges  Open science Biological research and discovery calls for the accessibility to all kinds of data sources.  There are too many types to be standardized From small scale Western Blot gel pictures to the big data sets of next generation sequencing  Utility of data is low Too many data to be analyzed and utilized effectively.  Requires high analytic skills With too many variables, gene regulation, for example, is a much more complicated network than, for example, a Facebook account.
    5. 5. Big Data and Biology: Recommendations     New search engine Go beyond Google search, and allow scientists to ask precise academic questions and get accurate snap shots of the specific biological topics - e.g. What’s the statistical confidence of p53 as it relates to breast cancer based on the scientific publications with an IF (impact factor)of 4 and above during the past 5 years. Tap into massive existing data first By analyzing massive clinical data, can we answer questions, such as the best cure plan for a woman age 60-80 breast cancer with a high blood pressure? A standardized operation model Unify and standardize the data of all kinds in biological experiments, such as images, graphs, curves, numbers, binary factors, and descriptions in words. New analytic tools New ways of modeling and simulation, e.g. Use computer generated models and simulations vs. live animal experiments to simulate a gene network regulatory state. Map the network with all the data sources and set a reliable research foundation.
    6. 6. Big Data – “Computational Knowledge Engine” (Health/Medicine) – IPSO Model 1. 2. 3. 4. 5. System Boundary – Data bits/Queries moving across system boundary Input – Keyword or Question into search field (What you want answered) Example – What is P53 in cancer? Process – Digitize keyword(s) or question. Algorithmic Keyword Match and/or queries by sending bits of data to the Data Mart for retrieval Storage – Storage warehouse includes all indexed “Human Expert Knowledge” material related to health/medicine. Example – Genetic analysis and Gene study Output – Bits of data will travel from storage warehouse to “inquirer” with answers, and directly related material. No noise, just pertinent results (Graphs, Definitions, etc…)
    7. 7. Big Data – “Computational Knowledge Engine” (Health/Medicine)    Expert System – Software that uses a knowledge base of human expertise for problem solving Different then a search engine, this is a computational knowledge engine! Leverage Big Data from Pubmed, PLoS, CDC, NIH, etc… expert knowledge
    8. 8. What Does Our Software Do? Full Corpus of Knowledge • Everything that can be known about Biology • Most of this will not be relevant to practical studies • A lot of research is wasted time or reproving the known What We Need to Know • This is where our artificial intelligence will lead the researcher – from what they know  what they need to know through practical knowledge expansion What We Know • Information we already have about our topic • Subject knowledge on a subset of biology
    9. 9. How Will It Do This?     Uses natural language processing to determine meanings of phrases in scientific publications, and converts those to logical statements that are then aggregated (semantic search applied to Pubmed). Inputs such as “p53 in cancer X” no longer leads to endless lists of papers – instead, the output is “p53 is found to be downregulated in cancer X.” This is a YES/NO system allowing for subsequent conditional logic to make determinations in future NIH funding Professors use their PhD Postdocs as reading machines – this will produce an actual quality-controlled data scouring system that will probabilistically aggregate data on their subject as bivalent statements (p53 causes cell death) as opposed to abstract observations (p53 may have a role in cell death). Quantitation will allow the AI to say, “there is a 95% correlation between p53 upregulation and cell death in breast cancer”
    10. 10. Conclusion      Every single piece of public data found by a biological researcher (Pubmed/ExPasy/Human Genome/NIH) is converted to data that can go into a database cell. Gene sequences, western blots, gel images, chemical interactions – literally everything is reduced to data that can be housed in a single super database. Outputs can be simple Wolfram-Alpha style bivalent responses. Simultaneous simple artificial intelligence leverages the big data and produces new areas of tactical research – expanding from what we know to identifying what we should know . In essence, we are applying the conditional logic of mathematics to biology.

    ×