Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
The human genome project [1], an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, lasted roughly 15 years and cost $5 billion (adjusted for inflation). With the recent advances in genome sequencing technology, that cost has now reduced to a few hundreds dollars [2] and can be done overnight.
Being able to access this kind of information may have a deep impact on the way complex diseases are treated: physicians will shift from general-purpose treatments to specific ones, tailored on the individual patient’s genomic features.This approach is referred to as precision medicine.
There are however several caveats: first of all, due to the nature of the problem, knowledge of both the biomedical and the computer science domain are required in order to correctly approach it; second, unlike more classical scenarios such as image classification or object detection, it is much more difficult to determine the accuracy of the system due to the complex and multifactorial nature of complex diseases such as cancer and neurodegenerative diseases.
Moreover, a black box kind of solution is unlikely to be of any use, due to legal and ethical reasons: interpretability of the model is crucial more than ever.
The goal of this thesis is to explore the possibilities and the limits of techniques based on deep neural networks for the analysis of biomolecular data, experimenting with publicly available datasets.
This is the slide eck that we used when we raised $1.2 million from investors for the angel round of IMSafer, back in 2006. The original company name was Collabarent.
This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
Master's Thesis - deep genomics: harnessing the power of deep neural networks...Enrico Busto
The human genome project [1], an international scientific research project with the goal of determining the sequence of nucleotide base pairs that make up human DNA, lasted roughly 15 years and cost $5 billion (adjusted for inflation). With the recent advances in genome sequencing technology, that cost has now reduced to a few hundreds dollars [2] and can be done overnight.
Being able to access this kind of information may have a deep impact on the way complex diseases are treated: physicians will shift from general-purpose treatments to specific ones, tailored on the individual patient’s genomic features.This approach is referred to as precision medicine.
There are however several caveats: first of all, due to the nature of the problem, knowledge of both the biomedical and the computer science domain are required in order to correctly approach it; second, unlike more classical scenarios such as image classification or object detection, it is much more difficult to determine the accuracy of the system due to the complex and multifactorial nature of complex diseases such as cancer and neurodegenerative diseases.
Moreover, a black box kind of solution is unlikely to be of any use, due to legal and ethical reasons: interpretability of the model is crucial more than ever.
The goal of this thesis is to explore the possibilities and the limits of techniques based on deep neural networks for the analysis of biomolecular data, experimenting with publicly available datasets.
This is the slide eck that we used when we raised $1.2 million from investors for the angel round of IMSafer, back in 2006. The original company name was Collabarent.
This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
Introduction to Big Data and its Potential for Dementia ResearchDavid De Roure
Presentation at Dementia Conference (Evington Initiative) held at Wellcome Trust, 22-23 October 2012. Acknowledgements to McKinsey & Company, also Tim Clark (MGH) and Iain Buchan (University of Manchester), for input to slides.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
The presentation by Klaus Gottlieb highlights human thinking tools that maintain advantages over AI, focusing on critical thinking, creativity, and problem-solving strategies. It showcases how these cognitive skills enable humans to interpret, innovate, and navigate complex scenarios more effectively than current AI capabilities, underscoring the importance of leveraging human intellect alongside technological advancements.
Keywords: Critical Thinking, Creativity, Problem-Solving, Human Intellect, Cognitive Skills, Innovation, AI Limitations.
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Establishing an Online Access Panel for Interactive Information Retrieval Res...GESIS
We propose an online access panel to support the evaluation process of Interactive Information Retrieval (IIR) systems. By maintaining an online access panel with users of IIR systems we assume that the recurring effort to recruit participants for web-based as well as for lab studies can be minimized. We target on using the online access panel not only for our own development processes but to open it for other interested researchers in the field of IIR. In this paper we present the concept of the online access panel as well as first implementation details.
The Impact of Information Technology on Chemistry and Related SciencesAshutosh Jogalekar
This is a copy of an invited talk I gave at the ACS meeting in Dallas in March 2014. The talk was about the impact of information technology on chemistry and related sciences. I interpreted 'information technology' broadly and divided the talk into three sections: Data, Simulation and Sociology.
'Data' talks about how chemical information has grown exponentially and how chemists are coming up with new techniques to store, organize and understand this information.
'Simulation' talks about how chemists are using the last two decades' spectacular progress in hardware and software to understand the behavior of molecules in a variety of applications ranging from drug design to new materials.
'Sociology' talks about the impact of blogs and social media on the practice of chemistry. More specifically I talk about how social media is serving as a 'second tier' of peer review and how this new medium is having an increasingly influential impact on many issues close to chemists' hearts including lab safety, 'chemophobia' and the public appreciation of chemistry.
Professor Carole Goble, University of Manchester, talks at the RIN "Research data: policies & behaviour" event as part of a series on Research Information in Transition.
Where are we going and how are we going to get there?David De Roure
Keynote from JISC Projects start-up meeting
Information Environment 2009-11 & Virtual Research Environment http://www.jisc.ac.uk/whatwedo/programmes/inf11/inf11startup.aspx
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsBenjamin Good
The Gene Ontology (GO) Consortium (GOC) is developing a new knowledge representation approach called ‘causal activity models’ (GO-CAM). A GO-CAM describes how one or several gene products contribute to the execution of a biological process. In these models (implemented as OWL instance graphs anchored in Open Biological Ontology (OBO) classes and relations), gene products are linked to molecular activities via semantic relationships like ‘enables’, molecular activities are linked to each other via causal relationships such as ‘positively regulates’, and sets of molecular activities are defined as ‘parts’ of larger biological processes. This approach provides the GOC with a more complete and extensible structure for capturing knowledge of gene function. It also allows for the representation of knowledge typically seen in pathway databases.
Here, we present details and results of a rule-based transformation of pathways represented using the BioPAX exchange format into GO-CAMs. We have automatically converted all Reactome pathways into GO-CAMs and are currently working on the conversion of additional resources available through Pathway Commons. By converting pathways into GO-CAMs, we can leverage OWL description logic reasoning over OBO ontologies to infer new biological relationships and detect logical inconsistencies. Further, the conversion helps to increase standardization for the representation of biological entities and processes. The products of this work can be used to improve source databases, for example by inferring new GO annotations for pathways and reactions and can help with the formation of meta-knowledge bases that integrate content from multiple sources.
Pathways2GO: Converting BioPax pathways to GO-CAMsBenjamin Good
Presentation at the Gene Ontology Consortium Annual Meeting. Describing the automatic conversion of biochemical pathways in the Reactome Knowledge Base into the Gene Ontology 'Causal Activity Model' representation.
Building a Biomedical Knowledge Garden Benjamin Good
Describes the tribulations of building a large biomedical knowledge graph. Provides a comparison between the UMLS and Wikidata in terms of content and structure. Concludes with the idea of anchoring the knowledge graph in Wikidata items and properties.
When the Heart BD2K grant was originally written. We proposed to build something called “Big Data World” to help advance citizen science, scientific crowdsourcing and science education – especially in bioinformatics. This past year, this idea has become Science Game Lab ( https://sciencegamelab.org ) . A collaboration between the Su laboratory at Scripps Research, Playmatics LLC, and recently the creators of WikiPathways.
Opportunities and challenges presented by Wikidata in the context of biocurationBenjamin Good
Abstract—Wikidata is a world readable and writable knowledge base maintained by the Wikimedia Foundation. It offers the opportunity to collaboratively construct a fully open access knowledge graph spanning biology, medicine, and all other domains of knowledge. To meet this potential, social and technical challenges must be overcome - many of which are familiar to the biocuration community. These include community ontology building, high precision information extraction, provenance, and license management. By working together with Wikidata now, we can help shape it into a trustworthy, unencumbered central node in the Semantic Web of biomedical data.
(Poster) Knowledge.Bio: an Interactive Tool for Literature-based Discovery Benjamin Good
PubMed now indexes roughly 25 million articles and is growing by more than a million per year. The scale of this “Big Knowledge” repository renders traditional, article-based modes of user interaction unsatisfactory, demanding new interfaces for integrating and summarizing widely distributed knowledge. Natural language processing (NLP) techniques coupled with rich user interfaces can help meet this demand, providing end-users with enhanced views into public knowledge, stimulating their ability to form new hypotheses.
Knowledge.Bio provides a Web interface for exploring the results from text-mining PubMed. It works with subject, predicate, object assertions (triples) extracted from individual abstracts and with predicted statistical associations between pairs of concepts. While agnostic to the NLP technology employed, the current implementation is loaded with triples from the SemRep-generated SemmedDB database and putative gene-disease pairs obtained using Leiden University Medical Center’s ‘Implicitome’ technology.
Users of Knowledge.Bio begin by identifying a concept of interest using text search. Once a concept is identified, associated triples and concept-pairs are displayed in tables. These tables have text-based and semantic filters to help refine the list of triples to relations of interest. The user then selects relations for insertion into a personal knowledge graph implemented using cytoscape.js. The graph is used as a note-taking or ‘mind-mapping’ structure that can be saved offline and then later reloaded into the application. Clicking on edges within a graph or on the ‘evidence’ element of a triple displays the abstracts where that relation was detected, thus allowing the user to judge the veracity of the statement and to read the underlying articles.
Knowledge.Bio is a free, open-source application that can provide, deep, personal, concise, shareable views into the “Big Knowledge” scattered across the biomedical literature.
Application: http://knowledge.bio
Source code: https://bitbucket.org/sulab/kb1/
Update on the gene wiki project, introduction to knowledge.bio semantic search application, introduction to biobranch.org collaborative decision tree creator
Building a massive biomedical knowledge graph with citizen scienceBenjamin Good
The life sciences are faced with a rapidly growing array of technologies for measuring the molecular states of living things. From sequencing platforms that can assemble the complete genome sequence of a complex organism involving billions of nucleotides in a few days to imaging systems that can just as rapidly churn out millions of snapshots of cells, biology is truly faced with a data deluge. To translate this information into new knowledge that can guide the search for new medicines, biomedical researchers increasingly need to build on the existing knowledge of the broad community. Prior knowledge can help guide searches through the masses of new data. Unfortunately, most biomedical knowledge is represented solely in the text of journal articles. Given that more than a million such articles are published every year, the challenge of using this knowledge effectively is substantial. Ideally, knowledge such as the interrelations between genes, drugs and diseases would be represented in a knowledge graph that enabled queries like: “show me all the genes related to this disease or related to any drugs used to treat this disease”. Systems exist that attempt to extract this information automatically from text, but the quality of their output remains far below what can be obtained by human readers. We are developing a new platform that taps the language comprehension abilities of citizen scientists to help excavate a queryable knowledge graph from the biomedical literature. In proof-of-concept experiments, we have demonstrated that lay-people are capable of extracting meaningful information from complex biological text. The information extracted using this community intelligence framework can surpass the efforts of individual experts in quality while also offering the potential to achieve massive scale. In this presentation we will describe the results of early experiments and introduce our prototype citizen science platform: http://mark2cure.org.
Branch: An interactive, web-based tool for building decision tree classifiersBenjamin Good
A crucial task in modern biology is the prediction of complex phenotypes, such as breast cancer prognosis, from genome-wide measurements. Machine learning algorithms can sometimes infer predictive patterns, but there is rarely enough data to train and test them effectively and the patterns that they identify are often expressed in forms (e.g. support vector machines, neural networks, random forests composed of 10s of thousands of trees) that are highly difficult to understand. In addition, it is generally unclear how to include prior knowledge in the course of their construction.
Decision trees provide an intuitive visual form that can capture complex interactions between multiple variables. Effective methods exist for inferring decision trees automatically but it has been shown that these techniques can be improved upon via the manual interventions of experts. Here, we introduce Branch, a new Web-based tool for the interactive construction of decision trees from genomic datasets. Branch offers the ability to: (1) upload and share datasets intended for classification tasks (in progress), (2) construct decision trees by manually selecting features such as genes for a gene expression dataset, (3) collaboratively edit decision trees, (4) create feature functions that aggregate content from multiple independent features into single decision nodes (e.g. pathways) and (5) evaluate decision tree classifiers in terms of precision and recall. The tool is optimized for genomic use cases through the inclusion of gene and pathway-based search functions.
Branch enables expert biologists to easily engage directly with high-throughput datasets without the need for a team of bioinformaticians. The tree building process allows researchers to rapidly test hypotheses about interactions between biological variables and phenotypes in ways that would otherwise require extensive computational sophistication. In so doing, this tool can both inform biological research and help to produce more accurate, more meaningful classifiers.
A prototype of Branch is available at http://biobranch.org/
The Cure: Making a game of gene selection for breast cancer survival predictionBenjamin Good
Background: Molecular signatures for predicting breast cancer prognosis could greatly improve care through personalization of treatment. Computational analyses of genome-wide expression datasets have identified such signatures, but these signatures leave much to be desired in terms of accuracy, reproducibility and biological interpretability. Methods that take advantage of structured prior knowledge (e.g. protein interaction networks) show promise in helping to define better signatures but most knowledge remains unstructured. Crowdsourcing via scientific discovery games is an emerging methodology that has the potential to tap into human intelligence at scales and in modes previously unheard of.
Objective: The main objective of this study was to test the hypothesis that knowledge linking expression patterns of specific genes to breast cancer outcomes could be captured from players of an open, Web-based game. We envisioned capturing knowledge both from the player’s prior experience and from their ability to interpret text related to candidate genes presented to them in the context of the game.
Methods: We developed and evaluated an online game called “The Cure” that captured information from players regarding genes for use in predictors of breast cancer survival. Information gathered from game play was aggregated using a voting approach and used to create rankings of genes. The top genes from these rankings were evaluated using annotation enrichment analysis, comparison to prior predictor gene sets, and by using them to train and test machine learning systems for predicting 10-year survival.
Results: Between its launch in Sept. 2012 and Sept. 2013, The Cure attracted more than 1,000 registered players who collectively played nearly 10,000 games. Gene sets assembled through aggregation of the collected data showed significant enrichment for genes known to be related to key concepts such as Cancer, Disease Progression, and Recurrence (P < 1.1e-07). In terms of the accuracy of models trained using them, these gene sets provided comparable performance to gene sets generated using other methods including those used in commercial tests. The Cure is available at http://genegames.org/cure/
Poster: Microtask crowdsourcing for disease mention annotation in PubMed abst...Benjamin Good
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
2. Types of Biomedical Citizen
Science
• Personal data: 23andme surveys, ubiome!, …
• Microtasks: large in number, small in difficulty
• Megatasks: Small in number, high in difficulty
Good & Su 2013. Crowdsourcing for Bioinformatics. Bioinformatics
3. Citizens enhance the capacity of
traditional science
by performing microtasks
• Processing images (tagging, tracing)
• Annotating concepts in text: mark2cure.org
Keys for success: volume, redundancy, aggregation, gold
standards for training and quality assessment
4. More than microtasks, not quite
megatasks: Visual Reasoning
Foldit: 3D Protein Structure Puzzles
DNA Sequence Alignment
Key to making it fun and productive is a good
automated scoring function
5. Citizen scientists performing the
megatasks usually reserved for
professional scientists
(Khatib et al 2011) “Algorithm discovery
by protein folding game players” PNAS
Open Innovation Contest Platforms
6. Successful megatasks
• Large, diverse user population
• Well-defined problem
• Rapid, high quality feedback on
proposed solutions
7. EteRNA
http://eterna.cmu.edu/
1. Its a game! - rapid feedback, visual, beautiful
2. Its a weekly competition! the winners are rewarded with real
laboratory tests of their hypotheses about how a string of RNA will
fold in 3 dimensions
3. Its a community forum for creating RNA design rules
4. Its a machine learning system for learning RNA design rules
5. Its a MASSIVE OPEN ONLINE LABORATORY !!!!
6. Its working! (Lee 2013) RNA design rules from a massive open laboratory, PNAS
8. Identifying new
opportunities
• “Measurement is the beginning of [citizen] science”
• Look for problems where it is possible to provide
rapid, high quality feedback about progress
9. Needs
• Frame the process
• Reduce barriers to entry
Games are the Gateway Drugs!
• Short, high-quality feedback loops
• Deeper learning, discovery paths
11. Opportunities
• Biology and medicine provide a
heroic purpose for citizen science –
not unlike the more traditional
purpose of saving the world from
aliens. Scientist