Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
Presentation for teaching faculty about resources, data, issues, and strategies for including personal genomics in the classroom, within the context of precision medicine as an overarching theme.
Presentation at the Canadian Cancer Research Conference satellite bioinformatics.ca workshop. This one is an introduction to tcga, icgc and cosmic databases.
Personal Genomes: what can I do with my data?Melanie Swan
Biology evolved to be just good enough to survive and genomics provides the critical next-generation toolkit for its greater exploitation. Genomics is already starting to be medically actionable and is likely to become increasingly useful over time. This presentation discusses how your genetic information is already useful today,
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
An introduction to the Gene Wiki project with an emphasis on the use of the new WikiData project. Also describes mark2cure, a citizen science initiative oriented on biomedical text mining.
Presentation for teaching faculty about resources, data, issues, and strategies for including personal genomics in the classroom, within the context of precision medicine as an overarching theme.
Presentation at the Canadian Cancer Research Conference satellite bioinformatics.ca workshop. This one is an introduction to tcga, icgc and cosmic databases.
Personal Genomes: what can I do with my data?Melanie Swan
Biology evolved to be just good enough to survive and genomics provides the critical next-generation toolkit for its greater exploitation. Genomics is already starting to be medically actionable and is likely to become increasingly useful over time. This presentation discusses how your genetic information is already useful today,
Next-generation sequencing has enabled clinicians and researchers alike to identify novel genetic variants associated with rare Mendelian Diseases across the human genome. To help enable researchers and clinicians understand the role of CNVs in human health and disease, Golden Helix has integrated a specialized NGS-based CNV caller capable of detecting deletion and duplication events as small as single-exons and as large as whole chromosome aneuploidy events. In this webcast, we will present our workflows that integrates the NGS-based CNV caller into SVS.
Cancer genome databases & Ecological databases Waliullah Wali
Introduction
Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis.
Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.
Cancer genome databases
COSMIC cancer database
COSMIC cancer database
COSMIC is an online database of somatically acquired mutations found in human cancer.
The database is freely available.
COSMIC cancer database
Types of data
Expert curation data
Genome-wide screen data
COSMIC cancer database
Expert curation data
Manually input by COSMIC expert curators.
Consists of comprehensive literature curation followed by subsequent updates.
Includes additional data points relevant to each disease and publication.
Provides accurate frequency data as mutation negative samples are specified.
COSMIC cancer database
Genome-wide screen data
Uploaded from publications reporting large scale genome screening data or imported from other databases such as TCGA and ICGC.
Provides unbiased molecular profiling of diseases while covering the whole genome.
Provides objective frequency data by interpreting non mutant genes across each genome.
Facilitates finding novel driver genes in cancer.
Enter into -
COSMIC cancer database
by typing http://cancer.sanger.ac.uk/cosmic
in the address bar of Browser
Searching Process
Examples
Examples
Examples
Examples
Ecological databases
Ecological databases
Ecological databases is a source for finding ecological datasets and quickly figuring out the best ways to use them.
BioOne
DataONE
GEOBASE
BioOne
BioOne is a nonprofit publisher that aims to make scientific research more accessible.
BioOne was established in 1999 in Washington, DC.
BioOne is Complete and open-access.
It serves a community of over 140 society and institutional publishers, 4,000 accessing institutions, and millions of researchers worldwide.
Enter into -
BioOne Ecological database
by typing http://www.bioone.org/
in the address bar of Browser
The key considerations of crispr genome editingChris Thorne
While CRISPR is simple to use, widely applicable and often highly efficient, there are a number of things to keep in mind to maximise experimental success. Here's what we recommend...
Presentation for Network Biology SIG 2013 by Thomas Kelder, Bioinformatics Scientist at TNO in The Netherlands. “Functional Network Signatures Link Anti-diabetic Interventions with Disease Parameters”
Errors and Limitaions of Next Generation SequencingNixon Mendez
High throughput sequencing technologies has made whole genome sequencing and resequencing available to many more researchers and projects.
Cost and time have been greatly reduced.
The error profiles and limitations of the new platforms differ significantly from those of previous sequencing technologies.
The selection of an appropriate sequencing platform for particular types of experiments is an important consideration.
NGS sequencing errors focuses mainly on the following points:
1.Low quality bases
2.PCR errors
3.High Error rate
NGS has inherent limitations they are as follows :
1.Sequence properties and algorithmic challenges
2.Contamination or new insertions
3.Repeat content
4.Segmental duplications
5.Missing and fragmented genes
6.Reference index
ScholarMate - A social research management toolJing Wang
ScholarMate.com is a professional research social network, and a social reasarch management tool.
ScholarMate’s vision is to connect people to research and innovate smarter.
eGrant.cn, ScholarMate.com and InnoCity.com are trademarks of IRIS Systems (Shenzhen) Ltd.
Next-generation sequencing has enabled clinicians and researchers alike to identify novel genetic variants associated with rare Mendelian Diseases across the human genome. To help enable researchers and clinicians understand the role of CNVs in human health and disease, Golden Helix has integrated a specialized NGS-based CNV caller capable of detecting deletion and duplication events as small as single-exons and as large as whole chromosome aneuploidy events. In this webcast, we will present our workflows that integrates the NGS-based CNV caller into SVS.
Cancer genome databases & Ecological databases Waliullah Wali
Introduction
Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis.
Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.
Cancer genome databases
COSMIC cancer database
COSMIC cancer database
COSMIC is an online database of somatically acquired mutations found in human cancer.
The database is freely available.
COSMIC cancer database
Types of data
Expert curation data
Genome-wide screen data
COSMIC cancer database
Expert curation data
Manually input by COSMIC expert curators.
Consists of comprehensive literature curation followed by subsequent updates.
Includes additional data points relevant to each disease and publication.
Provides accurate frequency data as mutation negative samples are specified.
COSMIC cancer database
Genome-wide screen data
Uploaded from publications reporting large scale genome screening data or imported from other databases such as TCGA and ICGC.
Provides unbiased molecular profiling of diseases while covering the whole genome.
Provides objective frequency data by interpreting non mutant genes across each genome.
Facilitates finding novel driver genes in cancer.
Enter into -
COSMIC cancer database
by typing http://cancer.sanger.ac.uk/cosmic
in the address bar of Browser
Searching Process
Examples
Examples
Examples
Examples
Ecological databases
Ecological databases
Ecological databases is a source for finding ecological datasets and quickly figuring out the best ways to use them.
BioOne
DataONE
GEOBASE
BioOne
BioOne is a nonprofit publisher that aims to make scientific research more accessible.
BioOne was established in 1999 in Washington, DC.
BioOne is Complete and open-access.
It serves a community of over 140 society and institutional publishers, 4,000 accessing institutions, and millions of researchers worldwide.
Enter into -
BioOne Ecological database
by typing http://www.bioone.org/
in the address bar of Browser
The key considerations of crispr genome editingChris Thorne
While CRISPR is simple to use, widely applicable and often highly efficient, there are a number of things to keep in mind to maximise experimental success. Here's what we recommend...
Presentation for Network Biology SIG 2013 by Thomas Kelder, Bioinformatics Scientist at TNO in The Netherlands. “Functional Network Signatures Link Anti-diabetic Interventions with Disease Parameters”
Errors and Limitaions of Next Generation SequencingNixon Mendez
High throughput sequencing technologies has made whole genome sequencing and resequencing available to many more researchers and projects.
Cost and time have been greatly reduced.
The error profiles and limitations of the new platforms differ significantly from those of previous sequencing technologies.
The selection of an appropriate sequencing platform for particular types of experiments is an important consideration.
NGS sequencing errors focuses mainly on the following points:
1.Low quality bases
2.PCR errors
3.High Error rate
NGS has inherent limitations they are as follows :
1.Sequence properties and algorithmic challenges
2.Contamination or new insertions
3.Repeat content
4.Segmental duplications
5.Missing and fragmented genes
6.Reference index
ScholarMate - A social research management toolJing Wang
ScholarMate.com is a professional research social network, and a social reasarch management tool.
ScholarMate’s vision is to connect people to research and innovate smarter.
eGrant.cn, ScholarMate.com and InnoCity.com are trademarks of IRIS Systems (Shenzhen) Ltd.
Compiled while the recent outbreak of this year 2014 is still on. Although labeled as Ebola, includes one or two slide about viral hemorrhagic fevers and some more about Marburg virus as well. Being a budding microbiologist, I have focused on disease, agent and prevention. Statistics up to the date 31.10.2014 included with references. Indian scenario is also considered. Let us all hope that this will be the last update for this presentation.Suggestions are welcome.
Since Mars was discovered, mankind has been interested in this planet. Many people find that saving humanity depends on the colonization of the Red Planet. Here are 10 interesting facts about the Mars.
The role and importance of social media in science Jari Laru
The role and importance of social media in science presentation in the course: 920001J - Introduction to Doctoral Training (1 ECTS credit). UNIOGS, University of Oulu, Finland.
Advice on writing for the web, a discussion of the special considerations of the medium, and some best practices for developing and delivering online content.
A brief history of the most well-documented and provocative UFO sightings (along with declassified government documents) and a discussion of the extraterrestrial hypothesis (ETH) and alternatives, such as the interdimensional/control system theory promoted by Jacques Vallee.
The video of this presentation is also available, and will help makes sense of some of the slides that lack text and descriptions.
https://www.youtube.com/watch?v=4MA0zQdbFfY&feature=youtu.be
Update on the gene wiki project, introduction to knowledge.bio semantic search application, introduction to biobranch.org collaborative decision tree creator
A Centralized Model Organism Database (CMOD) for the Long Tail of Sequenced G...Andrew Su
Keynote talk given at GMOD 2014
Video of talk at: https://www.youtube.com/watch?v=RVijs5ry05E
Video of QA at: https://www.youtube.com/watch?v=dGHXo-iNsyU
Blog post: http://sulab.org/2013/06/creating-a-centralized-model-organism-database-cmod/
Quantified Self On Being A Personal Genomic ObservatoryLarry Smarr
Larry Smarr's presentation on the "Quantified Self On Being A Personal Genomic Observatory", Keynote in the "Humans as Genomic Observatories" Meeting Session in the Genomics Standards Consortium, GSC 15, April 24, 2013
Introduction to Gene Mining Part A: BLASTn-off!adcobb
In this lesson, students will learn to use bioinformatics portals and tools to mine plant versions of human genes. Student handout and teacher resource materials are available at www.Araport.org, Teaching Resources (Community tab). Suitable for grades 9-12 or first year undergraduate students.
Microbiome Isolation and DNA Enrichment Protocol: Pathogen Detection Webinar ...QIAGEN
This slidedeck presents an easy-to-use workflow that allows selective isolation of microbial DNA from samples that are intrinsically rich in host DNA. This protocol includes steps for efficient depletion of host DNA while providing optimized conditions specific for bacterial lysis. This workflow is also specific for the identification of live bacteria, avoiding false results due to nucleic acids from dead bacteria. Enriched microbial DNA can be directly used in other molecular methods such as whole genome sequencing, qPCR and microarray assays.
Microtask crowdsourcing for disease mention annotation in PubMed abstractsBenjamin Good
Microtask crowdsourcing for disease mention annotation in PubMed abstracts
Benjamin M. Good, Max Nanis, Andrew I. Su
Identifying concepts and relationships in biomedical text enables knowledge to be applied in computational analyses that would otherwise be impossible. As a result, many biological natural language processing (BioNLP) projects attempt to address this challenge. However, the state of the art in BioNLP still leaves much room for improvement in terms of precision, recall and the complexity of knowledge structures that can be extracted automatically. Expert curators are vital to the process of knowledge extraction but are always in short supply. Recent studies have shown that workers on microtasking platforms such as Amazon’s Mechanical Turk (AMT) can, in aggregate, generate high-quality annotations of biomedical text.
Here, we investigated the use of the AMT in capturing disease mentions in Pubmed abstracts. We used the recently published NCBI Disease corpus as a gold standard for refining and benchmarking the crowdsourcing protocol. After merging the responses from 5 AMT workers per abstract with a simple voting scheme, we were able to achieve a maximum f measure of 0.815 (precision 0.823, recall 0.807) over 593 abstracts as compared to the NCBI annotations on the same abstracts. Comparisons were based on exact matches to annotation spans. The results can also be tuned to optimize for precision (max = 0.98 when recall = 0.23) or recall (max = 0.89 when precision = 0.45). It took 7 days and cost $192.90 to complete all 593 abstracts considered here (at $.06/abstract with 50 additional abstracts used for spam detection).
This experiment demonstrated that microtask-based crowdsourcing can be applied to the disease mention recognition problem in the text of biomedical research articles. The f-measure of 0.815 indicates that there is room for improvement in the crowdsourcing protocol but that, overall, AMT workers are clearly capable of performing this annotation task.
Microbiome Profiling with the Microbial Genomics Pro SuiteQIAGEN
In this slide deck, we introduce the scientist-friendly Microbial Genomics Pro Suite offering workflows optimized for microbiome profiling, microbial typing and outbreak analysis. The workflows and tools for microbial genomics introduced with this software package are further extending the comprehensive set of genomics, transcriptomics and epigenomics analysis solutions that researchers know from CLC Genomics Workbench.
Slides contain information about why bioinformatics appeared,
who bioinformaticians are, what they do, what kind of cool applications and challenges in bioinformatics there are.
Slides were prepared for the Bioinformatics seminar 2016, Institute of Computer Science, University of Tartu.
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
With the exponential growth of genomic data sets, healthcare practitioners now have the opportunity to improve human outcomes at an unprecedented pace. These outcomes are difficult to realize in the existing ecosystem of genomic tools, where biostatisticians regularly chain together command-line interfaces based on a single-node setup on premise. The Databricks Unified Analytics Platform for Genomics empowers users to perform end-to-end analysis on our massively scalable platform in the cloud: in only minutes, a data scientist can visualize an individual’s disease risk based on their raw genomic data. Built on Apache Spark, we provide click-button implementations of accepted best practice workflows, as well as low-level Spark SQL optimizations for common genomics operations.
Jake Lever - University of Glasgow
Will artificial intelligence change how readers use the research literature?
Huge advances in machine learning and natural language processing are set to upend how researchers search and consume research articles as well as change how articles are written. These new approaches are becoming adept at summarising and rewriting text, answering questions about it and extracting key information. These abilities will enable humans to search for information in new ways, such as the new ChatGPT system. They are valuable tools for researchers who curate the research literature to build knowledge bases particularly in biomedicine. Nevertheless, these approaches suffer from large problems including their computational cost and that they can confidently output incorrect information. This session will provide background on how these new methods work and discuss their benefits, challenges and potential impact.
Similar to Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science (20)
Citizen Science and Rare Disease ResearchAndrew Su
Talk given at "Personalized Health in the Digital Age" September 22, 2016 at Campus Biotech in Geneva, Switzerland https://www.personalizedhealth2016.ch/
Centralized Model Organism Database (Biocuration 2014 poster)Andrew Su
A Centralized Model Organism Database (CMOD) for the Long Tail of Genomes
Presented at Biocuration 2014 in Toronto http://biocuration2014.events.oicr.on.ca/
See related slides at http://www.slideshare.net/andrewsu/20140116-gmod-short
NCBO Webinar: Translating unstructured, crowdsourced content into structured ...Andrew Su
The use of crowdsourcing in biology is gaining popularity as a mechanism to tackle challenges of massive scale. However, to maximize participation and lower the barriers to entry, contributions to crowdsourcing efforts are typically not well-structured, which makes computing on these data challenging and difficult. The presentation will discuss strategies for translating this unstructured content into structured data. Three vignettes (in varying degrees of completion) will be described, one each from our Gene Wiki [1], BioGPS [2], and serious gaming [3] initiatives.
[1]: http://en.wikipedia.org/wiki/Portal:Gene_Wiki
[2]: http://biogps.org
[3]: http://genegames.org
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
1. Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
May 14, 2014
CBIIT
Slides: slideshare.net/andrewsu
Citizen Science!
2. Few genes are well annotated…
2
Data: NCBI, February 2013
41%
65%
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF
20,473
protein-
coding
genes
Genes, sorted by decreasing counts
GOAnnotation
Counts
3. … because the literature is sparsely curated?
3
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
4. … because the literature is sparsely curated?
4
0
10
20
30
40
1983 1988 1993 1998 2003 2008 2013
Average capacity of human scientist
6. 6
0
Sooner or later, the
research community will
need to be involved in the
annotation effort to scale
up to the rate of data
generation.
7. The Long Tail is a prolific source of content
7
Short
Head
Long Tail
Content
produced
Contributors (sorted)
News :
Video:
Product reviews:
Food reviews:
Talent judging:
Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics
Blogs
YouTube
Amazon reviews
Yelp
American Idol
9. Wikipedia has breadth and depth
9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words
(millions)
Wikipedia Britannica
Online
10. 10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
14. Wiki success depends on a positive feedback
14
Gene wiki page utility
Number of
users
Number of
contributors
1001
2002
15. 10,000 gene “stubs” within Wikipedia
15
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
16. Gene Wiki has a critical mass of readers
16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
17. Gene Wiki has a critical mass of editors
17
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Editorcount
Editors
Edits
Editcount
18. A review article for every gene is powerful
18
References to the literature
Hyperlinks to related concepts
Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
19. Making the Gene Wiki more computable
19
Structured annotationsFree text
20. Filling the gaps in gene annotation
20
Wikilink
GO exact
match
Gene Wiki
mapping
NCBI Entrez Gene: 334
Candidate
assertion
GO:0006897
6319 novel GO annotations
2147 novel DO annotations
21. Gene Wiki content improves enrichment analysis
23
p-value (PubMed only)
p-value
(PubMed + GW)
Muscle
contraction
More
significant
PubMed + GW
More
significant
PubMed only
Good BM et al., BMC Genomics, 2011
22. Making the Gene Wiki more computable
24
Structured annotationsFree text
Analyses
36. Utility: A simple and universal plugin interface
39
Utility
UsersContributors
Total of > 540 gene-centric online
databases registered as BioGPS plugins
37. Users: BioGPS has critical mass
40
• > 6400 registered users
• 14,000 unique visitors per month
• 155,000 page views per month
1. Harvard
2. NIH
3. UCSD
4. Scripps
5. MIT
6. Cambridge
7. U Penn
8. Stanford
9. Wash U
10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
38. Contributors: Explicit and implicit knowledge
41
540 plugins registered
(>300 publicly shared)
by over 120 users
spanning 280+ domains
Utility
UsersContributors
39. Gene Annotation Query as a Service
42
http://mygene.info
• High performance
• 3M hits/month
• Highly scalable
• 13k species
• 16M genes
• Weekly data updates
• JSON output
• REST interface
• Python/R/JS libraries
42. The biomedical literature is growing fast
45
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
43. Information Extraction
46
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
44. Disease mentions in PubMed abstracts
47
NCBI Disease corpus
• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in
PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural
Language Processing. Association for Computational Linguistics.
45. Four types of disease mentions
48
Specific Disease:
• “Diastrophic dysplasia”
Disease Class:
• “Cancers”
Composite Mention:
• “prostatic , skin , and lung cancer”
Modifier:
• ..the “familial breast cancer” gene , BRCA2..
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in
PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural
Language Processing. Association for Computational Linguistics.
46. Question: Can a group of non-scientists
collectively perform concept recognition in
biomedical texts?
49
49. Amazon Mechanical Turk (AMT)
52
Requester
Amazon
For each task, specify:
• a qualification test
• how many workers per task
• how much we will pay per task
Manages:
• parallel execution of jobs
• worker access to tasks
via qualification tests
• payments
• task advertising
Workers
1. Create tasks
2. Execute
3. Aggregate
50. Instructions to workers
53
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a
disease
– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss,
and visual impairment.
51. Qualification test
54
Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in
trinucleotide repeat expansion in the 3-untranslated region of a protein
kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”
Test #2: “Germline mutations in BRCA1 are responsible for most cases of
inherited breast and ovarian cancer . However , the function of the BRCA1
protein has remained elusive . As a regulated secretory protein , BRCA1
appears to function by a mechanism not previously described for tumour
suppressor gene products.”
Test #3: “We report about Dr . Kniest , who first described the condition in
1952 , and his patient , who , at the age of 50 years is severely
handicapped with short stature , restricted joint mobility , and blindness but
is mentally alert and leads an active life . This is in accordance with
molecular findings in other patients with Kniest dysplasia and…”
26 yes / no questions
54. Experimental design
• Task: Identify the disease mentions in
the 593 abstracts from the NCBI disease
corpus
– $0.06 per Human Intelligence Task (HIT)
– HIT = annotate one abstract from PubMed
– 5 workers annotate each abstract
57
55. This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Aggregation function based on simple voting
58
5
1 or more votes (K=1)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
56. Comparison to gold standard
59
F = 0.81, k = 2, N = 5
• 593 documents
• 7 days
• 17 workers
• $192.90
62. Comparisons to human annotators
65
Average level of
agreement
between expert
annotators
(stage 1)
F = 0.76
63. Comparisons to human annotators
66
F = 0.76
F = 0.87
Average level of
agreement
between expert
annotators
(stage 2)
64. 67
In aggregate, our worker
ensemble is faster, cheaper
and as accurate as a single
expert annotator for disease
concept recognition.
65. Information Extraction
68
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
66. Annotating the relationships
69
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
therapeutic target
subject
predicate
object
GENE
DISEASE
69. 72
Doug Howe, ZFIN
John Hogenesch, U Penn
Jon Huss, GNF
Luca de Alfaro, UCSC
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Lynn Schriml, U Maryland
Paul Pavlidis, U British Columbia
Peipei Ping, UCLA
Many Wikipedia editors
WP:MCB Project
Collaborators
Katie Fisch
Karthik Gangavarapu
Louis Gioia
Ben Good
Salvatore Loguercio
Adam Mark
Max Nanis
Ginger Tseung
Chunlei Wu
Group members
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Adriel Carolino
Erik Clarke
Jon Huss
Marc Leglise
Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco
Key group alumni
Citizen Science logo based on
http://thenounproject.com/term/team
work/39543/
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)
70. Related AMT work
73
• [1] Zhai et al 2013, used similar protocol to tag medication names in
clinical trials descriptions. F = 0.88 compared to gold standard
• [2] Burger et al, using microtask workers to identify relationships
between genes and mutations.
• [3] Aroyo & Welty, used workers to identify relations between
concepts in medical text.
[1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard
Development in Clinical Natural Language Processing” J Med Internet Res
[2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining
automated extraction and crowdsourcing.” Mitre technical report
[3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation
extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM
Research, 2013.
Editor's Notes
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
Pathway and expression databases
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
… but the amount of knowledge that is amenable to query and computation is tiny. We would like to have more efficient methods for information extraction.
Harmonic mean of the precision and recall593 training corpus
On 100 development data set
On 100 development data set
On 100 development data set
On 100 development data set
Phase 1: pairs of annotators work independently on computationally pre-annotated documents. Phase 2: annotators get to see each other’s annotations and then make changes Phase 3: all remaining inconsistencies resolved collaboratively
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.