"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
VariantSpark a library for genomics by Lynn LangitData Con LA
VariantSpark is a library for scalable genomic analysis that can process large genomic datasets containing millions of variants and thousands of samples. It uses machine learning techniques like k-means clustering and random forests for unsupervised and supervised analysis. VariantSpark can analyze whole genome datasets faster than other methods and scale to process 100% of genomic data. It also integrates with cloud platforms like AWS and Databricks for easy access and demo of its capabilities through Jupyter notebooks.
Architecture of ContentMine Components contentmine.orgpetermurrayrust
This is the evolving architecture of ContentMine (contentmine.org) architecture. It includes an overview ( slide #2, ) showing getpapers, quickscrape, norma and ami.
The key container is the CTree and the architecture shows where components are added or transformed to this.
These slides are dated and may be out-of-date wrt code. Some diagrams are autogenerated from *.dot files.
Please use http://discuss.contentmine.org/c/software as the main source of up-to-date info. Feel free to ask questions, offer help, critique, etc.
All s/w is Open (BSD, Apache2)
The document describes the Cassava Genome Hub, which provides big genomic data management and analysis resources for cassava. It discusses how the hub handles big data through its architecture and tools. The hub stores terabytes of cassava genomic, transcriptomic and other omics data. It provides tools like JBrowse, SNiPlay, GIGWA and Galaxy to enable visualization, exploration and analysis of the large datasets.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
"Spark, DeepLearning and Life Sciences, Systems Biology in the Big Data age" Dev Lakhani, Founder of Batch Insights
YouTube Link: https://www.youtube.com/watch?v=z6aTv0ZKndQ
Watch more from Data Natives 2015 here: http://bit.ly/1OVkK2J
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2016: http://bit.ly/1WMJAqS
About the author:
Dev Lakhani has a background in Software Engineering and Computational Statistics and is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
VariantSpark a library for genomics by Lynn LangitData Con LA
VariantSpark is a library for scalable genomic analysis that can process large genomic datasets containing millions of variants and thousands of samples. It uses machine learning techniques like k-means clustering and random forests for unsupervised and supervised analysis. VariantSpark can analyze whole genome datasets faster than other methods and scale to process 100% of genomic data. It also integrates with cloud platforms like AWS and Databricks for easy access and demo of its capabilities through Jupyter notebooks.
Architecture of ContentMine Components contentmine.orgpetermurrayrust
This is the evolving architecture of ContentMine (contentmine.org) architecture. It includes an overview ( slide #2, ) showing getpapers, quickscrape, norma and ami.
The key container is the CTree and the architecture shows where components are added or transformed to this.
These slides are dated and may be out-of-date wrt code. Some diagrams are autogenerated from *.dot files.
Please use http://discuss.contentmine.org/c/software as the main source of up-to-date info. Feel free to ask questions, offer help, critique, etc.
All s/w is Open (BSD, Apache2)
The document describes the Cassava Genome Hub, which provides big genomic data management and analysis resources for cassava. It discusses how the hub handles big data through its architecture and tools. The hub stores terabytes of cassava genomic, transcriptomic and other omics data. It provides tools like JBrowse, SNiPlay, GIGWA and Galaxy to enable visualization, exploration and analysis of the large datasets.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014).
Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Reproducibility of model-based results: standards, infrastructure, and recogn...FAIRDOM
Written and presented by Dagmar Waltemath (University of Rostock) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
Improving the Management of Computational Models -- Invited talk at the EBIMartin Scharm
Improving the Management of Computational Models:
storage – retrieval & ranking – version control
More information and slides to download at http://sems.uni-rostock.de/2013/12/martin-visits-the-ebi/
The document discusses the ISA infrastructure, which provides a generic format for experimental description and data exchange. The ISA infrastructure aims to support bio-scientists from experimental design to data publication. It does this through developing community standards, open source software tools, and engaging communities. The infrastructure provides a common framework to describe experiments in a way that allows data to flow between different systems and communities.
This document discusses challenges with the current scientific publishing system and proposes a vision for next generation scientific publishing (NGSP). Some key problems include retractions due to misconduct, lack of reproducibility, and non-reusable data and methods. NGSP would feature transparent and computable data and methods, open annotation of narratives and objects, and no restrictions on text mining or remixing. It would move information more quickly and allow verification through an open, service-oriented system without walled gardens. Taking NGSP forward will require collaboration across stakeholders in research communications.
Annotopia open annotation services platformTim Clark
Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/.
Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated.
Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services.
Annotopia is an open source platform licensed under Apache 2.0.
The document discusses using microformats as an alternative to more complex semantic web standards to integrate existing biological web resources. It proposes hAction, a microformat for biology, that could hook together disparate biological resources more simply than existing options. A demo is shown as a proof of concept that microformats may provide a way to share biological data across the web without large overheads.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
The document summarizes updates and new features in the latest release (Araport11) of the Arabidopsis Information Portal (Araport). Key points include:
1) Araport assumed responsibility for the Arabidopsis thaliana Col-0 genome sequence and annotation.
2) The Araport11 release incorporates 113 RNA-seq datasets, contributions from NCBI, UniProt, and Arabidopsis researchers. Structural and functional annotation were performed.
3) Araport provides a "one-stop shop" for Arabidopsis data including updated gene models, protein coding genes, transcripts, community curation tools, and over 70 tracks of data in JBrowse.
This document summarizes a presentation on analyzing microbial communities using QIIME (Quantitative Insights Into Microbial Ecology). It discusses how to [1] summarize taxonomy from an OTU table, [2] calculate beta diversity using UniFrac to compare communities, and [3] visualize diversity through emperor plots and networks. Additional analysis techniques like sampling design and network analysis are also briefly covered.
Araport is a one-stop community platform for Arabidopsis thaliana data integration, sharing, and analysis. It contains gene reports, expression data, sequences, variants, and community-contributed data tracks and modules. Key features include the ThaleMine gene search and analysis tool, JBrowse genome browser with over 100 tracks, and regularly updated Araport11 genome annotation. The platform is built and maintained by a collaboration between academic institutions and is intended to support open data sharing across the Arabidopsis research community.
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
FAIRPORT is an international project to develop a lightweight interoperability architecture for biomedical - and potentially other - data repositories.
This slide deck is a presentation to the FAIRPORT technical team. It describes a proposed model for supporting domain-specific search metadata using a common schema model across all repositories.
The proposal makes use of the following existing technologies, with minor extensions:
- the W3C DCAT model for dataset description
- the W3C SKOS knowledge organization system
- OWL2 Ontology Language
- Dublin Core Vocabulary
- NCBO Bioportal biomedical ontologies collection
GBIF-Norway status for the 6th European GBIF nodes meeting April 2014Dag Endresen
Slides prepared for the 6th European GBIF nodes meeting in Brussels. At the meeting these slides was replaced by a live online demo of these tools. Topics include citizen science transcription of specimen labels, persistent identifiers and custom collection portals. All slides are CC-by.
This document provides a summary of the Scalable Genome Analysis with ADAM project. ADAM is an open-source, high-performance, distributed platform for genomic analysis that defines a data schema, data layout on disk, and programming interface for distributed processing of genomic data using Spark and Scala. The goal of ADAM is to integrate across terabyte and petabyte-scale datasets to enable the discovery of low frequency genetic variants linked to traits and diseases.
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
slides from talk given at Bio-ontologies 2013, Berlin DE, 20 July 2013
Emily Merrill*, Stephane Corlosquet*, Paolo Ciccarese†*, Tim Clark*†‡, Sudeshna Das†*
* Massachusetts General Hospital
† Harvard Medical School
‡ School of Computer Science, University of Manchester
eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark
eXframe is a reusable framework for creating online repositories of genomics experiments. It uses Drupal to structure annotations of experiments, biomaterials, and assays. eXframe automatically publishes this data as RDF and provides a SPARQL endpoint. The first instance is the Stem Cell Commons, which deeply annotates experiments, organisms, tissues, and more using ontologies. It allows flexible querying of the data via SPARQL and integration with other endpoints. eXframe creates both public and private RDF stores to selectively share experimental data with researchers.
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...Araport
PMR database is a community resource for deposition and analysis of metabolomics data and related transcriptomics data. PMR currently houses metabolomics data from over 25 species of eukaryotes. In this talk, we introduce PMRs RESTful web APIs for data sharing, and demonstrate its applications in research using Araport to provide Arabidopsis metabolomics data.
TranSMART: How open source software revolutionizes drug discovery through cro...keesvb
Presentation about the use of open source software in pharmaceutical companies at Global Discovery & Development Innovation Summit (GDDIS) in Princeton, NY, fall 2013.
Open Source Collaboration in Drug Discovery in PharmaKees van Bochove
How pre-competitive collaboration in the pharmaceutical sector through open source platforms enables joint innovation of academics, pharma, SMEs and non-profits.
Presentation from the "Demystifying Big Data" Technical Conference (Universidad de La Laguna, Spain, June 2014).
Biomedical sciences rely on massive data sets. By using machines capable of generating large amounts of data with low cost, science has entered the 'Big Data' era, making computational infrastructures essential to maintain, transfer and analyze all this information.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Reproducibility of model-based results: standards, infrastructure, and recogn...FAIRDOM
Written and presented by Dagmar Waltemath (University of Rostock) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
Improving the Management of Computational Models -- Invited talk at the EBIMartin Scharm
Improving the Management of Computational Models:
storage – retrieval & ranking – version control
More information and slides to download at http://sems.uni-rostock.de/2013/12/martin-visits-the-ebi/
The document discusses the ISA infrastructure, which provides a generic format for experimental description and data exchange. The ISA infrastructure aims to support bio-scientists from experimental design to data publication. It does this through developing community standards, open source software tools, and engaging communities. The infrastructure provides a common framework to describe experiments in a way that allows data to flow between different systems and communities.
This document discusses challenges with the current scientific publishing system and proposes a vision for next generation scientific publishing (NGSP). Some key problems include retractions due to misconduct, lack of reproducibility, and non-reusable data and methods. NGSP would feature transparent and computable data and methods, open annotation of narratives and objects, and no restrictions on text mining or remixing. It would move information more quickly and allow verification through an open, service-oriented system without walled gardens. Taking NGSP forward will require collaboration across stakeholders in research communications.
Annotopia open annotation services platformTim Clark
Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/.
Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated.
Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services.
Annotopia is an open source platform licensed under Apache 2.0.
The document discusses using microformats as an alternative to more complex semantic web standards to integrate existing biological web resources. It proposes hAction, a microformat for biology, that could hook together disparate biological resources more simply than existing options. A demo is shown as a proof of concept that microformats may provide a way to share biological data across the web without large overheads.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Presentation from Strata-Hadoop 2015 (http://strataconf.com/big-data-conference-ny-2015/public/schedule/speaker/197575) -- a brief introduction to genomics followed by an overview of approaches to bioinformatics coding using Spark. Pretty high-level.
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
The document summarizes updates and new features in the latest release (Araport11) of the Arabidopsis Information Portal (Araport). Key points include:
1) Araport assumed responsibility for the Arabidopsis thaliana Col-0 genome sequence and annotation.
2) The Araport11 release incorporates 113 RNA-seq datasets, contributions from NCBI, UniProt, and Arabidopsis researchers. Structural and functional annotation were performed.
3) Araport provides a "one-stop shop" for Arabidopsis data including updated gene models, protein coding genes, transcripts, community curation tools, and over 70 tracks of data in JBrowse.
This document summarizes a presentation on analyzing microbial communities using QIIME (Quantitative Insights Into Microbial Ecology). It discusses how to [1] summarize taxonomy from an OTU table, [2] calculate beta diversity using UniFrac to compare communities, and [3] visualize diversity through emperor plots and networks. Additional analysis techniques like sampling design and network analysis are also briefly covered.
Araport is a one-stop community platform for Arabidopsis thaliana data integration, sharing, and analysis. It contains gene reports, expression data, sequences, variants, and community-contributed data tracks and modules. Key features include the ThaleMine gene search and analysis tool, JBrowse genome browser with over 100 tracks, and regularly updated Araport11 genome annotation. The platform is built and maintained by a collaboration between academic institutions and is intended to support open data sharing across the Arabidopsis research community.
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
FAIRPORT is an international project to develop a lightweight interoperability architecture for biomedical - and potentially other - data repositories.
This slide deck is a presentation to the FAIRPORT technical team. It describes a proposed model for supporting domain-specific search metadata using a common schema model across all repositories.
The proposal makes use of the following existing technologies, with minor extensions:
- the W3C DCAT model for dataset description
- the W3C SKOS knowledge organization system
- OWL2 Ontology Language
- Dublin Core Vocabulary
- NCBO Bioportal biomedical ontologies collection
GBIF-Norway status for the 6th European GBIF nodes meeting April 2014Dag Endresen
Slides prepared for the 6th European GBIF nodes meeting in Brussels. At the meeting these slides was replaced by a live online demo of these tools. Topics include citizen science transcription of specimen labels, persistent identifiers and custom collection portals. All slides are CC-by.
This document provides a summary of the Scalable Genome Analysis with ADAM project. ADAM is an open-source, high-performance, distributed platform for genomic analysis that defines a data schema, data layout on disk, and programming interface for distributed processing of genomic data using Spark and Scala. The goal of ADAM is to integrate across terabyte and petabyte-scale datasets to enable the discovery of low frequency genetic variants linked to traits and diseases.
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
slides from talk given at Bio-ontologies 2013, Berlin DE, 20 July 2013
Emily Merrill*, Stephane Corlosquet*, Paolo Ciccarese†*, Tim Clark*†‡, Sudeshna Das†*
* Massachusetts General Hospital
† Harvard Medical School
‡ School of Computer Science, University of Manchester
eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark
eXframe is a reusable framework for creating online repositories of genomics experiments. It uses Drupal to structure annotations of experiments, biomaterials, and assays. eXframe automatically publishes this data as RDF and provides a SPARQL endpoint. The first instance is the Stem Cell Commons, which deeply annotates experiments, organisms, tissues, and more using ontologies. It allows flexible querying of the data via SPARQL and integration with other endpoints. eXframe creates both public and private RDF stores to selectively share experimental data with researchers.
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...Araport
PMR database is a community resource for deposition and analysis of metabolomics data and related transcriptomics data. PMR currently houses metabolomics data from over 25 species of eukaryotes. In this talk, we introduce PMRs RESTful web APIs for data sharing, and demonstrate its applications in research using Araport to provide Arabidopsis metabolomics data.
TranSMART: How open source software revolutionizes drug discovery through cro...keesvb
Presentation about the use of open source software in pharmaceutical companies at Global Discovery & Development Innovation Summit (GDDIS) in Princeton, NY, fall 2013.
Open Source Collaboration in Drug Discovery in PharmaKees van Bochove
How pre-competitive collaboration in the pharmaceutical sector through open source platforms enables joint innovation of academics, pharma, SMEs and non-profits.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
The document discusses extending the iPlant cyberinfrastructure to support microbes in addition to plants. It provides an overview of iPlant, including its funding from NSF, collaborations, resources like data storage and computing platforms, and applications for analysis. Future plans are outlined to build tools and streamline workflows for metagenomics and enable high-throughput computing for microbial data.
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
The document discusses how cloud computing can enable large-scale genomic analysis by providing on-demand access to computational resources and petabytes of reference data. It describes how tools like Galaxy and CloudMan allow researchers to perform genomic analysis in the cloud through a web browser by automating the provisioning and configuration of cloud resources. This approach makes genomic research more accessible and enables the elastic scaling of analysis as needed.
The Institut de Biologie Paris Seine provides bioinformatics support and expertise using the Galaxy platform. They assist users in computational analyses of sequencing data, ensure transparency and reproducibility. They also provide training in Galaxy usage, conduct research in RNA biology and epigenetics, and follow advances in software and methods. The institute has expertise in programming languages, version control systems, and virtualization/container technologies. They have supported many Next Generation Sequencing analysis projects and developed publicly available tools.
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
This document provides an overview of bioinformatics tools and services for analyzing big data in biomedical research. It discusses traditional bioinformatics tools, analyzing genomic data from microarrays and next-generation sequencing without and with code, interpreting results using protein interaction networks and pathways, tools for data storage, cleaning and visualization, and making research reproducible. Galaxy, R, and programming are presented as useful for automated, reproducible analysis of large genomic datasets.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
IDB-Cloud Providing Bioinformatics Services on Cloudstratuslab
A presentation of IDB (Infrastructure Distributed for Biology) using StratusLab technology by Christophe Blanchet and Clément Gauthey at Lille, France, May 2013.
The suite of free software tools created within the OpenCB (Open Computational Biology – https://github.com/opencb) initiative makes possible to efficiently manage large genomic databases.
These tools are not widely used, since there is quite a steep learning curve for their adoption, thanks to the complexity of the software stack, but they may be really cost-effective for hospitals, research institutions etcetera.
The objective of the talk is showing the potential of the OpenCB suite, the information to start using it and the advantages for the end users. BioDec is currently deploying a large OpenCGA installation for the Genetic Unit of one of the main Italian Hospitals, where data in the order of the hundreds of TBs will be managed and analyzed by bioinformaticians.
Scaling People, Not Just Systems, to Take On Big Data ChallengesMatthew Vaughn
Here, I describe how the Texas Advanced Computing Center has shifted its focus from traditional modeling and simulation towards fully embracing big data analytics performed by users with diverse technical backgrounds.
Collaborations in the Extreme: The rise of open code development in the scie...Kelle Cruz
Video: https://www.simonsfoundation.org/event/collaborations-in-the-extreme-the-rise-of-open-code-development-in-the-scientific-community/
The internet is changing the scientific landscape by fostering international, interdisciplinary and collaborative software development. More than ever before, software is a crucial component of any scientific result. The ability to easily share code is reshaping expectations about reproducibility -- a fundamental tenet of the scientific process. In this lecture, Kelle Cruz will briefly provide the backstory of how these shifts have come about, describe some of the most impactful open source projects, and discuss efforts currently underway aimed at ensuring these community-led projects are sustainable and receive support.
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.
https://www.denbi.de/symposium2017
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
Slides for keynote talk at the Big Data Europe workshop nr 3 on 11.9.2017 in Amsterdam co-located with SEMANTiCS2017 conference by Ron Dekker, Director CESSDA: European Open Science Agenda: where we are and where we are going?
- The Broad Institute is a non-profit biomedical research institute founded in 2004 with 50 core faculty members from Harvard and MIT and over 1000 research personnel.
- It focuses on specific disease areas through various programs and initiatives and technological innovation through several platforms.
- In order to take advantage of cloud technologies, organizations need to fundamentally change how they engage with technology and technologists to collaborate effectively in this new environment.
Provenance for Data Munging EnvironmentsPaul Groth
Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
This document discusses using cloud computing for bioinformatics. It begins by defining cloud computing and describing its key characteristics like on-demand access to computing resources and rapid elasticity. It then discusses different cloud delivery models like Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). The document provides examples of public cloud providers for each delivery model. It also introduces tools like CloudBridge that help make applications cloud-independent and CloudLaunch, a portal for deploying cloud-enabled bioinformatics applications. Finally, it briefly discusses how these tools and cloud resources can help improve bioinformatics workflows by providing scalable infrastructure for processing large genomic datasets.
This document provides an overview and agenda for a GenomeSpace workshop. It introduces GenomeSpace as an online community for sharing diverse computational genomics tools. The document reviews several popular tools integrated with GenomeSpace, including Cytoscape, Galaxy, Genomica, and GenePattern. It also outlines basic recipes for using GenomeSpace, such as uploading data, launching tools, and transferring data between tools. The workshop aims to demonstrate the GenomeSpace user interface and provide hands-on experience with key tools and integrative analysis workflows.
Big data solution for ngs data analysisYun Lung Li
This document outlines a presentation on big data solutions for NGS data analysis using software containerization and distributed analytics. It discusses Docker containerization and its use at Atgenomix to simplify cluster environments. It also covers NGS genome analysis techniques like read mapping, variant calling, and using Spark and Hadoop for data parallelization. Elasticsearch is introduced for distributed, RESTful search and analytics of variant data.
This document summarizes bioinformatics tools that can be used for analysis of high-throughput sequencing data for molecular diagnostics. It discusses databases for virulence factors and antimicrobial resistance as well as tools for assembly, annotation, pan-genome analysis, visualization, and commercial solutions. The presentation emphasizes that there is no single best tool and different approaches are needed for different questions. Collaboration with other researchers is recommended.
Similar to tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Objectives (20)
tranSMART Community Meeting 5-7 Nov 13 - Session 3: The TraIT user stories fo...David Peyruc
This document provides an overview of the TraIT project and existing demonstrators using tranSMART. It discusses the TraIT roadmap and user stories being implemented at the Netherlands Cancer Institute. Key points include:
- TraIT aims to support translational research through integrated data and tools across clinical, imaging, biobanking and experimental domains.
- Existing demonstrators using tranSMART include DeCoDe (colorectal cancer) and PCMM (prostate cancer).
- The roadmap involves enhancing tranSMART functionality based on user needs and integrating additional data sources.
- At NKI, tranSMART will provide an integrated research data warehouse with clinical and research data from various sources and departments.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the cell phenotypes involved in metastasis
Characterization of the cell phenotypes involved in metastasis: Using tranSMART to enable high-throughput heterogeneous data integration and analysis
Brian Athey, University of Michigan
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analytical Capabilities with Knowledge Content
Sirimon Ocharoen, Thomson Reuters
To effectively analyze data in tranSMART, biological analysis/knowledge-based approach is needed. Through a case study, we will demonstrate how system biology content can be integrated in tranSMART to enable functional analysis and biological interpretation. We will also share our experience and user feedbacks from various projects.
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons ...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Recent tranSMART Lessons Learned in Academic and Life Science Settings
Dan Housman, Recombinant by Deloitte
The Recombinant by Deloitte team has worked with organizations such as Kimmel Cancer Center as a model to adapt existing mature i2b2 implementations to meet business and scientific needs. Other organizations are increasingly focused on how to use cloud and high performance computing models to achieve different performance levels. Advanced initiatives are progressing to link commercial tools such as Qlikview to explore tranSMART data and to solve for key gaps in scientific pipelines. Dan will present recent lessons learned, new capabilities, and some of the impact on the path forwards for future tranSMART updates.
tranSMART Community Meeting 5-7 Nov 13 - Session 5: EMIF (European Medical In...David Peyruc
The document discusses the European Medical Information Framework (EMIF) project. EMIF aims to create a platform and framework to integrate patient-level health data from across Europe to enable new research insights. Specifically, EMIF is developing tools and standards to pool data from various sources on over 48 million subjects from 7 EU countries. This will support research on predictors of metabolic diseases and Alzheimer's disease. EMIF is using the tranSMART platform to load clinical trial data and cohorts on over 33,000 subjects for analysis. The goal is for EMIF to become a trusted European hub for healthcare data to optimize clinical research.
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Proj...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 5: The Accelerated Cure Project MS Repository Dataset as a Case Study The Accelerated Cure Project MS Repository Dataset as a Case Study
Stephen Wicks, Rancho Biosciences
The Accelerated Cure Project for Multiple Sclerosis is a non-profit focused on accelerating research for a cure for MS. One of their major projects over the last decade has been the generation of the ACP Repository, a collection of biological samples and associated clinical data from approximately 3200 case or control participants. More than 75 studies are underway or have been completed, in both industry and academic settings, using samples from the ACP Repository. Rancho BioSciences has partnered with ACP through Orion Bionetworks to curate and load these datasets and associated clinical CRFs into tranSMART. In this talk, we will describe the rich ACP dataset and discuss our experiences in preparing the data for analysis in tranSMART
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Modularization (Plug‐Ins,...David Peyruc
The document discusses the development of new plugins for the TranSMART platform to add genomic visualization capabilities. It describes requirements like adding an HTML5 genome browser and supporting visualization of genomic variants and copy number variation data. It then details the process of consulting the community to choose the Dalliance genome browser and MyDAS backend, and extending the core API to support these plugins. The plugins were implemented and added to TranSMART to provide the new genomic visualization features.
The document outlines the key roles and values of a foundation to support the TRANSMART platform including:
- Stimulating awareness of project activities, functionalities, and data standards through communications
- Coordinating data curation and identifying opportunities for collaboration or common interest data sets
- Providing an app store for translational research plugins with various pricing models
- Ensuring quality, education, and training
It proposes establishing working groups and hiring a full-time community manager to address issues like lack of data transparency, siloed development, and ineffective project communications. The manager would facilitate engagement, updates, and synergies across stakeholders.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...David Peyruc
The document summarizes Pfizer's use of the tranSMART platform for various genomics and clinical data analyses including genome-wide association studies (GWAS), supporting exploratory data types like metabolomics and FACS data, and large collaborative efforts like the Alzheimer's Disease Neuroimaging Initiative (ADNI) and Parkinson's Progression Markers Initiative (PPMI) datasets. It also discusses analytical integration with Genedata Expressionist and plans for future enhancements to tranSMART like improved GWAS support and additional genotype data. Contributors to these efforts are acknowledged.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Min...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART and the One Mind for Research Data Exchange Portal
Jeff Grethe, One Mind for Research
One Mind for Research (http://1mind4research.org) is an independent, non-partisan, nonprofit
organization dedicated to curing the diseases of the brain and eliminating the stigma
and discrimination associated with mental illness and brain injuries. tranSMART will be a core
application within the One Mind Brain Data Exchange Portal, scheduled to launch publicly in
2014. Traumatic Brain Injury (TBI) affects an estimated 10 million people worldwide, and
tranSMART is one of the core applications within the portal used by researchers who are
looking to improve diagnostics and discover more effective treatments for patients suffering
from CNS- and TBI-related diseases.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehous...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART a Data Warehouse for Translational Medicine at Takeda Pharmaceuticals
International
Dave Marberg, Takeda
We have used the tranSMART platform to construct a warehouse containing data from several
Takeda clinical trials, proprietary preclinical drug activity studies, 1600 Gene Expression
Omnibus studies, and data from TCGA, CCLE, and other sources. All gene expression data has
been globally normalized. We extended the tranSMART platform with a set of R function calls
to enable cross-study queries and analysis via the rich toolset available in R. The utility of the
data warehouse is exemplified by a study in which we built a predictive model for drug
sensitivities. The model was trained on gene expression and IC50 data from cell lines and was
found to correctly predict drug activity in oncology indications.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: transmart’s application t...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART’s Application to Clinical Biomarker Discovery Studies in Sanofi
Sherry Cao, Sanofi
This presentation will discuss challenges we are encountering in clinical biomarker discovery
study and how we are using tranSMART to help to address them.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Simulation in tranSMARTDavid Peyruc
Dave King gave a presentation on November 6th 2013 about interactive visualization with tranSMART. The presentation introduced Dave King and his role in the presentation. It explained that tranSMART allows for modular and abstracted visualization through application programming interfaces, improving connectivity. The presentation concluded that these features of tranSMART are important for interactive data visualization.
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Clinical Biomarker DiscoveryDavid Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 3: tranSMART’s Application to Clinical Biomarker Discovery Studies in Sanofi
Sherry Cao, Sanofi
This presentation will discuss challenges we are encountering in clinical biomarker discovery
study and how we are using tranSMART to help to address them.
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Developing a TR Community...David Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Developing a Translational Research Community around the tranSMART Platform
Keith Elliston, tranSMART Foundation
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Herding CatDavid Peyruc
This document discusses managing open source communities and projects. It notes that open source communities involve not just developers but also users, installers, documentation writers, and support staff. Contributions come from new code, bug fixes, documentation, training materials, and feature requests. Projects need coordination, communication through mailing lists and meetings, and quality assurance through testing. Both incentives like acknowledging contributions and treats like involvement opportunities help encourage participation and "herd the cats" of an open source community.
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And WhenDavid Peyruc
tranSMART Community Meeting 5-7 Nov 13 - Session 2: MongoDB: What, Why And When
Massimo Brignoli, MongoDB Inc
The presentation will illustrate what MongoDB is, the advantages of the document based approach and some of the use cases where MongoDB is a perfect fit.
Osteoporosis - Definition , Evaluation and Management .pdfJim Jacob Roy
Osteoporosis is an increasing cause of morbidity among the elderly.
In this document , a brief outline of osteoporosis is given , including the risk factors of osteoporosis fractures , the indications for testing bone mineral density and the management of osteoporosis
Integrating Ayurveda into Parkinson’s Management: A Holistic ApproachAyurveda ForAll
Explore the benefits of combining Ayurveda with conventional Parkinson's treatments. Learn how a holistic approach can manage symptoms, enhance well-being, and balance body energies. Discover the steps to safely integrate Ayurvedic practices into your Parkinson’s care plan, including expert guidance on diet, herbal remedies, and lifestyle modifications.
Promoting Wellbeing - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Histololgy of Female Reproductive System.pptxAyeshaZaid1
Dive into an in-depth exploration of the histological structure of female reproductive system with this comprehensive lecture. Presented by Dr. Ayesha Irfan, Assistant Professor of Anatomy, this presentation covers the Gross anatomy and functional histology of the female reproductive organs. Ideal for students, educators, and anyone interested in medical science, this lecture provides clear explanations, detailed diagrams, and valuable insights into female reproductive system. Enhance your knowledge and understanding of this essential aspect of human biology.
Basavarajeeyam is a Sreshta Sangraha grantha (Compiled book ), written by Neelkanta kotturu Basavaraja Virachita. It contains 25 Prakaranas, First 24 Chapters related to Rogas& 25th to Rasadravyas.
3. •Microarray data analysis support
•Microarray data analysis support
•Load public microarray data from GEO data from GEO
•Load public microarray
•Store and retrievesaved analyses
•Store and retrievesaved analyses
•Search on gene name,on gene name etc.
•Search disease name, disease name e
•Genomicvariants and VCF support VCF support
•Genomicvariants and
•Load TCGA studies we have accesswe have access to
•Load TCGA studies to
•Load 1000 Genomes1000 Genomes data
•Load data
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
9. NotInventedHereSyndrome
Image from Rob Hooft, CTO NetherlandsBioinformatics Centre
http://nothinkingbeyondthispoint.blogspot.nl/2011/11/decision-tree-for-scientific.html
13. Phenotype Database
Written in Grails, supports several types of
omics data, provides data integration and
visualization, has R, Groovy and PHP API’s.
Sounds familiar?
http://phenotypefoundation.org
16. Sofar…
• TranSMART has a huge business potential. It’s
nosilverbulletthough.
• Scientistssometimes have troublereusingeachothers’ work. Especiallywhenit
comes to open source software.
21. Governance of R community
BrianRipley: “The R Project is governedby aselfperpetuatingoligarchy, a groupwith a lot of
power. R was principallydevelopedfor the
benefit of the core team.”
As citedon http://blog.revolutionanalytics.com/2011/08/brian-ripley-onthe-r-development-process.html
23. Galaxy is the most widelyused open
sourcebioinformatics web interface AFAIK.
Probably in nosmallamountthanks to
theircontinuousdedication to
improving the UI.
Butthere’ssomethingelse.
25. • An open source CMS (Content Management
System) written in Python, nowadays backing
thousands of productiongrade websites
• Startedby 2 developers in 2000, nowanactive
open source project withhundreds of
activedevelopers
• In 2004, the Plone Foundation was formed to
formalize IP and secure the future of Plone
• PloneCollective has hundreds of plugins
26.
27. What do all these successstories
have in common?
BioconductorPackages
GalaxyToolshed
PloneCollective
Drupal Modules
30. TranSMARTContributions - Pharma
• Janssen
– Initialversion of tranSMART
– Genomics viewer using IGV and GenePattern
– Faceted Search interface (resultsbrowsing)
• Millenium
– Loading TCGA andmany GEO studies
– R interface forinteractingwith data directly in R
– Several R analyses availabledirectly in GUI
31. TranSMARTContributions - Pharma
• Sanofi
– Cleaner user interface
– Added metadata layerfor all concepts
– Study/Program categorization& file management
• Pfizer
– GWAS upload (VCF), data storage and analysis
– Enhanced data export capabilities
32.
33. This is a mess.
Anotherreasonwhy we needthat
core.
34. Start the Core: I2B2 Refactoring
1. I2B2 was integratedwithtranSMART, but the
I2B2 API abstractionswereleaked all over the
place in the tranSMARTapplication.
2. We agreed in the London meeting that all
partieswould set some time apart
forworkingon the core.
3. Combined, it made sense to start working at
the clinical data API, properlyusing the I2B2
API wherepossible, and re-implement all I2B2
functionality in a new ‘core-db’ plugin.
35. The firstversion of core-integration
was completed half April.
Bythen, all webservice calls to whatformerly
was anoutdatedversion of the I2B2 Ontology
and CRC cells, were handled by the
newlyimplementedcore-dbplugin.
Also, a set of tests was written in the
process and API documentationgenerated.
36. In the long run, I believeforming a
gooddistributedworkinggroupon the
core API is a more important
delivery of this workshop
thancrunching out a stable 1.1
version.
That’show we writethathistory
39. TranSMART’s Strong Points
• Powerful, ready to go user interface
forcommon analyses (survival analysis, gene
expressionheatmapsetc.)
• Leverages i2b2 data model forclinical data and
offers unified view over different studies
• Uses a lot of good open
sourcetechnologyunder the hood (Grails, R,
SOLR, Pentaho)
leveragingexistingcommunitydevelopments
40. TranSMART Building Blocks
• R: open source statistics package with CRAN,
an active repository in which many algorithms
and statistical packages are published
• Grails: a rapid application development
framework in Groovy leveraging Java
technology such as Hibernate, Spring, Quartz
• I2b2: domain specific open source package for
storing and querying clinical data
• GenePattern, maybe soon: Galaxy, KNIME?
41. TranSMART’sWeaknesses
• Largemonolithic codebase
withlittlemodularizationbeyond the
standardGrails MVC setup
• Code quality is problematic, especiallyJavaScript
• Test coverage is low, nofunctional / web tests and
little unit and integration tests
• No clearinternalAPI’s, only a service level that
does the plumbing.
• I2b2 integrationviolates i2b2 abstractions
42. tranSMART Plans
• Use a clearly modularized architecture with
separation of clinical, high dimensional, search
and metadata storage; workflow execution
enginges and knowledge repository
• Define clear API and rewrite current
implementations with good test coverage
• Use i2b2 data model, re-harmonize with latest
i2b2 APIs, and don’t use i2b2 binaries directly
• Separate analysis definitions and abstract from
workflow execution engine
http://prezi.com/t6twshyctdsk/transmart-core-refactoring
44. Further reading
• Description of core API efforts:
http://thehyve.nl/rewiring-transmart
• In depthdescription of i2b2 refactoring:
http://thehyve.nl/inital-work-on-transmarts-core
• Overview of tranSMART Core API sofar:
http://thehyve.github.io/transmart-core-api/
• Example of continuousintegration test suite
(ofcore-db): https://ci.ctmmtrait.nl/browse/TMCOREDB-JOB1-51/test