The GenePattern Notebook Environment for Open Science and Reproducible Research, presentation by Michael Reich, UCSD, at BioITWorld West, San Francisco, March 11, 2019
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
Presented at GlobusWorld 2022 by Michael Reich from the uCSD School of Medicine. Describes how Globus services are integrated with a leading genomics analysis platform.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
Accelerating Data-driven Discovery in Energy ScienceIan Foster
A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
Presented at GlobusWorld 2022 by Michael Reich from the uCSD School of Medicine. Describes how Globus services are integrated with a leading genomics analysis platform.
Keynote at Gateways 2017 Conference, Ann Arbor MI
Speaker: Ian Stokes-Rees
"Connecting Cyberinfrastructure Back To The Laptop"
Science Gateways today are generally built to provide a web-accessible interface for a particular scientific community to access a combination of software, hardware, and data deployed in an expertly managed computing center. But what happens when the scientist wants to repatriate their data? Or perform some analysis that is not supported by the gateway? Both for the purposes of encouraging innovative workflows and serving an audience with a wide range of computational experience it is important to consider how a gateway can fit into the broader computational ecosystem of a particular researcher or research group. One simple starting point for this is to ask the question "how can the gateway connect back to the laptop?". This talk will consider how this is being done today in science gateways and present some ideas for how this could be expanded in the future.
Reproducibility of computational workflows is automated using continuous anal...Kento Aoyama
journal seminar in Akiyama-Lab@Tokyo Tech (http://www.bi.cs.titech.ac.jp/)
(2017-04-20)
> B. K. Beaulieu-Jones and C. S. Greene, “Reproducibility of computational workflows is automated using continuous analysis,” Nature. Biotechnology., vol. 35, no. 4, pp. 342–346, 2017.
> http://www.nature.com/nbt/journal/v35/n4/full/nbt.3780.html
What is reproducible research? Why should I use it? what tools should I use? This session will show you how to use scripts, version control and markdown to do better research.
Research Management Solutions from Microsoft are smart and affordable platforms as well as tools that can help you create the ultimate community-based research platform. These solutions are very cost-effective and are powered by cutting-edge technology.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
2. Jill P. Mesirov Lab
• New statistical and machine
learning techniques for the
genomic analysis of cancer and
other diseases
• Gene Set Enrichment
Analysis (GSEA)
• Software and reproducible
research environments
• Integrative Genomics Viewer
(IGV)
• GenePattern
4. Jupyter
The Notebook Environment Explosion
Beaker Zeppelin CoCalc
(SageMath)
MS Azure
Notebooks
IBM
PixieDust
Google
CoLaboratory
Google Cloud
DataLab
Jupyter
5. Published Jupyter Notebooks
Books or other large collections
• Introductory Tutorials
• Programming and Computer Science
• Statistics, Machine Learning and Data
Science
• Mathematics, Physics, Chemistry,
Biology
• Earth Science and Geo-Spatial data
• Linguistics and Text Mining
• Signal Processing
• Engineering Education
Scientific computing and data analysis
• General topics in scientific computing
• Social data
• Psychology and Neuroscience
• Machine Learning, Statistics and Probability
• Physics, Chemistry and Biology
• Economics and Finance
• Earth science and geo-spatial data
• Data visualization and plotting
• Mathematics
• Signal, Sound and Image Processing
• Natural Language Processing
https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
7. Notebook Benefits
• Combines scientific narrative with executable code
• No space limitation
• Include process as well as results
• Provides a reproducible* record of analysis
• Inception-to-publication environment for science
• Highly-detailed tutorials
8. Notebook environments require
programming expertise
• Python, R, or other supported languages
• Analysis packages: scikit-learn, etc.
• Plotting packages: matplotlib, plotly, etc.
12. What is needed for bioinformatics notebooks
• Access to the wealth of existing analysis tools
• Seamless flow of datasets between cells
• Simplified user interface to code
• Biologist-friendly text formatting tools
• Easy publication and dissemination of new work
and adoption of others’ work
13. The GenePattern Notebook Environment
for collaborative and reproducible in silico research
• Shareable, executable notebook documents
• Support in silico research from inception through publication
• Seamless integration of GenePattern with Jupyter notebook
• No programming required
• Access to compute resources
www.genepattern-notebook.org
Reich, M., Tabor, T., Liefeld, T., Thorvaldsdóttir, H., Hill, B., Tamayo, P., & Mesirov, J. P. (2017).
The GenePattern Notebook Environment. Cell systems, 5(2), 149-151.
14. Access to analysis tools: GenePattern
Module
Integrator
Easy addition
of new tools
Analysis Engine
Record/replay analyses
Versioning of methods
Web service access
Clients
Programming
Web
Access for all users
Module Repository
GSEA
MuTect
HISAT2
Hundreds of analysis and
visualization tools
Notebook
www.genepattern.org
15. > java -Djava.awt.headless=true -
Dwin=cluster.exe -Dmac=clusterMac -
Dlinux=clusterLinux -
Dlinux64=clusterLinux64 -cp
hcl.jar/legacy-gp-modules.jar/ant.jar
org.genepattern.modules.hcl.RunCluster -f
input.filename log.transform row.center
row.normalize column.center
column.normalize -u output.base.name -e
column.distance.measure -g
row.distance.measure -m clustering.method
Corresponding GenePattern
visual representation
Standard “command-line” method for
running analysis
GenePattern wraps software tools in an accessible visible format
17. Access to multiple compute resources
IU HPC GenePattern Server
Pipelines
Modules
Jobs
Local GP Server
Pipelines
Modules
Jobs
GenePattern Server
Amazon Cloud
Pipelines
Modules
Jobs
GenePattern UsersServers
18. GenePattern Notebook features
GenePattern login cell
Input user credentials
Access any GenePattern server
Connect through GenePattern API
Module browser window
View, search, and select available modules and
pipelines
GenePattern analysis cell
Upload datasets and enter parameters
Submit jobs to GenePattern server
View job status
Link to result files and share results
19. Seamless Data Flow within notebooks
Send a result to the input of another notebook cell
result_file.gct
24. Seamless data flow
Send a result file to a dataframe
• Pandas is a popular Python library for working with data.
• Certain GenePattern file formats can be automatically imported into a pandas
Dataframe by selecting “Send to Dataframe” in the menu.
29. GenePattern Python Library
import gp
# Create a GenePattern server proxy instance
gpserver = gp.GPServer('http://localhost:8080/gp','myusername', 'mypassword')
# Obtain GPTask by module name
module = gp.GPTask(gpserver, "PreprocessDataset")
# Load module parameter data
module.param_load()
# Create a job specification
job_spec = module.make_job_spec()
# Upload a file to the server
uploaded_file = gpserver.upload_file("file_name", "/path/to/the/file/file_name")
job_spec.set_parameter("input.filename", uploaded_file.get_url())
# Submit the job to the GenePattern server
job = gpserver.run_job(job_spec)
• Complete programmatic access
• Automatic integration with GenePattern cell data
30. GenePattern Notebook Repository
• Create, execute, and share
GenePattern notebooks
• Zero-install usage of
GenePattern Notebook
• Python mathematical
statistical machine learning
and graphics packages
• Save and adapt shared
notebooks
• Runs in Amazon cloud utilizing
scalable compute
• Open-source, available on
GitHub
www.genepattern-notebook.org
31. Public Notebook Library
• Featured notebooks created
by GenePattern team and
collaborators
• Templates for common
bioinformatics analysis
scenarios
• Community-contributed
published notebooks
• Tutorial notebooks teaching
basic principles of the
environment
• Adaptable to your own
research
• Search and tag functions
40. Docker swarm group
launched on EC2
autoscale group
Web
browser
GenePattern Server
GenePattern Notebook Repository and Workspace
S3
data transfer to compute nodes
EFS
Notebooks and user data
A
W
S
B
a
t
c
h
Cloud Architecture
EC2 instance
EC2
Compute node
Analysis
Module
Analysis
Module
EC2
Compute node
Analysis
Module
Analysis
Module
EC2
Compute node
Analysis
Module
Analysis
Module
Docker swarm running
on EC2 autoscale
group
Notebook
Kernel
Notebook
Kernel
Notebook
Kernel
Notebook
Kernel
EC2 instance
JupyterHub Server
Python
packages
GenePattern
extensions
41. Local GenePattern Notebooks
Docker image
• genepattern-notebook on Docker Hub
Python
• pip install genepattern-notebook
• conda install genepattern-notebook
42. In development
• Release of in-depth analysis notebooks
– CoGAPS latent space analysis (Elana Fertig Lab)
– GSEA with information-theoretic metrics (Pablo Tamayo lab)
– REVEALER identifies the context of genomic alterations that drive
tumor formation (Pablo Tamayo lab)
– AMARETTO multi-omics analysis to identify cancer driver genes and
mutations (Nathalie Pochet, Olivier Gevaert lab)
• Incorporation of additional tools
– IGV.js
– NDEx Network Data Exchange
– Etc.
• Real-time collaboration
43. GenePattern Community
• First public release in 2004
• Current version: 3.9.10
• >65,000 registered users
• Public server runs 2000-5000
analyses/week
• Open source – BSD-style license
• https://github.com/genepattern
• Community-contributed methods
o CRISPR analysis
o Bisulfite sequencing
o Flow cytometry
o RNAi screens
o etc.
Web site:
www.genepattern.org
Community
repository:
www.gparc.org
Public server:
cloud.genepattern.org
44. Acknowledgements
Mesirov Lab
Peter Carr
Lukas Chavez
David Eby
Barbara Hill
Edwin Juarez
Forrest Kim
Ted Liefeld
Clarence Mah
Jim Robinson
Thorin Tabor
Helga Thorvaldsdottir
Douglass Turner
Jill P. Mesirov, PI
Collaborators
Pablo Tamayo Lab, UCSD
Fernando Perez, Lawrence
Berkeley Lab