Big data has arrived in the life science research domain and has driven the need for optimized high-performance networks in these research environments. Many petabytes of data transfer, storage and analytics are now a reality due to the fact that data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories. These data flows are complicated by the combination of high-frequency mouse flows as well as high-volume elephant flows, sometimes from the same application operating in parallel environments. Additional complicating factors include collaborative research efforts on large data stores that utilize both common and disparate compute resources, the need for high-performance data encryption in-flight to cover the transmission and handling of clinical data, and the relatively poor state of algorithm development from an IO standpoint throughout the industry. This presentation will cover representative advanced networking use cases from life sciences research, the challenges that they present in networking environments, some solutions that are being deployed with in both small and large institutions, and an overview of a few of the unresolved problems to date.
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchAri Berman
Ari Berman discussed the growing requirements for high performance computing (HPC) in life science research. Large amounts of data are being generated through techniques like next generation sequencing, imaging, and medical scanning. However, infrastructure is struggling to keep up with the rapid changes in instrumentation and data volumes. Proper HPC design requires understanding specific use cases to solve problems rather than just achieve performance. Life scientists can be tough on systems and want everything immediately with limited budgets.
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...Ari Berman
Talk given at the Leverage Big Data '14 Event in May 2014.
Big data has arrived in the life science research domain, and it has caught researchers and IT professionals alike off-guard. Workstations, Excel, and even small clusters under people's desks are no longer sufficient to meet the data storage and processing needs of modern biological research techniques. Data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories while budgets in those spaces continue to be slashed. Despite the reduced budgets, it is predicted that 25% of all researchers will require HPC to analyze their data in the coming year. Research organizations are starting to realize they have to run to catch up, or face failure in the wake of old-school IT infrastructures and policies. IT organizations have been forced to get creative and build amazing infrastructures for pennies, or fail in the wake of the user pressure being generated from the laboratories. Converged infrastructure is the present and the future for biomedical, clinical, and life sciences research. In this talk, I'll cover the IT challenges in life sciences, how and where they are being met, and talk about the near-future trends in IT infrastructure, services, and informatics and how they will affect medical discoveries in the next 5-10 years.
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
Talk slides from the annual "trends from the trenches" address at BioITWorld Expo. 2014 Edition.
### Email chris@bioteam.net if you'd like a PDF copy of this deck ###
Mapping Life Science Informatics to the CloudChris Dagdigian
This document discusses strategies for mapping informatics to the cloud. It provides 9 tips for doing so effectively. Tip 1 advises that high-performance computing and clouds require a new model where resources are dedicated to each application. Tip 2 recommends hybrid cloud approaches but cautions they are less usable than claimed and practical only sometimes. The document emphasizes the need to handle legacy codes in addition to new "big data" approaches.
RAPIDS 2018 - Keynote - How I learned to stop worrying and love version controldotmesh
The keynote talk from RAPIDS 2018 in London.
Dr Stephen J Newhouse and Luke Marsden explain why now is the moment to take Reproducibility and Provenance in Data Science (RAPIDS) seriously, and how this can be achieved with process and tooling.
Stephen shares his experiences of the challenges in the industry and Luke introduces the beta version of Dotscience, a tool for model tracking and collaboration through RAPIDS.
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
October 2013 "Beyond the Genome" presentation slides. Talk is mostly focused on issues around IaaS cloud usage for "Bio-IT" and life science informatics & scientific computing.
PDF SLIDES AVAILABLE DIRECTLY - PLEASE EMAIL "CHRIS@BIOTEAM.NET" FOR SLIDES
From the Benchtop to the Datacenter: HPC Requirements in Life Science ResearchAri Berman
Ari Berman discussed the growing requirements for high performance computing (HPC) in life science research. Large amounts of data are being generated through techniques like next generation sequencing, imaging, and medical scanning. However, infrastructure is struggling to keep up with the rapid changes in instrumentation and data volumes. Proper HPC design requires understanding specific use cases to solve problems rather than just achieve performance. Life scientists can be tough on systems and want everything immediately with limited budgets.
From the Benchtop to the Datacenter: IT and Converged Infrastructure in Life ...Ari Berman
Talk given at the Leverage Big Data '14 Event in May 2014.
Big data has arrived in the life science research domain, and it has caught researchers and IT professionals alike off-guard. Workstations, Excel, and even small clusters under people's desks are no longer sufficient to meet the data storage and processing needs of modern biological research techniques. Data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories while budgets in those spaces continue to be slashed. Despite the reduced budgets, it is predicted that 25% of all researchers will require HPC to analyze their data in the coming year. Research organizations are starting to realize they have to run to catch up, or face failure in the wake of old-school IT infrastructures and policies. IT organizations have been forced to get creative and build amazing infrastructures for pennies, or fail in the wake of the user pressure being generated from the laboratories. Converged infrastructure is the present and the future for biomedical, clinical, and life sciences research. In this talk, I'll cover the IT challenges in life sciences, how and where they are being met, and talk about the near-future trends in IT infrastructure, services, and informatics and how they will affect medical discoveries in the next 5-10 years.
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
Talk slides from the annual "trends from the trenches" address at BioITWorld Expo. 2014 Edition.
### Email chris@bioteam.net if you'd like a PDF copy of this deck ###
Mapping Life Science Informatics to the CloudChris Dagdigian
This document discusses strategies for mapping informatics to the cloud. It provides 9 tips for doing so effectively. Tip 1 advises that high-performance computing and clouds require a new model where resources are dedicated to each application. Tip 2 recommends hybrid cloud approaches but cautions they are less usable than claimed and practical only sometimes. The document emphasizes the need to handle legacy codes in addition to new "big data" approaches.
RAPIDS 2018 - Keynote - How I learned to stop worrying and love version controldotmesh
The keynote talk from RAPIDS 2018 in London.
Dr Stephen J Newhouse and Luke Marsden explain why now is the moment to take Reproducibility and Provenance in Data Science (RAPIDS) seriously, and how this can be achieved with process and tooling.
Stephen shares his experiences of the challenges in the industry and Luke introduces the beta version of Dotscience, a tool for model tracking and collaboration through RAPIDS.
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingChris Dagdigian
October 2013 "Beyond the Genome" presentation slides. Talk is mostly focused on issues around IaaS cloud usage for "Bio-IT" and life science informatics & scientific computing.
PDF SLIDES AVAILABLE DIRECTLY - PLEASE EMAIL "CHRIS@BIOTEAM.NET" FOR SLIDES
This is a custom "Bio IT trends/problems" deck that I did for a general but highly technical audience at the 2014 Internet2 Technology Exchange conference.
Download of the raw PPT is disabled; contact me at chris@bioteam.net if a direct copy or PDF of the presentation would be useful.
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
We’ll explore current and future considerations in advanced computing architectures that empower the conversion of data into knowledge. Life sciences produce the largest amount of data production out of all major science domains, making analytics and scientific computing cornerstones of modern research programs and methodologies. We’ll highlight the remarkable biomedical discoveries that are emerging through combined efforts, and discuss where and how the right infrastructure can catalyze the advancement of human knowledge. On-premises architectures as well as cloud, hybrid, and exotic architectures will all be discussed. It’s likely that all life science researchers will required advanced computing to perform their research within the next year. However, there has been less focus on advanced computing infrastructures across the industry due to the increased availability of public cloud infrastructure anything as a service models.
This was a 30 min talk intended as one of the opening/overview presentations before a full-day deep dive into ScienceDMZ design patterns and architectures.
Direct downloads are not enabled. Contact me directly (chris@bioteam.net) if you for some odd reason want a copy of this slide deck!
Next generation genomics: Petascale data in the life sciencesGuy Coates
Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
- The speaker observes trends in how research infrastructure is changing more rapidly than IT can refresh systems, creating challenges. This includes new instruments generating vastly more data.
- There is a blurring of roles between scientists, sysadmins, and programmers as everything becomes more automated and "scriptable." Sysadmins must learn programming and researchers can now self-provision resources.
- Virtualization is widely used even in HPC to provide flexibility and address business needs. Very large "fat node" servers are replacing clusters of smaller nodes. Local disk is coming back as a hedge against big data requirements.
- Object storage is becoming more viable and approachable on commodity hardware for a
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
Timothy Dawes of Genentech and Elliot Hui of the University of California, Irvine share their well-received presentation from SLAS2017 in Washington, DC.
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...Boris Adryan
- A biologist is interested in large, unstructured IoT data to gain insights from connections between different data points, similar to how biologists study connections between biological entities.
- Currently the IoT lacks common data formats, ontologies to provide context to things, and data repositories, limiting analytical flexibility and insights that can be gained.
- Biologists overcame similar problems by developing standards like gene ontologies, metadata requirements, and public data repositories, enabling knowledge inference from large, diverse datasets.
- Key concepts from biology that could help the IoT include developing ontologies to define thing functions, processes, and localizations in order to organize knowledge and enable inferencing across the large, diverse data generated by the Io
This document discusses the challenges and opportunities presented by the increasing volume and complexity of biological data. It outlines four main areas: 1) Developing methods to efficiently store, access, and analyze large datasets; 2) Broadening our understanding of gene function beyond a small number of well-studied genes; 3) Accelerating research through improved sharing of data, results, and methods; and 4) Leveraging exploratory analysis of integrated datasets to generate new insights. The author advocates for lossy data compression, streaming analysis, preprint sharing, improved metadata collection, and incentivizing open data practices.
The document discusses big data sources, challenges, and analytics. It describes how big data is too large to be managed by traditional databases due to its volume, velocity, variety, and veracity. Big data comes from sources like web pages, social media, sensors, and financial transactions. Analyzing big data requires distributed computing across clusters of servers to store and process the data in parallel. Frameworks like MapReduce and Hadoop were developed to perform big data analytics across clusters and address challenges of node failures, network bottlenecks, and distributed programming.
The document discusses the future of networks and big data transmission over the next 10 years. It notes that the volume of data being generated is increasing exponentially and will grow 44 times from 2009 to 2020. This data comes from various sources like social media, sensors, mobile devices, and scientific instruments. It is also generated in many different formats. Networks will need to adapt to handle the massive scale and speed of big data by moving to a more flexible software-defined networking model that can push computing resources closer to where the data is located. This represents a shift away from traditional specialized network hardware towards a network operating system that provides a global view and control of the network.
Building a Distributed Collaborative Data Pipeline with Apache SparkDatabricks
The year of COVID-19 pandemic has spotlighted as never before the many shortcomings of the world’s data management workflows. The lack of established ways to exchange and access data was a highly recognized contributing factor in our poor response to the pandemic. On multiple occasions we have witnessed how our poor practices around reproducibility and provenance have completely sidetracked major vaccine research efforts, prompting many calls for action from scientific and medical communities to address these problems.
This presentation discusses standards for sharing functional genomics data. It summarizes lessons learned from the Minimum Information About a Microarray Experiment (MIAME) standard, including that simply depositing data is not enough - metadata, analysis code, and usable formats are also needed for reproducibility. For high-throughput sequencing data, a Minimum Information about a high-throughput Nucleotide Sequencing Experiment (MINSEQE) standard is proposed with similar requirements as MIAME. The presentation emphasizes keeping standards simple while ensuring machine-readability for reuse.
TranSMART: How open source software revolutionizes drug discovery through cro...keesvb
Presentation about the use of open source software in pharmaceutical companies at Global Discovery & Development Innovation Summit (GDDIS) in Princeton, NY, fall 2013.
The phrase “Big Data” is generally used to describe a large volume of structured and/or unstructured data that cannot be processed using traditional database and software techniques. In the domain of chemistry the Royal Society of Chemistry certainly hosts large structured databases of chemistry data, for example ChemSpider, as well as unstructured content, in the form of our collection of scientific articles. Our research literature provides value to their readership and, at present, as an example of one of our databases, ChemSpider is accessed by many tens of thousands of scientists every day. But do these collections constitute “Big Data” or is it the potential which lies within the collections that can contribute to the Big Data movement. This presentation will discuss our activities to contribute both data, and service-based access to our data sets, to support grant-based projects such as the Innovative Medicines Initiative Open PHACTS project (to support drug discovery) and the PharmaSea initiative (to identify novel natural products from the ocean). We will also provide an overview of our activities to perform data mining of public patent collections and examine what can be done with the data. We are presently extracting physicochemical properties and textual forms of NMR spectra and, with the resulting data, are building predictive models (for melting points at present) and assembling a large NMR spectral database containing many hundreds of thousands of spectral-structure pairs. Our experiences to date have demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with over a quarter of a million melting points producing a matrix of over 200 billion descriptors. Our work to produce the NMR spectral database will necessitate batch processing of the data to examine consistency between the spectral-structure pairs and other forms of data validation. The intention is to take our experiences in this work applied to a public patents corpus and apply it to the RSC back file of publications to mine data and enable new paths to the discoverability of both data and the associated publications.
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
This is a talk I put together for a http://www.neren.org/ seminar called "Bridging the Gap: Research Facilitation". Tried to give a biotech/pharma view for a mostly academic audience.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
More Related Content
Similar to High-Performance Networking Use Cases in Life Sciences
This is a custom "Bio IT trends/problems" deck that I did for a general but highly technical audience at the 2014 Internet2 Technology Exchange conference.
Download of the raw PPT is disabled; contact me at chris@bioteam.net if a direct copy or PDF of the presentation would be useful.
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
We’ll explore current and future considerations in advanced computing architectures that empower the conversion of data into knowledge. Life sciences produce the largest amount of data production out of all major science domains, making analytics and scientific computing cornerstones of modern research programs and methodologies. We’ll highlight the remarkable biomedical discoveries that are emerging through combined efforts, and discuss where and how the right infrastructure can catalyze the advancement of human knowledge. On-premises architectures as well as cloud, hybrid, and exotic architectures will all be discussed. It’s likely that all life science researchers will required advanced computing to perform their research within the next year. However, there has been less focus on advanced computing infrastructures across the industry due to the increased availability of public cloud infrastructure anything as a service models.
This was a 30 min talk intended as one of the opening/overview presentations before a full-day deep dive into ScienceDMZ design patterns and architectures.
Direct downloads are not enabled. Contact me directly (chris@bioteam.net) if you for some odd reason want a copy of this slide deck!
Next generation genomics: Petascale data in the life sciencesGuy Coates
Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
- The speaker observes trends in how research infrastructure is changing more rapidly than IT can refresh systems, creating challenges. This includes new instruments generating vastly more data.
- There is a blurring of roles between scientists, sysadmins, and programmers as everything becomes more automated and "scriptable." Sysadmins must learn programming and researchers can now self-provision resources.
- Virtualization is widely used even in HPC to provide flexibility and address business needs. Very large "fat node" servers are replacing clusters of smaller nodes. Local disk is coming back as a hedge against big data requirements.
- Object storage is becoming more viable and approachable on commodity hardware for a
A presentation I gave at the 2018 Molecular Med Tri-Con in San Francisco, February 2018. This addresses the general challenge of biomedical data management, some of the things to consider when evaluation solutions in this space, and concludes with a brief summary of some of the tools and platforms in this space.
Timothy Dawes of Genentech and Elliot Hui of the University of California, Irvine share their well-received presentation from SLAS2017 in Washington, DC.
O'Reilly Webcast: Organizing the Internet of Things - Actionable Insight Thro...Boris Adryan
- A biologist is interested in large, unstructured IoT data to gain insights from connections between different data points, similar to how biologists study connections between biological entities.
- Currently the IoT lacks common data formats, ontologies to provide context to things, and data repositories, limiting analytical flexibility and insights that can be gained.
- Biologists overcame similar problems by developing standards like gene ontologies, metadata requirements, and public data repositories, enabling knowledge inference from large, diverse datasets.
- Key concepts from biology that could help the IoT include developing ontologies to define thing functions, processes, and localizations in order to organize knowledge and enable inferencing across the large, diverse data generated by the Io
This document discusses the challenges and opportunities presented by the increasing volume and complexity of biological data. It outlines four main areas: 1) Developing methods to efficiently store, access, and analyze large datasets; 2) Broadening our understanding of gene function beyond a small number of well-studied genes; 3) Accelerating research through improved sharing of data, results, and methods; and 4) Leveraging exploratory analysis of integrated datasets to generate new insights. The author advocates for lossy data compression, streaming analysis, preprint sharing, improved metadata collection, and incentivizing open data practices.
The document discusses big data sources, challenges, and analytics. It describes how big data is too large to be managed by traditional databases due to its volume, velocity, variety, and veracity. Big data comes from sources like web pages, social media, sensors, and financial transactions. Analyzing big data requires distributed computing across clusters of servers to store and process the data in parallel. Frameworks like MapReduce and Hadoop were developed to perform big data analytics across clusters and address challenges of node failures, network bottlenecks, and distributed programming.
The document discusses the future of networks and big data transmission over the next 10 years. It notes that the volume of data being generated is increasing exponentially and will grow 44 times from 2009 to 2020. This data comes from various sources like social media, sensors, mobile devices, and scientific instruments. It is also generated in many different formats. Networks will need to adapt to handle the massive scale and speed of big data by moving to a more flexible software-defined networking model that can push computing resources closer to where the data is located. This represents a shift away from traditional specialized network hardware towards a network operating system that provides a global view and control of the network.
Building a Distributed Collaborative Data Pipeline with Apache SparkDatabricks
The year of COVID-19 pandemic has spotlighted as never before the many shortcomings of the world’s data management workflows. The lack of established ways to exchange and access data was a highly recognized contributing factor in our poor response to the pandemic. On multiple occasions we have witnessed how our poor practices around reproducibility and provenance have completely sidetracked major vaccine research efforts, prompting many calls for action from scientific and medical communities to address these problems.
This presentation discusses standards for sharing functional genomics data. It summarizes lessons learned from the Minimum Information About a Microarray Experiment (MIAME) standard, including that simply depositing data is not enough - metadata, analysis code, and usable formats are also needed for reproducibility. For high-throughput sequencing data, a Minimum Information about a high-throughput Nucleotide Sequencing Experiment (MINSEQE) standard is proposed with similar requirements as MIAME. The presentation emphasizes keeping standards simple while ensuring machine-readability for reuse.
TranSMART: How open source software revolutionizes drug discovery through cro...keesvb
Presentation about the use of open source software in pharmaceutical companies at Global Discovery & Development Innovation Summit (GDDIS) in Princeton, NY, fall 2013.
The phrase “Big Data” is generally used to describe a large volume of structured and/or unstructured data that cannot be processed using traditional database and software techniques. In the domain of chemistry the Royal Society of Chemistry certainly hosts large structured databases of chemistry data, for example ChemSpider, as well as unstructured content, in the form of our collection of scientific articles. Our research literature provides value to their readership and, at present, as an example of one of our databases, ChemSpider is accessed by many tens of thousands of scientists every day. But do these collections constitute “Big Data” or is it the potential which lies within the collections that can contribute to the Big Data movement. This presentation will discuss our activities to contribute both data, and service-based access to our data sets, to support grant-based projects such as the Innovative Medicines Initiative Open PHACTS project (to support drug discovery) and the PharmaSea initiative (to identify novel natural products from the ocean). We will also provide an overview of our activities to perform data mining of public patent collections and examine what can be done with the data. We are presently extracting physicochemical properties and textual forms of NMR spectra and, with the resulting data, are building predictive models (for melting points at present) and assembling a large NMR spectral database containing many hundreds of thousands of spectral-structure pairs. Our experiences to date have demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with over a quarter of a million melting points producing a matrix of over 200 billion descriptors. Our work to produce the NMR spectral database will necessitate batch processing of the data to examine consistency between the spectral-structure pairs and other forms of data validation. The intention is to take our experiences in this work applied to a public patents corpus and apply it to the RSC back file of publications to mine data and enable new paths to the discoverability of both data and the associated publications.
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
This is a talk I put together for a http://www.neren.org/ seminar called "Bridging the Gap: Research Facilitation". Tried to give a biotech/pharma view for a mostly academic audience.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Basics of crystallography, crystal systems, classes and different forms
High-Performance Networking Use Cases in Life Sciences
1. High-Performance Networking
Use Cases in Life Sciences
1
2014 Internet2 Technology Exchange; Indianapolis, IN
Slides available at http://www.slideshare.net/arieberman
2. Who am I?
2
Director of Government Services, Principal
Investigator
I’m a fallen scientist - Ph.D. Molecular Biology,
Neuroscience, Bioinformatics
I’m an HPC/Infrastructure geek - 15 years
I help enable science!
I’m Ari
3. 3
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to learn
IT, SW & HPC to get our own
research done
‣ Infrastructure, Informatics,
Software Development, Cross-disciplinary
Assessments
‣ 11+ years bridging the “gap”
between science, IT & high
performance computing
‣ Our wide-ranging work is what gets
us invited to speak at events like
this ...
10. BioTeam
What do we do?
4
Laboratory Knowledge
Converged Solution
11. BioTeam
What do we do?
4
Laboratory Knowledge
Converged Solution
12. Our domain coverage
Mostly work in Life Sciences
• Government
• Universities
• Big pharma
• Biotech
• Private institutes
• Diagnostic startups
• Oil and Gas
• Geospatial
• Hollywood Animation
• Law Enforcement
5
14. We’ve noticed a few things
We have a unique perspective across much of life
sciences
‣ Big Data has arrived in Life Sciences
‣ Data is being generated at unprecedented rates
‣ Research and Biomedical Orgs were caught off
guard
‣ IT running to catch up, limited budgets
‣ Money is tight, Orgs reluctant to invest in Bio-IT
7
25% of all Life Scientists will require HPC in 2015!
15. 8
Big Picture / Meta Issue
‣ HUGE revolution in the
rate at which lab
platforms are being
redesigned, improved &
refreshed
‣ IT not a part of the
conversation, running to
catch up
16. The Central Problem Is ...
Science progressing way faster than IT can refresh/
change
‣ Instrumentation & protocols are changing FAR
FASTER than we can refresh our Research-IT &
Scientific Computing infrastructure
• Bench science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can
support unknown research requirements &
workflows over many years (gulp ...)
9
17. 10
It’s a risky time to be doing Bio-IT
11
What are the drivers in Bio-IT today?
19. It’s like the hard drive of life
12
The big deal about DNA
‣ DNA is the template of life
‣ DNA is read --> RNA
‣ RNA is read --> Proteins
‣ Proteins are the
functional machinery that
make life possible
‣ Understanding the
template = understanding
basis for disease
21. How does NGS work?
Reference assembly, variant calling
14
22. How does NGS work?
Reference assembly, variant calling
14
23. How does NGS work?
Reference assembly, variant calling
14
24. The Human Genome
Gateway to personalized medicine
‣ 3.2 Gbp
‣ 23 chromosomes
‣ ~21,000 genes
‣ Over 55M known
variations
15
25. ...and why NGS is the primary driver
16
The Problem...
‣ Sequencers are now relatively
cheap and fast
‣ Some can generate a human
genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in
that time
‣ First genome took 13 years and
$2.7B to complete
‣ Know of 10 organizations:
100,000 genomes over 5 years
26. ...and why NGS is the primary driver
16
The Problem...
‣ Sequencers are now relatively
cheap and fast
‣ Some can generate a human
genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in
that time
‣ First genome took 13 years and
$2.7B to complete
‣ Know of 10 organizations:
100,000 genomes over 5 years
That’s 14PB of data, folks
28. High-throughput Imaging
‣ Robotics screening millions of
compounds on live cells 24/7
• Not as much data as genomics in
volume, but just as complex
• Data volumes in the 10’s TB/week
‣ Confocal Imaging
• Scanning 100’s of tissue sections/
week, each with 10’s of scans,
each with 20-40 layers and multiple
florescent channels
• Data volumes in the 1’s - 10’s TB/
week
18
29. High-res medical imaging
High-power, dense detector MRI scanners in use
24/7 at large research hospitals
‣ Creating 3D models of
brains, comparing large
datasets
‣ Using those models to
perform detailed
neurosurgery with real-time
analytic feedback from
supercomputer in the OR
(cool stuff)
‣ Also generates 10’s of TB/
week 19
30. 20
This is a huge problem
‣ Causing a literal deluge of
data, in the 10’s of
Petabytes
‣ NIH generating 1.5PB of
data/month
‣ First real case in life science
where 100Gb networking
might really be needed
‣ But, not enough storage or
compute
32. File & Data Types
We have them all
‣ Massive text files
‣ Massive binary files
‣ Flatfile ‘databases’
‣ Spreadsheets everywhere
‣ Directories w/ 6 million
files
‣ Large files: 600GB+
‣ Small files: 30kb or smaller
22
33. Why, giant meta-analyses, of course
23
What to do with all that data?
‣ Typical problem across all of
big data: how do you use it?
‣ In life sciences: no real
standards of data formats
‣ Data scattered all over,
despite push for Data
Commons
‣ Not always accessible
‣ Combining the data if you
have it all is a real challenge
34. A Compounding Problem...
Scientists don’t like to share (really!)
‣ The fear:
• if someone sees data before it
is published, they might steal it
and publish it themselves
(getting scooped)
‣ Causes:
• Long time to publication
• Outdated methods of
assigning scientific credit
• Not properly incentivized
24
35. A Problem for Data Commons
Sharing required
‣ Data piling up
(scientists are
hoarders)
‣ Bad network
infrastructures
‣ Few central analytics
platforms
‣ Wild-west file formats/
algorithms
‣ No sharing 25
36. A Problem for Data Commons
Sharing required
‣ Data piling up
(scientists are
hoarders)
‣ Hyperscale Bad network
infrastructures
analytics will only work
‣ Few central if the analytics
data is accessible!
platforms
‣ Wild-west file formats/
algorithms
‣ No sharing 25
37. Clear issue for Networking
Every kind of flow imaginable
‣ Mouse —> Elephant
‣ Typical problem: firewalls
not designed for this
‣ Potentially massive
amount of constant data
movement
‣ How are people handling
all of this?
26
40. Laboratories not Integrated
Usually very little IT infrastructure in labs
‣ Tons of data generating
equipment going in now
‣ Can generate 15GB of
data in 50 hours
‣ Others can generate
64GB/day
‣ Labs are not designed to
transmit data, lucky if
wired for ethernet
29
41. Laboratories not Integrated
Usually very little IT infrastructure in labs
‣ Tons of data generating
equipment going in now
‣ Can generate 15GB of
data in 50 hours
‣ Others can generate
64GB/day
‣ Labs are not designed to
transmit data, lucky if
wired for ethernet
29
42. Laboratories not Integrated
Usually very little IT infrastructure in labs
‣ Tons of data generating
equipment going in now
‣ Can generate 15GB of
data in 50 hours
‣ Others can generate
64GB/day
‣ Labs are not designed to
transmit data, lucky if
wired for ethernet
29
43. Getting data out
OK, so write data over ethernet to network drive…
‣ Sounds good, 64GB in 24
hours ~= 6Mb/s
‣ Problem: desktop class
ethernet adaptors
‣ No error checking, no
retries, no MD5, no local
buffer
‣ If network goes, whole
run is lost
30
44. Getting data out
Scientists have to get creative, but not in a good way
‣ Usually ends up going to
local workstation
‣ Go buy the cheapest disks
they can
‣ Carry it somewhere, transfer
the data to a workstation
‣ Put the disk in a drawer
under a sink (really)
‣ Works if lab only does one or
two runs/month, fails if more
31
45. Lab data transit not huge!
Unless you’re dealing with a bigger lab with lots of
equipment, or a core facility
‣ Fast networking not
required, 100Mb OK
‣ Just GOOD networking
‣ ….for now (more later)
32
46. Successful models
Some generalized network models that have
successfully solved the problem
‣ Most of it is protocol and
topology
‣ Quality of Service (QoS)
‣ Appropriate segmentation
(L2 and/or L3)
‣ MPLS paths
‣ Intermediate protocols
(i.e., Aspera FASP)
‣ One way or another,
guarantee transfer 33
48. Storage: a networking problem
As storage needs increase, the need to transmit it
goes up too
‣ Networking will quickly replace storage as #1
headache in Bio-IT
‣ Petascale storage is useless without high-performance
networking
‣ Most enterprise networks won’t cut it
35
49. Storage: an Org Problem
Most single laboratories don’t have an immediate
need for peta-scale storage
‣ BUT - labs need to be peta-capable
‣ Can’t predict how much or
what kind of equipment
‣ Have to build for an
indeterminate future
‣ Does it make sense for each
lab to buy own storage?
• Probably not, doesn’t scale well
financially
36
50. Storage: an Org Problem
Orgs that don’t invest will find themselves in a mess
of storage support
‣ This is when the storage
problem becomes a
networking problem
‣ Scientists need to share,
collaborate
‣ Lab with 100TB of data,
needs to share with offsite
or onsite scientist
‣ Also: backups and disaster
recovery: data is the new
commodity 37
51. Storage: a networking problem
Without high-performance networking, petascale
anything is useless
‣ Traditional enterprise networks
don’t cut it
‣ Large single-stream flows get
squashed through firewalls and
IDS
‣ Centralized: 10’s of PBs
‣ Distributed: 100’s of PBs
• Likely a lot of duplication
‣ Network becomes key
‣ Cloud use makes this an even
bigger problem
38
52. Storage: options!
‣ There are a ton of options for
storage
• Local: small and large
• Institutional: mostly large
• Distributed Institutional: distributed NAS
(GPFS over WAN), Object store
networks, iRODS
• Public clouds: block and object storage
‣ All require high-performance
networking
‣ Anything external requires
awesome external connection
39
53. Storage networking: solutions
External connections that make petascale storage
useful to scientists
‣ OC-192
• Works for large institutions willing to
make investment
• Cost prohibitive: $200-$300k/month
• Start-up cost of at least $1-2M for
border equipment
‣ Internet2 10/100Gb Hybrid ports
• Much better cost, fewer routing
options
• $200k/year
‣ Google Fiber, AT&T Gigapower? 40
54. Storage networking: solutions
Internal networking more critical than external
for petascale storage
‣ Infrastructure must be able to
support the inevitable 1PB transit
• Disaster recovery
• High-availability
• Backup
‣ Need at least 10Gb
• Probably dedicated 10Gb per >1PB
storage facility: 40Gb min —> 1Tb
backbone
‣ 1Gb will not cut it for that data size
• ~97 days to transmit at saturation
• 10Gb: ~9.7 days
41
55. Storage networking: solutions
And now, the real problem: topology and logical
design
‣ Need a scaling internal
topology
‣ One core switch doing all
routing and packet transit ==
bad
‣ More advanced designs needed
‣ Also: prioritize performance
over security
• Nearly impossible for most orgs
‣ Most implemented option:
Science DMZ
42
56. Science DMZ: not for everything
Sensitive data have policies and compliance issues,
breaking them can be illegal
‣ Need logical topology flexible
enough for security AND
performance
‣ Best example: ISP model
• Collapsed PE/CE on single router at edge
• OSPF routing at edge, fast label
switching on dual 100Gb cores
• VRF for network segments
• MPLS for fast transit and bandwidth
guarantees
‣ Side benefit: trusted and untrusted
Science DMZ
43
58. Compute == Answers!
The pinnacle of data transit, the reason we store it
in the first place
‣ High performance computing:
clusters, supercomputers,
single servers, powerful
workstations, etc.
‣ Mostly a datacenter issue
‣ Unless…
• Storage not centralized or co-located:
data duplicated unless
have a killer network
• New methods: data doesn’t
move, compute moves to data
45
59. Use Case: Get data to cluster
Assumes the use of central high-performance
storage system
‣ Easier problem within the
same datacenter
‣ Large data needs large pipe
‣ Output of storage device
needs to be fast
• Needs to drive data to/from all
compute nodes simultaneously
‣ Large clusters: big problem
• Needs parallel filesystems:
GPFS, Lustre
46
60. Internal network esp. important
Use of local disk in newer clusters
‣ Implementation of
storage/analytics systems
for Big Data/HDFS
‣ Hadoop, Gluster, local
ZFS volumes, virtual disk
pools
‣ Now storage can be both
internal and external
‣ I/O throughput is critical
47
61. Application characteristics
‣ Mostly single process apps
‣ Some SMP/threaded apps performance
bound by IO and/or RAM
‣ Lots of Perl/Python/R
‣ Hundreds of apps, codes & toolkits
‣ 1TB - 2TB RAM “High Memory” nodes
becoming essential
‣ MPI is rare
• Well written MPI is even rarer
‣ Few MPI apps actually benefit from
expensive low-latency interconnects*
• *Chemistry, modeling and structure work is
the exception
48
62. Life Science very I/O bound
Genomics especially
‣ Sync time for data often
takes longer than the job
itself
‣ Have to load up to 300GB
into memory, for 1min
process
‣ Do this thousands of times
‣ Largely due to bad
programming and
improperly configured
systems 49
63. Cluster networking Solutions
Interconnects between the nodes and the cluster’s
connection to the main network critical
‣ Optimal cluster networks: fat
tree and torus topologies
• All layer 2, internally
‣ Most keep subscription to 1:4,
depending on usage
‣ Top-level switches connect at
high speed to datacenter
network
• Newest are multiple 10Gb or 40Gb
• Infiniband internal networks:
Mellanox ConnectX3 - ethernet and
IB capable switch ports 50
65. Collaboration
Fundamental to science
‣ Now that data production is reaching petascale,
collaboration is getting harder
‣ Projects are getting more complex, more data
is being generated, takes more people to work
on the science
‣ Journal authorships: common to see 40+
authors now
‣ Clearly a networking problem at its core
‣ Let’s face it, doing this right is expensive! 52
66. Data Movement & Data Sharing
The gist of collaborative data sharing in life sciences
‣ Peta-scale data movement
needs
• Within an organization
• To/from collaborators
• To/from suppliers
• To/from public data repos
‣ Peta-scale data sharing
needs
• Collaborators and partners may
be all over the world
53
68. We Have Both Ingest Problems
Physical & Network
‣ Significant physical ingest
occurring in Life Science
• Standard media: naked SATA drives
shipped via Fedex
‣ Cliche example:
• 30 genomes outsourced means 30
drives will soon be sitting in your
mail pile
‣ Organizations often use similar
methods to freight data
between buildings and among
geographic sites 55
69. Physical Ingest Just Plain Nasty
‣ Easy to talk about in
theory
‣ Seems “easy” to scientists
and even IT at first glance
‣ Really really nasty in
practice
• Incredibly time consuming
• Significant operational burden
• Easy to do badly / lose data
56
70. Collaboration Solutions
Science DMZ: making it easier to collaborate
Image source: “The Science DMZ: Introduction & Architecture” -- esnet 57
71. Collaboration Solutions
Internet2: making data accessible and affordable
‣ Internet2 is bringing Research
and Education together
• High-speed, clean networking at its
core
• Novel and advanced uses of SDN
• Subsidized rates: national high-performance
networking affordable
‣ AL2S: quickly establish national
networks at high-speed
‣ Combined with Science DMZ:
platform for collaboration
58
72. Collaboration Solutions
Push for Cloud use: Most use Amazon Web
Services, Google Cloud not far behind
‣ Many Orgs are pushing for cloud
‣ Unsupported scientists end up
using cloud
‣ It’s fast, flexible, affordable, if done
right
‣ Great place for large public
datasets to live
‣ Has existing high(ish)-performance
networking
‣ If done wrong, way more expensive
than local compute
‣ Biggest problem: getting data to it!
59
73. Collaboration Solutions
Hybrid HPC: Also known as hybrid clouds
‣ Relatively new idea
• small local footprint
• large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public cloud
required
‣ Software interface layer acting as the
mediator between local and public
resources
‣ Good for tight budgets, has to be
done right to work
‣ Not many working examples yet 60
74. Data Commons
Central storage of knowledge with compute
‣ Common structure for
data storage and indexing
(a cloud?)
‣ Associated compute for
analytics
‣ Development platform for
application development
(PaaS)
‣ Make discovery more
possible 61
76. USDA: Agricultural Research Service
Huge Government Agency trying to make agriculture
better in every way
‣ Researchers doing amazing
research on how crops and
animals can be better farmed
‣ Lower environmental
impacts
‣ Better economic returns
‣ How to optimize how
agriculture functions in the
US
‣ But, there’s a problem…
63
77. They’re doing all the things!
Every kind of high-throughput research talked about
they are doing, and more, and on a massive scale
64
78. Just to list a few…
‣ Genomics (a lot of de novo
assembly)
‣ Large scale imaging
• LIDAR
• Satellite
‣ Simulations
‣ Climatology
‣ Remote sensing
‣ Farm equipment sensors (IoT)
65
79. Their current network
66
• Upgrading to DS3
• Still a lot of T1
• Won’t cut it for
science
80. The new initiative
Build a Science DMZ: SciNet, on an Internet2 AL2S
Backbone
67
89. Converged Infrastructure
71
The meta issue
‣ Individual technologies and
their general successful use
are fine
‣ Unless they all work
together as a unified
solution, it all means
nothing
‣ Creating an end-to-end
solution based on the use
case (science!): converged
infrastructure
94. Universal Truth
“The network IS the computer” - John Gage, Sun
Microsystems
‣ Convergence is not possible
without networking
‣ Also not possible without GOOD
networking
‣ Life Sciences is learning lessons
learned by physics and astronomy
5-10 years ago
‣ Biggest problem is Org acceptance
and investment in personnel and
equipment
‣ Next-Gen biomedical research
advancing too quickly: must invest
now
74
95. 75
end; Thanks!
slides at http://www.slideshare.net/arieberman