Presented at PEARC21.
Most experimental sciences now rely on computing, and biolog- ical sciences are no exception. As datasets get bigger, so do the computing costs, making proper optimization of the codes used by scientists increasingly important. Many of the codes developed in recent years are based on the Python-based NumPy, due to its ease of use and good performance characteristics. The composable nature of NumPy, however, does not generally play well with the multi-tier nature of modern CPUs, making any non-trivial multi- step algorithm limited by the external memory access speeds, which are hundreds of times slower than the CPU’s compute capabilities. In order to fully utilize the CPU compute capabilities, one must keep the working memory footprint small enough to fit in the CPU caches, which requires splitting the problem into smaller portions and fusing together as many steps as possible. In this paper, we present changes based on these principles to two important func- tions in the scikit-bio library, principal coordinates analysis and the Mantel test, that resulted in over 100x speed improvement in these widely used, general-purpose tools.
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)
An introduction to Google's AI Engine, look deeper into Artificial Networks and Machine Learning. Appreciate how our simplest neural network be codified and be used to data analytics.
Abstract: This PDSG workshop introduces basic concepts on TensorFlow. The course covers fundamentals. Concepts covered are Vectors/Matrices/Vectors, Design&Run, Constants, Operations, Placeholders, Bindings, Operators, Loss Function and Training.
Level: Fundamental
Requirements: Some basic programming knowledge is preferred. No prior statistics background is required.
This session for beginners introduces tf.data APIs for creating data pipelines by combining various "lazy operators" in tf.data, such as filter(), map(), batch(), zip(), flatmap(), take(), and so forth.
Familiarity with method chaining and TF2 is helpful (but not required). If you are comfortable with FRP, the code samples in this session will be very familiar to you.
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs, followed by a Keras code sample for defining a CNN. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session. Then we'll see a short introduction to TensorFlow 1.x and some insights into TF 2 that will be released some time this year.
A fast-paced introduction to TensorFlow 2 regarding some important new features (such as generators and the @tf.function decorator), along with tf.data code samples and lazy operators. We'll also delve into the key ideas underlying CNNs, RNNs, and LSTMs, followed by some Keras-based code blocks.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
this is the forth slide for machine learning workshop in Hulu. Machine learning methods are summarized in the beginning of this slide, and boosting tree is introduced then. You are commended to try boosting tree when the feature number is not too much (<1000)
An introduction to Google's AI Engine, look deeper into Artificial Networks and Machine Learning. Appreciate how our simplest neural network be codified and be used to data analytics.
Abstract: This PDSG workshop introduces basic concepts on TensorFlow. The course covers fundamentals. Concepts covered are Vectors/Matrices/Vectors, Design&Run, Constants, Operations, Placeholders, Bindings, Operators, Loss Function and Training.
Level: Fundamental
Requirements: Some basic programming knowledge is preferred. No prior statistics background is required.
This session for beginners introduces tf.data APIs for creating data pipelines by combining various "lazy operators" in tf.data, such as filter(), map(), batch(), zip(), flatmap(), take(), and so forth.
Familiarity with method chaining and TF2 is helpful (but not required). If you are comfortable with FRP, the code samples in this session will be very familiar to you.
A fast-paced introduction to Deep Learning concepts, such as activation functions, cost functions, back propagation, and then a quick dive into CNNs, followed by a Keras code sample for defining a CNN. Basic knowledge of vectors, matrices, and derivatives is helpful in order to derive the maximum benefit from this session. Then we'll see a short introduction to TensorFlow 1.x and some insights into TF 2 that will be released some time this year.
A fast-paced introduction to TensorFlow 2 regarding some important new features (such as generators and the @tf.function decorator), along with tf.data code samples and lazy operators. We'll also delve into the key ideas underlying CNNs, RNNs, and LSTMs, followed by some Keras-based code blocks.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
The JVM memory model describes how threads in the Java eco-system interact through memory. While the memory model impact on developing for the JVM may not be obvious, it is the cause for certain number of "anomalies" that are, well, by design.
In this presentation we will explore the aspects of the memory model, including things like reordering of instructions, volatile members, monitors, atomics and JIT.
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
Fusion simulations have traditionally required the use of leadership scale High Performance Computing (HPC) resources in order to produce advances in physics. The impressive improvements in compute and memory capacity of many-GPU compute nodes are now allowing for some problems that once required a multi-node setup to be also solvable on a single node. When possible, the increased interconnect bandwidth can result in order of magnitude higher science throughput, especially for communication-heavy applications. In this paper we analyze the performance of the fusion simulation tool CGYRO, an Eulerian gyrokinetic turbulence solver designed and optimized for collisional, electromagnetic, multiscale simulation, which is widely used in the fusion research community. Due to the nature of the problem, the application has to work on a large multi-dimensional computational mesh as a whole, requiring frequent exchange of large amounts of data between the compute processes. In particular, we show that the average-scale nl03 benchmark CGYRO simulation can be run at an acceptable speed on a single Google Cloud instance with 16 A100 GPUs, outperforming 8 NERSC Perlmutter Phase1 nodes, 16 ORNL Summit nodes and 256 NERSC Cori nodes. Moving from a multi-node to a single-node GPU setup we get comparable simulation times using less than half the number of GPUs. Larger benchmark problems, however, still require a multi-node HPC setup due to GPU memory capacity needs, since at the time of writing no vendor offers nodes with a sufficient GPU memory setup. The upcoming external NVSWITCH does however promise to deliver an almost equivalent solution for up to 256 NVIDIA GPUs.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535130
The anachronism of whole-GPU accountingIgor Sfiligoi
NVIDIA has been making steady progress in increasing the compute performance of its GPUs, resulting in order of magnitude compute throughput improvements over the years. With several models of GPUs coexisting in many deployments, the traditional accounting method of treating all GPUs as being equal is not reflecting compute output anymore. Moreover, for applications that require significant CPU-based compute to complement the GPU-based compute, it is becoming harder and harder to make full use of the newer GPUs, requiring sharing of those GPUs between multiple applications in order to maximize the achievable science output. This further reduces the value of whole-GPU accounting, especially when the sharing is done at the infrastructure level. We thus argue that GPU accounting for throughput-oriented infrastructures should be expressed in GPU core hours, much like it is normally done for the CPUs. While GPU core compute throughput does change between GPU generations, the variability is similar to what we expect to see among CPU cores. To validate our position, we present an extensive set of run time measurements of two IceCube photon propagation workflows on 14 GPU models, using both on-prem and Cloud resources. The measurements also outline the influence of GPU sharing at both HTCondor and Kubernetes infrastructure level.
Presented at PEARC22.
Document DOI: https://doi.org/10.1145/3491418.3535125
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
HTCondor has been very successful in managing globally distributed, pleasantly parallel scientific workloads, especially as part of the Open Science Grid. HTCondor system design makes it ideal for integrating compute resources provisioned from anywhere, but it has very limited native support for autonomously provisioning resources managed by other solutions. This work presents a solution that allows for autonomous, demand-driven provisioning of Kubernetes-managed resources. A high-level overview of the employed architectures is presented, paired with the description of the setups used in both on-prem and Cloud deployments in support of several Open Science Grid communities. The experience suggests that the described solution should be generally suitable for contributing Kubernetes-based resources to existing HTCondor pools.
Presented at PEARC22.
Paper DOI: https://doi.org/10.1145/3491418.3535123
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
Overview of the recent performance optimization of CGYRO, an Eulerian GyroKinetic Fusion Plasma solver, with emphasize on the Multiscale Turbulence Simulations.
Presented at the joint US-Japan Workshop on Exascale Computing Collaboration and6th workshop of US-Japan Joint Institute for Fusion Theory (JIFT) program (Jan 18th 2022).
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
Poster presented at PEAC21.
The poster contains the complete scaling plots for both unweighted and weighted normalized Unifrac compute for sample sizes ranging from 1k to 307k on both GPUs and CPUs.
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
Presented at PEARC21.
Many scientific high-throughput applications can benefit from the elastic nature of Cloud resources, especially when there is a need to reduce time to completion. Cost considerations are usually a major issue in such endeavors, with networking often a major component; for data-intensive applications, egress networking costs can exceed the compute costs. Dedicated network links provide a way to lower the networking costs, but they do add complexity. In this paper we provide a description of a 100 fp32 PFLOPS Cloud burst in support of IceCube production compute, that used Internet2 Cloud Connect service to provision several logically-dedicated network links from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and Google Cloud Platform, that in aggregate enabled approximately 100 Gbps egress capability to on-prem storage. It provides technical details about the provisioning process, the benefits and limitations of such a setup and an analysis of the costs incurred.
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
Presented at GTC21.
The raw computing power of GPUs has been steadily increasing, significantly outpacing the CPU gains. This poses a problem for many GPU-enabled scientific applications that use CPU code paths to feed data to the GPU code, resulting in lower GPU utilization, and thus reduced gains in scientific output. Applications that are high-throughput in nature, such as astronomy-focused IceCube and LIGO, can partially work around the problem by running several instances of the executable on the same GPU. This approach, however, is sub-optimal both in terms of application performance and workflow management complexity. The recently introduced Multi-Instance GPU (MIG) capability, available on the NVIDIA A100 GPU, provides a much cleaner and easier-to-use alternative by allowing the logical slicing of the powerful GPU and assigning different slices to different applications. And at least in the case of IceCube, it can provide over 3x more scientific output on the same hardware.
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
Presented at EDUCAUSE CCCG March 2021.
The IceCube Neutrino Observatory is the world’s premier facility to detect neutrinos.
Built at the south pole in natural ice, it requires extensive and expensive calibration to properly track the neutrinos.
Most of the required compute power comes from on-prem resources through the Open Science Grid,
but IceCube can easily harness the Cloud compute at any scale, too, as demonstrated by a series of Cloud bursts.
This talk provides both details of the performed Cloud bursts, as well as some insight in the science itself.
Fusion simulations have traditionally required the use of leadership scale HPC resources in order to produce advances in physics. One such package is CGYRO, a premier tool for multi-scale plasma turbulence simulation. CGYRO is a typical HPC application that will not fit into a single node, as it requires several TeraBytes of memory and O(100) TFLOPS compute capability for cutting-edge simulations. CGYRO also requires high-throughput and low-latency networking, due to its reliance on global FFT computations. While in the past such compute may have required hundreds, or even thousands of nodes, recent advances in hardware capabilities allow for just tens of nodes to deliver the necessary compute power. We explored the feasibility of running CGYRO on Cloud resources provided by Microsoft on their Azure platform, using the infiniband-connected HPC resources in spot mode. We observed both that CPU-only resources were very efficient, and that running in spot mode was doable, with minimal side effects. The GPU-enabled resources were less cost effective but allowed for higher scaling.
For IceCube, large amount of photon propagation simulation is needed to properly calibrate natural Ice. Simulation is compute intensive and ideal for GPU compute. This Cloud run was more data intensive than precious ones, producing 130 TB of output data. To keep egress costs in check, we created dedicated network links via the Internet2 Cloud Connect Service.
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
Presented at OSG All-Hands Meeting 2020 - USCMS-USATLAS Session.
This talk presented the PRP experience with using Admiralty as a Kubernetes federation solution, with both discussion of why we need it, why Admiralty is the best (if not actually the only) solution for our needs, and how it works.
Accelerating microbiome research with OpenACCIgor Sfiligoi
Presented at OpenACC Summit 2020.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another. Computing UniFrac on modest sample sizes used to take a workday on a server class CPU-only node, while modern datasets would require a large compute cluster to be feasible. After porting to GPUs using OpenACC, the compute of the same modest sample size now takes only a few minutes on a single NVIDIA V100 GPU, while modern datasets can be processed on a single GPU in hours. The OpenACC programming model made the porting of the code to GPUs extremely simple; the first prototype was completed in just over a day. Getting full performance did however take much longer, since proper memory access is fundamental for this application.
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
Presented at PEARC20.
This talk presents expanding the IceCube’s production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
Poster presented at PEARC20.
There is increased awareness and recognition that public Cloud providers do provide capabilities not found elsewhere, with elasticity being a major driver. The value of elastic scaling is however tightly coupled to the capabilities of the networks that connect all involved resources, both in the public Clouds and at the various research institutions. This poster presents results of measurements involving file transfers inside public Cloud providers, fetching data from on-prem resources into public Cloud instances and fetching data from public Cloud storage into on-prem nodes. The networking of the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform, has been benchmarked. The on-prem nodes were managed by either the Pacific Research Platform or located at the University of Wisconsin – Madison. The observed sustained throughput was of the order of 100 Gbps in all the tests moving data in and out of the public Clouds and throughput reaching into the Tbps range for data movements inside the public Cloud providers themselves. All the tests used HTTP as the transfer protocol.
TransAtlantic Networking using Cloud linksIgor Sfiligoi
Scientific communities have only limited amount of bandwidth available for transferring data between the US and the EU.
We know Cloud providers have plenty of bandwidth available, but at what cost?
Bursting into the public Cloud - Sharing my experience doing it at large scal...Igor Sfiligoi
When compute workflow needs spike well in excess of the capacity of a local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. I have recently helped IceCube expand their resource pool by a few orders of magnitude, first to 380 PFLOP32s for a few hours and later to 170 PFLOP32s for a whole workday. In the process we moved O(50 TB) of data to and from the clouds, showing that networking is not a limiting factor, either. While there was a non-negligible dollar cost involved with each, the effort involved was quite modest. In this session I will explain what was done and how, alongside an overview of why IceCube needs so much compute.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Navigating the Metaverse: A Journey into Virtual Evolution"
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
1. Accelerating Key Bioinformatics Tasks
100-fold by Improving Memory Access
Igor Sfiligoi, Daniel McDonald and Rob Knight
University of California San Diego
2. High level overview
• Microbiome studies recently moved from 100s to 10k-100k samples
• But the tools were built having 1k samples in mind
• Many of the tools are written in Python
• With NumPy used for compute-intensive portions
• This talk focuses on pcoa and mantel, which are part of scikit-bio
• The above codes work well at small scales
• NumPy takes good care of using the compute resources
• But very slow at large(r) scales
• Lack of memory locality due to multi-step logic
• Re-implemented in Cython, with emphasis on memory locality
• Now scales well into the 100k samples range
3. A word about microbiome studies
• Microbiome studies composed of samples from the real world
• Often collected from many different environments
• DNA sequencing technology used to observe
what microorganisms are present in a given sample
• Typical questions
• How similar the samples are to each other?
• Can these similarities be explained by study variables?
(e.g., salinity, pH, host or not host associated, etc)
• How correlated are distance matrices created by different methods?
• Often comparing with samples from other studies, too
• Increases the total number of samples to consider
4. A word about CPU memory subsystem
• CPUs compute several orders of magnitude faster than DRAM
• Intel Xeon Gold 6242 takes ~100 cycles to load from DRAM
• CPUs thus have internal caches to speed up access
• But limited in size, due to high cost
• L1 cache: 4 cycles, 32 KiB per core
• L2 cache: 14 cycles, 1 MiB per core
• L3 cache: ~50 cycles, 22.5 MiB per chip
(the above is for the Xeon, other CPUs similar)
• Minimizing off-cache access thus essential
• Not a new concept, but can be overlooked when datasets grow
5. Principal Coordinates Analysis
• The pcoa function is used to perform
a Principal Coordinates Analysis of a distance matrix
• Two step process
• Centering of the matrix
• Compute of approximate eigenvalues
Surprisingly,
the first step was the
most expensive!
6. Centering the matrix
• Original implementation
simple and elegant
• But not memory efficient
• Multi-step process
• With each arithmetic operation
traversing the whole NumPy matrix
• Reads 8 and writes 5
matrix-sized buffers
import numpy as np
def e_matrix(distance_matrix):
return distance_matrix * distance_matrix / -2
def f_matrix(E_matrix):
row_means = E_matrix.mean(axis=1, keepdims=True);
col_means = E_matrix.mean(axis=0, keepdims=True)
matrix_mean = E_matrix.mean()
return E_matrix - row_means - col_means + matrix_mean
def center_distance_matrix(distance_matrix):
# distance_matrix must be of type np.ndarray and
# be a 2D floating point matrix
return f_matrix(e_matrix(distance_matrix))
OK for small matrices, but
slow once cannot fit in cache
7. Centering the matrix
• To keep active buffer in cache, must expose parallelism
• Move compute to inner loop
• And use a tilling pattern
• Cannot do in NumPy
• Re-implemented in Cython
0/2
def center_distance_matrix(mat, centered):
row_means = np.empty(mat.shape[0])
global_mean = e_matrix_means_cy(mat, centered, row_means)
f_matrix_inplace_cy(row_means, global_mean, centered)
8. Centering the matrix
• To keep active buffer in cache, must expose parallelism
• Move compute to inner loop
• And use a tilling pattern
• Re-implemented in Cython
• Harder to read,
but much faster
def e_matrix(distance_matrix):
return distance_matrix * distance_matrix / -2
def f_matrix(E_matrix):
row_means = E_matrix.mean(axis=1, keepdims=True);
col_means = E_matrix.mean(axis=0, keepdims=True)
…
def e_matrix_means_cy(float[:, :] mat,
float[:, :] centered, float[:] row_means):
n_samples = mat.shape[0]
global_sum = 0.0
for row in prange(n_samples, nogil=True):
row_sum = 0.0
for col in range(n_samples):
val = mat[row,col];
val2 = -0.5*val*val
centered[row,col] = val2
row_sum = row_sum + val2
global_sum += row_sum
row_means[row] = row_sum/n_samples
return (global_sum/n_samples)/n_samples
1/2
Compute means
while creating E_matrix
9. Centering the matrix
• To keep active buffer in cache, must expose parallelism
• Move compute to inner loop
• And use a tilling pattern
• Re-implemented in Cython
• Harder to read,
but much faster
def f_matrix_inplace_cy(float[:] row_means, float global_mean,
float[:, :] centered):
n_samples = centered.shape[0]
# use a tiled pattern to maximize locality of row_means,
# since they are also column means
for trow in prange(0, n_samples, 16, nogil=True):
trow_max = min(trow+16 n_samples)
for tcol in range(0, n_samples, 16):
tcol_max = min(tcol+16, n_samples)
for row in range(trow, trow_max, 1):
gr_mean = global_mean - row_means[row]
for col in range(tcol, tcol_max, 1):
centered[row,col] = centered[row,col] +
(gr_mean - row_means[col])
def f_matrix(E_matrix):
row_means = E_matrix.mean(axis=1, keepdims=True);
col_means = E_matrix.mean(axis=0, keepdims=True)
matrix_mean = E_matrix.mean()
return E_matrix - row_means - col_means +
matrix_mean
Out
2/2
Extensively use tilling
10. Observed speedup
• Between 2x and 30x, depending on available HW
• Bigger improvement on larger problems in full-CPU mode
CPU, number of used cores, matrix size Original Latest
AMD EPYC 7252 1 core - 25k x 25k 4.1 2.1
AMD EPYC 7252 8 cores - 25k x 25k 4.0 0.3
Intel Xeon Gold 6242 1 core - 25k x 25k 7.7 2.3
Intel Xeon Gold 6242 16 cores - 25k x 25k 7.2 0.3
AMD EPYC 7252 1 core - 70k x 70k 100 30
AMD EPYC 7252 8 cores - 70k x 70k 125 4.2
Intel Xeon Gold 6242 1 core - 100k x 100k 255 78
Intel Xeon Gold 6242 16 cores - 100k x 100k 254 9.1
Note: Original implementation cannot make effective use of multiple cores
11. The Mantel test
• The mantel function computes the correlation
between two distance matrices using the Mantel test.
• The statistical significance derived from a Monte-Carlo procedure,
where the input matrices are permuted K times
• Default K=999
• Implication:
• The correlation function is typically
computed 1k times on a whole matrix
Extremely slow
if matrix cannot fit in cache
12. Original mantel implementation
• Again, simple and elegant,
but not memory efficient
• Note that it relies
on external functions
for its core compute
function
from scipy.stats import pearsonr
def _mantel_stats_pearson(x, y, permutations=999):
# pearsonr operates on the flat representation
# of the upper triangle of the symmetric matrix
x_flat = x.condensed_form()
y_flat = y.condensed_form()
orig_stat = pearsonr(x_flat, y_flat)
perm_gen = ( pearsonr(x.permute(condensed=True), y_flat)
for _ in range(permutations) )
permuted_stats = np.fromiter(perm_gen, np.float,
count=permutations)
return [orig_stat, permuted_stats]
def mantel(x, y, permutations=999):
orig_stat, permuted_stats = _mantel_stats_pearson(x, y,
permutations)
count_better = ( np.absolute(permuted_stats)
>= np.absolute(orig_stat) ).sum()
p_value = (count_better + 1) / (permutations + 1)
return orig_stat, p_value
13. Original mantel implementation
• Again, simple and elegant,
but not memory efficient
• Note that it relies
on external functions
for its core compute
function
from scipy.stats import pearsonr
def _mantel_stats_pearson(x, y, permutations=999):
# pearsonr operates on the flat representation
# of the upper triangle of the symmetric matrix
x_flat = x.condensed_form()
y_flat = y.condensed_form()
orig_stat = pearsonr(x_flat, y_flat)
perm_gen = ( pearsonr(x.permute(condensed=True), y_flat)
for _ in range(permutations) )
permuted_stats = np.fromiter(perm_gen, np.float,
count=permutations)
return [orig_stat, permuted_stats]
def mantel(x, y, permutations=999):
orig_stat, permuted_stats = _mantel_stats_pearson(x, y,
permutations)
count_better = ( np.absolute(permuted_stats)
>= np.absolute(orig_stat) ).sum()
p_value = (count_better + 1) / (permutations + 1)
return orig_stat, p_value
Memory locality requires partial compute,
so must inline and re-implement
external functions, too
from linalg import norm
def pearsonr(x_flat, y_flat):
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
return np.dot(xnorm, ynorm)
14. Additional optimization opportunities
• With fewer black-boxes,
we can optimize more, too
• 2nd parameter always the same
• The mean and norm of a
matrix do not change when the
matrix is permuted
from scipy.stats import pearsonr
def _mantel_stats_pearson(x, y, permutations=999):
# pearsonr operates on the flat representation
# of the upper triangle of the symmetric matrix
x_flat = x.condensed_form()
y_flat = y.condensed_form()
orig_stat = pearsonr(x_flat, y_flat)
perm_gen = ( pearsonr(x.permute(condensed=True), y_flat)
for _ in range(permutations) )
permuted_stats = np.fromiter(perm_gen, np.float,
count=permutations)
return [orig_stat, permuted_stats]
def mantel(x, y, permutations=999):
orig_stat, permuted_stats = _mantel_stats_pearson(x, y,
permutations)
count_better = ( np.absolute(permuted_stats)
>= np.absolute(orig_stat) ).sum()
p_value = (count_better + 1) / (permutations + 1)
return orig_stat, p_value
from linalg import norm
def pearsonr(x_flat, y_flat):
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
return np.dot(xnorm, ynorm)
Reuse
results
15. Optimized mantel implementation
• Inlining makes it harder to read
• But not much longer
def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order,
float xmean, float normxm,
float[:] ym_normalized,
float[:] permuted_stats):
perms_n = perm_order.shape[0]
out_n = perm_order.shape[1]
for p in prange(perms_n, nogil=True):
my_ps = 0.0
for row in range(out_n-1):
vrow = perm_order[p, row]
idx = row*(out_n-1) - ((row-1)*row)/2
for icol in range(out_n-row-1):
col = icol+row+1
yval = ym_normalized[idx+icol]
xval = x_data[vrow, perm_order[p, col]]*mul + add
my_ps = yval*xval + my_ps
permuted_stats[p] = my_ps
from linalg import norm
def _mantel(x, y, permutations):
x_flat = x.condensed_form()
y_flat = y.condensed_form()
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
orig_stat = np.dot(xnorm, ynorm)
mat_n = x._data.shape[0]
perm_order = np.empty([permutations, mat_n], dtype=np.int)
for row in range(permutations):
perm_order[row, :] = np.random.permutation(mat_n)
permuted_stats = np.empty([permutations])
mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm,
ynorm, permuted_stats)
return [orig_stat, permuted_stats]
Permuted multi-matrix
dot operation
Pre-compute anything
that can be reused
16. Optimized mantel implementation
• Inlining makes it harder to read
• But not much longer
def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order,
float xmean, float normxm,
float[:] ym_normalized,
float[:] permuted_stats):
perms_n = perm_order.shape[0]
out_n = perm_order.shape[1]
for p in prange(perms_n, nogil=True):
my_ps = 0.0
for row in range(out_n-1):
vrow = perm_order[p, row]
idx = row*(out_n-1) - ((row-1)*row)/2
for icol in range(out_n-row-1):
col = icol+row+1
yval = ym_normalized[idx+icol]
xval = x_data[vrow, perm_order[p, col]]*mul + add
my_ps = yval*xval + my_ps
permuted_stats[p] = my_ps
from linalg import norm
def _mantel(x, y, permutations):
x_flat = x.condensed_form()
y_flat = y.condensed_form()
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
orig_stat = np.dot(xnorm, ynorm)
mat_n = x._data.shape[0]
perm_order = np.empty([permutations, mat_n], dtype=np.int)
for row in range(permutations):
perm_order[row, :] = np.random.permutation(mat_n)
permuted_stats = np.empty([permutations])
mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm,
ynorm, permuted_stats)
return [orig_stat, permuted_stats]
17. Observed speedup
• Up to 200x
• Bigger improvement
on larger problems
in full-CPU mode
• Game changing
for large problems
• We would not
even consider
computing those
Again, original implementation cannot make effective use of multiple cores
CPU, num. of used cores, matrix size Original Latest
AMD EPYC 7252 1 core - 10k ^2 2.19k 149
AMD EPYC 7252 8 cores - 10k ^2 2.17k 24
Intel Xeon Gold 6242 1 core - 10k ^2 4.2k 170
Intel Xeon Gold 6242 16 cores - 10k ^2 4.2k 20
AMD EPYC 7252 1 core - 20k ^2 8.9k 615
AMD EPYC 7252 8 cores - 20k ^2 8.8k 97
Intel Xeon Gold 6242 1 core - 20k ^2 17.9k 933
Intel Xeon Gold 6242 16 cores - 20k ^2 17.5k 108
AMD EPYC 7252 8 cores - 70k ^2 - 1.44k
Intel Xeon Gold 6242 16 cores-100k^2 - 4.2k
18. Validation costs
• Even something as trivial as checking for symmetry can be expensive
if one does not consider memory locality
• A 100k samples distance matrix would take several minutes!
CPU, num. of used cores, matrix size Original Latest
AMD EPYC 7252 1 core - 25k ^2 1.8 2.6
AMD EPYC 7252 8 cores - 25k ^2 1.8 0.4
Intel Xeon Gold 6242 1 core - 25k ^2 4.4 3.2
Intel Xeon Gold 6242 16 cores - 25k ^2 4.4 0.24
AMD EPYC 7252 1 core - 70k ^2 18 21
AMD EPYC 7252 8 cores - 70k ^2 18 3.2
Intel Xeon Gold 6242 1 core - 100k ^2 219 78
Intel Xeon Gold 6242 16 cores-100k^2 212 5.4
Check performed on each
object instantiation!
19. Validation costs
• The solution, again,
was Cython with tiling
import numpy as np
def is_symmetric_and_hollow(mat):
# mat must be of type np.ndarray
# and be a 2D floating point matrix
not_sym = (mat.T != mat).any()
not_hollow = (np.trace(mat) != 0)
return [not no_sym, not not_hollow]
import numpy as np
from cython.parallel import prange
def is_symmetric_and_hollow_cy(float[:, :] mat):
in_n = mat.shape[0]
is_sym = True
is_hollow = True
# use a tiled approach to maximize memory locality
for trow in prange(0, in_n, 16, nogil=True):
trow_max = min(trow+16, in_n)
for tcol in range(0, in_n, 16):
tcol_max = min(tcol+16, in_n)
for row in range(trow, trow_max, 1):
for col in range(tcol, tcol_max, 1):
is_sym &= (mat[row,col]==mat[col,row])
if (trow==tcol): # diagonal block
for col in range(tcol, tcol_max, 1):
is_hollow &= (mat[col,col]==0)
return [(is_sym==True), (is_hollow==True)]
20. Conclusions
• We re-implemented two often used functions, pcoa and mantel,
plus related object validation function
• Now up to 100x faster
• Root cause was over-reliance on standard libraries
• In particular Python’s NumPy
• Code was simple and elegant, but not memory efficient
• New implementation uses lower-level constructs
• Used Cython to push most of compute into inner loop
• This will greatly help the microbiome community
• Improvements accepted into scikit-bio library code-base
21. Acknowledgemnts
• This work was partially funded by the
US National Research Foundation (NSF) under grants
DBI-2038509, OAC-1541349 and OAC-1826967.