This document discusses extending high-performance computing (HPC) to integrate data analytics and connect to edge computing. It presents two use cases: 1) augmenting molecular dynamics workflows with in situ and in transit analytics to capture protein structural information, and 2) connecting HPC to sensors at the edge for precision farming applications involving soil moisture data prediction. The document outlines approaches for building closed-loop workflows that integrate simulation, data generation, analytics, and data feedback between HPC and edge resources to enable real-time decision making.
A brief intro about me, my past experience, and a small history of data analytics and data science and the latest trends in data science such as deep learning, graph analysis, distributed computing with spark and streaming computing.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
A brief intro about me, my past experience, and a small history of data analytics and data science and the latest trends in data science such as deep learning, graph analysis, distributed computing with spark and streaming computing.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
EN: Statistics has never been a hacker's business.
In recent years, however, the need for "data science" has emerged.
This is the unprecedented intersection of computer science or hacking and, additionally, statistically relevant looking graphics and the so-called "Big Data".
The latter are heaps of data that are somewhat larger than they used to be.
With Perl and with Raku.
DE: Statistik war noch nie jederHackers Sache.
In den letzten Jahern wurde jedoch Bedarf für "Data Science" aufgetan.
Das ist die zuvor noch nicht dagewesene Schnittmenge aus Informatik bzw. Hacking und dazu statistisch relevant aussehenden Grafiken und den sogenannten "Big Data".
Letzteres sind Datenmengen, die etwas größer sind, als man es bisher gewohnt war.
Mit Perl und mit Raku.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONAM Publications,India
The main aim of this paper is to develop a new dynamic indexing structure to support very large datasets and high dimensionality. This new structure is tree based used to facilitate efficient access. It is highly adaptable to any type of applications. The newly developed structure is based on nearest neighbors’ method with exception of linearly scan the very large datasets. The NewTree surely minimizes adverse effect of the curse of dimensionality. It means that the most existing indexing techniques degrade rapidly when dimensionality goes higher. The major drawback here is the retrieval of subsets from the huge storage system. The NewTree structure can handle very efficiently and effectively during adding new data. When the new data are added and the shape of the structure does not change. The performance of the newly developed structure can be evaluated with SR Tree, existing indexing structure. The results clearly show that the efficiency of the newly developed structure is superior in both time complexity and memory complexity than SR Tree.
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
EN: Statistics has never been a hacker's business.
In recent years, however, the need for "data science" has emerged.
This is the unprecedented intersection of computer science or hacking and, additionally, statistically relevant looking graphics and the so-called "Big Data".
The latter are heaps of data that are somewhat larger than they used to be.
With Perl and with Raku.
DE: Statistik war noch nie jederHackers Sache.
In den letzten Jahern wurde jedoch Bedarf für "Data Science" aufgetan.
Das ist die zuvor noch nicht dagewesene Schnittmenge aus Informatik bzw. Hacking und dazu statistisch relevant aussehenden Grafiken und den sogenannten "Big Data".
Letzteres sind Datenmengen, die etwas größer sind, als man es bisher gewohnt war.
Mit Perl und mit Raku.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONAM Publications,India
The main aim of this paper is to develop a new dynamic indexing structure to support very large datasets and high dimensionality. This new structure is tree based used to facilitate efficient access. It is highly adaptable to any type of applications. The newly developed structure is based on nearest neighbors’ method with exception of linearly scan the very large datasets. The NewTree surely minimizes adverse effect of the curse of dimensionality. It means that the most existing indexing techniques degrade rapidly when dimensionality goes higher. The major drawback here is the retrieval of subsets from the huge storage system. The NewTree structure can handle very efficiently and effectively during adding new data. When the new data are added and the shape of the structure does not change. The performance of the newly developed structure can be evaluated with SR Tree, existing indexing structure. The results clearly show that the efficiency of the newly developed structure is superior in both time complexity and memory complexity than SR Tree.
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Webinar: Using R for Advanced Analytics with MongoDBMongoDB
In the age of big data, organizations rely on data scientists to provide critical decision support and predictive analysis. Life sciences now leverages new kinds of data to innovate, address unmet medical needs, and capture new markets.
MongoDB’s flexible schema and incredible scalability makes it a natural choice for storing diverse data sets needed to accomplish these tasks. In this session, we will explore the tools and design patterns available to the data scientist to harness the power of MongoDB for data preparation and enrichment. We will also discuss R for advanced analytics utilizing MongoDB R connector for data movement.
In this webinar, we will cover:
- Design your MongoDB schema and utilize the aggregation framework for data preparation and enrichment.
- Connect R/RStudio to your MongoDB environment. Understand connectors available and be able to choose the right deployment topology for production.
- Recognize the analytical patterns and apply MongoDB and R together to deliver cutting edge insights in your organization.
- Learn to communicate the value to both technical and research audiences. Examine use cases such as genome annotation, disease spread via geospatial analysis, drug side-effects monitoring in social media using text analytics, and more.
Addresses streaming data challenges in sampling rates, cache maintenance, deductive reasoning, and the surrounding Semantic Web framework. Using a fixed-size cache, the challenge is to identify and preserve assertions within a stream. Deductive reasoning will continuously be performed over the cache to draw relevant conclusions as quickly as possible. The use of a cache differentiates our work from state-of-the-art works in deductive stream reasoning in that the cache enables us to temporarily store propositions that are no longer in the stream window.
Using R for Advanced Analytics with MongoDBMongoDB
Using R for Advanced Analytics with MongoDB
Speaker: Jane Uyvova, Sr. Solutions Architect, MongoDB
Level: 300 (Advanced)
Track: Data Analytics
In the age of big data, organizations rely on data scientists to provide critical decision support and predictive analysis. Most industries now leverage new kinds of data to innovate, understand their customers, and capture new markets. MongoDB’s flexible schema and scalability makes it a natural choice for storing diverse data sets needed to accomplish these tasks.
In this session, we will explore the tools and design patterns available to the data scientist to harness the power of MongoDB for data preparation and enrichment. We will focus on R for advanced analytics utilizing mongolite as well as MongoDB Spark Connector R API.
What You Will Learn:
- How to connect R to your MongoDB environment. Understand connectors available and be able to choose the right deployment topology for production.
- How to design your MongoDB schema and utilize the aggregation framework for data preparation and enrichment.
- How to recognize the analytical patterns required and apply MongoDB and R together to deliver key insights in your organization.
Rattle is Free (as in Libre) Open Source Software and the source code is available from the Bitbucket repository. We give you the freedom to review the code, use it for whatever purpose you like, and to extend it however you like, without restriction, except that if you then distribute your changes you also need to distribute your source code too.
Rattle - the R Analytical Tool To Learn Easily - is a popular GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. One of the most important features (according to me) is that all of your interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.
Rattle clocks between 10,000 and 20,000 installations per month from the RStudio CRAN node (one of over 100 nodes). Rattle has been downloaded several million times overall.
In this deck from the Stanford HPC Conference, Shahin Khan from OrionX describes major market Shifts in IT.
"We will discuss the digital infrastructure of the future enterprise and the state of these trends."
"We work with clients on the impact of Digital Transformation (DX) on them, their customers, and their messages. Generally, they want to track, in one place, trends like IoT, 5G, AI, Blockchain, and Quantum Computing. And they want to know what these trends mean, how they affect each other, and when they demand action, and how to formulate and execute an effective plan. If that describes you, we can help."
Watch the video: https://wp.me/p3RLHQ-lPP
Learn more: http://orionx.net
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
In this deck from IWOCL / SYCLcon 2020, Hal Finkel from Argonne National Laboratory presents: Preparing to program Aurora at Exascale - Early experiences and future directions.
"Argonne National Laboratory’s Leadership Computing Facility will be home to Aurora, our first exascale supercomputer. Aurora promises to take scientific computing to a whole new level, and scientists and engineers from many different fields will take advantage of Aurora’s unprecedented computational capabilities to push the boundaries of human knowledge. In addition, Aurora’s support for advanced machine-learning and big-data computations will enable scientific workflows incorporating these techniques along with traditional HPC algorithms. Programming the state-of-the-art hardware in Aurora will be accomplished using state-of-the-art programming models. Some of these models, such as OpenMP, are long-established in the HPC ecosystem. Other models, such as Intel’s oneAPI, based on SYCL, are relatively-new models constructed with the benefit of significant experience. Many applications will not use these models directly, but rather, will use C++ abstraction libraries such as Kokkos or RAJA. Python will also be a common entry point to high-performance capabilities. As we look toward the future, features in the C++ standard itself will become increasingly relevant for accessing the extreme parallelism of exascale platforms.
This presentation will summarize the experiences of our team as we prepare for Aurora, exploring how to port applications to Aurora’s architecture and programming models, and distilling the challenges and best practices we’ve developed to date. oneAPI/SYCL and OpenMP are both critical models in these efforts, and while the ecosystem for Aurora has yet to mature, we’ve already had a great deal of success. Importantly, we are not passive recipients of programming models developed by others. Our team works not only with vendor-provided compilers and tools, but also develops improved open-source LLVM-based technologies that feed both open-source and vendor-provided capabilities. In addition, we actively participate in the standardization of OpenMP, SYCL, and C++. To conclude, I’ll share our thoughts on how these models can best develop in the future to support exascale-class systems."
Watch the video: https://wp.me/p3RLHQ-lPT
Learn more: https://www.iwocl.org/iwocl-2020/conference-program/
and
https://www.anl.gov/topic/aurora
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Greg Wahl from Advantech presents: Transforming Private 5G Networks.
Advantech Networks & Communications Group is driving innovation in next-generation network solutions with their High Performance Servers. We provide business critical hardware to the world's leading telecom and networking equipment manufacturers with both standard and customized products. Our High Performance Servers are highly configurable platforms designed to balance the best in x86 server-class processing performance with maximum I/O and offload density. The systems are cost effective, highly available and optimized to meet next generation networking and media processing needs.
“Advantech’s Networks and Communication Group has been both an innovator and trusted enabling partner in the telecommunications and network security markets for over a decade, designing and manufacturing products for OEMs that accelerate their network platform evolution and time to market.” Said Advantech Vice President of Networks & Communications Group, Ween Niu. “In the new IP Infrastructure era, we will be expanding our expertise in Software Defined Networking (SDN) and Network Function Virtualization (NFV), two of the essential conduits to 5G infrastructure agility making networks easier to install, secure, automate and manage in a cloud-based infrastructure.”
In addition to innovation in air interface technologies and architecture extensions, 5G will also need a new generation of network computing platforms to run the emerging software defined infrastructure, one that provides greater topology flexibility, essential to deliver on the promises of high availability, high coverage, low latency and high bandwidth connections. This will open up new parallel industry opportunities through dedicated 5G network slices reserved for specific industries dedicated to video traffic, augmented reality, IoT, connected cars etc. 5G unlocks many new doors and one of the keys to its enablement lies in the elasticity and flexibility of the underlying infrastructure.
Advantech’s corporate vision is to enable an intelligent planet. The company is a global leader in the fields of IoT intelligent systems and embedded platforms. To embrace the trends of IoT, big data, and artificial intelligence, Advantech promotes IoT hardware and software solutions with the Edge Intelligence WISE-PaaS core to assist business partners and clients in connecting their industrial chains. Advantech is also working with business partners to co-create business ecosystems that accelerate the goal of industrial intelligence."
Watch the video: https://wp.me/p3RLHQ-lPQ
* Company website: https://www.advantech.com/
* Solution page: https://www2.advantech.com/nc/newsletter/NCG/SKY/benefits.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
In this deck from the Stanford HPC Conference, Katie Lewis from Lawrence Livermore National Laboratory presents: The Incorporation of Machine Learning into Scientific Simulations at Lawrence Livermore National Laboratory.
"Scientific simulations have driven computing at Lawrence Livermore National Laboratory (LLNL) for decades. During that time, we have seen significant changes in hardware, tools, and algorithms. Today, data science, including machine learning, is one of the fastest growing areas of computing, and LLNL is investing in hardware, applications, and algorithms in this space. While the use of simulations to focus and understand experiments is well accepted in our community, machine learning brings new challenges that need to be addressed. I will explore applications for machine learning in scientific simulations that are showing promising results and further investigation that is needed to better understand its usefulness."
Watch the video: https://youtu.be/NVwmvCWpZ6Y
Learn more: https://computing.llnl.gov/research-area/machine-learning
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
In this deck from the Stanford HPC Conference, DK Panda from Ohio State University presents: How to Achieve High-Performance, Scalable and Distributed DNN Training on Modern HPC Systems?
"This talk will start with an overview of challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. After that, the talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Co-designing Deep Learning Stacks with High-Performance MPI, 3) Out-of- core DNN training, and 4) Hybrid (Data and Model) parallelism. Case studies to accelerate DNN training with popular frameworks like TensorFlow, PyTorch, MXNet and Caffe on modern HPC systems will be presented."
Watch the video: https://youtu.be/LeUNoKZVuwQ
Learn more: http://web.cse.ohio-state.edu/~panda.2/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
In this deck from the Stanford HPC Conference, Nick Nystrom and Paola Buitrago provide an update from the Pittsburgh Supercomputing Center.
Nick Nystrom is Chief Scientist at the Pittsburgh Supercomputing Center (PSC). Nick is architect and PI for Bridges, PSC's flagship system that successfully pioneered the convergence of HPC, AI, and Big Data. He is also PI for the NIH Human Biomolecular Atlas Program’s HIVE Infrastructure Component and co-PI for projects that bring emerging AI technologies to research (Open Compass), apply machine learning to biomedical data for breast and lung cancer (Big Data for Better Health), and identify causal relationships in biomedical big data (the Center for Causal Discovery, an NIH Big Data to Knowledge Center of Excellence). His current research interests include hardware and software architecture, applications of machine learning to multimodal data (particularly for the life sciences) and to enhance simulation, and graph analytics.
Watch the video: https://youtu.be/LWEU1L1o7yY
Learn more: https://www.psc.edu/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Stanford HPC Conference, Ryan Quick from Providentia Worldwide describes how DNNs can be used to improve EDA simulation runs.
"Systems Intelligence relies on a variety of methods for providing insight into the core mechanisms for driving automated behavioral changes in self-healing command and control platforms. This talk reports on initial efforts with leveraging Semiconductor Electronic Design Automation (EDA) telemetry data from cross-domain sources including power, network, storage, nodes, and applications in neural networks as a driving method for insight into SI automation systems."
Watch the video: https://youtu.be/2WbR8tq-XbM
Learn more: http://www.providentiaworldwide.com/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
In this deck from the Stanford HPC Conference, Nicole Xu from Stanford University describes how she transformed a common jellyfish into a bionic creature that is part animal and part machine.
"Animal locomotion and bioinspiration have the potential to expand the performance capabilities of robots, but current implementations are limited. Mechanical soft robots leverage engineered materials and are highly controllable, but these biomimetic robots consume more power than corresponding animal counterparts. Biological soft robots from a bottom-up approach offer advantages such as speed and controllability but are limited to survival in cell media. Instead, biohybrid robots that comprise live animals and self- contained microelectronic systems leverage the animals’ own metabolism to reduce power constraints and body as an natural scaffold with damage tolerance. We demonstrate that by integrating onboard microelectronics into live jellyfish, we can enhance propulsion up to threefold, using only 10 mW of external power input to the microelectronics and at only a twofold increase in cost of transport to the animal. This robotic system uses 10 to 1000 times less external power per mass than existing swimming robots in literature and can be used in future applications for ocean monitoring to track environmental changes."
Watch the video: https://youtu.be/HrmJFyvInj8
Learn more: https://sanfrancisco.cbslocal.com/2020/02/05/stanford-research-project-common-jellyfish-bionic-sea-creatures/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Stanford HPC Conference, Peter Dueben from the European Centre for Medium-Range Weather Forecasts (ECMWF) presents: Machine Learning for Weather Forecasts.
"I will present recent studies that use deep learning to learn the equations of motion of the atmosphere, to emulate model components of weather forecast models and to enhance usability of weather forecasts. I will than talk about the main challenges for the application of deep learning in cutting-edge weather forecasts and suggest approaches to improve usability in the future."
Peter is contributing to the development and optimization of weather and climate models for modern supercomputers. He is focusing on a better understanding of model error and model uncertainty, on the use of reduced numerical precision that is optimised for a given level of model error, on global cloud- resolving simulations with ECMWF's forecast model, and the use of machine learning, and in particular deep learning, to improve the workflow and predictions. Peter has graduated in Physics and wrote his PhD thesis at the Max Planck Institute for Meteorology in Germany. He worked as Postdoc with Tim Palmer at the University of Oxford and has taken up a position as University Research Fellow of the Royal Society at the European Centre for Medium-Range Weather Forecasts (ECMWF) in 2017.
Watch the video: https://youtu.be/ks3fkRj8Iqc
Learn more: https://www.ecmwf.int/
and
http://www.hpcadvisorycouncil.com/events/2020/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Gilad Shainer from the HPC AI Advisory Council describes how this organization fosters innovation in the high performance computing community.
"The HPC-AI Advisory Council’s mission is to bridge the gap between high-performance computing (HPC) and Artificial Intelligence (AI) use and its potential, bring the beneficial capabilities of HPC and AI to new users for better research, education, innovation and product manufacturing, bring users the expertise needed to operate HPC and AI systems, provide application designers with the tools needed to enable parallel computing, and to strengthen the qualification and integration of HPC and AI system products."
Watch the video: https://wp.me/p3RLHQ-lNz
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Today RIKEN in Japan announced that the Fugaku supercomputer will be made available for research projects aimed to combat COVID-19.
"Fugaku is currently being installed and is scheduled to be available to the public in 2021. However, faced with the devastating disaster unfolding before our eyes, RIKEN and MEXT decided to make a portion of the computational resources of Fugaku available for COVID-19-related projects ahead of schedule while continuing the installation process.
Fugaku is being developed not only for the progress in science, but also to help build the society dubbed as the “Society 5.0” by the Japanese government, where all people will live safe and comfortable lives. The current initiative to fight against the novel coronavirus is driven by the philosophy behind the development of Fugaku."
Initial Projects
Exploring new drug candidates for COVID-19 by "Fugaku"
Yasushi Okuno, RIKEN / Kyoto University
Prediction of conformational dynamics of proteins on the surface of SARS-Cov-2 using Fugaku
Yuji Sugita, RIKEN
Simulation analysis of pandemic phenomena
Nobuyasu Ito, RIKEN
Fragment molecular orbital calculations for COVID-19 proteins
Yuji Mochizuki, Rikkyo University
In this deck from the Performance Optimisation and Productivity group, Lubomir Riha from IT4Innovations presents: Energy Efficient Computing using Dynamic Tuning.
"We now live in a world of power-constrained architectures and systems and power consumption represents a significant cost factor in the overall HPC system economy. For these reasons, in recent years researchers, supercomputing centers and major vendors have developed new tools and methodologies to measure and optimize the energy consumption of large-scale high performance system installations. Due to the link between energy consumption, power consumption and execution time of an application executed by the final user, it is important for these tools and the methodology used to consider all these aspects, empowering the final user and the system administrator with the capability of finding the best configuration given different high level objectives.
This webinar focused on tools designed to improve the energy-efficiency of HPC applications using a methodology of dynamic tuning of HPC applications, developed under the H2020 READEX project. The READEX methodology has been designed for exploiting the dynamic behaviour of software. At design time, different runtime situations (RTS) are detected and optimized system configurations are determined. RTSs with the same configuration are grouped into scenarios, forming the tuning model. At runtime, the tuning model is used to switch system configurations dynamically.
The MERIC tool, that implements the READEX methodology, is presented. It supports manual or binary instrumentation of the analysed applications to simplify the analysis. This instrumentation is used to identify and annotate the significant regions in the HPC application. Automatic binary instrumentation annotates regions with significant runtime. Manual instrumentation, which can be combined with automatic, allows code developer to annotate regions of particular interest."
Watch the video: https://wp.me/p3RLHQ-lJP
Learn more: https://pop-coe.eu/blog/14th-pop-webinar-energy-efficient-computing-using-dynamic-tuning
and
https://code.it4i.cz/vys0053/meric
Sign up for our insideHPC Newsletter: http://insidehpc.com/newslett
In this deck from GTC Digital, William Beaudin from DDN presents: HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD.
Enabling high performance computing through the use of GPUs requires an incredible amount of IO to sustain application performance. We'll cover architectures that enable extremely scalable applications through the use of NVIDIA’s SuperPOD and DDN’s A3I systems.
The NVIDIA DGX SuperPOD is a first-of-its-kind artificial intelligence (AI) supercomputing infrastructure. DDN A³I with the EXA5 parallel file system is a turnkey, AI data storage infrastructure for rapid deployment, featuring faster performance, effortless scale, and simplified operations through deeper integration. The combined solution delivers groundbreaking performance, deploys in weeks as a fully integrated system, and is designed to solve the world's most challenging AI problems.
Watch the video: https://wp.me/p3RLHQ-lIV
Learn more: https://www.ddn.com/download/nvidia-superpod-ddn-a3i-ai400-appliance-with-the-exa5-filesystem/
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck, Paul Isaacs from Linaro presents: State of ARM-based HPC. This talk provides an overview of applications and infrastructure services successfully ported to Aarch64 and benefiting from scale.
"With its debut on the TOP500, the 125,000-core Astra supercomputer at New Mexico’s Sandia Labs uses Cavium ThunderX2 chips to mark Arm’s entry into the petascale world. In Japan, the Fujitsu A64FX Arm-based CPU in the pending Fugaku supercomputer has been optimized to achieve high-level, real-world application performance, anticipating up to one hundred times the application execution performance of the K computer. K was the first computer to top 10 petaflops in 2011."
Watch the video: https://wp.me/p3RLHQ-lIT
Learn more: https://www.linaro.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
Today Xilinx announced Versal Premium, the third series in the Versal ACAP portfolio. The Versal Premium series features highly integrated, networked and power-optimized cores and the industry’s highest bandwidth and compute density on an adaptable platform. Versal Premium is designed for the highest bandwidth networks operating in thermally and spatially constrained environments, as well as for cloud providers who need scalable, adaptable application acceleration.
Versal is the industry’s first adaptive compute acceleration platform (ACAP), a revolutionary new category of heterogeneous compute devices with capabilities that far exceed those of conventional silicon architectures. Developed on TSMC’s 7-nanometer process technology, Versal Premium combines software programmability with dynamically configurable hardware acceleration and pre-engineered connectivity and security features to enable a faster time-to- market. The Versal Premium series delivers up to 3X higher throughput compared to current generation FPGAs, with built-in Ethernet, Interlaken, and cryptographic engines that enable fast and secure networks. The series doubles the compute density of currently deployed mainstream FPGAs and provides the adaptability to keep pace with increasingly diverse and evolving cloud and networking workloads.
Learn more: https://insidehpc.com/2020/03/xilinx-announces-versal-premium-acap-for-network-and-cloud-acceleration/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
In this video from the Rice Oil & Gas Conference, Chin Fang from Zettar presents: Moving Massive Amounts of Data across Any Distance Efficiently.
The objective of this talk is to present two on-going projects aiming at improving and ensuring highly efficient bulk transferring or streaming of massive amounts of data over digital connections across any distance. It examines the current state of the art, a few very common misconceptions, the differences among the three major type of data movement solutions, a current initiative attempting to improve the data movement efficiency from the ground up, and another multi-stage project that shows how to conduct long distance large scale data movement at speed and scale internationally. Both projects have real world motivations, e.g. the ambitious data transfer requirements of Linac Coherent Light Source II (LCLS-II) [1], a premier preparation project of the U.S. DOE Exascale Computing Initiative (ECI) [2]. Their immediate goals are described and explained, together with the solution used for each. Findings and early results are reported. Possible future works are outlined.
Watch the video: https://wp.me/p3RLHQ-lBX
Learn more: https://www.zettar.com/
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from the Rice Oil & Gas Conference, Bradley McCredie from AMD presents: Scaling TCO in a Post Moore's Law Era.
"While foundries bravely drive forward to overcome the technical and economic challenges posed by scaling to 5nm and beyond, Moore’s law alone can provide only a fraction of the performance / watt and performance / dollar gains needed to satisfy the demands of today’s high performance computing and artificial intelligence applications. To close the gap, multiple strategies are required. First, new levels of innovation and design efficiency will supplement technology gains to continue to deliver meaningful improvements in SoC performance. Second, heterogenous compute architectures will create x-factor increases of performance efficiency for the most critical applications. Finally, open software frameworks, APIs, and toolsets will enable broad ecosystems of application level innovation."
Watch the video:
Learn more: http://amd.com
and
https://rice2020oghpc.rice.edu/program-2/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
In this deck from the ECSS Symposium, Abe Stern from NVIDIA presents: CUDA-Python and RAPIDS for blazing fast scientific computing.
"We will introduce Numba and RAPIDS for GPU programming in Python. Numba allows us to write just-in-time compiled CUDA code in Python, giving us easy access to the power of GPUs from a powerful high-level language. RAPIDS is a suite of tools with a Python interface for machine learning and dataframe operations. Together, Numba and RAPIDS represent a potent set of tools for rapid prototyping, development, and analysis for scientific computing. We will cover the basics of each library and go over simple examples to get users started. Finally, we will briefly highlight several other relevant libraries for GPU programming."
Watch the video: https://wp.me/p3RLHQ-lvu
Learn more: https://developer.nvidia.com/rapids
and
https://www.xsede.org/for-users/ecss/ecss-symposium
Sign up for our insideHPC Newsletter: http://insidehp.com/newsletter
In this deck from FOSDEM 2020, Colin Sauze from Aberystwyth University describes the development of a RaspberryPi cluster for teaching an introduction to HPC.
"The motivation for this was to overcome four key problems faced by new HPC users:
* The availability of a real HPC system and the effect running training courses can have on the real system, conversely the availability of spare resources on the real system can cause problems for the training course.
* A fear of using a large and expensive HPC system for the first time and worries that doing something wrong might damage the system.
* That HPC systems are very abstract systems sitting in data centres that users never see, it is difficult for them to understand exactly what it is they are using.
* That new users fail to understand resource limitations, in part because of the vast resources in modern HPC systems a lot of mistakes can be made before running out of resources. A more resource constrained system makes it easier to understand this.
The talk will also discuss some of the technical challenges in deploying an HPC environment to a Raspberry Pi and attempts to keep that environment as close to a "real" HPC as possible. The issue to trying to automate the installation process will also be covered."
Learn more: https://github.com/colinsauze/pi_cluster
and
https://fosdem.org/2020/schedule/events/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
In this deck from ATPESC 2019, Ken Raffenetti from Argonne presents an overview of HPC interconnects.
"The Argonne Training Program on Extreme-Scale Computing (ATPESC) provides intensive, two-week training on the key skills, approaches, and tools to design, implement, and execute computational science and engineering applications on current high-end computing systems and the leadership-class computing systems of the future."
Watch the video: https://wp.me/p3RLHQ-luc
Learn more: https://extremecomputingtraining.anl.gov/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
2. Acknowledgements
T. Johnston
T. Estrada M. Cuendet E. Deelman
R. da Silva
Sponsors:
H. Weinstein
T. Do
S. Thomas
B. Mulligan
A. RazaviH. Carrillo
R. Vargas
D. Rorabaugh
R. LLamas M. Guevara
3. Trends in Next-Generation Systems: IO Gap and Ensembles
5x
12x
64x
13x
Rising Importance of Ensembles
Source:
https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification
ExascalePetascaleTerascale
Sierra (2018)
Sequoia (2013)
Purple (2005)
Number of Jobs
SystemScale(FLOPS)
Few Many
Widening IO Gap
Source: Lucy Nowell (DOE)
3
5. Extending HPC to Connect to the “Edge”
Image Source: https://iiot-world.com/digital-disruption/the-right-representation-of-digital-twins-for-data-analytics/
Simulations Data Analytics Real System with Sensors
5
6. Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
6
7. Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
7
8. Classical Molecular Dynamics Simulations
Source: Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/
• A MD simulation comprises of
hundreds of thousands of MD job
• Each job preforms hundreds of
thousands of MD steps
8
9. Classical Molecular Dynamics Simulations
àForces on single atoms
à Acceleration
à Velocity
à Position
• MD step computes forces on single
atoms (e.g., bond, dihedrals, nonbond)
• Forces are added to compute
acceleration
• Acceleration is used to update
velocities
• Velocities are used to update the atom
positions
• Every n steps (Stride)
à Store 3D snapshot or frame
9
10. Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
Run n-Stride
simulation steps
Data Generation
10
11. Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data Storage
11
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
12. Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
12
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
13. Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
13
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
14. Extending HPC to Integrate Data Analytics
Analysis
Local Storage
MEMORY
CN
Application
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Parallel File System
MEMORY
CN
Application
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Analysis
CN
14
Data Data + metadata
Parallel File System
15. Augmenting HPC with In Situ and In Transit Analytics
AnalysisSimulation
Node 1 Node 2 Node 3
Network Interconnect
Core Core
Core Core
Core Core
Core Core
Simulation
Node
Analysis
SharedMemory
Example of tools:
• DataSpaces (Rutgers U.)
• DataStager (GeorgiaTech)
15
16. In Situ and In Transit Analytics for MD Simulations
• We want to capture what is going on in each frame without:
§ Disrupting the simulation (e.g., stealing CPU and memory on the node)
§ Moving all the frames to a central file system and analyzing them once the
simulation is over
§ Comparing each frame with past frames of the same job
§ Comparing each frame with frames of other jobs
Frames (or snapshots) of an MD trajectory with a stride of 5 steps:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 80
Frame 75
16
17. Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
17
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
18. Building a Closed-loop Workflow
Data Generation Data AnalyticsData Storage
18
Data Feedback
A4MD
analytics
Retriever
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
19. Proteins with Similar Functions
Key principle: proteins with similar structures have similar functions
• Measure millions of protein variants expressed from yeast or bacteria
• Structure proteins to produce desired properties (protein engineering)
19 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
20. Protein Representations
3D Cartesian representation Surface representationMulti-fold representation
20 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
21. From Multi-fold Representation to Image Encoding
21 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
22. From Multi-fold Representation to Image Encoding
22 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
23. From Multi-fold Representation to Image Encoding
23 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
24. High-Throughput Protein Analysis
• Eight biological processes from biological process taxonomy in RCSB-PDB
• 62,991 proteins from the PDB
Proteinsas3Dtens
24
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
convolutional
neural network
Google’s Inception-v3,
Gem-Net Groundtrue
Predictions
Normalized Confusion Matrix - Accuracy 80.66%
25. Capturing Changes in Folding with Transfer Learning
25
T. Estrada,et al. A Graphic Encoding Method for Quantitative Classification of Protein Structure and
Representation of Conformational Changes. 2019
Protein: Opsin Frame 50 Frame 1500 Frame 1950
26. Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
26
27. Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
27
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
28. Modeling Lost Frames
Node
MD
Simulation
Data
Analytics
Dataspaces
Server
Main Memory
Dataspaces
Shared Memory Region
Single Node (In Situ)
28 S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
S1, S2, S3: generate MD frame
W1, W2, W2: write to shared memory
R1, R2, R3: read from shared memory
A1, A3: analyze frame
29. Dataflow for Modeling Lost Frames
29
Generator of
MD frames
Ingestor
Dataflow
Plumed In-memory
Staging Area
DataSpaces Retriever
Data Generation Data AnalyticsData Storage
Dataflow
Analytics
representations
Distance matrices
λmax
eigenvalues
T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of
Computational Chemistry (JCC), 38(16):1419-1430, 2017.
30. Eigenvalues: Proxy for Structural Changes
Frames (or snapshots) of an MD trajectory with a stride of 5 steps:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 80
Frame 75
30
T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of
Computational Chemistry (JCC), 38(16):1419-1430, 2017.
λmax 60 λmax65 λmax70 λmax75 λmax80λmax55
The distance between two max eigenvalues can serve as a proxy for
distance between the two associated conformations
31. Cα
1
Cα
Nα
31
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
32. Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
0 dij
dji 0
Distance of two segments with
segment length:
Nα/2 x Cα atoms
(Cα
1 - Cα
Nα/2 -1 ) ( Cα
Nα/2 - Cα
Nα/2)
Single λmax
Cα
1
Cα
Nα
32
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
33. Cα
1
Cα
2
Cα
3
Cα
4
Cα
1. Cα
4
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
0 dij
dji 0
Distance of two segments with
segment length:
Nα/2 x Cα atoms
(Cα
1 - Cα
Nα/2 -1 ) ( Cα
Nα/2 - Cα
Nα/2)
Single λmax
Cα
1
Cα
Nα
33
Distances of Nα/2 segments with
segment length: 2 x Cα atoms
(Cα
1 - Cα
2 ) ( Cα
3 - Cα
4)
(Cα
5 - Cα
6 ) ( Cα
7 – Cα
8)
….
(Cα
Nα/2-3 - Cα
Nα/2-2 ) ( Cα
Nα/2-1 - Cα
Nα/2)
λmax, 1 λmax, 2 … λmax, Nα/2
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
34. Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
Cα
1
Cα
2
Cα
3
Cα
4
Cα
1. Cα
4
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
Many small
matrices
Few large
matrices
Segment size = proxy of number
of matrices and matrix sizes
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
34
35. 2-step Model: Fraction of Analyzed Frames
Gltph 270,088 atoms
Trp cage 12,619 atoms T cell receptor 81,092 atoms
35
Observables: small segment lengths
(i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 )
Matrix size (# elements)
Matrixperiod(s/matrices)
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
36. 2-step Model: Fraction of Analyzed Frames
Observables: segment lengths
(i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 )
Model: polynomial model of degree 2
obtained by least-square fitting observables
R2 = 0.88
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
36
Matrixperiod(s/matrices)
Matrix size (# elements)
Matrixperiod(s/matrices)
Matrix size (# elements)
37. 2-step Model: Fraction of Analyzed Frames
Observables Degree 2 polynomial model Absolute error between data
and fitting model
R2 = 0.88
Matrixperiod(s/matrices)
Matrix size (# elements)
Matrixperiod(s/matrices)
Matrix size (# elements)
Matrixperiod(s/matrices)
Matrix size (# elements)
FractionofAnalyzedFrames
FractionofAnalyzedFrames
Error
38. 2-step Model: Frames Distribution
• Given a trajectory, we model the proportions p and q of
analyzed frames (f) with periods k and k+1
§ Example: Gltph (27,000 atoms and TPS 318), trajectory of 1,000 frames.
Period between analyzed frames Period between analyzed frames Period between analyzed frames
Stride 2 Stride 6 Stride 12
Numberofanalyzedframes
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
38
39. Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
39
40. Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
40
41. Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
41
42. Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
42
43. Soil Moisture Data for Precision Farming
• Satellites collect raster data across the surface of the Earth
Image Source: http://www.esa-soilmoisture-cci.org/43
44. Soil Moisture: Incomplete Data
Image source: Ricardo Llamas, University of Delaware
Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/)
December 2000
Averages
(m3/m3)
44
45. Soil Moisture: Incomplete Data
Image source: Ricardo Llamas, University of Delaware
Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/)
December 2000
Averages
(m3/m3)
Causes of missing data:
● dense vegetation
● snow/ice cover
● frozen surface
● extremely dry surface
45
46. Soil Moisture: Coarse-grained Data
Original Resolution
27 km × 27 km
Image source: McPherson et al., Using coarse-grained occurrence data to predict species distributions at
finer spatial resolutions—possibilities and limitations, Ecological Modeling 192:499-522, 2006.
Desired Resolution
1 km × 1 km
46
47. Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data Generation
Landscape
Surface DSM
Weather Data
NOAA
47
Satellite and
sensors data
Course-grained,
incomplete data
48. Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Weather Data
NOAA
48
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Satellite and
sensors data
Data
predictions
Course-grained,
incomplete data
49. Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
49
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion;
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data
50. Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
50
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data
52. Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
52
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data
53. Hybrid Algorithms for Analytics
Data GenerationData AnalyticsCompute
53
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Fine-grained,
complete data
Course-grained,
incomplete data
56. SBM vs. k-NN
56
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
57. SBM vs. k-NN
57
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
58. SBM vs. k-NN
58
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
59. SBM vs. k-NN
59
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
60. SBM vs. k-NN
60
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
61. Hybrid Piecewise Polynomial Modeling (HYPPO)
Surrogate-Based Modeling
• Use all sampled data
• Construct one polynomial
(single complex global model)
k Nearest Neighbors
• Use local data
• Compute the average
(many simple local models)
61 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
62. Hybrid Piecewise Polynomial Modeling (HYPPO)
Surrogate-Based Modeling
• Use all sampled data
• Construct one polynomial
(single complex global model)
k Nearest Neighbors
• Use local data
• Compute the average
(many simple local models)
Hybrid modeling à HYPPO
• Use local data
• Construct many polynomials
(many complex local models)
62 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
63. 63
Observations
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
66. 66
Surrogate-based Model
Observations
k Nearest Neighbors Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
Hybrid Piecewise Polynomial
Model (HYPPO)
67. 3
2
1
0
Soil moisture predictions Model degrees
Case Study: Fine-grained Modeling of Mid-Atlantic Region
67 D. Rorabaugh, M. Guevara, R. Llamas, J. Kitson, R. Vargas, M. Taufer. SOMOSPIE: A modular SOil MOisture SPatial
Inference Engine based on data driven decision. arXiv:1904.07754, 2019
68. Challenges and Opportunities (I)
• Two trends in HPC are impacting scientific applications:
§ Convergence of simulations and data analytics
§ Emergence of edge computing
• Applications in precision medicine and precision farming
are leveraging these trends
• There is the need to further integrate these trends in HPC
§ New challenges and opportunities for the HPC community
68
69. Challenges and Opportunities (II)
• Efficiency: Optimize performance and power usage associated to data
generation, movement, and analytics
• Non-invasive: Capture knowledge from data without rewriting
simulations’ legacy codes or simulations’ scripts
• Generality: Build workflows that support different types of analytics
across different applications and different data
• Portability: Execute combined compute and analytics across different
systems, including the edge, and with heterogenous resources
• Scalability: Design methods for knowledge discovery at scale (e.g.,
scalable ML algorithms) for “compute + analytics + data” workflows
69