Over the past decade, the fourth paradigm of data-intensive science rapidly became a major driving concept of multiple application domains encompassing and generating large-scale devices such as light sources and cutting edge telescopes. The success of data-intensive projects subsequently triggered the next generation of machine learning approaches. These new artificial intelligent systems clearly represent a paradigm shift from data processing pipelines towards the fifth paradigm of knowledge-centric cognitive applications requiring the integration of Big Data processing platforms and HPC technologies.
The talk addresses the existing impedance mismatch between data-intensive and compute-intensive ecosystems by presenting the Spark-MPI approach based on the MPI Process Management Interface (PMI). The approach was originally designed for building high-performance streaming image reconstruction pipelines at light source facilities. This talk will demonstrate Spark-MPI within the context of distributed deep learning applications by integrating the Apache Spark platform, PMI Exascale and Horovod MPI-based training framework for TensorFlow.
A RAPID DEPLOYMENT BIG DATA COMPUTING PLATFORM FOR CLOUD ROBOTICSIJCNCJournal
The primary contribution of this research is the production of a general cloud robotics architecture that leverages the established and evolving big data technologies. Prior research in this area has not released all details of their deployed architectures, which prevents experimental results from being replicated and verified. By providing a general-purpose architecture, it is hoped that this framework will allow future research to build upon and begin to create a standardised platform, where research can be easily repeated, validated and compared.The secondary contribution is the critical evaluation of the design of cloud robotic architectures. Whilst prior research has demonstrated that cloud-based robotic processing is achievable via big data technologies, such research has not discussed the choice in design. With the ecosystem of big data technologies expanding in recent years, a review of the most relevant technologies for cloud robotics is appropriate to demonstrate and validate the proposed architectural design.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers a complete schedule of upcoming events, using OpenACC for a biophysics problem, HPC Summit Digital, overview of the SDSC GPU Hackathon, OmpSs-2 programming model, new resources and more!
Toffee – A highly efficient, lossless file format for DIA-MSBrett Tully
The closed nature of vendor file formats in mass spectrometry is a significant barrier to progress in developing robust bioinformatics software. In response, the community has developed the open mzML format, implemented in XML and based on controlled vocabularies [1]. Widely adopted, mzML is an important step forward; however, it suffers from two challenges that are particularly apparent as the field moves to high-throughput proteomics: a) large increase in file size – and corresponding increase in CPU time devoted to I/O, and b) a largely sequential I/O access pattern. Described here is ‘toffee’, an open, random I/O format backed by HDF5, with lossless compression that gives file sizes similar to the original vendor format and can be reconverted back to mzML without penalty. In addition to the file format, there are C++ and python libraries for creating and accessing the file format, along with a wrapper around OpenSWATH [2] that enables SWATH-MS data to be analyzed with standard algorithms. Using this library, the files can be accessed in the same manner as the Vendor file (or mzML) in a scan-by-scan manner; however, by accepting a degree of mass approximation (<5 parts per million) toffee enables data to be extracted as a two-dimensional slice analogous to an image, and thus amenable to deep-learning based peptide identification strategies. Documentation and examples are available at https://toffee.readthedocs.io, and all code is MIT licensed at https://bitbucket.org/cmriprocan/toffee.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers pseudo random number generation, the first-ever MONAI Bootcamp, upcoming GPU Hackathons and Bootcamps, and new resources!
Controlled vocabularies and ontologies in Dataverse data repositoryvty
External controlled vocabularies support implementation is one of the most asked features by research communities. Slides for the Dataverse Community Meeting 2021 at Harvard University
A RAPID DEPLOYMENT BIG DATA COMPUTING PLATFORM FOR CLOUD ROBOTICSIJCNCJournal
The primary contribution of this research is the production of a general cloud robotics architecture that leverages the established and evolving big data technologies. Prior research in this area has not released all details of their deployed architectures, which prevents experimental results from being replicated and verified. By providing a general-purpose architecture, it is hoped that this framework will allow future research to build upon and begin to create a standardised platform, where research can be easily repeated, validated and compared.The secondary contribution is the critical evaluation of the design of cloud robotic architectures. Whilst prior research has demonstrated that cloud-based robotic processing is achievable via big data technologies, such research has not discussed the choice in design. With the ecosystem of big data technologies expanding in recent years, a review of the most relevant technologies for cloud robotics is appropriate to demonstrate and validate the proposed architectural design.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers a complete schedule of upcoming events, using OpenACC for a biophysics problem, HPC Summit Digital, overview of the SDSC GPU Hackathon, OmpSs-2 programming model, new resources and more!
Toffee – A highly efficient, lossless file format for DIA-MSBrett Tully
The closed nature of vendor file formats in mass spectrometry is a significant barrier to progress in developing robust bioinformatics software. In response, the community has developed the open mzML format, implemented in XML and based on controlled vocabularies [1]. Widely adopted, mzML is an important step forward; however, it suffers from two challenges that are particularly apparent as the field moves to high-throughput proteomics: a) large increase in file size – and corresponding increase in CPU time devoted to I/O, and b) a largely sequential I/O access pattern. Described here is ‘toffee’, an open, random I/O format backed by HDF5, with lossless compression that gives file sizes similar to the original vendor format and can be reconverted back to mzML without penalty. In addition to the file format, there are C++ and python libraries for creating and accessing the file format, along with a wrapper around OpenSWATH [2] that enables SWATH-MS data to be analyzed with standard algorithms. Using this library, the files can be accessed in the same manner as the Vendor file (or mzML) in a scan-by-scan manner; however, by accepting a degree of mass approximation (<5 parts per million) toffee enables data to be extracted as a two-dimensional slice analogous to an image, and thus amenable to deep-learning based peptide identification strategies. Documentation and examples are available at https://toffee.readthedocs.io, and all code is MIT licensed at https://bitbucket.org/cmriprocan/toffee.
Stay up-to-date on the latest news, events and resources for the OpenACC community. This month’s highlights covers pseudo random number generation, the first-ever MONAI Bootcamp, upcoming GPU Hackathons and Bootcamps, and new resources!
Controlled vocabularies and ontologies in Dataverse data repositoryvty
External controlled vocabularies support implementation is one of the most asked features by research communities. Slides for the Dataverse Community Meeting 2021 at Harvard University
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
Collective Mind: a collaborative curation tool for program optimizationGrigori Fursin
Designing and optimizing applications becomes increasingly tedious, time consuming, ad-hoc and error prone due to ever changing and complex hardware and software stack. At the same time, it becomes difficult or even impossible to validate, reproduce and extend many proposed optimization and auto-tuning techniques from numerous publications. One of the main reasons is a lack of common and practical way to preserve, systematize and reuse available knowledge and artifacts including developments, optimizations and experimental data.
In this talk, I will present modular, extensible, python-based Collective Mind framework and web-based schema-free repository
(c-mind.org) that I developed at first to systematize my own research and experimentation on machine learning based program optimization and compiler design. This infrastructure can be customized to preserve, describe, share and reproduce whole experimental setups including benchmarks, data sets, libraries, tools, predictive models and optimization information with related JSON-based meta-data. I will also present and discuss positive and negative feedback during several recent academic and industrial usages of this framework to systematize benchmarking and program optimization, and to initiate new publication model where experimental results and related research artifacts are shared, reproduced and validated by the community. In a long term, I hope that such approach and collective knowledge will eventually help us to squeeze maximum performance from computer systems while minimizing energy, development time and other costs.
Presentation of the paper "Primers or Reminders? The Effects of Existing Review Comments on Code Review" published at ICSE 2020.
Authors:
Davide Spadini, Gül Calikli, Alberto Bacchelli
Link to the paper: https://research.tudelft.nl/en/publications/primers-or-reminders-the-effects-of-existing-review-comments-on-c
Benchmarking open source deep learning frameworksIJECEIAES
Deep Learning (DL) is one of the hottest fields. To foster the growth of DL, several open source frameworks appeared providing implementations of the most common DL algorithms. These frameworks vary in the algorithms they support and in the quality of their implementations. The purpose of this work is to provide a qualitative and quantitative comparison among three such frameworks: TensorFlow, Theano and CNTK. To ensure that our study is as comprehensive as possible, we consider multiple benchmark datasets from different fields (image processing, NLP, etc.) and measure the performance of the frameworks’ implementations of different DL algorithms. For most of our experiments, we find out that CNTK’s implementations are superior to the other ones under consideration.
Stay up-to-date on the latest news, research, and resources. This month's edition covers 2024 predictions across the HPC and AI industry, NSF's National Artificial Intelligence Research Resource (NAIRR) pilot, the role of compilers in scientific computing, on-demand and upcoming webinars, and more!
Enabling Application Integrated Proactive Fault ToleranceDai Yang
Exascale computing is the next major milestone for the HPC community. Due to a steadily increasing probability of failures, cur- rent applications must be made malleable to be able to cope with dy- namic resource changes. In this paper, we show first results with LAIK, a lightweight library for dynamically re-distributable application data. This allows to free compute nodes from workload before a predicted failure. For a real-world application, we show that LAIK adds negligi- ble overhead. In addition, we show the effect of different re-distribution strategies.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
OpenACC and Open Hackathons Monthly Highlights May 2023.pdfOpenACC
Stay up-to-date on the latest news, research, and resources. This month's edition covers the call for speakers for the Open Accelerated Computing Summit, scheduled Open Hackathons and Bootcamps, an interview with Sunita Chandrasekaran, a call for proposals for the DOE's INCITE program, upcoming webinars, and more!
Fluid Network Planes – An overview of Network Refactoring and Offloading Trends. Keynote at IEEE Netsoft'19, Paris, 2019.
Keynote Description
10 years have passed since the term SDN was coined in 2009. Since then, the three letter acronym has kept evolving through broadening definitions until our state of affairs where SDN means little unless adequately technically elaborated.
Among the key aspects of SDN is the refactoring of the network control plane. At the crossroads, NFV introduces new ways to refactor network functions, be them control or data plane related.
Despite the softwarization flag of SDN and NFV, the hardware/software continuum is as relevant as ever, offering new offloading opportunities at node and network-wide scales.
In this talk, we will review evolving transformations behind network softwarization with a special focus on network refactoring and offloading trends leading to “fluid network planes”, where the location and HW/SW embodiments of network functions becomes blurry, from the edge to core, from one administrative provider to another, from programmable silicon to portable lightweight virtualized containers.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
AN EMPIRICAL STUDY OF USING CLOUD-BASED SERVICES IN CAPSTONE PROJECT DEVELOPMENTcsandit
Cloud computing is gaining prominence and popularity in three important forms: Software as a Service, Platform as a Service, and Infrastructure as a Service. In this paper, we will present
an empirical study of how these cloud-based services were used in an undergraduate Computer Science capstone class to enable agile and effective development, testing, and deployment of sophisticated software systems, facilitate team collaborations among students, and ease the project assessment and grading tasks for teachers. Especially, in this class, students and teachers could leverage time, talent, and resources collaboratively and distributedly on his/her own schedule, from his/her convenient location, and using heterogeneous programming platforms thanks to such a completely All-In-Cloud environment, which eliminated the necessity of spending valuable development time on local setup, configuration, and maintenance, streamlined version control and group management, and greatly increased the collective productivity of student groups. Despite of the relatively steep learning curve in the beginning of the semester, all nine groups of students benefitted tremendously from such an All-In-Cloud experience and eight of them completed their substantial software projects successfully. This paper is concluded with a vision on expandin and standardizing the adoption of the Cloud ecosystem in other Computer Science classes in the future.
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Open Hackathon Mentor Program, highlight from the recent UK National Hackathon, upcoming Open Hackathon and Bootcamp events, and more!
Clariah Tech Day: Controlled Vocabularies and Ontologies in Dataversevty
This presentation is about external CVs support in Dataverse, Open Source data repository. Data Archiving and Networked Services (DANS-KNAW) decided to use Dataverse as a basic technology to build Data Stations and provide FAIR data services for various Dutch research communities.
Collective Mind: a collaborative curation tool for program optimizationGrigori Fursin
Designing and optimizing applications becomes increasingly tedious, time consuming, ad-hoc and error prone due to ever changing and complex hardware and software stack. At the same time, it becomes difficult or even impossible to validate, reproduce and extend many proposed optimization and auto-tuning techniques from numerous publications. One of the main reasons is a lack of common and practical way to preserve, systematize and reuse available knowledge and artifacts including developments, optimizations and experimental data.
In this talk, I will present modular, extensible, python-based Collective Mind framework and web-based schema-free repository
(c-mind.org) that I developed at first to systematize my own research and experimentation on machine learning based program optimization and compiler design. This infrastructure can be customized to preserve, describe, share and reproduce whole experimental setups including benchmarks, data sets, libraries, tools, predictive models and optimization information with related JSON-based meta-data. I will also present and discuss positive and negative feedback during several recent academic and industrial usages of this framework to systematize benchmarking and program optimization, and to initiate new publication model where experimental results and related research artifacts are shared, reproduced and validated by the community. In a long term, I hope that such approach and collective knowledge will eventually help us to squeeze maximum performance from computer systems while minimizing energy, development time and other costs.
Presentation of the paper "Primers or Reminders? The Effects of Existing Review Comments on Code Review" published at ICSE 2020.
Authors:
Davide Spadini, Gül Calikli, Alberto Bacchelli
Link to the paper: https://research.tudelft.nl/en/publications/primers-or-reminders-the-effects-of-existing-review-comments-on-c
Benchmarking open source deep learning frameworksIJECEIAES
Deep Learning (DL) is one of the hottest fields. To foster the growth of DL, several open source frameworks appeared providing implementations of the most common DL algorithms. These frameworks vary in the algorithms they support and in the quality of their implementations. The purpose of this work is to provide a qualitative and quantitative comparison among three such frameworks: TensorFlow, Theano and CNTK. To ensure that our study is as comprehensive as possible, we consider multiple benchmark datasets from different fields (image processing, NLP, etc.) and measure the performance of the frameworks’ implementations of different DL algorithms. For most of our experiments, we find out that CNTK’s implementations are superior to the other ones under consideration.
Stay up-to-date on the latest news, research, and resources. This month's edition covers 2024 predictions across the HPC and AI industry, NSF's National Artificial Intelligence Research Resource (NAIRR) pilot, the role of compilers in scientific computing, on-demand and upcoming webinars, and more!
Enabling Application Integrated Proactive Fault ToleranceDai Yang
Exascale computing is the next major milestone for the HPC community. Due to a steadily increasing probability of failures, cur- rent applications must be made malleable to be able to cope with dy- namic resource changes. In this paper, we show first results with LAIK, a lightweight library for dynamically re-distributable application data. This allows to free compute nodes from workload before a predicted failure. For a real-world application, we show that LAIK adds negligi- ble overhead. In addition, we show the effect of different re-distribution strategies.
Programming Modes and Performance of Raspberry-Pi ClustersAM Publications
In present times, updated information and knowledge has become readily accessible to researchers, enthusiasts, developers, and academics through the Internet on many different subjects for wider areas of application. The underlying framework facilitating such possibilities is networking of servers, nodes, and personal computers. However, such setups, comprising of mainframes, servers and networking devices are inaccessible to many, costly, and are not portable. In addition, students and lab-level enthusiasts do not have the requisite access to modify the functionality to suit specific purposes. The Raspberry-Pi (R-Pi) is a small device capable of many functionalities akin to super-computing while being portable, economical and flexible. It runs on open source Linux, making it a preferred choice for lab-level research and studies. Users have started using the embedded networking capability to design portable clusters that replace the costlier machines. This paper introduces new users to the most commonly used frameworks and some recent developments that best exploit the capabilities of R-Pi when used in clusters. This paper also introduces some of the tools and measures that rate efficiencies of clusters to help users assess the quality of cluster design. The paper aims to make users aware of the various parameters in a cluster environment.
OpenACC and Open Hackathons Monthly Highlights May 2023.pdfOpenACC
Stay up-to-date on the latest news, research, and resources. This month's edition covers the call for speakers for the Open Accelerated Computing Summit, scheduled Open Hackathons and Bootcamps, an interview with Sunita Chandrasekaran, a call for proposals for the DOE's INCITE program, upcoming webinars, and more!
Fluid Network Planes – An overview of Network Refactoring and Offloading Trends. Keynote at IEEE Netsoft'19, Paris, 2019.
Keynote Description
10 years have passed since the term SDN was coined in 2009. Since then, the three letter acronym has kept evolving through broadening definitions until our state of affairs where SDN means little unless adequately technically elaborated.
Among the key aspects of SDN is the refactoring of the network control plane. At the crossroads, NFV introduces new ways to refactor network functions, be them control or data plane related.
Despite the softwarization flag of SDN and NFV, the hardware/software continuum is as relevant as ever, offering new offloading opportunities at node and network-wide scales.
In this talk, we will review evolving transformations behind network softwarization with a special focus on network refactoring and offloading trends leading to “fluid network planes”, where the location and HW/SW embodiments of network functions becomes blurry, from the edge to core, from one administrative provider to another, from programmable silicon to portable lightweight virtualized containers.
Bringing HPC Algorithms to Big Data Platforms: Spark Summit East talk by Niko...Spark Summit
The talk will present a MPI-based extension of the Spark platform developed in the context of light source facilities. The background and rationale of this extension are described in the attached paper “Bringing the HPC reconstruction algorithms to Big Data platforms”[1], which has been presented at New York Scientific Data Summit (NYSDS), August 14-17, 2016 (talk: https://www.bnl.gov/nysds16/files/pdf/talks/NYSDS16%20Malitsky.pdf) Specifically, the paper highlighted a gap between two modern driving forces of the scientific discovery process: HPC and Big Data technologies. As a result, it proposed to extend the Spark platform with inter-worker communication for supporting scientific-oriented parallel applications. The approach was illustrated in the context of the Spark-based deployment of the SHARP MPI/GPU ptychographic solver. Aside from its practical value, this application represents a reference use case that captures the major technical aspects of other reconstruction tasks. In the NYSDS’16 paper, the implemented approach followed the CaffeOnSpark RDMA peer-to-peer model and augmented it with the RDMA address exchange server. By the Spark Summit, we plan to further advance this direction with the Spark-MPI generic solution based on the Hydra process management framework for supporting two major MPI implementations, MPICH and MVAPICH.
AN EMPIRICAL STUDY OF USING CLOUD-BASED SERVICES IN CAPSTONE PROJECT DEVELOPMENTcsandit
Cloud computing is gaining prominence and popularity in three important forms: Software as a Service, Platform as a Service, and Infrastructure as a Service. In this paper, we will present
an empirical study of how these cloud-based services were used in an undergraduate Computer Science capstone class to enable agile and effective development, testing, and deployment of sophisticated software systems, facilitate team collaborations among students, and ease the project assessment and grading tasks for teachers. Especially, in this class, students and teachers could leverage time, talent, and resources collaboratively and distributedly on his/her own schedule, from his/her convenient location, and using heterogeneous programming platforms thanks to such a completely All-In-Cloud environment, which eliminated the necessity of spending valuable development time on local setup, configuration, and maintenance, streamlined version control and group management, and greatly increased the collective productivity of student groups. Despite of the relatively steep learning curve in the beginning of the semester, all nine groups of students benefitted tremendously from such an All-In-Cloud experience and eight of them completed their substantial software projects successfully. This paper is concluded with a vision on expandin and standardizing the adoption of the Cloud ecosystem in other Computer Science classes in the future.
OpenACC and Hackathons Monthly Highlights: April 2023OpenACC
Stay up-to-date on the latest news, research and resources. This month's edition covers the Open Hackathon Mentor Program, highlight from the recent UK National Hackathon, upcoming Open Hackathon and Bootcamp events, and more!
UberCloud HPC Experiment Introduction for Beginnershpcexperiment
UberCloud HPC Experiment Introduction for Beginners.
What is the HPC Experiment
How the HPC Experiment works
How to participate in the HPC Experiment
And an example project
PlanetData project was presented by Elena Simperl and Barry Norton from Karlsruhe Institute of Technology at the 1st International Symposium on Data-driven Process Discovery and Analysis on June 30, 2011 in Campione d’Italia, Italy
This slide was used in ISO/IEC JTC1 SC36 Plenary Meeting in June 22, 2015.
Title of this slide is 'Proof of Concept for Learning Analytics Interoperability and subtitle is 'Reference Model based on open source SW'.
Similar to Spark-MPI: Approaching the Fifth Paradigm with Nikolay Malitsky (20)
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
3. Four Science Paradigms*
1. Experimental: describe empirical facts and test hypotheses
since: thousand years ago
2. Theoretical: explain and predict natural phenomena using
models and abstractions
since: several hundred years ago
3. Computational: simulate theoretical models using computers
since: second half of the 20th century
4. Data-Intensive: scientific discoveries based on Big Data analytics
since: around 15 years ago
3
*Jim Gray and Alex Szalay, eScience – A Transformed Scientific Method, NRC-CSTB, 2007
#SAISEco4
4. Paradigm Shift
4
▪ The fourth paradigm of data-intensive science rapidly became a major conceptual
approach for multiple application domains encompassing and generating large-
scale scientific drivers such as fusion reactors and light source facilities.
▪ The success of data-intensive projects subsequently triggered an explosion
of numerous machine learning approaches addressing a wide range of
industrial and scientific applications such as computer vision, self-driving
cars, and brain modelling.
▪ The next generation of artificial intelligent systems clearly represents
a paradigm shift from data processing pipelines towards cognitive
knowledge-centric applications.
4th Paradigm:
Data-Intensive
Science3rd Paradigm:
Computational
Science
5th Paradigm:
Cognitive Computing
DeepMind AlphaGo
IBM Watson DeepQA
Human Brain Project
▪ As shown in Fig. 1, AI systems broke the boundaries of computational and
data-intensive paradigms and began to form a new ecosystem by merging and
extending existing technologies.
Figure 1: The Fifth Paradigm*
#SAISEco4
*N. Malitsky, R. Castain, and M. Cowan, Spark-MPI: Approaching the Fifth Paradigm of Cognitive Applications, arXiv:1806.01110, 2018
5. Knowledge
5
▪ In his original talk, Jim Gray discussed “objectifying” knowledge within the field of ontology for
providing a structured representation of abstract concepts and physical entities. This direction is
related with the development of structured knowledge bases and associated technologies such as the
Semantic Web and Linked Data.
▪ Existing structured resources however only capture a tiny subset of available information. Therefore,
advanced question-answering (QA) systems* augmented them with corpora of raw text and processing
pipelines consisting of multiple stages that combine hundreds of different cooperating algorithms from
various fields.
As a result, emerging AI-oriented applications imply a more general and practical knowledge definition:
Knowledge is a multifacet substance distributed among heterogeneous information networks and
associated processing platforms. The structure and relationship between different components of such
a composite representation is dynamic, continuously shaped and consolidated by machine learning
processes.
*D. A. Ferrucci, Introduction to “This is Watson”, IBM Journal of Research and Development, 2012
#SAISEco4
6. From Processing Pipelines to Rational Agents
6
Data-intensive
processing pipelines
Deep learning
model-centric applications
W
W
W
D
D
D
O
D
D
D
W
W
W
M
W
W
W
D
D
D
O
Reinforcement learning
agent-oriented applications
D
D
D
W
W
W
M
#SAISEco4
7. Approaching the Fifth Paradigm of Cognitive Applications
7
*Dharshan Kumaran, Demis Hassabis, and James L. McClelland, What Learning Systems do Intelligent Agents Need?
Complementary Learning Systems, Trends in Cognitive Sciences, 2016
Neocortex /
Heterogeneous Knowledge
and Information Network
Hippocampus /
Streaming Pipeline
Figure 2: Complementary Learning Systems*
The consolidation of HPC and Big Data
machine learning technologies
represents the prerequisite for
developing the next paradigm of
cognitive applications
4th Paradigm:
Data-Intensive
Science3rd Paradigm:
Computational
Science
5th Paradigm:
Cognitive Applications
Figure 1: The Fifth Paradigm
#SAISEco4
9. Closing the gap between Big Data and HPC computing
9
*Geoffrey Fox et al. HPC-ABDC High Performance Computing Enhanced Apache Big Data Stack, CCGrid, 2015
Spark MPI
Ecosystems*: Big Data HPC Computing
New Frontiers
#SAISEco4
10. MPI: Message Passing Interface
10
Application Programming Interface:
▪ peer-to-peer: allreduce
▪ master-workers: scatter, gather, reduce
▪ point-to-point: send, receive
▪ remote memory access: put, get
Portable Access Layer for various communication protocols:
Process Management Interface:
▪ RDMA
▪ GPUDirect RDMA
▪ TCP/IP
▪ shared memory
▪ address exchange service
▪ …
#SAISEco4
11. PMI-based Spark-MPI Approach
11
Spark
Driver
PMI
Server
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
Spark driver-worker
PMI server-worker
MPI inter-worker
Interfaces
▪ PMI-Exascale (PMIx): created by the Open
MPI team3 in response to the ever-increasing
scale of supercomputing clusters.
(3) R. Castain, D. Solt, J. Hursey, and A. Bouteiller, PMIx: Process Management for Exascale Environment, 2017
(2) P. Balaji et al. PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems, 2010
▪ Process Management Interface (PMI):
originally developed by the MPICH team2
and used for exchanging wireup information
among processes.
#SAISEco4
▪ The PMIx community has therefore focused on extending
the earlier PMI work, adding flexibility to existing APIs (e.g.,
to support asynchronous operations) as well as new APIs
that broaden the range of interactions with the resident
resource manager.
(1) N. Malitsky et al. Building Near-Real-Time Processing Pipelines with the Spark-MPI platform, NYSDS, 2017
▪ Spark-MPI1 encompasses three interfaces. Specifically, it
complements the Spark conventional driver-worker model
with the PMI server-worker interface for establishing MPI
inter-worker communications.
12. Open MPI*
12
Open MPI was derived as a generalization of four projects bringing together over 40 frameworks. It introduced
a Modular Component Architecture (MCA) that utilized components (a.k.a. plugins) to provide alternative
implementations of key functional blocks such as message transport, mapping, algorithms, and collective
operations.
*E.Gabriel, G.E. Fagg, G. Bosilca, T. Anhskun, J. J. Dongarra, J. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain,
D. J. Daniel, R. L. Graham, and T. S. Woodall, Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation, 2004
Framework
Base
component
…
component
MPI application
Modular Component Architecture (MCA)
Framework
Base
component
…
component
…
Architecture
MPI application
Open MPI core (OPAL, ORTE, and OMPI layers)
sparkmpi
Base
default
…
Base
tcp
ofi
MPI byte
transfer layer
(btl)
smcuda
…
…
OpenRTE Daemon’s
Launch Subsystem
(odls)
Implementation
#SAISEco4
13. Spark-MPI Integrated Platform
13
N. Malitsky, R. Castain, and M. Cowan, Spark-MPI: Approaching the Fifth Paradigm of Cognitive Applications, arXiv:1806.01110, 2018
MPI-Based Algorithms
Process Management Interface (PMI)
Connectors
Resilient Distributed Dataset API
SLURM
Parallel
File Systems
Spark Platform HPC Extensions
Receivers
Streaming
Sources
#SAISEco4
15. Deep Learning Training as a Third Paradigm
Computational Application
15
W
W W
W
P
Parameter Server-based Data Parallel Model*
P: Parameter Server
*Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, 2015
W
W
W
W
All-Reduce Model
W: DL Worker
#SAISEco4
16. (Some of the) MPI DL Projects
16
▪ CNTK1: Microsoft Cognitive Toolkit
▪ TensorFlow-Matex2: added two new TensorFlow operators, Global_Broadcast and
MPI_Allreduce
▪ S-Caffe3: scaled Caffe with the MPI level hierarchical reduction design
▪ Horovod4: adopted Baidu’s approach based on the ring-allreduce algorithm and further
developed its implementation with NVIDIA’s NCCL library for collective implementation
(1) A. Agarwal et al. An Introduction to Computational Networks and Computational Network Toolkit, 2014
(2) A. Vishnu et al. User-transparent distributed TensorFlow, 2017
(3) A. A. Awan et al. S-Caffe: co-designing MPI runtime and Caffe for scalable deep learning on modern GPU clusters, 2017
(4) A. Segeev and M. Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow, 2018
(5) P. Mendygral. Scaling Deep Learning, 2018
▪ CPE ML Plugin5: Cray Programming Environment Machine Learning Plugin
#SAISEco4
17. Spark-MPI-Horovod
17
Crate the TF optimizer
Wrap TF with Horovod
Run the Horovod training
on Spark workers
Initialize Horovod and MPI
Extract the MNIST dataset
Build the DL model
…
Run the Horovod MPI-
based training
Initialize the PMI environmental variables
The Horovod MPI-based training framework replaces
the TensorFlow parameter servers with the ring-
allreduce approach for averaging gradients among
TensorFlow workers.
For users, the corresponding integration consists of two
primary steps as illustrated by the script: (1) initializing
Horovod with hvd.init() and (2) wrapping TensorFlow
worker’s optimizer with hvd.DistributedOptimizer().
The Spark-MPI pipelines enable to process the Horovod
training on Spark workers with Map operations. To
establish MPI communication among the Spark workers,
the Map operation (e.g. train()) needs only to define
PMI-related environmental variables (such as
PMIX_RANK and a port number).
#SAISEco4
19. Deep Reinforcement Learning
19
**R. Nishihara et. al. Real-Time Machine Learning: The Missing Pieces, arXiv 1703.03924, 2017
*A. Nair et al. Massively Parallel Methods for Deep Reinforcement Learning, ICML, 2015
Environment Actor
Parameter
Server
DQN Learner
Replay
Memory
(s,a,r,s’)
(s,a,r,s’)
Agent
argmaxa Q(s, a; q)
(r,s’)
Figure 1: Gorila* (General Reinforcement Learning Architecture)
System Requirements**:
• Low latency
• High throughput
• Dynamic task creation
• Heterogeneous tasks
• Arbitrary dataflow dependencies
• Transparent fault tolerance
• Debuggability and profiling
#SAISEco4
20. (Some of the) RL Applications*
20
(2) D. Silver et al. Mastering the game of Go with deep neutral networks an tree search, Nature, 2016
(1) V. Mnih et al. Playing Atari with Deep Reinforcement Learning, NIPS, 2013
▪ Atari Games1
▪ AlphaGo2
▪ Robotics
▪ Self-driving vehicles
▪ Autonomous UAVs
…
“Pterodactylus antiquus, the first pterosaur
species to be named and identified as a flying
reptile … 150.8–148.5 million years ago”
(Wikipedia)
range
#SAISEco4
21. Summary
21
▪ Emerging AI projects represent a paradigm shift from data processing pipelines towards
the fifth paradigm of cognitive knowledge-centric applications.
▪ The new generation of AI composite applications requires the integration of Big Data and
HPC technologies. For example, MPI was originally introduced within the computational
paradigm ecosystem for developing HPC scientific applications. But recently, MPI was
successfully applied for extending the scale of deep learning applications.
▪ Knowledge is a multifacet substance distributed among heterogeneous information
networks and associated processing platforms. The structure and relationship between
different components is dynamic, continuously shaped and consolidated by machine
learning processes.
▪ Spark-MPI addresses this strategic direction by extending the Spark platform with MPI-
based HPC applications using the Process Management Interface (PMI).
#SAISEco4