The document discusses modeling systems at the end of Dennard scaling and approaches to modeling in a post-Dennard era. It covers the end of consistent CPU performance improvements, the rise of specialized computing like GPUs and deep learning drivers. It also discusses using fewer bits in calculations, exploring uncertainties, and generating low-dimensional representations from complex models to help address the challenges of increased computing needs. Learning algorithms may help build emulators and surrogates of Earth system models to enable fitting-purpose simulations.
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Extracting relevant Metrics with Spectral Clustering - Evelyn TrautmannPyData
On a fast growing online platform arise numerous metrics. With increasing amount of metrics methods of exploratory data analysis are becoming more and more important. We will show how recognition of similar metrics and clustering can make monitoring feasible and provide a better understanding of their mutual dependencies.
A modified k means algorithm for big data clusteringSK Ahammad Fahad
Amount of data is getting bigger in every moment and this data comes from everywhere; social media, sensors, search engines, GPS signals, transaction records, satellites, financial markets, ecommerce sites etc. This large volume of data may be semi-structured, unstructured or even structured. So it is important to derive meaningful information from this huge data set. Clustering is the process to categorize data such that data are grouped in the same cluster when they are similar according to specific metrics. In this paper, we are working on k-mean clustering technique to cluster big data. Several methods have been proposed for improving the performance of the k-means clustering algorithm. We propose a method for making the algorithm less time consuming, more effective and efficient for better clustering with reduced complexity. According to our observation, quality of the resulting clusters heavily depends on the selection of initial centroid and changes in data clusters in the subsequence iterations. As we know, after a certain number of iterations, a small part of the data points change their clusters. Therefore, our proposed method first finds the initial centroid and puts an interval between those data elements which will not change their cluster and those which may change their cluster in the subsequence iterations. So that it will reduce the workload significantly in case of very large data sets. We evaluate our method with different sets of data and compare with others methods as well.
The Materials Genome Initiative (MGI) is a multi-agency effort led by the Department of Energy, National Science Foundation, National Institute of Standards and Technology, and other agencies to reduce the time to develop and deploy new materials by 50% while reducing costs. The goal is to develop a materials innovation infrastructure through improved data sharing and predictive models to achieve national goals in energy, security, and human welfare. Key aspects of MGI include developing repositories of high-quality, shared data and predictive models that span multiple length and time scales from the quantum to the macro level to enable the accelerated discovery and design of new materials with targeted properties.
Toward the Online Visualisation of Algorithm Performance for Parameter Selectiondjw213
David Walker (University of Exeter) and Matthew Craven (University of Plymouth)
Presented at EvoStar 2018 in Parma, Italy (05/04/2018)
Paper: https://link.springer.com/chapter/10.1007/978-3-319-77538-8_38
Abstract:
A visualisation method is presented that is intended to assist evolutionary algorithm users with the parametrisation of their algorithms. The visualisation method presents the convergence and diversity properties such that different parametrisations can be easily compared, and poor performing parameter sets can be easily identified and discarded. The efficacy of the visualisation is presented using a set of benchmark optimisation problems from the literature, as well as a benchmark water distribution network design problem. Results show that it is possible to observe the different performance caused by different parametrisations. Future work discusses the potential of this visualisation within an online tool that will enable a user to discard poor parametrisations as they execute to free up resources for better ones.
E&P organizations are turning more attention to accumulated data to enhance operating efficiency, safety, and recovery. The computing paradigm is shifting, the O&G paradigm is shifting, and the rise of the machine learning paradigm requires careful attention to top-down integrated systems engineering. A system approach will be presented to stimulate out-of-the-box thinking to address the machine learning paradigm.
Our research demonstrates how data assimilation can be used, with a non-hydrostatic coastal ocean model, to study sub-mesoscale processes and accurately estimate the state variables. The implementation is non trivial for physical ocean models which are highly nonlinear, sensitive to perturbations, and require a dense spatial discretization in order to correctly reproduce the dynamics. A major challenge of this approach is the high computational cost incurred by a high-resolution numerical model with a three-dimensional data assimilation scheme in a complicated stratified system. Interfacing the General Curvilinear Coastal Ocean Model (GCCOM) with the faster data assimilation framework, NCAR Data Assimilation Research Testbed (DART), allowed us to assimilate very high resolution observations into the system. Observing System Simulation Experiments (OSSEs) in very steep seamount test cases are presented. These were used to explore the proper initial ensemble members for the model, estimate the observation error variance needed to reproduce the dynamics in a turbulent flow experiment, and to analyze the impact of localization in such small processes. Our results demonstrate that the DART-GCCOM model can assimilate high resolution observations (tenths of meters) using as few as 30 ensemble members.
The document discusses the future of high performance computing (HPC). It covers several topics:
- Next generation HPC applications will involve larger problems in fields like disaster simulation, urban science, and data-intensive science. Projects like the Square Kilometer Array will generate exabytes of data daily.
- Hardware trends include using many-core processors, accelerators like GPUs, and heterogeneous computing with CPUs and GPUs. Future exascale systems may use conventional CPUs with GPUs or innovative architectures like Japan's Post-K system.
- The top supercomputers in the world currently include Summit, a IBM system combining Power9 CPUs and Nvidia Voltas at Oak Ridge, and China's Sunway Taihu
This presentation shows an overview of the main concepts introduced in the EDBT2015 Summer School, which took place in Palamos. For each area, we summarize the main issues and current approaches. We also describe the challenges and main activities that were undertaken in the summer school
Extracting relevant Metrics with Spectral Clustering - Evelyn TrautmannPyData
On a fast growing online platform arise numerous metrics. With increasing amount of metrics methods of exploratory data analysis are becoming more and more important. We will show how recognition of similar metrics and clustering can make monitoring feasible and provide a better understanding of their mutual dependencies.
A modified k means algorithm for big data clusteringSK Ahammad Fahad
Amount of data is getting bigger in every moment and this data comes from everywhere; social media, sensors, search engines, GPS signals, transaction records, satellites, financial markets, ecommerce sites etc. This large volume of data may be semi-structured, unstructured or even structured. So it is important to derive meaningful information from this huge data set. Clustering is the process to categorize data such that data are grouped in the same cluster when they are similar according to specific metrics. In this paper, we are working on k-mean clustering technique to cluster big data. Several methods have been proposed for improving the performance of the k-means clustering algorithm. We propose a method for making the algorithm less time consuming, more effective and efficient for better clustering with reduced complexity. According to our observation, quality of the resulting clusters heavily depends on the selection of initial centroid and changes in data clusters in the subsequence iterations. As we know, after a certain number of iterations, a small part of the data points change their clusters. Therefore, our proposed method first finds the initial centroid and puts an interval between those data elements which will not change their cluster and those which may change their cluster in the subsequence iterations. So that it will reduce the workload significantly in case of very large data sets. We evaluate our method with different sets of data and compare with others methods as well.
The Materials Genome Initiative (MGI) is a multi-agency effort led by the Department of Energy, National Science Foundation, National Institute of Standards and Technology, and other agencies to reduce the time to develop and deploy new materials by 50% while reducing costs. The goal is to develop a materials innovation infrastructure through improved data sharing and predictive models to achieve national goals in energy, security, and human welfare. Key aspects of MGI include developing repositories of high-quality, shared data and predictive models that span multiple length and time scales from the quantum to the macro level to enable the accelerated discovery and design of new materials with targeted properties.
Toward the Online Visualisation of Algorithm Performance for Parameter Selectiondjw213
David Walker (University of Exeter) and Matthew Craven (University of Plymouth)
Presented at EvoStar 2018 in Parma, Italy (05/04/2018)
Paper: https://link.springer.com/chapter/10.1007/978-3-319-77538-8_38
Abstract:
A visualisation method is presented that is intended to assist evolutionary algorithm users with the parametrisation of their algorithms. The visualisation method presents the convergence and diversity properties such that different parametrisations can be easily compared, and poor performing parameter sets can be easily identified and discarded. The efficacy of the visualisation is presented using a set of benchmark optimisation problems from the literature, as well as a benchmark water distribution network design problem. Results show that it is possible to observe the different performance caused by different parametrisations. Future work discusses the potential of this visualisation within an online tool that will enable a user to discard poor parametrisations as they execute to free up resources for better ones.
E&P organizations are turning more attention to accumulated data to enhance operating efficiency, safety, and recovery. The computing paradigm is shifting, the O&G paradigm is shifting, and the rise of the machine learning paradigm requires careful attention to top-down integrated systems engineering. A system approach will be presented to stimulate out-of-the-box thinking to address the machine learning paradigm.
Our research demonstrates how data assimilation can be used, with a non-hydrostatic coastal ocean model, to study sub-mesoscale processes and accurately estimate the state variables. The implementation is non trivial for physical ocean models which are highly nonlinear, sensitive to perturbations, and require a dense spatial discretization in order to correctly reproduce the dynamics. A major challenge of this approach is the high computational cost incurred by a high-resolution numerical model with a three-dimensional data assimilation scheme in a complicated stratified system. Interfacing the General Curvilinear Coastal Ocean Model (GCCOM) with the faster data assimilation framework, NCAR Data Assimilation Research Testbed (DART), allowed us to assimilate very high resolution observations into the system. Observing System Simulation Experiments (OSSEs) in very steep seamount test cases are presented. These were used to explore the proper initial ensemble members for the model, estimate the observation error variance needed to reproduce the dynamics in a turbulent flow experiment, and to analyze the impact of localization in such small processes. Our results demonstrate that the DART-GCCOM model can assimilate high resolution observations (tenths of meters) using as few as 30 ensemble members.
The document discusses the future of high performance computing (HPC). It covers several topics:
- Next generation HPC applications will involve larger problems in fields like disaster simulation, urban science, and data-intensive science. Projects like the Square Kilometer Array will generate exabytes of data daily.
- Hardware trends include using many-core processors, accelerators like GPUs, and heterogeneous computing with CPUs and GPUs. Future exascale systems may use conventional CPUs with GPUs or innovative architectures like Japan's Post-K system.
- The top supercomputers in the world currently include Summit, a IBM system combining Power9 CPUs and Nvidia Voltas at Oak Ridge, and China's Sunway Taihu
A crucial ingredient of a successful weather prediction system is its ability to combine observational data with the
output of numerical weather prediction models to estimate the state of the atmosphere and the oceans. This problem of estimation of the state of a high dimensional chaotic system such as the atmosphere, given noisy and partial observations of it is known as data assimilation in the context of earth sciences. The main object of interest in these problems is
the conditional distribution, called the posterior, of the state conditioned on the observations. Monte Carlo methods are the most commonly used techniques to study this posterior and also to use it efficiently for prediction. I will give a general introduction to the data assimilation problems and also to Monte Carlo techniques, followed by a discussion of some commonly used Monte Carlo algorithms for data assimilation.
This talk is a new update based on some of our recent results on doing Tall and Skinny QRs in MapReduce. In particular, the "fast" iterative refinement approximation based on a sample is new.
Machine Learning encompasses data acquisition, transmission, retention, analysis, and reduction. The expected outgrowth of 24x7 data systems and operations centers is Knowledge Engineering and Data Intensive Analytics AKA Machine Learning. This presentation will develop and apply Machine Learning concepts to the Upstream O&G industry. Specific focus will be given to the fundamental concepts and definitions of Machine Learning along with the application of Machine Learning.
The Solution of Maximal Flow Problems Using the Method Of Fuzzy Linear Progra...theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Developing Computational Skills in the Sciences with Matlab Webinar 2017SERC at Carleton College
This document summarizes a workshop on teaching computational skills in the sciences using MATLAB. The workshop included strategies for teaching data analysis, modeling, and computation through domain-focused courses. Presenters provided teaching activities and resources for conveying these skills with MATLAB. Three professors demonstrated representative activities involving geophone layout simulation, building modular visualization tools, and principal component analysis. The workshop aimed to provide a community for peer educators to share resources and approaches for effectively teaching computational skills in science fields using MATLAB.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
Numerous results in reconfigurable computing research suggest that FPGAs are able to deliver greatly improved performance or energy efficiency for many computationally demanding applications. This potential is being exploited by hyperscale cloud providers, which have recently deployed large scale installations with FPGA. In contrast, FPGAs have not had any significant impact on general purpose HPC installations so far.
In this presentation, I will try to shed some light on the reasons for this development and the apparent gap between the promise and reality for FPGAs in HPC. I will discuss what the reconfigurable computing research community can and needs to provide to attract more interest from HPC users and suppliers. To highlight practical challenges, I will share some of our experiences at the Paderborn Center for Parallel Computing, where have recently commissioned two HPC testbed clusters with FPGAs and where we are currently planning to deploy FPGAs at a larger scale in our production HPC systems.
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
Distributed algorithms in machine learning follow two main paradigms: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. The drawback of the latter approach is that the data needs to be replicated on each machine. Such replications limit the scalability of machine learning algorithms, since in several real-world tasks it is observed that the data and model sizes typically grow hand in hand. In this talk, I will present Hybrid-Parallelism, a new paradigm that partitions both, the data as well as the model parameters simultaneously in a completely de-centralized manner. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. Next, I will present a case-study showing how to apply these ideas to reformulate Multinomial Logistic Regression to achieve Hybrid Parallelism (DSMLR: Doubly-Separable Multinomial Logistic Regression). Finally, I will demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets.
The document provides a summary of Guy Tel-Zur's experience at the SC10 supercomputing conference. It outlines the various talks, panels, and presentations Tel-Zur attended over the course of the conference related to topics like computational physics, GPU computing, climate modeling, earthquake simulations, and the future of high performance computing. It also mentions visiting the exhibition and learning about technologies like Eclipse PTP, Elastic-R, Python for scientific computing, Amazon Cluster GPU instances, and the Top500 list of supercomputers.
Microkernels in the Era of Data-Centric ComputingMartin Děcký
Martin Děcký presented on microkernels in the era of data-centric computing. He discussed how emerging memory technologies and near-data processing can break away from the von Neumann architecture. Near-data processing provides benefits like reduced latency, increased throughput, and lower energy consumption. This leads to more distributed and heterogeneous systems that are well-matched to multi-microkernel architectures, with microkernels providing isolation and message passing between cores. The talk outlined an incremental approach from initial workload offloading to developing a full multi-microkernel system.
The document discusses the growing carbon footprint of information and communication technologies (ICT) and efforts to make cyberinfrastructure more energy efficient and environmentally sustainable. Specifically, it mentions that (1) ICT energy usage is growing rapidly and accounts for 2% of global greenhouse gas emissions, (2) universities are working on initiatives like the GreenLight project to reduce ICT energy usage through techniques like dynamic power management, and (3) further research is needed to develop more energy-efficient computing technologies, data center designs, and videoconferencing solutions to reduce the need for travel.
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
The document discusses the value of data and the rise of big data. It notes that Matthew Fontaine Maury in the 1800s recognized the value of analyzing ship log data collectively. Today, new sources of data like sensors have exploded the volume of data. Characteristics of big data include volume, variety, and velocity. Technological challenges include scalability, heterogeneity, and low latency. The document provides examples of non-relational databases and MapReduce as approaches to handle big data.
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsHong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
This document provides an overview of a data science training module that introduces students to key concepts like big data, data science components, types of data scientists, and use cases. The module covers topics such as big data scenarios, challenges, Hadoop, R programming, and machine learning. Students will learn how to analyze real-world datasets and implement machine learning algorithms in both Hadoop and R. The goal is for students to understand how to use tools and techniques to extract insights from large datasets.
A computational scientist's wish list for tomorrow's computing systemskhinsen
Like many areas of modern life, scientific research has been transformed profoundly by information technology. Most of today's research relies on computers and software for core tasks such as data analysis and model exploration. This has created both new opportunities and new danger zones. The much discussed reproducibility crisis, for example, is largely the result of inappropriate use of computational tools.
To make sure that tomorrow's computing systems provide support for doing research reliably, computational scientists need to establish a dialog with designers of programming languages and systems, and that is my goal with this presentation. I will describe the particularities of computational science: data-centric approaches, situated software, exploration vs. consolidation of computational models, the role of specifications, interfacing independently developed components, and the central scientific requirement of inquirability. I will also outline how today's computing systems are insufficient, and discuss some of my own attempts to contribute to improving them.
This slide deck is used as an introduction to the MapReduce programming model, trying hard to be Hadoop-agnostic, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
The document discusses the GreenLight project which aims to minimize the energy consumption of computational resources through several methods. It plans to instrument compute and storage infrastructure to measure and control energy usage, develop software to optimize resource allocation, and explore using renewable energy sources like solar power. The goals are to allow scientists to study the energy costs of large-scale computing and develop more efficient deployment strategies and architectures.
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
RAMP approach to analytics: Rapid Analytics and Model Prototyping; collaborative data challenges with in-built data science process management tools and analytics; An observatory of data science and scientists. Presented at the Design Theory Special Interest Group of International Design Society. Mines ParisTech and Centre for Data Science.
A crucial ingredient of a successful weather prediction system is its ability to combine observational data with the
output of numerical weather prediction models to estimate the state of the atmosphere and the oceans. This problem of estimation of the state of a high dimensional chaotic system such as the atmosphere, given noisy and partial observations of it is known as data assimilation in the context of earth sciences. The main object of interest in these problems is
the conditional distribution, called the posterior, of the state conditioned on the observations. Monte Carlo methods are the most commonly used techniques to study this posterior and also to use it efficiently for prediction. I will give a general introduction to the data assimilation problems and also to Monte Carlo techniques, followed by a discussion of some commonly used Monte Carlo algorithms for data assimilation.
This talk is a new update based on some of our recent results on doing Tall and Skinny QRs in MapReduce. In particular, the "fast" iterative refinement approximation based on a sample is new.
Machine Learning encompasses data acquisition, transmission, retention, analysis, and reduction. The expected outgrowth of 24x7 data systems and operations centers is Knowledge Engineering and Data Intensive Analytics AKA Machine Learning. This presentation will develop and apply Machine Learning concepts to the Upstream O&G industry. Specific focus will be given to the fundamental concepts and definitions of Machine Learning along with the application of Machine Learning.
The Solution of Maximal Flow Problems Using the Method Of Fuzzy Linear Progra...theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Developing Computational Skills in the Sciences with Matlab Webinar 2017SERC at Carleton College
This document summarizes a workshop on teaching computational skills in the sciences using MATLAB. The workshop included strategies for teaching data analysis, modeling, and computation through domain-focused courses. Presenters provided teaching activities and resources for conveying these skills with MATLAB. Three professors demonstrated representative activities involving geophone layout simulation, building modular visualization tools, and principal component analysis. The workshop aimed to provide a community for peer educators to share resources and approaches for effectively teaching computational skills in science fields using MATLAB.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
Numerous results in reconfigurable computing research suggest that FPGAs are able to deliver greatly improved performance or energy efficiency for many computationally demanding applications. This potential is being exploited by hyperscale cloud providers, which have recently deployed large scale installations with FPGA. In contrast, FPGAs have not had any significant impact on general purpose HPC installations so far.
In this presentation, I will try to shed some light on the reasons for this development and the apparent gap between the promise and reality for FPGAs in HPC. I will discuss what the reconfigurable computing research community can and needs to provide to attract more interest from HPC users and suppliers. To highlight practical challenges, I will share some of our experiences at the Paderborn Center for Parallel Computing, where have recently commissioned two HPC testbed clusters with FPGAs and where we are currently planning to deploy FPGAs at a larger scale in our production HPC systems.
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
Distributed algorithms in machine learning follow two main paradigms: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. The drawback of the latter approach is that the data needs to be replicated on each machine. Such replications limit the scalability of machine learning algorithms, since in several real-world tasks it is observed that the data and model sizes typically grow hand in hand. In this talk, I will present Hybrid-Parallelism, a new paradigm that partitions both, the data as well as the model parameters simultaneously in a completely de-centralized manner. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. Next, I will present a case-study showing how to apply these ideas to reformulate Multinomial Logistic Regression to achieve Hybrid Parallelism (DSMLR: Doubly-Separable Multinomial Logistic Regression). Finally, I will demonstrate the versatility of DS-MLR under various scenarios in data and model parallelism, through an empirical study consisting of real-world datasets.
The document provides a summary of Guy Tel-Zur's experience at the SC10 supercomputing conference. It outlines the various talks, panels, and presentations Tel-Zur attended over the course of the conference related to topics like computational physics, GPU computing, climate modeling, earthquake simulations, and the future of high performance computing. It also mentions visiting the exhibition and learning about technologies like Eclipse PTP, Elastic-R, Python for scientific computing, Amazon Cluster GPU instances, and the Top500 list of supercomputers.
Microkernels in the Era of Data-Centric ComputingMartin Děcký
Martin Děcký presented on microkernels in the era of data-centric computing. He discussed how emerging memory technologies and near-data processing can break away from the von Neumann architecture. Near-data processing provides benefits like reduced latency, increased throughput, and lower energy consumption. This leads to more distributed and heterogeneous systems that are well-matched to multi-microkernel architectures, with microkernels providing isolation and message passing between cores. The talk outlined an incremental approach from initial workload offloading to developing a full multi-microkernel system.
The document discusses the growing carbon footprint of information and communication technologies (ICT) and efforts to make cyberinfrastructure more energy efficient and environmentally sustainable. Specifically, it mentions that (1) ICT energy usage is growing rapidly and accounts for 2% of global greenhouse gas emissions, (2) universities are working on initiatives like the GreenLight project to reduce ICT energy usage through techniques like dynamic power management, and (3) further research is needed to develop more energy-efficient computing technologies, data center designs, and videoconferencing solutions to reduce the need for travel.
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
The document discusses the value of data and the rise of big data. It notes that Matthew Fontaine Maury in the 1800s recognized the value of analyzing ship log data collectively. Today, new sources of data like sensors have exploded the volume of data. Characteristics of big data include volume, variety, and velocity. Technological challenges include scalability, heterogeneity, and low latency. The document provides examples of non-relational databases and MapReduce as approaches to handle big data.
TUW-ASE Summer 2015 - Quality of Result-aware data analyticsHong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
This document provides an overview of a data science training module that introduces students to key concepts like big data, data science components, types of data scientists, and use cases. The module covers topics such as big data scenarios, challenges, Hadoop, R programming, and machine learning. Students will learn how to analyze real-world datasets and implement machine learning algorithms in both Hadoop and R. The goal is for students to understand how to use tools and techniques to extract insights from large datasets.
A computational scientist's wish list for tomorrow's computing systemskhinsen
Like many areas of modern life, scientific research has been transformed profoundly by information technology. Most of today's research relies on computers and software for core tasks such as data analysis and model exploration. This has created both new opportunities and new danger zones. The much discussed reproducibility crisis, for example, is largely the result of inappropriate use of computational tools.
To make sure that tomorrow's computing systems provide support for doing research reliably, computational scientists need to establish a dialog with designers of programming languages and systems, and that is my goal with this presentation. I will describe the particularities of computational science: data-centric approaches, situated software, exploration vs. consolidation of computational models, the role of specifications, interfacing independently developed components, and the central scientific requirement of inquirability. I will also outline how today's computing systems are insufficient, and discuss some of my own attempts to contribute to improving them.
This slide deck is used as an introduction to the MapReduce programming model, trying hard to be Hadoop-agnostic, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
The document discusses the GreenLight project which aims to minimize the energy consumption of computational resources through several methods. It plans to instrument compute and storage infrastructure to measure and control energy usage, develop software to optimize resource allocation, and explore using renewable energy sources like solar power. The goals are to allow scientists to study the energy costs of large-scale computing and develop more efficient deployment strategies and architectures.
A data science observatory based on RAMP - rapid analytics and model prototypingAkin Osman Kazakci
RAMP approach to analytics: Rapid Analytics and Model Prototyping; collaborative data challenges with in-built data science process management tools and analytics; An observatory of data science and scientists. Presented at the Design Theory Special Interest Group of International Design Society. Mines ParisTech and Centre for Data Science.
Machine Learning meets Granular Computing: the emergence of granular models in the Big Data era
** Presentation Slides from Dr Rafael Falcon, from Larus Technologies, for the February 2018 Ottawa Machine Learning & Artificial Intelligence Meetup
Abstract
Traditional Machine Learning (ML) models are unable to effectively cope with the challenges posed by the many V’s (volume, velocity, variety, etc.) characterizing the Big Data phenomenon. This has triggered the need to revisit the underlying principles and assumptions ML stands upon. Dimensionality reduction, feature/instance selection, increased computational power and parallel/distributed algorithm implementations are well-known approaches to deal with these large volumes of data.
In this talk we will introduce Granular Computing (GrC), a vibrant research discipline devoted to the design of high-level information granules and their inference frameworks. By adopting more symbolic constructs such as sets, intervals or similarity classes to describe numerical data, GrC has paved the way for a more human-centric manner of interacting with and reasoning about the real world. We will go over several granular models that address common ML tasks such as classification/clustering and will outline a methodology to appropriately design information granules for the problem at hand. Though not a mainstream concept yet, GrC is a promising direction for ML systems to harness Big Data.
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.
This document presents the STINGER framework for performing streaming graph analysis. STINGER is designed to analyze dynamic graphs where edges and vertices are continuously being added or removed. It describes several building blocks for streaming graph algorithms, including an incremental PageRank algorithm that updates PageRank scores based on graph changes rather than recomputing from scratch. It also discusses techniques for counting triangles and detecting communities in a streaming graph setting. The goal of STINGER is to enable low-latency analysis of changing graph structures as the updates occur.
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
The document discusses methods for integrating multi-scale omics data using kernel and machine learning approaches. It describes how omics data is large, heterogeneous, and multi-scaled, creating bottlenecks for analysis. Methods discussed for data integration include multiple kernel learning to combine different relational datasets in an unsupervised way. The methods are applied to integrate different datasets from the TARA Oceans expedition to identify patterns in ocean microbial communities. Improving interpretability of the methods and making them more accessible to biological users is discussed.
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
In this video from the 2019 Stanford HPC Conference, Steve Oberlin from NVIDIA presents: HPC + Ai: Machine Learning Models in Scientific Computing.
"Most AI researchers and industry pioneers agree that the wide availability and low cost of highly-efficient and powerful GPUs and accelerated computing parallel programming tools (originally developed to benefit HPC applications) catalyzed the modern revolution in AI/deep learning. Clearly, AI has benefited greatly from HPC. Now, AI methods and tools are starting to be applied to HPC applications to great effect. This talk will describe an emerging workflow that uses traditional numeric simulation codes to generate synthetic data sets to train machine learning algorithms, then employs the resulting AI models to predict the computed results, often with dramatic gains in efficiency, performance, and even accuracy. Some compelling success stories will be shared, and the implications of this new HPC + AI workflow on HPC applications and system architecture in a post-Moore’s Law world considered."
Watch the video: https://youtu.be/SV3cnWf39kc
Learn more: https://nvidia.com
and
http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
1. Modeling Systems at the end of Dennard Scaling
Future of Fluids: Big Data and Big Computation
Aviation Forum
Atlanta Georgia
V. Balaji
NOAA/GFDL and Princeton University
28 June 2018
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 1 / 35
2. Outline
1 Earth system modeling
2 Hardware evolution at the end of Dennard scaling
The end of Dennard scaling
Specialized and commodity computing
Increased concurrency, slower arithmetic
Deep learning is an industry driver
3 Approaches to modeling post-Dennard
Uncertainty exploration
Use fewer bits
Generate low-dimensional representations from
higher-dimensional
4 Ideas and challenges
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 2 / 35
3. Outline
1 Earth system modeling
2 Hardware evolution at the end of Dennard scaling
The end of Dennard scaling
Specialized and commodity computing
Increased concurrency, slower arithmetic
Deep learning is an industry driver
3 Approaches to modeling post-Dennard
Uncertainty exploration
Use fewer bits
Generate low-dimensional representations from
higher-dimensional
4 Ideas and challenges
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 3 / 35
4. Atmospheric response to doubled CO2
Fig 5 from Manabe and Wetherald (1975), equilibrium response to
doubled CO2.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 4 / 35
5. History of GFDL Computing
Courtesy Brian Gross, NOAA/GFDL.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 5 / 35
6. NGGPS: Next-Generation Global Prediction System
FV3 dynamical core from GFDL for the next-generation forecast model
(target: 3 km non-hydrostatic in 10 years running at ∼ 200 d/d)
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 6 / 35
7. Passing the climate Turing test?
We may be able to simulate everything in great detail, but do we
understand how it works?
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 7 / 35
8. Outline
1 Earth system modeling
2 Hardware evolution at the end of Dennard scaling
The end of Dennard scaling
Specialized and commodity computing
Increased concurrency, slower arithmetic
Deep learning is an industry driver
3 Approaches to modeling post-Dennard
Uncertainty exploration
Use fewer bits
Generate low-dimensional representations from
higher-dimensional
4 Ideas and challenges
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 8 / 35
9. Moore’s Law and End of Dennard scaling
Figure courtesy Moore 2011: Data processing in exascale-class
systems.
Processor concurrency: Intel Xeon-Phi.
Fine-grained thread concurrency: Nvidia GPU.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 9 / 35
10. Top500 revisited
HPCG/HPL ratio is a measure of “percent of peak” (Dongarra and
Heroux 2013).
All recent HPC acquisitions in climate/weather have been on
conventional Intel Xeon (see Balaji et al 2017).
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 10 / 35
11. The inexorable triumph of commodity computing
From The Platform, Hemsoth (2015).
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 11 / 35
12. The "Navier-Stokes Computer" of 1986
“The Navier-Stokes computer (NSC)
has been developed for solving
problems in fluid mechanics involving
complex flow simulations that require
more speed and capacity than
provided by current and proposed
Class VI supercomputers. The
machine is a parallel processing
supercomputer with several new
architectural elements which can be
programmed to address a wide range
of problems meeting the following
criteria: (1) the problem is
numerically intensive, and (2) the
code makes use of long vectors.”
Nosenchuck and Littman (1986)
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 12 / 35
13. The Caltech "Cosmic Cube" (1986)
“Caltech is at its best blazing new trails; we are not the best place for
programmatic research that dots i’s and crosses t’s”. Geoffrey Fox,
pioneer of the Caltech Concurrent Computation Program, in 1986.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 13 / 35
17. Processors for Deep Learning
Deep learning is a layered NN approach with hidden layers. Figure
courtesy NVidia.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 17 / 35
18. Google TPU (Tensor Processing Unit)
Figure courtesy Google.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 18 / 35
19. Google TPU (Tensor Processing Unit)
Hardware pipelining of steps in matrix-multiply. Figure courtesy
Google.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 19 / 35
20. Outline
1 Earth system modeling
2 Hardware evolution at the end of Dennard scaling
The end of Dennard scaling
Specialized and commodity computing
Increased concurrency, slower arithmetic
Deep learning is an industry driver
3 Approaches to modeling post-Dennard
Uncertainty exploration
Use fewer bits
Generate low-dimensional representations from
higher-dimensional
4 Ideas and challenges
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 20 / 35
21. No separation of "large" and "small" scales
Nastrom and Gage (1985).
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 21 / 35
22. Multi-model “skill scores”
Based on RMS error of surface temperature and precipitation. (Fig. 3
from Knutti et al, GRL, 2013).
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 22 / 35
23. Multi-model skill scores?
More complex models that show the same skill represents an
“advance”!
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 23 / 35
24. Model tuning
Model tuning or “calibration” consists of reducing overall model bias
(usually relative to 20th century climatology) by modifying parameters.
In principle, minimizing some cost function:
C(p1, p2, ...) =
N
1
ωi φi − φobs
i
Usually the p must be chosen within some observed or theoretical
range pmin ≤ p ≤ pmax .
“Fudge factors” (applying known wrong values) generally frowned
upon (see Shackley et al 1999 discussion on history of “flux
adjustments”. More on that later...)
The choice of ωi is part of the lab’s “culture”!
The choice of φobs
i is also troublesome:
overlap between “tuning” metrics and “evaluation” metrics.
“Over-tuning”: remember “reality” is but one ensemble member!
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 24 / 35
25. Model choice: culture and constraints
GFDL models built on FMS Goals: dec-cen, carbon cycle,
seasonal prediction, decadal predictability, TC climatology,
aerosol-cloud feedbacks, ozone climate, regional climate
IITM (8 SYPD on 164p; 500 CHSY): Goals: DECK experiments,
monsoons under climate change.
IPSL: IPSLCM6-VLR (38 SYPD on 160p; 100 CHSY) to
IPSLCM6-LR (6 SYPD on 550p; 2200 CHSY) Goals: WCRP
grand challenge on clouds; dec-cen climate change; carbon cycle;
ozone climate; paleoclimate
Strategies of model building (choices of ωi)
Thought experiment: if two different labs started at the same point in
Knutti’s genealogy, would they build the same model?
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 25 / 35
26. Objective methods of tuning?
Neelin et al (2010) construct “metamodels” to aid in multi-parameter
optimization. Metamodel generation is expensive (as in deep learning),
and varies with cost function.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 26 / 35
27. Low precision arithmetic for Deep Learning
Figure 1 from Gupta et al (2015).
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 27 / 35
28. Low precision arithmetic for Deep Learning
Figure courtesy NVidia. Low-precision arithmetic.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 28 / 35
29. Irreproducible Computing, Inexact Hardware
Figure 1 from Düben et al, Phil. Trans. A, 2016. Which bits can we
allow to be “inexactly” flipped? Lorenz 96 as canonical test case of
non-linearity and chaos.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 29 / 35
30. Irreproducible Computing, Inexact Hardware
Figure 2 from Düben et al, Phil. Trans. A, 2016.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 30 / 35
31. Generating parameterizations from CRMs and
super-parameterization
(Courtesy: S-J Lin, NOAA/GFDL).
(Courtesy: D. Randall, CSU;
CMMAP).
Global-scale CRMs (e.g 7 km simulation on the left) and even
super-parameterization using embedded cloud models (right)
remain prohibitively expensive.
Use emulators (genetic programming or DL using GCM-resolution
predictors) to emulate columns of a cloud field.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 31 / 35
32. Outline
1 Earth system modeling
2 Hardware evolution at the end of Dennard scaling
The end of Dennard scaling
Specialized and commodity computing
Increased concurrency, slower arithmetic
Deep learning is an industry driver
3 Approaches to modeling post-Dennard
Uncertainty exploration
Use fewer bits
Generate low-dimensional representations from
higher-dimensional
4 Ideas and challenges
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 32 / 35
33. Ideas and Challenges
No scale separation implies a catastrophic cascade of
dimensionality: we’re off by 1010 from required flops, Schneider et
al (2017).
Multiple “fit-for-purpose” cost functions depending on the question
asked.
Learning algorithms may play multiple roles:
Building emulators, fast surrogate models of low dimensionality.
Early detection of “viable” models
Other fields exploring same terrain face substantial difficulties: see
Frégnac (2017): “Big data and the industrialization of
neuroscience: A safe roadmap for understanding the brain?” See
also Jonas and Kording (2017): “Could a Neuroscientist
Understand a Microprocessor?”
In the face of the above, we must regard it a success that we hold
the line on Manabe’s results despite a vast increase in
dimensionality!
Need unified modeling system across the model hierarchy.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 33 / 35
34. What would future infrastructure look like?
A unified modeling infrastructure with:
≤∼1 SYPD models, “LES”, “DNS” for generating training data
∼10 SYPD comprehensive models for “doing science” – e.g climate
sensitivity, detection-attribution, predictability, prediction, projection,
...
≥∼100-1000 SYPD fast approximate models for uncertainty
exploration
Massive re-engineering to speed up the 10 SYPD model by a few
X will not be transformational (scientists will add to it to bring it
back to ∼10 SYPD)
A flexible open evaluation and testing framework where metrics
can be added with little effort (see e.g Pangeo)
A system of composing cost functions at will and generating the
learnt models within a period attuned to human attention span
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 34 / 35
35. Bibliography
“Climate goals and computing the future of clouds”, Schneider et
al 2017.
“Climate Computing: The State of Play” Balaji 2015.
“Big data and the industrialization of neuroscience: A safe
roadmap for understanding the brain?” Frégnac 2017.
“The Art and Science of Climate Model Tuning”. Hourdin et al
2016.
“On the use of inexact, pruned hardware in atmospheric
modelling” Düben et al 2014.
“CPMIP: measurements of real computational performance of
Earth system models in CMIP6”. Balaji et al 2017.
V. Balaji (balaji@princeton.edu) The Post-Dennard Era 28 June 2018 35 / 35