The document discusses the history and structure of dense linear algebra and parallel matrix multiplication algorithms. It provides an outline of the topics to be covered, including the history and motivation for optimizing dense linear algebra, what constitutes dense linear algebra problems, why minimizing communication is important, lower bounds on communication, and parallel matrix multiplication algorithms. It also briefly reviews blocked matrix multiplication and its goal of minimizing data movement between memory hierarchies.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
The growing use of Big Data frameworks on large machines highlights the importance of performance issues and the value of High Performance Computing (HPC) technology. This paper looks carefully at three major frameworks Spark, Flink and Message Passing Interface (MPI) both in scaling across nodes and internally over the many cores inside modern nodes.We focus on the special challenges of the Java Virtual Machine (JVM) using an Intel Haswell HPC cluster with 24 cores per node. Two parallel machine learning algorithms, K-Means clustering and Multidimensional Scaling (MDS) are used in our performance studies. We identify three major issues – thread models, affinity patterns, and communication mechanisms – as factors affecting performance by large factors and show how to optimize them so that Java can match the performance of traditional HPC languages like C. Further we suggest approaches that preserve the user interface and elegant dataflow approach of Flink and Spark but modify the runtime so that these Big Data frameworks can achieve excellent performance and realize the goals of HPCBig Data convergence.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Databricks
The need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more.
We describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
SystemML is an Apache project that provides a declarative machine learning language for data scientists. It aims to simplify the development of custom machine learning algorithms and enable scalable execution on everything from single nodes to clusters. SystemML provides pre-implemented machine learning algorithms, APIs for various languages, and a cost-based optimizer to compile execution plans tailored to workload and hardware characteristics in order to maximize performance.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
The growing use of Big Data frameworks on large machines highlights the importance of performance issues and the value of High Performance Computing (HPC) technology. This paper looks carefully at three major frameworks Spark, Flink and Message Passing Interface (MPI) both in scaling across nodes and internally over the many cores inside modern nodes.We focus on the special challenges of the Java Virtual Machine (JVM) using an Intel Haswell HPC cluster with 24 cores per node. Two parallel machine learning algorithms, K-Means clustering and Multidimensional Scaling (MDS) are used in our performance studies. We identify three major issues – thread models, affinity patterns, and communication mechanisms – as factors affecting performance by large factors and show how to optimize them so that Java can match the performance of traditional HPC languages like C. Further we suggest approaches that preserve the user interface and elegant dataflow approach of Flink and Spark but modify the runtime so that these Big Data frameworks can achieve excellent performance and realize the goals of HPCBig Data convergence.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Databricks
The need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more.
We describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
This document provides an introduction to NoSQL databases. It discusses the need for NoSQL databases due to limitations in scaling relational databases for big data and web applications. It then defines what NoSQL databases are, highlighting their non-relational structures and ability to scale horizontally. Both advantages like flexibility and disadvantages like lack of transactions are outlined. The CAP theorem and its implications for database consistency models are also introduced. Finally, common categories of NoSQL databases are listed.
The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
This document summarizes a Ph.D. dissertation defense presented by Saliya Ekanayake on September 28, 2016. The dissertation studied big data performance and benchmarking, focusing on parallel machine learning. It evaluated performance factors like thread models, affinity, communication mechanisms, and optimizations for high-level languages. It also presented SPIDAL Java, a library for scalable parallel machine learning applications, and evaluated its performance on use cases like gene sequence clustering and stock data analysis using up to 48 nodes.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
What does it take to make google work at scale xlight
The document discusses the software stacks used by companies like Google and Facebook to run services at massive scale. It describes key components of their infrastructure like MapReduce for distributed processing, BigTable for storage, and Borg for cluster management. The stacks are optimized for data-intensive workloads across thousands of machines through replication, load balancing, and fault tolerance. Current research aims to improve utilization, stream processing, and distributed algorithms to enable real-time insights from huge datasets.
What does it take to make google work at scale 君 廖
This document summarizes a presentation about systems that operate at large scale like Google. It discusses the hardware infrastructure of datacenters with over 10,000 machines and customized networking. It then covers the major software stacks at Google including low-level storage systems like GFS and Colossus, distributed databases like BigTable and Spanner, and batch processing with MapReduce. Similar software stacks at Facebook and differences from high performance computing are also mentioned. Current research directions in distributed systems and operating at large scale are briefly outlined.
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
MLlib is an Apache Spark component that focuses on machine learning algorithms. It was initially contributed by the AMPLab at UC Berkeley and has supported sparse data since version 1.0. This document discusses how sparse data appears frequently in real-world big data problems and describes how MLlib exploits sparsity to improve storage needs and computation speed for algorithms like k-means, linear methods, and singular value decomposition. By avoiding unnecessary computations on zero values and leveraging sparse linear algebra, MLlib is able to efficiently handle sparse data problems at large scale.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Conquering "big data": An introduction to shard queryJustin Swanhart
Shard-Query is a middleware solution that enables massively parallel query execution for MySQL databases. It works by splitting single SQL queries into multiple smaller queries or tasks that can run concurrently on one or more database servers. This allows it to scale out query processing across more CPUs and servers for improved performance on large datasets and analytics workloads. It supports both partitioning tables for parallelism within a single server as well as sharding tables across multiple servers. The end result is that it can enable MySQL to perform like other parallel database solutions through distributed query processing.
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
Colin Charles gave a presentation comparing SQL and NoSQL databases. He discussed why organizations adopt NoSQL databases like MongoDB for large, unstructured datasets and rapid development. However, he argued that MySQL can also handle these workloads through features like dynamic columns, memcached integration, and JSON support. MySQL addresses limitations around high availability, scalability, and schema flexibility through tools and plugins that provide sharding, replication, load balancing, and online schema changes. In the end, MySQL with the right tools is capable of fulfilling both transactional and NoSQL-style workloads.
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
"Apache Spark is today’s fastest growing Big Data analysis platform. Spark workloads typically maintain a persistent data set in memory, which is accessed multiple times over the network. Consequently, networking IO performance is a critical component in Spark systems. RDMA’s performance characteristics, such as high bandwidth, low latency, and low CPU overhead, offer a good opportunity for accelerating Spark by improving its data transfer facilities."
"In this talk, we present a Java-based, RDMA network layer for Apache Spark. The implementation optimized both the RPC and the Shuffle mechanisms for RDMA. Initial benchmarking shows up to 25% improvement for Spark Applications."
Watch the video presentation: http://wp.me/p3RLHQ-gzN
Learn more: http://mellanox.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
KSQL is an open-source streaming SQL engine for Apache Kafka. It allows users to easily interact with and analyze streaming data in Kafka using SQL-like queries. KSQL builds upon Kafka Streams to provide stream processing capabilities with exactly-once processing semantics. It aims to expand access to stream processing beyond coding by providing an interactive SQL interface for tasks like streaming ETL, anomaly detection, real-time monitoring, and simple topic transformations. KSQL can be run in standalone, client-server, or application deployment modes.
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
This document introduces Kafka Streams and provides an example of using it to process streaming data from Apache Kafka. It summarizes some key limitations of using Apache Spark for streaming use cases with Kafka before demonstrating how to build a simple text processing pipeline with Kafka Streams. The document also discusses parallelism, state stores, aggregations, joins and deployment considerations when using Kafka Streams. It provides an example of how Kafka Streams was used to aggregate metrics from multiple instances of an application into a single stream.
A shared-filesystem-memory approach for running IDA in parallel over informal...openseesdays
This document describes a method for running incremental dynamic analysis (IDA) in parallel over computer clusters to reduce computation time. The method distributes IDA tasks across multiple CPUs by either: (1) distributing individual seismic records to different CPUs or (2) further distributing the runs within each record to additional CPUs. This achieves near linear speedup. The method is applied to a case study building to demonstrate a reduction in analysis time from 40 hours to less than 10 hours using 20 CPUs. Monte Carlo simulations are also discussed to quantify modeling parameter uncertainties through approximate IDA techniques.
Automating materials science workflows with pymatgen, FireWorks, and atomateAnubhav Jain
FireWorks is a workflow management system that allows researchers to define and execute complex computational materials science workflows on local or remote computing resources in an automated manner. It provides features such as error detection and recovery, job scheduling, provenance tracking, and remote file access. The atomate library builds on FireWorks to provide a high-level interface for common materials simulation procedures like structure optimization, band structure calculation, and property prediction using popular codes like VASP. Together, these tools aim to make high-throughput computational materials discovery and design more accessible to researchers.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Liège, Belgium.
This document provides an introduction to NoSQL databases. It discusses the need for NoSQL databases due to limitations in scaling relational databases for big data and web applications. It then defines what NoSQL databases are, highlighting their non-relational structures and ability to scale horizontally. Both advantages like flexibility and disadvantages like lack of transactions are outlined. The CAP theorem and its implications for database consistency models are also introduced. Finally, common categories of NoSQL databases are listed.
The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
This document summarizes a Ph.D. dissertation defense presented by Saliya Ekanayake on September 28, 2016. The dissertation studied big data performance and benchmarking, focusing on parallel machine learning. It evaluated performance factors like thread models, affinity, communication mechanisms, and optimizations for high-level languages. It also presented SPIDAL Java, a library for scalable parallel machine learning applications, and evaluated its performance on use cases like gene sequence clustering and stock data analysis using up to 48 nodes.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
HTML slides and longer abstract can be found at https://github.com/ljdursi/EuroMPI2016.
For years, the academic science and engineering community was almost alone in pursuing very large-scale numerical computing, and MPI was the lingua franca for such work. But starting in the mid-2000s, we were no longer alone. First internet-scale companies like Google and Yahoo! started performing fairly basic analytics tasks at enormous scale, and since then others have begun tackling increasingly complex and data-heavy machine-learning computations, which involve very familiar scientific computing primitives such as linear algebra, unstructured mesh decomposition, and numerical optimization. These new communities have created programming environments which emphasize what we’ve learned about computer science and programmability since 1994 – with greater levels of abstraction and encapsulation, separating high-level computation from the low-level implementation details.
At about the same time, new academic research communities began using computing at scale to attack their problems - but in many cases, an ideal distributed-memory application for them begins to look more like the new concurrent distributed databases than a large CFD simulation, with data structures like dynamic hash tables and Bloom trees playing more important roles than rectangular arrays or unstructured meshes. These new academic communities are among the first to adopt emerging big-data technologies over traditional HPC options; but as big-data technologies improve their tightly-coupled number-crunching capabilities, they are unlikely to be the last.
In this talk, I sketch out the landscape of distributed technical computing frameworks and environments, and look to see where MPI and the MPI community fits in to this new ecosystem.
What does it take to make google work at scale xlight
The document discusses the software stacks used by companies like Google and Facebook to run services at massive scale. It describes key components of their infrastructure like MapReduce for distributed processing, BigTable for storage, and Borg for cluster management. The stacks are optimized for data-intensive workloads across thousands of machines through replication, load balancing, and fault tolerance. Current research aims to improve utilization, stream processing, and distributed algorithms to enable real-time insights from huge datasets.
What does it take to make google work at scale 君 廖
This document summarizes a presentation about systems that operate at large scale like Google. It discusses the hardware infrastructure of datacenters with over 10,000 machines and customized networking. It then covers the major software stacks at Google including low-level storage systems like GFS and Colossus, distributed databases like BigTable and Spanner, and batch processing with MapReduce. Similar software stacks at Facebook and differences from high performance computing are also mentioned. Current research directions in distributed systems and operating at large scale are briefly outlined.
In these slides we analyze why the aggregate data models change the way data is stored and manipulated. We introduce MapReduce and its open source implementation Hadoop. We consider how MapReduce jobs are written and executed by Hadoop.
Finally we introduce spark using a docker image and we show how to use anonymous function in spark.
The topics of the next slides will be
- Spark Shell (Scala, Python)
- Shark Shell
- Data Frames
- Spark Streaming
- Code Examples: Data Processing and Machine Learning
MLlib is an Apache Spark component that focuses on machine learning algorithms. It was initially contributed by the AMPLab at UC Berkeley and has supported sparse data since version 1.0. This document discusses how sparse data appears frequently in real-world big data problems and describes how MLlib exploits sparsity to improve storage needs and computation speed for algorithms like k-means, linear methods, and singular value decomposition. By avoiding unnecessary computations on zero values and leveraging sparse linear algebra, MLlib is able to efficiently handle sparse data problems at large scale.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Conquering "big data": An introduction to shard queryJustin Swanhart
Shard-Query is a middleware solution that enables massively parallel query execution for MySQL databases. It works by splitting single SQL queries into multiple smaller queries or tasks that can run concurrently on one or more database servers. This allows it to scale out query processing across more CPUs and servers for improved performance on large datasets and analytics workloads. It supports both partitioning tables for parallelism within a single server as well as sharding tables across multiple servers. The end result is that it can enable MySQL to perform like other parallel database solutions through distributed query processing.
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityIvan Zoratti
Colin Charles gave a presentation comparing SQL and NoSQL databases. He discussed why organizations adopt NoSQL databases like MongoDB for large, unstructured datasets and rapid development. However, he argued that MySQL can also handle these workloads through features like dynamic columns, memcached integration, and JSON support. MySQL addresses limitations around high availability, scalability, and schema flexibility through tools and plugins that provide sharding, replication, load balancing, and online schema changes. In the end, MySQL with the right tools is capable of fulfilling both transactional and NoSQL-style workloads.
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
"Apache Spark is today’s fastest growing Big Data analysis platform. Spark workloads typically maintain a persistent data set in memory, which is accessed multiple times over the network. Consequently, networking IO performance is a critical component in Spark systems. RDMA’s performance characteristics, such as high bandwidth, low latency, and low CPU overhead, offer a good opportunity for accelerating Spark by improving its data transfer facilities."
"In this talk, we present a Java-based, RDMA network layer for Apache Spark. The implementation optimized both the RPC and the Shuffle mechanisms for RDMA. Initial benchmarking shows up to 25% improvement for Spark Applications."
Watch the video presentation: http://wp.me/p3RLHQ-gzN
Learn more: http://mellanox.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
KSQL is an open-source streaming SQL engine for Apache Kafka. It allows users to easily interact with and analyze streaming data in Kafka using SQL-like queries. KSQL builds upon Kafka Streams to provide stream processing capabilities with exactly-once processing semantics. It aims to expand access to stream processing beyond coding by providing an interactive SQL interface for tasks like streaming ETL, anomaly detection, real-time monitoring, and simple topic transformations. KSQL can be run in standalone, client-server, or application deployment modes.
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Exampleconfluent
This document introduces Kafka Streams and provides an example of using it to process streaming data from Apache Kafka. It summarizes some key limitations of using Apache Spark for streaming use cases with Kafka before demonstrating how to build a simple text processing pipeline with Kafka Streams. The document also discusses parallelism, state stores, aggregations, joins and deployment considerations when using Kafka Streams. It provides an example of how Kafka Streams was used to aggregate metrics from multiple instances of an application into a single stream.
A shared-filesystem-memory approach for running IDA in parallel over informal...openseesdays
This document describes a method for running incremental dynamic analysis (IDA) in parallel over computer clusters to reduce computation time. The method distributes IDA tasks across multiple CPUs by either: (1) distributing individual seismic records to different CPUs or (2) further distributing the runs within each record to additional CPUs. This achieves near linear speedup. The method is applied to a case study building to demonstrate a reduction in analysis time from 40 hours to less than 10 hours using 20 CPUs. Monte Carlo simulations are also discussed to quantify modeling parameter uncertainties through approximate IDA techniques.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Gender and Mental Health - Counselling and Family Therapy Applications and In...
lecture12_densela_1_jwd16.ppt
1. 02/25/2016 CS267 Lecture 12 1
CS 267
Dense Linear Algebra:
History and Structure,
Parallel Matrix Multiplication
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr16
2. Quick review of earlier lecture
• What do you call
• A program written in PyGAS, a Global Address
Space language based on Python…
• That uses a Monte Carlo simulation algorithm to
approximate π …
• That has a race condition, so that it gives you a
different funny answer every time you run it?
Monte - π - thon
02/25/2016 CS267 Lecture 12 2
3. 02/25/2016 CS267 Lecture 12 3
Outline
• History and motivation
• What is dense linear algebra?
• Why minimize communication?
• Lower bound on communication
• Parallel Matrix-matrix multiplication
• Attaining the lower bound
• Other Parallel Algorithms (next lecture)
4. 02/25/2016 CS267 Lecture 12 4
Outline
• History and motivation
• What is dense linear algebra?
• Why minimize communication?
• Lower bound on communication
• Parallel Matrix-matrix multiplication
• Attaining the lower bound
• Other Parallel Algorithms (next lecture)
5. 5
Motifs
The Motifs (formerly “Dwarfs”) from
“The Berkeley View” (Asanovic et al.)
Motifs form key computational patterns
6. What is dense linear algebra?
• Not just matmul!
• Linear Systems: Ax=b
• Least Squares: choose x to minimize ||Ax-b||2
• Overdetermined or underdetermined; Unconstrained, constrained, or weighted
• Eigenvalues and vectors of Symmetric Matrices
• Standard (Ax = λx), Generalized (Ax=λBx)
• Eigenvalues and vectors of Unsymmetric matrices
• Eigenvalues, Schur form, eigenvectors, invariant subspaces
• Standard, Generalized
• Singular Values and vectors (SVD)
• Standard, Generalized
• Different matrix structures
• Real, complex; Symmetric, Hermitian, positive definite; dense, triangular, banded …
• 27 types in LAPACK (and growing…)
• Level of detail
• Simple Driver (“x=Ab”)
• Expert Drivers with error bounds, extra-precision, other options
• Lower level routines (“apply certain kind of orthogonal transformation”, matmul…)
CS267 Lecture 12 6
02/25/2016
7. Organizing Linear Algebra – in books
www.netlib.org/lapack www.netlib.org/scalapack
www.cs.utk.edu/~dongarra/etemplates
www.netlib.org/templates
gams.nist.gov
8. A brief history of (Dense) Linear Algebra software (1/7)
• Libraries like EISPACK (for eigenvalue problems)
• Then the BLAS (1) were invented (1973-1977)
• Standard library of 15 operations (mostly) on vectors
• “AXPY” ( y = α·x + y ), dot product, scale (x = α·x ), etc
• Up to 4 versions of each (S/D/C/Z), 46 routines, 3300 LOC
• Goals
• Common “pattern” to ease programming, readability
• Robustness, via careful coding (avoiding over/underflow)
• Portability + Efficiency via machine specific implementations
• Why BLAS 1 ? They do O(n1) ops on O(n1) data
• Used in libraries like LINPACK (for linear systems)
• Source of the name “LINPACK Benchmark” (not the code!)
02/25/2016 CS267 Lecture 12 8
• In the beginning was the do-loop…
9. 02/25/2016 CS267 Lecture 12 9
Current Records for Solving Dense Systems (11/2013)
• Linpack Benchmark
• Fastest machine overall (www.top500.org)
• Tianhe-2 (Guangzhou, China)
• 33.9 Petaflops out of 54.9 Petaflops peak (n=10M)
• 3.1M cores, of which 2.7M are accelerator cores
• Intel Xeon E5-2692 (Ivy Bridge) and
Xeon Phi 31S1P
• 1 Pbyte memory
• 17.8 MWatts of power, 1.9 Gflops/Watt
• Historical data (www.netlib.org/performance)
• Palm Pilot III
• 1.69 Kiloflops
• n = 100
Current Records for Solving Dense Systems (11/2015)
10. A brief history of (Dense) Linear Algebra software (2/7)
• But the BLAS-1 weren’t enough
• Consider AXPY ( y = α·x + y ): 2n flops on 3n read/writes
• Computational intensity = (2n)/(3n) = 2/3
• Too low to run near peak speed (read/write dominates)
• Hard to vectorize (“SIMD’ize”) on supercomputers of
the day (1980s)
• So the BLAS-2 were invented (1984-1986)
• Standard library of 25 operations (mostly) on
matrix/vector pairs
• “GEMV”: y = α·A·x + β·x, “GER”: A = A + α·x·yT, x = T-1·x
• Up to 4 versions of each (S/D/C/Z), 66 routines, 18K LOC
• Why BLAS 2 ? They do O(n2) ops on O(n2) data
• So computational intensity still just ~(2n2)/(n2) = 2
• OK for vector machines, but not for machine with caches
02/25/2016 CS267 Lecture 12 10
11. A brief history of (Dense) Linear Algebra software (3/7)
• The next step: BLAS-3 (1987-1988)
• Standard library of 9 operations (mostly) on matrix/matrix pairs
• “GEMM”: C = α·A·B + β·C, C = α·A·AT + β·C, B = T-1·B
• Up to 4 versions of each (S/D/C/Z), 30 routines, 10K LOC
• Why BLAS 3 ? They do O(n3) ops on O(n2) data
• So computational intensity (2n3)/(4n2) = n/2 – big at last!
• Good for machines with caches, other mem. hierarchy levels
• How much BLAS1/2/3 code so far (all at www.netlib.org/blas)
• Source: 142 routines, 31K LOC, Testing: 28K LOC
• Reference (unoptimized) implementation only
• Ex: 3 nested loops for GEMM
• Lots more optimized code (eg Homework 1)
• Motivates “automatic tuning” of the BLAS
• Part of standard math libraries (eg AMD ACML, Intel MKL)
02/25/2016 CS267 Lecture 12 11
12. 02/25/2009 CS267 Lecture 8 12
BLAS Standards Committee to start meeting again May 2016:
Batched BLAS: many independent BLAS operations at once
Reproducible BLAS: getting bitwise identical answers from
run-to-run, despite nonassociate floating point, and dynamic
scheduling of resources (bebop.cs.berkeley.edu/reproblas)
Low-Precision BLAS: 16 bit floating point
See www.netlib.org/blas/blast-forum/ for previous extension attempt
New functions, Sparse BLAS, Extended Precision BLAS
13. A brief history of (Dense) Linear Algebra software (4/7)
• LAPACK – “Linear Algebra PACKage” - uses BLAS-3 (1989 – now)
• Ex: Obvious way to express Gaussian Elimination (GE) is adding
multiples of one row to other rows – BLAS-1
• How do we reorganize GE to use BLAS-3 ? (details later)
• Contents of LAPACK (summary)
• Algorithms that are (nearly) 100% BLAS 3
– Linear Systems: solve Ax=b for x
– Least Squares: choose x to minimize ||Ax-b||2
• Algorithms that are only 50% BLAS 3
– Eigenproblems: Find l and x where Ax = l x
– Singular Value Decomposition (SVD)
• Generalized problems (eg Ax = l Bx)
• Error bounds for everything
• Lots of variants depending on A’s structure (banded, A=AT, etc)
• How much code? (Release 3.6.0, Nov 2015) (www.netlib.org/lapack)
• Source: 1750 routines, 721K LOC, Testing: 1094 routines, 472K LOC
• Ongoing development (at UCB and elsewhere) (class projects!)
• Next planned release June 2016 13
14. A brief history of (Dense) Linear Algebra software (5/7)
• Is LAPACK parallel?
• Only if the BLAS are parallel (possible in shared memory)
• ScaLAPACK – “Scalable LAPACK” (1995 – now)
• For distributed memory – uses MPI
• More complex data structures, algorithms than LAPACK
• Only subset of LAPACK’s functionality available
• Details later (class projects!)
• All at www.netlib.org/scalapack
02/25/2016 CS267 Lecture 12 14
15. 02/25/2016 CS267 Lecture 12 15
Success Stories for Sca/LAPACK (6/7)
Cosmic Microwave Background
Analysis, BOOMERanG
collaboration, MADCAP code (Apr.
27, 2000).
• Widely used
• Adopted by Mathworks, Cray,
Fujitsu, HP, IBM, IMSL, Intel,
NAG, NEC, SGI, …
• 7.5M webhits/year @ Netlib
(incl. CLAPACK, LAPACK95)
• New Science discovered through the
solution of dense matrix systems
• Nature article on the flat
universe used ScaLAPACK
• Other articles in Physics
Review B that also use it
• 1998 Gordon Bell Prize
• www.nersc.gov/assets/NewsImages/2003/
newNERSCresults050703.pdf
16. A brief future look at (Dense) Linear Algebra software (7/7)
• PLASMA, DPLASMA and MAGMA (now)
• Ongoing extensions to Multicore/GPU/Heterogeneous
• Can one software infrastructure accommodate all algorithms
and platforms of current (future) interest?
• How much code generation and tuning can we automate?
• Details later (Class projects!) (icl.cs.utk.edu/{{d}plasma,magma})
• Other related projects
• Elemental (libelemental.org)
• Distributed memory dense linear algebra
• “Balance ease of use and high performance”
• FLAME (z.cs.utexas.edu/wiki/flame.wiki/FrontPage)
• Formal Linear Algebra Method Environment
• Attempt to automate code generation across multiple platforms
• So far, none of these libraries minimize communication in all
cases (not even matmul!)
17. 17
Back to basics:
Why avoiding communication is important (1/3)
Algorithms have two costs:
1.Arithmetic (FLOPS)
2.Communication: moving data between
• levels of a memory hierarchy (sequential case)
• processors over a network (parallel case).
CPU
Cache
DRAM
CPU
DRAM
CPU
DRAM
CPU
DRAM
CPU
DRAM
02/25/2016 CS267 Lecture 12
18. Why avoiding communication is important (2/3)
• Running time of an algorithm is sum of 3 terms:
• # flops * time_per_flop
• # words moved / bandwidth
• # messages * latency
18
communication
• Time_per_flop << 1/ bandwidth << latency
• Gaps growing exponentially with time
Annual improvements
Time_per_flop Bandwidth Latency
DRAM 26% 15%
Network 23% 5%
59%
02/25/2016
• Minimize communication to save time
CS267 Lecture 12
21. Goal:
Organize Linear Algebra to Avoid Communication
21
• Between all memory hierarchy levels
• L1 L2 DRAM network, etc
• Not just hiding communication (overlap with arithmetic)
• Speedup 2x
• Arbitrary speedups/energy savings possible
• Later: Same goal for other computational patterns
• Lots of open problems
02/25/2016
CS267 Lecture 12
22. Review: Blocked Matrix Multiply
• Blocked Matmul C = A·B breaks A, B and C into blocks
with dimensions that depend on cache size
22
… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc
… b chosen so 3 bxb blocks fit in cache
for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b
C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b2 reads/writes
• When b=1, get “naïve” algorithm, want b larger …
• (n/b)3 · 4b2 = 4n3/b reads/writes altogether
• Minimized when 3b2 = cache size = M, yielding O(n3/M1/2) reads/writes
• What if we had more levels of memory? (L1, L2, cache etc)?
• Would need 3 more nested loops per level
• Recursive (cache-oblivious algorithm) also possible
02/25/2016 CS267 Lecture 12
23. Communication Lower Bounds: Prior Work on Matmul
• Assume n3 algorithm (i.e. not Strassen-like)
• Sequential case, with fast memory of size M
• Lower bound on #words moved to/from slow memory =
(n3 / M1/2 ) [Hong, Kung, 81]
• Attained using blocked or cache-oblivious algorithms
23
• Parallel case on P processors:
• Let M be memory per processor; assume load balanced
• Lower bound on #words moved
= ((n3 /p) / M1/2 )) [Irony, Tiskin, Toledo, 04]
• If M = 3n2/p (one copy of each matrix), then
lower bound = (n2 /p1/2 )
• Attained by SUMMA, Cannon’s algorithm
02/25/2016 CS267 Lecture 12
24. New lower bound for all “direct” linear algebra
• Holds for
• Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, …
• Some whole programs (sequences of these operations,
no matter how they are interleaved, eg computing Ak)
• Dense and sparse matrices (where #flops << n3 )
• Sequential and parallel algorithms
• Some graph-theoretic algorithms (eg Floyd-Warshall)
• Generalizations later (Strassen-like algorithms, loops accessing arrays)
24
Let M = “fast” memory size per processor
= cache size (sequential case) or O(n2/p) (parallel case)
#flops = number of flops done per processor
#words_moved per processor = (#flops / M1/2 )
#messages_sent per processor = (#flops / M3/2 )
02/25/2016 CS267 Lecture 12
• Holds for
• Matmul
25. New lower bound for all “direct” linear algebra
• Sequential case, dense n x n matrices, so O(n3) flops
• #words_moved = (n3/ M1/2 )
• #messages_sent = (n3/ M3/2 )
• Parallel case, dense n x n matrices
• Load balanced, so O(n3/p) flops processor
• One copy of data, load balanced, so M = O(n2/p) per processor
• #words_moved = (n2/ p1/2 )
• #messages_sent = ( p1/2 )
25
Let M = “fast” memory size per processor
= cache size (sequential case) or O(n2/p) (parallel case)
#flops = number of flops done per processor
#words_moved per processor = (#flops / M1/2 )
#messages_sent per processor = (#flops / M3/2 )
02/25/2016 CS267 Lecture 12
SIAM Linear Algebra Prize, 2012
26. Can we attain these lower bounds?
• Do conventional dense algorithms as implemented in LAPACK and
ScaLAPACK attain these bounds?
• Mostly not yet, work in progress
• If not, are there other algorithms that do?
• Yes
• Goals for algorithms:
• Minimize #words_moved
• Minimize #messages_sent
• Need new data structures
• Minimize for multiple memory hierarchy levels
• Cache-oblivious algorithms would be simplest
• Fewest flops when matrix fits in fastest memory
• Cache-oblivious algorithms don’t always attain this
• Attainable for nearly all dense linear algebra
• Just a few prototype implementations so far (class projects!)
• Only a few sparse algorithms so far (eg Cholesky)
26
02/25/2016 CS267 Lecture 12
27. 02/25/2016 CS267 Lecture 12 27
Outline
• History and motivation
• What is dense linear algebra?
• Why minimize communication?
• Lower bound on communication
• Parallel Matrix-matrix multiplication
• Attaining the lower bound
• Other Parallel Algorithms (next lecture)
29. 02/25/2016 CS267 Lecture 12 29
Parallel Matrix-Vector Product
• Compute y = y + A*x, where A is a dense matrix
• Layout:
• 1D row blocked
• A(i) refers to the n by n/p block row
that processor i owns,
• x(i) and y(i) similarly refer to
segments of x,y owned by i
• Algorithm:
• Foreach processor i
• Broadcast x(i)
• Compute y(i) = A(i)*x
• Algorithm uses the formula
y(i) = y(i) + A(i)*x = y(i) + Sj A(i,j)*x(j)
x
y
P0
P1
P2
P3
P0 P1 P2 P3
A(0)
A(1)
A(2)
A(3)
30. 02/25/2016 CS267 Lecture 12 30
Matrix-Vector Product y = y + A*x
• A column layout of the matrix eliminates the broadcast of x
• But adds a reduction to update the destination y
• A 2D blocked layout uses a broadcast and reduction, both
on a subset of processors
• sqrt(p) for square processor grid
P0 P1 P2 P3
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
31. 02/25/2016 CS267 Lecture 12 31
Parallel Matrix Multiply
• Computing C=C+A*B
• Using basic algorithm: 2*n3 Flops
• Variables are:
• Data layout: 1D? 2D? Other?
• Topology of machine: Ring? Torus?
• Scheduling communication
• Use of performance models for algorithm design
• Message Time = “latency” + #words * time-per-word
= a + n*b
• Efficiency (in any model):
• serial time / (p * parallel time)
• perfect (linear) speedup efficiency = 1
32. 02/25/2016 CS267 Lecture 12 32
Matrix Multiply with 1D Column Layout
• Assume matrices are n x n and n is divisible by p
• A(i) refers to the n by n/p block column that processor i
owns (similiarly for B(i) and C(i))
• B(i,j) is the n/p by n/p sublock of B(i)
• in rows j*n/p through (j+1)*n/p - 1
• Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i)
p0 p1 p2 p3 p5
p4 p6 p7
May be a reasonable
assumption for analysis,
not for code
33. 02/25/2016 CS267 Lecture 12 33
Matrix Multiply: 1D Layout on Bus or Ring
• Algorithm uses the formula
C(i) = C(i) + A*B(i) = C(i) + Sj A(j)*B(j,i)
• First consider a bus-connected machine without
broadcast: only one pair of processors can
communicate at a time (ethernet)
• Second consider a machine with processors on a ring:
all processors may communicate with nearest neighbors
simultaneously
34. 02/25/2016 CS267 Lecture 12 34
MatMul: 1D layout on Bus without Broadcast
Naïve algorithm:
C(myproc) = C(myproc) + A(myproc)*B(myproc,myproc)
for i = 0 to p-1
for j = 0 to p-1 except i
if (myproc == i) send A(i) to processor j
if (myproc == j)
receive A(i) from processor i
C(myproc) = C(myproc) + A(i)*B(i,myproc)
barrier
Cost of inner loop:
computation: 2*n*(n/p)2 = 2*n3/p2
communication: a + b*n2 /p
35. 02/25/2016 CS267 Lecture 12 35
Naïve MatMul (continued)
Cost of inner loop:
computation: 2*n*(n/p)2 = 2*n3/p2
communication: a + b*n2 /p … approximately
Only 1 pair of processors (i and j) are active on any iteration,
and of those, only i is doing computation
=> the algorithm is almost entirely serial
Running time:
= (p*(p-1) + 1)*computation + p*(p-1)*communication
2*n3 + p2*a + p*n2*b
This is worse than the serial time and grows with p.
36. 02/25/2016 CS267 Lecture 12 36
Matmul for 1D layout on a Processor Ring
• Pairs of adjacent processors can communicate simultaneously
Copy A(myproc) into Tmp
C(myproc) = C(myproc) + Tmp*B(myproc , myproc)
for j = 1 to p-1
Send Tmp to processor myproc+1 mod p
Receive Tmp from processor myproc-1 mod p
C(myproc) = C(myproc) + Tmp*B( myproc-j mod p , myproc)
• Same idea as for gravity in simple sharks and fish algorithm
• May want double buffering in practice for overlap
• Ignoring deadlock details in code
• Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2
37. 02/25/2016 CS267 Lecture 12 37
Matmul for 1D layout on a Processor Ring
• Time of inner loop = 2*(a + b*n2/p) + 2*n*(n/p)2
• Total Time = 2*n* (n/p)2 + (p-1) * Time of inner loop
• 2*n3/p + 2*p*a + 2*b*n2
• (Nearly) Optimal for 1D layout on Ring or Bus, even with Broadcast:
• Perfect speedup for arithmetic
• A(myproc) must move to each other processor, costs at least
(p-1)*cost of sending n*(n/p) words
• Parallel Efficiency = 2*n3 / (p * Total Time)
= 1/(1 + a * p2/(2*n3) + b * p/(2*n) )
= 1/ (1 + O(p/n))
• Grows to 1 as n/p increases (or a and b shrink)
• But far from communication lower bound
39. Summary of Parallel Matrix Multiply
• SUMMA
• Scalable Universal Matrix Multiply Algorithm
• Attains communication lower bounds (within log p)
• Cannon
• Historically first, attains lower bounds
• More assumptions
• A and B square
• P a perfect square
• 2.5D SUMMA
• Uses more memory to communicate even less
• Parallel Strassen
• Attains different, even lower bounds
02/25/2016 CS267 Lecture 12 39
40. 02/25/2016 CS267 Lecture 12 40
SUMMA Algorithm
• SUMMA = Scalable Universal Matrix Multiply
• Presentation from van de Geijn and Watts
• www.netlib.org/lapack/lawns/lawn96.ps
• Similar ideas appeared many times
• Used in practice in PBLAS = Parallel BLAS
• www.netlib.org/lapack/lawns/lawn100.ps
41. SUMMA uses Outer Product form of MatMul
• C = A*B means C(i,j) = Sk A(i,k)*B(k,j)
• Column-wise outer product:
C = A*B
= Sk A(:,k)*B(k,:)
= Sk (k-th col of A)*(k-th row of B)
• Block column-wise outer product
(block size = 4 for illustration)
C = A*B
= A(:,1:4)*B(1:4,:) + A(:,5:8)*B(5:8,:) + …
= Sk (k-th block of 4 cols of A)*
(k-th block of 4 rows of B)
02/25/2016 CS267 Lecture 12 41
42. 42
SUMMA – n x n matmul on P1/2 x P1/2 grid
• C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij
• A[i,k] is n/P1/2 x b submatrix of A
• B[k,j] is b x n/P1/2 submatrix of B
• C[i,j] = C[i,j] + Sk A[i,k]*B[k,j]
• summation over submatrices
• Need not be square processor grid
* =
i
j
A[i,k]
k
k
B[k,j]
C[i,j]
02/25/2016 CS267 Lecture 12
43. 43
SUMMA– n x n matmul on P1/2 x P1/2 grid
* =
i
j
A[i,k]
k
k
B[k,j]
C(i,j)
For k=0 to n/b-1
for all i = 1 to P1/2
owner of A[i,k] broadcasts it to whole processor row (using binary tree)
for all j = 1 to P1/2
owner of B[k,j] broadcasts it to whole processor column (using bin. tree)
Receive A[i,k] into Acol
Receive B[k,j] into Brow
C_myproc = C_myproc + Acol * Brow
Brow
Acol
02/25/2016 CS267 Lecture 12
44. 44
SUMMA Costs
For k=0 to n/b-1
for all i = 1 to P1/2
owner of A[i,k] broadcasts it to whole processor row (using binary tree)
… #words = log P1/2 *b*n/P1/2 , #messages = log P1/2
for all j = 1 to P1/2
owner of B[k,j] broadcasts it to whole processor column (using bin. tree)
… same #words and #messages
Receive A[i,k] into Acol
Receive B[k,j] into Brow
C_myproc = C_myproc + Acol * Brow … #flops = 2n2*b/P
°Total #words = log P * n2 /P1/2
°Within factor of log P of lower bound
°(more complicated implementation removes log P factor)
°Total #messages = log P * n/b
°Choose b close to maximum, n/P1/2, to approach lower bound P1/2
°Total #flops = 2n3/P
45. 02/25/2016 CS267 Lecture 8 45
PDGEMM = PBLAS routine
for matrix multiply
Observations:
For fixed N, as P increases
Mflops increases, but
less than 100% efficiency
For fixed P, as N increases,
Mflops (efficiency) rises
DGEMM = BLAS routine
for matrix multiply
Maximum speed for PDGEMM
= # Procs * speed of DGEMM
Observations (same as above):
Efficiency always at least 48%
For fixed N, as P increases,
efficiency drops
For fixed P, as N increases,
efficiency increases
46. 46
Can we do better?
• Lower bound assumed 1 copy of data: M = O(n2/P) per proc.
• What if matrix small enough to fit c>1 copies, so M = cn2/P ?
• #words_moved = Ω( #flops / M1/2 ) = Ω( n2 / ( c1/2 P1/2 ))
• #messages = Ω( #flops / M3/2 ) = Ω( P1/2 /c3/2)
• Can we attain new lower bound?
• Special case: “3D Matmul”: c = P1/3
• Bernsten 89, Agarwal, Chandra, Snir 90, Aggarwal 95
• Processors arranged in P1/3 x P1/3 x P1/3 grid
• Processor (i,j,k) performs C(i,j) = C(i,j) + A(i,k)*B(k,j), where
each submatrix is n/P1/3 x n/P1/3
• Not always that much memory available…
02/25/2016 CS267 Lecture 12
47. 2.5D Matrix Multiplication
• Assume can fit cn2/P data per processor, c > 1
• Processors form (P/c)1/2 x (P/c)1/2 x c grid
c
(P/c)1/2
Example: P = 32, c = 2
02/25/2016 CS267 Lecture 12
48. 2.5D Matrix Multiplication
• Assume can fit cn2/P data per processor, c > 1
• Processors form (P/c)1/2 x (P/c)1/2 x c grid
k
j
Initially P(i,j,0) owns A(i,j) and B(i,j)
each of size n(c/P)1/2 x n(c/P)1/2
(1) P(i,j,0) broadcasts A(i,j) and B(i,j) to P(i,j,k)
(2) Processors at level k perform 1/c-th of SUMMA, i.e. 1/c-th of Σm A(i,m)*B(m,j)
(3) Sum-reduce partial sums Σm A(i,m)*B(m,j) along k-axis so P(i,j,0) owns C(i,j)
49. 2.5D Matmul on IBM BG/P, n=64K
• As P increases, available memory grows c increases proportionally to P
• #flops, #words_moved, #messages per proc all decrease proportionally to P
• #words_moved = Ω( #flops / M1/2 ) = Ω( n2 / ( c1/2 P1/2 ))
• #messages = Ω( #flops / M3/2 ) = Ω( P1/2 /c3/2)
• Perfect strong scaling! But only up to c = P1/3
50. 2.5D Matmul on IBM BG/P, 16K nodes / 64K cores
02/25/2016 CS267 Lecture 12
51. 2.5D Matmul on IBM BG/P, 16K nodes / 64K cores
c = 16 copies
Distinguished Paper Award, EuroPar’11
SC’11 paper by Solomonik, Bhatele, D.
02/25/2016
52. Perfect Strong Scaling – in Time and Energy
• Every time you add a processor, you should use its memory M too
• Start with minimal number of procs: PM = 3n2
• Increase P by a factor of c total memory increases by a factor of c
• Notation for timing model:
• γT , βT , αT = secs per flop, per word_moved, per message of size m
• T(cP) = n3/(cP) [ γT+ βT/M1/2 + αT/(mM1/2) ]
= T(P)/c
• Notation for energy model:
• γE , βE , αE = joules for same operations
• δE = joules per word of memory used per sec
• εE = joules per sec for leakage, etc.
• E(cP) = cP { n3/(cP) [ γE+ βE/M1/2 + αE/(mM1/2) ] + δEMT(cP) + εET(cP) }
= E(P)
• c cannot increase forever: c <= P1/3 (3D algorithm)
• Corresponds to lower bound on #messages hitting 1
• Perfect scaling extends to Strassen’s matmul, direct N-body, …
• “Perfect Strong Scaling Using No Additional Energy”
• “Strong Scaling of Matmul and Memory-Indep. Comm. Lower Bounds”
• Both at bebop.cs.berkeley.edu
53. Classical Matmul vs Parallel Strassen
• Complexity of classical Matmul vs Strassen
• Flops: O(n3/p) vs O(nw/p) where w = log2 7 ~ 2.81
• Communication lower bound on #words:
Ω((n3/p)/M1/2) = Ω(M(n/M1/2)3/p) vs Ω(M(n/M1/2)w/p)
• Communication lower bound on #messages:
Ω((n3/p)/M3/2) = Ω((n/M1/2)3/p) vs Ω((n/M1/2)w/p)
• All attainable as M increases past O(n2/p), up to a limit:
can increase M by factor up to p1/3 vs p1-2/w
#words as low as Ω(n/p2/3) vs Ω(n/p2/w)
• Best Paper Prize, SPAA’11, Ballard, D., Holtz, Schwartz
• How well does parallel Strassen work in practice?
02/27/2014 CS267 Lecture 12 53
54. Strong scaling of Matmul on Hopper (n=94080)
02/25/2016 54
G. Ballard, D., O. Holtz, B. Lipshitz, O. Schwartz
“Communication-Avoiding Parallel Strassen”
bebop.cs.berkeley.edu, Supercomputing’12
56. Extensions of Lower Bound and
Optimal Algorithms
• For each processor that does G flops with fast memory of size M
#words_moved = Ω(G/M1/2)
• Extension: for any program that “smells like”
• Nested loops …
• That access arrays …
• Where array subscripts are linear functions of loop indices
• Ex: A(i,j), B(3*i-4*k+5*j, i-j, 2*k, …), …
• There is a constant s such that
#words_moved = Ω(G/Ms-1)
• s comes from recent generalization of Loomis-Whitney (s=3/2)
• Ex: linear algebra, n-body, database join, …
• Lots of open questions: deriving s, optimal algorithms …
02/25/2016 CS267 Lecture 12 56
57. Proof of Communication Lower Bound on C = A·B (1/4)
• Proof from Irony/Toledo/Tiskin (2004)
• Think of instruction stream being executed
• Looks like “ … add, load, multiply, store, load, add, …”
• Each load/store moves a word between fast and slow memory
• We want to count the number of loads and stores, given that we are
multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associative
• Assuming that at most M words can be stored in fast memory
• Outline:
• Break instruction stream into segments, each with M loads and stores
• Somehow bound the maximum number of flops that can be done in
each segment, call it F
• So F · # segments T = total flops = 2·n3 , so # segments T / F
• So # loads & stores = M · #segments M · T / F
CS267 Lecture 12
02/25/2016 57
59. Proof of Communication Lower Bound on C = A·B (2/4)
k
“A face”
“C face”
Cube representing
C(1,1) += A(1,3)·B(3,1)
• If we have at most 2M “A squares”, 2M “B squares”, and
2M “C squares” on faces, how many cubes can we have?
i
j
A(2,1)
A(1,3)
B(1,3)
B(3,1)
C(1,1)
C(2,3)
A(1,1)
B(1,1)
A(1,2)
B(2,1)
59
60. Proof of Communication Lower Bound on C = A·B (3/4)
x
z
z
y
x
y
k
A shadow
C shadow
j
i
# cubes in black box with
side lengths x, y and z
= Volume of black box
= x·y·z
= ( xz · zy · yx)1/2
= (#A□s · #B□s · #C□s )1/2
(i,k) is in A shadow if (i,j,k) in 3D set
(j,k) is in B shadow if (i,j,k) in 3D set
(i,j) is in C shadow if (i,j,k) in 3D set
Thm (Loomis & Whitney, 1949)
# cubes in 3D set = Volume of 3D set
≤ (area(A shadow) · area(B shadow) ·
area(C shadow)) 1/2
61
61. Proof of Communication Lower Bound on C = A·B (4/4)
• Consider one “segment” of instructions with M loads, stores
• Can be at most 2M entries of A, B, C available in one segment
• Volume of set of cubes representing possible multiply/adds in
one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F
• # Segments 2n3 / F
• # Loads & Stores = M · #Segments M · 2n3 / F
n3 / (2M)1/2 – M = (n3 / M1/2 )
• Parallel Case: apply reasoning to one processor out of P
• # Adds and Muls 2n3 / P (at least one proc does this )
• M= n2 / P (each processor gets equal fraction of matrix)
• # “Load & Stores” = # words moved from or to other procs
M · (2n3 /P) / F= M · (2n3 /P) / (2M)3/2 = n2 / (2P)1/2
62
63. 2/27/08 CS267 Guest Lecture 1 91
Recursive Layouts
• For both cache hierarchies and parallelism, recursive
layouts may be useful
• Z-Morton, U-Morton, and X-Morton Layout
• Also Hilbert layout and others
• What about the user’s view?
• Fortunately, many problems can be solved on a
permutation
• Never need to actually change the user’s layout
64. 02/09/2006 CS267 Lecture 8 92
Gaussian Elimination
0
x
x
x
x
.
.
.
Standard Way
subtract a multiple of a row
0
x
0
0
. . .
0
LINPACK
apply sequence to a column
x
nb
then apply nb to rest of matrix
a3=a3-a1*a2
a3
a2
a1
L
a2 =L-1 a2
0
x
0
0
. . .
0
nb LAPACK
apply sequence to nb
Slide source: Dongarra
65. 02/09/2006 CS267 Lecture 8 93
LU Algorithm:
1: Split matrix into two rectangles (m x n/2)
if only 1 column, scale by reciprocal of pivot & return
2: Apply LU Algorithm to the left part
3: Apply transformations to right part
(triangular solve A12 = L-1A12 and
matrix multiplication A22=A22 -A21*A12 )
4: Apply LU Algorithm to right part
Gaussian Elimination via a Recursive Algorithm
L A12
A21 A22
F. Gustavson and S. Toledo
Most of the work in the matrix multiply
Matrices of size n/2, n/4, n/8, …
Slide source: Dongarra
66. 02/09/2006 CS267 Lecture 8 94
Recursive Factorizations
• Just as accurate as conventional method
• Same number of operations
• Automatic variable blocking
• Level 1 and 3 BLAS only !
• Extreme clarity and simplicity of expression
• Highly efficient
• The recursive formulation is just a rearrangement of the point-wise
LINPACK algorithm
• The standard error analysis applies (assuming the matrix
operations are computed the “conventional” way).
Slide source: Dongarra
68. 02/09/2006 CS267 Lecture 8 96
Review: BLAS 3 (Blocked) GEPP
for ib = 1 to n-1 step b … Process matrix b columns at a time
end = ib + b-1 … Point to end of block of b columns
apply BLAS2 version of GEPP to get A(ib:n , ib:end) = P’ * L’ * U’
… let LL denote the strict lower triangular part of A(ib:end , ib:end) + I
A(ib:end , end+1:n) = LL-1 * A(ib:end , end+1:n) … update next b rows of U
A(end+1:n , end+1:n ) = A(end+1:n , end+1:n )
- A(end+1:n , ib:end) * A(ib:end , end+1:n)
… apply delayed updates with single matrix-multiply
… with inner dimension b
BLAS 3
69. 02/09/2006 CS267 Lecture 8 97
Review: Row and Column Block Cyclic Layout
processors and matrix blocks
are distributed in a 2d array
pcol-fold parallelism
in any column, and calls to the
BLAS2 and BLAS3 on matrices of
size brow-by-bcol
serial bottleneck is eased
need not be symmetric in rows and
columns
70. 02/09/2006 CS267 Lecture 8 98
Distributed GE with a 2D Block Cyclic Layout
block size b in the algorithm and the block sizes brow
and bcol in the layout satisfy b=brow=bcol.
shaded regions indicate busy processors or
communication performed.
unnecessary to have a barrier between each
step of the algorithm, e.g.. step 9, 10, and 11 can be
pipelined
73. 02/09/2006 CS267 Lecture 8 101
PDGESV = ScaLAPACK
parallel LU routine
Since it can run no faster than its
inner loop (PDGEMM), we measure:
Efficiency =
Speed(PDGESV)/Speed(PDGEMM)
Observations:
Efficiency well above 50% for large
enough problems
For fixed N, as P increases,
efficiency decreases
(just as for PDGEMM)
For fixed P, as N increases
efficiency increases
(just as for PDGEMM)
From bottom table, cost of solving
Ax=b about half of matrix multiply
for large enough matrices.
From the flop counts we would
expect it to be (2*n3)/(2/3*n3) = 3
times faster, but communication
makes it a little slower.
76. 02/09/2006 CS267 Lecture 8 104
Old version,
pre 1998 Gordon Bell Prize
Still have ideas to accelerate
Project Available!
Old Algorithm,
plan to abandon
77. 02/09/2006 CS267 Lecture 8 105
Have good ideas to speedup
Project available!
Hardest of all to parallelize
Have alternative, and
would like to compare
Project available!
78. 02/09/2006 CS267 Lecture 8 106
Out-of-core means
matrix lives on disk;
too big for main mem
Much harder to hide
latency of disk
QR much easier than LU
because no pivoting
needed for QR
Moral: use QR to solve Ax=b
Projects available
(perhaps very hard…)
80. 02/09/2006 CS267 Lecture 8 108
Work-Depth Model of Parallelism
• The work depth model:
• The simplest model is used
• For algorithm design, independent of a machine
• The work, W, is the total number of operations
• The depth, D, is the longest chain of dependencies
• The parallelism, P, is defined as W/D
• Specific examples include:
• circuit model, each input defines a graph with ops at
nodes
• vector model, each step is an operation on a vector of
elements
• language model, where set of operations defined by
language
81. 02/09/2006 CS267 Lecture 8 109
Latency Bandwidth Model
• Network of fixed number P of processors
• fully connected
• each with local memory
• Latency (a)
• accounts for varying performance with number of
messages
• gap (g) in logP model may be more accurate cost if
messages are pipelined
• Inverse bandwidth (b)
• accounts for performance varying with volume of data
• Efficiency (in any model):
• serial time / (p * parallel time)
• perfect (linear) speedup efficiency = 1
84. 2/25/2009 CS267 Lecture 8 112
Motivation (1)
3 Basic Linear Algebra Problems
1. Linear Equations: Solve Ax=b for x
2. Least Squares: Find x that minimizes ||r||2 S ri
2
where r=Ax-b
• Statistics: Fitting data with simple functions
3a. Eigenvalues: Find l and x where Ax = l x
• Vibration analysis, e.g., earthquakes, circuits
3b. Singular Value Decomposition: ATAx=2x
• Data fitting, Information retrieval
Lots of variations depending on structure of A
• A symmetric, positive definite, banded, …
85. 2/25/2009 CS267 Lecture 8 113
Motivation (2)
•Why dense A, as opposed to sparse A?
• Many large matrices are sparse, but …
• Dense algorithms easier to understand
• Some applications yields large dense
matrices
• LINPACK Benchmark (www.top500.org)
• “How fast is your computer?” =
“How fast can you solve dense Ax=b?”
• Large sparse matrix algorithms often yield
smaller (but still large) dense problems
• Do ParLab Apps most use small dense matrices?
86. 02/25/2009 CS267 Lecture 8
Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm Serial PRAM Memory #Procs
• Dense LU N3 N N2 N2
• Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3)
• Jacobi N2 (N5/3) N (N2/3) N N
• Explicit Inv. N2 log N N2 N2
• Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N
• Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N
• Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N
• FFT N*log N log N N N
• Multigrid N log2 N N N
• Lower bound N log N N
PRAM is an idealized parallel model with zero cost communication
Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997
(Note: corrected complexities for 3D case from last lecture!).
87. Lessons and Questions (1)
• Structure of the problem matters
• Cost of solution can vary dramatically (n3 to n)
• Many other examples
• Some structure can be figured out automatically
• “Ab” can figure out symmetry, some sparsity
• Some structures known only to (smart) user
• If performance not critical, user may be happy to settle for Ab
• How much of this goes into the motifs?
• How much should we try to help user choose?
• Tuning, but at algorithmic choice level (SALSA)
• Motifs overlap
• Dense, sparse, (un)structured grids, spectral
88. Organizing Linear Algebra (1)
• By Operations
• Low level (eg mat-mul: BLAS)
• Standard level (eg solve Ax=b, Ax=λx: Sca/LAPACK)
• Applications level (eg systems & control: SLICOT)
• By Performance/accuracy tradeoffs
• “Direct methods” with guarantees vs “iterative methods” that
may work faster and accurately enough
• By Structure
• Storage
• Dense
– columnwise, rowwise, 2D block cyclic, recursive space-filling curves
• Banded, sparse (many flavors), black-box, …
• Mathematical
• Symmetries, positive definiteness, conditioning, …
• As diverse as the world being modeled
89. Organizing Linear Algebra (2)
• By Data Type
• Real vs Complex
• Floating point (fixed or varying length), other
• By Target Platform
• Serial, manycore, GPU, distributed memory, out-of-
DRAM, Grid, …
• By programming interface
• Language bindings
• “Ab” versus access to details
90. For all linear algebra problems:
Ex: LAPACK Table of Contents
• Linear Systems
• Least Squares
• Overdetermined, underdetermined
• Unconstrained, constrained, weighted
• Eigenvalues and vectors of Symmetric Matrices
• Standard (Ax = λx), Generallzed (Ax=λxB)
• Eigenvalues and vectors of Unsymmetric matrices
• Eigenvalues, Schur form, eigenvectors, invariant subspaces
• Standard, Generalized
• Singular Values and vectors (SVD)
• Standard, Generalized
• Level of detail
• Simple Driver
• Expert Drivers with error bounds, extra-precision, other options
• Lower level routines (“apply certain kind of orthogonal transformation”)
91. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general , pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
92. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general , pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
93. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
94. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
95. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
96. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general , pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
97. For all matrix/problem structures:
Ex: LAPACK Table of Contents
• BD – bidiagonal
• GB – general banded
• GE – general
• GG – general, pair
• GT – tridiagonal
• HB – Hermitian banded
• HE – Hermitian
• HG – upper Hessenberg, pair
• HP – Hermitian, packed
• HS – upper Hessenberg
• OR – (real) orthogonal
• OP – (real) orthogonal, packed
• PB – positive definite, banded
• PO – positive definite
• PP – positive definite, packed
• PT – positive definite, tridiagonal
• SB – symmetric, banded
• SP – symmetric, packed
• ST – symmetric, tridiagonal
• SY – symmetric
• TB – triangular, banded
• TG – triangular, pair
• TB – triangular, banded
• TP – triangular, packed
• TR – triangular
• TZ – trapezoidal
• UN – unitary
• UP – unitary packed
98. For all data types:
Ex: LAPACK Table of Contents
• Real and complex
• Single and double precision
• Arbitrary precision in progress
99. Organizing Linear Algebra (3)
www.netlib.org/lapack www.netlib.org/scalapack
www.cs.utk.edu/~dongarra/etemplates
www.netlib.org/templates
gams.nist.gov
100. 2/27/08 CS267 Guest Lecture 1 128
Review of the BLAS
BLAS level Ex. # mem refs # flops q
1 “Axpy”,
Dot prod
3n 2n1 2/3
2 Matrix-
vector mult
n2 2n2 2
• Building blocks for all linear algebra
• Parallel versions call serial versions on each processor
• So they must be fast!
• Define q = # flops / # mem refs = “computational intensity”
• The larger is q, the faster the algorithm can go in the
presence of memory hierarchy
• “axpy”: y = a*x + y, where a scalar, x and y vectors
101. 02/22/2011 CS267 Lecture 11
130
Summary of Parallel Matrix Multiplication so far
• 1D Layout
• Bus without broadcast - slower than serial
• Nearest neighbor communication on a ring (or bus with
broadcast): Efficiency = 1/(1 + O(p/n))
• 2D Layout – one copy of all matrices (O(n2/p) per processor)
• Cannon
• Efficiency = 1/(1+O(a * ( sqrt(p) /n)3 +b* sqrt(p) /n)) – optimal!
• Hard to generalize for general p, n, block cyclic, alignment
• SUMMA
• Efficiency = 1/(1 + O(a * log p * p / (b*n2) + b*log p * sqrt(p) /n))
• Very General
• b small => less memory, lower efficiency
• b large => more memory, high efficiency
• Used in practice (PBLAS)
Why?