What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
Ying Xie & Pooja Chenna, Kennesaw State University, present at the 2015 HPCC Systems Engineering Summit Community Day. Deep learning has been emerged as a new breakthrough in the area of Machine Learning and revitalized the research in Artificial Intelligence (AI). Deep learning techniques have been widely used in image recognition and natural language processing. In this presentation, we will show our implementation of an important deep neural network architecture that is called Deep Belief Network in ECL. We will further illustrate how to apply our implementation to solve the problem of network intrusion detection based on massive data sets on HPCC Systems. Additionally, we will report our study on how to optimally configure a stacked auto encoder, another deep learning architecture, on HPCC Systems for the purpose of both supervised and unsupervised learning on different types of data sets.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
Ying Xie & Pooja Chenna, Kennesaw State University, present at the 2015 HPCC Systems Engineering Summit Community Day. Deep learning has been emerged as a new breakthrough in the area of Machine Learning and revitalized the research in Artificial Intelligence (AI). Deep learning techniques have been widely used in image recognition and natural language processing. In this presentation, we will show our implementation of an important deep neural network architecture that is called Deep Belief Network in ECL. We will further illustrate how to apply our implementation to solve the problem of network intrusion detection based on massive data sets on HPCC Systems. Additionally, we will report our study on how to optimally configure a stacked auto encoder, another deep learning architecture, on HPCC Systems for the purpose of both supervised and unsupervised learning on different types of data sets.
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
This document provides an overview of an architectural roadmap for implementing a Hadoop ecosystem. It begins with definitions of big data and Hadoop's history. It then describes the core components of Hadoop, including HDFS, MapReduce, YARN, and ecosystem tools for abstraction, data ingestion, real-time access, workflow, and analytics. Finally, it discusses security enhancements that have been added to Hadoop as it has become more mainstream.
The document introduces the Hadoop ecosystem, which provides an approach for handling large amounts of data across commodity hardware. Hadoop is an open source software framework that uses MapReduce and HDFS to allow distributed storage and processing of large datasets across clusters of computers. It has been adopted by many large companies as a standard for batch processing of big data. The document describes how Hadoop is used by organizations to combine different datasets, remove data silos, and enable new types of experiments and analyses.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Apache Hadoop es un framework de software que soporta aplicaciones distribuidas bajo una licencia libre.1 Permite a las aplicaciones trabajar con miles de nodos y petabytes de datos. Hadoop se inspiró en los documentos Google para MapReduce y Google File System (GFS).
Hadoop es un proyecto de alto nivel Apache que está siendo construido y usado por una comunidad global de contribuyentes,2 mediante el lenguaje de programación Java. Yahoo! ha sido el mayor contribuyente al proyecto,3 y usa Hadoop extensivamente en su negocio
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache hadoop introduction and architectureHarikrishnan K
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large data sets across clusters of commodity hardware. The core of Hadoop is a storage part known as Hadoop Distributed File System (HDFS) and a processing part known as MapReduce. HDFS provides distributed storage and MapReduce enables distributed processing of large datasets in a reliable, fault-tolerant and scalable manner. Hadoop has become popular for distributed computing as it is reliable, economical and scalable to handle large and varying amounts of data.
Big data | Hadoop | components of hadoop |Rahul Gulab SingRahul Singh
This document provides an overview of big data concepts and the Hadoop framework. It discusses the volume, variety, velocity, and veracity challenges of big data. It introduces Hadoop distributed file system (HDFS) and MapReduce programming model for processing large datasets across clusters of computers. Examples are given of how Google, Facebook, and the New York Times use Hadoop to manage petabytes of data and perform tasks like processing web searches, photos, and converting historical newspaper articles to PDF.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Big data is being collected from many sources like the web, social networks, and businesses. Hadoop is an open source software framework that can process large datasets across clusters of computers. It uses a programming model called MapReduce that allows automatic parallelization and fault tolerance. Hadoop uses commodity hardware and can handle various data formats and large volumes of data distributed across clusters. Companies like Cloudera provide tools and services to help users manage and analyze big data with Hadoop.
Big-data analytics beyond Hadoop - Big-data is not equal to Hadoop, especially for iterative algorithms! Lot of alternatives have emerged. Spark and GraphLab are most interesting next generation platforms for analytics.
This document provides an introduction to Hadoop and big data concepts. It discusses what big data is, the four V's of big data (volume, velocity, variety, and veracity), different data types (structured, semi-structured, unstructured), how data is generated, and the Apache Hadoop framework. It also covers core Hadoop components like HDFS, YARN, and MapReduce, common Hadoop users, the difference between Hadoop and RDBMS systems, Hadoop cluster modes, the Hadoop ecosystem, HDFS daemons and architecture, and basic Hadoop commands.
No, combiner and reducer logic cannot be same.
Combiner is an optional step that performs local aggregation of the intermediate key-value pairs generated by the mappers. Its goal is to reduce the amount of data transferred from the mapper to the reducer.
Reducer performs the final aggregation of the values associated with a particular key. It receives the intermediate outputs from all the mappers, groups them by key, and produces the final output.
So while combiner and reducer both perform aggregation, their scopes of operation are different - combiner works locally on mapper output to minimize data transfer, whereas reducer operates globally on all mapper outputs to produce the final output. The logic needs to be optimized for their respective purposes.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
This document discusses matching data intensive applications to hardware and software architectures. It provides examples of over 50 big data applications and analyzes their characteristics to identify common patterns. These patterns are used to propose a "big data version" of the Berkeley dwarfs and NAS parallel benchmarks for evaluating data-intensive systems. The document also analyzes hardware architectures from clouds to HPC and proposes integrating HPC concepts into the Apache software stack to develop an HPC-ABDS software stack for high performance data analytics. Key aspects of applications, hardware, and software architectures are illustrated with examples and diagrams.
High Performance Computing and Big Data Geoffrey Fox
This document proposes a hybrid software stack that combines large-scale data systems from both research and commercial applications. It runs the commodity Apache Big Data Stack (ABDS) using enhancements from High Performance Computing (HPC) to improve performance. Examples are given from bioinformatics and financial informatics. Parallel and distributed runtimes like MPI, Storm, Heron, Spark and Flink are discussed, distinguishing between parallel (tightly-coupled) and distributed (loosely-coupled) systems. The document also discusses optimizing Java performance and differences between capacity and capability computing. Finally, it explains how this HPC-ABDS concept allows convergence of big data, big simulation, cloud and HPC systems.
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
This document provides an overview of an architectural roadmap for implementing a Hadoop ecosystem. It begins with definitions of big data and Hadoop's history. It then describes the core components of Hadoop, including HDFS, MapReduce, YARN, and ecosystem tools for abstraction, data ingestion, real-time access, workflow, and analytics. Finally, it discusses security enhancements that have been added to Hadoop as it has become more mainstream.
The document introduces the Hadoop ecosystem, which provides an approach for handling large amounts of data across commodity hardware. Hadoop is an open source software framework that uses MapReduce and HDFS to allow distributed storage and processing of large datasets across clusters of computers. It has been adopted by many large companies as a standard for batch processing of big data. The document describes how Hadoop is used by organizations to combine different datasets, remove data silos, and enable new types of experiments and analyses.
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries
such as Apache Hadoop, Spark, and Storm. While these
systems are rich in interoperability and features, developing
high performance big data analytic applications is challenging.
Also, the study of performance characteristics and
high performance optimizations is lacking in the literature for
these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper presents the implementation of a high performance big data analytics library - SPIDAL Java - with a comprehensive discussion on five performance challenges, solutions, and speedup results. SPIDAL Java captures a class of global machine learning applications with significant computation and communication that can serve as a yardstick in studying performance bottlenecks with Java big data analytics. The five challenges present here are the cost of intra-node messaging, inefficient cache utilization, performance costs with threads, overhead of garbage collection, and the costs of heap allocated objects. SPIDAL Java presents its solutions to these and demonstrates significant performance gains and scalability when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Apache Hadoop es un framework de software que soporta aplicaciones distribuidas bajo una licencia libre.1 Permite a las aplicaciones trabajar con miles de nodos y petabytes de datos. Hadoop se inspiró en los documentos Google para MapReduce y Google File System (GFS).
Hadoop es un proyecto de alto nivel Apache que está siendo construido y usado por una comunidad global de contribuyentes,2 mediante el lenguaje de programación Java. Yahoo! ha sido el mayor contribuyente al proyecto,3 y usa Hadoop extensivamente en su negocio
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Apache hadoop introduction and architectureHarikrishnan K
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large data sets across clusters of commodity hardware. The core of Hadoop is a storage part known as Hadoop Distributed File System (HDFS) and a processing part known as MapReduce. HDFS provides distributed storage and MapReduce enables distributed processing of large datasets in a reliable, fault-tolerant and scalable manner. Hadoop has become popular for distributed computing as it is reliable, economical and scalable to handle large and varying amounts of data.
Big data | Hadoop | components of hadoop |Rahul Gulab SingRahul Singh
This document provides an overview of big data concepts and the Hadoop framework. It discusses the volume, variety, velocity, and veracity challenges of big data. It introduces Hadoop distributed file system (HDFS) and MapReduce programming model for processing large datasets across clusters of computers. Examples are given of how Google, Facebook, and the New York Times use Hadoop to manage petabytes of data and perform tasks like processing web searches, photos, and converting historical newspaper articles to PDF.
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Big data is being collected from many sources like the web, social networks, and businesses. Hadoop is an open source software framework that can process large datasets across clusters of computers. It uses a programming model called MapReduce that allows automatic parallelization and fault tolerance. Hadoop uses commodity hardware and can handle various data formats and large volumes of data distributed across clusters. Companies like Cloudera provide tools and services to help users manage and analyze big data with Hadoop.
Big-data analytics beyond Hadoop - Big-data is not equal to Hadoop, especially for iterative algorithms! Lot of alternatives have emerged. Spark and GraphLab are most interesting next generation platforms for analytics.
This document provides an introduction to Hadoop and big data concepts. It discusses what big data is, the four V's of big data (volume, velocity, variety, and veracity), different data types (structured, semi-structured, unstructured), how data is generated, and the Apache Hadoop framework. It also covers core Hadoop components like HDFS, YARN, and MapReduce, common Hadoop users, the difference between Hadoop and RDBMS systems, Hadoop cluster modes, the Hadoop ecosystem, HDFS daemons and architecture, and basic Hadoop commands.
No, combiner and reducer logic cannot be same.
Combiner is an optional step that performs local aggregation of the intermediate key-value pairs generated by the mappers. Its goal is to reduce the amount of data transferred from the mapper to the reducer.
Reducer performs the final aggregation of the values associated with a particular key. It receives the intermediate outputs from all the mappers, groups them by key, and produces the final output.
So while combiner and reducer both perform aggregation, their scopes of operation are different - combiner works locally on mapper output to minimize data transfer, whereas reducer operates globally on all mapper outputs to produce the final output. The logic needs to be optimized for their respective purposes.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
This document discusses matching data intensive applications to hardware and software architectures. It provides examples of over 50 big data applications and analyzes their characteristics to identify common patterns. These patterns are used to propose a "big data version" of the Berkeley dwarfs and NAS parallel benchmarks for evaluating data-intensive systems. The document also analyzes hardware architectures from clouds to HPC and proposes integrating HPC concepts into the Apache software stack to develop an HPC-ABDS software stack for high performance data analytics. Key aspects of applications, hardware, and software architectures are illustrated with examples and diagrams.
High Performance Computing and Big Data Geoffrey Fox
This document proposes a hybrid software stack that combines large-scale data systems from both research and commercial applications. It runs the commodity Apache Big Data Stack (ABDS) using enhancements from High Performance Computing (HPC) to improve performance. Examples are given from bioinformatics and financial informatics. Parallel and distributed runtimes like MPI, Storm, Heron, Spark and Flink are discussed, distinguishing between parallel (tightly-coupled) and distributed (loosely-coupled) systems. The document also discusses optimizing Java performance and differences between capacity and capability computing. Finally, it explains how this HPC-ABDS concept allows convergence of big data, big simulation, cloud and HPC systems.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiBD Project.
"This talk will provide an overview of challenges in designing convergent HPC and BigData software stacks on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit HPC scheduler (SLURM), parallel file systems (Lustre) and NVM-based in-memory technology will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown.
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,950 organizations worldwide (in 85 countries). More than 518,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3rd, 14th, 17th, and 27th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 300 organizations in 35 countries. More than 28,900 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu.
Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Watch the video: https://youtu.be/1QEq0EUErKM
Learn more: http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
This document discusses big data and intensive data processing. It defines big data and compares it to traditional analytics. It discusses technologies used for big data like Hadoop, MapReduce, and machine learning. It also discusses frameworks for analyzing big data like Apache Mahout and how Mahout is moving away from MapReduce to platforms like Apache Spark.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
FedCentric Technologies provides high performance computing solutions using large memory systems and in-memory databases to solve data analytics problems that exceed the capabilities of traditional approaches. They worked with the National Cancer Institute to develop a graph database running on high memory hardware to enable complex queries and analysis of genomic variant data to further cancer research. FedCentric also works with pharmaceutical clients to utilize high performance data analytics for applications such as microarray imaging and analysis.
The document describes the R User Conference 2014 which was held from June 30 to July 3 at UCLA in Los Angeles. The conference included tutorials on the first day covering topics like applied predictive modeling in R and graphical models. Keynote speeches and sessions were held on subsequent days covering various technical and statistical topics as well as best practices in R programming. Tutorials and sessions demonstrated tools and packages in R like dplyr and Shiny for data analysis and interactive visualizations.
The document summarizes an Open Data Science Conference and iRODS User Group meeting. It discusses technologies like Julia, Stan, Scikit-learn, Apache Spark, Apache Hadoop, and Apache Hive that were presented. It provides information on keynote speakers and their affiliated companies. The document also lists topics for training workshops and good talks available online. Finally, it summarizes questions asked about iRODS and provides information on implementing data policy rules.
Designing Software Libraries and Middleware for Exascale Systems: Opportuniti...inside-BigData.com
This talk will focus on challenges in designing software libraries and middleware for upcoming exascale systems with millions of processors and accelerators. Two kinds of application domains - Scientific Computing and Big data will be considered. For scientific computing domain, we will discuss about challenges in designing runtime environments for MPI and PGAS (UPC and OpenSHMEM) programming models by taking into account support for multi-core, high-performance networks, GPGPUs and Intel MIC. Features and sample performance numbers from MVAPICH2 and MVAPICH2-X (supporting Hybrid MPI and PGAS (UPC and OpenSHMEM)) libraries will be presented. For Big Data domain, we will focus on high-performance and scalable designs of Hadoop (including HBase) and Memcached using native RDMA support of InfiniBand and RoCE.
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
The document discusses capacity planning and performance tuning for Hadoop big data systems. It begins with an agenda that covers why capacity planners need to prepare for Hadoop, an overview of the Hadoop ecosystem, capacity planning and performance tuning of Hadoop, getting started, and the importance of measurement. The document then discusses various components of the Hadoop ecosystem and provides guidance on analyzing different types of workloads and components.
This document summarizes a research paper on analyzing and visualizing Twitter data using the R programming language with Hadoop. The goal was to leverage Hadoop's distributed processing capabilities to support analytical functions in R. Twitter data was analyzed and visualized in a distributed manner using R packages that connect to Hadoop. This allowed large-scale Twitter data analysis and visualizations to be built as a R Shiny application on top of results from Hadoop.
Supermicro designed and implemented a rack-level cluster solution for San Diego Supercomputing Center (SDSC) optimized for their custom and experimental AI training and inferencing workloads, and meeting their environmental and TCO requirements. The project team will discuss the journey of designing and deploying our Rack Plug and Play cluster, and Shawn Strande, Dupty Director, SDSC, will be sharing his experience of partnering with the Supermicro team to solve his challgenges in HPC and AI.
The team will also share the technology that powers the SDSC Voyager Supercomputer, the Habana Gaudi AI system with 3rd Gen Intel® Xeon® Scalable processors for Deep Learning Training, and Habana Goya for Inferencing.
Watch the webinar: https://www.brighttalk.com/webcast/17278/517013
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
This document discusses big data tools and trends that enable real-time business intelligence from machine logs. It provides an overview of Perficient, a leading IT consulting firm, and introduces the speakers Eric Roch and Ben Hahn. It then covers topics like what constitutes big data, how machine data is a source of big data, and how tools like Hadoop, Storm, Elasticsearch can be used to extract insights from machine data in real-time through open source solutions and functional programming approaches like MapReduce. It also demonstrates a sample data analytics workflow using these tools.
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
The document discusses Microsoft's Cognitive Toolkit (CNTK), an open source deep learning toolkit developed by Microsoft. It provides the following key points:
1. CNTK uses computational graphs to represent machine learning models like DNNs, CNNs, RNNs in a flexible way.
2. It supports CPU and GPU training and works on Windows and Linux.
3. CNTK achieves state-of-the-art accuracy and is efficient, scaling to multi-GPU and multi-server settings.
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
High value analytics in FS are being enabled by Graph, machine learning and Spark technologies. To make these real at production scale HPC technologies are more appropriate than commodity clusters.
Gestion des données scientifiques en imagerie in vivo – Journée scientifique organisée par PIV le 7 décembre 2017 au PARCC-HEGP
Marie-Christine Jacquemot
OPIDOR
“Psychiatry and the Humanities”: An Innovative Course at the University of Mo...Université de Montréal
“Psychiatry and the Humanities”: An Innovative Course at the University of Montreal Expanding the medical model to embrace the humanities. Link: https://www.psychiatrictimes.com/view/-psychiatry-and-the-humanities-an-innovative-course-at-the-university-of-montreal
The biomechanics of running involves the study of the mechanical principles underlying running movements. It includes the analysis of the running gait cycle, which consists of the stance phase (foot contact to push-off) and the swing phase (foot lift-off to next contact). Key aspects include kinematics (joint angles and movements, stride length and frequency) and kinetics (forces involved in running, including ground reaction and muscle forces). Understanding these factors helps in improving running performance, optimizing technique, and preventing injuries.
Are you looking for a long-lasting solution to your missing tooth?
Dental implants are the most common type of method for replacing the missing tooth. Unlike dentures or bridges, implants are surgically placed in the jawbone. In layman’s terms, a dental implant is similar to the natural root of the tooth. It offers a stable foundation for the artificial tooth giving it the look, feel, and function similar to the natural tooth.
Travel Clinic Cardiff: Health Advice for International TravelersNX Healthcare
Travel Clinic Cardiff offers comprehensive travel health services, including vaccinations, travel advice, and preventive care for international travelers. Our expert team ensures you are well-prepared and protected for your journey, providing personalized consultations tailored to your destination. Conveniently located in Cardiff, we help you travel with confidence and peace of mind. Visit us: www.nxhealthcare.co.uk
STUDIES IN SUPPORT OF SPECIAL POPULATIONS: GERIATRICS E7shruti jagirdar
Unit 4: MRA 103T Regulatory affairs
This guideline is directed principally toward new Molecular Entities that are
likely to have significant use in the elderly, either because the disease intended
to be treated is characteristically a disease of aging ( e.g., Alzheimer's disease) or
because the population to be treated is known to include substantial numbers of
geriatric patients (e.g., hypertension).
Computer in pharmaceutical research and development-Mpharm(Pharmaceutics)MuskanShingari
Statistics- Statistics is the science of collecting, organizing, presenting, analyzing and interpreting numerical data to assist in making more effective decisions.
A statistics is a measure which is used to estimate the population parameter
Parameters-It is used to describe the properties of an entire population.
Examples-Measures of central tendency Dispersion, Variance, Standard Deviation (SD), Absolute Error, Mean Absolute Error (MAE), Eigen Value
How to Control Your Asthma Tips by gokuldas hospital.Gokuldas Hospital
Respiratory issues like asthma are the most sensitive issue that is affecting millions worldwide. It hampers the daily activities leaving the body tired and breathless.
The key to a good grip on asthma is proper knowledge and management strategies. Understanding the patient-specific symptoms and carving out an effective treatment likewise is the best way to keep asthma under control.
1. Gestion des données scientifiques en imagerie in vivo
https://piv2017.sciencesconf.org/
Christophe CERIN, Université Paris13
christophe.cerin@lipn.univ-paris13.fr
Plateformes de resources numériques
Expérience USPC
Paris, December 7, 2017
2. Outline
2
• Motivations related to large scale platforms
• Background:
• Criteria for observing platforms
• High Performance Computing (HPC)
• Apache suite for Big Data
• Some projects
• Contribution: USPC platform
• Conclusion and perspective
3. Motivations related to large
scale platforms
In experimental Sciences
we need tips for using
them appropriately / best
practices
3
4. 4
Definition of a large scale system
• System with unprecedented amounts of hardware (>1M cores,
>100 PB storage, >100Gbits/s interconnect);
• Ultra large scale system: hardware + lines of source code +
numbers of users + volumes of data;
• New problems:
• built across multiple organizations;
• often with conflicting purposes and needs;
• constructed from heterogeneous parts with complex
dependencies and emergent properties;
• continuously evolving;
• software, hardware and human failures will be the norm,
not the exception.
5. Since in our lives (Cloud Computing)
5
Software as a Service
SaaS
Plateform as a Service
PaaS
Infrastructure as a
Service
IaaS
6. • Don’t believe them;
• Variety in the eco-system is the norm.
6
Some speeches are ready for you
7. In Cloud, innovation drives technology/science
Containers and no more Virtual Machines
7
Hardware
Operating System
App App App
Conventional Stack
Hardware
OS
App App App
Hypervisor
OS OS
Virtualized Stack
▪Abstraction on resources in the past: VM technologies to hide
physical properties, interactions between system layer,
application layer and the end-users
9. 9
In Sciences (not in business)
• Experimental approach: modeling,…,
experiments,…
• What is an experimental plan for large
scale systems and reproducible research?
• Answers depend on the system used:
• Cluster (dedicated; managed)
• Cloud (almost non supervised… or by
you)
11. ▪Cluster ➔Performance ➔Flops - Top500
▪Cluster ➔Power ➔Flops per watt - Green500
▪Cloud ➔Price ➔for computing, for storing…
▪Cloud: the economic model may be strange (Amazon
Spot Instances)
▪Scientific issues: how to compare cluster and cloud?
(be fair)
11
Performance criteria for observing
large scale system
14. 14
Productivity
▪Productivity traditionally refers to the ratio between the quantity
of software produced and the cost spent for it.
▪Factors: Programming language used, Program size, The experience
of programmers and design personnel, The novelty of requirements,
The complexity of the program and its data, The use of structured
programming methods, Program class or the distribution method,
Program type of the application area, Tools and environmental
conditions, Enhancing existing programs or systems, Maintaining
existing programs or systems, Reusing existing modules and standard
designs, Program generators, Fourth-generation languages,
Geographic separation of development locations, Defect potentials
and removal methods, (Existing) Documentation, Prototyping before
main development begins, Project teams and organization
structures, Morale and compensation of staff
15. ▪Parallel (http://press3.mcs.anl.gov/atpesc/files/2015/03/ESCTP-snir_215.pdf):
▪Shared memory (OpenMP):
▪CPU+vector
▪CPU+GPU
▪CPU+accelerators (Google Tensor processing unit…)
▪Message passing (MPI)
▪Sequential with the call to « parallel data structure » (Scala)
▪Map-reduce: Hadoop, Spark (In addition to Map and Reduce operations, it
supports SQL queries, streaming data, machine learning and graph data
processing.)
15
Programming model
16. ▪Even more complicated: Cray has added "ARM Option" (i.e. CPU blade option,
using the ThunderX2) to their XC50 supercomputers
▪We need consolidations:
▪in hardware
▪in software
▪in algorithms
▪in applications
▪in experimental methods
▪hpc facilities
▪training
16
Hardware landscape
17. 17
Apache suite for Big Data
1- Bases de Données
1.1 Apache Hbase
1.2 Apache Cassandra
1.3 CouchDB
1.4 MongoDB
2 Accès aux données/ requêtage
2.1 Pig
2.2 Hive
2.3 Livy
3 Data Intelligence
3.1 Apache Drill
3.2 Apache Mahout
3.3 H2O
4 Data Serialisation
4.1 Apache Thrift
4.2 Apache Avro
5 Data integration
5.1 Apache Sqoop
5.2 Apache Flume
5.3 Apache Chuckwa
6 Requetage
6.1 Presto
6.2 Impala
6.3 Dremel
7 Sécurité des données
7.1 Apache Metron
7.2 Sqrrl
8.2 MapReduce
8.3 Spark
9.1 Elasticsearch
10.1 Apache Hadoop
10.6 Apache Mesos
10.10 Apache Flink
10.12 Apache Zookeeper
10.15 Apache Storm
10.20 Apache Kafka
18. Example of a project mixing
HPC and Big Data considerations
Where theory comes before data:
physics
chemistry
Materials
...
Where data comes before theory:
Medicine
business insights
autonomy (smart car, building, city)
18
19. Harp: Hadoop and Collective
Communication for Iterative Computation
19
• Project http://salsaproj.indiana.edu/harp/
• Motivation:
• Communication patterns are not abstracted and
defined in Hadoop, Pregel/Giraph, Spark
• In contrary, MPI has very fine grain based
(collective) communication primitives (based on
arrays and buffers)
• Harp provides data abstractions and communication
abstractions on top of them. It can be plugged into
Hadoop runtime to enable efficient in-memory
communication to avoid HDFS read/write.
21. • Harp works as a plugin in Hadoop. The goal is to
make Hadoop cluster can schedule original
MapReduce jobs and Map-Collective jobs at the
same time.
• Collective communication is defined as
movement of partitions within tables.
• Collective communication requires
synchronization. Hadoop scheduler is modified
to schedule tasks in BSP style.
• Fault tolerance with checkpointing.
21
Harp
22. • For the data-scientist of nowadays;
• Hub to facilite access to Amazon cloud
resources, to share data, to collaborate
between many people through notebooks;
• RosettaHUB: https://www.rosettahub.com
• Available through Amazon Educate program
(https://s3.amazonaws.com/awseducate-
list/AWS_Educate_Institutions.pdf)
22
RosettaHub
39. • Computer hardware will continue to evolve; Trend:
accelerators instead of GPUs;
• Platforms will continue to evolve to facilitate the
management of data, the computation on data;
• Need for convergence/resonance between HPC and Big-
data eco-systems;
• SPC: needs for training, using, transforming the scientific
approaches… with research engineers for accompany the
change;
• SPC: needs for projects… and to share betweenez
communities.
39
Conclusion and perspective