This document proposes a hybrid software stack that combines large-scale data systems from both research and commercial applications. It runs the commodity Apache Big Data Stack (ABDS) using enhancements from High Performance Computing (HPC) to improve performance. Examples are given from bioinformatics and financial informatics. Parallel and distributed runtimes like MPI, Storm, Heron, Spark and Flink are discussed, distinguishing between parallel (tightly-coupled) and distributed (loosely-coupled) systems. The document also discusses optimizing Java performance and differences between capacity and capability computing. Finally, it explains how this HPC-ABDS concept allows convergence of big data, big simulation, cloud and HPC systems.
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
DK Panda from Ohio State University presented this deck at the Switzerland HPC Conference.
"This talk will provide an overview of challenges in accelerating Hadoop, Spark and Mem- cached on modern HPC clusters. An overview of RDMA-based designs for multiple com- ponents of Hadoop (HDFS, MapReduce, RPC and HBase), Spark, and Memcached will be presented. Enhanced designs for these components to exploit in-memory technology and parallel file systems (such as Lustre) will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown."
Watch the video presentation: https://www.youtube.com/watch?v=glf2KITDdVs
See more talks in the Swiss Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
High Performance Processing of Streaming DataGeoffrey Fox
Describes two parallel robot planning algorithms implemented with Apache Storm on OpenStack -- SLAM (Simultaneous Localization & Mapping) and collision avoidance. Performance (response time) studied and improved as example of HPC-ABDS (High Performance Computing enhanced Apache Big Data Software Stack) concept.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Predicting failure in power networks, detecting fraudulent activities in payment card transactions, and identifying next logical products targeted at the right customer at the right time all require machine learning around massive data sets. This form of artificial intelligence requires complex self-learning algorithms, rapid data iteration for advanced analytics and a robust big data architecture that’s up to the task.
Learn how you can quickly exploit your existing IT infrastructure and scale operations in line with your budget to enjoy advanced data modeling, without having to invest in a large data science team.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
Our Strata Beijing 2017 presentation slides where we show how to use data from a movement sensor, in real-time, to do anomaly detection at scale using standard enterprise big data software.
This lecture aims to give some food for thought regarding how the current High Performance Computing systems (hardware and software) tends to merge with Big Data ones (Machine Learning, Analytics and Enterprise workloads) in order to meet both workloads demands sharing the same clusters.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
At eScience in the Cloud 2014, Redmond WA, April 30 2014
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive, even though commercially clouds devote much more resources to data analytics than supercomputers devote to simulations.
We look at a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We suggest a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on combining HPC and the Apache software stack that is well used in modern cloud computing.
Initial results on Azure and HPC Clusters are presented
Predicting failure in power networks, detecting fraudulent activities in payment card transactions, and identifying next logical products targeted at the right customer at the right time all require machine learning around massive data sets. This form of artificial intelligence requires complex self-learning algorithms, rapid data iteration for advanced analytics and a robust big data architecture that’s up to the task.
Learn how you can quickly exploit your existing IT infrastructure and scale operations in line with your budget to enjoy advanced data modeling, without having to invest in a large data science team.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
Our Strata Beijing 2017 presentation slides where we show how to use data from a movement sensor, in real-time, to do anomaly detection at scale using standard enterprise big data software.
This lecture aims to give some food for thought regarding how the current High Performance Computing systems (hardware and software) tends to merge with Big Data ones (Machine Learning, Analytics and Enterprise workloads) in order to meet both workloads demands sharing the same clusters.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
5th Multicore World
15-17 February 2016 – Shed 6, Wellington, New Zealand
http://openparallel.com/multicore-world-2016/
We start by dividing applications into data plus model components and classifying each component (whether from Big Data or Big Simulations) in the same way. These leads to 64 properties divided into 4 views, which are Problem Architecture (Macro pattern); Execution Features (Micro patterns); Data Source and Style; and finally the Processing (runtime) View.
We discuss convergence software built around HPC-ABDS (High Performance Computing enhanced Apache Big Data Stack) http://hpc-abds.org/kaleidoscope/ and show how one can merge Big Data and HPC (Big Simulation) concepts into a single stack.
We give examples of data analytics running on HPC systems including details on persuading Java to run fast.
Some details can be found at http://dsc.soic.indiana.edu/publications/HPCBigDataConvergence.pdf
Is your application system process facing problem? With the help of System-level analysis you can save your application from failures at different levels. It analyzes how the components are interacting at multiple layers & technologies. Keep your system efficient and secure.
Virtual Machines are a mainstay in the enterprise. Apache Hadoop is normally run on bare machines. This talk walks through the convergence and the use of virtual machines for running ApacheHadoop. We describe the results from various tests and benchmarks which show that the overhead of using VMs is small. This is a small price to pay for the advantages offered by virtualization. The second half of talk compares multi-tenancy with VMs versus multi-tenancy of with Hadoop`s Capacity scheduler. We follow on with a comparison of resource management in V-Sphere and the finer grained resource management and scheduling in NextGen MapReduce. NextGen MapReduce supports a general notion of a container (such as a process, jvm, virtual machine etc) in which tasks are run;. We compare the role of such first class VM support in Hadoop.
Moving From a Selenium Grid to the Cloud - A Real Life StorySauce Labs
Come hear how Anshul Sharma, Senior QA Engineer at Emmi Solutions, made the move from testing on an in-house Selenium Grid to the Cloud while expanding test coverage and making great strides in moving to a full continuous integration workflow.
Progeny LIMS sets a new standard for life science LIMS software. Manage absolutely any type of sample and associated data in a fully customizable multi-level inventory system. With a flexible database that is configurable directly through the user interface, you have the power to customize your system to match any lab requirement without the need for custom programming. A dedicated Progeny LIMS specialist will work with you hand-in-hand from defining your project until the moment you go live. From biobanks to individual clinical labs, Progeny LIMS is your affordable sample tracking solution.
Introduction to Designing and Building Big Data ApplicationsCloudera, Inc.
Learn what the course covers, from capturing data to building a search interface; the spectrum of processing engines, Apache projects, and ecosystem tools available for converged analytics; who is best suited to attend the course and what prior knowledge you should have; and the benefits of building applications with an enterprise data hub.
This slide deck is based on the concepts in a great book by William Ury called Getting Past No. If these slides pique your interest, I suggest reading the book; it is well worth your time.
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development. However the same is not so true for data intensive problems even though commercial clouds presumably devote more resources to data analytics than supercomputers devote to simulations. We try to establish some principles that allow one to compare data intensive architectures and decide which applications fit which machines and which software.
We use a sample of over 50 big data applications to identify characteristics of data intensive applications and propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks. We consider hardware from clouds to HPC. Our software analysis builds on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We illustrate issues with examples including kernels like clustering, and multi-dimensional scaling; cyberphysical systems; databases; and variants of image processing from beam lines, Facebook and deep-learning.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...Geoffrey Fox
Advances in high-performance/parallel computing in the 1980's and 90's was spurred by the development of quality high-performance libraries, e.g., SCALAPACK, as well as by well-established benchmarks, such as Linpack.
Similar efforts to develop libraries for high-performance data analytics are underway. In this talk we motivate that such benchmarks should be motivated by frequent patterns encountered in high-performance analytics, which we call Ogres.
Based upon earlier work, we propose that doing so will enable adequate coverage of the "Apache" bigdata stack as well as most common application requirements, whilst building upon parallel computing experience.
Given the spectrum of analytic requirements and applications, there are multiple "facets" that need to be covered, and thus we propose an initial set of benchmarks - by no means currently complete - that covers these characteristics.
We hope this will encourage debate
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
Data Centric Approach: Our platform is built on the premise of absorbing data from multiple data sources and transforming them to a highly intelligent social network graphs that can be processed to non-obvious relationships.
Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiB...inside-BigData.com
In this deck from the 2019 Stanford HPC Conference, DK Panda from Ohio State University presents: Designing Convergent HPC and Big Data Software Stacks: An Overview of the HiBD Project.
"This talk will provide an overview of challenges in designing convergent HPC and BigData software stacks on modern HPC clusters. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Enhanced designs for these components to exploit HPC scheduler (SLURM), parallel file systems (Lustre) and NVM-based in-memory technology will also be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown.
DK Panda is a Professor and University Distinguished Scholar of Computer Science and Engineering at the Ohio State University. He has published over 450 papers in the area of high-end computing and networking. The MVAPICH2 (High Performance MPI and PGAS over InfiniBand, Omni-Path, iWARP and RoCE) libraries, designed and developed by his research group (http://mvapich.cse.ohio-state.edu), are currently being used by more than 2,950 organizations worldwide (in 85 countries). More than 518,000 downloads of this software have taken place from the project's site. This software is empowering several InfiniBand clusters (including the 3rd, 14th, 17th, and 27th ranked ones) in the TOP500 list. The RDMA packages for Apache Spark, Apache Hadoop and Memcached together with OSU HiBD benchmarks from his group (http://hibd.cse.ohio-state.edu) are also publicly available. These libraries are currently being used by more than 300 organizations in 35 countries. More than 28,900 downloads of these libraries have taken place. High-performance and scalable versions of the Caffe and TensorFlow framework are available from https://hidl.cse.ohio-state.edu.
Prof. Panda is an IEEE Fellow. More details about Prof. Panda are available at http://www.cse.ohio-state.edu/~panda.
Watch the video: https://youtu.be/1QEq0EUErKM
Learn more: http://hpcadvisorycouncil.com/events/2019/stanford-workshop/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
Describes relations between Big Data and Big Simulation Applications and how this can guide a Big Data - Exascale (Big Simulation) Convergence (as in National Strategic Computing Initiative) and lead to a "complete" set of Benchmarks. Basic idea is to view use cases as "Data" + "Model"
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
This session demonstrates how cloud can accelerate breakthroughs in scientific research by providing on-demand access to powerful computing. You will gain insight into how scientific researchers are using the cloud to solve complex science, engineering, and business problems that require high bandwidth, low latency networking and very high compute capabilities. You will hear how leveraging the cloud reduces the costs and time to conduct large scale, worldwide collaborative research. Researchers can then access computational power, data storage, and supercomputing resources, and data sharing capabilities in a cost-efficient manner without implementation delays. Disease research can be accomplished in a fraction of the time, and innovative researchers in small schools or distant corners of the world have access to the same computing power as those at major research institutions by leveraging Amazon EC2, Amazon S3, optimizing C3 instances and more to increase collaboration. This session will provide best practices and insight from UC Berkeley AMP Lab on the services used to connect disparate sets of data to drive meaningful new insight and impact.
Similar to High Performance Computing and Big Data (20)
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
Most things are dominated by Artificial Intelligence (AI). Technology Companies like Amazon, Google, Facebook, and Microsoft are AI First organizations.
Engineering achievement today is highlighted by the AI buried in a vehicle or machine. Industry (Manufacturing) 4.0 focusses on the AI-Driven future of the Industrial Internet of Things.
Software is eating the world.
We can describe much computer systems work as designing, building and using the Global AI and Modelling supercomputer which itself is autonomously tuned by AI. We suggest that this is not just a bunch of buzzwords but has profound significance and examine consequences of this for education and research.
Naively high-performance computing should be relevant for the AI supercomputer but somehow the corporate juggernaut is not making so much use of it. We discuss how to change this.
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
“Next Generation Grid – HPC Cloud” proposes a toolkit capturing current capabilities of Apache Hadoop, Spark, Flink and Heron as well as MPI and Asynchronous Many Task systems from HPC. This supports a Cloud-HPC-Edge (Fog, Device) Function as a Service Architecture. Note this "new grid" is focussed on data and IoT; not computing. Use interoperable common abstractions but multiple polymorphic implementations.
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
Within the last few years, there have been significant contributions to Java-based big data frameworks and libraries such as Apache Hadoop, Spark, and Storm. While these systems are rich in interoperability and features, developing high performance big data analytic applications is challenging. Also, the study of performance characteristics and high performance optimizations is lacking in the literature for these applications. By contrast, these features are well documented in the High Performance Computing (HPC) domain and some of the techniques have potential performance benefits in the big data domain as well. This paper identifies a class of machine learning applications with significant computation and communication as a yardstick and presents five optimizations to yield high performance in Java big data analytics. Also, it incorporates these optimizations in developing SPIDAL Java - a highly optimized suite of Global Machine Learning (GML) applications. The optimizations include intra-node messaging through memory maps over network calls, improving cache utilization, reliance on processes over threads, zero garbage collection, and employing offheap buffers to load and communicate data. SPIDAL Java demonstrates significant performance gains and scalability with these techniques when running on up to 3072 cores in one of the latest Intel Haswell-based multicore clusters.
http://dsc.soic.indiana.edu/publications/hpc2016-spidal-high-performance-submit-18-public.pdf
http://dsc.soic.indiana.edu/presentations/SPIDALJava.pptx
DTW: 2015 Data Teaching Workshop – 2nd IEEE STC CC and RDA Workshop on Curricula and Teaching Methods in Cloud Computing, Big Data, and Data Science
as part of CloudCom 2015 (http://2015.cloudcom.org/), Vancouver, Nov 30-Dec 3, 2015.
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics; The other is BDOSSP: Big Data Open Source Software and Projects. Links are
http://openedx.scholargrid.org/ BDAA Fall 2015
http://datascience.scholargrid.org/ BDOSSP Spring 2016
http://bigdataopensourceprojects.soic.indiana.edu/ Spring 2015
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India. This talk covers parallel data analytics for bioinformatics. Messages are
Always run MDS. Gives insight into data and performance of machine learning
Leads to a data browser as GIS gives for spatial data
3D better than 2D
~20D better than MSA?
Clustering Observations
Do you care about quality or are you just cutting up space into parts
Deterministic Clustering always makes more robust
Continuous clustering enables hierarchy
Trimmed Clustering cuts off tails
Distinct O(N) and O(N2) algorithms
Use Conjugate Gradient
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
Invited talk at NSF/TCPP Workshop on Parallel and Distributed Computing Education Edupar at IPDPS 2015 May 25, 2015 5/25/2015 Hyderabad
Discusses Indiana University Data Science Program and experience with online education; the program is available in both online and residential modes. We end by discussing two classes taught both online and residentially and online by Geoffrey Fox. One is BDAA: Big Data Applications & Analytics https://bigdatacourse.appspot.com/course. The other is BDOSSP: Big Data Open Source Software and Projects http://bigdataopensourceprojects.soic.indiana.edu/
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
This memo describes experiences with online teaching in Spring Semester 2014. We discuss the technologies used and the approach to teaching/learning.
This work is based on Google Course Builder for a Big Data overview course
Big Data and Clouds: Research and EducationGeoffrey Fox
Presentation September 9 2013 PPAM 2013 Warsaw
Economic Imperative: There are a lot of data and a lot of jobs
Computing Model: Industry adopted clouds which are attractive for data analytics. HPC also useful in some cases
Progress in scalable robust Algorithms: new data need different algorithms than before
Progress in Data Intensive Programming Models
Progress in Data Science Education: opportunities at universities
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
Keynote at Sixth International Workshop on Cloud Data Management CloudDB 2014 Chicago March 31 2014.
Abstract: We introduce the NIST collection of 51 use cases and describe their scope over industry, government and research areas. We look at their structure from several points of view or facets covering problem architecture, analytics kernels, micro-system usage such as flops/bytes, application class (GIS, expectation maximization) and very importantly data source.
We then propose that in many cases it is wise to combine the well known commodity best practice (often Apache) Big Data Stack (with ~120 software subsystems) with high performance computing technologies.
We describe this and give early results based on clustering running with different paradigms.
We identify key layers where HPC Apache integration is particularly important: File systems, Cluster resource management, File and object data management, Inter process and thread communication, Analytics libraries, Workflow and Monitoring.
See
[1] A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures, Shantenu Jha, Judy Qiu, Andre Luckow, Pradeep Mantha and Geoffrey Fox, accepted in IEEE BigData 2014, available at: http://arxiv.org/abs/1403.1528
[2] High Performance High Functionality Big Data Software Stack, G Fox, J Qiu and S Jha, in Big Data and Extreme-scale Computing (BDEC), 2014. Fukuoka, Japan. http://grids.ucs.indiana.edu/ptliupages/publications/HPCandApacheBigDataFinal.pdf
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC Geoffrey Fox
This proposes an integration of HPC and Apache Technologies. HPC-ABDS+ Integration areas include
File systems,
Cluster resource management,
File and object data management,
Inter process and thread communication,
Analytics libraries,
Workflow
Monitoring
Classification of Big Data Use Cases by different FacetsGeoffrey Fox
Ogres classify Big Data applications by multiple facets – each with several exemplars and features. This gives a
guide to breadth and depth of Big Data and allows one to examine which ogres a particular architecture/software support.
FutureGrid Computing Testbed as a ServiceGeoffrey Fox
Describes FutureGrid and its role as a Computing Testbed as a Service. FutureGrid is user-customizable, accessed interactively and supports Grid, Cloud and HPC software with and without VM’s. Lessons learnt and example use cases are described
Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerp...Geoffrey Fox
Motivating Introduction to MOOC on Big Data from an applications point of view https://bigdatacoursespring2014.appspot.com/course
Course says:
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC's.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
High Performance Computing and Big Data
1. 1
Big Data Institute,
Seoul National University, Korea
Geoffrey Fox August 22, 2016
gcf@indiana.edu
http://www.dsc.soic.indiana.edu/, http://spidal.org/ http://hpc-abds.org/kaleidoscope/
Department of Intelligent Systems Engineering
School of Informatics and Computing, Digital Science Center
Indiana University Bloomington
High Performance Computing and Big Data
2. Abstract
• We propose a hybrid software stack with Large scale data systems for both
research and commercial applications running on the commodity (Apache)
Big Data Stack (ABDS) using High Performance Computing (HPC)
enhancements typically to improve performance. We give several examples
taken from bio and financial informatics.
• We look in detail at parallel and distributed run-times including MPI from
HPC and Apache Storm, Heron, Spark and Flink from ABDS stressing that
one needs to distinguish the different needs of parallel (tightly coupled) and
distributed (loosely coupled) systems.
• We also study "Java Grande" or the principles to use to allow Java codes to
perform as fast as those written in more traditional HPC languages. We also
note the differences between capacity (individual jobs using many nodes)
and capability (lots of independent jobs) computing.
• We discuss how this HPC-ABDS concept allows one to discuss
convergence of Big Data, Big Simulation, Cloud and HPC Systems.
See http://hpc-abds.org/kaleidoscope/
28/28/2016
3. Why Connect (“Converge”) Big Data and HPC
• Two major trends in computing systems are
– Growth in high performance computing (HPC) with an international exascale
initiative (China in the lead)
– Big data phenomenon with an accompanying cloud infrastructure of well
publicized dramatic and increasing size and sophistication.
• Note “Big Data” largely an industry initiative although software used is often open
source
– So HPC labels overlaps with “research” e.g. HPC community largely
responsible for Astronomy and Accelerator (LHC, Belle, BEPC ..) data analysis
• Merge HPC and Big Data to get
– More efficient sharing of large scale resources running simulations and data
analytics
– Higher performance Big Data algorithms
– Richer software environment for research community building on many big
data tools
– Easier sustainability model for HPC – HPC does not have resources to build
and maintain a full software stack
38/28/2016
4. Convergence Points (Nexus) for
HPC-Cloud-Big Data-Simulation
• Nexus 1: Applications – Divide use cases into Data and
Model and compare characteristics separately in these two
components with 64 Convergence Diamonds (features)
• Nexus 2: Software – High Performance Computing (HPC)
Enhanced Big Data Stack HPC-ABDS. 21 Layers adding high
performance runtime to Apache systems (Hadoop is fast!).
Establish principles to get good performance from Java or C
programming languages
• Nexus 3: Hardware – Use Infrastructure as a Service IaaS and
DevOps to automate deployment of software defined systems
on hardware designed for functionality and performance e.g.
appropriate disks, interconnect, memory
48/28/2016
5. SPIDAL Project
Datanet: CIF21 DIBBs: Middleware and
High Performance Analytics Libraries for
Scalable Data Science
• NSF14-43054 started October 1, 2014
• Indiana University (Fox, Qiu, Crandall, von Laszewski)
• Rutgers (Jha)
• Virginia Tech (Marathe)
• Kansas (Paden)
• Stony Brook (Wang)
• Arizona State (Beckstein)
• Utah (Cheatham)
• A co-design project: Software, algorithms, applications
8/28/2016 5
7. Main Components of SPIDAL Project
• Design and Build Scalable High Performance Data Analytics Library
• SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable
Analytics for:
– Domain specific data analytics libraries – mainly from project.
– Add Core Machine learning libraries – mainly from community.
– Performance of Java and MIDAS Inter- and Intra-node.
• NIST Big Data Application Analysis – features of data intensive Applications
deriving 64 Convergence Diamonds. Application Nexus.
• HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High
Performance Computing) and the rich functionality of the commodity Apache Big
Data Stack. Software Nexus
• MIDAS: Integrating Middleware – from project.
• Applications: Biomolecular Simulations, Network and Computational Social
Science, Epidemiology, Computer Vision, Geographical Information Systems,
Remote Sensing for Polar Science and Pathology Informatics, Streaming for
robotics, streaming stock analytics
• Implementations: HPC as well as clouds (OpenStack, Docker) Convergence with
common DevOps tool Hardware Nexus
78/28/2016
9. Data and Model in Big Data and Simulations I
• Need to discuss Data and Model as problems have both
intermingled, but we can get insight by separating which allows
better understanding of Big Data - Big Simulation
“convergence” (or differences!)
• The Model is a user construction and it has a “concept”,
parameters and gives results determined by the computation.
We use term “model” in a general fashion to cover all of these.
• Big Data problems can be broken up into Data and Model
– For clustering, the model parameters are cluster centers while the data
is set of points to be clustered
– For queries, the model is structure of database and results of this query
while the data is whole database queried and SQL query
– For deep learning with ImageNet, the model is chosen network with
model parameters as the network link weights. The data is set of images
used for training or classification
98/28/2016
10. Data and Model in Big Data and Simulations II
• Simulations can also be considered as Data plus Model
– Model can be formulation with particle dynamics or partial
differential equations defined by parameters such as particle
positions and discretized velocity, pressure, density values
– Data could be small when just boundary conditions
– Data large with data assimilation (weather forecasting) or
when data visualizations are produced by simulation
• Big Data implies Data is large but Model varies in size
– e.g. LDA with many topics or deep learning has a large
model
– Clustering or Dimension reduction can be quite small in
model size
• Data often static between iterations (unless streaming); Model
parameters vary between iterations
108/28/2016
12. 51 Detailed Use Cases: Contributed July-September 2013
Covers goals, data features such as 3 V’s, software, hardware
• Government Operation(4): National Archives and Records Administration, Census Bureau
• Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search,
Digital Materials, Cargo shipping (as in UPS)
• Defense(3): Sensors, Image surveillance, Situation Assessment
• Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis,
Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity
• Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd
Sourcing, Network Science, NIST benchmark datasets
• The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source
experiments
• Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron
Collider at CERN, Belle Accelerator II in Japan
• Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake,
Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation
datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to
watersheds), AmeriFlux and FLUXNET gas sensors
• Energy(1): Smart grid
• Published by NIST as http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf
with common set of 26 features recorded for each use-case; “Version 2” being prepared
128/28/201626 Features for each use case Biased to science
13. Sample Features of 51 Use Cases I
• PP (26) “All” Pleasingly Parallel or Map Only
• MR (18) Classic MapReduce MR (add MRStat below for full count)
• MRStat (7) Simple version of MR where key computations are simple
reduction as found in statistical averages such as histograms and
averages
• MRIter (23) Iterative MapReduce or MPI (Flink, Spark, Twister)
• Graph (9) Complex graph data structure needed in analysis
• Fusion (11) Integrate diverse data to aid discovery/decision making;
could involve sophisticated algorithms or could just be a portal
• Streaming (41) Some data comes in incrementally and is processed
this way
• Classify (30) Classification: divide data into categories
• S/Q (12) Index, Search and Query
138/28/2016
14. Sample Features of 51 Use Cases II
• CF (4) Collaborative Filtering for recommender engines
• LML (36) Local Machine Learning (Independent for each parallel entity) –
application could have GML as well
• GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI,
MDS,
– Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief
Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can
call EGO or Exascale Global Optimization with scalable parallel algorithm
• Workflow (51) Universal
• GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual
Earth, Google Earth, GeoServer etc.
• HPC(5) Classic large-scale simulation of cosmos, materials, etc. generating
(visualization) data
• Agent (2) Simulations of models of data-defined macroscopic entities
represented as agents
148/28/2016
15. 7 Computational Giants of
NRC Massive Data Analysis Report
1) G1: Basic Statistics e.g. MRStat
2) G2: Generalized N-Body Problems
3) G3: Graph-Theoretic Computations
4) G4: Linear Algebraic Computations
5) G5: Optimizations e.g. Linear Programming
6) G6: Integration e.g. LDA and other GML
7) G7: Alignment Problems e.g. BLAST
158/28/2016
http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?
16. HPC (Simulation) Benchmark Classics
• Linpack or HPL: Parallel LU factorization
for solution of linear equations; HPCG
• NPB version 1: Mainly classic HPC solver kernels
– MG: Multigrid
– CG: Conjugate Gradient
– FT: Fast Fourier Transform
– IS: Integer sort
– EP: Embarrassingly Parallel
– BT: Block Tridiagonal
– SP: Scalar Pentadiagonal
– LU: Lower-Upper symmetric Gauss Seidel
168/28/2016
Simulation Models
17. 13 Berkeley Dwarfs
1) Dense Linear Algebra
2) Sparse Linear Algebra
3) Spectral Methods
4) N-Body Methods
5) Structured Grids
6) Unstructured Grids
7) MapReduce
8) Combinational Logic
9) Graph Traversal
10) Dynamic Programming
11) Backtrack and
Branch-and-Bound
12) Graphical Models
13) Finite State Machines
178/28/2016
First 6 of these correspond to Colella’s
original. (Classic simulations)
Monte Carlo dropped.
N-body methods are a subset of
Particle in Colella.
Note a little inconsistent in that
MapReduce is a programming model
and spectral method is a numerical
method.
Need multiple facets to classify use
cases!
Largely Models for Data or Simulation
19. Classifying Use Cases
• The Big Data Ogres built on a collection of 51 big data uses gathered by
the NIST Public Working Group where 26 properties were gathered for each
application.
• This information was combined with other studies including the Berkeley
dwarfs, the NAS parallel benchmarks and the Computational Giants of
the NRC Massive Data Analysis Report.
• The Ogre analysis led to a set of 50 features divided into four views that
could be used to categorize and distinguish between applications.
• The four views are Problem Architecture (Macro pattern); Execution
Features (Micro patterns); Data Source and Style; and finally the
Processing View or runtime features.
• We generalized this approach to integrate Big Data and Simulation
applications into a single classification looking separately at Data and
Model with the total facets growing to 64 in number, called convergence
diamonds, and split between the same 4 views.
• A mapping of facets into work of the SPIDAL project has been given.
198/28/2016
21. 64 Features in 4 views for Unified Classification of Big Data
and Simulation Applications
21
Local(Analytics/Informatics/Simulations)
2
M
Data Source and Style View
Pleasingly Parallel
Classic MapReduce
Map-Collective
Map Point-to-Point
Shared Memory
Single Program Multiple Data
Bulk Synchronous Parallel
Fusion
Dataflow
Agents
Workflow
Geospatial Information System
HPC Simulations
Internet of Things
Metadata/Provenance
Shared / Dedicated / Transient / Permanent
Archived/Batched/Streaming – S1, S2, S3, S4, S5
HDFS/Lustre/GPFS
Files/Objects
Enterprise Data Model
SQL/NoSQL/NewSQL
1
M
Micro-benchmarks
Execution View
Processing View
1
2
3
4
6
7
8
9
10
11M
12
10D
9
8D
7D
6D
5D
4D
3D
2D
1D
Map Streaming 5
Convergence
Diamonds
Views and
Facets
Problem Architecture View
15
M
CoreLibraries
Visualization
14
M
GraphAlgorithms13
M LinearAlgebraKernels/Manysubclasses
12
M
Global(Analytics/Informatics/Simulations)
3
M
RecommenderEngine
5
M
4
M
BaseDataStatistics
10
M
StreamingDataAlgorithms
OptimizationMethodology
9
M
Learning
8
M
DataClassification
7
M
DataSearch/Query/Index
6
M
11
M
DataAlignment
Big Data Processing
Diamonds
MultiscaleMethod
17
M
16
M
IterativePDESolvers
22
M
Natureofmeshifused
EvolutionofDiscreteSystems
21
M
ParticlesandFields
20
M
N-bodyMethods
19
M
SpectralMethods
18
M
Simulation (Exascale)
Processing Diamonds
DataAbstraction
D
12
ModelAbstraction
M
12
DataMetric=M/Non-Metric=N
D
13
DataMetric=M/Non-Metric=N
M
13
=NN/=N
M
14
Regular=R/Irregular=IModel
M
10
Veracity
7
Iterative/Simple
M
11
CommunicationStructure
M
8
Dynamic=D/Static=S
D
9
Dynamic=D/Static=S
M
9
Regular=R/Irregular=IData
D
10
ModelVariety
M
6
DataVelocity
D
5
PerformanceMetrics
1
DataVariety
D
6
FlopsperByte/MemoryIO/Flopsperwatt
2
ExecutionEnvironment;Corelibraries
3
DataVolume
D
4
ModelSize
M
4
Simulations Analytics
(Model for Big Data)
Both
(All Model)
(Nearly all Data+Model)
(Nearly all Data)
(Mix of Data and Model)
8/28/2016
22. Examples in Problem Architecture View PA
• The facets in the Problem architecture view include 5 very common ones
describing synchronization structure of a parallel job:
– MapOnly or Pleasingly Parallel (PA1): the processing of a collection of
independent events;
– MapReduce (PA2): independent calculations (maps) followed by a final
consolidation via MapReduce;
– MapCollective (PA3): parallel machine learning dominated by scatter,
gather, reduce and broadcast;
– MapPoint-to-Point (PA4): simulations or graph processing with many
local linkages in points (nodes) of studied system.
– MapStreaming (PA5): The fifth important problem architecture is seen in
recent approaches to processing real-time data.
– We do not focus on pure shared memory architectures PA6 but look at
hybrid architectures with clusters of multicore nodes and find important
performances issues dependent on the node programming model.
• Most of our codes are SPMD (PA-7) and BSP (PA-8).
228/28/2016
23. 6 Forms of
MapReduce
Describes
Architecture of
- Problem (Model
reflecting data)
- Machine
- Software
2 important
variants (software)
of Iterative
MapReduce and
Map-Streaming
a) “In-place” HPC
b) Flow for model
and data
238/28/2016
4) Map- Point to
Point Communication
3) Iterative MapReduce
or Map-Collective-
Input
map
reduce
Iterations
Local
Graph
5) Map-Streaming
maps brokers
Events
6) Shared-Memory
Map Communication
Shared Memory
Map & Communication
1) Map-Only
Pleasingly Parallel
3) Iterative MapReduce
or Map-Collective-
2) Classic
MapReduce
Input
map
reduce
Input
map
reduce
Iterations
Input
Output
map
24. Examples in Execution View EV
• The Execution view is a mix of facets describing either data or model; PA
was largely the overall Data+Model
• EV-M14 is Complexity of model (O(N2) for N points) seen in the non-
metric space models EV-M13 such as one gets with DNA sequences.
• EV-M11 describes iterative structure distinguishing Spark, Flink, and Harp
from the original Hadoop.
• The facet EV-M8 describes the communication structure which is a focus
of our research as much data analytics relies on collective communication
which is in principle understood but we find that significant new work is
needed compared to basic HPC releases which tend to address point to
point communication.
• The model size EV-M4 and data volume EV-D4 are important in describing
the algorithm performance as just like in simulation problems, the grain size
(the number of model parameters held in the unit – thread or process – of
parallel computing) is a critical measure of performance.
248/28/2016
25. Examples in Data View DV
• We can highlight DV-5 streaming where there is a lot of recent
progress;
• DV-9 categorizes our Biomolecular simulation application with
data produced by an HPC simulation
• DV-10 is Geospatial Information Systems covered by our
spatial algorithms.
• DV-7 provenance, is an example of an important feature that
we are not covering.
• The data storage and access DV-3 and D-4 is covered in our
pilot data work.
• The Internet of Things DV-8 is not a focus of our project
although our recent streaming work relates to this and our
addition of HPC to Apache Heron and Storm is an example of
the value of HPC-ABDS to IoT.
• 258/28/2016
26. Examples in Processing View PV
• The Processing view PV characterizes algorithms and is only Model (no
Data features) but covers both Big data and Simulation use cases.
• Graph PV-M13 and Visualization PV-M14 covered in SPIDAL.
• PV-M15 directly describes SPIDAL which is a library of core and other
analytics.
• This project covers many aspects of PV-M4 to PV-M11 as these
characterize the SPIDAL algorithms (such as optimization, learning,
classification).
– We are of course NOT addressing PV-M16 to PV-M22 which are
simulation algorithm characteristics and not applicable to data analytics.
• Our work largely addresses Global Machine Learning PV-M3 although
some of our image analytics are local machine learning PV-M2 with
parallelism over images and not over the analytics.
• Many of our SPIDAL algorithms have linear algebra PV-M12 at their core;
one nice example is multi-dimensional scaling MDS which is based on
matrix-matrix multiplication and conjugate gradient.
•
268/28/2016
27. Comparison of Data Analytics with Simulation I
• Simulations (models) produce big data as visualization of results – they
are data source
– Or consume often smallish data to define a simulation problem
– HPC simulation in (weather) data assimilation is data + model
• Pleasingly parallel often important in both
• Both are often SPMD and BSP
• Non-iterative MapReduce is major big data paradigm
– not a common simulation paradigm except where “Reduce” summarizes
pleasingly parallel execution as in some Monte Carlos
• Big Data often has large collective communication
– Classic simulation has a lot of smallish point-to-point messages
– Motivates MapCollective model
• Simulations characterized often by difference or differential operators
leading to nearest neighbor sparsity
• Some important data analytics can be sparse as in PageRank and “Bag of words”
algorithms but many involve full matrix algorithm
8/28/2016
27
28. Comparison of Data Analytics with Simulation II
• There are similarities between some graph problems and particle
simulations with a particular cutoff force.
– Both are MapPoint-to-Point problem architecture
• Note many big data problems are “long range force” (as in
gravitational simulations) as all points are linked.
– Easiest to parallelize. Often full matrix algorithms
– e.g. in DNA sequence studies, distance (i, j) defined by BLAST,
Smith-Waterman, etc., between all sequences i, j.
– Opportunity for “fast multipole” ideas in big data. See NRC report
• Current Ogres/Diamonds do not have facets to designate underlying
hardware: GPU v. Many-core (Xeon Phi) v. Multi-core as these
define how maps processed; they keep map-X structure fixed; maybe
should change as ability to exploit vector or SIMD parallelism could
be a model facet.
8/28/2016
28
29. Comparison of Data Analytics with Simulation III
• In image-based deep learning, neural network weights are block sparse
(corresponding to links to pixel blocks) but can be formulated as full
matrix operations on GPUs and MPI in blocks.
• In HPC benchmarking, Linpack being challenged by a new sparse
conjugate gradient benchmark HPCG, while I am diligently using non-
sparse conjugate gradient solvers in clustering and Multi-dimensional
scaling.
• Simulations tend to need high precision and very accurate results –
partly because of differential operators
• Big Data problems often don’t need high accuracy as seen in trend to
low precision (16 or 32 bit) deep learning networks
– There are no derivatives and the data has inevitable errors
• Note parallel machine learning (GML not LML) can benefit from HPC
style interconnects and architectures as seen in GPU-based deep
learning
– So commodity clouds not necessarily best
8/28/2016
29
31. Clustering
• The SPIDAL Library includes several clustering algorithms with
sophisticated features
– Deterministic Annealing
– Radius cutoff in cluster membership
– Elkans algorithm using triangle inequality
• They also cover two important cases
– Points are vectors – algorithm O(N) for N points
– Points not vectors – all we know is distance (i, j) between each pair of
points i and j. algorithm O(N2) for N points
• We find visualization important to judge quality of clustering
• As data typically not 2D or 3D, we use dimension reduction to project data
so we can then view it
• Have a browser viewer WebPlotViz that replaces an older Windows system
318/28/2016
32. 2D Vector Clustering with cutoff at 3
σ
328/28/2016
LCMS Mass Spectrometer Peak Clustering. Charge 2 Sample with 10.9 million points
and 420,000 clusters visualized in WebPlotViz
Orange Star – outside all clusters; yellow circle cluster centers
33. Dimension Reduction
• Principal Component Analysis (linear mapping) and Multidimensional
Scaling MDS (nonlinear and applicable to non-Euclidean spaces) are
methods to map abstract spaces to three dimensions for visualization
• Both run well in parallel and give great results
• Semimetric spaces have pairwise distances defined between points in
space (i, j)
• But data is typically in a high dimensional or non vector space so use
dimension reduction. Associate each point i with a vector Xi in a Euclidean
space of dimension K so that (i, j) d(Xi , Xj) where d(Xi , Xj) is Euclidean
distance between mapped points i and j in K dimensional space.
• K = 3 natural for visualization but other values interesting
• Principal Component analysis is best known dimension reduction approach
but a) linear b) requires original points in a vector space
• There are many other nonlinear vector space methods such as GTM
Generative Topographic Mapping
33
34. WDA-SMACOF “Best” MDS
• MDS Minimizes Stress (X) with pairwise distances (i, j)
(X) = i<j=1
N weight(i,j) ((i, j) - d(Xi , Xj))2
• SMACOF clever Expectation Maximization method choses good steepest
descent
• Improved by Deterministic Annealing gradually reducing Temperature
distance scale; DA does not impact compute time much and gives DA-
SMACOF
– Deterministic Annealing like Simulated Annealing but no Monte Carlo
• Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial weights
but get nonuniform weight from
– The preferred Sammon method weight(i,j) = 1/(i, j) or
– Missing distances put in as weight(i,j) = 0
• Use conjugate gradient – converges in 5-100 iterations – a big gain for
matrix with a million rows. This removes factor of N in time complexity and
gives WDA-SMACOF
348/28/2016
36. Fungi -- 4 Classic Clustering Methods plus Species
Coloring
368/28/2016
37. Heatmap of original distance vs 3D Euclidean
Distances for Sequences and Stocks
• One can visualize quality of dimension by comparing as a scatterplot
or heatmap, the distances (i, j) before and after mapping to 3D.
• Perfection is a diagonal straight line and results seem good in general
378/28/2016
Proteomics Example
Stock Market Example
40. Quality of 3D Phylogenetic Tree
• 3 different MDS implementations and 3 different distance
measures
• EM-SMACOF is basic SMACOF for MDS
• LMA was previous best method using Levenberg-Marquardt
nonlinear 2 solver
• WDA-SMACOF finds best result
408/28/2016
Sum of branch lengths of the Spherical Phylogram
generated in 3D space on two datasets
0
5
10
15
20
25
30
MSA SWG NW
SumofBranches
Sum of Branches on 599nts Data
WDA-SMACOF LMA EM-SMACOF
0
5
10
15
20
25
MSA SWG NW
SumofBranches
Sum of Branches on 999nts Data
WDA-SMACOF LMA EM-SMACOF
41. HTML5 web viewer WebPlotViz
• Supports visualization of 3D point sets (typically derived by mapping from
abstract spaces) for streaming and non-streaming case
– Simple data management layer
– 3D web visualizer with various capabilities such as defining color
schemes, point sizes, glyphs, labels
• Core Technologies
– MongoDB management
– Play Server side framework
– Three.js
– WebGL
– JSON data objects
– Bootstrap Javascript web pages
• Open Source
http://spidal-gw.dsc.soic.indiana.edu/
• ~10,000 lines of extra code
418/28/2016
Front end
view
(Browser)
Plot visualization & time series
animation (Three.js)
Web Request Controllers
(Play Framework)
Upload
Data Layer
(MongoDB)
Request Plots
JSON Format
Plots
Upload format
to JSON
Converter
Server
MongoDB
42. Stock Daily Data Streaming Example
• Example is collection of around 7000 distinct stocks with daily values
available at ~2750 distinct times
– Clustering as provided by Wall Street – Dow Jones set of 30
stocks, S&P 500, various ETF’s etc.
• The Center for Research in Security Prices (CSRP) database through
the Wharton Research Data Services (wrds) web interface
• Available for free to the Indiana University students for research
• 2004 Jan 01 to 2015 Dec 31 have daily Stock prices in the form of a
CSV file
• We use the information
– ID, Date, Symbol, Factor to Adjust Volume, Factor to Adjust Price,
Price, Outstanding Stocks
8/28/2016
42
43. Relative Changes in Stock Values using one day values
measured from January 2004 and starting after one year January 2005
Filled circles are final values
438/28/2016
Apple
Mid Cap
Energy
S&P
Dow Jones
Finance
Origin
0% change
+10%
+20%
44. Relative Changes in Stock Values using one day values
Expansion of previous data
448/28/2016
Mid Cap
Energy
S&P
Dow Jones
Finance
8/28/2016 44
Mid Cap
Energy
S&P
Dow Jones
Finance
Origin
0% change
+10%
45. Algorithm Challenge
• The NRC Massive Data Analysis report stresses importance of finding O(N)
or O(NlogN) algorithms for O(N2) problems
– N is number of points
• This is well understood for O(N2) simulation problems where there is a long
range force as in gravitational (cosmology) simulations for N stars or
galaxies
• Simulations are governed by equations that allow a systematic ”multipole
expansion” with O(N) as first term with corrections
– Has been used successfully in parallel for 25 years
• O(N2) big data problems don’t have a systematic practical approach even
though there is a qualitative argument shown in next slide.
• The work wi,j is labelled by two indices i and j each running from 1 to N.
• If points i and j are near each other, need to perform accurate calculations
• If far apart, can use approximations and for example, replace points in a far
away cluster of M particles by their cluster center weighted by M
45
46. 46
O(N2) interactions between
green and purple clusters
should be able to
represent by centroids as
in Barnes-Hut.
Hard as no Gauss
theorem; no multipole
expansion and points
really in 1000 dimension
space as clustered before
3D projection
O(N2) green-green and
purple-purple interactions
have value but green-purple
are “wasted”
“clean” sample of 446K
O(N2) reduced to
O(N) times cluster
size
47. Software Nexus
Application Layer On
Big Data Software Components for
Programming and Data Processing On
HPC for runtime On
IaaS and DevOps Hardware and Systems
• HPC-ABDS
• MIDAS
• Java Grande
8/28/2016 47
48. 488/28/2016
HPC-ABDS
Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies
Cross-
Cutting
Functions
1) Message
and Data
Protocols:
Avro, Thrift,
Protobuf
2) Distributed
Coordination
: Google
Chubby,
Zookeeper,
Giraffe,
JGroups
3) Security &
Privacy:
InCommon,
Eduroam
OpenStack
Keystone,
LDAP, Sentry,
Sqrrl, OpenID,
SAML OAuth
4)
Monitoring:
Ambari,
Ganglia,
Nagios, Inca
17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad,
Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA),
Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML
16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA,
Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j,
H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables,
CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK
15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud
Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT,
Agave, Atmosphere
15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq,
Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird
14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook
Puma/Ptail/Scribe/ODS, Azure Stream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine
14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco,
Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem
13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty,
ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon
SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs
12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB,
H-Store
12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC
12) Extraction Tools: UIMA, Tika
11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal
Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL
11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB,
Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J,
graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame
Public Cloud: Azure Table, Amazon Dynamo, Google DataStore
11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet
10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST
9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm,
Torque, Globus Tools, Pilot Jobs
8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS
Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage
7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis
6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat,
Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes,
Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api
5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula,
Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds
Networking: Google Cloud DNS, Amazon Route 53
21 layers
Over 350
Software
Packages
January
29
2016
49. Functionality of 21 HPC-ABDS Layers
1) Message Protocols:
2) Distributed Coordination:
3) Security & Privacy:
4) Monitoring:
5) IaaS Management from HPC to
hypervisors:
6) DevOps:
7) Interoperability:
8) File systems:
9) Cluster Resource
Management:
10) Data Transport:
11) A) File management
B) NoSQL
C) SQL
498/28/2016
12) In-memory databases & caches /
Object-relational mapping / Extraction
Tools
13) Inter process communication
Collectives, point-to-point, publish-
subscribe, MPI:
14) A) Basic Programming model and
runtime, SPMD, MapReduce:
B) Streaming:
15) A) High level Programming:
B) Frameworks
16) Application and Analytics:
17) Workflow-Orchestration:
Lesson of large number (350). This is a rich software
environment that HPC cannot “compete” with. Need to
use and not regenerate
Note level 13 Inter process communication added
50. HPC-ABDS SPIDAL Project Activities
• Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) integrated
with Heron/Flink and Cloudmesh on HPC cluster
• Level 16: Applications: Datamining for molecular dynamics, Image processing
for remote sensing and pathology, graphs, streaming, bioinformatics, social
media, financial informatics, text mining
• Level 16: Algorithms: Generic and custom for applications SPIDAL
• Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop,
Spark, Flink. Improve Inter- and Intra-node performance; science data structures
• Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark,
Flink, Giraph) using HPC runtime technologies, Harp
• Level 12: In-memory Database: Redis + Spark used in Pilot-Data Memory
• Level 11: Data management: Hbase and MongoDB integrated via use of Beam
and other Apache tools; enhance Hbase
• Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark,
Hadoop; integrate Storm and Heron with Slurm
• Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability
50
Green is MIDAS
Black is SPIDAL
8/28/2016
51. Java Grande
Revisited on 3 data analytics codes
Clustering
Multidimensional Scaling
Latent Dirichlet Allocation
all sophisticated algorithms
518/28/2016
52. Java MPI performs better than FJ Threads
128 24 core Haswell nodes on SPIDAL 200K DA-MDS Code
528/28/2016
Best FJ Threads intra
node; MPI inter node
Best MPI; inter
and intra node
MPI; inter/intra
node; Java not
optimized
Speedup compared to 1
process per node on 48 nodes
53. Investigating Process and Thread Models
538/28/2016
• FJ Fork Join Threads lower
performance than Long
Running Threads LRT
• Results
– Large effects for Java
– Best affinity is process
and thread binding to
cores - CE
– At best LRT mimics
performance of “all
processes”
• 6 Thread/Process Affinity
Models
54. Java and C K-Means LRT-FJ and LRT-BSP with different
affinity patterns over varying threads and processes.
8/28/2016
Java
C
106 points and 1000 centers on 16 nodes
106 points and 50k, and 500k centers
performance on 16 nodes
55. Java versus C Performance
• C and Java Comparable with Java doing better on larger problem
sizes
• All data from one million point dataset with varying number of centers
on 16 nodes 24 core Haswell
558/28/2016
56. Performance Dependence on Number
of Cores inside node (16 nodes total)
• Long-Running
Theads LRT Java
– All Processes
– All Threads
internal to node
– Hybrid – Use one
process per chip
• Fork Join Java
– All Threads
– Hybrid – Use one
process per chip
• Fork Join C
– All Threads
• All MPI internode
568/28/2016
58. HPC-ABDS Parallel Computing
• Both simulations and data analytics use similar parallel computing ideas
• Both do decomposition of both model and data
• Both tend use SPMD and often use BSP Bulk Synchronous Processing
• One has computing (called maps in big data terminology) and
communication/reduction (more generally collective) phases
• Big data thinks of problems as multiple linked queries even when queries
are small and uses dataflow model
• Simulation uses dataflow for multiple linked applications but small steps
such as iterations are done in place
• Reduction in HPC (MPIReduce) done as optimized tree or pipelined
communication between same processes that did computing
• Reduction in Hadoop or Flink done as separate map and reduce processes
using dataflow
– This leads to 2 forms (In-Place and Flow) of Map-X mentioned earlier
• Interesting Fault Tolerance issues highlighted by Hadoop-MPI comparisons
– not discussed here!
588/28/2016
60. Breaking Programs into Parts
608/28/2016
Coarse Grain
Dataflow
HPC or ABDS
Fine Grain Parallel Computing
Data/model parameter decomposition
61. Kmeans Clustering Flink and MPI
one million 2D points fixed; various # centers
24 cores on 16 nodes
618/28/2016
62. • MPI designed for fine grain case and typical of parallel computing
used in large scale simulations
– Only change in model parameters are transmitted
– In-place implementation
• Dataflow typical of distributed or Grid computing paradigms
– Data sometimes and model parameters certainly transmitted
– Caching in iterative MapReduce avoids data communication and
in fact systems like TensorFlow, Spark or Flink are called dataflow
but usually implement “model-parameter” flow
• We quantify this by an overhead analysis on next slide that works
for “in-place” runtimes. Flow implementations have additional
sources of overhead that we know are large but haven’t studied as
quantitatively
625/17/2016
HPC-ABDS Parallel Computing I
63. • Overheads are given by similar formulae for big data and
simulations
Overhead f = (1/Model parameter Size in each map)n x
(Typical Hardware communication cost/Typical computing
cost)
• Index n>0 depends on communication structure
– n=0.5 for matrix problems; n=1 for O(N2) problems
• Large f: Intra-job reduction such as Kmeans clustering
where one has center changes at end of each iteration and
• Small f: Inter-Job Reduction as at end of a query as seen in
workflow
• Increasing grain size = Model parameter Size in each map,
decreases overhead as n>0
635/17/2016
HPC-ABDS Parallel Computing II
64. • For a given application, need to understand:
– Are we using Data Flow or “Model-parameter” Flow
– Requirements of compute/communication ratio
• Inefficient to use same runtime mechanism independent of
characteristics
– Use In-Place or Flow Software implementations
• Classic Dataflow is approach of Spark and Flink so need to add
parallel in-place computing as done by Harp for Hadoop
– TensorFlow also uses In-Place technology
• HPC-ABDS plan is to keep current user interfaces (say to Spark
Flink Hadoop Storm Heron) and transparently use HPC to improve
performance exploiting added level 13 in HPC-ABDS
• We have done this to Hadoop (next Slide), Spark, Storm, Heron
– Working on further HPC integration with ABDS
645/17/2016
HPC-ABDS Parallel Computing III
65. Harp (Hadoop Plugin) brings HPC to ABDS
• Basic Harp: Iterative HPC communication; scientific data abstractions
• Careful support of distributed data AND distributed model
• Avoids parameter server approach but distributes model over worker nodes
and supports collective communication to bring global model to each node
• Applied first to Latent Dirichlet Allocation LDA with large model and data
658/28/2016
Shuffle
M M M M
Collective Communication
M M M M
R R
MapCollective ModelMapReduce Model
YARN
MapReduce V2
Harp
MapReduce
Applications
MapCollective
Applications
69. Adding HPC to Storm & Heron for Streaming
Robot with a
Laser Range
Finder
Map Built from
Robot data
Robotics Applications
Robots need to
avoid collisions
when they move
N-Body Collision
Avoidance
Simultaneous Localization and Mapping
Time series data
visualization in real
time
Map High dimensional
data to 3D visualizer
Apply to Stock market
data tracking 6000
stocks
8/28/2016
69
70. Data Pipeline
Hosted on HPC and
OpenStack cloud
End to end delays
without any processing
is less than 10ms
Message Brokers
RabbitMQ, Kafka
Gateway
Sending
to
pub-sub
Sending
to
Persisting
storage
Streaming
workflow
A stream
application with
some tasks
running in parallel
Multiple
streaming
workflows
Streaming Workflows
Apache Heron and Storm
Storm does not support “real
parallel processing” within
bolts – add optimized inter-bolt
communication
8/28/2016
70
73. Workflow in HPC-ABDS
• HPC familiar with Taverna, Pegasus, Kepler, Galaxy etc. and
• ABDS has many workflow systems with recent Apache
systems being Crunch, NiFi and Beam (open source version
of Google Cloud Dataflow)
– Use ABDS for sustainability reasons?
– ABDS approaches are better integrated than HPC approaches with
ABDS data management like Hbase and are optimized for distributed
data.
• Heron, Spark and Flink provide distributed dataflow runtime
which is needed for workflow
• Beam uses Spark or Flink as runtime and supports streaming
and batch data
• Needs more study
738/28/2016
74. Automatic parallelization
• Database community looks at big data job as a dataflow of (SQL) queries
and filters
• Apache projects like Pig, MRQL and Flink aim at automatic query
optimization by dynamic integration of queries and filters including
iteration and different data analytics functions
• Going back to ~1993, High Performance Fortran HPF compilers optimized
set of array and loop operations for large scale parallel execution of
optimized vector and matrix operations
• HPF worked fine for initial simple regular applications but ran into trouble
for cases where parallelism hard (irregular, dynamic)
• Will same happen in Big Data world?
• Straightforward to parallelize k-means clustering but sophisticated
algorithms like Elkans method (use triangle inequality) and fuzzy
clustering are much harder (but not used much NOW)
• Will Big Data technology run into HPF-style trouble with growing use of
sophisticated data analytics?
748/28/2016
76. Constructing HPC-ABDS Exemplars
• This is one of next steps in NIST Big Data Working Group
• Jobs are defined hierarchically as a combination of Ansible (preferred over Chef or
Puppet as Python) scripts
• Scripts are invoked on Infrastructure (Cloudmesh Tool)
• INFO 524 “Big Data Open Source Software Projects” IU Data Science class
required final project to be defined in Ansible and decent grade required that script
worked (On NSF Chameleon and FutureSystems)
– 80 students gave 37 projects with ~15 pretty good such as
– “Machine Learning benchmarks on Hadoop with HiBench”, Hadoop/Yarn, Spark,
Mahout, Hbase
– “Human and Face Detection from Video”, Hadoop (Yarn), Spark, OpenCV,
Mahout, MLLib
• Build up curated collection of Ansible scripts defining use cases for
benchmarking, standards, education
https://docs.google.com/document/d/1INwwU4aUAD_bj-XpNzi2rz3qY8rBMPFRVlx95k0-xc4
• Fall 2015 class INFO 523 introductory data science class was less constrained;
students just had to run a data science application but catalog interesting
– 140 students: 45 Projects (NOT required) with 91 technologies, 39 datasets
768/28/2016
77. Cloudmesh Interoperability DevOps Tool
• Model: Define software configuration with tools like Ansible (Chef,
Puppet); instantiate on a virtual cluster
• Save scripts not virtual machines and let script build applications
• Cloudmesh is an easy-to-use command line program/shell and portal to
interface with heterogeneous infrastructures taking script as input
– It first defines virtual cluster and then instantiates script on it
– It has several common Ansible defined software built in
• Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud
supported clouds as well as classic HPC and Docker infrastructures
– Has an abstraction layer that makes it possible to integrate other IaaS
frameworks
• Managing VMs across different IaaS providers is easier
• Demonstrated interaction with various cloud providers:
– FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera,
AWS, Azure, virtualbox
• Status: AWS, and Azure, VirtualBox, Docker need improvements; we
focus currently on SDSC Comet and NSF resources that use OpenStack
77HPC Cloud Interoperability Layer8/28/2016
78. Cloudmesh Architecture
• We define a basic virtual cluster which is a set of instances with a common security context
• We then add basic tools including languages Python Java etc.
• Then add management tools such as Yarn, Mesos, Storm, Slurm etc …..
• Then add roles for different HPC-ABDS PaaS subsystems such as Hbase, Spark
– There will be dependencies e.g. Storm role uses Zookeeper
• Any one project picks some of HPC-ABDS PaaS Ansible roles and adds >=1 SaaS that are
specific to their project and for example read project data and perform project analytics
• E.g. there will be an OpenCV role used in Image processing applications
788/28/2016
Software
Engineering Process
79. Summary of Big Data - Big Simulation
Convergence?
HPC-Clouds convergence? (easier than converging higher levels in
stack)
Can HPC continue to do it alone?
Convergence Diamonds
HPC-ABDS Software on differently optimized hardware
infrastructure
8/28/2016 79
80. • Applications, Benchmarks and Libraries
– 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis,
13 Berkeley dwarfs, 7 NAS parallel benchmarks
– Unified discussion by separately discussing data & model for each application;
– 64 facets– Convergence Diamonds -- characterize applications
– Characterization identifies hardware and software features for each application across big
data, simulation; “complete” set of benchmarks (NIST)
• Software Architecture and its implementation
– HPC-ABDS: Cloud-HPC interoperable software: performance of HPC (High Performance
Computing) and the rich functionality of the Apache Big Data Stack.
– Added HPC to Hadoop, Storm, Heron, Spark; could add to Beam and Flink
– Could work in Apache model contributing code
• Run same HPC-ABDS across all platforms but “data management” nodes have different
balance in I/O, Network and Compute from “model” nodes
– Optimize to data and model functions as specified by convergence diamonds
– Do not optimize for simulation and big data
• Convergence Language: Make C++, Java, Scala, Python (R) … perform well
• Training: Students prefer to learn Big Data rather than HPC
• Sustainability: research/HPC communities cannot afford to develop everything (hardware and
software) from scratch
General Aspects of Big Data HPC Convergence
8/28/2016
80
81. Typical Convergence Architecture
• Running same HPC-ABDS software across all platforms but data
management machine has different balance in I/O, Network and Compute
from “model” machine
– Note data storage approach: HDFS v. Object Store v. Lustre style file
systems is still rather unclear
• The Model behaves similarly whether from Big Data or Big Simulation.
818/28/2016
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
Data Management Model for Big Data
and Big Simulation