Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016. ODPi, Big Data, Hadoop, Spark, H2O, Sparkling Water, performance benchmarking on ARM64/AArch64,
My presentation for the first user group meeting of our lab's Big Data IWT TETRA project [*]. In the presentation, I gave a demo of Cloudera Manager, discussed 4 micro benchmarks and finalized the presentation with an overview of the Big Bench benchmark.
[*] For more information on what IWT TETRA funding exactly is, see http://www.iwt.be/english/funding/subsidy/tetra
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover:
+ What is Spark Streaming and what is it used for?
+ How does Spark Streaming work?
+ Example code to read, process, and write the processed data
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
My presentation for the first user group meeting of our lab's Big Data IWT TETRA project [*]. In the presentation, I gave a demo of Cloudera Manager, discussed 4 micro benchmarks and finalized the presentation with an overview of the Big Bench benchmark.
[*] For more information on what IWT TETRA funding exactly is, see http://www.iwt.be/english/funding/subsidy/tetra
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications.
We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures.
The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.
Spark Streaming is an extension of the core Spark API that enables continuous data stream processing. It is particularly useful when data needs to be processed in real-time. Carol McDonald, HBase Hadoop Instructor at MapR, will cover:
+ What is Spark Streaming and what is it used for?
+ How does Spark Streaming work?
+ Example code to read, process, and write the processed data
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
Apache Spark has rocked the big data landscape, quickly becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk we introduce the Spark programming model, and describe some unique Java runtime capabilities in the JIT, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. We will show how solutions, previously infeasible with regular Java programming, become possible with a high performance Spark core runtime, enabling you to solve problems smarter and faster.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Lambda Architecture using Google Cloud plus AppsSimon Su
This is a demo that use apps script to demo the lambda dashboard. The apps script publish a endpoint and client using fluentd to post data to apps script and also bigquery. Then, you can see the realtime and batch query in the same view.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.
The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
Apache Spark has rocked the big data landscape, quickly becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk we introduce the Spark programming model, and describe some unique Java runtime capabilities in the JIT, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. We will show how solutions, previously infeasible with regular Java programming, become possible with a high performance Spark core runtime, enabling you to solve problems smarter and faster.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Lambda Architecture using Google Cloud plus AppsSimon Su
This is a demo that use apps script to demo the lambda dashboard. The apps script publish a endpoint and client using fluentd to post data to apps script and also bigquery. Then, you can see the realtime and batch query in the same view.
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
In this paper we present the initial results of our work to run BigBench on Spark. First, we evaluated the data scalability behavior of the existing MapReduce implementation of BigBench. Next, we executed the group of 14 pure HiveQL queries on Spark SQL and compared the results with the respective Hive results. Our experiments show that: (1) for both MapReduce and Spark SQL, BigBench queries perform with the increase of the data size on average better than the linear scaling behavior and (2) pure HiveQL queries perform faster on Spark SQL than on Hive.
http://clds.sdsc.edu/wbdb2015.ca/program
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
RAPIDS – Open GPU-accelerated Data Science
RAPIDS is an initiative driven by NVIDIA to accelerate the complete end-to-end data science ecosystem with GPUs. It consists of several open source projects that expose familiar interfaces making it easy to accelerate the entire data science pipeline- from the ETL and data wrangling to feature engineering, statistical modeling, machine learning, and graph analysis.
Corey J. Nolet
Corey has a passion for understanding the world through the analysis of data. He is a developer on the RAPIDS open source project focused on accelerating machine learning algorithms with GPUs.
Adam Thompson
Adam Thompson is a Senior Solutions Architect at NVIDIA. With a background in signal processing, he has spent his career participating in and leading programs focused on deep learning for RF classification, data compression, high-performance computing, and managing and designing applications targeting large collection frameworks. His research interests include deep learning, high-performance computing, systems engineering, cloud architecture/integration, and statistical signal processing. He holds a Masters degree in Electrical & Computer Engineering from Georgia Tech and a Bachelors from Clemson University.
And introdution to MR and Hadoop and an view on the opportunities to use MR with databases i.e., SQL-MapReduce by Teradata and In-database MR by Oracle.
The presentation was used during a class of Datenbanken Implementierungstechniken in 2013.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUsChris Fregly
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs @ Strata London, May 24 2017
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs - Advanced Spark and TensorFlow Meetup May 23 2017 @ Hotels.com London
We'll discuss how to deploy TensorFlow, Spark, and Sciki-learn models on GPUs with Kubernetes across multiple cloud providers including AWS, Google, and Azure - as well as on-premise.
In addition, we'll discuss how to optimize TensorFlow models for high-performance inference using the latest TensorFlow XLA (Accelerated Linear Algebra) framework including the JIT and AOT Compilers.
Github Repo (100% Open Source!)
https://github.com/fluxcapacitor/pipeline
http://pipeline.io
This is slides from our recent HadoopIsrael meetup. It is dedicated to comparison Spark and Tez frameworks.
In the end of the meetup there is small update about our ImpalaToGo project.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark
By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes.
Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together.
R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics.
This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS .
- How to Scale R
- Work with R and Hadoop + Spark
-Demo of MLS on HDP/HDInsight server with RStudio
- How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio:
Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
Event: TDWI Accelerate Seattle, October 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Description: How to develop scalable and in-DB analytics using R in Spark and SQL-Server
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
Event: TDWI Accelerate, Seattle, Oct 16, 2017
Topic: Distributed and In-Database Analytics with R
Presenter: Debraj GuhaThakurta
Tags: R, Spark, SQL Server
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.
Everything you wanted to know about Apache Tez:
-- Distributed execution framework targeted towards data-processing applications.
-- Based on expressing a computation as a dataflow graph.
-- Highly customizable to meet a broad spectrum of use cases.
-- Built on top of YARN – the resource management framework for Hadoop.
-- Open source Apache incubator project and Apache licensed.
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityJAYAPRAKASH JPINFOTECH
Deadline-aware Map Reduce Job Scheduling with Dynamic Resource Availability
To buy this project in ONLINE, Contact:
Email: jpinfotechprojects@gmail.com,
Website: https://www.jpinfotech.org
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
We present a software model built on the Apache software stack (ABDS) that is well used in modern cloud computing, which we enhance with HPC concepts to derive HPC-ABDS.
We discuss layers in this stack
We give examples of integrating ABDS with HPC
We discuss how to implement this in a world of multiple infrastructures and evolving software environments for users, developers and administrators
We present Cloudmesh as supporting Software-Defined Distributed System as a Service or SDDSaaS with multiple services on multiple clouds/HPC systems.
We explain the functionality of Cloudmesh as well as the 3 administrator and 3 user modes supported
Similar to Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016 (20)
Apache Ambari on ARM Server - Linaro ConnectGanesh Raju
Apache Ambari on ARM Server - Linaro Connect. Using Apache Bigtop as Opensource bigdata distro. Installation, deployment, configuration and metrics monitoring of Hadoop, Spark, HBase, Hive
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016
1. Demo - Smart City Use-case
Using ODPi Hadoop, Spark, H2O and Sparkling water
Ganesh Raju
2. ENGINEERS AND DEVICES
WORKING TOGETHER
● Simplify & standardize big data ecosystem with a common reference
specification and test suites.
● Reduces cost and complexity and accelerates the development of Big Data
solutions.
● Cross-compatibility between different distributions of Hadoop and big data
technologies
● Has two stacks: Runtime and Operations
● V2.0 alpha release coming soon
● Linaro is a member of ODPi
www.odpi.org
ODPi
3. ENGINEERS AND DEVICES
WORKING TOGETHER
● Distributed and fast in-memory data processing engine
● Provides development APIs to efficiently execute iterative streaming, machine
learning or SQL workloads
● Spark was developed as an alternative approach to Map Reduce with easy of
use in mind.
● Code in Java, Scala, or Python.
Spark
4. ENGINEERS AND DEVICES
WORKING TOGETHER
● H2O is a in-memory user friendly machine learning API
● Compatible with Hadoop and Spark
● Spark + H2O is Sparkling Water
● Sparkling Water allows to combine fast & scalable machine learning algorithms
of H2O with high performance distributed processing capabilities of Spark
engine.
● Spark’s RDD and DataFrame and H2O’s H2OFrame are interoperable
● Users can utilize H2O Flow UI to drive Scala / R / Python computation from
Spark
H2O Sparkling Water
5. ENGINEERS AND DEVICES
WORKING TOGETHER
● Utilizing ODPi v1 based Native Hadoop, Spark, H2O Sparkling Water, H2O flow.
● All Compiled on ARM - ODPi Hadoop 2.7, Spark 1.6 with Scala 2.10 (Scala 2.11 is
not supported with SparklingWater)
● 3 node cluster running on Linaro Developer Cloud - HP MoonShot machines
● Dataset files stored in HDFS.
● Spark utilizing Yarn for Resource manager.
● H2O Sparkling water utilizing Spark as execution Engine.
● H2O Flow utilizing Spark SQL API and scala code
● .csv data -> HDFS -> Spark RDD -> H2O H2OFrame
https://wiki.linaro.org/LEG/Engineering/BigData
Demo
7. ENGINEERS AND DEVICES
WORKING TOGETHER
● Various Benchmarking Tools
● Types of Benchmarks and standards
● Challenges of BigData benchmarking on ARM
● Some of the tools that we will be covering are TPC (Transaction Processing
Performance Council) based TPCx-HS, TPC-DS, TPC-H benchmark, HiBench
(TestDFSIO), Spark-Bench for Apache Spark, MRBench for Mapreduce,
NNBench for HDFS...etc
Abstract
8. ENGINEERS AND DEVICES
WORKING TOGETHER
● Measure performance and scale
● Simulate higher load
○ Find bottlenecks/limits
● Evaluate different hardware/software
○ OS, Java, VM.
○ Hadoop, Spark, Pig, Hive..
● Validate reliability
● Validate assumptions / Configurations
● Compare two different deployments
● Performance tuning
Why Benchmarking ..?
9. ENGINEERS AND DEVICES
WORKING TOGETHER
Challenges of BigData benchmarking
● System Diversity
○ Variety of Solutions - Data Read, I/O, Streaming, Data warehousing,
Machine Learning
● Rapid Data Evolution - Velocity.
● System and Data Scale
● System Complexity
○ Multiple pipelines (layers of Transformations)
10. ENGINEERS AND DEVICES
WORKING TOGETHER
Types of benchmarks and standards
● Micro benchmarks: To evaluate specific lower-level, system operations
○ E.g. Hadoop Workload Examples (sort, grep, wordcount and Terasort,
Gridmix, Pigmix), HiBench, HDFS DFSIO, AMP Lab Big Data Benchmark
● Functional/Component benchmarks: Specific to low level function
○ E.g. Basic SQL queries (select, join, etc.,)
○ Synthetic benchmarks
● Application level
○ Bigbench
○ Spark bench
11. ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark Efforts -
Microbenchmarks
Workloads Software
Stacks
Metrics
HiBench Sort, WordCount, TeraSort, PageRank, K-means, Bayes
classification, Index
Hadoop
and Hive
Execution
Time,
Throughput,
resource
utilization
DFSIO Generate, read, write, append, and remove data for
MapReduce jobs
Hadoop Execution
Time,
Throughput
AMPLab benchmark Part of CALDA workloads (scan, aggregate and join) and
PageRank
Hive, Tez Execution
Time
12. ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark
Efforts - TPC
Workloads Software
Stacks
Metrics
TPCx-HS HSGen, HSData, Check, HSSort and HSValidate Hadoop Performance,
price and energy
TPC-H Datawarehousing operations Hive, Pig Execution Time,
Throughput
TPC-DS Decision support benchmark
Data loading, queries and maintenance
Hive, Pig Execution Time,
Throughput
13. ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark
Efforts -
Synthetic
Workloads Software Stacks Metrics
SWIM Synthetic user generated MapReduce jobs of reading,
writing, shuffling and sorting
Hadoop Multiple metrics
GridMix Synthetic and basic operations to stress test job
scheduler and compression and decompression
Hadoop Memory,
Execution Time,
Throughput
PigMix 17 Pig specific queries Hadoop, Pig Execution Time
MRBench MapReduce benchmark as a complementary to TeraSort
- Datawarehouse operations with 22 TPC-H queries
Hadoop Execution Time
NNBench and
NNBenchWithO
utMR
Load testing namenode and HDFS I/O with small
payloads
Hadoop I/O
SparkBench CPU, memory and shuffle and IO intensive workloads.
Machine Learning, Streaming, Graph Computation and
SQL Workloads
Spark Execution Time,
Data process
rate
BigBench Interactive-based queries based on synthetic data Hadoop, Spark Execution Time
14. ENGINEERS AND DEVICES
WORKING TOGETHER
Benchmark
Efforts
Workloads Software Stacks Metrics
BigDataBench 1. Micro Benchmarks (sort, grep, WordCount);
2. Search engine workloads (index, PageRank);
3. Social network workloads (connected components (CC),
K-means and BFS);
4. E-commerce site workloads (Relational database queries
(select, aggregate and join), collaborative filtering (CF) and
Naive Bayes;
5. Multimedia analytics workloads (Speech Recognition, Ray
Tracing, Image Segmentation, Face Detection);
6. Bioinformatics workloads
Hadoop,
DBMSs, NoSQL
systems, Hive,
Impala, Hbase,
MPI, Libc, and
other real-time
analytics
systems
Throughput,
Memory, CPU
(MIPS, MPKI -
Misses per
instruction)
15. ENGINEERS AND DEVICES
WORKING TOGETHER
Hadoop benchmark and Test tool
● Hadoop distribution comes with a number of benchmarks
● TestDFSIO, nnbench, mrbench are in hadoop-*test*.jar
● TeraGen, TeraSort, TeraValidate are in hadoop-*examples*.jar
● You can check it using the command
$ cd /usr/local/hadoop
$ bin/hadoop jar hadoop-*test*.jar
$ bin/hadoop jar hadoop-*examples*.jar
● While running the benchmarks you might want to use time command which
measure the elapsed time. This saves you the hassle of navigating to the
hadoop JobTracker interface. The relevant metric is real value in the first row.
$ time hadoop jar hadoop-*examples*.jar ...
[...]
real 9m15.510s
user 0m7.075s
sys 0m0.584s
16. ENGINEERS AND DEVICES
WORKING TOGETHER
TeraGen, TeraSort and TeraValidate
● This is a most well known Hadoop benchmark
● The TeraSort is to sort the data as fast as possible
● This test suite combines HDFS and mapreduce layers of a hadoop cluster
● TeraSort benchmark consists of 3 steps
○ Generate input via TeraGen
○ Run TeraSort on input data
○ Validate sorted output data via TeraValidate
https://wiki.linaro.org/LEG/Engineering/BigData/HadoopBuildInstallAndRunGuide
17. ENGINEERS AND DEVICES
WORKING TOGETHER
HiBench
● Contains 9 typical Hadoop and Spark workloads (including micro benchmarks, HDFS benchmarks,
web search benchmarks, machine learning benchmarks using Mahout, and data analytics
benchmarks)
● Sort, WordCount, TeraSort, TestDFSIO, Nutch indexing (search indexing using Nutch engine),
PageRank (An implementation of Google’s Web page ranking algorithm), hivebench
● Uses zlib compression for input and output
● Metrics: Time (sec) & Throughput (Bytes/Sec), Memory partitions, parallelism,
● Cons: Lack of AARCH bits, Lack of documentations
https://wiki.linaro.org/LEG/Engineering/BigData/HiBench
18. ENGINEERS AND DEVICES
WORKING TOGETHER
TestDFSIO
● It is part of hadoop-mapreduce-client-jobclient.jar
● Stress test I/O performance (throughput and latency) on a clustered setup.
● This test will shake out the hardware, OS and Hadoop setup on your cluster
machines (NameNode/DataNode)
● The tests are run as a MapReduce job using 1:1 mapping (1 map / file)
● Helpful to discover performance bottlenecks in your network
● Benchmark write test followed up with read test
● Use -write for write tests and -read for read tests.
● The results stored in TestDFSIO_results.log. Use -resFile to choose different file
name
19. ENGINEERS AND DEVICES
WORKING TOGETHER
Hive Testbench
● Based on TPC-H and TPC-DS benchmarks
● Experiment Apache Hive at any data scale
● Contains data generator and set of queries
● Test the basic Hive performance on large data sets
https://wiki.linaro.org/LEG/Engineering/BigData/HiveTestBench
20. ENGINEERS AND DEVICES
WORKING TOGETHER
MR(Map Reduce) Benchmark for MR
● Loops a small job number of times
● Checks whether small job runs are responsive and running efficiently on your
cluster
● Puts focus on MapReduce layer as its impact on the HDFS layer is very limited
● The multiple parallel MRBench issue is resolved. Hence you can run it from
different boxes
● Test command to run 50 small test jobs
$ hadoop jar hadoop-*test*.jar mrbench -numRuns 50
● Exemplary output, which means in 31 sec the job finished
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 31414
21. ENGINEERS AND DEVICES
WORKING TOGETHER
NNBench and NNBenchWithoutMR
● Load testing NameNode through continuous read, write, rename and delete
operations on small files
● Stress tests HDFS (I/O)
● To increase stress, multiple instances of NNBenchWithoutMR can be run
simultaneously from several machines or increase map tasks for NNBench
● All write tests are run then followed by read tests
● The test command: The below command will run a NameNode benchmark that
creates 1000 files using 12 maps and 6 reducers.
$ hadoop jar hadoop-*test*.jar nnbench -operation create_write
-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000
-replicationFactorPerFile 3 -readFileAfterOpen true
-baseDir /benchmarks/NNBench-`hostname -s`
22. ENGINEERS AND DEVICES
WORKING TOGETHER
TPC Benchmark
● TPCx-HS - https://wiki.linaro.org/LEG/Engineering/BigData/TPCxHS
○ Currently facing problems with cluster shell configuration
● TPC-H
○ TPC-H benchmark focuses on ad-hoc queries
● TPC-DS
○ “the” standard benchmark for decision support
● TPC-C
○ Is an on-line transaction processing (OLTP) benchmark
23. ENGINEERS AND DEVICES
WORKING TOGETHER
TPCx-HS Benchmark
X: Express, H: Hadoop, S: Sort
The TPCx-HS kit contains
● TPCx-HS specification documentation
● TPCx-HS User's guide documentation
● Scripts to run benchmarks
● Java code to execute the benchmark load
TPCx-HS Execution
● A valid run consists of 5 separate phases run sequentially with overlap in their execution
● The benchmark test consists of 2 runs (Run with lower and higher TPCx-HS Performance Metric)
● No configuration or tuning changes or reboot are allowed between the two runs
24. ENGINEERS AND DEVICES
WORKING TOGETHER
TPC vs SPEC models
TPC model
● Specification based
● Performance, Price, energy in one
benchmark
● End-to-End
● Multiple tests (ACID, Load)
● Independent Review
● Full disclosure
● TPC Technology conference
SPEC model
● Kit based
● Performance and energy in
separate benchmarks
● Server centric
● Single test
● Summary disclosure
● SPEC research group ICPE
25. ENGINEERS AND DEVICES
WORKING TOGETHER
BigBench
● BigBench is a joint effort with partners in industry and academia on creating a comprehensive
and standardized BigData benchmark.
● BigBench builds upon and borrows elements from existing benchmarking efforts (such as
TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, YCSB and TPC-DS).
● BigBench is a specification-based benchmark with an open-source reference implementation
kit.
● As a specification-based benchmark, it would be technology-agnostic and provide the
necessary formalism and flexibility to support multiple implementations.
● Focused around execution time calculation
● Consists of 30 queries/workloads (10 of them are from TPC)
● Drawback - it is structured-data-intensive
26. ENGINEERS AND DEVICES
WORKING TOGETHER
Spark Bench for Apache Spark
● Build on ARM works
● FAIL: When spark bench examples are run, a KILL signal is observed which
terminates all workers.
● This is still under investigation as there are no useful logs to debug. No proper
error description and lack of documentation is a challenge.
● A ticket is already filed on spark bench git which is unresolved.
● Con: Lack of documentation.
27. ENGINEERS AND DEVICES
WORKING TOGETHER
GridMix
● Mix of Synthetic Mapreduce jobs (sorting text data and SequenceFiles)
● Evaluate MapReduce and HDFS performance
● The input file needs to be in JSON format
● Jobs can be either LOADJOB (trace of history logs using Rumen) or SLEEPJOB (A synthetic job where
each task does *nothing* but sleep for a certain duration)
● Jobs can be run in STRESS, REPLAY or SERIAL mode
● You can emulate number of users, number of job queries and resource usage (CPU, memory, JVM
heap)
● Basic command line usage: (Provided as part of hadoop command)
$ hadoop gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
● Con: Challenging to explore the performance impact of combining or separating workloads, e.g.,
through consolidating from many clusters.
28. ENGINEERS AND DEVICES
WORKING TOGETHER
PigMix
● PigMix is a set of queries used test Apache Pig performance
● There are queries that test latency (How long it takes to run this query ?)
● Queries that test scalability (How many fields or records can ping handle before
it fails ?)
● Usage: Run the below commands from pig home
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix-deploy (generate test dataset)
ant -Dharness.hadoop.home=$HADOOP_HOME pigmix (run the PigMix benchmark)
29. ENGINEERS AND DEVICES
WORKING TOGETHER
SWIM(Statistical Workload Injector for MapReduce)
● Enables rigorous performance measurement of MapReduce systems
● Contains suites of workloads of thousands of jobs, with complex data, arrival,
and computation patterns
● Informs both highly targeted, workload specific optimizations
● Highly recommended for MapReduce operators
● Performance measurement
https://github.com/SWIMProjectUCB/SWIM/wiki/Performance-measurement-by-ex
ecuting-synthetic-or-historical-workloads
30. ENGINEERS AND DEVICES
WORKING TOGETHER
AmpLab
● The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative
comparisons of five systems
○ Redshift – a hosted MPP database offered by Amazon.com based on the ParAc
warehouse
○ Hive – a Hadoop-based data warehousing system
○ Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework
○ Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine
○ Stinger/Tez – Tez is a next generation Hadoop execution engine used in Spark
● This benchmark measures response time on a handful of relational queries: scans, aggregations, joins,
and UDF’s, across different data sizes.
31. ENGINEERS AND DEVICES
WORKING TOGETHER
BigDataBench
BigDataBench is a benchmark suite for scale-out workloads, different from SPEC
CPU (sequential workloads), and PARSEC (multithreaded workloads). Currently, it
simulates five typical and important big data applications: search engine, social
network, e-commerce, multimedia data analytics, and bioinformatics.
Includes 15 real-world data sets, and 34 big data workloads.