This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
Introduction, Macro Definition and Call, Macro Expansion, Nested Macro Calls, Advanced Macro Facilities, Design Of a Macro Preprocessor, Design of a Macro Assembler, Functions of a Macro Processor, Basic Tasks of a Macro Processor, Design Issues of Macro Processors, Features, Macro Processor Design Options, Two-Pass Macro Processors, One-Pass Macro Processors
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Introduction, Macro Definition and Call, Macro Expansion, Nested Macro Calls, Advanced Macro Facilities, Design Of a Macro Preprocessor, Design of a Macro Assembler, Functions of a Macro Processor, Basic Tasks of a Macro Processor, Design Issues of Macro Processors, Features, Macro Processor Design Options, Two-Pass Macro Processors, One-Pass Macro Processors
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Description of all types of Loaders from System programming subjects.
eg. Compile-Go Loader
General Loader
Absolute Loader
Relocating Loader
Practical Relocating Loader
Linking Loader
Linker Vs. Loader
general relocatable loader
Pipelining is an speed up technique where multiple instructions are overlapped in execution on a processor. It is an important topic in Computer Architecture.
This slide try to relate the problem with real life scenario for easily understanding the concept and show the major inner mechanism.
Description of all types of Loaders from System programming subjects.
eg. Compile-Go Loader
General Loader
Absolute Loader
Relocating Loader
Practical Relocating Loader
Linking Loader
Linker Vs. Loader
general relocatable loader
Pipelining is an speed up technique where multiple instructions are overlapped in execution on a processor. It is an important topic in Computer Architecture.
This slide try to relate the problem with real life scenario for easily understanding the concept and show the major inner mechanism.
Hadoop has proven to be an invaluable tool for many companies over the past few years. Yet it has it's ways and knowing them up front can safe valuable time. This session is a run down of the ever recurring lessons learned from running various Hadoop clusters in production since version 0.15.
What to expect from Hadoop - and what not? How to integrate Hadoop into existing infrastructure? Which data formats to use? What compression? Small files vs big files? Append or not? Essential configuration and operations tips. What about querying all the data? The project, the community and pointers to interesting projects that complement the Hadoop experience.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
LAS16-305: Smart City Big Data Visualization on 96BoardsLinaro
LAS16-305: Smart City Big Data Visualization on 96Boards
Speakers: Naresh Bhat, Ganesh Raju
Date: September 28, 2016
★ Session Description ★
Cities are getting identified as smart cities based on what and how data are used to do predictive analytics. Smart City as a phrase can have a wide spectrum of meaning. But there are two key things (Data and Analytics) that ‘smart’ refers to in smart city. With IoT gaining so much market attention, brings in the power to drive the implementation. Data collection, Storage and Analytics provide so much potential. This talk will go over a sample use case scenario utilizing ODPi based Hadoop eco system and H20 visualizations for analytics.
★ Resources ★
Etherpad: pad.linaro.org/p/las16-305
Presentations & Videos: http://connect.linaro.org/resource/las16/las16-305/
★ Event Details ★
Linaro Connect Las Vegas 2016 – #LAS16
September 26-30, 2016
http://www.linaro.org
http://connect.linaro.org
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016Ganesh Raju
Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016. ODPi, Big Data, Hadoop, Spark, H2O, Sparkling Water, performance benchmarking on ARM64/AArch64,
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented.
http://clds.sdsc.edu/wbdb2014.de/program
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
We evaluate the performance of DataStax Enterprise (DSE) using the HiBench benchmark suite and compare it with the corresponding Cloudera’s Distribution of Hadoop (CDH) results. Both systems, DSE and CDH were stress tested using CPU-bound (WordCount), I/O-bound (Enhanced DFSIO) and mixed (HiveBench) workloads.
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
Challenges of Building a First Class SQL-on-Hadoop Engine:
Why and what is Big SQL 3.0?
Overview of the challenges
How we solved (some of) them
Architecture and interaction with Hadoop
Query rewrite
Query optimization
Future challenges
Nagios Conference 2012 - Dan Wittenberg - Case Study: Scaling Nagios Core at ...Nagios
Dan Wittenberg's presentation on using Nagios at a Fortune 50 Company
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
Bug bites Elephant? Test-driven Quality Assurance in Big Data Application Dev...inovex GmbH
Around the currently available large piles of Big Data, there's happening quite a mixed gathering: Business Engineers define which insightswould be precious, Analysts build models, Hadoop programmers tame the flood of data, and Operations people setup machines and networks. It's exactly the interplay of all participants which is central to project success. This setup together with the distributed nature of processing poses new challenges to well-established models of assuring software artifact quality: How can non-programmers define acceptance criteria? How can functionalities be tested which depend on cluster execution, orchestration of, e.g., different hadoop jobs without delaying the development process? Which data selection is suited best for simulating the live environment? How can intermediate results in arbitrary serialization formats be inspected?
In this talk, experiences and best practices from approaching these problems in a large-scale log data analysis project will be presented. At 1&1, our team develops hadoop applications which process roughly 1 billion log events (~1 TB) per day. We will give an overview of the hard- and software setup of our quality assurance environment, which includes FitNesse as a wiki-style acceptance testing framework.Starting from a comparison with existing test frameworks like MRUnit, we will explain how we automate the parameterized deployment of our applications, choose test data sampling strategies, perform workflow management and orchestration of jobs / applications, and use Pig for inspection of intermediate results and definition of final acceptance criteria. Our conclusion is that test-driven development in the field of Big Data requires adaption of existing paradigms, but is crucial for maintaining high quality standards for the resulting applications.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Hadoop & Big Data benchmarking
1. Benchmarking
Hadoop & Big Data benchmarking
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk
IWT TETRA User Group Meeting - November 28, 2014
1 / 62
5. Benchmarking
Intro: Hadoop essentials
Hadoop 1.0
Source: Apache Hadoop YARN : moving beyond
MapReduce and batch processing with Apache Hadoop 2,
Hortonworks, 2014)
MapReduce and HDFS are the
core components, while other
components are built around the
core.
5 / 62
6. Benchmarking
Intro: Hadoop essentials
Hadoop 2.0
Source: Apache Hadoop YARN : moving beyond
MapReduce and batch processing with Apache Hadoop 2,
Hortonworks, 2014)
YARN adds a more general
interface to run non-MapReduce
jobs within the Hadoop
framework.
6 / 62
21. Benchmarking
Benchmarks
Why benchmark?
My three reasons for using benchmarks:
1 Evaluating the eect of a hardware/software upgrade:
OS, Java VM,. . .
Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .
2 Debugging:
Compare with other clusters or published results.
3 Performance tuning:
E.g. Cloudera CDH default con
25. Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO
Read and write test for HDFS.
Helpful for
getting an idea of how fast your cluster is in terms of I/O,
stress testing HDFS,
discover network performance bottlenecks,
shake out the hardware, OS and Hadoop setup of your cluster
machines (particularly the NameNode and the DataNodes).
23 / 62
27. les of size 1 GB for a total of 10 GB:
$ hadoop jar hadoop-*test*.jar
TestDFSIO -write -nrFiles 10 -fileSize 1000
TestDFSIO is designed to use 1 map task per
30. Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: write test output
Typical output of write test
----- TestDFSIO ----- : write
Date time: Mon Oct 06 10:21:28 CEST 2014
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 12.874702111579893
Average IO rate mb/sec: 13.013071060180664
IO rate std deviation: 1.4416050051562712
Test exec time sec: 114.346
25 / 62
37. les, each of size 1 GB:
$ hadoop jar hadoop-*test*.jar
TestDFSIO -read -nrFiles 10 -fileSize 1000
27 / 62
38. Benchmarking
Benchmarks
Micro Benchmarks
TestDFSIO: read test output
Typical output of read test
----- TestDFSIO ----- : read
Date time: Mon Oct 06 10:56:15 CEST 2014
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 402.4306813151435
Average IO rate mb/sec: 492.8257751464844
IO rate std deviation: 196.51233829270575
Test exec time sec: 33.206
28 / 62
39. Benchmarking
Benchmarks
Micro Benchmarks
In
uence of HDFS replication factor
When interpreting TestDFSIO results, keep in mind:
The HDFS replication factor plays an important role!
A higher replication factor leads to slower writes.
For three identical TestDFSIO write runs (units are MB/s):
HDFS replication factor
1 2 3
Throughput 190 25 13
Average IO-rate 190 10 25 3 13 1
29 / 62
40. Benchmarking
Benchmarks
Micro Benchmarks
TeraSort
Goal
Sort 1TB of data (or any other amount of data) as fast as possible.
Probably most well-known Hadoop benchmark.
Combines testing the HDFS and MapReduce layers of an
Hadoop cluster.
Typical areas where TeraSort is helpful
Iron out your Hadoop con
49. Benchmarking
Benchmarks
Micro Benchmarks
NNBench
Goal
Load test the NameNode hardware and software.
Generates a lot of HDFS-related requests with normally very
small payloads.
Purpose: put a high HDFS management stress on the
NameNode.
Can simulate requests for creating, reading, renaming and
deleting
52. les using 12 maps and 6 reducers:
$ hadoop jar hadoop-*test*.jar nnbench
-operation create_write
-maps 12
-reduces 6
-blockSize 1
-bytesToWrite 0
-numberOfFiles 1000
-replicationFactorPerFile 3
-readFileAfterOpen true
-baseDir /user/bart/NNBench-`hostname -s`
38 / 62
53. Benchmarking
Benchmarks
Micro Benchmarks
MRBench
Goal
Loop a small job a number of times.
checks whether small job runs are responsive and running
eciently on the cluster
complimentary to TeraSort
puts its focus on the MapReduce layer
impact on the HDFS layer is very limited
39 / 62
54. Benchmarking
Benchmarks
Micro Benchmarks
MRBench: example
Run a loop of 50 small test jobs:
$ hadoop jar hadoop-*test*.jar
mrbench -baseDir /user/bart/MRBench
-numRuns 50
40 / 62
55. Benchmarking
Benchmarks
Micro Benchmarks
MRBench: example
Run a loop of 50 small test jobs:
$ hadoop jar hadoop-*test*.jar
mrbench -baseDir /user/bart/MRBench
-numRuns 50
Example output:
DataLines Maps Reduces AvgTime (milliseconds)
1 2 1 28822
! average
56. nish time of executed jobs was 28 seconds.
41 / 62
59. Benchmarking
Benchmarks
BigBench
BigBench
Big Data benchmark based on TPC-DS.
Focus is mostly on MapReduce engines.
Collaboration between industry and academia.
https://github.com/intel-hadoop/Big-Bench/
History
Launched at First Workshop on Big Data Benchmarking
(May 8-9, 2012).
Full kit at Fifth Workshop on Big Data Benchmarking
(August 5-6, 2014).
44 / 62
60. Benchmarking
Benchmarks
BigBench
BigBench data model
Source: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics, Ghazal et al., 2013.
45 / 62
61. Benchmarking
Benchmarks
BigBench
BigBench: Data Model - 3 V's
Variety
BigBench data is
structured,
semi-structured,
unstructured.
Velocity
Periodic refreshes for all data.
Dierent velocity for dierent areas:
Vstructured Vunstructured Vsemistructured
Volume
TPC-DS: discrete scale factors
(100, 300, 1000, 3000, 10000, 3000 and 100000).
BigBench: continuous scale factor.
46 / 62
80. Benchmarking
Conclusions
Conclusions
Use Hadoop distributions!
Hadoop cluster administration ! Cloudera Manager.
Micro-benchmarks $ BigBench.
Your best benchmark is your own application!
61 / 62