The document discusses capacity planning and performance tuning for Hadoop big data systems. It begins with an agenda that covers why capacity planners need to prepare for Hadoop, an overview of the Hadoop ecosystem, capacity planning and performance tuning of Hadoop, getting started, and the importance of measurement. The document then discusses various components of the Hadoop ecosystem and provides guidance on analyzing different types of workloads and components.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop situations absolutely required direct attached storage. However, there are many intelligent situations where shared external storage may make sense in a Hadoop environment. This presentation details how/why/where, and promotes taking an intelligent, Hadoop-aware approach to deciding between internal storage and external shared storage. Having full awareness of Hadoop considerations is essential to selecting either internal or external shared storage in Hadoop environment.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
MapR-DB is an enterprise-grade, high performance, in-Hadoop NoSQL (“Not Only SQL”) database management system. It is used to add real-time, operational analytics capabilities to Hadoop and now natively support JSON.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
John Sing's Edge 2013 presentation, detailing when/where/how external storage products and/or system software (i.e. GPFS) can be effectively used in a Hadoop storage environment. Many Hadoop situations absolutely required direct attached storage. However, there are many intelligent situations where shared external storage may make sense in a Hadoop environment. This presentation details how/why/where, and promotes taking an intelligent, Hadoop-aware approach to deciding between internal storage and external shared storage. Having full awareness of Hadoop considerations is essential to selecting either internal or external shared storage in Hadoop environment.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
Presentation given for the SQLPass community at SQLBits XIV in Londen. The presentation is an overview about the performance improvements provided to Hive with the Stinger initiative.
MapR-DB is an enterprise-grade, high performance, in-Hadoop NoSQL (“Not Only SQL”) database management system. It is used to add real-time, operational analytics capabilities to Hadoop and now natively support JSON.
http://bit.ly/1BTaXZP – Hadoop has been a huge success in the data world. It’s disrupted decades of data management practices and technologies by introducing a massively parallel processing framework. The community and the development of all the Open Source components pushed Hadoop to where it is now.
That's why the Hadoop community is excited about Apache Spark. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Sparkstreaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis.
This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop.
Keys Botzum - Senior Principal Technologist with MapR Technologies
Keys is Senior Principal Technologist with MapR Technologies, where he wears many hats. His primary responsibility is interacting with customers in the field, but he also teaches classes, contributes to documentation, and works with engineering teams. He has over 15 years of experience in large scale distributed system design. Previously, he was a Senior Technical Staff Member with IBM, and a respected author of many articles on the WebSphere Application Server as well as a book.
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Cost savings and expert system advice with athene ES/1 Metron
athene® ES/1 provides analysis of current and recent system performance activity, It identifies problems as they are reported and uses expert system techniques to recommend what courses of action are required to restore service levels. Severity level reporting and tuning hints enable attention to be focused where it is needed. Detailed drill down facilities are available to analyze problems, plot trends and report on the most important metrics of all z/OS subsystems.
Hive - Apache hadoop Bigdata training by Desing PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers advance knowledge about Apache Hive.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Hadoop Basics - Apache hadoop Bigdata training by Design Pathshala Desing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the basics of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Recommendation and graph algorithms in Hadoop and SQLDavid Gleich
A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API’s capabilities and discuss future directions including easy sessionization and event-time-based windowing.
Bigdata Hadoop project payment gateway domainKamal A
Live Hadoop project in payment gateway domain for people seeking real time work experience in bigdata domain. Email: Onlinetraining2011@gmail.com ,
Skypeid: onlinetraining2011
My profile: www.linkedin.com/pub/kamal-a/65/2b2/2b5
Ansible for Drupal infrastructure and deploymentsJeff Geerling
Let's talk Ansible!
Drupal 8 uses YAML. Ansible uses YAML.
Drupal 8 makes it easy to build awesome websites. Ansible makes it easy to build awesome infrastructure.
Let's get together and discuss how you can use (and are already using) Ansible for your infrastructure, continuous integration, deployments, etc., with a focus on things like:
- Ansible for local development environments (e.g. Drupal VM, Vlad)
- Ansible on a cluster of Raspberry Pis (seriously! I'm bringing the Dramble with me)
- Ansible for provisioning and managing hundreds of cloud servers.
Jeff Geerling will be leading the BoF, but hopefully we'll end up with a good discussion about how Ansible can help you solve some pain points in infrastructure management, deployments, and more!
From DrupalCon LA BoF on Ansible and Drupal: https://events.drupal.org/losangeles2015/bofs/ansible-drupal-infrastructure-and-deployments
A talk given to JCConf 2015 on 2015/12/05.
在程式設計領域,“immutable objects” 是相當重要的設計模式。同樣的,在虛擬化及雲端時代,“immutable infrastructure” 也成為新一代的顯學。在資源及流程的充分配合下,這將會大大簡化系統的複雜度,穩定性也會大大提升。
本演講將會從觀念出發,並佐以部份實作建議,讓大家有足夠資訊來評估此架構的好處。
Video: https://youtu.be/9j008nd6-A4
DevOps for Humans - Ansible for Drupal Deployment Victory!Jeff Geerling
Everyone knows it's a Good Idea™ to use a configuration management system (e.g. Puppet, Chef) to manage your Drupal infrastructure. But many people (myself included) have run into a wall of #wtfmoments when trying to learn the vagaries of traditional CM systems and their vendor-specific syntaxes.
In 2012, Ansible was released, enabling normal human beings to manage their servers with an easy, but powerful, CM system that uses YAML (just like Drupal 8!) to define configuration and Jinja2 (very much like Twig!) for templates. Not only that, but Ansible is also an incredibly simple and very flexible Drupal deployment and continuous delivery tool.
Learn how you can use Ansible to manage your infrastructure—including local development environments—and stop letting servers and deployments get in the way of development.
Ataas2016 - Big data hadoop and map reduce - new age tools for aid to test...Agile Testing Alliance
Big Data - Hadoop and MapReduce - new age tools for aid to testing and QA BigData with its slew of technologies and terms has been the most talked about area in last couple of years. This has evolved in Big Data Science, Analytics and now on the IoT and automation side. There is a need for testers and QA team to not only get used to this new age digital transformation area but at the same time embrace the technology to their own advantage. We have experimented and successfully used Big Data Technologies – Hadoop and MapReduce for a recent testing engagement. The actual application was implemented using classic technologies like CentOS and C++. Testing team implemented Hadoop and MapReduce to help in quick turnaround for the testing.
How to build and run a big data platform in the 21st centuryAli Dasdan
This tutorial was presented in the IEEE Big Data Conference in 2019. It shows that building and running a big data platform for both real-time streaming and batch data processing for all kinds of applications involving analytics, data science, reporting, and the like in today’s world can be as easy as following a checklist. We live in a fortunate time that many of the components needed are already available in the open source or as a service from commercial vendors. This tutorial shows how to put these components together in multiple sophistication levels to cover the spectrum from a basic reporting need to a full fledged operation across geographically distributed regions with business continuity measures in place. This tutorial provides enough information and checklists to the audience that it can also serve as a goto reference in the actual process of building and running.
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Sarah Aerni
Slides from the Pivotal Open Source Hub Meetup
"Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Science!"
As the need for data science as a key differentiator grows in all industries, from large corporations to startups, the need to get to results quickly is enabled by sharing ideas and methods in the community. The data science team at Pivotal leverages and contributes to this community of publicly available and open source technologies as part of their practice. We will share the resources we use by highlighting specific toolkits for building models (e.g. MADlib, R) and visualization (e.g. Gephi and Circos) along with their benefits and limitations by sharing examples from Pivotal's data science engagements. At the end of this session we hope to have answered the questions: Where can I get started with Data Science? Which toolkit is most appropriate for building a model with my dataset? How can I visualize my results to have the greatest impact?
Bio: Sarah Aerni is a member of the Pivotal Data Science team with a focus on healthcare and life science. She has a background in the field of Bioinformatics, developing tools to help biomedical researchers understand their data. She holds a B.S. In Biology with a specialization in Bioinformatics and minor in French Literature from UCSD, and an M.S. and Ph.D in Biomedical Informatics from Stanford University. During her time as a researcher she focused on the interface between machine learning and biology, building computational models enabling research for a broad range of fields in biomedicine. She also co-founded a start-up providing informatics services to researchers and small companies. At Pivotal she works with customers in life science and healthcare building models to derive insight and business value from their data.
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
General overview of the Big Data Concept.
Presentation of the Hierarchical Linear Subspace Indexing Method to perform exact similarity search in high dimensional data
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
What are the design considerations that go into architecting a modern data warehouse? This presentation will cover some of the requirements analysis, design decisions, and execution challenges of building a modern data lake/data warehouse.
On a business level, everyone wants to get hold of the business value and other organizational advantages that big data has to offer. Analytics has arisen as the primitive path to business value from big data. Hadoop is not just a storage platform for big data; it’s also a computational and processing platform for business analytics. Hadoop is, however, unsuccessful in fulfilling business requirements when it comes to live data streaming. The initial architecture of Apache Hadoop did not solve the problem of live stream data mining. In summary, the traditional approach of big data being co-relational to Hadoop is false; focus needs to be given on business value as well. Data Warehousing, Hadoop and stream processing complement each other very well. In this paper, we have tried reviewing a few frameworks and products
which use real time data streaming by providing modifications to Hadoop.
How to add Artificial Intelligence Capabilities to Existing Software PlatformsHarish Nalagandla
Artificial Intelligence is real and over the next few years, will significantly change the world as we know it. Even though there is some hype around this technology, many companies have put this to practical use, helping businesses delight their customers, increase engagement on their Platforms, and ultimately delivering positive financial results. Many companies have already invested into IT Platforms and it is not practical to invest in bringing up an Artificial Intelligence Platform in parallel. So, the more practical approach is to add Artificial Intelligence capabilities to existing Platforms.
In this keynote session, Harish Nalagandla, Director of Engineering, Enterprise Services Platform, PayPal, will discuss the pillars that form the foundation of Artificial Intelligence and will describe an approach towards how companies with existing systems can add Artificial Intelligence capabilities to their existing Platforms. This session will present a high level blueprint to help guide organizations in their own journey towards adoption of Artificial Intelligence technologies as they make progress towards digital transformation. This is a high level presentation targeted towards technology executives.
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
In this webinar
This talk identifies several shortcomings of Apache Hadoop and presents an alternative approach for building simple and flexible Big Data software stacks quickly, based on next generation computing paradigms, such as in-memory data/compute grids. The focus of the talk is on software architectures, but several code examples using Hazelcast will be provided to illustrate the concepts discussed.
We’ll cover these topics:
-Briefly explain why Hadoop is not a universal, or inexpensive, Big Data solution – despite the hype
-Lay out technical requirements for a flexible Big/Fast Data processing stack
-Present solutions thought to be alternatives to Hadoop
-Argue why In-Memory Data/Compute Grids are so attractive in creating future-proof Big/Fast Data applications
-Discuss how well Hazelcast meets the Big/Fast Data requirements vs Hadoop
-Present several code examples using Java and Hazelcast to illustrate concepts discussed
-Live Q&A Session
Presenter:
Jacek Kruszelnicki, President of Numatica Corporation
Similar to Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity Planner
1. moviri.com
Hitchhiker’s guide for the Capacity Planner
Connecticut Computer Measurement Group
Connecticut Computer Measurement Group
Cromwell CT – April 2015
Renato Bonomini renato.bonomini@moviri.com
Capacity Management and BigData
2. 2
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
3. Brought to you by…
Renato Bonomini
Lead of US operations
for Moviri
@renatobonomini
Mattia Berlusconi
Capacity
Management
Consultant
Giulia Rumi
Capacity
Management
Analyst
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 3
4. 4
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
5. Handling large amount of data?High Performance Computing?
Is it new? Where does it come from? Why do I have to listen to this?
5
Cray 1, 80 MFLOPS, 1975
[A bunch of engineers on a field trip in Silicon Valley, Renato]
IBM 350, 3.56 Mb, 1956
[Wikipedia]
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
6. “When will computer
hardware match the
human brain?”
Hans Moravec
Robotics Institute
Carnegie Mellon
University
The need for
Analytics:
the new
“machine
revolution”
6
http://www.transhumanist.com/volume1/moravec.htm
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
7. February 20151964: Isaac Asimov on 2014 World’s Fair
“The world of A.D. 2014 will have
few routine jobs that cannot be
done better by some machine than
by any human being.
Mankind will therefore have
become largely a race of machine
tenders.”
“When will computer hardware match the human brain?”
7
http://reuvengorsht.com/2015/02/07/machines-replace-middle-management
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
8. “Map and Reduce”“Divide et Impera”
• Julius Caesar arrives in Alexandria after
defeating the Egyptian army and enters the
Ancient Library
• Surprise: there are millions of copies in the
library, how many of those are in latin?
• Caesar arranges a Centuria (80 soldiers) to
inspect each one a batch of books and report
to their Centurion the number of pages
written in Latin for their book
• The Centurion writes on a tabula the count
from each soldier; when finished he sums the
part up
All I need to know I learned from Rome
8
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
9. “Map and Reduce”Message Passing Interface
Wow so “Map and Reduce” was a revolution? In one sense, which one?
9
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
MPI tutorial Blaise Barney
Lawrence Livermore National Laboratory
C, Fortran
Java, Python
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
10. 1. MapReduce makes technologies available to a wide audience
We saw that MPI already handled similar use cases, but it was restricted mostly to University
Research and large R&D facilities
2. Reliability and commodity hardware at its base
3. It moves the needle on how to handle large amount of data
Database: organize first, then load
Hadoop: load first, then organize
What are the revolutions brought by MapReduce and BigData?
10Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
11. 11
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
12. ● “Hardware” contains libraries and utilities, stores data, and supports
jobs execution
● HDFS is the fault-tolerant, replicated distributed file-system
● YARN (Yet Another Resource Negotiator) includes several
programming models that can co-exist in the cluster and MapReduce
is only one of them
● The Application layer is composed of several frameworks, among
which Pig and Hive are the most used.
Hadoop workflow
● clients break data into small chunks to be loaded onto different data
nodes
● for each datablock, client contacts namenode and it answer with a
sorted list of 3 data nodes (every block is replicated in more than one
machine)
● the client writes the blocks directly onto the datanode, the datanode
replicates the data onto the two nodes
The most famous open-source implementation of a MapReduce
framework is Apache Hadoop
12
Optimization Techniques within the Hadoop Eco-system: a Survey
Giulia Rumi, Claudia Colella, Danilo Ardagna
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
LAYERS HADOOP 1.X HADOOP 2.X
Users
Application
layer
Programming
Models
Resource
Management
File system
Hardware
Hive/Pig
Hadoop 1.X
MapReduce
HDFS
Hive/Pig
HDFS
YARN
MapReduce
14. Geek Fun
“A DBA walks into a NOSQL bar,
but turns and leaves
because he couldn't find a table”
(webtonull)
15. ● HDFS (Hadoop distributed filesystem) is
where Hadoop cluster stores data
● YARN is the architectural center of Hadoop
that allows multiple data processing engines
● MapReduce is a programming paradigm
● Hive provides a warehouse structure and
SQL-like access for data in HDFS
● Pig A high-level data-flow language
● Hbase is an open-source, distributed,
versioned, column-oriented store that sits
on top of HDFS.
• Apache Spark is an open source big data real
time processing framework
• ZooKeeper is an open source Apache project
that provides a centralized infrastructure and
services that enable synchronization across a
cluster
• Apache Cassandra is an open source
distributed database management system
designed to handle large amounts of data
across many commodity servers
• Solr is an opensource enterprise search
platform from the Apache Lucene project. It
provides full-text search, hit highlighting,
faceted search, dynamic clustering, database
integration and rich document handling.
We are going to focus on a few specific “animals” of this zoo
15Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
16. 16
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
17. “we’ll start the new Hadoop cluster with 500 TB and then we’ll see how much we need”
Real conversation at customer
Why do you need to get on board soon?
There are significant resources and areas of improvement
● Significant investments are being directed towards these
initiatives
● They are complex, large, with hundredths of configuration
parameters: a little help from experienced capacity
planner can save a lot of money
17Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
18. ● Shouldn’t the ‘Hadoop user/owner’ take care of this?
Distributed machine learning is still an active research topic, It is related to both
machine learning and systems
While Hadoop users don’t develop systems, they need to know how to choose
systems. An important fact is that existing distributed systems or parallel
frameworks are not particularly designed for machine learning algorithms
● Hadoop users can
help to affect how systems are designed
design new algorithms for existing systems
Role of the Capacity Planner and Performance Analyst
18Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
20. ● Scheduling is one of the most important tasks in a multi-
concurrent-task system: research from our colleague
Giulia (and others) on “Optimization Techniques within
the Hadoop Eco-system: a Survey” [DOI: 10.1109/SYNASC.2014.65]
● This illustrates the typical optimization problems:
data locality
sticky slots problems
poor system utilization because of suboptimal distribution
of tasks
unbalanced jobs
starvation and even fairness (be fair to your users)
● There are hundredths of configuration variables available
to the end-user: rule of thumb vs. optimal configuration
can make a big difference
Current performance tuning opportunities:
Scheduling
20Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
21. ● Other initiatives
starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of
MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA
shivnath@cs.duke.edu]
Research from Dominique A. Heger of DHT [Workload Dependent Hadoop MapReduce Application
Performance Modeling]
● The common result of most research initiatives is “One size does not fit all”
Example for classic MapReduce: there is not a single behavior, you have to know your workload
characterization
“Hortonworks recommends that you either use the Balanced workload configuration or invest in a
pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment”
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-
guide/content/typical-workloads.html
How are other configuration opportunities being pursued?
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 21
22. ● You want to know what is the limiting
factor of each workload
● Examples are
CPU performance
Disk I/O
Memory (bandwidth and latency)
Network (bandwidth, delay, packet loss)
storage space
● This is nothing new for the
wise Capacity Planner!
Profiling your workload
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 22
Courtesy of Intel
23. 23
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
24. Different point of views for analysis
• Interest in fast response for
“interactive workload”
– CPU, Memory, Network and IO utilization
levels to respond to queries in a quick and
effective way
• Interest in high throughput for
“batch workloads”
– Maximize the utilization levels, not interested
in response time
• Interest in storage capacity
– Understand and plan file system and HDFS
Different types of Workload
• Most companies are simply using Hadoop to
store information (HDFS) for big data-sets
• Vendors incorporate many other
components: hdfs, hive, spark, solr, flume,
etc.
• For example, there are significant differences
in Hadoop and HBase workloads
– Hadoop MapReduce is is a framework to
process large set of data, using distributed
and parallel algorithms
– HBase is much better for real-time
read/write/modify access to tabular data
Hadoop is a “zoo” of several different applications
24Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
25. For each component
let’s make a summary of
• how they work so that we can focus on the
type of workload
• what the bottlenecks could be, in the order
we usually find them
• what technique (a) (b) or (c) could apply
• what similar ‘traditional’ technology could be
used as analogy
3 standard types of analyses
We’ll check what’s underneath each component
to file them under 3 simple analysis we are all
friends with:
a. interactive workload > you are interested in
a good response time
b. batch workloads > you are interested in
maximizing utilization, optimal concurrency
and best volume/duration ratio
c. storage > used/free space
Get your feet wet!
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 25
26. Online vs streaming vs batch – frame the problem as you already know
26Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved
http://www.hadoop360.com/blog/batch-vs-real-time-data-processing
27. How to get started
• HDFS is a write once, read many (or WORM-
ish) filesystem: only append to the file
– it keeps growing and growing!
• NameNode
– Monitor the disk space available to the
NameNode (local or remote when diversified
storage is used for resilience as
recommended)
• DataNode
– IO is important
– disk space another dimension
What it is
• where Hadoop cluster stores data, functions
include
– storage of the files metadata, overseeing the
health of datanode, coordination of the
access to data
• 2 main components
– NameNode, it is the master of HDFS, memory
and I/O intensive
– Datanode manages storage attached to the
nodes
HDFS is append-only file system; it does not
allow data modification
HDFS Hadoop distributed filesystem
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 27
28. How to get started/2
• Capacity analysis approach:
– (a),(b) or (c)
• Similar technology
– high level, manage it as any logical storage
device
Bottleneck
• Disk IO (volume of IOps and response time)
• Network bandwidth
• storage space [you need 4x times the raw
size of the data you will store in the HDFS.
However on average we have seen a
compression ratio of up to 10-20 for the text
files stored in HDFS. So the actual raw disk
space required is only about 30-50% of the
original uncompressed size]
HDFS Hadoop distributed filesystem/2
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 28
29. How to get started
• Bottleneck
– for every component
• Disk IO
• Network
– for node manager (slave)
• CPU
• Capacity analysis approach:
– (b)
• Similar technology
– Job Scheduler
What it is
• YARN is the architectural center of Hadoop
that allows multiple data processing engines
such as interactive SQL, real-time streaming,
data science and batch processing to handle
data stored in a single platform.
YARN Yet Another Resource Negotiator
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 29
30. How to get started
• Bottleneck
– JVM Memory metrics
– Very much workload dependent! You have to
profile your application
What it is
• Remember: it is a programming paradigm,
not a standalone application. it mainly
consist of two phases:
– In Map phase, the main work is reading data
blocks and splitting into Map tasks in parallel
processing. The result is temporarily stored in
the memory and disk
– The work in reduce stage is concentrating the
output of the same key to the same Reduce
task and processing it, output the final result.
Map&Reduce
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 30
31. How to get started
• possible bottlenecks
– Memory
– Disk IO
– Network
• Capacity analysis approach
– (a) or (b)
• Similar technology
– data warehouse
What they are
• Hive provides a warehouse structure and
SQL-like access for data in HDFS and other
Hadoop input sources (e.g. Amazon S3).Hive's
query language, HiveQL, compiles to
MapReduce
• Pig is a high-level language for writing
queries over large datasets. A query planner
compiles, queries written in this language
(called "Pig Latin") into maps and reduces
which are then executed on a Hadoop
cluster. Pig main features are: ease of
programming, optimization opportunities,
customization, extensibility
Pig & Hive
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 31
32. How to get started
• possible bottlenecks
– Memory (be careful of swapping, JVM
memory metrics and GC) GC pauses longer
than 60 seconds can cause RS to go offline
– Disk IO (in case data is spooled to disk)
– Network (latency)
• Capacity analysis approach
– (a)
• Similar technology
– distributed DBMS
What it is
• HBase is column-based rather than row-based,
which enables high-speed execution of
operations performed over similar values across
massive data sets,
• HBase directly runs on top of HDFS
• It scales linearly by requiring all tables to have a
primary key. The key space is divided into
sequential blocks that are then allotted to a
region. RegionServers own one or more regions,
so the load is spread uniformly across the
cluster. HBase can further subdivide the region
by splitting it automatically, so that manual data
sharding is not necessary.
HBASE
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 32
33. 33
Agenda
● Why as Capacity Planners do we need to be prepared?
● The Hadoop ecosystem - or better said the zoo
● Capacity Planning and Performance Tuning of Hadoop
● How to get started
● Measure measure measure
34. • CPU
Utilization (user/sys/wio)
load
• Memory
Utilization
used (cached, user, sys)
swap in/out
• disk IO
read/write ops rate
read/write ops byte rate
• network
sent/received packets and bits
• Garbage Collection
collections count and time
overhead (time percentage spent in GC), very
important
• Heap memory
– Size, used
– used after GC (much more valuable, you can
correlate it with workload)
– Perm Gen/Code Cache/Eden Space 'used'
– PS Old/Perm/ Gen 'used'
– Tenured Gen 'used'
– PS Eden/Survivor/PS Survivor Space 'used'
• JVM threads
Count
daemon count
• JVM files
JVM open/max open files
It sounds all good so far but which metrics do I need?
Laundry list – generic metrics
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 34
35. • HDFS namenode:
storage: total and used capacity
Files created/total/deleted
• HDFS datanode
Fs: bytes read/written
Fs: reads/writes from local/remote client
“map reduce blocks”: volume of read,
written/removed/replicated/verified
“map reduce blocks operations”:
copy/read/replace/write, avg time/volume
• YARN resource manager
active/decom/unhealthy NodeManagers
active applications/users
applications submitted, completed, failed,
killed
applications pending, running
containers
allocated/released/pending/reserved
• HBASE
– Request (total/read/write)
– memory stores size, upper limit
– flush queue length
– compaction queue length
• ZooKeeper
– sent/received packets
– request latency
– outstanding requests
– JVM pool size
• Solr
– request rate/latency
– JVM pool size
– added docs rate
– query result cache size, hits %, response time
– document cache size
It sounds all good so far but which metrics do I need?
Laundry list – specific metrics
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 35
36. Headquarters
Via Schiaffino 11C
20158 Milan MI
Italy
T +39-024951-7001
USA East
One Boston Place, Floor 26
Boston, MA 02108
USA
T +1-617-936-0212
USA West
425 Broadway Street
Redwood City, CA 94063
USA
T +1-650-226-4274
moviri.com
37. ● Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level
operators and you can use it interactively to query data within the shell. It is a comprehensive, unified framework
to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to
Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data
processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use
case.
● Features
everything is in memory
data is stored in memory into a number of files (RDD files)
best for cyclic jobs
best perf with cyclic job (performance 100 times better wrt hadoop)
● possible bottlenecks:
Memory
Network + Disk IO (remote/local files)
CPU
● Capacity analysis approach: (a) or (b) depending on the workload
● Similar technology: similar to Hadoop MapReduce generic case
Spark
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 37
38. ● Applications can leverage these services to coordinate distributed processing across large clusters. A
very large Hadoop cluster can be supported by multiple ZooKeeper servers.
● Each client machine communicates with one of the ZooKeeper servers to retrieve and update its
synchronization information. Often network and memory problems manifest themselves first in ZK
● possible bottlenecks:
CPU wio
Memory (JVM) latency
GC pauses longer than 60 seconds can cause RS to go offline
Network (latency)
● Capacity analysis approach: (a)
● Similar technology: in-memory database
Zookeeper
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 38
39. ● Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous
master-less replication allowing low latency operations for all clients
● possible bottlenecks:
Memory
Disk IO
Network
● Capacity analysis approach: (a) and (c)
● Similar technology: distributed DBMS
Cassandra
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 39
40. ● Solr is highly scalable: it provide distributed search and index replication. It is the most popular search
engine
● possible bottlenecks:
Memory (at the JVM level)
CPU
Disk IO
● Capacity analysis approach: (a)
● Similar technology: distributed DBMS
Solr
Capacity Management of BigData - Moviri (C) 2015 - All Rights Reserved 40
Editor's Notes
Abstract
Hadoop is a zoo of different types of workloads; even if most companies are simply using Hadoop to store information (HDFS), there are many other applications, to name a few hdfs, hive, pig, impala, spark, solr, flume.
Each animal in this zoo behaves differently and, for example, there are significant differences in the two most common workloads “MapReduce” and “HBase”:
This leads to mainly three point of views for analysis to make sure service levels are achieved:
Interest in response time for “interactive workload” CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way
Interest in high throughput for “batch workloads”: Maximize the utilization levels, not interested in response time
Interest in planning storage capacity (filesystem and HDFS)
This speech focuses on providing guidelines for the capacity planner to understand how to translate existing techniques and framework and to adapt them to these new technologies: in most cases “what’s old is new again”.
Renato Bonomini, Lead of US operations for Moviri – engineer at heart and by training at Politecnico of Milan, my specialties are Digital Signal Processing and in a previous life High Performance Computing. Now I help companies achieving alignment between Business and IT using optimizaton techniques, Capacity Management and Performance Management
Mattia Berlusconi holds a Degree in IT Engineering from the Politecnico of Milan. In 2012 he joined the Consulting Department of Moviri, working in the Capacity Management Business Unit as IT Performance Optimization consultant. He participates to national and international projects focused on designing and implementing capacity management solutions to allow customers to effectively manage the capacity of their on-premises and cloud IT environments. He likes photography and hiking mountains.
Giulia Rumi is a member of the Moviri Capacity Management Team. Giulia holds a MS degree in Computer Engineering from Politecnico of Milan with a thesis work focused on energy consumption in mobile devices. She joined Moviri straight out of college in 2015. She plays piano and likes sweets and comics.
Cray1: 80 MFLOPS
Neptuny/Moviri field trip at the museum of computer @ san jose in front of cray 1
IBM 350: 3.56 Mb
Today
Iphone 5s GPU: 76.8 GFLOPS, 16 GB of RAM
Applications: Predictive Analytics, Machine Learning
A few applications or analytics that revolutionized the way we live
- Moneyball
- Recommendation engines
analytics: the “new machine revolution”
http://www.transhumanist.com/volume1/moravec.htm “When will computer hardware match the human brain?”
In 1964, Isaac Asimov, wrote about a visit to the World’s Fair of 2014:
“The world of A.D. 2014 will have few routine jobs that cannot be done better by some machine than by any human being. Mankind will therefore have become largely a race of machine tenders.” http://reuvengorsht.com/2015/02/07/machines-replace-middle-management/
Bad news for human kind: it has already happened, think of Uber and other companies delegating middle management to algorithms
http://reuvengorsht.com/2015/02/07/machines-replace-middle-management
Early example of Map Reduce process – or late example of divide et impera principle?
Compare Map and Reduce functions between the Caesar example and the WordCount example typical of M&R
Compare
MPI’s
Broadcast
Gather
Scatter
Reduce
Map & Reduce
Map
Reduce
A huge difference is how modern and enterprise ready M&R is: ask a young developer to code in f77 (not the only choice, but a common one for MPI) or even in C and comparing this to the ease of development in Java for example is a much better choice
3 important points of why M&R made a difference that MPI could not popularize
the MapReduce principle http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf
what is hadoop in the latest version http://www.slideshare.net/cloudera/introduction-to-yarn-and-mapreduce-2
diagram to illustrate architecture ‘hadoop architecture’ [Giulia’s paper and http://wiki.apache.org/hadoop/PoweredByYarn ]
HDFS -> YARN -> {MR2, Impala, Spark, Hbase, MPI, hive, pig}
What we are focusing on: top list (HDFS, YARN, MapReduce, Hive/Pig, Hbase)
HDFS (Hadoop distributed filesystem) is where Hadoop cluster stores data
YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform
MapReduce is a programming paradigm
Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3).
Pig A high-level data-flow language and execution framework for parallel computation.
Hbase is an open-source, distributed, versioned, column-oriented store that sits on top of HDFS.
other interesting : solr, spark, zookeeper, impala, cassandra
Spark Apache Spark is an open source big data real time processing framework built around speed, ease of use, and sophisticated analytics
ZooKeeper is an open source Apache project that provides a centralized infrastructure and services that enable synchronization across a cluster. It maintains data like configuration information, hierarchical naming space, and so on.
Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
Solr is an opensource enterprise search platform from the Apache Lucene project. It provides full-text search, hit highlighting, faceted search, dynamic clustering, database integration and rich document handling.
Why do you need to get on board soon?
Conversation heard at a company “we’ll start the new hadoop cluster with 500 TB and then we’ll see how much we need” there are significant amount of resources ($$$) involved in these infrastructures
As a capacity planner, don’t miss the boat!
Role the Capacity Planner and Performance Analyst
Shouldn’t the ‘hadoop user/owner’ take care of this? Distributed machine learning is still an active research topic, It is related to both machine learning and systems
While hadoop users don’t develop systems, they need to know how to choose systems. An important fact is that existing distributed systems or parallel frameworks are not particularly designed for machine learning algorithms; hadoop users can
help to affect how systems are designed
design new algorithms for existing systems
what’s available as guidelines (http://blog.cloudera.com/blog/2010/08/hadoophbase-capacity-planning
and http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster )
starting point planner: http://hortonworks.com/cluster-sizing-guide/
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-guide/content/ch_hardware-recommendations.html
CURRENT PERFORMANCE ISSUES - research being developed , see
Optimization Techniques within the Hadoop Eco-system: A Survey
DOI:
10.1109/SYNASC.2014.65
scheduling performances: scheduling is one of the most important tasks in a multi-concurrent-task system : Paper from our colleague Giulia (and others) on “Optimization Techniques within the Hadoop Eco-system: a Survey”
this shows the typical optimization problems:
data locality
sticky slots problems
poor system utilization because of suboptimal distribution of tasks
unbalanced jobs
starvation and even fairness (be fair to your users)
others: pushing the envelope w/existing initiative:
starfish http://www.cs.duke.edu/starfish/index.html [Towards Automatic Optimization of MapReduce Programs, Shivnath Babu Duke University Durham, North Carolina, USA shivnath@cs.duke.edu]
documents from DHT [Workload Dependent Hadoop MapReduce Application Performance Modeling] Dominique A. Heger
www.cmg.org/wp-content/uploads/2013/07/m_101_61.pdf
“One size does not fit all”, example for Classic MapReduce there is not a single behaviour
you have to know your workload characterization
“Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment.”
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.7/bk_cluster-planning-guide/content/typical-workloads.html
3 standard types of analyses, we’ll check what’s underneath each component to file them under 3 simple analysis we are all friends with:
(a) interactive workload > you are interested in a good response time
(b) batch workloads > you are interested in maximizing utilization, optimal concurrency and best volume/duration ratio
(c) storage > used/free space
for each component, let’s try to make a summary of
how they work so that we can focus on the type of workload
what the bottlenecks could be, in the order we usually find them
what technique (a) (b) or (c) could apply
what similar ‘traditional’ technology could be used as analogy
2 main components
NameNode
it is the master of HDFS that directs the DataNode daemons to perform the low-level I/O tasks
it was a single point of failure for HDFS in MR1. it is no longer a single point of failure from v2: MapR has developed a "distributed NameNode," where the HDFS metadata is distributed across the cluster in "Containers,"
it maps the blocks onto the datanodes
The function of the NameNode is memory and I/O intensive.
Memory hungry!
Monitor JVM heap size
Monitor the disk space available to the NameNode (local or remote when diversified storage is used for resilience as recommended)
datanode
usually there is a datanode for each node in the cluster
it manages storage attached to the nodes
IO is important
disk space another dimension
HDFS is append-only file system; it does not allow data modification
HDFS is a write once, read many (or WORM-ish) filesystem: once a file is created, the filesystem API only allows you to append to the file, not to overwrite it. >> it keeps growing and growing!
possible bottlenecks:
Disk IO (volume of IOps and response time)
Network bandwidth
Storage
Capacity analysis approach: (a),(b) or (c)
Similar technology: high level, manage it as any logical storage device
YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing.
YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:
a global ResourceManager
a per-application ApplicationMaster
a per-node slave NodeManager
a per-application Container running on a NodeManager
The ResourceManager and the NodeManager formed the new generic system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all applications in the system. The ApplicationMaster is a framework-specific entity that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the component tasks. The ResourceManager has a scheduler, which is responsible for allocating resources to the various applications running in the cluster, according to constraints such as queue capacities and user limits. The scheduler schedules based on the resource requirements of each application. Each ApplicationMaster has responsibility for negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. From the system perspective, the ApplicationMaster runs as a normal container. The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager. NodeManager and DataNode run together onto the same machine
Hive provides a warehouse structure and SQL-like access for data in HDFS and other Hadoop input sources (e.g. Amazon S3).Hive's query language, HiveQL, compiles to MapReduce. It also allows user-defined functions (UDFs). Hive is widely used, and has itself become a "sub-platform" in the Hadoop ecosystem. It is best suited for batch jobs over large sets of append-only data, providing Apache Hive automatically manages the compilation, optimization, and execution of an Hive-QL statement.
Pig is a high-level language for writing queries over large datasets. A query planner compiles, queries written in this language (called "Pig Latin") into maps and reduces which are then executed on a Hadoop cluster. Pig main features are: ease of programming, optimization opportunities, customization, extensibility It abstracts the procedural style of MapReduce in the direction of the declarative style of SQL. A Pig program generally goes through three steps: load, transform, and store. At first the data on which the program has to work are loaded (in Hadoop the objects are stored in HDFS); then a set of transformations are applied to the loaded data and the mappers and reducers are handled transparently to the user; finally, if needed, the results are stored in a local file or in HDFS.
possible bottlenecks:
Memory
Disk IO
Network
Capacity analysis approach: (a) or (b)
Similar technology: data warehouse
HBase is column-based rather than row-based, which enables high-speed execution of operations performed over similar values across massive data sets, e.g. read/write operations that involve all rows but only a small subset of all columns. HBase directly runs on top of HDFS
HBase scales linearly by requiring all tables to have a primary key. The key space is divided into sequential blocks that are then allotted to a region. RegionServers own one or more regions, so the load is spread uniformly across the cluster. If the keys within a region are frequently accessed, HBase can further subdivide the region by splitting it automatically, so that manual data sharding is not necessary.
possible bottlenecks:
Memory (be careful of swapping) - It is recommended to discourage swapping on HBase nodes and to enable GC logging to look for large GC pauses in the log. GC pauses longer than 60 seconds can cause RS to go offline
Disk IO (in case data is spooled to disk)
Network (latency)
Capacity analysis approach: (a)
Similar technology: distributed DBMS
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell.It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
everything is in memory
data is stored in memory into a number of files (RDD files)
best for cyclic jobs
best perf with cyclic job (performance 100 times better wrt hadoop)
possible bottlenecks:
Memory
Network + Disk IO (remote/local files)
CPU
Capacity analysis approach: (a) or (b) depending on the workload
Similar technology: similar to Hadoop MapReduce generic case
Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators and you can use it interactively to query data within the shell.It is a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.
everything is in memory
data is stored in memory into a number of files (RDD files)
best for cyclic jobs
best perf with cyclic job (performance 100 times better wrt hadoop)
possible bottlenecks:
Memory
Network + Disk IO (remote/local files)
CPU
Capacity analysis approach: (a) or (b) depending on the workload
Similar technology: similar to Hadoop MapReduce generic case