Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
Hadoop is a zoo of different types of workloads; even if most companies are simply using Hadoop to store information (HDFS), there are many other applications, to name a few hdfs, hive, pig, impala, spark, solr, flume.
Each animal in this zoo behaves differently and, for example, there are significant differences in the two most common workloads “MapReduce” and “HBase”
This leads to mainly three point of views for analysis to make sure service levels are achieved:
- Interest in response time for “interactive workload” CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way
- Interest in high throughput for “batch workloads”: Maximize the utilization levels, not interested in response time
- Interest in planning storage capacity (filesystem and HDFS)
This speech focuses on providing guidelines for the capacity planner to understand how to translate existing techniques and framework and to adapt them to these new technologies: in most cases “what’s old is new again”
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
In this talk, we look at YARN scheduler choices available today for Apache Hadoop 2 and discuss their pros and cons. We dive deeper into Capacity Scheduler by providing a comprehensive overview of its various settings with examples from real large-scale Hadoop clusters to promoter a broader understanding of schedulers’ current state and best practices in place today when it comes to queue nomenclature, planning, allocations, and ongoing management. We present detailed cluster, queue, and job behaviors from several different capacity management philosophies.
We then propose practical solutions without any change to the scheduler or core Hadoop that allows managing queue creations and capacity allocations while optimizing for cluster utilization and maintaining SLA guarantees. A unified queue nomenclature, admission and capacity re-allocation policies across BUs, applications, and clusters make service automation possible. Transparency in resources consumed allows for defining realistic SLA expectation. Finally, consistent application tagging completes the feedback loop with SLAs observed through application level reporting.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
Capacity Management and BigData/Hadoop - Hitchhiker's guide for the Capacity ...Renato Bonomini
Hadoop is a zoo of different types of workloads; even if most companies are simply using Hadoop to store information (HDFS), there are many other applications, to name a few hdfs, hive, pig, impala, spark, solr, flume.
Each animal in this zoo behaves differently and, for example, there are significant differences in the two most common workloads “MapReduce” and “HBase”
This leads to mainly three point of views for analysis to make sure service levels are achieved:
- Interest in response time for “interactive workload” CPU, Memory, Network and IO utilization levels to respond to queries in a quick and effective way
- Interest in high throughput for “batch workloads”: Maximize the utilization levels, not interested in response time
- Interest in planning storage capacity (filesystem and HDFS)
This speech focuses on providing guidelines for the capacity planner to understand how to translate existing techniques and framework and to adapt them to these new technologies: in most cases “what’s old is new again”
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
In this talk, we look at YARN scheduler choices available today for Apache Hadoop 2 and discuss their pros and cons. We dive deeper into Capacity Scheduler by providing a comprehensive overview of its various settings with examples from real large-scale Hadoop clusters to promoter a broader understanding of schedulers’ current state and best practices in place today when it comes to queue nomenclature, planning, allocations, and ongoing management. We present detailed cluster, queue, and job behaviors from several different capacity management philosophies.
We then propose practical solutions without any change to the scheduler or core Hadoop that allows managing queue creations and capacity allocations while optimizing for cluster utilization and maintaining SLA guarantees. A unified queue nomenclature, admission and capacity re-allocation policies across BUs, applications, and clusters make service automation possible. Transparency in resources consumed allows for defining realistic SLA expectation. Finally, consistent application tagging completes the feedback loop with SLAs observed through application level reporting.
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Hive - Apache hadoop Bigdata training by Desing PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers advance knowledge about Apache Hive.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Hive - Apache hadoop Bigdata training by Desing PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers advance knowledge about Apache Hive.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
Hortonworks Presentation at The Boulder/Denver BigData Meetup on July 22nd, 2015. Topic: Scaling Spark Workloads on YARN. Spark as a workload in a multi-tenant Hadoop infrastructure, scaling, cloud deployment, tuning.
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
As Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk I given an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Streaming DataFrames/Datasets. Datasets provide an evolution of the RDD API by allowing users to express computation as type-safe lambda functions on domain objects, while still leveraging the powerful optimizations supplied by the Catalyst optimizer and Tungsten execution engine. I will describe the high-level concepts as well as dive into the details of the internal code generation that enable us to provide good performance automatically. Streaming DataFrames/Datasets let developers seamlessly turn their existing structured pipelines into real-time incremental processing engines. I will demonstrate this new API’s capabilities and discuss future directions including easy sessionization and event-time-based windowing.
Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
The story about how to figure out what to measure, and how you can benchmark that. This slide deck tells the idea of benchmarking and does not tell actual commercial/open source benchmark tools.
From common errors seen in running Spark applications, e.g., OutOfMemory, NoClassFound, disk IO bottlenecks, History Server crash, cluster under-utilization to advanced settings used to resolve large-scale Spark SQL workloads such as HDFS blocksize vs Parquet blocksize, how best to run HDFS Balancer to re-distribute file blocks, etc. you will get all the scoop in this information-packed presentation.
Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations.
This presentation speaks about Advanced Hadoop Tuning and Optimisation.
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
In this webinar
This talk identifies several shortcomings of Apache Hadoop and presents an alternative approach for building simple and flexible Big Data software stacks quickly, based on next generation computing paradigms, such as in-memory data/compute grids. The focus of the talk is on software architectures, but several code examples using Hazelcast will be provided to illustrate the concepts discussed.
We’ll cover these topics:
-Briefly explain why Hadoop is not a universal, or inexpensive, Big Data solution – despite the hype
-Lay out technical requirements for a flexible Big/Fast Data processing stack
-Present solutions thought to be alternatives to Hadoop
-Argue why In-Memory Data/Compute Grids are so attractive in creating future-proof Big/Fast Data applications
-Discuss how well Hazelcast meets the Big/Fast Data requirements vs Hadoop
-Present several code examples using Java and Hazelcast to illustrate concepts discussed
-Live Q&A Session
Presenter:
Jacek Kruszelnicki, President of Numatica Corporation
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
Over the past year, a lot of progress has been made in advancing the Apache Hadoop platform at Yahoo. We underwent a massive infrastructure consolidation to lower the platform TCO. CaffeOnSpark was open-sourced for distributed deep learning on existing infrastructure with a combination of CPU and GPU-based computing. Traditional compute on MapReduce continues to shift to Apache Tez and Apache Spark for lower processing time. Our internal security, multi-tenancy, and scale changes to Apache Storm got pushed to the community in Storm 0.10. Omid was open-sourced for managing transactions reliably on Apache HBase. Multi-tenancy with region groups, splittable META, ZooKeeper-less assignment manager, favored nodes with HDFS block placement, and support for humongous tables have taken Apache HBase scale to new heights. Dependency management in Apache Oozie for combinatorial, conditional, and optional processing gives increased flexibility to our data pipelines teams in maintaining SLAs. Focus on ease of use and onboarding improvements have brought in a whole new class of use cases and users to the platform. In this talk, we will provide a comprehensive overview of the platform technology stack, recent developments, metrics, and share thoughts on where things are headed when it comes to big data at Yahoo.
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In this talk, Sumeet Singh will present some of the recent innovations, open source contributions, and where things are headed when it comes to Hadoop at Yahoo.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
Building a real-time monitoring service that handles millions of custom events per second while satisfying complex rules, varied throughput requirements, and numerous dimensions simultaneously is a complex endeavor. Sumeet Singh and Mridul Jain explain how Yahoo approached these challenges with Apache Storm Trident, Kafka, HBase, and OpenTSDB and discuss the lessons learned along the way.
Sumeet and Mridul explain scaling patterns backed by real scenarios and data to help attendees develop their own architectures and strategies for dealing with the scale challenges that come with real-time big data systems. They also explore the tradeoffs made in catering to a diverse set of daily users and the associated usability challenges that motivated Yahoo to build a self-serve, easy-to-use platform that requires minimal programming experience. Sumeet and Mridul then discuss event-level tracking for debugging and troubleshooting problems that our users may encounter at this scale. Over the course of their talk, they also address building infrastructure and operational intelligence with anomaly detection, alert correlation, and trend analysis based on the monitoring platform.
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Senior Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
Hadoop has allowed us to move towards a unified source of truth for all of organization’s data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations.
In this talk, we will share an approach in tackling the above challenges. We will explain how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data. In addition, the approach allows us to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
URL: http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38768
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo's Hadoop clusters. A key component that enables this efficient operation is data compression.
With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented.
The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on "Big Data" who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
The Hadoop project is an integral part of Yahoo!'s cloud infrastructure and is at the heart of many of Yahoo!'s important business processes. Sumeet Singh, the Head of Products for Cloud Services and Hadoop at Yahoo!, explains how Yahoo! leverages Hadoop and Cloud Platforms to process and serve Internet- scale data.
Yahoo! operates one of the world's largest private cloud infrastructures. Learn how technologies scale out for building enterprise-wide trusted platforms with tight SLAs.
URL: http://www.saptechnologyservice.com/track1.html
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
Cloud-based architectures of Hadoop have made it attractive for public cloud service providers to offer hosted Hadoop services and charge customers on a pay-for-what-you-use basis. For enterprises that have already adopted Hadoop, the data infrastructure has long been seen as a cost element in their budgets. As a result, enterprises thinking of adopting Hadoop are increasingly debating between on-premise and cloud-based models for their data processing needs.
We lay out a set of criteria and methodical approaches to help enterprises that have not yet adopted Hadoop evaluate their options, and discuss the pros and cons of both models. For enterprises that have already made significant investments or have plans to build a Hadoop-based infrastructure, we present an approach to manage Hadoop as a Service with a P&L, transparency in costs, and metering & billing provisions.
As we discuss these approaches, we will share insights gathered from the exercise conducted on one of the largest Hadoop footprints in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs for usage, measure the resource usage for services, optimize for higher utilization, and benchmark costs.
URL: http://strataconf.com/stratany2013/public/schedule/detail/30824
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
Yahoo! has been using HBase for a long time in isolated instances, most notably for the personalization platform powering its homepage experiences. The introduction of multi-tenancy has lowered the barriers for all Hadoop users to use HBase. We will cover traditional use cases for HBase at Yahoo!, and new use cases as a result in content management, advertising, log processing, analytics and reporting, recommendation graphs, and dimension data stores.
We will then talk about the deployment strategy and enhancements made that facilitate multi-tenancy. Region Server groups provide a coarse level of isolation among tenants by designating a subset of region servers to serve designated tables, and Namespaces for logical grouping of resources (region servers, tables) and privileges (quota, ACLs).
We'll also share our experiences in operating HBase with security enabled and contributions made in this area, and results from performance runs conducted to validate customer expectations in a multi-tenant environment.
URL: http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--multi-tenant-apache-hbase-at-yahoo-video.html
Event Management System Vb Net Project Report.pdfKamal Acharya
In present era, the scopes of information technology growing with a very fast .We do not see any are untouched from this industry. The scope of information technology has become wider includes: Business and industry. Household Business, Communication, Education, Entertainment, Science, Medicine, Engineering, Distance Learning, Weather Forecasting. Carrier Searching and so on.
My project named “Event Management System” is software that store and maintained all events coordinated in college. It also helpful to print related reports. My project will help to record the events coordinated by faculties with their Name, Event subject, date & details in an efficient & effective ways.
In my system we have to make a system by which a user can record all events coordinated by a particular faculty. In our proposed system some more featured are added which differs it from the existing system such as security.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSEDuvanRamosGarzon1
AIRCRAFT GENERAL
The Single Aisle is the most advanced family aircraft in service today, with fly-by-wire flight controls.
The A318, A319, A320 and A321 are twin-engine subsonic medium range aircraft.
The family offers a choice of engines
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deployments
1. Capacity Planning in Multi-tenant
Hadoop, HBase and Storm Deployments
PRESENTED BY Amrit Lal and Sumeet Singh ⎪ April 02, 2014
2 0 1 4 H a d o o p S u m m i t , A m s t e r d a m , N e t h e r l a n d s
2. Introduction
2 2014 Hadoop Summit, Amsterdam, Netherlands
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
§ Product Manager at Yahoo engaged in building
high class and robust Hadoop infrastructure
services
§ Eight years of experience across HSBC, Oracle
and Google in developing products and
platforms for high growth enterprises
§ M.B.A. from Carnegie Mellon University701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
§ Manages Hadoop products team at Yahoo!
§ Responsible for Product Management, Strategy
and Customer Engagements
§ Managed Cloud Services products team and
headed Strategy functions for the Cloud
Platform Group at Yahoo
§ M.B.A. from UCLA and M.S. from
Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
3. Agenda
3 2014 Hadoop Summit, Amsterdam, Netherlands
The Need for Capacity Planning1
Big Data Platform Deployment Models2
Resource Drivers and Data Sources3
Capacity Models and Tools4
SLA Management5
4. 0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers(DataNode)
Year
Servers Storage
Multi-tenant Apache Hadoop Platform Evolution
4 2014 Hadoop Summit, Amsterdam, Netherlands
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems
with Security,
Multi-tenancy,
and SLAs
Open
Sourced with
Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H2.x
(Low latency,
Util, HA etc.)
6. Multi-tenant Apache HBase Growth at Yahoo
6 2014 Hadoop Summit, Amsterdam, Netherlands
1,140
33.6 PB
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0
200
400
600
800
1,000
1,200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofServers(RegionServer)
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage
7. Multi-tenant Apache Storm Growth at Yahoo
7 2014 Hadoop Summit, Amsterdam, Netherlands
Zero to “175” Production Topologies in a Year
760
175
0
20
40
60
80
100
120
140
160
180
200
0
100
200
300
400
500
600
700
800
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
NumberofTopologies
NumberofServers(Supervisor)
Supervisor Topologies
Multi-tenancy
Release
8. Where Does Capacity Planning Fit
8 2014 Hadoop Summit, Amsterdam, Netherlands
Phased
Environment
Production
On-boarding
Capacity
Planning
Architecture
Validation
Technology
Choice
Project Lifecycle Support
9. Big Data Platform Technology Stack at Yahoo
9 2014 Hadoop Summit, Amsterdam, Netherlands
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
Relevant for Capacity Planning
10. Deployment Model
10 2014 Hadoop Summit, Amsterdam, Netherlands
DataNode NodeManager
NameNode RM
DataNode RegionServer
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Relevant for Capacity Planning
11. Capacity Drivers That Matter
11 2014 Hadoop Summit, Amsterdam, Netherlands
Data (Storage) Volume of data to be stored and processed
Memory Container for direct and faster access to stored data
CPU Cores (and threads) available for processing
Throughput Number of transactions per second
Latency
Time taken to complete a request or operation ((includes
processing, disk and network I/O time)
Drivers Measure
12. Apache Hadoop Resources
12 2014 Hadoop Summit, Amsterdam, Netherlands
Data (Storage) Data stored in HDFS (disk)
Memory
Map and Reduce containers
(in H 0.23/ 2.0)
CPU
YARN-2 for Capacity Scheduler,
Yahoo is not using it yet
Throughput
Latency
Time taken for the jobs to
complete
§ Freq., size, retention, # files
§ Rep. factor
§ Map memory
§ Reduce memory
§ N/A
§ Individual job run times
§ Time to finish all jobs (when
run in parallel) – peak usage
Drivers Measure
Data processed/ second with
concurrent Mappers and Reducers
§ Total data processed
§ Maps and Reduces to run
(simple or complex DAGs)
IntheorderofimportanceforHadoop
13. Working Through a Use Case
13 2014 Hadoop Summit, Amsterdam, Netherlands
Pig Mail needs to process 30 TB of data
everyday in about 6 hours so that it can
develop algorithms that can detect spam
more effectively. A Pig script will parse the
data in sequential phases to finally
materialize the features of the mail that
decides if the mail is a SPAM.
1
3
2-L 2-R
Stage 1
Stage 2
Stage 3
Pig DAG
ILLUSTRATIVE
14. Data (Storage)
14 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Pig Mail Project Info – User Input
Data upload frequency Once daily
Data added per upload 1 TB / day
Data retention (Input) 30 days
Data output 50 GB
Data retention 1 day
Anticipated growth in data volume (3-6 months) 20%
Step 2: # Servers Based on Storage (default values at hdfs-site.xml)
HDFS replication factor
dfs.replication
Default: <3>
HDFS required (30 + 0.05) x 1.2 x 3 = 108 TB
Suggested server config (based on total cost) C-xxx/48/4000 (four 4 TB disks)
Storage available per server
12 TB out of 16 TB (rest for OS, temp, swap etc.)
dfs.datanode.du.reserved, <107374182400> 1 TB
Servers required 108 / 12 = 9 servers
Step 3: Namespace Needed (default values at hdfs-site.xml)
HDFS block size
dfs.blocksize
Default: <134217728> 128 MB
Average file size 1.5 X 128 MB = 200 MB (assumed)
Namespace for files 108 TB / 200 MB = 540,000
15. Memory
15 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Cluster/ Node Level Info (configured values at yarn-site.xml) – Admins Only
Max memory on the node for containers
yarn.nodemanager.resource.memory-mb
Conf: <45056> (44G out of 48G, rest for the OS)
Virtual to physical memory
yarn.nodemanager.vmem-pmem-ratio
Default: <2.1> (2:1 virtual to exceed physical by)
Min allocable memory for containers
yarn.scheduler.minimum-allocation-mb
Default: <512> (0.5G)
Max allocable memory for containers
yarn.scheduler.maximum-allocation-mb
Default: <8192> (8G)
Step 2: Container Level Info (default values at mapred-site.xml)
Map task container size
mapreduce.map.memory.mb
Default: <1536> (1.5G)
Reduce task container size
mapreduce.reduce.memory.mb
Default: <2048> (2G)
MR AppMaster memory size
yarn.app.mapreduce.am.resource.mb
Default: <1536> (1.5G)
Map task JVM heap size
mapreduce.map.java.opts
Default: Xmx1024m
Reduce task JVM heap size
mapreduce.reduce.java.opts
Default: Xmx1536m
Map and Reduce container sizes are determined by users developing the app based on memory needs of the tasks
16. Throughput
16 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Estimating Number of Mappers
Upper bound on input splits mapreduce.input.fileinputformat.split.maxsize
Lower bound on input splits mapreduce.input.fileinputformat.split.minsize
Number of mappers
Number of input splits
(e.g. 8,192 maps = 1 TB of data / 128M split size)
Step 2 A: Estimating Number of Reducers
Limit on the input size to reducers
mapreduce.reduce.input.limit
Default: <10737418240> (10G)
Fixed number of reducers mapreduce.job.reduces
Number of reducers Min (fixed reducers, total input size / reducer size)
Step 2 B: Estimating Number of Reducers (Pig and Hive)
Pig
Min (fixed reducers, pig.exec.reducers.max,
total input size / pig.exec.reducers.bytes.per.reducer)
Default: <max 999, reducer bytes 1GB>
Hive
Min (fixed reducers, hive.exec.reducers.max ,
total input size / hive.exec.reducers.bytes.per.reducer)
Default: < max 999, reducer bytes 1GB>
17. Throughput and Latency
17 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Sample Run (with a tenth of data on a sandbox cluster)
Stages # Map Map Size Map Time # Reduce Reduce Size Reduce Time
Stage 1 100 1.5 GB 10 Min 50 2 GB 5 Min
Stage 2 - L 50 1.5 GB 10 Min 20 2 GB 10 Min
Stage 2 - R 30 1.5 GB 5 Min 10 2 GB 5 Min
Stage 3 70 1.5 GB 5 Min 30 2 GB 5 Min
Notes:
§ SLOT_MILLIS_MAPS and SLOT_MILLIS_REDUCES from Job Counters gives the time spent
§ TOTAL_LAUNCHED_MAPS and TOTAL_LAUNCHED_REDUCES from Job Counters gives # Map and # Reduce
§ Reduce time includes the Sort and Shuffle time. Shuffle Time is Data per Reducer / est. 4 MB/s (bandwidth for
data transfer from Map to Reduce)
§ Add 10% for speculative execution (failed/killed task attempts)
Step 2: Mappers and Reducers for SLA and Full Dataset
Stages Mins SLA Share # Map # Reduce
Map
Total
Reduce
Total
Total
Mem.
#
Servers
Stage 1 15 / 45 Min 120 / 360 Min 138
(100 x 11) / 8
69
(50 x 11) / 8
207 GB 138 GB 345 GB 8
Project Pig Mail Capacity Ask = MAX (Compute <8 Servers>, Storage <9 Servers>) = 9 Servers
19. Apache HBase Resources
19 2014 Hadoop Summit, Amsterdam, Netherlands
Throughput
Supported frequency of data read
or written in a second (for a given
record size)
Latency
Time taken for the read, write or
scan operations to complete
Memory
BlockCache; data that needs to
be served through cache
Data (Storage)
CPU N/A
§ Number of reads, writes or
scans per second per server
§ Read or write time in ms
(typically) per record
§ % of data read from cache
§ MemStore / BlockCache
ratio, RegionServer heap
§ N/A
Drivers Measure
Total data stored in HDFS (disk)
§ Avg. record size x avg.
number of records stored
IntheorderofimportanceforHBase
20. Working Through a Use Case
20 2014 Hadoop Summit, Amsterdam, Netherlands
Awesome eCommerce needs to
process about 200 M records daily
somewhere between 6:00 - 10:00 AM to
update product information. About 50%
of the data is related to existing
products where price may need to be
updated by comparing current with the
new offer price. Remaining 50% of the
offer is new products and will be written
without price comparison.
There are three separate tables for
product, price and offers with 3 KB avg.
record size. Writes are in the order of
500 Million records and reads 250
Million across each of the three tables.
ILLUSTRATIVE
21. Throughput & Latency
21 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Project Info – User Input
Active reads/writes per day 4 Hrs.
Avg. writes / day (all three tables) 1,500 M
Avg. reads / day (all three tables) 750 M
Average record size 3 KB
Records cached / warmed on start 50%
Step 2: # Servers Based on Write Throughput
Peak concurrent writes required 1,500 M x 3 KB / (4 x 3,600 sec) = ~ 300 MB / sec
Peak write throughput per RegionServer 45 MB / sec (based on performance benchmarks)
Servers required 300 / 45 = 7 RegionServer
Step 3: # Servers Based on Read Throughput
Peak concurrent reads required 750 M x 3 KB / (4x3600 sec) = ~160 MB / sec
Peak cold random read throughput 10 MB / sec (based on performance benchmarks)
Peak hot random read throughput 200 MB / sec (based on performance benchmarks)
RegionServer for cold reads 160 x 50% / 10 = 8
RegionServer for hot read 160 x 50% / 200 = 1
Servers required Max (8,1) = 8 RegionServer
Performance benchmarks were conducted by simulating HBase workloads through YCSB on dedicated servers
22. Memory
22 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: RegionServer Info (configured values at hbase-site.xml and hbase-env.sh) – Admins Only
Max memory available per Region Server
C-xxx/64/4000
<64 GB>
Heap size of the Region Server JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Memory allocated to BlockCache
hfile.block.cache.size = 0.8 (80%)
Default: <0.4> (40% of Heap)
Memory allocated to Memstore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Default: <0.4> (40% of Heap)
Step 2: Servers required to serve from block cache
Total records 200 M
Average record size 3 KB
Total data served 200 M x 3 KB = 0.55 TB
Total data served through BlockCache 0.55 TB x 50% = 0.28 TB
Loading factor in the (LRU) BlockCache (in HBase 0.94) 85 %
Total BlockCache available per RegionServer 58 GB x 0.8 x 85% = 40 GB
Servers required 0.28 TB / 40 GB = 7 RegionServer
Block cache allocation is dependent on the mix of reads and writes access patterns. Remainder of LRU is used by
other resident users such as catalog tables, hfiles indexes, bloom filters
23. Data
23 2014 Hadoop Summit, Amsterdam, Netherlands
Step 2: # Servers Based on data served
Raw disk space to JVM heap / RegionServer 10 GB / 128 MB x 3 x 0.2 = 48
Raw disk space available / RegionServer 48 x 58 GB x 0.2 = 0.56 TB
Total data served through tables 0.55 TB
Total raw data served 0.55 TB x 3 = 1.65 TB
Servers required 1.65 / 0.56 = 3 servers
Step 1: RegionServer Info (configured values at hbase-site.xml & hbase-env.sh) – Admins Only
Max memory available per RegionServer C-xxx/64/4000 (four 4 TB disks) = 64 GB
Heap size of the RegionServer JVM
export HBASE_HEAPSIZE = 59392 (58 GB)
Default: <1000> (1000 MB)
Region size
hbase.hregion.max.filesize = 10737418240
Default: <10737418240> (10 GB)
Memory allocated to MemStore
hbase.regionserver.global.memstore.size = 0.2 (20%)
Default: <0.4> (40% of Heap)
Memstore flush size
hbase.hregion.memstore.flush.size= 134217728
Default: <134217728> (128 MB)
HDFS replication factor
dfs.replication = 3
Default: <3>
Project Awesome eCommerce Ask = MAX (Write <7 RS>, Read <8 RS>, Cached<7 RS >, Data <3 RS>) = 8 RS
25. Apache Storm Resources
25 2014 Hadoop Summit, Amsterdam, Netherlands
Throughput
Events processed per second or
parallel workers
Memory
Worker/ Slot memory for spouts
and bolts
CPU
CPU threads needed for workers/
executors
Latency
Data (Storage) N/A
§ # events, # messages / sec
§ Tuples / sec
§ Spout and bolt JVM size
§ Message and Tuple size
§ Cores for spout and bolt
processes, inter and intra
§ Inter and Intra worker
comm.
§ N/A
Drivers Measure
Time taken for processing the
input stream of events
§ Execute / complete latency
IntheorderofimportanceforStorm
26. Working Through a Use Case
26 2014 Hadoop Summit, Amsterdam, Netherlands
Wonder Search wants to index editorial
content in near real-time for users to be able
to search content. The editorial content is
available in Apache HBase.
Spout: Scans HBase since the last scan till
current time to get the editorial content.
Bolt 1: Build the index and store it back in
HBase.
Bolt 2: Push the index for serving.
ILLUSTRATIVE
27. Throughput and Latency
27 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Supervisor Level Info (configured values at storm.yaml or multitenant-scheduler.yaml) – Admins Only
Incoming (worker) messages queue size topology.receiver.buffer.size, Default: <8>
Outgoing (worker) messages queue size topology.transfer.buffer.size, Default: <1024>
Incoming (executor) tuple queue size topology.executor.receive.buffer.size, Default: <1024>
Outgoing (executor) tuple queue size topology.executor.send.buffer.size, Default: <1024>
Slots available per supervisor
supervisor.slots.ports
<24>, hyper-threaded cores for dual hex-core machines
Multi-tenant scheduler (user isolation scheduler)
multitenant.scheduler.user.pools: <users> : <# nodes>,
topology.isolate.machines: <Number of Nodes>
Step 2: # Servers Based on Throughput
Events processed with single spout per worker 1,000 messages / sec
Target throughput required 8,000 messages / sec
Number of spout executors required 8,000 / 1,000 = 8 (across 8 slots)
Number of tuple executed across 1st bolt (5 executors) 10,000 tuples / sec
Total executors required for 1st bolt 8 x 5 = 40 (across 40 slots)
Number of tuples executed across 2nd bolt (5 executors) 15,000 tuples / sec
Total executors required for 2nd Bolt 8 x 5 = 40 (across 40 slots)
Total slots based on executors 8 + 40 + 40 = 88 Slots
Number of supervisors required 88 / 24 = 4 servers
28. CPU vs. Throughput
28 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Track CPU usage with JVM tools (jmap/ jstack)
Max CPU cores per supervisor C-xxx/48/4000 (12 physical cores)
CPU usage for 1000 messages / sec
4 physical cores (32.12%)
Includes 1 spout and 5 bolt executors each for bolts 1
and 2, and CPU usage for inter-messaging (ZeroMQ or
Netty)
Equal CPU division between spout and bolt executor
(assumed)
Executor CPU needs = 4 / (1+5+5) = 4/11 cores
Total workers
TOPOLOGY_WORKERS
Config#setNumWorkers
Tasks per component
TOPOLOGY_TASKS
ComponentConfigurationDeclarer#setNumTasks()
Step 2: Extrapolate for Target Throughput (linear increase)
Target spout executors 8, TopologyBuilder#setSpout()
Target bolt executors 40, TopologyBuilder#setBolt()
CPU needed for spout executors 8 x 4/11 = 3 cores
CPU needed for 1st bolt executors 40 x 4/11 = 15 cores
CPU needed for 2nd bolt executors 40 x 4/11 = 15 cores
CPU need for the topology 3 + 15 + 15 = 33 cores
Total supervisors needed 33 /12 = 3 servers
29. Memory vs. Throughput
29 2014 Hadoop Summit, Amsterdam, Netherlands
Step 1: Supervisor Level Info
Max memory available per supervisor node
C-xxx/48/4000 <48 GB>
(Usable 42G out of 48G, rest for the OS)
Step 2: # Servers Based on Memory needs
Events processed across spout executors 8,000 messages / sec
Avg. event or message size 3 MB
Data processed per second across spout executors 8,000 x 3 MB = 24 GB / sec
Events processed per second across 1st bolt executors 10,000 x 8 = 80,000 tuples / sec
Average tuple size 100 KB
Data processed per second across 1st bolt executors 80,000 tuples / sec x 100 KB = 8 GB / sec
Data processed per second across 2nd bolt executors 15,000 x 8 tuples / sec x 100 KB = 12 GB / sec
Total data processed 24 GB / sec + 8 GB / sec + 12 GB / sec = 44 GB / sec
Number of Supervisors required to process data 44 / 42 = 2 server
Project Wonder Search Ask = MAX (Throughput <4 Servers>, CPU <3 Servers>, Memory <2 Server >= 4 Servers
32. Growing with YARN
32 2014 Hadoop Summit, Amsterdam, Netherlands
HDFS (File System)
YARN (Resource Manager)
MapReduce
(Batch)
Spark
(Iterative)
Storm
(Stream)
HBaseGiraph
R, OpenMPI,
Indexing etc.
Coming soon
on YARN
Available
today
…
New Services on YARN
Tez
(DAGs)
33. Near Future for Capacity Planning
33 2014 Hadoop Summit, Amsterdam, Netherlands
Hadoop HBase Storm
§ CPU as a resource
§ Container reuse
§ Long-running jobs
§ Other potential
resources such as disk,
network, GPUs etc.
§ Tez as the execution
engine
§ Spark-on-YARN etc.
§ BlockCache
implementations
§ LRU
§ Slab
§ Bucket
§ Short circuit reads
§ Bloom filters and co-
processors
§ HBase-on-YARN
§ Storm-on-YARN
§ More experience with
multi-tenancy
34. Acknowledgement
34 2014 Hadoop Summit, Amsterdam, Netherlands
Hadoop Capacity Planning
Nathan Roberts Hadoop Core Architect
Koji Noguchi Software Engineer
Viraj Bhat Software Engineer
Ryota Egashiri Software Engineer
Balaji Narayan Service Engineer
Anish Matthew Service Engineer
Rajiv Chittajallu SE Architect
HBase Capacity Planning
Francis Liu Software Engineer
Dheeraj Kapur Service Engineer
Storm Capacity Planning
Bobby Evans Software Engineer
Dheeraj Kapur Service Engineer