As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
The document provides an introduction to Apache Hadoop, including:
1) It describes Hadoop's architecture which uses HDFS for distributed storage and MapReduce for distributed processing of large datasets across commodity clusters.
2) It explains that Hadoop solves issues of hardware failure and combining data through replication of data blocks and a simple MapReduce programming model.
3) It gives a brief history of Hadoop originating from Doug Cutting's Nutch project and the influence of Google's papers on distributed file systems and MapReduce.
This document discusses challenges faced with running Hive at large scale at Yahoo. It describes how Yahoo runs Hive on 18 Hadoop clusters with over 400,000 nodes and 580PB of data. Even with optimizations like Tez, ORC, and vectorization, Yahoo encountered slow queries, out of memory errors, and slow partition pruning for queries on tables with millions of partitions. Fixes involved throwing more hardware at the metastore, client-side tuning, and addressing memory leaks and inefficiencies in the metastore and filesystem cache.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
This document provides an overview of using Hadoop and MapReduce for ETL processes. It begins with brief introductions to Hadoop, HDFS, and MapReduce programming models. It then demonstrates a MapReduce job written in Java that analyzes weather data to find the maximum and minimum daily temperatures for each weather station. The document also discusses Hive and Pig for SQL-like querying of data in Hadoop and provides an example MapReduce job written in Java that performs a multi-step ETL process.
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Operating multi-tenant clusters requires careful planning of capacity for on-time launch of big data projects and applications within expected budget and with appropriate SLA guarantees. Making such guarantees with a set of standard hardware configurations is key to operate big data platforms as a hosted service for your organization.
This talk highlights the tools, techniques and methodology applied on a per-project or user basis across three primary multi-tenant deployments in the Apache Hadoop ecosystem, namely MapReduce/YARN and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively. We will demo the estimation tools developed for these deployments that can be used for capital planning and forecasting, and cluster resource and SLA management, including making latency and throughput guarantees to individual users and projects.
As we discuss the tools, we will share considerations that got incorporated to come up with the most appropriate calculation across these three primary deployments. We will discuss the data sources for calculations, resource drivers for different use cases, and how to plan for optimum capacity allocation per project with respect to given standard hardware configurations.
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
The document provides an introduction to Apache Hadoop, including:
1) It describes Hadoop's architecture which uses HDFS for distributed storage and MapReduce for distributed processing of large datasets across commodity clusters.
2) It explains that Hadoop solves issues of hardware failure and combining data through replication of data blocks and a simple MapReduce programming model.
3) It gives a brief history of Hadoop originating from Doug Cutting's Nutch project and the influence of Google's papers on distributed file systems and MapReduce.
This document discusses challenges faced with running Hive at large scale at Yahoo. It describes how Yahoo runs Hive on 18 Hadoop clusters with over 400,000 nodes and 580PB of data. Even with optimizations like Tez, ORC, and vectorization, Yahoo encountered slow queries, out of memory errors, and slow partition pruning for queries on tables with millions of partitions. Fixes involved throwing more hardware at the metastore, client-side tuning, and addressing memory leaks and inefficiencies in the metastore and filesystem cache.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
This document provides an overview of using Hadoop and MapReduce for ETL processes. It begins with brief introductions to Hadoop, HDFS, and MapReduce programming models. It then demonstrates a MapReduce job written in Java that analyzes weather data to find the maximum and minimum daily temperatures for each weather station. The document also discusses Hive and Pig for SQL-like querying of data in Hadoop and provides an example MapReduce job written in Java that performs a multi-step ETL process.
Hadoop Vaidya is a tool that analyzes Hadoop job performance and provides targeted advice to address issues. It contains a set of diagnostic rules to detect performance problems by analyzing job execution counters. The rules can identify issues such as unbalanced reduce partitioning and map/reduce tasks reading HDFS files as side effects. When run on over 22,000 jobs at Yahoo, it found that 18.79% had unbalanced reduce partitioning and 91% had map/reduce tasks reading HDFS data unnecessarily.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
Analyzing Real-World Data with Apache Drilltshiran
This document provides an overview of Apache Drill, an open source SQL query engine for analysis of both structured and unstructured data. It discusses how Drill allows for schema-free querying of data stored in Hadoop, NoSQL databases and other data sources using SQL. The document outlines some key features of Drill, such as its flexible data model, ability to discover schemas on the fly, and distributed execution architecture. It also presents examples of using Drill to analyze real-world data from sources like HDFS, MongoDB and more.
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
(1) The amount of data in the world is growing exponentially, with unstructured data making up over 80% of collected data by 2020. (2) Apache Drill provides data agility for Hadoop by enabling self-service data exploration through a flexible data model and schema discovery. (3) Drill allows business users to rapidly query diverse data sources like files, HBase tables, and Hive without requiring IT, through a simple SQL interface.
This document provides an overview of several advanced Hadoop topics, including:
- YARN, the resource manager that allocates resources and manages job scheduling in Hadoop. It uses a global ResourceManager and per-application ApplicationMasters.
- Testing HDFS I/O throughput with TestDFSIO, a tool that measures read and write performance through MapReduce jobs. It reports metrics like throughput and IO rates.
- The mrjob Python library, which provides a framework for writing multi-step MapReduce jobs in Python that can be run locally or on a Hadoop cluster. Sample code demonstrates defining a job class with mapper, reducer, and step methods.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3.
2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments.
3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
Hadoop has allowed us to move towards a unified source of truth for all of organization’s data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations.
In this talk, we will share an approach in tackling the above challenges. We will explain how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data. In addition, the approach allows us to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
URL: http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38768
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
Cloud-based architectures of Hadoop have made it attractive for public cloud service providers to offer hosted Hadoop services and charge customers on a pay-for-what-you-use basis. For enterprises that have already adopted Hadoop, the data infrastructure has long been seen as a cost element in their budgets. As a result, enterprises thinking of adopting Hadoop are increasingly debating between on-premise and cloud-based models for their data processing needs.
We lay out a set of criteria and methodical approaches to help enterprises that have not yet adopted Hadoop evaluate their options, and discuss the pros and cons of both models. For enterprises that have already made significant investments or have plans to build a Hadoop-based infrastructure, we present an approach to manage Hadoop as a Service with a P&L, transparency in costs, and metering & billing provisions.
As we discuss these approaches, we will share insights gathered from the exercise conducted on one of the largest Hadoop footprints in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs for usage, measure the resource usage for services, optimize for higher utilization, and benchmark costs.
URL: http://strataconf.com/stratany2013/public/schedule/detail/30824
Hadoop Vaidya is a tool that analyzes Hadoop job performance and provides targeted advice to address issues. It contains a set of diagnostic rules to detect performance problems by analyzing job execution counters. The rules can identify issues such as unbalanced reduce partitioning and map/reduce tasks reading HDFS files as side effects. When run on over 22,000 jobs at Yahoo, it found that 18.79% had unbalanced reduce partitioning and 91% had map/reduce tasks reading HDFS data unnecessarily.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
Analyzing Real-World Data with Apache Drilltshiran
This document provides an overview of Apache Drill, an open source SQL query engine for analysis of both structured and unstructured data. It discusses how Drill allows for schema-free querying of data stored in Hadoop, NoSQL databases and other data sources using SQL. The document outlines some key features of Drill, such as its flexible data model, ability to discover schemas on the fly, and distributed execution architecture. It also presents examples of using Drill to analyze real-world data from sources like HDFS, MongoDB and more.
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.
YARN - Hadoop Next Generation Compute PlatformBikas Saha
The presentation emphasizes the new mental model of YARN being the cluster OS where one can write and run different applications in Hadoop in a cooperative multi-tenant cluster
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
(1) The amount of data in the world is growing exponentially, with unstructured data making up over 80% of collected data by 2020. (2) Apache Drill provides data agility for Hadoop by enabling self-service data exploration through a flexible data model and schema discovery. (3) Drill allows business users to rapidly query diverse data sources like files, HBase tables, and Hive without requiring IT, through a simple SQL interface.
This document provides an overview of several advanced Hadoop topics, including:
- YARN, the resource manager that allocates resources and manages job scheduling in Hadoop. It uses a global ResourceManager and per-application ApplicationMasters.
- Testing HDFS I/O throughput with TestDFSIO, a tool that measures read and write performance through MapReduce jobs. It reports metrics like throughput and IO rates.
- The mrjob Python library, which provides a framework for writing multi-step MapReduce jobs in Python that can be run locally or on a Hadoop cluster. Sample code demonstrates defining a job class with mapper, reducer, and step methods.
The document discusses the MapR Big Data platform and Apache Drill. It provides an overview of MapR's M7 which makes HBase enterprise-grade by eliminating compactions and enabling a unified namespace. It also describes Apache Drill, an interactive query engine inspired by Google's Dremel that supports ad-hoc queries across different data sources at scale through its logical and physical query planning. The document demonstrates simple queries and provides details on contributing to and using Apache Drill.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3.
2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments.
3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Sumeet Singh
Hadoop has allowed us to move towards a unified source of truth for all of organization’s data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations.
In this talk, we will share an approach in tackling the above challenges. We will explain how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data. In addition, the approach allows us to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
URL: http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38768
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
Cloud-based architectures of Hadoop have made it attractive for public cloud service providers to offer hosted Hadoop services and charge customers on a pay-for-what-you-use basis. For enterprises that have already adopted Hadoop, the data infrastructure has long been seen as a cost element in their budgets. As a result, enterprises thinking of adopting Hadoop are increasingly debating between on-premise and cloud-based models for their data processing needs.
We lay out a set of criteria and methodical approaches to help enterprises that have not yet adopted Hadoop evaluate their options, and discuss the pros and cons of both models. For enterprises that have already made significant investments or have plans to build a Hadoop-based infrastructure, we present an approach to manage Hadoop as a Service with a P&L, transparency in costs, and metering & billing provisions.
As we discuss these approaches, we will share insights gathered from the exercise conducted on one of the largest Hadoop footprints in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs for usage, measure the resource usage for services, optimize for higher utilization, and benchmark costs.
URL: http://strataconf.com/stratany2013/public/schedule/detail/30824
Sumeet Singh and Amrit Lal presented on costing big data operations at the 2014 Hadoop Summit. They discussed developing total cost of ownership models to understand resource usage and costs associated with Hadoop infrastructure. They covered benchmarking costs, improving utilization and return on investment. Metrics included storage, compute, networking and operational costs which could be tracked for planning, transparency and improving efficiency.
This document presents a maturity model for big data asset management with 6 levels: business monitoring, business insights, business excellence, insights monetization, business metamorphose, and core business processes. It describes using data as an asset and applying analytics at different degrees to business models from backward to predictive to prescriptive analytics.
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...StampedeCon
This document discusses building a production data infrastructure beyond a big data pilot project. It examines the data value chain from data acquisition to analytics. The key components discussed include data acquisition, ingestion, storage, data services, analytics, and data management. Various options for these components are explored, with considerations for batch, interactive and real-time workloads. The goal is to provide a framework for understanding the options and making choices to support different use cases at scale in a production environment.
The document discusses resource tracking for Hadoop and Storm clusters at Yahoo. It describes how Yahoo developed tools over three years to track resource usage at the application, cluster, queue, user and project levels. This includes capturing CPU and memory usage for Hadoop YARN applications and Storm topologies. The data is stored and made available through dashboards and APIs. Yahoo also calculates total cost of ownership for Hadoop and converts resource usage to estimated monthly costs for projects. This visibility into usage and costs helps with capacity planning, operational efficiency, and ensuring fairness across grid users.
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
The document discusses challenges in moving big data projects from pilots to production. It highlights that pilots have loose SLAs and focus on a few use cases and demonstrated insights, while production requires enforced SLAs, supporting many use cases and delivering actionable insights. Key challenges in the transition include establishing governance, skills, funding models and integrating insights into operations. The document also provides examples of technology considerations and common operating models for big data analytics.
A comprehensive overview of the security concepts in the open source Hadoop stack in mid 2015 with a look back into the "old days" and an outlook into future developments.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
This document provides an overview of Apache Spark, including its core concepts, transformations and actions, persistence, parallelism, and examples. Spark is introduced as a fast and general engine for large-scale data processing, with advantages like in-memory computing, fault tolerance, and rich APIs. Key concepts covered include its resilient distributed datasets (RDDs) and lazy evaluation approach. The document also discusses Spark SQL, streaming, and integration with other tools.
The document discusses building a big data maturity model for Sri Lanka's tourism industry. It proposes a 5-level model moving from basic enterprise infrastructure to optimized intelligent insights. Level 1 involves siloed data on volumes, velocities and varieties. Level 2 adds semantic connectivity through a linked travel ontology. Level 3 incorporates social layers for collaboration. Level 4 provides intelligent insights. Level 5 optimizes through scaling, tooling and a marketplace to connect users, devices, apps and data. The model is intended to help Sri Lanka's tourism industry better utilize big data.
The TCO Calculator - Estimate the True Cost of Hadoop MapR Technologies
http://bit.ly/1wsAuRS - There are many hidden costs for Apache Hadoop that have different effects across different Hadoop distributions. With the new MapR TCO calculator organisations have a simple and reliable tool that is based on facts to compare costs.
How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ?
Presentation attempts to compare the different options available on AWS.
An IT operating model defines the framework an IT organization uses to interface with business, develop applications to meet requirements, and deliver services to customers. It establishes key elements like processes, governance, sourcing, service support, service delivery, and organizational structure. Designing an effective operating model is important because it provides a standard interface between business and IT, standardized functions and processes, and an approved view of how IT operates. The methodology to design an operating model involves conducting interviews with stakeholders to understand roles, functions, information exchange, processes, procedures, governance, and responsibilities.
Big Data, IoT, data lake, unstructured data, Hadoop, cloud, and massively parallel processing (MPP) are all just fancy words unless you can find uses cases for all this technology. Join me as I talk about the many use cases I have seen, from streaming data to advanced analytics, broken down by industry. I’ll show you how all this technology fits together by discussing various architectures and the most common approaches to solving data problems and hopefully set off light bulbs in your head on how big data can help your organization make better business decisions.
How to Build & Sustain a Data Governance Operating Model DATUM LLC
Learn how to execute a data governance strategy through creation of a successful business case and operating model.
Originally presented to an audience of 400+ at the Master Data Management & Data Governance Summit.
Visit www.datumstrategy.com for more!
This document discusses total cost of ownership considerations for Hadoop implementations. It outlines different deployment methods like on-premise Hadoop, Hadoop appliances, and Hadoop as a service through cloud providers. For on-premise implementations, it identifies key cost categories and provides a sample TCO calculation over 36 months. It also discusses factors for managing implementation risks from vendors and internal IT. The document concludes by outlining scenarios for when on-premise or Hadoop as a service may be preferable based on organizational needs and IT resources.
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
This document discusses considerations for scaling Hadoop platforms at Yahoo. It covers topics such as deployment models (on-premise vs. public cloud), total cost of ownership, hardware configuration, networking, software stack, security, data lifecycle management, metering and governance, and debunking myths. The key takeaways are that utilization matters for cost analysis, hardware becomes increasingly heterogeneous over time, advanced networking designs are needed to avoid bottlenecks, security and access management must be flexible, and data lifecycles require policy-based management.
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers, including the challenges that come from scale, security and multi-tenancy. We will cover the current technology stack that we have built or assembled, infrastructure elements such as configurations, deployment models, and network, and and what it takes to offer hosted Hadoop services to a large customer base.
The document provides an overview of an experimentation platform built on Hadoop. It discusses experimentation workflows, why Hadoop was chosen as the framework, the system architecture, and challenges faced and lessons learned. Key points include:
- The platform supports A/B testing and reporting on hundreds of metrics and dimensions for experiments.
- Data is ingested from various sources and stored in Hadoop for analysis using technologies like Hive, Spark, and Scoobi.
- Challenges included optimizing joins and jobs for large datasets, addressing data skew, and ensuring job resiliency. Tuning configuration parameters and job scheduling helped improve performance.
Experimentation plays a vital role in business growth at eBay by providing valuable insights and prediction on how users will reach to changes made to the eBay website and applications. On a given day, eBay has several hundred experiments running at the same time. Our experimentation data processing pipeline handles billions of rows user behavioral and transactional data per day to generate detailed reports covering 100+ metrics over 50 dimensions.
In this session, we will share our journey of how we moved this complex process from Data warehouse to Hadoop. We will give an overview of the experimentation platform and data processing pipeline. We will highlight the challenges and learnings we faced implementing this platform in Hadoop and how this transformation led us to build a scalable, flexible and reliable data processing workflow in Hadoop. We will cover our work done on performance optimizations, methods to establish resilience and configurability, efficient storage formats and choices of different frameworks used in the pipeline.
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
Since 2006, Hadoop and its ecosystem components have evolved into a platform that Yahoo has begun to trust for running its businesses globally. Hadoop’s scalability, efficiency, built-in reliability, and cost effectiveness have made it an enterprise-wide platform that web-scale cloud operations run on. In this talk, we will take a broad look at some of the top software, hardware, and services considerations that have gone in to make the platform indispensable for nearly 1,000 active developers on a daily basis, including the challenges that come from scale, security and multi-tenancy we have dealt with in the last several years of operating one the largest Hadoop footprints in the world. We will cover the current technology stack Yahoo that has built or assembled, infrastructure elements such as configurations, deployment models, and network, and what it takes to offer hosted Hadoop services to a large customer base at Yahoo. Throughout the talk, we will highlight relevant use cases from Yahoo’s Mobile, Search, Advertising, Personalization, Media, and Communications businesses that may make these considerations more pertinent to your situation.
Architecting a Scalable Hadoop Platform: Top 10 considerations for successDataWorks Summit
This document discusses 10 considerations for architecting a scalable Hadoop platform:
1. Choosing between on-premise or public cloud deployment.
2. Evaluating total cost of ownership which includes hardware, software, support and other recurring costs.
3. Configuring hardware including servers, storage, networking and heterogeneous resources.
4. Ensuring a high performance network backbone that avoids bottlenecks.
5. Maintaining a software stack that focuses on use cases over specific technologies.
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points:
- CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage.
- They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases.
- Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics.
- Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their
M.V. Rama Kumar has 3 years of experience in application development using Java and big data technologies like Hadoop. He has 1.6 years of experience using Hadoop components such as HDFS, MapReduce, Pig, Hive, Sqoop, HBase and Oozie. He has extensive experience setting up Hadoop clusters and processing large, structured and unstructured data.
The Common BI/Big Data Challenges and Solutions presented by seasoned experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session.
Architecting the Future of Big Data and SearchHortonworks
The document discusses the potential for integrating Apache Lucene and Apache Hadoop technologies. It covers their histories and current uses, as well as opportunities and challenges around making them work better together through tighter integration or code sharing. Developers and businesses are interested in ways to improve searching large amounts of data stored using Hadoop technologies.
Prashanth Shankar Kumar has over 8 years of experience in data analytics, Hadoop, Teradata, and mainframes. He currently works as a Hadoop Developer/Tech Lead at Bank of America where he develops Hive queries, Impala queries, MapReduce programs, and Oozie workflows. Previously he worked as a Hadoop Developer at State Farm Insurance where he installed and managed Hadoop clusters and developed solutions using Hive, Pig, Sqoop, and HBase. He has expertise in Teradata, SQL, Java, Linux, and agile methodologies.
How pig and hadoop fit in data processing architectureKovid Academy
Pig, developed by Yahoo research in 2006, enables programmers to write data transformation programs for Hadoop quickly and easily without the cost and complexity of map-reduce programs.
This document summarizes a research paper on analyzing and visualizing Twitter data using the R programming language with Hadoop. The goal was to leverage Hadoop's distributed processing capabilities to support analytical functions in R. Twitter data was analyzed and visualized in a distributed manner using R packages that connect to Hadoop. This allowed large-scale Twitter data analysis and visualizations to be built as a R Shiny application on top of results from Hadoop.
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
Your Big Data strategy is only as good as the quality of your data. Today, deriving business value from data depends on how well your company can capture, cleanse, integrate and manage data. During this webinar, we discuss how to eliminate the challenges to Big Data management inside Hadoop.
Eliminating the Challenges of Big Data Management Inside HadoopHortonworks
Your Big Data strategy is only as good as the quality of your data. Today, deriving business value from data depends on how well your company can capture, cleanse, integrate and manage data. During this webinar, we discussed how to eliminate the challenges to Big Data management inside Hadoop.
Go over these slides to learn:
· How to use the scalability and flexibility of Hadoop to drive faster access to usable information across the enterprise.
· Why a pure-YARN implementation for data integration, quality and management delivers competitive advantage.
· How to use the flexibility of RedPoint and Hortonworks to create an enterprise data lake where data is captured, cleansed, linked and structured in a consistent way.
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
The document discusses a Big Data Meetup organized by C-BAG (Chennai Big Data Analytic Group) on October 29, 2014 in Chennai. It provides details about two speakers, Dhruv Kumar from Concurrent Inc. and Vinay Shukla from Hortonworks, who will discuss reducing development time for production-grade Hadoop applications and Hortonworks' Hadoop platform respectively. The remainder of the document consists of presentation slides that cover topics including the modern data architecture with Hadoop, enterprise goals for data architecture, unlocking applications from new data types, and case studies.
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable storage of petabytes of data and large-scale computations across commodity hardware.
- Apache Hadoop is used widely by internet companies to analyze web server logs, power search engines, and gain insights from large amounts of social and user data. It is also used for machine learning, data mining, and processing audio, video, and text data.
- The future of Apache Hadoop includes making it more accessible and easy to use for enterprises, addressing gaps like high availability and management, and enabling partners and the community to build on it through open APIs and a modular architecture.
This document provides a summary of M.V. Rama Kumar's professional experience and qualifications. He has over 3 years of experience in application development using Java and big data technologies like Hadoop, HDFS, MapReduce, Apache Pig, Hive and Sqoop. Some of his key responsibilities have included writing Pig scripts to optimize job execution time, creating Hive tables and queries, and using Sqoop to transfer data between HDFS and relational databases. He is currently working as a Software Engineer with Tata Consultancy Services on projects involving XML analytics using Hadoop and sentiment analysis on customer data in the banking domain.
Similar to Hadoop Summit San Jose 2014: Costing Your Big Data Operations (20)
This document discusses Hadoop at Yahoo, including:
- Yahoo has built a large multi-tenant Apache Hadoop deployment that powers many of its businesses and use cases.
- Over the years, Yahoo has scaled its Hadoop infrastructure significantly, now consisting of over 50,000 servers and 50PB of storage.
- Yahoo uses Hadoop for a wide range of use cases across advertising, search, personalization, anti-spam, and more, processing data at massive scales of billions of records daily.
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
Over the past year, a lot of progress has been made in advancing the Apache Hadoop platform at Yahoo. We underwent a massive infrastructure consolidation to lower the platform TCO. CaffeOnSpark was open-sourced for distributed deep learning on existing infrastructure with a combination of CPU and GPU-based computing. Traditional compute on MapReduce continues to shift to Apache Tez and Apache Spark for lower processing time. Our internal security, multi-tenancy, and scale changes to Apache Storm got pushed to the community in Storm 0.10. Omid was open-sourced for managing transactions reliably on Apache HBase. Multi-tenancy with region groups, splittable META, ZooKeeper-less assignment manager, favored nodes with HDFS block placement, and support for humongous tables have taken Apache HBase scale to new heights. Dependency management in Apache Oozie for combinatorial, conditional, and optional processing gives increased flexibility to our data pipelines teams in maintaining SLAs. Focus on ease of use and onboarding improvements have brought in a whole new class of use cases and users to the platform. In this talk, we will provide a comprehensive overview of the platform technology stack, recent developments, metrics, and share thoughts on where things are headed when it comes to big data at Yahoo.
With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In this talk, Sumeet Singh will present some of the recent innovations, open source contributions, and where things are headed when it comes to Hadoop at Yahoo.
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
This document discusses lessons learned from building a scalable, self-serve, real-time, multi-tenant monitoring service at Yahoo. It describes transitioning from a classical architecture to one based on real-time big data technologies like Storm and Kafka. Key lessons include properly handling producer-consumer problems at scale, challenges of debugging skewed data, strategically managing multi-tenancy and resources, issues optimizing asynchronous systems, and not neglecting assumptions outside the application.
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Senior Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
In this talk, we look at YARN scheduler choices available today for Apache Hadoop 2 and discuss their pros and cons. We dive deeper into Capacity Scheduler by providing a comprehensive overview of its various settings with examples from real large-scale Hadoop clusters to promoter a broader understanding of schedulers’ current state and best practices in place today when it comes to queue nomenclature, planning, allocations, and ongoing management. We present detailed cluster, queue, and job behaviors from several different capacity management philosophies.
We then propose practical solutions without any change to the scheduler or core Hadoop that allows managing queue creations and capacity allocations while optimizing for cluster utilization and maintaining SLA guarantees. A unified queue nomenclature, admission and capacity re-allocation policies across BUs, applications, and clusters make service automation possible. Transparency in resources consumed allows for defining realistic SLA expectation. Finally, consistent application tagging completes the feedback loop with SLAs observed through application level reporting.
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo's Hadoop clusters. A key component that enables this efficient operation is data compression.
With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented.
The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on "Big Data" who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
The Hadoop project is an integral part of Yahoo!'s cloud infrastructure and is at the heart of many of Yahoo!'s important business processes. Sumeet Singh, the Head of Products for Cloud Services and Hadoop at Yahoo!, explains how Yahoo! leverages Hadoop and Cloud Platforms to process and serve Internet- scale data.
Yahoo! operates one of the world's largest private cloud infrastructures. Learn how technologies scale out for building enterprise-wide trusted platforms with tight SLAs.
URL: http://www.saptechnologyservice.com/track1.html
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
Yahoo! has been using HBase for a long time in isolated instances, most notably for the personalization platform powering its homepage experiences. The introduction of multi-tenancy has lowered the barriers for all Hadoop users to use HBase. We will cover traditional use cases for HBase at Yahoo!, and new use cases as a result in content management, advertising, log processing, analytics and reporting, recommendation graphs, and dimension data stores.
We will then talk about the deployment strategy and enhancements made that facilitate multi-tenancy. Region Server groups provide a coarse level of isolation among tenants by designating a subset of region servers to serve designated tables, and Namespaces for logical grouping of resources (region servers, tables) and privileges (quota, ACLs).
We'll also share our experiences in operating HBase with security enabled and contributions made in this area, and results from performance runs conducted to validate customer expectations in a multi-tenant environment.
URL: http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--multi-tenant-apache-hbase-at-yahoo-video.html
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
1. Costing Your Big Data Operations
PRESENTED BY Sumeet Singh, Amrit Lal ⎪ June 5, 2014
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2. Introduction
2
§ Product Manager at Yahoo engaged in building high
class and robust Hadoop infrastructure services
§ Eight years of experience across HSBC, Oracle and
Google in developing products and platforms for high
growth enterprises
§ MBA from Carnegie Mellon University
§ Manages Hadoop products team at Yahoo!
§ Responsible for Product Management, Strategy and
Customer Engagements
§ Managed Cloud Services products team and headed
Strategy functions for the Cloud Platform Group at
Yahoo
§ MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Amrit Lal
Product Manager
Hadoop and Big Data Platforms
Cloud Engineering Group
701 First Avenue,
Sunnyvale, CA 94089 USA
@amritasshwar
2014 Hadoop Summit, San Jose, California
3. Agenda
3
Total Cost of Ownership (TCO) Models1
Deeper Understanding of (Resource) Usage
P&L, Metering and Billing Provisions
Benchmark Costs
Improve Utilization and ROI
2
3
4
5
2014 Hadoop Summit, San Jose, California
4. Why do Costing?
4
Profitability
Understanding the data services costs (an element of your total project cost) to determine how
profitable the project is
ROI Investment decisions both at the platform and app / project level
Operational
Efficiency
Benchmark, improve ops by focusing on avg. utilization, increasing the # hosted apps, storage
efficiencies, job performance etc.
Planning Capital planning and budgeting, product improvements
Cost Transparency Metering / usage metrics, billing, chargeback / showback, P&L
2014 Hadoop Summit, San Jose, California
5. Costing is Relevant Irrespective of the Service Model
5
Private
Cloud
Public
Cloud
§ Fixed costs that favors scale and
24x7 operations
§ Centralized operations
§ Multi-tenant clusters with security
and data sharing
§ Cost a function of desired SLA
§ Utilization and # hosted apps a
primary lever
§ Tenants often tend to ignore costs
§ Variable with usage and favors a run
and done model
§ Decentralized operations, ops /
headcount costs still relevant
§ Dedicated virtual clusters
§ Monthly bills!
§ Releasing cluster instances, when not
needed, a wise idea
§ Users often overlook the peripheral
costs
2014 Hadoop Summit, San Jose, California
6. 0
50
100
150
200
250
300
350
400
450
500
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers
Year
Servers Storage
Important with Multi-tenancy and Scale
6
Yahoo!
Commits to
Scaling Hadoop
for Production
Use
Research
Workloads
in Search and
Advertising
Production
(Modeling)
with machine
learning &
WebMap
Revenue
Systems with
Security,
Multi-tenancy,
and SLAs
Open Sourced
with Apache
Hortonworks
Spinoff for
Enterprise
hardening
Nextgen
Hadoop
(H 0.23 YARN)
New Services
(HBase, Storm,
Hive etc.
Increased
User-base
with partitioned
namespaces
Apache H 2.x
(Low latency,
Util, HA etc.)
2014 Hadoop Summit, San Jose, California
7. 272
330
382
495
525
260
310
360
410
460
510
560
Q1-11 Q2-11 Q3-11 Q4-11 Q1-12 Q2-12 Q3-12 Q4-12 Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
Hosted Apps Growth on Apache Hadoop
7
NumberofNewProjects
New Customer Apps On-boarded
58 projects in
2011
52 projects in
2012
113 projects in
2013
2014 Hadoop Summit, San Jose, California
8. Multi-tenant Apache HBase Growth
8
1140
33.6 PB
0
5
10
15
20
25
30
35
40
0
200
400
600
800
1000
1200
Q1-13 Q2-13 Q3-13 Q4-13 Q1-14
DataStored(inPB)
NumberofRegionServers
Zero to “20” Use Cases (60,000 Regions) in a Year
Region Servers Storage
2014 Hadoop Summit, San Jose, California
10. Capital Deployment for Big Data Infrastructure
10
DataNode NodeManager
NameNode RM
DataNodes RegionServers
NameNode HBase Master Nimbus
Supervisor
Administration, Management and Monitoring
ZooKeeper
Pools
HTTP/HDFS/GDM
Load Proxies
Applications and Data
Data
Feeds
Data
Stores
Oozie
Server
HS2/
HCat
Network
Backplane
2014 Hadoop Summit, San Jose, California
11. Big Data Platforms Technology Stack at Yahoo
11
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
2014 Hadoop Summit, San Jose, California
12. Resources Consumed in Big Data Operations
12
.
.
.
.
Colo 1
Rack 1 Rack N
.
.
Bandwidth
Storage
Memory
CPU
Clusters in Datacenters Server Resources
2014 Hadoop Summit, San Jose, California
13. Elements of a TCO Model
13
$2.1 M
60%
12%
7%
6%
3%
2%
6
5
4
3
2
1
7
10%
Operations Engineering
§ Headcount for service engineering and data operations teams responsible for day-to-day ops and
support
6
Acquisition/ Install (One-time)
§ Labor, POs, transportation, space, support, upgrades, decommissions, shipping/ receiving etc.
5
Network Hardware
§ Aggregated network component costs, including switches, wiring, terminal servers, power strips etc.
4
Active Use and Operations (Recurring)
§ Recurring datacenter ops cost (power, space, labor support, and facility maintenance
3
R&D HC
§ Headcount for platform software development, quality, and release engineering
2
Cluster Hardware
§ Data nodes, name nodes, job trackers, gateways, load proxies, monitoring, aggregator, and web servers
1
Monthly TCOTCO Components
Network Bandwidth
§ Data transferred into and out of clusters for all colos, including cross-colo transfers
7
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
15. Unit Costs for Hadoop Operations
15
Compute
Containers where
apps can perform
computation and
access HDFS if
needed
Storage
HFDS (usable) space
needed by an app
with default
replication factor of
three
Network
bandwidth needed
to move data into/
out of the clusters
by the app
Bandwidth Namespace
Files and
directories used
by the apps to
understand/ limit
the load on NN
$ / GB-Hour (H 0.23/2.0)
GBs of Memory
available for an hour
Monthly Compute Cost
Avail. Compute Capacity
$ / GB Stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
Unit
Total Capacity
Unit Cost
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
[Monthly GB In + Out] x
$ / GB
N/A
N/A
N/A
2014 Hadoop Summit, San Jose, California
16. Working Through A Hadoop Example
16
Monthly TCO (less bw.) = $2 M
Compute @ 50% = $1 M
315 TB memory
== 315 TB x 24 x 30
= 227 M GB-Hours
$1 M/ 227 M GB-Hours
= $0.004 / GB-Hour / Month
Monthly TCO (less bw.) = $2 M
Storage @ 50% = $1 M
RAW HDFS = 200 PB
Usable HDFS == [ 200 x 0.8 (20%
overhead) ] / 3
= 53.3 PB
$ 1 M / 53.3 PB
= $ 0.019 / GB / Month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly Charges = $0.1 M
Total Data In + Out = 5 PB
$ 0.1 M / 5 PB
= $ 0.02/ GB transferred
Compute Storage Bandwidth
2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
17. Measuring Hadoop Resource Consumption
17
Map GB-Hours = GB(M1) x T(M1) +
GB(M2) x T(M2) + …
Reduce GB-Hours = GB(R1) x T(R1)
+ GB(R2) x T(R2) + …
Cost = (M + R) GB-Hour x $0.004 /
GB-Hour / Month
= $ for the Job/ Month
(M+R) GB-Hours for all jobs can
summed up for the month for a user,
app, BU, or the entire platform
Monthly Job
and Task
Cost
Monthly Roll-
ups
Compute Storage Bandwidth
/ project (app) directory quota in
GB (peak monthly storage used)
/ user directory quota in GB (peak
monthly storage used)
/ data is accounted for as each user
accountable for their portion of use.
For e.g.
GB Read (U1)
GB Read (U1) + GB Read (U2) + …
Roll-ups through relationship
among user, file ownership, app,
and their BU
Bandwidth measured at the cluster
level and divided among select
apps and users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their BU
2014 Hadoop Summit, San Jose, California
19. Measuring Hadoop Resource Consumption
19 2014 Hadoop Summit, San Jose, California
SLA Dashboard on Hadoop Analytics Warehouse
20. Putting it Together for Hadoop Services
20
BU
HDFS (Storage) Compute Network Bandwidth
Total Cost
($ M)Used
(PB)
Effective Used
(PB)
Cost
($ M)
Used
(GB-hour)
Cost
($ M)
Transferred
(GB)
Cost
($ M)
BU1 15 PB 3.45 PB $0.065 12.5 M $0.05 1.25 PB $0.025 $0.15 M
BU2 10 PB 2.65 PB $0.05 6.25 M $0.025 0.5 PB $0.01 $0.085 M
… …. … … … … … …
BU N … … … … … … ...
Total 148 PB 39.5 PB $0.75 M 125 M $0.5 M 5 PB $0.1 M $1.35 M
Resource
Unit
Aggregated / Measured
Cost
HDFS (Storage)
GB
Monthly, Peak storage used
$ 0.019/GB
Compute
Map-Reduce GB Hours
Number of GBs used by mappers and reducers and hours they ran for
$ 0.004/GB-Hour
Network Bandwidth
GB
Monthly, total in /out
$ 0.02/GB
Hadoop Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
21. Multi-Tenant Deployment For Apache HBase
21
Region Server M
X:Table:Region M
Y:Table:Region M
…
Z:Table:Region M
Region Server N
X:Table:Region N
Y:Table:Region N
…
Z:Table:Region N
Projects X,Y & Z
RegionServerJVMHDFSReads/Writes
Shared Region Servers
Region Server 2
X:Table:Region 2
Y:Table:Region 2
…
Z:Table:Region 2
…
HMaster
Zookeeper
Region Server 1
X:Table:Region 1
Y:Table:Region 1
…
Z:Table:Region 1
2014 Hadoop Summit, San Jose, California
22. Understanding Apache HBase Resources
22
X:Table:Region 1
Y:Table:Region M
…
Regionlevel
Reads/Writes
HFile
HFile
HFile
HDFS Storage (Disk)RegionServer JVM (Heap)
Z:Table:Region N…
2014 Hadoop Summit, San Jose, California
Total Reads @ RS
Reads (Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N)
Read Share (X)
Total Table X
Total Table (X, Y, Z)
Total Writes @ RS
Writes (Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N)
Total Table Data @ RS
Table X: Reg 1
+ Table X: Reg 2
+ …
+ Table Z: Reg N
Write Share (X)
Total Table X
Total Table (X, Y, Z)
Reads Writes Data Stored
23. Unit Costs for HBase Operations
23
Writes
Write Operations
performed on Region
Server while writing
to individual table
regions
Reads
Read Operations
performed on Region
Server while reading
from individual table
regions
HFDS (usable)
space needed by
table region’s
HFiles with default
replication factor
Storage Bandwidth
Network bandwidth
needed to move
data in to/out of the
clusters by clients
$ / 1000 Writes
Total Write operations
across Region Servers
Monthly Write TCO
Total Write Ops (K)
$ / 1000 Reads
Total Read operations
across Region Servers
Monthly Read TCO
Total Read Ops (K)
Unit
Total Capacity
Unit Cost
$ / GB Stored
Usable storage space
(less replication and
overheads)
Monthly Storage Cost
Avail. Usable Storage
$ / GB for Inter-region
data transfers
Inter-region (peak) link
capacity
Monthly GB [In + Out] x
$ / GB
2014 Hadoop Summit, San Jose, California
24. Working Through An HBase Example
24
Monthly TCO (less bw.)
= $60 K
Write Serving @ 25%
= $15 K
Total Write operations
across Region Servers
= 100 M
$ 15 K / 100 M = $0.15 per
1000 writes per month
Monthly TCO (less bw.)
= $60 K
Write Serving @ 25%
= $15 K
Total Read operations
across Region Servers
= 200 M
$ 15 K / 200 M = $0.075 per
1000 reads per month
Monthly Cost
Monthly
Capacity
Unit Cost
Monthly TCO (less bw.)
= $60 K
Storage @ 50%
= $30 K
RAW HDFS = 10 PB
Usable HDFS == [ 10 x 0.8
(20% overhead) ] / 3
= 2.67 PB
$ 30 K / 2.67 PB
= $ 0.011 / GB / Month
Writes Reads Storage
2014 Hadoop Summit, San Jose, California
Monthly Charges
= $5 K
Total Data In + Out
= 0.25 PB
$ 5 K / 0.25 PB
= $ 0.02 / GB transferred
Bandwidth
ILLUSTRATIVE
25. Measuring HBase Resource Consumption
25
Write Ops per Region Server
per Table Region =
#W(R1:RS1)+#W(R2:RS1)+
…
Cost = Total Writes x $0.15 /
1000 writes/month
=$ for the Table/RS/Month
Write Ops cost for all tables
across all region servers for
a user ,app, BU or the
platform
Read Ops per Region Server
per Table Region =
#R(R1:RS1)+#R(R2:RS1)+
…
Cost = Total Reads x
$0.075 /1000 writes/month
=$ for the Table/RS/Month
Read Ops cost for all tables
across all region servers for
a user ,app, BU or the
platform
Monthly
HBase
Project Cost
Monthly Roll-
ups
HDFS size of regions under
hbase/table/<regions> in
GBs
Cost = Total HDFS size x
$ 0.011 / GB / Month
=$ for the Table/Month
Total HDFS size for all
tables across all region
servers for a user ,app, BU
or the platform
Writes Reads Storage
2014 Hadoop Summit, San Jose, California
Bandwidth measured at the
cluster level and divided
among select apps and
users of data based on
average volume In/Out
Roll-ups through relationship
among user, app, and their
BU
Bandwidth
26. Putting it Together for HBase Services
26
Resource
Unit
Aggregated / Measured
Cost
Write Operations
Count of operations
Monthly, Total write operations across regions of table
$ 0.15 / 1000 Writes
Read Operations
Count of operations
Monthly, Total read operations across regions of table
$ 0.075 / 1000 Reads
HDFS (Storage)
GB
Monthly, Peak storage used
$ 0.011 / GB
Network Bandwidth
GB
Monthly, total in /out
$ 0.02 / GB
HBase Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
BU
Write Operations Read Operations HDFS (Storage) Network Bandwidth
Total Cost
($ K)Count
(M)
Cost
($ K)
Count
(M)
Cost
($ K)
Used
(PB)
Effective Used
(PB)
Cost
($ K)
Transferred
(PB)
Cost
($ K)
BU 1 30 M $ 4.5 20 M $ 1.5 3 PB 0.8 PB $ 8.80 1.25 PB $ 0.025 $ 14.82
BU 2 10 M $ 1.5 60 M $ 4.5 1 PB 0.27 PB $ 2.93 0.5 PB $ 0.01 $ 8.94
… …. … … … … …
BU N … … … … … ...
Total 100 M $ 15 200 M $ 15 10 PB 2.67PB $ 29.4 0.25 PB $ 5 $ 64.4
2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
27. Multi-Tenant Deployment For Apache Storm
27
Topologies X,Y & Z
SharedSupervisors
NimbusZookeeper
Supervisor M
X: Worker M
Y: Worker M
…
Z: Worker M
Supervisor N
X: Worker N
Y: Worker N
…
Z: Worker N
Supervisor 2
X: Worker 2
Y: Worker 2
…
Z: Worker 2
…
Supervisor 1
X: Worker 1
Y: Worker 1
…
Z: Worker 1
2014 Hadoop Summit, San Jose, California
28. Understanding Apache Storm Resources
28
Topology A : Worker
Task
Task
Task
Task
Supervisor
FixedWorkerSlots
§ Supervisor runs one or worker
processes for one or more
topologies
§ Each Supervisor have fixed
number of worker slots
§ A worker process belongs to a
specific topology
§ The workers from topologies are
distributed randomly on
supervisor
§ Tasks perform the actual data
processing
Topology B : Worker
Task
Task
Task
Task
2014 Hadoop Summit, San Jose, California
29. $ / Slot-Hour
Total number of slots
Monthly Slots Used
Avail. Slots
Unit Costs for Storm Operations
29
Compute
Worker Slots where topology
workers execute the actual
logic / tasks of spout and bolts
in parallel
Network bandwidth needed to
move data into/out of the
clusters by topologies
Bandwidth
Unit
Total Capacity
Unit Cost
2014 Hadoop Summit, San Jose, California
$ / GB for Inter-region data transfers
Inter-region (peak) link capacity
[Monthly GB In + Out] x $ / GB
30. Monthly TCO (less bw.) = $30 K
24 Slots Per Supervisors@100%
= $30 K
19.2 K Slots = 19.2 K x 24 x30
= 13.8 M Slot Hours
$ 30 K / 13.8 M Slot-Hours
= $0.002 / Slot-Hour / Month
Working Through a Storm Example
30
Compute Bandwidth
Monthly Cost
Monthly
Capacity
Unit Cost
2014 Hadoop Summit, San Jose, California
Monthly Charges = $2.5 K
Total Data In + Out = 0.12 PB
[$ 2.5 K / 0.12 PB
= $ 0.02/ GB transferred
ILLUSTRATIVE
31. Worker Slot-Hours for Topologies =
#W(TP1) x T(TP1) + #W(TP2) x
T(TP2) + …
Cost = Worker Slot-Hours x $0.002 /
Slot-Hour / Month
= $ for the Topology / Month
Worker Slot-Hours for all Topologies
can be summed up for the month for
a user, app, BU, or the entire
platform
Measuring Storm Resource Consumption
31
Compute Bandwidth
Monthly Cost
Monthly Roll-
ups
2014 Hadoop Summit, San Jose, California
Bandwidth measured at the cluster
level and divided among select apps
and users of data based on average
volume In/Out
Roll-ups through relationship among
user, app, and their BU
ILLUSTRATIVE
32. Putting it Together for Storm Services
32
BU
Compute Network Bandwidth
Total Cost
($ K)Used
(Slot hour)
Cost
($ K)
Transferred
(PB)
Cost
($ K)
BU1 2.5 M $ 5 0.02 PB $ 0.4 $ 5.4
BU2 1.25 M $ 2.5 0.04 PB $ 0.8 $ 3.3
… … … … …
BU N … … … ...
Total 10 M $ 20 0.12 PB $ 2.4 K $ 22.4
Resource
Unit
Aggregated / Measured
Cost
Compute
Worker Slot Hours
Number of slots used by Topology workers and hours they ran for
$ 0.002/Slot-Hour
Network Bandwidth
GB
Monthly, total in /out
$ 0.02/GB
Storm Services Billing Rate Card [ Monthly Rates ]
Monthly Bill for May 2014
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
33. Project Based Costing for Grid Services
33
Project Summary Period Cost (K)
Grid Services Cost May 2014 $ 165.5 K
Project Usage Details (Data Center DC1) Usage Cost (K)
Apache Hadoop Services $ 126 K
Compute (Map & Reduce GB-Hours consumed @ $0.004/GB-Hour) 12.5 M $ 50 K
Storage (GBs of peak storage used @ $ 0.019/GB) 3.45 PB $ 66 K
Network (GBs In/Out @ $0.02/GB) 0.5 PB $ 10 K
Apache HBase Services $ 34.1 K
Reads (Number of Read Operations @ $0.075/1000 Reads) 30 M $ 2.2 K
Writes (Number of Write Operations @ $0.15/1000 Writes) 20 M $ 3.0 K
Storage (GBs of peak storage used @ $ 0.011/GB) 2.45 PB $26.9 K
Network (GBs In/Out @ $0.02/GB) 0.1 PB $2 K
Apache Storm Services $ 5.4 K
Compute (Slot Hours consumed @ $ 0.002/Slot-Hour) 2.5 M $ 5 K
Network (GBs In/Out @ $0.02/GB) 0.02 PB $ 0.4 K
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
34. Platform P&L
34
Line Item Q4’12 Q1’13 Q2’13 Q3 ’13 Total Total %
Y! Gross Revenues
Cost of revenues (less Grid CapEx)
Gross Profit
Grid OpEx
R&D Headcount
SE&O Headcount
Acquisition/Install
Active Use/ Ops
Network Bandwidth
Total Gird OpEx
Grid CapEx
Grid Services
Total Grid CapEx
Contribution Margin
Indirect Costs
G&A
Sales and Marketing
ILLUSTRATIVE
2014 Hadoop Summit, San Jose, California
LEFT BLANK ON PURPOSE
35. Hadoop Cost Benchmarking – An Approach
35
On-Premise Public Cloud
Monthly Used Unused Total Public Pricing or Terms-based (Used On-Premise Eqv.)
M/R 71.4 M 61.6 M 133 M
Compute Instances (normalized time,
RAM, 32/64 ops, I/O etc.)
1,000
instances/ hr.
HDFS 148 PB 52 PB 200 PB
Storage
(account for 3x repl., job/ app space)
30 PB/ month
Avg. Data
Processed
- - 75 PB Instance Storage 2.5 PB daily
M/R $0.50 M $0.50 M $1 M 1,000 x $0.70/ instance/ hr. x 24 x 30 $0.5 M
HDFS $0.75 M $0.25 M $1 M 30 PB x $0.04/GB/month $1.2 M
Other Costs (if any) such as reads,
writes, data services/ hour etc.
$0.25 M
Total * $1.25 M $0.75 M $2 M Total $ 1.95 M
Quantity
equivalent
Cost
equivalent
2014 Hadoop Summit, San Jose, California
* Ignored bandwidth, assumed equivalent
ILLUSTRATIVE
36. HBase and Storm Cost Benchmarking
36
On-Premise Public Cloud
Total Public Pricing or Terms-based (Used On-Premise Eqv.)
Reads
Peak concurrent reads for
a given record size
300 MB/s
Reads on chosen instances
(benchmarks 45MB/s)
300/45 = 7
instances
Writes
Peak concurrent writes for
a given record size
160 MB/s
Writes on chosen instances
(benchmarks 10MB/s)
160/10 = 16
instances
Storage
Data storage in tables (incl.
replication)
1.6 TB
Data served per instance (benchmarks
0.5 TB incl. repl.)
1.6/0.5 = 3
Cost calculations stay the same as Hadoop.
Instances required based on thru-put
and storage needs
16 instances/
hour
Slots-
Hours
Slot hours per month 2.5M
Instance hours based on memory and
CPU requirements (12 slots / instance)
0.21 M
instance
hours
Cost calculations stay the same as Hadoop.
Quantity
equivalent
2014 Hadoop Summit, San Jose, California
* Ignored bandwidth, assumed equivalent
ILLUSTRATIVE
Quantity
equivalent
37. Improving Utilization favors on-premise setup
37
Utilization / Consumption (Compute and Storage)
Cost($)
On-premise Hadoop
as a Service
On-demand public
cloud service
Terms-based public
cloud service
Favors on-premise
Hadoop as a Service
Favors public cloud
service
x
x
Sensitivity analysis on
costs based on current
and expected utilization
or target utilization can
provide further insights
into your operations and
cost competitiveness
Highstartingcost
Scalingup
2014 Hadoop Summit, San Jose, California
38. Improving Utilization improves ROI
38
Time
CostAmortizedoverApps($)
Phase I 2012 – 2013 (H 0.23) 2014 & Future
Time = t Time = t’
Cost (t) = C
Cost (t’)= C’
# App continue to
grow on the Platform
At time t, BU profits are
R (t) – C(t) = π (t)
Platform’s goal is to continue
to increase the ROI while
supporting new technology
and services
R (t’) – C (t’) = π (t’), where
C (t’) < C (t) and π (t’) > π (t)
for same or bigger revenues.
2014 Hadoop Summit, San Jose, California
39. Going Forward
39 2014 Hadoop Summit, San Jose, California
Hadoop HBase Storm
§ CPU as a resource
§ Pre-emption and priority
§ Long-running jobs
§ Other potential
resources such as disk,
network, GPUs etc.
§ Tez as the execution
engine / Container
reuse
§ Multiple Region Servers
per node
§ Larger JVMs / GC
improvements
§ HBase-on-YARN
§ cgroup profiles
§ Storm-on-YARN
§ Resource aware
scheduling (memory,
CPU, network)
§ cgroup profiles
§ More experience with
multi-tenancy
Co-exist with HBase to share the compute and memory – Using the c-group profiles at the Storm JVM level and topology worker
level
Resource aware scheduling – Memory & CPU
YARN