Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
This document discusses YARN's shared cache feature for application resources. It provides an overview of how YARN localizes resources for each application and containers. The shared cache aims to address inefficiencies in this process by caching identical resources on NodeManagers and sharing them between applications and containers. The design goals are for the shared cache to be scalable, secure, fault-tolerant and transparent. It works by having a shared cache client interface with a shared cache manager that maintains metadata and persisted resources. This can significantly reduce data transfer and localization costs for applications that reuse common resources.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Ted Dunning presents information on Drill and Spark SQL. Drill is a query engine that operates on batches of rows in a pipelined and optimistic manner, while Spark SQL provides SQL capabilities on top of Spark's RDD abstraction. The document discusses the key differences in their approaches to optimization, execution, and security. It also explores opportunities for unification by allowing Drill and Spark to work together on the same data.
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
The document discusses scaling HDFS to manage billions of files. It describes how HDFS usage has grown from millions of files in 2007 to potentially billions of files in the future. To address this, the speakers propose storing HDFS metadata in a key-value store like LevelDB instead of solely in memory. They evaluate this approach and find comparable performance to HDFS for most operations. Future work includes improving operations like compaction and failure recovery in the new architecture.
Hadoop is an open source framework for distributed storage and processing of large datasets across commodity hardware. It has two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing. HDFS stores data across clusters in a redundant and fault-tolerant manner. MapReduce allows distributed processing of large datasets in parallel using map and reduce functions. The architecture aims to provide reliable, scalable computing using commodity hardware.
The document provides an overview of MapR's distributed file system and improvements over traditional Hadoop implementations. Key points include:
- MapR partitions files into containers that are distributed across nodes, improving performance over HDFS which requires multiple copies.
- MapReduce on MapR is faster through direct RPC to receivers during shuffling, very wide merges, and leveraging the distributed file system.
- Benchmark results show MapR outperforming Hadoop on streaming workloads, TeraSort, HBase random reads, and small file creation rates.
- The container architecture is said to scale to exabyte-sized clusters with modest memory requirements for metadata caching.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
This document discusses YARN's shared cache feature for application resources. It provides an overview of how YARN localizes resources for each application and containers. The shared cache aims to address inefficiencies in this process by caching identical resources on NodeManagers and sharing them between applications and containers. The design goals are for the shared cache to be scalable, secure, fault-tolerant and transparent. It works by having a shared cache client interface with a shared cache manager that maintains metadata and persisted resources. This can significantly reduce data transfer and localization costs for applications that reuse common resources.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Ted Dunning presents information on Drill and Spark SQL. Drill is a query engine that operates on batches of rows in a pipelined and optimistic manner, while Spark SQL provides SQL capabilities on top of Spark's RDD abstraction. The document discusses the key differences in their approaches to optimization, execution, and security. It also explores opportunities for unification by allowing Drill and Spark to work together on the same data.
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
The document discusses scaling HDFS to manage billions of files. It describes how HDFS usage has grown from millions of files in 2007 to potentially billions of files in the future. To address this, the speakers propose storing HDFS metadata in a key-value store like LevelDB instead of solely in memory. They evaluate this approach and find comparable performance to HDFS for most operations. Future work includes improving operations like compaction and failure recovery in the new architecture.
Hadoop is an open source framework for distributed storage and processing of large datasets across commodity hardware. It has two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing. HDFS stores data across clusters in a redundant and fault-tolerant manner. MapReduce allows distributed processing of large datasets in parallel using map and reduce functions. The architecture aims to provide reliable, scalable computing using commodity hardware.
The document provides an overview of MapR's distributed file system and improvements over traditional Hadoop implementations. Key points include:
- MapR partitions files into containers that are distributed across nodes, improving performance over HDFS which requires multiple copies.
- MapReduce on MapR is faster through direct RPC to receivers during shuffling, very wide merges, and leveraging the distributed file system.
- Benchmark results show MapR outperforming Hadoop on streaming workloads, TeraSort, HBase random reads, and small file creation rates.
- The container architecture is said to scale to exabyte-sized clusters with modest memory requirements for metadata caching.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration?
In this talk we will cover:
- Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven.
- Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js.
- Easy deployment of the Hadoop stack to the cloud.
- Hermes - our homegrown command-line tool which helps us automate data-related tasks.
- Examples of exciting machine learning challenges that we are currently tackling
- Hadoop with Azure and Microsoft stack.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
Apache Tez is the new data processing framework in the Hadoop ecosystem. It runs on top of YARN - the new compute platform for Hadoop 2. Learn how Tez is built from the ground up to tackle a broad spectrum of data processing scenarios in Hadoop/BigData - ranging from interactive query processing to complex batch processing. With a high degree of automation built-in, and support for extensive customization, Tez aims to work out of the box for good performance and efficiency. Apache Hive and Pig are already adopting Tez as their platform of choice for query execution.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL.
Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations.
The video can be found here: http://vimeo.com/chug/using-apache-drill
Have you ever heard the buzzword "big data"? Big data is briefly described to collect massive amounts of data and extract all the small details and larger trends that are available. Summarize the output and generate important insight about customers and competitors.
Enterprises seem to have sensed that something is in the air and have started to shop technology. So what has the world to offer for enterprises that have an unknown amount of petabytes flowing through their systems on a daily basis? There are a few options, but really few that can catch up with the popularity of Hadoop. Hadoop can store and process large amounts of data. It has a large and diverse toolset for integrations, operations and processing and it is open source!
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
DeathStar is a system that runs HBase on YARN to provide easy, dynamic multi-tenant HBase clusters via YARN. It allows different applications to run HBase in separate application-specific clusters on a shared HDFS and YARN infrastructure. This provides strict isolation between applications and enables dynamic scaling of clusters as needed. Some key benefits are improved cluster utilization, easier capacity planning and configuration, and the ability to start new clusters on demand without lengthy provisioning times.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Serialization is the process of converting an object into a byte stream to store or transmit the object. The document discusses three serialization frameworks: Avro, MessagePack, and Kryo. Avro uses a JSON-defined schema and is created by the creator of Hadoop. MessagePack supports rich data structures like JSON and has interfaces for RPC. Kryo makes serialization easy by collecting serializers by class and supports compression.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
You sit on a big pile of data and want to know how to leverage it in your company? Interested in use-cases, examples and practical demos about the full Hadoop stack? Looking for big-data inspiration?
In this talk we will cover:
- Use-cases how implementing a Hadoop stack in TheNewMotion drastically helped us, software engineers, with our everyday challenges. And how Hadoop enables our management team, marketing and operations to become more data-driven.
- Practical introduction into our data warehouse, analytical and visualization stack: Apache Pig, Impala, Hue, Apache Spark, IPython notebook and Angular with D3.js.
- Easy deployment of the Hadoop stack to the cloud.
- Hermes - our homegrown command-line tool which helps us automate data-related tasks.
- Examples of exciting machine learning challenges that we are currently tackling
- Hadoop with Azure and Microsoft stack.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
Apache Tez is the new data processing framework in the Hadoop ecosystem. It runs on top of YARN - the new compute platform for Hadoop 2. Learn how Tez is built from the ground up to tackle a broad spectrum of data processing scenarios in Hadoop/BigData - ranging from interactive query processing to complex batch processing. With a high degree of automation built-in, and support for extensive customization, Tez aims to work out of the box for good performance and efficiency. Apache Hive and Pig are already adopting Tez as their platform of choice for query execution.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL.
Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations.
The video can be found here: http://vimeo.com/chug/using-apache-drill
Have you ever heard the buzzword "big data"? Big data is briefly described to collect massive amounts of data and extract all the small details and larger trends that are available. Summarize the output and generate important insight about customers and competitors.
Enterprises seem to have sensed that something is in the air and have started to shop technology. So what has the world to offer for enterprises that have an unknown amount of petabytes flowing through their systems on a daily basis? There are a few options, but really few that can catch up with the popularity of Hadoop. Hadoop can store and process large amounts of data. It has a large and diverse toolset for integrations, operations and processing and it is open source!
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
DeathStar is a system that runs HBase on YARN to provide easy, dynamic multi-tenant HBase clusters via YARN. It allows different applications to run HBase in separate application-specific clusters on a shared HDFS and YARN infrastructure. This provides strict isolation between applications and enables dynamic scaling of clusters as needed. Some key benefits are improved cluster utilization, easier capacity planning and configuration, and the ability to start new clusters on demand without lengthy provisioning times.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Serialization is the process of converting an object into a byte stream to store or transmit the object. The document discusses three serialization frameworks: Avro, MessagePack, and Kryo. Avro uses a JSON-defined schema and is created by the creator of Hadoop. MessagePack supports rich data structures like JSON and has interfaces for RPC. Kryo makes serialization easy by collecting serializers by class and supports compression.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
Analysing Banking Data to Provide Relevant Offers to CustomersMarc Torrens
This document discusses Strands' solutions for analyzing banking customer data to provide relevant offers. Strands provides personalization and recommendation solutions for financial institutions and retailers. Their solutions include Card-Linked Offers (CLO), which enables retailers to target deals to bank customers. CLO allows customers to accept offers in digital banking and get cash back for purchases, while retailers can monitor campaign performance. Strands' solutions analyze transactional and other banking customer data using machine learning to determine customers' likelihood of purchasing certain categories and maximize campaign performance and relevance of offers provided.
Analysing copious amount of in-store customer navigation data generated from intelligent devices, retailers can improve customer in-store shopping experience
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
Big data analytics uses tools like Hadoop and its components HDFS and MapReduce to store and analyze large datasets in a distributed environment. HDFS stores very large data sets reliably and streams them at high speeds, while MapReduce allows developers to write programs that process massive amounts of data in parallel across a distributed cluster. Other concepts discussed in the document include data preparation, visualization, hypothesis testing, and deductive vs inductive reasoning as they relate to big data analytics. The document aims to introduce readers to big data analytics using Hadoop and suggests the audience as data analysts, scientists, database managers, and consultants.
The document summarizes Oracle's Big Data Appliance and solutions. It discusses the Big Data Appliance hardware which includes 18 servers with 48GB memory, 12 Intel cores, and 24TB storage per node. The software includes Oracle Linux, Apache Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and Oracle Loader for Hadoop. Oracle Loader for Hadoop can be used to load data from Hadoop into Oracle Database in online or offline mode. The Big Data Appliance provides an optimized platform for storing and analyzing large amounts of data and is integrated with Oracle Exadata.
SOC presentation- Building a Security Operations CenterMichael Nickle
Presentation I used to give on the topic of using a SIM/SIEM to unify the information stream flowing into the SOC. This piece of collateral was used to help close the largest SIEM deal (Product and services) that my employer achieved with this product line.
Key Findings:
- We are in an industrial revolution… right now!
- User Experience must extend “from the screen to the shop”
- Security considerations continue to increase in scope
- AWS infrastructure and SMART COSMOS platform services are powerful options for realizing Industry 4.0
We are following John and Jane through a typical day and explore their every day banking needs. The aim of this presentation is to showcase how modern day banking is transforming today and in the future.
Many believe Big Data is a brand new phenomenon. It isn't, it is part of an evolution that reaches far back history. Here are some of the key milestones in this development.
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...Data Driven Innovation
L'economia europea dei dati: soluzioni politiche e giuridiche per realizzare un'economia dei dati a livello di Unione Europea, nell'ambito della strategia per il mercato unico digitale. La consultazione pubblica 'Building the European Data Economy'. Il paternariato pubblico privato (PPP) Big Data Value ed opportunità di finanziamento in Horizon 2020. L'incubatore Data Pitch: opportunità per Start-up e Piccole e Medie Imprese.
Industry 4.0: Merging Internet and FactoriesFabernovel
Industrial IoT and connected objects for factories are part of our research at FABERNOVEL OBJET, our activity dedicated to IoT.
The future of industry is at the crossroads of internet and factories. Some call it INDUSTRY 4.0 or FACTORY 4.0 in reference to the upcoming fourth industrial revolution. Governments and private companies in Germany, UK and the USA have acknowledged the importance of industrial IoT and its central role in future industrial transformation.
The adoption of Industrial Internet has both near-term and long-term impacts and will be characterized by the emergence of new models such as the “Outcome Economy” and the “Autonomous, Pull Economy”.
We believe that INDUSTRY 4.0 is a growth opportunity for industrial companies, and have decrypted this very phenomenon in the following presentation.
This presentation, by big data guru Bernard Marr, outlines in simple terms what Big Data is and how it is used today. It covers the 5 V's of Big Data as well as a number of high value use cases.
This document provides an overview of big data and Hadoop. It discusses what big data is, why it has become important recently, and common use cases. It then describes how Hadoop addresses challenges of processing large datasets by distributing data and computation across clusters. The core Hadoop components of HDFS for storage and MapReduce for processing are explained. Example MapReduce jobs like wordcount are shown. Finally, higher-level tools like Hive and Pig that provide SQL-like interfaces are introduced.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
This document provides an overview of Hadoop fundamentals including:
- Why Hadoop is used for big data applications due to its ability to handle petabytes of data across commodity hardware in a scalable and economical way.
- What Hadoop is and how it provides a distributed storage and processing infrastructure based on Google's papers using HDFS for storage and MapReduce for processing.
- How HDFS stores and replicates blocks of data across nodes to provide fault tolerance and how MapReduce uses a simple programming model of map and reduce functions to distribute processing.
- An example word count application is described to illustrate how MapReduce can be used to count word frequencies by mapping words to counts and then reducing the
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
Scaling Storage and Computation with Hadoopyaevents
Hadoop provides a distributed storage and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. Hadoop is partitioning data and computation across thousands of hosts, and executes application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is an Apache Software Foundation project; it unites hundreds of developers, and hundreds of organizations worldwide report using Hadoop. This presentation will give an overview of the Hadoop family projects with a focus on its distributed storage solutions
What AI Means For Your Product Strategy And What To Do About ItVMware Tanzu
The document summarizes Matthew Quinn's presentation on "What AI Means For Your Product Strategy And What To Do About It" at Denver Startup Week 2023. The presentation discusses how generative AI could impact product strategies by potentially solving problems companies have ignored or allowing competitors to create new solutions. Quinn advises product teams to evaluate their strategies and roadmaps, ensure they understand user needs, and consider how AI may change the problems being addressed. He provides examples of how AI could influence product development for apps in home organization and solar sales. Quinn concludes by urging attendees not to ignore AI's potential impacts and to have hard conversations about emerging threats and opportunities.
Make the Right Thing the Obvious Thing at Cardinal Health 2023VMware Tanzu
This document discusses the evolution of internal developer platforms and defines what they are. It provides a timeline of how technologies like infrastructure as a service, public clouds, containers and Kubernetes have shaped developer platforms. The key aspects of an internal developer platform are described as providing application-centric abstractions, service level agreements, automated processes from code to production, consolidated monitoring and feedback. The document advocates that internal platforms should make the right choices obvious and easy for developers. It also introduces Backstage as an open source solution for building internal developer portals.
Enhancing DevEx and Simplifying Operations at ScaleVMware Tanzu
Cardinal Health introduced Tanzu Application Service in 2016 and set up foundations for cloud native applications in AWS and later migrated to GCP in 2018. TAS has provided Cardinal Health with benefits like faster development of applications, zero downtime for critical applications, hosting over 5,000 application instances, quicker patching for security vulnerabilities, and savings through reduced lead times and staffing needs.
Dan Vega discussed upcoming changes and improvements in Spring including Spring Boot 3, which will have support for JDK 17, Jakarta EE 9/10, ahead-of-time compilation, improved observability with Micrometer, and Project Loom's virtual threads. Spring Boot 3.1 additions were also highlighted such as Docker Compose integration and Spring Authorization Server 1.0. Spring Boot 3.2 will focus on embracing virtual threads from Project Loom to improve scalability of web applications.
Platforms, Platform Engineering, & Platform as a ProductVMware Tanzu
This document discusses building platforms as products and reducing developer toil. It notes that platform engineering now encompasses PaaS and developer tools. A quote from Mercedes-Benz emphasizes building platforms for developers, not for the company itself. The document contrasts reactive, ticket-driven approaches with automated, self-service platforms and products. It discusses moving from considering platforms as a cost center to experts that drive business results. Finally, it provides questions to identify sources of developer toil, such as issues with workstation setup, running software locally, integration testing, committing changes, and release processes.
This document provides an overview of building cloud-ready applications in .NET. It defines what makes an application cloud-ready, discusses common issues with legacy applications, and recommends design patterns and practices to address these issues, including loose coupling, high cohesion, messaging, service discovery, API gateways, and resiliency policies. It includes code examples and links to additional resources.
Dan Vega discussed new features and capabilities in Spring Boot 3 and beyond, including support for JDK 17, Jakarta EE 9, ahead-of-time compilation, observability with Micrometer, Docker Compose integration, and initial support for Project Loom's virtual threads in Spring Boot 3.2 to improve scalability. He provided an overview of each new feature and explained how they can help Spring applications.
Spring Cloud Gateway - SpringOne Tour 2023 Charles Schwab.pdfVMware Tanzu
Spring Cloud Gateway is a gateway that provides routing, security, monitoring, and resiliency capabilities for microservices. It acts as an API gateway and sits in front of microservices, routing requests to the appropriate microservice. The gateway uses predicates and filters to route requests and modify requests and responses. It is lightweight and built on reactive principles to enable it to scale to thousands of routes.
This document appears to be from a VMware Tanzu Developer Connect presentation. It discusses Tanzu Application Platform (TAP), which provides a developer experience on Kubernetes across multiple clouds. TAP aims to unlock developer productivity, build rapid paths to production, and coordinate the work of development, security and operations teams. It offers features like pre-configured templates, integrated developer tools, centralized visibility and workload status, role-based access control, automated pipelines and built-in security. The presentation provides examples of how these capabilities improve experiences for developers, operations teams and security teams.
The document provides information about a Tanzu Developer Connect Workshop on Tanzu Application Platform. The agenda includes welcome and introductions on Tanzu Application Platform, followed by interactive hands-on workshops on the developer experience and operator experience. It will conclude with a quiz, prizes and giveaways. The document discusses challenges with developing on Kubernetes and how Tanzu Application Platform aims to improve the developer experience with features like pre-configured templates, developer tools integration, rapid iteration and centralized management.
The Tanzu Developer Connect is a hands-on workshop that dives deep into TAP. Attendees receive a hands on experience. This is a great program to leverage accounts with current TAP opportunities.
The Tanzu Developer Connect is a hands-on workshop that dives deep into TAP. Attendees receive a hands on experience. This is a great program to leverage accounts with current TAP opportunities.
Simplify and Scale Enterprise Apps in the Cloud | Dallas 2023VMware Tanzu
This document discusses simplifying and scaling enterprise Spring applications in the cloud. It provides an overview of Azure Spring Apps, which is a fully managed platform for running Spring applications on Azure. Azure Spring Apps handles infrastructure management and application lifecycle management, allowing developers to focus on code. It is jointly built, operated, and supported by Microsoft and VMware. The document demonstrates how to create an Azure Spring Apps service, create an application, and deploy code to the application using three simple commands. It also discusses features of Azure Spring Apps Enterprise, which includes additional capabilities from VMware Tanzu components.
SpringOne Tour: Deliver 15-Factor Applications on Kubernetes with Spring BootVMware Tanzu
The document discusses 15 factors for building cloud native applications with Kubernetes based on the 12 factor app methodology. It covers factors such as treating code as immutable, externalizing configuration, building stateless and disposable processes, implementing authentication and authorization securely, and monitoring applications like space probes. The presentation aims to provide an overview of the 15 factors and demonstrate how to build cloud native applications using Kubernetes based on these principles.
SpringOne Tour: The Influential Software EngineerVMware Tanzu
The document discusses the importance of culture in software projects and how to influence culture. It notes that software projects involve people and personalities, not just technology. It emphasizes that culture informs everything a company does and is very difficult to change. It provides advice on being aware of your company's culture, finding ways to inculcate good cultural values like writing high-quality code, and approaches for influencing decision makers to prioritize culture.
SpringOne Tour: Domain-Driven Design: Theory vs PracticeVMware Tanzu
This document discusses domain-driven design, clean architecture, bounded contexts, and various modeling concepts. It provides examples of an e-scooter reservation system to illustrate domain modeling techniques. Key topics covered include identifying aggregates, bounded contexts, ensuring single sources of truth, avoiding anemic domain models, and focusing on observable domain behaviors rather than implementation details.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
4. Hadoop Core
• Open-source Apache project out of Yahoo! in 2006
• Distributed fault-tolerant data storage and batch
processing
• Provides linear scalability on commodity hardware
• Adopted by many:
– Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM,
Netflix, Twitter, Yahoo!, and many, many more
6. Overview
• Great at
– Reliable storage for multi-petabyte data sets
– Batch queries and analytics
– Complex hierarchical data structures with changing
schemas, unstructured and structured data
• Not so great at
– Changes to files (can’t do it…)
– Low-latency responses
– Analyst usability
• This is less of a concern now due to higher-level languages
7. Data Structure
• Bytes!
• No more ETL necessary
• Store data now, process later
• Structure on read
– Built-in support for common data types and formats
– Extendable
– Flexible
8. Versioning
• Version 0.20.x, 0.21.x, 0.22.x, 1.x.x
– Two main MR packages:
• org.apache.hadoop.mapred (deprecated)
• org.apache.hadoop.mapreduce (new hotness)
• Version 2.x.x, alpha’d in May 2012
– NameNode HA
– YARN – Next Gen MapReduce
10. HDFS Overview
• Hierarchical UNIX-like file system for data storage
– sort of
• Splitting of large files into blocks
• Distribution and replication of blocks to nodes
• Two key services
– Master NameNode
– Many DataNodes
• Checkpoint Node (Secondary NameNode)
11. NameNode
• Single master service for HDFS
• Single point of failure (HDFS 1.x)
• Stores file to block to location mappings in the namespace
• All transactions are logged to disk
• NameNode startup reads namespace image and logs
12. Checkpoint Node (Secondary NN)
• Performs checkpoints of the NameNode’s namespace and
logs
• Not a hot backup!
1. Loads up namespace
2. Reads log transactions to modify namespace
3. Saves namespace as a checkpoint
13. DataNode
• Stores blocks on local disk
• Sends frequent heartbeats to NameNode
• Sends block reports to NameNode
• Clients connect to DataNode for I/O
14. How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNode
1
Client
2
A1
3
A2 A3 A4
Client contacts NameNode to write data
NameNode says write it to these nodes
Client sequentially
writes blocks to DataNode
15. How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
DataNodes replicate data
blocks, orchestrated
by the NameNode
16. How HDFS Works - Reads
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
1
2
3
Client contacts NameNode to read data
NameNode says you can find it here
Client sequentially
reads blocks from DataNode
17. DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1A1 A2A2
A3A3A4 A4
Client connects to another
node serving that block
How HDFS Works - Failure
18. Block Replication
• Default of three replicas
• Rack-aware system
– One block on same rack
– One block on same rack,
different host
– One block on another rack
• Automatic re-copy by
NameNode, as needed
Rack 1
DN
DN
DN
…
Rack 2
DN
DN
DN
…
19. HDFS 2.0 Features
• NameNode High-Availability (HA)
– Two redundant NameNodes in active/passive configuration
– Manual or automated failover
• NameNode Federation
– Multiple independent NameNodes using the same collection
of DataNodes
21. Hadoop MapReduce 1.x
• Moves the code to the data
• JobTracker
– Master service to monitor jobs
• TaskTracker
– Multiple services to run tasks
– Same physical machine as a DataNode
• A job contains many tasks
• A task contains one or more task attempts
22. JobTracker
• Monitors job and task progress
• Issues task attempts to TaskTrackers
• Re-tries failed task attempts
• Four failed attempts = one failed job
• Schedules jobs in FIFO order
– Fair Scheduler
• Single point of failure for MapReduce
23. TaskTrackers
• Runs on same node as DataNode service
• Sends heartbeats and task reports to JobTracker
• Configurable number of map and reduce slots
• Runs map and reduce task attempts
– Separate JVM!
24. Exploiting Data Locality
• JobTracker will schedule task on a TaskTracker that is
local to the block
– 3 options!
• If TaskTracker is busy, selects TaskTracker on same rack
– Many options!
• If still busy, chooses an available TaskTracker at random
– Rare!
25. How MapReduce Works
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTracker
1
Client
4
2
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
3
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
Client submits job to JobTracker
JobTracker submits
tasks to TaskTrackers
Job output is written to
DataNodes w/replication
JobTracker reports metrics
26. DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTrackerClient
B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2
DataNode B DataNode C DataNode D
TaskTracker A TaskTracker B TaskTracker C TaskTracker D
How MapReduce Works - Failure
JobTracker assigns task to different node
27. YARN
• Abstract framework for distributed application
development
• Split functionality of JobTracker into two components
– ResourceManager
– ApplicationMaster
• TaskTracker becomes NodeManager
– Containers instead of map and reduce slots
• Configurable amount of memory per NodeManager
28. MapReduce 2.x on YARN
• MapReduce API has not changed
– Rebuild required to upgrade from 1.x to 2.x
• Application Master launches and monitors job via YARN
• MapReduce History Server to store… history
30. Hadoop Ecosystem
• Core Technologies
– Hadoop Distributed File System
– Hadoop MapReduce
• Many other tools…
– Which I will be describing… now
31. Moving Data
• Sqoop
– Moving data between RDBMS and HDFS
– Say, migrating MySQL tables to HDFS
• Flume
– Streams event data from sources to sinks
– Say, weblogs from multiple servers into HDFS
33. Higher Level APIs
• Pig
– Data-flow language – aptly named PigLatin -- to generate
one or more MapReduce jobs against data stored locally or
in HDFS
• Hive
– Data warehousing solution, allowing users to write SQL-like
queries to generate a series of MapReduce jobs against
data stored in HDFS
34. Pig Word Count
A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word;
C = GROUP B BY word;
D = FOREACH C GENERATE group AS word, COUNT(B);
STORE D INTO '$output';
35. Key/Value Stores
• HBase
• Accumulo
• Implementations of Google’s Big Table for HDFS
• Provides random, real-time access to big data
• Supports updates and deletes of key/value pairs
37. Data Structure
• Avro
– Data serialization system designed for the Hadoop
ecosystem
– Expressed as JSON
• Parquet
– Compressed, efficient columnar storage for Hadoop and
other systems
38. Scalable Machine Learning
• Mahout
– Library for scalable machine learning written in Java
– Very robust examples!
– Classification, Clustering, Pattern Mining, Collaborative
Filtering, and much more
39. Workflow Management
• Oozie
– Scheduling system for Hadoop Jobs
– Support for:
• Java MapReduce
• Streaming MapReduce
• Pig, Hive, Sqoop, Distcp
• Any ol’ Java or shell script program
40. Real-time Stream Processing
• Storm
– Open-source project
which runs a streaming
of data, called a spout,
to a series of execution
agents called bolts
– Scalable and fault-
tolerant, with guaranteed
processing of data
– Benchmarks of over a
million tuples processed
per second per node
41. Distributed Application Coordination
• ZooKeeper
– An effort to develop and
maintain an open-source
server which enables
highly reliable distributed
coordination
– Designed to be simple,
replicated, ordered, and
fast
– Provides configuration
management, distributed
synchronization, and group
services for applications
43. Hadoop Streaming
• Write MapReduce mappers and reducers using stdin and
stdout
• Execute on command line using Hadoop Streaming JAR
// TODO verify
hadoop jar hadoop-streaming.jar -input input -output outputdir
-mapper org.apache.hadoop.mapreduce.Mapper -reduce /bin/wc
44. SQL on Hadoop
• Apache Drill
• Cloudera Impala
• Hive Stinger
• Pivotal HAWQ
• MPP execution of SQL queries against HDFS data
46. That’s a lot of projects
• I am likely missing several (Sorry, guys!)
• Each cropped up to solve a limitation of Hadoop Core
• Know your ecosystem
• Pick the right tool for the right job
49. MapReduce Paradigm
• Data processing system with two key phases
• Map
– Perform a map function on input key/value pairs to generate
intermediate key/value pairs
• Reduce
– Perform a reduce function on intermediate key/value groups
to generate output key/value pairs
• Groups created by sorting map output
53. InputFormat
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context);
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext context);
}
54. RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {
public abstract void initialize(InputSplit split, TaskAttemptContext context);
public abstract boolean nextKeyValue();
public abstract KEYIN getCurrentKey();
public abstract VALUEIN getCurrentValue();
public abstract float getProgress();
public abstract void close();
}
56. Partitioner
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
• Default HashPartitioner uses key’s hashCode() % numPartitions
57. Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void setup(Context context) { /* NOTHING */ }
protected void cleanup(Context context) { /* NOTHING */ }
protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {
for (VALUEIN value : values)
context.write((KEYOUT) key, (VALUEOUT) value);
}
public void run(Context context) {
setup(context);
while (context.nextKey())
reduce(context.getCurrentKey(), context.getValues(), context);
cleanup(context);
}
}
58. OutputFormat
public abstract class OutputFormat<K, V> {
public abstract RecordWriter<K, V>
getRecordWriter(TaskAttemptContext context);
public abstract void checkOutputSpecs(JobContext context);
public abstract OutputCommitter
getOutputCommitter(TaskAttemptContext context);
}
59. RecordWriter
public abstract class RecordWriter<K, V> {
public abstract void write(K key, V value);
public abstract void close(TaskAttemptContext context);
}
61. Problem
• Count the number of times
each word is used in a
body of text
• Uses TextInputFormat and
TextOutputFormat
map(byte_offset, line)
foreach word in line
emit(word, 1)
reduce(word, counts)
sum = 0
foreach count in counts
sum += count
emit(word, sum)
62. Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable ONE = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
}
}
}
64. Reducer Code
public class IntSumReducer
extends Reducer<Text, LongWritable, Text, IntWritable> {
private IntWritable outvalue = new IntWritable();
private int sum = 0;
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
outvalue.set(sum);
context.write(key, outvalue);
}
}
65. So what’s so hard about it?
MapReduce
that’s a tiny box
All the problems you'll
ever have ever
66. So what’s so hard about it?
• MapReduce is a limitation
• Entirely different way of thinking
• Simple processing operations such as joins are not so
easy when expressed in MapReduce
• Proper implementation is not so easy
• Lots of configuration and implementation details for
optimal performance
– Number of reduce tasks, data skew, JVM size, garbage
collection
67. So what does this mean for you?
• Hadoop is written primarily in Java
• Components are extendable and configurable
• Custom I/O through Input and Output Formats
– Parse custom data formats
– Read and write using external systems
• Higher-level tools enable rapid development of big data
analysis
68. Resources, Wrap-up, etc.
• http://hadoop.apache.org
• Very supportive community
• Strata + Hadoop World Oct. 28th – 30th in Manhattan
• Plenty of resources available to learn more
– Blogs
– Email lists
– Books
– Shameless Plug -- MapReduce Design Patterns
69. Getting Started
• Pivotal HD Single-Node VM and Community Edition
– http://gopivotal.com/pivotal-products/data/pivotal-hd
• For the brave and bold -- Roll-your-own!
– http://hadoop.apache.org/docs/current
70. Acknowledgements
• Apache Hadoop, the Hadoop elephant logo, HDFS,
Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie,
Pig, Sqoop, YARN, and ZooKeeper are trademarks of the
Apache Software Foundation
• Cloudera Impala is a trademark of Cloudera
• Parquet is copyright Twitter, Cloudera, and other
contributors
• Storm is licensed under the Eclipse Public License
71. Learn More. Stay Connected.
• Talk to us on Twitter: @springcentral
• Find Session replays on YouTube: spring.io/video
Editor's Notes
Apache project based on two Google papers in 2003 and 2004 on the Google File System and MapReduce
Spawned off of Nutch, open-source web-search software, when looking to store the data
Linear scalability using commodity hardware – Facebook adopted hadoop, 10TB to 15PB
Not for random reads/writes, not for updates – batch processing of large amounts of data
Fault tolerant system primarily around distribution and replication of resources
There is no true backup of your Hadoop cluster
Use HDFS to store petabytes of data using thousands of nodes
Largest Hadoop cluster (March 2011) could hold 30 PB of data
Looks a lot like UNIX file system – contains files, folders, permissions, users, and groups
Isn’t actually stored that way –
Large data files are split into blocks and placed on DataNode services
NameNode is the name server for the file name to block mapping – it knows how the file is split and where the data is in the cluster
All read and write requests go through the namenode, but data is served from the DataNodes via HTTP
Namespace is stored in memory, transactions are logged on the local file system
Secondary NameNode or checkpoint node creates snapshots of the NameNode namespace for fault tolerance and faster restarts
Now, namenode does not persist the data block locations themselves (but does store them in memory). DataNodes tell NameNode what they have
Block reports contain all the block IDs it is holding onto, md5 checksums, etc.
Client contacts the namenode with a request to write some data
Namenode responds and says okay write it to these data nodes
Client connects to each data node and writes out four blocks, one per node
After the file is closed, the data nodes traffic data around to replicate the blocks to a triplicate, all orchestrated by the namenode
In the event of a node failure, data can be accessed on other nodes and the namenode will move data blocks to other nodes
Client contacts the namenode with a request to write some data
Namenode responds and says okay write it to these data nodes
Client connects to each data node and writes out four blocks, one per node
Client contacts the namenode with a request to write some data
Namenode responds and says okay write it to these data nodes
Client connects to each data node and writes out four blocks, one per node
Job tracker takes submitted jobs from clients and determines the locations of the blocks that make up the input
One data block equals one task
Task attempts are distributed to TaskTrackers running in parallel with each DataNode, thus giving data locality for reading the data
Successful task attempts are good! Failed task attempts are given to another TaskTracker for processing
4 single failed task attempts equals one failed job
Client submits a job to the JobTracker for processing
JobTracker uses the input of the job to determine where the blocks are located (through the NameNode), and then distributed task attempts to the task trackers
TaskTrackers coordinate the task attempts and data output is written back to the datanodes, which is distributed and replicated as normal HDFS operations
Job statistics – not output – is reported back to the client upon job completion
Client submits a job to the JobTracker for processing
JobTracker uses the input of the job to determine where the blocks are located (through the NameNode), and then distributed task attempts to the task trackers
TaskTrackers coordinate the task attempts and data output is written back to the datanodes, which is distributed and replicated as normal HDFS operations
Job statistics – not output – is reported back to the client upon job completion
RM manages global assignment of compute resources to applications
AM manages application life cycle – tasked to negotiate resources form the RM and works with NM to execute and monitor tasks
NodeManager executes containers which
In YARN, a MapReduce application is equivalent to a job, executed by the MapReduce AM
HBase Master
Can run multiple HBase Master’s for high availability with automatic failover
HBase RegionServer
Hosts a table’s Regions, much how a DataNode hosts a file’s blocks
How do I get data into this file system?
How do we take all the boiler-plate code away?
How do we update data in HDFS?
Common data format for the ecosystem that plays well with Hadoop
Stream processing, etc
Common Data Format
Uses key value pairs as input and output to both phases
Highly parallelizable paradigm – very easy choice for data processing on a Hadoop cluster
Uses key value pairs as input and output to both phases
Highly parallelizable paradigm – very easy choice for data processing on a Hadoop cluster
Talk about each piece
Talk about each piece, mention keys must be writable comparable instances
Defines the splits for the mapreduce job as well as the record reader
Given an input split, used to create key value pairs out of the logical split.
Responsible for respecting record boundaries
Mapper is where the good stuff happens
This is the Identity Mapper and the class that is overwritten.
Talk about HashPartitioner
Context is a subclass of Reducer
Mapper outputs data to a single file that is logically partitioned by key
Reducers copy their partition over to the local machine – “shuffle”
Each reducer then sorts their partitions into a single sorted local file for processing
HDFS is great and storing data, and MapReduce is great at scaling out processing
However, MapReduce is a limitation as everything needs to be expressed in key/value pairs, and needs to fit in this box of map, shuffle, sort, reduce
Many different ways to execute a join using MapReduce, need to choose which you want to do based on type of join, data size, etc
Number of reducers is configurable
Fewer reducers get more data to process
Key skew will cause some reducers to get too much work
Java is nice, since a lot of other things are written in Java and it is pretty easy to get something going soon
Because of this, anything you can do in Java, you can apply to MapReduce
Just need to ensure you don’t break the paradigm and leave everything parallel