This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Agenda:
1.Data Flow Challenges in an Enterprise
2.Introduction to Apache NiFi
3.Core Features
4.Architecture
5.Demo –Simple Lambda Architecture
6.Use Cases
7.Q & A
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
Agenda:
1.Data Flow Challenges in an Enterprise
2.Introduction to Apache NiFi
3.Core Features
4.Architecture
5.Demo –Simple Lambda Architecture
6.Use Cases
7.Q & A
This Hadoop Hive Tutorial will unravel the complete Introduction to Hive, Hive Architecture, Hive Commands, Hive Fundamentals & HiveQL. In addition to this, even fundamental concepts of BIG Data & Hadoop are extensively covered.
At the end, you'll have a strong knowledge regarding Hadoop Hive Basics.
PPT Agenda
✓ Introduction to BIG Data & Hadoop
✓ What is Hive?
✓ Hive Data Flows
✓ Hive Programming
----------
What is Apache Hive?
Apache Hive is a data warehousing infrastructure built over Hadoop which is targeted towards SQL programmers. Hive permits SQL programmers to directly enter the Hadoop ecosystem without any pre-requisites in Java or other programming languages. HiveQL is similar to SQL, it is utilized to process Hadoop & MapReduce operations by managing & querying data.
----------
Hive has the following 5 Components:
1. Driver
2. Compiler
3. Shell
4. Metastore
5. Execution Engine
----------
Applications of Hive
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Cutting-edge Hadoop clusters are bound to need custom (add-on) services that are not available in the Hadoop distribution of their choice. Agility is crucial for companies to integrate any service into existing large-scale Hadoop clusters with ease.
Apache Ambari manages the Hadoop cluster and solves this problem by extending the stack with add-on services, which can be a new Apache project, different Hadoop file system, or internal tool. This talk covers how to create a service definition in Ambari to manage lifecycle commands and configs, plus advanced topics like packaging, installing from multiple repositories, recommending and validating configs using Service Advisor, running custom commands, defining dependencies on configs and other services, and more. We will also cover how to create custom metrics and dashboards using Ambari Metric System and Grafana, generating alerts, and enabling security by authenticating with Kerberos.
Further, we will discuss the future of service definitions and how Ambari 3.0 will support custom services through Management Packs to enable Hadoop vendors to release software faster.
Speaker
Jayush Luniya, Principal Software Engineer, Hortonworks
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
How Apache Drives Music Recommendations At SpotifyJosh Baer
The slides go through the high-level process of generating personalized playlists for all Spotify's users, using Apache big data products extensively.
Presentation given at Apache: Big Data Europe conference on September 29th, 2015 in Budapest.
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022
Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs.
Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
"An Elephan can't jump. But can carry heavy load".
Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations: Amazon.com, Apple, eBay, IBM, ImageShack, LinkedIn, Microsoft, Twitter, The New York Times...
We have entered an era of Big Data. Huge information is for the most part accumulation of information sets so extensive and complex that it is exceptionally hard to handle them utilizing close by database administration devices. The principle challenges with Big databases incorporate creation, curation, stockpiling, sharing, inquiry, examination and perception. So to deal with these databases we require, "exceedingly parallel software's". As a matter of first importance, information is procured from diverse sources, for example, online networking, customary undertaking information or sensor information and so forth. Flume can be utilized to secure information from online networking, for example, twitter. At that point, this information can be composed utilizing conveyed document frameworks, for example, Hadoop File System. These record frameworks are extremely proficient when number of peruses are high when contrasted with composes.
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Hadoop Playlist (Ignite talks at Strata + Hadoop World 2013)Adam Kawa
Link to video: http://www.youtube.com/watch?v=_GNbn4RzZcQ
A typical day of a data engineer at Spotify revolves around Hadoop and music. However after some time of simultaneous developing MapReduce jobs, maintaining a cluster and listening to perfect music, something surprising might happen.. What? Well, a data engineer starts discovering Hadoop concepts in the lyrics of many songs! How can Coldplay, The Black Eyed Peas, Michael Jackson sing about Hadoop? (more at blog: http://hakunamapdata.com/hadoop-playlist-at-spotify/)
Introduction To Elastic MapReduce at WHUGAdam Kawa
Elasic MapReduce presentation given at 2nd meeting of Warsaw Hadoop User Group.
Watch also demonstration at www.youtube.com/watch?v=Azwilbn8GCs (it show how to create Hadoop cluster on Amazon Elastic MapReduce with Karashpere Studio for EMR (a plugin for Eclipse) to launch big calculations quickly and easily.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
2. Why Data?
Get insights to offer a better product
“More data usually beats better algorithms”
Get of insights to make better decisions
Avoid “guesstimates”
Take a competitive advantage
3. What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and high-level language
4. Fundamental Ideas
A big system of machines, not a big machine
Failures will happen
Move computation to data, not data to computation
Write complex code only once, but right
A system of multiple animals
5. Apache Hadoop
An open-source Java software
Storing and processing of very large data sets
A clusters of commodity machines
A simple programming model
6. Apache Hadoop
Two main components:
HDFS - a distributed file system
MapReduce – a distributed processing layer
Many other tools belong to “Apache Hadoop Ecosystem”
8. The Purpose Of HDFS
Store large datasets in a distributed, scalable and fault-tolerant way
High throughput
It is like a big truck
Very large files
to move heavy stuff
Streaming reads and writes (no edits)
(not Ferrari)
Write once, read many times
9. HDFS Mis-Usage
Do NOT use, if you have
Low-latency requests
Random reads and writes
Lots of small files
Then better to consider
RDBMs,
File servers,
Hbase or Cassandra...
10. Splitting Files And Replicating Blocks
Split a very large file into smaller (but still large) blocks
Store them redundantly on a set of machines
11. Spiting Files Into Blocks
Today, 128MB or 256MB is recommended
The default block size is 64MB
Minimize the overhead of a disk seek operation (less than 1%)
A file is just “sliced” into chunks after each 64MB (or so)
It does NOT matter whether it is text/binary, compressed or not
It does matter later (when reading the data)
12. Replicating Blocks
The default replication factor is 3
It can be changed per a file or a directory
It can be increased for “hot” datasets (temporarily or permanently)
Trade-off between
Reliability, availability, performance
Disk space
13. Master And Slaves
The Master node keeps and manages all metadata information
The Slave nodes store blocks of data and serve them to the client
Master node (called NameNode)
Slave nodes (called DataNodes)
14. Classical* HDFS Cluster
*no NameNode HA, no HDFS Replication
Manages metadata
Does some
“house-keeping”
operations for
NameNode
Stores and retrieves
blocks of data
15. HDFS NameNode
Performs all the metadata-related operations
Keeps information in RAM (for fast lookup)
The filesystem tree
Metadata for all files/directories (e.g. ownership, permissions)
Names and locations of blocks
Metadata (not all) is additionally stored on disk(s) (for reliability)
The filesystem snapshot (fsimage) + editlog (edits) files
16. HDFS DataNode
Stores and retrieves blocks of data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to the NN to say that it is still alive
Sends a block report to the NN periodically
17. HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior snapshot (fsimage) and editlog(s) (edits)
Fetches current fsimage and edits files from the NameNode
Applies edits to fsimage to create the up-to-date fsimage
Then sends the up-to-date fsimage back to the NameNode
We can configure frequency of this operation
Reduces the NameNode startup time
Prevents edits to become too large
18. Exemplary HDFS Commands
hadoop fs -ls -R /user/kawaa
hadoop fs -cat /toplist/2013-05-15/poland.txt | less
hadoop fs -put logs.txt /incoming/logs/user
hadoop fs -count /toplist
hadoop fs -chown kawaa:kawaa /toplist/2013-05-15/poland.avro
It is distributed, but it gives you a beautiful abstraction!
19. Reading A File From HDFS
Block data is never sent through the NameNode
The NameNode redirects a client to an appropriate DataNode
The NameNode chooses a DataNode that is as “close” as possible
$ hadoop fs -cat /toplist/2013-05-15/poland.txt
Blocks locations
Lots of data comes
from DataNodes
to a client
20. HDFS Network Topology
Network topology defined by an administrator in a supplied script
Convert IP address into a path to a rack (e.g /dc1/rack1)
A path is used to calculate distance between nodes
Image source: “Hadoop: The Definitive Guide” by Tom White
21. HDFS Block Placement
Pluggable (default in BlockPlacementPolicyDefault.java)
1st replica on the same node where a writer is located
Otherwise “random” (but not too “busy” or almost full) node is used
2nd and the 3rd replicas on two different nodes in a different rack
The rest are placed on random nodes
No DataNode with more than one replica of any block
No rack with more than two replicas of the same block (if possible)
22. HDFS Balancer
Moves block from over-utilized DNs to under-utilized DNs
Stops when HDFS is balanced
the utilization of
differs from the
Maintains the block placement policy
every DN
utilization
of the cluster by no
more than a given threshold
25. HDFS Block
Question
Why a block itself does NOT know
which file it belongs to?
Answer
Design decision → simplicity, performance
Filename, permissions, ownership etc might change
It would require updating all block replicas that belongs to a file
27. HDFS Metadata
Question
Why NN does NOT store information
about block locations on disks?
Answer
Design decision → simplicity
They are sent by DataNodes as block reports periodically
Locations of block replicas may change over time
A change in IP address or hostname of DataNode
Balancing the cluster (moving blocks between nodes)
Moving disks between servers (e.g. failure of a motherboard)
29. HDFS Replication
Question
How many files represent a block replica in HDFS?
Answer
Actually, two files:
The first file for data itself
The second file for block’s metadata
Checksums for the block data
The block’s generation stamp
by default less than 1%
of the actual data
30. HDFS Block Placement
Question
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
It only checks, if a node
a) has enough disk space to write a block, and
b) does not serve too many clients ...
31. HDFS Block Placement
Question
Why does NOT the default block placement strategy take the disk
space utilization (%) into account?
Answer
Some DataNodes might become overloaded by incoming data
e.g. a newly added node to the cluster
33. HDFS And Local File System
Runs on the top of a native file system (e.g. ext3, ext4, xfs)
HDFS is simply a Java application that uses a native file system
34. HDFS Data Integrity
HDFS detects corrupted blocks
When writing
Client computes the checksums for each block
Client sends checksums to a DN together with data
When reading
Client verifies the checksums when reading a block
If verification fails, NN is notified about the corrupt replica
Then a DN fetches a different replica from another DN
35. HDFS NameNode Scalability
Stats based on Yahoo! Clusters
An average file ≈ 1.5 blocks (block size = 128 MB)
An average file ≈ 600 bytes in RAM (1 file and 2 blocks objects)
100M files ≈ 60 GB of metadata
1 GB of metadata ≈ 1 PB physical storage (but usually less*)
*Sadly, based on practical observations, the block to file ratio tends to
decrease during the lifetime of the cluster
Dekel Tankel, Yahoo!
36. HDFS NameNode Performance
Read/write operations throughput limited by one machine
~120K read ops/sec
~6K write ops/sec
MapReduce tasks are also HDFS clients
Internal load increases as the cluster grows
More block reports and heartbeats from DataNodes
More MapReduce tasks sending requests
Bigger snapshots transferred from/to Secondary NameNode
37. HDFS Main Limitations
Single NameNode
Keeps all metadata in RAM
Performs all metadata operations
Becomes a single point of failure (SPOF)
38. HDFS Main Improvements
Introduce multiple NameNodes in form of:
HDFS Federation
HDFS High Availability (HA)
Find
More:
http://slidesha.re/15zZlet
41. Usually it means some kind of disk failure
$ ls /disk/hd12/
ls: reading directory /disk/hd12/: Input/output error
org.apache.hadoop.util.DiskChecker$DiskErrorException:
Too many failed volumes - current valid volumes: 11,
volumes configured: 12, volumes failed: 1, Volume failures tolerated: 0
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
Increase dfs.datanode.failed.volumes.tolerated
to avoid expensive block replication when a disk fails
(and just monitor failed disks)
44. Problem
A user can not run resource-intensive Hive queries
It happened immediately after expanding the cluster
45. Description
The queries are valid
The queries are resource-intensive
The queries run successfully on a small dataset
But they fail on a large dataset
Surprisingly they run successfully through other user accounts
The user has right permissions to HDFS directories and Hive tables
46. The NameNode is throwing thousands of warnings and exceptions
14592 times only during 8 min (4768/min in a peak)
47. Normally
Hadoop is a very trusty elephant
The username comes from the client machine (and is not verified)
The groupname is resolved on the NameNode server
Using the shell command ''id -Gn <username>''
If a user does not have an account on the NameNode server
The ExitCodeException exception is thrown
48. Possible Fixes
Create an user account on the NameNode server (dirty, insecure)
Use AD/LDAP for a user-group resolution
hadoop.security.group.mapping.ldap.* settings
If you also need the full-authentication, deploy Kerberos
49. Our Fix
We decided to use LDAP for a user-group resolution
However, LDAP settings in Hadoop did not work for us
Because posixGroup is not a supported filter group class in
hadoop.security.group.mapping.ldap.search.filter.group
We found a workaround using nsswitch.conf
50. Lesson Learned
Know who is going to use your cluster
Know who is abusing the cluster (HDFS access and MR jobs)
Parse the NameNode logs regularly
Look for FATAL, ERROR, Exception messages
Especially before and after expanding the cluster
52. MapReduce Model
Programming model inspired by functional programming
map() and reduce() functions processing <key, value> pairs
Useful for processing very large datasets in a distributed way
Simple, but very expressible
55. MapReduce Job
Input data is divided into
splits and converted into
<key, value> pairs
Invokes map() function
multiple times
Invokes reduce()
Function multiple times
Keys are sorted,
values not (but
could be)
56. MapReduce Example: ArtistCount
Artist, Song, Timestamp, User
Key is the offset of the line
from the beginning
of the line
We could specify which artist
goes to which reducer
(HashParitioner is default one)
57. MapReduce Example: ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing language ;)
59. Data Locality in HDFS and MapReduce
By default, three replicas
should be available somewhere
on the cluster
Ideally, Mapper code is sent
to a node that has
the replica of this block
60. MapReduce Implementation
Batch processing system
Automatic parallelization and distribution of computation
Fault-tolerance
Deals with all messy details related to distributed processing
Relatively easy to use for programmers
Java API, Streaming (Python, Perl, Ruby …)
Apache Pig, Apache Hive, (S)Crunch
Status and monitoring tools
62. JobTracker Reponsibilities
Manages the computational resources
Available TaskTrackers, map and reduce slots
Schedules all user jobs
Schedules all tasks that belongs to a job
Monitors tasks executions
Restarts failed and speculatively runs slow tasks
Calculates job counters totals
63. TaskTracker Reponsibilities
Runs map and reduce tasks
Reports to JobTracker
Heartbeats saying that it is still alive
Number of free map and reduce slots
Task progress, status, counters etc
65. MapReduce Job Submission
They are copied with
a higher replication
factor
(by default, 10)
Image source: “Hadoop: The Definitive Guide” by Tom White
Tasks are started
in a separate JVM
to isolate a user code
form Hadoop code
66. MapReduce: Sort And Shuffle Phase
Map phase
Reduce phase
Other maps tasks
Image source: “Hadoop: The Definitive Guide” by Tom White
Other reduce tasks
67. MapReduce: Partitioner
Specifies which Reducer should get a given <key, value> pair
Aim for an even distribution of the intermediate data
Skewed data may overload a single reducer
And make a job running much longer
68. Speculative Execution
Scheduling a redundant copy of the remaining, long-running task
The output from the one that finishes first is used
The other one is killed, since it is no longer needed
An optimization, not a feature to make jobs run more reliably
69. Speculative Execution
Enable, if tasks often experience “external” problems e.g. hardware
degradation (disk, network card), system problems, memory
unavailability..
Otherwise
Speculative execution can reduce overall throughput
Redundant tasks run with similar speed as non-redundant ones
Might help one job, all the others have to wait longer for slots
Redundantly running reduce tasks will
transfer over the network all intermediate data
write their output redundantly (for a moment) to directory in HDFS
71. Java API
Very customizable
Input/Output Format, Record Reader/Writer,
Partitioner, Writable, Comparator …
Unit testing with MRUnit
HPROF profiler can give a lot of insights
Reuse objects (especially keys and values) when possible
Split String efficiently e.g. StringUtils instead of String.split
Hadoop Java API
http://slidesha.re/1c50IPk
More about
72. MapReduce Job Configuration
Tons of configuration parameters
Input split size (~implies the number of map tasks)
Number of reduce tasks
Available memory for tasks
Compression settings
Combiner
Partitioner
and more...
74. MapReduce Input <Key, Value> Pairs
Question
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
75. MapReduce Input <Key, Value> Pairs
Question
Why each line in text file is, by default, converted to
<offset, line> instead of <line_number, line>?
Answer
If your lines are not fixed-width,
you need to read file from the beginning to the end
to find line_number of each line (thus it is not parallelized).
77. “I noticed that a bottleneck seems to be coming from the map tasks.
Is there any reason that we can't open any of the allocated reduce slots
to map tasks?”
Regards,
Chris
How to
hard-code the
number of
map and
reduce
slots
efficiently?
78. This may change again soon ...
We initially started with 60/40
But today we are closer to something like 70/30
Time Spend In Occupied Slots
Occupied Map And Reduce Slots
79. We are currently introducing a new feature to Luigi
Automatic settings of
Maximum input split size (~implies the number of map tasks)
Number of reduce task
More settings soon (e.g. size of map output buffer)
The goal is each task running 5-15 minutes on average
Because even perfect manual setting
may become outdated
because input size grows
80. The current PoC ;)
type
# map
# reduce
avg map time
avg reduce time
job execution time
old_1
4826
25
46sec
1hrs, 52mins, 14sec
2hrs, 52mins, 16sec
new_1
391
294
4mins, 46sec
8mins, 24sec
23mins, 12sec
type
# map
# reduce
avg map time
avg reduce time
job execution time
old_2
4936
800
7mins, 30sec
22mins, 18sec
5hrs, 20mins, 1sec
new_2
4936
1893
8mins, 52sec
7mins, 35sec
1hrs, 18mins, 29sec
It should help in extreme cases
short-living maps
short-living and long-living reduces
83. Only 1 task failed,
2x more task
were killed
than
were completed
84.
85. Logs show that the JobTracker gets a request to kill the tasks
Who actually can send a kill request?
User (using e.g. mapred job -kill-task)
JobTracker (a speculative duplicate, or when a whole job fails)
Fair Scheduler
Diplomatically, it's called “preemption”
86. Key Observations
Killed tasks came from ad-hoc and resource-intensive Hive queries
Tasks are usually killed quickly after they start
Surviving tasks are running fine for long time
Hive queries are running in their own Fair Scheduler's pool
87. Eureka!
FairScheduler has a license to kill!
Preempt the newest tasks
in an over-share pool
to forcibly make some room
for starving pools
88. Hive pool was running over its minimum and fair shares
Other pools were running under their minimum and fair shares
So that
Fair Scheduler was (legally) killing Hive tasks from time to time
Fair Scheduler can kill to be KIND...
89. Possible Fixes
Disable the preemption
Tune minimum shares based on your workload
Tune preemption timeouts based on your workload
Limit the number of map/reduce tasks in a pool
Limit the number of jobs in a pool
Switch to Capacity Scheduler
90. Lessons Learned
A scheduler should NOT be considered as the ''black-box''
It is so easy to implement a long-running Hive query