The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This Edureka "What is Hadoop" tutorial ( Hadoop Blog series: https://goo.gl/LFesy8 ) helps you to understand how Big Data emerged as a problem and how Hadoop solved that problem. This tutorial will be discussing about Hadoop Architecture, HDFS & it's architecture, YARN and MapReduce in detail. Below are the topics covered in this tutorial:
1) 5 V’s of Big Data
2) Problems with Big Data
3) Hadoop-as-a solution
4) What is Hadoop?
5) HDFS
6) YARN
7) MapReduce
8) Hadoop Ecosystem
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail
Below topics are explained in this Hive presetntation:
1. History of Hive
2. What is Hive?
3. Architecture of Hive
4. Data flow in Hive
5. Hive data modeling
6. Hive data types
7. Different modes of Hive
8. Difference between Hive and RDBMS
9. Features of Hive
10. Demo on HiveQL
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This Edureka "What is Hadoop" tutorial ( Hadoop Blog series: https://goo.gl/LFesy8 ) helps you to understand how Big Data emerged as a problem and how Hadoop solved that problem. This tutorial will be discussing about Hadoop Architecture, HDFS & it's architecture, YARN and MapReduce in detail. Below are the topics covered in this tutorial:
1) 5 V’s of Big Data
2) Problems with Big Data
3) Hadoop-as-a solution
4) What is Hadoop?
5) HDFS
6) YARN
7) MapReduce
8) Hadoop Ecosystem
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail
Below topics are explained in this Hive presetntation:
1. History of Hive
2. What is Hive?
3. Architecture of Hive
4. Data flow in Hive
5. Hive data modeling
6. Hive data types
7. Different modes of Hive
8. Difference between Hive and RDBMS
9. Features of Hive
10. Demo on HiveQL
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Hadoop Institutes: kelly technologies are the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
http://www.kellytechno.com/Hyderabad/Course/Hadoop-Training
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Hadoop is a framework for running applications on large clusters built of commodity hardware.The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
4. HDFS
Hadoop Distributed File System (HDFS) is designed to reliably
store very large files across machines in a large cluster.
It is inspired by the GoogleFileSystem.
Distribute large data file into blocks.
Blocks are managed by different nodes in the cluster.
Each block is replicated on multiple nodes.
Name node stored metadata information about files and blocks.
4
5. Hadoop Distributed File System (HDFS)
5
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)
6. HDFS Consists of a Name Node and Data Node.
HDFS Architecture
Name node Remembers where the data is stored in the cluster.
Data node Stores the actual data in the cluster.
Name node Master Node which clients must initiate read/write.
6
7. Has meta data information about a file.
File name ,permissions, directory.
Which nodes contain which blocks.
Name node
Disk backup of meta-data very important if you lose the name
node, you lose HDFS.
7
9. HDFS ComparingVersions
HDFS 1.0 HDFS 2.0
disaster failure Name node single point failure Name node high availability
Resource manager
Resource Manager with map
reduce
Resource Manager withYarn
Scalability and
Performance
Scalability and performance
Suffer with larger clusters
Scalability and performance do
will with larger clusters
9
10. HDFS was built under the premise that hardware will fail.
FaultTolerance
Ensure that when hardware fails / users can still have their
Data available.
Achieved through storing multiple Copies throughout
Cluster.
10
14. Programming model for expressing distributed computations at a
massive scale.
What’s MapReduce?
A patented software framework introduced by Google.
Processes 20 petabytes of data per day.
Popularized by open-source Hadoop project.
Used atYahoo!, Facebook,Amazon, …
14
16. Code usually written in Java - though it can be written in other languages with the
Hadoop StreamingAPI.
MapReduce core functionality (I)
Two fundamental components:
• Map step:
Master node takes large problem and slices it into smaller sub problems;
distributes these to worker nodes.
Worker node may do this again if necessary.
Worker processes smaller problem and hands back to master.
• Reduce step:
Master node takes the answers to the sub problems and combines them in
a predefined way to get the output/answer to original problem.
16
18. Input reader reads a block and divides into splits.
Input reader
Each split would be sent to a map function.
a line is an input of a map function.
The key could be some internal number (filename - blockid – lineid ).
The value is the content of the textual line.
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Block 1
Block 2
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Input reader
18
19. Mapper: map function
Mapper takes the output generated by input reader.
output a list of intermediate <key, value> pairs.
Apple Orange Mongo
Orange Grapes Plum
Apple Plum Mongo
Apple Apple Plum
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
mapper
m1
m2
m3
m4
19
20. Reducer: reduce function
Reducer takes the output generated by
the Mapper.
aggregates the value for each key, and
outputs the final result.
Apple, 1
Orange, 1
Mongo, 1
Orange, 1
Grapes, 1
Plum, 1
Apple, 1
Plum, 1
Mongo, 1
Apple, 1
Apple, 1
Plum, 1
Apple, 1
Apple, 1
Apple, 1
Apple, 1
Orange, 1
Orange, 1
Grapes, 1
Mongo, 1
Mongo, 1
Plum, 1
Plum, 1
Plum, 1
Apple, 4
Orange, 2
Grapes, 1
Mongo, 2
Plum, 3
reducer
shuffle/sort
r1
r2
r3
r4
r5
There is shuffle/sort before reducing.
20
23. MapReduce: Execution Details
Input reader
Divide input into splits, assign each split to a Map task.
Map task
Apply the Map function to each record in the split.
Each Map function returns a list of (key, value) pairs.
Shuffle/Partition and Sort
Shuffle distributes sorting & aggregation to many reducers.
All records for key k are directed to the same reduce processor.
Sort groups the same keys together, and prepares for aggregation.
Reduce task
Apply the Reduce function to each key.
The result of the Reduce function is a list of (key, value) pairs.
23
25. REDUCE(k,list(vMAP(k,v)
MapReduce – Group AVG Example
NewYork, US, 10
LosAngeles, US,40
London, GB, 20
Berlin, DE, 60
Glasgow, GB, 10
Munich, DE, 30
…
DE,45
GB,15
US,25
(US,10)
(US,40)
(GB,20
(GB,10
(DE,60
(DE,30
(US,10
(US,40
(GB,20
(GB,10
(DE,60
(DE,30
Input Data Intermediate
(K,V)-Pairs
Result
25
26. Map-Reduce Execution Engine
(Example: Color Count)
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Users only provide the “Map” and “Reduce” functions 26
27. Properties of MapReduce Engine
JobTracker is the master node (runs with the namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
• This file has 5 Blocks run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
Node 1 Node 2 Node 3
27
28. Properties of MapReduce Engine (Cont’d)
TaskTracker is the slave node (runs on each datanode)
Receives the task from JobTracker
Runs the task until completion (either map or reduce task)
Always in communication with the JobTracker reporting progress
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example,
1 map-reduce job consists of 4
map tasks and 3 reduce tasks
28
29. Example -Word count
Hello
Cloud
TA cool
Hello
TA
cool
Input
Mapper
Mapper
Mapper
Hello [11]
TA [11]
Cloud [1]
cool [11] Reducer
Reducer
Hello 2
TA 2
Cloud 1
cool 2
Hello 1
TA 1
Cloud 1
Hello1
cool 1
cool 1
TA 1
Hello1
Hello1
TA 1
TA 1
Cloud1
cool 1
cool 1
Sort/Copy
Merge
Output
29
31. Example 2: Color Count
Shuffle & Sorting
based on k
Reduce
Reduce
Reduce
Map
Map
Map
Map
Input blocks
on HDFS
Produces (k, v)
( , 1)
Parse-hash
Parse-hash
Parse-hash
Parse-hash
Consumes(k, [v])
( , [1,1,1,1,1,1..])
Produces(k’, v’)
( , 100)
Job: Count the number of each color in a data set
Part0003
Part0002
Part0001
That’s the output file, it
has 3 parts on probably 3
different machines 31
32. Example 3: Color Filter
Job: Select only the blue and the green colors
Input blocks
on HDFS
Map
Map
Map
Map
Produces (k, v)
( , 1)
Write to HDFS
Write to HDFS
Write to HDFS
Write to HDFS
• Each map task will select only
the blue or green colors
• No need for reduce phase
Part0001
Part0002
Part0003
Part0004
That’s the output file, it
has 4 parts on probably 4
different machines
32
33. Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
33
34. Word Count with Combiner
Input Map & Combine Shuffle & Sort Reduce Output
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
the, 2
fox, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
34
37. Why Hive?
Problem: Data, data and more data
200GB per day in March 2008 back to 1TB compressed per
day today
The Hadoop Experiment
Problem: Map/Reduce (MR) is great but every one is not a
Map/Reduce expert.
I know SQL and I am a python and php expert.
So what do we do: HIVE
37
38. • A system for querying and managing structured data built on top of
Map/Reduce and Hadoop.
• MapReduce (MR) is very low level and requires customers to write custom
programs.
• HIVE supports queries expressed in SQL-like language called HiveQL
which are compiled into MR jobs that are executed on Hadoop.
What is HIVE?
• Data model
• Hive structures data into well-understood database concepts such as: tables, rows,
columns.
• It supports primitive types: integers, floats, doubles, and strings.
38
39. Hive Components
Shell Interface: Like the MySQL shell
Driver:
Session handles, fetch, execution
Complier:
Prarse,plan,optimize.
Execution Engine:
DAG stage,Run map or reduce.
39
41. HDFS
Map Reduce
Web UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
MetaStore
Thrift API
Hive QL
Parser
Planner
Optimizer
Execution
SerDe
CSV
Thrift
Regex
UDF/UDAF
substr
sum
average
FileFormats
TextFile
SequenceFile
RCFile
User-defined
Map-reduce Scripts
Architecture
41
42. Hive Metastore
Stores Hive metadata.
Default metastore database uses Apache Derby.
Various configurations:
Embedded(in-process metastore, in-process database)
Mainly for unit tests.
only one process can connect to the metastore at a time.
Local (in-process metastore, out-of-process database)
Each Hive client connects to the metastore directly
Remote (out-of-process metastore, out-of-process database)
Each Hive client connects to a metastore server, which connects to the
metadata database itself.
Metastore server and Clint communicate usingThrift Protocol.
42
43. HiveWarehouse
Hive tables are stored in the Hive “warehouse”.
Default HDFS location: /user/hive/warehouse.
Tables are stored as sub-directories in the warehouse directory.
Partitions are subdirectories of tables.
External tables are supported in Hive.
The actual data is stored in flat files.
43
44. Hive Schemas
Hive is schema-on-read
oSchema is only enforced when the data is read (at query time)
oAllows greater flexibility: same data can be read using multiple
schemas
Contrast with an RDBMS, which is schema-on-write
oSchema is enforced when the data is loaded.
oSpeeds up queries at the expense of load times.
44
45. Data Hierarchy
Hive is organised hierarchically into:
Databases: namespaces that separate tables and other objects.
Tables: homogeneous units of data with the same schema.
Analogous to tables in an RDBMS.
Partitions: determine how the data is stored
Allow efficient access to subsets of the data.
Buckets/clusters
For subsampling within a partition.
Join optimization.
45
46. HiveQL
HiveQL / HQL provides the basic SQL-like operations:
Select columns using SELECT.
Filter rows usingWHERE.
JOIN between tables.
Evaluate aggregates using GROUP BY.
Store query results into another table.
Download results to a local directory (i.e., export from HDFS).
Manage tables and queries with CREATE, DROP, and ALTER.
46
47. Primitive DataTypes
Type Comments
TINYINT, SMALLINT, INT,
BIGINT
1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY 47
48. Complex DataTypes
Type Comments
STRUCT
A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP
Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY
Indexed list
IfA is an array of elements ['a','b','c']:
A[0] returns 'a'
48
49. CreateTable
CreateTable is a statement used to create a table in Hive.
The syntax and example are as follows:
49
50. CreateTable Example
Let us assume you need to create a table named employee using CREATE
TABLE statement.
The following table lists the fields and their data types in employee table:
50
Sr.No Field Name Data Type
1 Eid int
2 Name String
3 Salary Float
4 Designation string
51. CreateTable Example
The following query creates a table named employee.
51
If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already exists.
On successful creation of table, you get to see the following response:
52. Load Data
we can insert data using the Insert statement. But in Hive, we can insert
data using the LOAD DATA statement.
The syntax and example are as follows:
52
LOCAL is identifier to specify the local path. It is optional.
OVERWRITE is optional to overwrite the data in the table.
PARTITION is optional.
53. Load Data Example
We will insert the following data into the table.
It is a text file named sample.txt in /home/user directory.
53
The following query loads the given text into the table.
On successful download, you get to see the following response:
55. Select-Where Example
Assume we have the employee table as given below, with fields named Id, Name,
Salary, Designation, and Dept. Generate a query to retrieve the employee details who
earn a salary of more than Rs 30000.
55
The following query retrieves the employee details using the above scenario:
56. Select-Where Example
On successful execution of the query, you get to see the following response:
56
58. HiveQL Limitations
HQL only supports equi-joins, outer joins, left semi-joins.
Because it is only a shell for mapreduce, complex queries can be
hard to optimise.
Missing large parts of full SQL specification:
Correlated sub-queries.
Sub-queries outside FROM clauses.
Updatable or materialized views.
Stored procedures.
58
59. External Table
CREATE EXTERNAL TABLE page_view_stg
(viewTime INT,
userid BIGINT,
page_url STRING,
referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION '/user/staging/page_view';
59
60. BrowsingTables And Partitions
Command Comments
SHOW TABLES; Show all the tables in the database
SHOW TABLES 'page.*'; Show tables matching the specification ( uses
regex syntax )
SHOW PARTITIONS page_view; Show the partitions of the page_view table
DESCRIBE page_view; List columns of the table
DESCRIBE EXTENDED page_view; More information on columns (useful only for
debugging )
DESCRIBE page_view
PARTITION (ds='2008-10-31');
List information about a partition
60
61. Loading Data
Use LOAD DATA to load data from a file or directory
Will read from HDFS unless LOCAL keyword is specified
Will append data unless OVERWRITE specified
PARTITION required if destination table is partitioned
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-8_us.txt'
OVERWRITE INTO TABLE page_view
PARTITION (date='2008-06-08', country='US')
61
62. Inserting Data
Use INSERT to load data from a Hive query
Will append data unless OVERWRITE specified
PARTITION required if destination table is partitioned
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view
PARTITION (dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url
WHERE pvs.country = 'US';
62
64. What is Apache pig?
pig: is a high-level platform for creating MapReduce programs.
pig: is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
Pig is made up of two components:
PigLatin.
Runtime Environment.
64
65. Why Apache pig?
Programmers who are not so good at Java normally used to struggle
working with Hadoop, especially while performing any MapReduce tasks.
Apache Pig is a boon for all such programmers.
Using Pig Latin, programmers can perform MapReduce tasks easily
without having to type complex codes in Java.
Pig Latin is SQL-like language and it is easy to learnApache Pig when you
are familiar with SQL.
65
66. Features of Pig
Rich set of operators − It provides many operators to perform operations
like join, sort, filer, etc.
Ease of programming − Pig Latin is similar to SQL and it is easy to write a
Pig script if you are good at SQL.
Handles all kinds of data − Apache Pig analyzes all kinds of data, both
structured as well as unstructured. It stores the results in HDFS.
UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig
Scripts.
66
67. Apache PigVs MapReduce
67
Apache Pig MapReduce
Apache Pig is a data flow language. MapReduce is a data processing paradigm.
It is a high level language. MapReduce is low level and rigid.
Performing a Join operation inApache Pig is
pretty simple.
It is quite difficult in MapReduce to perform a
Join operation between datasets.
Any programmer with a basic knowledge of SQL
can work conveniently withApache Pig.
Exposure to Java is must to work with
MapReduce.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more the
number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted internally
into a MapReduce job.
MapReduce jobs have a long compilation process.
68. Apache PigVs Hive
68
Apache Pig Hive
Apache Pig uses a language called Pig Latin.
It was originally created atYahoo.
Hive uses a language called HiveQL.
It was originally created at Facebook.
Pig Latin is a data flow language. HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in
pipeline paradigm.
HiveQL is a declarative language.
Apache Pig can handle structured, unstructured,
and semi-structured data.
Hive is mostly for structured data.
69. Apache Pig - Architecture
69
Apache Pig converts scripts into a series of MapReduce jobs.
Apache Pig makes the programmer’s job easy.
Parser :
checks the syntax, does type checking, and other
miscellaneous checks.
The output of the parser will be a DAG.
a DAG (directed acyclic graph) represents the Pig Latin
statements and logical operators.
Optimizer :
The logical plan (DAG) is passed to the logical optimizer,
which carries out the logical optimizations such as
projection and pushdown.
70. Apache Pig - Architecture
70
Compiler :
The compiler compiles the optimized logical plan into a
series of MapReduce jobs.
Execution engine :
Finally the MapReduce jobs are submitted to Hadoop in
a sorted order.
Finally, these MapReduce jobs are executed on Hadoop
producing the desired results.
71. Pig Latin Data Model
71
The data model of Pig Latin is fully nested and it allows complex non-
atomic data types such as map and tuple.
• A bag is a collection of tuples.
• A tuple is an ordered set of fields.
• A field is a piece of data.
72. Pig Latin statements
72
Basic constructs :
These statements work with relations,They include expressions and schemas.
Every statement ends with a semicolon (;).
Pig Latin statements take a relation as input and produce another relation as output.
Pig Latin example:
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
73. Pig Latin Data types
73
DataType Description & Example
int Represents a signed 32-bit integer. Example : 8
long Represents a signed 64-bit integer. Example : 5L
float Represents a signed 32-bit floating point. Example : 5.5F
double Represents a 64-bit floating point. Example : 10.5
chararray Represents a character array (string) in Unicode UTF-8 format. Example :‘tutorials point’
Bytearray Represents a Byte array (blob).
Boolean Represents a Boolean value. Example : true/ false.
Datetime Represents a date-time. Example : 1970-01-01T00:00:00.000+00:00
Biginteger Represents a Java BigInteger. Example : 60708090709
Bigdecimal Represents a Java BigDecimal Example : 185.98376256272893883
74. Pig Latin ComplexTypes
74
DataType Description & Example
Tuple A tuple is an ordered set of fields. Example : (raja, 30)
Bag A bag is a collection of tuples. Example : {(raju,30),(Mohhammad,45)}
Map A Map is a set of key-value pairs. Example : [‘name’#’Raju’,‘age’#30]
75. Apache Pig Filter Operator
The FILTER operator is used to select the required tuples from a relation based on a condition.
75
syntax of the FILTER operator.
Example:
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
76. Filter Operator Example
And we have loaded this file into Pig with the relation name student_details as shown below.
76
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
now use the Filter operator to get the details of the students who belong to the city Chennai.
Verify the relation filter_data using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation filter_data as follows.
77. Apache Pig Distinct Operator
The DISTINCT operator is used to remove redundant (duplicate) tuples from a relation.
77
syntax of the DISTINCT operator.
Example:
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
78. Distinct Operator Example
78
remove the redundant (duplicate) tuples from the relation named student_details using the DISTINCT operator
Verify the relation distinct_data using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation distinct_data as follows.
79. Apache Pig Group Operator
The GROUP operator is used to group the data in one or more relations. It collects the data having the same key.
79
syntax of the group operator.
Example:
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
80. Group Operator Example
80
let us group the records/tuples in the relation by age as shown below.
Verify the relation group_data using the DUMP operator as shown below.
It will produce the following output, displaying the contents of the relation group_data as follows.
81. Apache Pig Join Operator
The JOIN operator is used to combine records from two or more relations.
81
Joins can be of the following types:
Self-join: is used to join a table with itself.
Inner Join:An inner join returns rows when there is a match in both tables.
left outer Join: returns all rows from the left table, even if there are no matches in the right relation.
right outer join: returns all rows from the right table, even if there are no matches in the left table.
full outer join: operation returns rows when there is a match in one of the relations.
83. What is Impala?
Cloudera Impala is a query engine that runs onApache Hadoop.
Similar to HiveQL.
Does not use Map reduce.
Optimized for low latency queries.
Open source apache project.
Developed by Cloudera.
Much faster than Hive or pig.
83
84. Comparing Pig, Hive and Impala
Description of Feature Pig Hive Impala
SQL based query language No yes yes
Schema optional required required
Process data with external scripts yes yes no
Extensible file format support yes yes no
Query speed slow slow fast
Accessible via ODBC/JDBC no yes yes
84
86. What is Sqoop?
Command-line interface for transforming data between relational database and
Hadoop
Support incremental imports
Imports use to populate tables in Hadoop
Exports use to put data from Hadoop into relational database such as SQL server
Hadoop RDBMSsqoop
86
89. Scoop – Example
An example scoop command to
– load data from mySql into Hive
bin/sqoop-import
--connect jdbc:mysql://<mysql host>:<msql port>/db3
-username <username>
-password <password>
--table <tableName>
--hive-table <Hive tableName>
--create-hive-table
--hive-import
--hive-home <hive path>
89
90. How Sqoop works
The dataset being transferred is broken into small blocks.
Map only job is launched.
Individual mapper is responsible for transferring a block of the
dataset.
90