YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
5. www.edureka.co
WHAT IS HADOOP?
HADOOP
Hadoop is an open source distributed processing
framework that manages data processing and
storage for big data applications running in clustered
systems.
15. MAJOR HADOOP COMPONENTS
HDFS
• Hadoop Distributed File System.
• Primary Data Storage Unit in Hadoop.
• Used in Distributed Data Processing environment.
www.edureka.co
16. MAJOR HADOOP COMPONENTS
HCATALOG
• Hadoop Storage Management layer.
• Exposes Tabular data of Hive metastore to other
applications like Pig, MapReduce etc.
www.edureka.co
17. MAJOR HADOOP COMPONENTS
ZOOKEEPER
• Centralized Open-source Server
• Used to provide a distributed configuration
service, synchronization service, and naming
registry for large distributed systems.
www.edureka.co
18. MAJOR HADOOP COMPONENTS
OOZIE
• Server-based workflow scheduling system
• It Schedules jobs in Apache Hadoop Jobs
• Used to manage Directed Acyclical Graphs (DAGs)
www.edureka.co
20. MAJOR HADOOP COMPONENTS
MAPREDUCE
• Software Framework for distributed processing .
• It splits data into chunks to enable map, filter and
other operations.
• Used in Functional Programming.
www.edureka.co
21. MAJOR HADOOP COMPONENTS
SPARK
• General Purpose Cluster Computing Framework.
• It can perform Real-time data streaming and ETL
• Used for Micro-Batch Processing.
www.edureka.co
22. MAJOR HADOOP COMPONENTS
TEZ
• High performance Data processing tool.
• Executes series of MapReduce Jobs as single Job
• Used to Batch Processing environment
www.edureka.co
24. MAJOR HADOOP COMPONENTS
HIVE
• Data Warehouse Software Project
• Enables SQL like queries for Databases.
• Used in ETL, Hive DDL and DML
www.edureka.co
25. MAJOR HADOOP COMPONENTS
SPARK SQL
• Distributed SQL Query engine
• Enables Structured Data Processing.
• Used importing data from RDDs, Hive, Parquet
files etc.
www.edureka.co
26. MAJOR HADOOP COMPONENTS
IMPALA
• In-Memory Processing Query engine
• Integrates with HIVE metastore to share the table
information between the components.
• Used to process data in Hadoop Clusters
www.edureka.co
27. MAJOR HADOOP COMPONENTS
APACHE DRILL
• Low Latency Distributed Query engine
• Combines a variety of data stores just by using a
single query.
• Used to support different kinds of NoSQL Data
bases.
www.edureka.co
28. MAJOR HADOOP COMPONENTS
HBASE
• Open source, non-relational distributed database
• Combines a variety of data stores just by using a
single query.
www.edureka.co
30. MAJOR HADOOP COMPONENTS
APACHE PIG
• High level scripting language
• Enables users to write complex data
transformations
• Performs ETL and analyses huge Datasets.
www.edureka.co
31. MAJOR HADOOP COMPONENTS
APACHE SQOOP
• Command-line interface application for
transferring data between relational databases
and Hadoop.
• Data Ingesting tool.
• Enables to import and export structured data in
an enterprise level
www.edureka.co
33. MAJOR HADOOP COMPONENTS
SPARK STREAMING
• Spark Streaming is an extension of the
core SparkAPI.
• Enables scalable, high-throughput, fault-
tolerant stream processing of live data streams
• Spark Streaming provides a high-level abstraction
called discretized stream for continuous data
streaming.
www.edureka.co
34. MAJOR HADOOP COMPONENTS
APACHE KAFKA
• Open-source stream-processing software
• Ingests and moves large amounts of data very
quickly.
• Uses publish and subscribe to streams of records.
www.edureka.co
35. MAJOR HADOOP COMPONENTS
APACHE FLUME
• Open-source Distributed and Reliable software
• Architecture is based on Streaming Data Flows
• Collecting, Aggregating and Moving large logs of
Data.
www.edureka.co
37. MAJOR HADOOP COMPONENTS
APACHE GIRAPH
• Iterative graph processing framework.
• Utilizes Apache Hadoop's MapReduce
implementation to process graphs.
• Used to analyse social media data
www.edureka.co
38. MAJOR HADOOP COMPONENTS
APACHE GRAPHX
• GraphX is Apache Spark's API for graphs and
graph-parallel computation.
• Comparable performance to the fastest
specialized graph processing systems.
• Seamlessly work with both graphs and collections.
• Choose from a growing library of graph
algorithms.
www.edureka.co
40. MAJOR HADOOP COMPONENTS
H2O
• H2O is open-source software for big-data analysis.
• H2O allows to fit thousands of potential models as
part of discovering patterns in data.
• H2O uses iterative methods that provide quick
answers using all of the client's data.
www.edureka.co
41. MAJOR HADOOP COMPONENTS
ORYX
• A generic lambda architecture tier, providing
batch/speed/serving layers.
• Oryx is designed with specialization for real-time
large scale machine learning
• End-to-End implementation of the standard ML
algorithms as applications.
www.edureka.co
42. MAJOR HADOOP COMPONENTS
SPARK MLlib
• Spark MLlib is a scalable Machine Learning
Library.
• It enables us to perform Machine Learning
operations in Spark.
www.edureka.co
43. MAJOR HADOOP COMPONENTS
AVRO
• Avro is a row-oriented remote procedure call and
data serialization.
• Used in Dynamic typing and Schema Evolution
and many more.
• Avro is used in Data Serialization and RPC.
www.edureka.co
44. MAJOR HADOOP COMPONENTS
THRIFT
• It is an Interface definition language and binary
communication protocol.
• It allows users to define data types and service
interfaces in a simple definition file
• Thrift is used in building RPC Clients and Servers.
www.edureka.co
45. MAJOR HADOOP COMPONENTS
MAHOUT
• Implementations of distributed machine learning
algorithms.
• Store and process big data in a distributed
environment across clusters of computers
using simple programming models
www.edureka.co