HADOOP ECOSYSTEM
Sandip K. Darwade
MNIT Jaipur
May 27, 2014
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
Outline
Hadoop
Hadoop Ecosystem
HDFS
MapReduce
YARN
Avro
Pig
Hive
HBase
Mahout
Sqoop
ZooKeeper
Chukwa
HCatalog
References
...
What is Hadoop ?
The Apache Hadoop software library is a framework that
allows for the distributed processing of large dat...
What is Hadoop Ecosystem ?
Introduction to the world of Hadoop and the core related
software projects. There are countless...
Hadoop Ecosystem
Figure : Hadoop Ecosystem Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
HDFS
Hadoop Distributed File System.
Files are stored in HDFS and divided into blocks, which
are then copied to multiple D...
HDFS
NameNode
Run on a separate machine.
Manage the file system namespace,and control access of external
clients.
Store file...
MapReduce
Programming model for data processing.
Hadoop can run MapReduce programs written in various
languages Java,Pytho...
MapReduce
Files are split into fixed sized blocks and stored on data
nodes (Default 64MB).
Programs written, can process on...
MapReduce (continue...)
Figure : MapReduce Process Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 ...
MapReduce (continue...)
Map
Map process each block separately in parallel.
Generate an intermediate key/value pairs set.
R...
YARN
YARN (Yet Another Resource Negotiator).
MapReduce 1.0 had issues with scalability, memory usage
and synchronization.
...
YARN (continue...)
Figure : Yarn Architecture Via Apache
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
Avro
Avro is a framework for performing remote procedure
calls and data serialization.
It can be used to pass data from on...
Pig
Pig is a framework consisting of a high-level scripting
language (Pig Latin).
Run-time environment that allows users t...
Hive
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query
and ana...
Hive (continue...)
Building blocks of Hive.
Metastore stores the system catalog and metadata about tables,
columns, partit...
HBase
HBase is distributed column-oriented database built on
top of HDFS.
HBase is not relational and does not support SQL...
HBase (continue...)
Figure : HBase Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
Mahout
Mahout is a scalable machine-learning and data mining
library.
There are currently four main groups of algorithms i...
Mahout (continue...)
Figure : Mahout Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
Sqoop
Sqoop allows easy import and export of data from
structured data stores.
Command-line tool to import any JDBC suppor...
Sqoop (continue...)
Figure : Sqoop Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
ZooKeeper
ZooKeeper is a distributed, open-source coordination
service for distributed applications.
They are especially p...
ZooKeeper (continue...)
Figure : ZooKeeper Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
Chukwa
Chukwa is a Hadoop subproject devoted to large-scale log
collection and analysis.
Chukwa is built on top of HDFS an...
Chukwa (continue...)
Figure : Chukwa Architecture
Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
HCatalog
An incubator-level project at Apache.
HCatalog is a metadata and table storage management
service for HDFS.
HCata...
Bibliography I
G. Yang, “The application of mapreduce in the cloud computing,” Intelligence
Information Processing and Tru...
Upcoming SlideShare
Loading in …5
×

Hadoop Ecosystem

619 views
460 views

Published on

Hadoop Ecosystem and Hadoop-Related Projects at Apache
excluding Cloudera project related to Hadoop

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
619
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
24
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop Ecosystem

  1. 1. HADOOP ECOSYSTEM Sandip K. Darwade MNIT Jaipur May 27, 2014 Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
  2. 2. Outline Hadoop Hadoop Ecosystem HDFS MapReduce YARN Avro Pig Hive HBase Mahout Sqoop ZooKeeper Chukwa HCatalog References Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 2 / 29
  3. 3. What is Hadoop ? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is best known for MapReduce and its distributed filesystem (HDFS),and large-scale data processing. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
  4. 4. What is Hadoop Ecosystem ? Introduction to the world of Hadoop and the core related software projects. There are countless commercial Hadoop-integrated products focused on making Hadoop more usable and layman-accessible, but the ones here were chosen because they provide core functionality and speed in Hadoop so called Hadoop Ecosystem. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
  5. 5. Hadoop Ecosystem Figure : Hadoop Ecosystem Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
  6. 6. HDFS Hadoop Distributed File System. Files are stored in HDFS and divided into blocks, which are then copied to multiple Data Nodes. Hadoop cluster contains only one NameNode and many DataNodes. Data blocks are replicated for High Availability and fast access. Figure : HDFS Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
  7. 7. HDFS NameNode Run on a separate machine. Manage the file system namespace,and control access of external clients. Store file system Meta-data in memory. File information, each block information of files, and every file block information in Data Node . DataNode Run on Separate machine,which is the basic unit of file storage. Sent all messages of existing Blocks periodically to Name Node. Data Node response read and write request from the Name Node,and also respond, create, delete, and copy the block command from Name Node. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
  8. 8. MapReduce Programming model for data processing. Hadoop can run MapReduce programs written in various languages Java,Python. Parallel Processing,put Mapreduce in very large-scale data analysis. Mapper produce intermediate results. Reducer aggregates the results. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
  9. 9. MapReduce Files are split into fixed sized blocks and stored on data nodes (Default 64MB). Programs written, can process on distributed clusters in parallel. Input data is a set of key/value pairs, the output is also the key/value pairs. Mainly Two Phase Map and Reduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
  10. 10. MapReduce (continue...) Figure : MapReduce Process Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 / 29
  11. 11. MapReduce (continue...) Map Map process each block separately in parallel. Generate an intermediate key/value pairs set. Results of these logic blocks are reassembled. Reduce Accepts an intermediate key and related value. Processed the intermediate key and value. Form a set of relatively small value set. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
  12. 12. YARN YARN (Yet Another Resource Negotiator). MapReduce 1.0 had issues with scalability, memory usage and synchronization. YARN addresses problems with MapReduce 1.0’s architecture, specifically with the JobTracker service. YARN splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
  13. 13. YARN (continue...) Figure : Yarn Architecture Via Apache Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
  14. 14. Avro Avro is a framework for performing remote procedure calls and data serialization. It can be used to pass data from one program or language to another, e.g. from C to Pig. Suited for use with scripting languages such as Pig because data is always stored with its schema in Avro and therefore the data is self-describing. Avro can also handle changes in schema still preserving access to the data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
  15. 15. Pig Pig is a framework consisting of a high-level scripting language (Pig Latin). Run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Pig is more flexible than Hive with respect to possible data format. Pig’s data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
  16. 16. Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. Using Hadoop was not easy for end users those who were not familiar with MapReduce framework. A Hive query is converted to MapReduce tasks. Figure : Hive Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
  17. 17. Hive (continue...) Building blocks of Hive. Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive. Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks. Execution Engine executes the tasks produced by the compiler in proper dependency order. Hive Server provides a thrift interface and a JDBC/ODBC server. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
  18. 18. HBase HBase is distributed column-oriented database built on top of HDFS. HBase is not relational and does not support SQL, but given the proper problem space. It is able to do what an RDBMS cannot. HBase is modeled with an HBase master node orchestrating a cluster of one or more regionserver slaves. HBase master is responsible for bootstrapping a virgin install, for assigning regions to registered regionservers, and for recovering regionserver failures. HBase manages a ZooKeeper instance as the authority on cluster state. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
  19. 19. HBase (continue...) Figure : HBase Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
  20. 20. Mahout Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout. Recommendations, a.k.a. collective filtering. Classification, a.k.a categorization. Clustering. Frequent itemset mining, a.k.a parallel frequent pattern mining. Mahout is not simply a collection of pre-existing algorithms. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion, and have been written to be executable in MapReduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
  21. 21. Mahout (continue...) Figure : Mahout Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
  22. 22. Sqoop Sqoop allows easy import and export of data from structured data stores. Command-line tool to import any JDBC supported database into Hadoop. Generate Writables for use in MapReduce jobs. High performance connectors for some RDBMS. Distributed,reliable,available service for efficiently moving large amount of data as it is produced. Suited for gathering log from multiple systems. Inserting them into HDFS as they are generated. Design Goal : Reliability , Scalability , Manageability, Extensibility. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
  23. 23. Sqoop (continue...) Figure : Sqoop Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
  24. 24. ZooKeeper ZooKeeper is a distributed, open-source coordination service for distributed applications. They are especially prone to errors such as race conditions and deadlock. Generate Writables for use in MapReduce jobs. ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace. The name space consists of data registers called znodes, and these are similar to files and directories. ZooKeeper data is kept in-memory, which means it can achieve high throughput and low latency numbers. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
  25. 25. ZooKeeper (continue...) Figure : ZooKeeper Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
  26. 26. Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of HDFS and MapReduce framework and inherits Hadoops scalability and robustness. Four Components of Chukwa. Agents that run on each machine and emit data. Collectors that receive data from the agent and write to a stable storage. MapReduce jobs for parsing and archiving the data. HICC, Hadoop Infrastructure Care Center; a web-portal style interface for displaying data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
  27. 27. Chukwa (continue...) Figure : Chukwa Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
  28. 28. HCatalog An incubator-level project at Apache. HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig. HCatalog’s goal is to simplify the user’s interaction with HDFS data. Enable data sharing between tools and execution platforms. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
  29. 29. Bibliography I G. Yang, “The application of mapreduce in the cloud computing,” Intelligence Information Processing and Trusted Computing (IPTC) 2011, vol. 9, pp. 154–156, Oct 2011. T. White, Hadoop:The Definitive Guide, Third Edition. 1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc., 2012. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29

×