Hadoop Ecosystem
Upcoming SlideShare
Loading in...5
×
 

Hadoop Ecosystem

on

  • 186 views

Hadoop Ecosystem and Hadoop-Related Projects at Apache

Hadoop Ecosystem and Hadoop-Related Projects at Apache
excluding Cloudera project related to Hadoop

Statistics

Views

Total Views
186
Views on SlideShare
186
Embed Views
0

Actions

Likes
0
Downloads
10
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop Ecosystem Hadoop Ecosystem Presentation Transcript

  • HADOOP ECOSYSTEM Sandip K. Darwade MNIT Jaipur May 27, 2014 Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 1 / 29
  • Outline Hadoop Hadoop Ecosystem HDFS MapReduce YARN Avro Pig Hive HBase Mahout Sqoop ZooKeeper Chukwa HCatalog References Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 2 / 29
  • What is Hadoop ? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop is best known for MapReduce and its distributed filesystem (HDFS),and large-scale data processing. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 3 / 29
  • What is Hadoop Ecosystem ? Introduction to the world of Hadoop and the core related software projects. There are countless commercial Hadoop-integrated products focused on making Hadoop more usable and layman-accessible, but the ones here were chosen because they provide core functionality and speed in Hadoop so called Hadoop Ecosystem. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 4 / 29
  • Hadoop Ecosystem Figure : Hadoop Ecosystem Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 5 / 29
  • HDFS Hadoop Distributed File System. Files are stored in HDFS and divided into blocks, which are then copied to multiple Data Nodes. Hadoop cluster contains only one NameNode and many DataNodes. Data blocks are replicated for High Availability and fast access. Figure : HDFS Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 6 / 29
  • HDFS NameNode Run on a separate machine. Manage the file system namespace,and control access of external clients. Store file system Meta-data in memory. File information, each block information of files, and every file block information in Data Node . DataNode Run on Separate machine,which is the basic unit of file storage. Sent all messages of existing Blocks periodically to Name Node. Data Node response read and write request from the Name Node,and also respond, create, delete, and copy the block command from Name Node. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 7 / 29
  • MapReduce Programming model for data processing. Hadoop can run MapReduce programs written in various languages Java,Python. Parallel Processing,put Mapreduce in very large-scale data analysis. Mapper produce intermediate results. Reducer aggregates the results. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 8 / 29
  • MapReduce Files are split into fixed sized blocks and stored on data nodes (Default 64MB). Programs written, can process on distributed clusters in parallel. Input data is a set of key/value pairs, the output is also the key/value pairs. Mainly Two Phase Map and Reduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 9 / 29
  • MapReduce (continue...) Figure : MapReduce Process Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 10 / 29
  • MapReduce (continue...) Map Map process each block separately in parallel. Generate an intermediate key/value pairs set. Results of these logic blocks are reassembled. Reduce Accepts an intermediate key and related value. Processed the intermediate key and value. Form a set of relatively small value set. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 11 / 29
  • YARN YARN (Yet Another Resource Negotiator). MapReduce 1.0 had issues with scalability, memory usage and synchronization. YARN addresses problems with MapReduce 1.0’s architecture, specifically with the JobTracker service. YARN splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Rather than burdening a single node with handling scheduling and resource management for the entire cluster, YARN now distributes this responsibility across the cluster. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 12 / 29
  • YARN (continue...) Figure : Yarn Architecture Via Apache Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 13 / 29
  • Avro Avro is a framework for performing remote procedure calls and data serialization. It can be used to pass data from one program or language to another, e.g. from C to Pig. Suited for use with scripting languages such as Pig because data is always stored with its schema in Avro and therefore the data is self-describing. Avro can also handle changes in schema still preserving access to the data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 14 / 29
  • Pig Pig is a framework consisting of a high-level scripting language (Pig Latin). Run-time environment that allows users to execute MapReduce on a Hadoop cluster. Like HiveQL in Hive, Pig Latin is a higher-level language that compiles to MapReduce. Pig is more flexible than Hive with respect to possible data format. Pig’s data model is similar to the relational data model, except that tuples (a.k.a. records or rows) can be nested. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 15 / 29
  • Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. Using Hadoop was not easy for end users those who were not familiar with MapReduce framework. A Hive query is converted to MapReduce tasks. Figure : Hive Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 16 / 29
  • Hive (continue...) Building blocks of Hive. Metastore stores the system catalog and metadata about tables, columns, partitions, etc. Driver manages the lifecycle of a HiveQL statement as it moves through Hive. Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks. Execution Engine executes the tasks produced by the compiler in proper dependency order. Hive Server provides a thrift interface and a JDBC/ODBC server. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 17 / 29
  • HBase HBase is distributed column-oriented database built on top of HDFS. HBase is not relational and does not support SQL, but given the proper problem space. It is able to do what an RDBMS cannot. HBase is modeled with an HBase master node orchestrating a cluster of one or more regionserver slaves. HBase master is responsible for bootstrapping a virgin install, for assigning regions to registered regionservers, and for recovering regionserver failures. HBase manages a ZooKeeper instance as the authority on cluster state. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 18 / 29
  • HBase (continue...) Figure : HBase Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 19 / 29
  • Mahout Mahout is a scalable machine-learning and data mining library. There are currently four main groups of algorithms in Mahout. Recommendations, a.k.a. collective filtering. Classification, a.k.a categorization. Clustering. Frequent itemset mining, a.k.a parallel frequent pattern mining. Mahout is not simply a collection of pre-existing algorithms. Algorithms in the Mahout library belong to the subset that can be executed in a distributed fashion, and have been written to be executable in MapReduce. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 20 / 29
  • Mahout (continue...) Figure : Mahout Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 21 / 29
  • Sqoop Sqoop allows easy import and export of data from structured data stores. Command-line tool to import any JDBC supported database into Hadoop. Generate Writables for use in MapReduce jobs. High performance connectors for some RDBMS. Distributed,reliable,available service for efficiently moving large amount of data as it is produced. Suited for gathering log from multiple systems. Inserting them into HDFS as they are generated. Design Goal : Reliability , Scalability , Manageability, Extensibility. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 22 / 29
  • Sqoop (continue...) Figure : Sqoop Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 23 / 29
  • ZooKeeper ZooKeeper is a distributed, open-source coordination service for distributed applications. They are especially prone to errors such as race conditions and deadlock. Generate Writables for use in MapReduce jobs. ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace. The name space consists of data registers called znodes, and these are similar to files and directories. ZooKeeper data is kept in-memory, which means it can achieve high throughput and low latency numbers. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 24 / 29
  • ZooKeeper (continue...) Figure : ZooKeeper Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 25 / 29
  • Chukwa Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of HDFS and MapReduce framework and inherits Hadoops scalability and robustness. Four Components of Chukwa. Agents that run on each machine and emit data. Collectors that receive data from the agent and write to a stable storage. MapReduce jobs for parsing and archiving the data. HICC, Hadoop Infrastructure Care Center; a web-portal style interface for displaying data. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 26 / 29
  • Chukwa (continue...) Figure : Chukwa Architecture Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 27 / 29
  • HCatalog An incubator-level project at Apache. HCatalog is a metadata and table storage management service for HDFS. HCatalog depends on the Hive metastore and exposes it to other services such as MapReduce and Pig. HCatalog’s goal is to simplify the user’s interaction with HDFS data. Enable data sharing between tools and execution platforms. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 28 / 29
  • Bibliography I G. Yang, “The application of mapreduce in the cloud computing,” Intelligence Information Processing and Trusted Computing (IPTC) 2011, vol. 9, pp. 154–156, Oct 2011. T. White, Hadoop:The Definitive Guide, Third Edition. 1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc., 2012. Sandip K. Darwade (MNIT) HADOOP ECOSYSTEM May 27, 2014 29 / 29