APACHE HADOOP
JERRIN JOSEPH
CSU ID#2578741
CONTENTS
 Hadoop
 Hadoop Distributed File System (HDFS)
 Hadoop MapReduce
 Apache Hive
 Apache H-Base
 ZooKeeper
 Hortonworks Data Platform
 Cloudera Hadoop Solution
ABSTRACT
 Hadoop is an efficient Big data handling tool.
 Reduced the data processing time from ‘days’ to
‘hours’.
 Hadoop Distributed File System(HDFS) is the data
storage unit of Hadoop.
 Hadoop MapReduce is the data processing unit
which works on distributed processing principle.
INTRODUCTION
 What is Big Data??
 Bulk Amount
 Unstructured
 Lots of Applications which need to handle huge
amount of data (in terms of 500+ TB per day)
 If a regular machine need to transmit 1TB of data
through 4 channels : 43 Minutes.
 What if 500 TB ??
HADOOP
 “The Apache Hadoop software library is a
framework that allows for the distributed processing
of large data sets across clusters of computers
using simple programming models”[1]
 Core Components :
 HDFS: large data sets across clusters of
computers.
 Hadoop MapReduce: the distributed processing
using simple programming models
HADOOP : KEY FEATURES
 High Scalability
 Highly Tolerant to Software & Hardware Failures
 High Throughput
 Best for larger files with less in number
 Performs fast and parallel execution of Jobs
 Provides Streaming access to data
 Can be built out of commodity hardware
HADOOP: DRAWBACKS
 Not good for Low-latency data access
 Not good for Small files with large in number
 Not good for Multiple write files
 Do not encryption at storage level or network level
 Have a high complexity security model
 Hadoop is not a Database: Hence cannot alter a
file.
HADOOP ARCHITECTURE
HADOOP DISTRIBUTED FILE
SYSTEM (HDFS)
HADOOP DISTRIBUTED FILE SYSTEM
(HDFS)
 Storage unit of Hadoop
 Relies on principles of Distributed File System.
 HDFS have a Master-Slave architecture
 Main Components:
 Name Node : Master
 Data Node : Slave
 3+ replicas for each block
 Default Block Size : 64MB
HDFS: KEY FEATURES
 Highly fault tolerant. (automatic failure recovery
system)
 High throughput
 Designed to work with systems with vary large file
(files with size in TB) and few in number.
 Provides streaming access to file system data. It is
specifically good for write once read many kind of
files (for example Log files).
 Can be built out of commodity hardware. HDFS
doesn't need highly expensive storage devices.
HDFS ARCHITECTURE
NAME NODE
 Master of HDFS
 Maintains and Manages data on Data Nodes
 High reliability Machine (can be even RAID)
 Expensive Hardware
 Stores NO data; Just holds Metadata!
 Secondary Name Node:
 Reads from RAM of Name Node and stores it to hard
disks periodically.
 Active & Passive Name Nodes from Gen2 Hadoop
DATA NODES
 Slaves in HDFS
 Provides Data Storage
 Deployed on independent machines
 Responsible for serving Read/Write requests from
Client.
 The data processing is done on Data Nodes.
HDFS OPERATION
 Client makes a Write request to Name Node
 Name Node responds with the information about on
available data nodes and where data to be written.
 Client write the data to the addressed Data Node.
 Replicas for all blocks are automatically created by
Data Pipeline.
 If Write fails, Data Node will notify the Client and
get new location to write.
 If Write Completed Successfully, Acknowledgement
is given to Client
 Non-Posted Write by Hadoop
HDFS OPERATION
HDFS: FILE WRITE
HDFS: FILE READ
HADOOP MAPREDUCE
HADOOP MAPREDUCE
 Simple programming model
 Hadoop Processing Unit
 MapReduce also have Master-Slave architecture
 Main Components:
 Job Tracker : Master
 Task Tracker : Slave
 From Google’s MapReduce
 Do not fetch data to Master Node; Processed data
at Slave Node and returns output to Master
 Implemented using Maps and Reduces
 Split by FileInputFormat
 Maps
 Inheriting Mapper Class
 Produces (key, value) pair as intermediate result from
data.
 Reduces
 Inheriting Reducer Class
 Produces required output from intermediate result
produced by Maps.
HADOOP MAPREDUCE
JOB TRACKER
 Master in MapReduce
 Receives the job request from Client
 Governs execution of jobs
 Makes the task scheduling decision
TASK TRACKER
 Slave in MapReduce
 Governs execution of Tasks
 Periodically reports the progress of tasks
MAPREDUCE ARCHITECTURE
MAPREDUCE OPERATIONS
MAPREDUCE OPERATIONS
MAPREDUCE OPERATIONS
MAPREDUCE OPERATIONS
APACHE HIVE
HIVE
 Built on top of Hadoop
 Supports SQL like Query Language : Hive-QL
 Data in Hive is organized into tables
 Provides structure for unstructured Big Data
 Work with data inside HDFS
 Data : File or Group of Files in HDFS
 Schema : In the form of metadata stored in
Relational Database
 Data and Schema are separated
 Schema only for existing data
 Supports Primitive Column Types and Nestable
Collection Types (Array and Map)
HIVE QUERY LANGUAGE
 SQL like language
 DDL : to create tables with specific serialization
formats
 DML : to load data from external sources and insert
query results into Hive tables
 Do not support updating and deleting rows in
existing tables
 Supports Multi-Table insert
 Supports custom map-reduce scripts written in any
language
 Can be extended with custom functions (UDFs)
 User Defined Transformation Function(UDTF)
 User Defined Aggregation Function (UDAF)
 External Interfaces:
 Web UI : Management
 Hive CLI : Run Queries, Browse Tables, etc
 API : JDBC, ODBC
 Metastore :
 System catalog which contains metadata about Hive
tables
 Driver :
 manages the life cycle of a Hive-QL statement during
compilation, optimization and execution
 Compiler :
 translates Hive-QL statement into a plan which consists
of a DAG of map-reduce jobs
HIVE ARCHITECTURE
HIVE ARCHITECTURE
HIVE ACHIEVEMENTS & FUTURE PLANS
 First step to provide warehousing layer for
Hadoop(Web-based Map-Reduce data processing
system)
 Accepts only sub-set of SQL: Working to subsume
SQL syntax
 Working on Rule-based optimizer : Plans to build
Cost-based optimizer
 Enhancing JDBC and ODBC drivers for making the
interactions with commercial BI tools.
 Working on making it perform better
APACHE H-BASE
H-BASE
 Distributed Column-oriented database on top of
Hadoop/HDFS
 Provides low-latency access to single rows from
billions of records
 Column oriented:
 OLAP
 Best for aggregation
 High compression rate: Few distinct values
 Do not have a Schema or data type
 Built for Wide tables : Millions of columns
Billions of rows
 Denormalized data
 Master-Slave architecture
H-BASE ARCHITECTURE
HMASTER SERVER
 Like Name Node in HDFS
 Manages and Monitors H-Base Cluster Operations
 Assign Region to Region Servers
 Handling Load-balancing and Splitting
REGION SERVER
 Like Data Node in HDFS
 Highly Scalable
 Handle Read/ Write Requests
 Direct communication with Clients
INTERNAL ARCHITECTURE
 Tables Regions
 Store
 MemStore
 FileStore Blocks
 Column Families
APACHE ZOOKEEPER
ZOOKEEPER
 Coordination
 Race Condition
 Dead-locks
 Partial Failure
 Inconsistency
 What is ZooKeeper?
 Distributed coordination service for distributed
applications
 Like a Centralized Repository
 Challenges for Distributed Applications
 ZooKeeper Goals
 Serialization
 Atomicity
 Reliability
 Simple API
ZOOKEEPER ARCHITECTURE
HORTONWORKS DATA
PLATFORM
HORTONWORKS DATA PLATFORM
1. Governance and Integration
2. Data Access
3. Data Management
4. Security
5. Operations
 YARN : Data Operating System between Data
Storage and Data Access.
HORTONWORKS DATA PLATFORM
 Data Operating System on Hadoop
 Enables data processing simultaneously in multiple
ways
 Provides resource management and pluggable
architecture.
 The data processing engines works with YARN.
HDP: YARN
HDP: GOVERNANCE INTEGRATION
 provide a reliable, repeatable, and simple
framework for managing the flow of data in and out
of Hadoop
 Falcon: Framework for simplifying data
management and pipeline processing in Hadoop.
 Sqoop: Tool used to transfer data between Hadoop
and Relational Databases.
 Flume: Service for effectively collecting,
aggregating, and moving large amounts of
streaming data into Hadoop.
 Batch: MapReduce
 Script: Pig
 Pig Latin defines a set of transformations on data such
as aggregate, join and sort.
 SQL: Hive
 NoSQL: HBase
 Stream: Storm
 Distributed real-time computational system for
processing fast, large streams of data.
 Search: Solr
 Advanced full-text search and near real-time indexing.
HDP: DATA ACCESS
 Critical features for authentication, authorization,
accountability and data protection.
 Deploy, monitor and manage a Hadoop cluster
within the enterprise data ecosystem.
 Oozie: Java web application used to schedule
Hadoop jobs.
HDP: SECURITY
HDP: OPERATIONS
CLOUDERA HADOOP
 Cloudera solution High-Level architecture
CLOUDERA HADOOP ARCHITECTURE
 Cloudera solution taxonomy
CLOUDERA HADOOP ARCHITECTURE
DEPLOYMENT
 Crowbar:
 Complete automated operations platform.
 To provision h/w, configure it, and install Red Hat
Enterprise Linux and Cloudera Manager.
 Designed to deploy layers of infrastructure on bare-
metal servers and all the way up the stack.
MANAGEMENT
 Ganglia:
 gathering metrics and tracking them over time.
 designed to scale to thousands of nodes
 Nagios:
 Powerful monitoring system
 Enables organizations to identify and resolve IT
infrastructure problems.
 Monitoring, Alerting, Response, Reporting,
Maintenance, and Planning.
CDH : COMPONENTS
 Avro: Serialization system.
 Crunch: Java library for more easily writing, testing,
and running MR pipelines.
 DataFu: library of UDFs for data mining and
statistics in Apache Pig.
 Cloudera Impala
 Kite SDK: APIs, examples, and docs for building
apps on top of Hadoop.
 Cloudera Search: Offers free-text, Google-style
search of Hadoop data for business users.
 JobTracker : Master Name Node
 TaskTracker : Data Node(x)
 NameNode : Master Name Node
 Secondary namenode : Secondary Name Node
 Operating System Provisioning : Admin Node
 Chef : Admin Node
 Yum Repositories : Admin Node
 Cloudera Manager : Edge Node(x)
 Zookeeper : Data Node(x)
 HMaster : Master Name Node
 RegionServer : Data Node(x)
 Crowbar Admin : Admin Node
 Journal : Master Name Node,
Secondary Name Node, HA
Node
CLOUDERA :SOFTWARE LOCATIONS
 Cloudera works using Crowbar and Cloudera
Manager for setting up hardware and connecting
different tools to the cluster. HDP deployed YARN
data operating system with resource management
and pluggable architecture.
 Cloudera focus on business solutions while
Hortonworks focus on research stream.
 Both supports almost all the tools on Hadoop
Cluster.
CLOUDERA V/S HORTONWORKS
 Cloudera Search in CDH and Solr Search in HDP.
 Cloudera Hadoop have some extra tools defined to
work on Hadoop cluster than the common tools on
Apache Hadoop.
CLOUDERA V/S HORTONWORKS
CONCLUSION
 Hadoop is a successful solution for Big Data
Handling
 Hadoop expanded from a simple project to the level
of a platform
 The projects and tools on Hadoop are proof for the
successfulness of Hadoop.
REFERENCES
[1] "Apache Hadoop", http://hadoop.apache.org/
[2] “Apache Hive”, http://hive.apache.org/
[3] “Apache HBase”, https://hbase.apache.org/
[4] “Apache ZooKeeper”, http://zookeeper.apache.org/
[5] Jason Venner, "Pro Hadoop", Apress Books, 2009
[6] "Hadoop Wiki", http://wiki.apache.org/hadoop/
[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun
Tian, James Majors, Adam Manzanares, Xiao Qin, "
Improving MapReduce Performance through Data
Placement in Heterogeneous Hadoop Clusters", 19th
International Heterogeneity in Computing Workshop,
Atlanta, Georgia, April 2010
[8] Dhruba Borthakur, The Hadoop Distributed File
System: Architecture and Design, The Apache
Software Foundation 2007.
[9] "Apache Hadoop",
http://en.wikipedia.org/wiki/Apache_Hadoop
[10] "Hadoop Overview",
http://www.revelytix.com/?q=content/hadoop-
overview
[11] Konstantin Shvachko, Hairong Kuang, Sanjay
Radia, Robert Chansler, The Hadoop Distributed
File System, Yahoo!, Sunnyvale, California USA,
Published in: Mass Storage Systems and
Technologies (MSST), 2010 IEEE 26th Symposium.
REFERENCES
[12] Vinod Kumar Vavilapalli, Arun C Murthy, Chris
Douglas, Sharad Agarwal, Mahadev Konar, Robert
Evans, Thomas Graves, Jason Lowe, Hitesh Shah,
Siddharth Seth, Bikas Saha, Carlo Curino, Owen
O’Malley, Sanjay Radia, Benjamin Reed, Eric
Baldeschwieler, Apache Hadoop YARN: Yet Another
Resource Negotiator, ACM Symposium on Cloud
Computing 2013, Santa Clara, California.
[13] Raja Appuswamy, Christos Gkantsidis, Dushyanth
Narayanan, Orion Hodson, and Antony Rowstron,
Scale-up vs Scale-out for Hadoop: Time to rethink?,
Microsoft Research, ACM Symposium on Cloud
Computing 2013, Santa Clara, California.
[14] “Hortonworks Data Platform”,
http://www.hortonworks.com/hdp/
REFERENCES
[15] Dell | Cloudera Solution Reference Architecture
v2.1.0, A Dell Reference Architecture Guide, Nov
2012
[16] “Cloudera”, http://www.cloudera.com/
REFERENCES
Hadoop_arunam_ppt

Hadoop_arunam_ppt

  • 1.
  • 2.
    CONTENTS  Hadoop  HadoopDistributed File System (HDFS)  Hadoop MapReduce  Apache Hive  Apache H-Base  ZooKeeper  Hortonworks Data Platform  Cloudera Hadoop Solution
  • 3.
    ABSTRACT  Hadoop isan efficient Big data handling tool.  Reduced the data processing time from ‘days’ to ‘hours’.  Hadoop Distributed File System(HDFS) is the data storage unit of Hadoop.  Hadoop MapReduce is the data processing unit which works on distributed processing principle.
  • 4.
    INTRODUCTION  What isBig Data??  Bulk Amount  Unstructured  Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day)  If a regular machine need to transmit 1TB of data through 4 channels : 43 Minutes.  What if 500 TB ??
  • 5.
    HADOOP  “The ApacheHadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models”[1]  Core Components :  HDFS: large data sets across clusters of computers.  Hadoop MapReduce: the distributed processing using simple programming models
  • 6.
    HADOOP : KEYFEATURES  High Scalability  Highly Tolerant to Software & Hardware Failures  High Throughput  Best for larger files with less in number  Performs fast and parallel execution of Jobs  Provides Streaming access to data  Can be built out of commodity hardware
  • 7.
    HADOOP: DRAWBACKS  Notgood for Low-latency data access  Not good for Small files with large in number  Not good for Multiple write files  Do not encryption at storage level or network level  Have a high complexity security model  Hadoop is not a Database: Hence cannot alter a file.
  • 8.
  • 9.
  • 10.
    HADOOP DISTRIBUTED FILESYSTEM (HDFS)  Storage unit of Hadoop  Relies on principles of Distributed File System.  HDFS have a Master-Slave architecture  Main Components:  Name Node : Master  Data Node : Slave  3+ replicas for each block  Default Block Size : 64MB
  • 11.
    HDFS: KEY FEATURES Highly fault tolerant. (automatic failure recovery system)  High throughput  Designed to work with systems with vary large file (files with size in TB) and few in number.  Provides streaming access to file system data. It is specifically good for write once read many kind of files (for example Log files).  Can be built out of commodity hardware. HDFS doesn't need highly expensive storage devices.
  • 12.
  • 13.
    NAME NODE  Masterof HDFS  Maintains and Manages data on Data Nodes  High reliability Machine (can be even RAID)  Expensive Hardware  Stores NO data; Just holds Metadata!  Secondary Name Node:  Reads from RAM of Name Node and stores it to hard disks periodically.  Active & Passive Name Nodes from Gen2 Hadoop
  • 14.
    DATA NODES  Slavesin HDFS  Provides Data Storage  Deployed on independent machines  Responsible for serving Read/Write requests from Client.  The data processing is done on Data Nodes.
  • 15.
  • 16.
     Client makesa Write request to Name Node  Name Node responds with the information about on available data nodes and where data to be written.  Client write the data to the addressed Data Node.  Replicas for all blocks are automatically created by Data Pipeline.  If Write fails, Data Node will notify the Client and get new location to write.  If Write Completed Successfully, Acknowledgement is given to Client  Non-Posted Write by Hadoop HDFS OPERATION
  • 17.
  • 18.
  • 19.
  • 20.
    HADOOP MAPREDUCE  Simpleprogramming model  Hadoop Processing Unit  MapReduce also have Master-Slave architecture  Main Components:  Job Tracker : Master  Task Tracker : Slave  From Google’s MapReduce  Do not fetch data to Master Node; Processed data at Slave Node and returns output to Master
  • 21.
     Implemented usingMaps and Reduces  Split by FileInputFormat  Maps  Inheriting Mapper Class  Produces (key, value) pair as intermediate result from data.  Reduces  Inheriting Reducer Class  Produces required output from intermediate result produced by Maps. HADOOP MAPREDUCE
  • 22.
    JOB TRACKER  Masterin MapReduce  Receives the job request from Client  Governs execution of jobs  Makes the task scheduling decision TASK TRACKER  Slave in MapReduce  Governs execution of Tasks  Periodically reports the progress of tasks
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    HIVE  Built ontop of Hadoop  Supports SQL like Query Language : Hive-QL  Data in Hive is organized into tables  Provides structure for unstructured Big Data  Work with data inside HDFS  Data : File or Group of Files in HDFS  Schema : In the form of metadata stored in Relational Database  Data and Schema are separated  Schema only for existing data  Supports Primitive Column Types and Nestable Collection Types (Array and Map)
  • 30.
    HIVE QUERY LANGUAGE SQL like language  DDL : to create tables with specific serialization formats  DML : to load data from external sources and insert query results into Hive tables  Do not support updating and deleting rows in existing tables  Supports Multi-Table insert  Supports custom map-reduce scripts written in any language  Can be extended with custom functions (UDFs)  User Defined Transformation Function(UDTF)  User Defined Aggregation Function (UDAF)
  • 31.
     External Interfaces: Web UI : Management  Hive CLI : Run Queries, Browse Tables, etc  API : JDBC, ODBC  Metastore :  System catalog which contains metadata about Hive tables  Driver :  manages the life cycle of a Hive-QL statement during compilation, optimization and execution  Compiler :  translates Hive-QL statement into a plan which consists of a DAG of map-reduce jobs HIVE ARCHITECTURE
  • 32.
  • 33.
    HIVE ACHIEVEMENTS &FUTURE PLANS  First step to provide warehousing layer for Hadoop(Web-based Map-Reduce data processing system)  Accepts only sub-set of SQL: Working to subsume SQL syntax  Working on Rule-based optimizer : Plans to build Cost-based optimizer  Enhancing JDBC and ODBC drivers for making the interactions with commercial BI tools.  Working on making it perform better
  • 34.
  • 35.
    H-BASE  Distributed Column-orienteddatabase on top of Hadoop/HDFS  Provides low-latency access to single rows from billions of records  Column oriented:  OLAP  Best for aggregation  High compression rate: Few distinct values  Do not have a Schema or data type  Built for Wide tables : Millions of columns Billions of rows  Denormalized data  Master-Slave architecture
  • 36.
  • 37.
    HMASTER SERVER  LikeName Node in HDFS  Manages and Monitors H-Base Cluster Operations  Assign Region to Region Servers  Handling Load-balancing and Splitting REGION SERVER  Like Data Node in HDFS  Highly Scalable  Handle Read/ Write Requests  Direct communication with Clients
  • 38.
    INTERNAL ARCHITECTURE  TablesRegions  Store  MemStore  FileStore Blocks  Column Families
  • 39.
  • 40.
    ZOOKEEPER  Coordination  RaceCondition  Dead-locks  Partial Failure  Inconsistency  What is ZooKeeper?  Distributed coordination service for distributed applications  Like a Centralized Repository  Challenges for Distributed Applications  ZooKeeper Goals  Serialization  Atomicity  Reliability  Simple API
  • 41.
  • 42.
  • 43.
    HORTONWORKS DATA PLATFORM 1.Governance and Integration 2. Data Access 3. Data Management 4. Security 5. Operations  YARN : Data Operating System between Data Storage and Data Access.
  • 44.
  • 45.
     Data OperatingSystem on Hadoop  Enables data processing simultaneously in multiple ways  Provides resource management and pluggable architecture.  The data processing engines works with YARN. HDP: YARN
  • 46.
    HDP: GOVERNANCE INTEGRATION provide a reliable, repeatable, and simple framework for managing the flow of data in and out of Hadoop  Falcon: Framework for simplifying data management and pipeline processing in Hadoop.  Sqoop: Tool used to transfer data between Hadoop and Relational Databases.  Flume: Service for effectively collecting, aggregating, and moving large amounts of streaming data into Hadoop.
  • 47.
     Batch: MapReduce Script: Pig  Pig Latin defines a set of transformations on data such as aggregate, join and sort.  SQL: Hive  NoSQL: HBase  Stream: Storm  Distributed real-time computational system for processing fast, large streams of data.  Search: Solr  Advanced full-text search and near real-time indexing. HDP: DATA ACCESS
  • 48.
     Critical featuresfor authentication, authorization, accountability and data protection.  Deploy, monitor and manage a Hadoop cluster within the enterprise data ecosystem.  Oozie: Java web application used to schedule Hadoop jobs. HDP: SECURITY HDP: OPERATIONS
  • 49.
  • 50.
     Cloudera solutionHigh-Level architecture CLOUDERA HADOOP ARCHITECTURE
  • 51.
     Cloudera solutiontaxonomy CLOUDERA HADOOP ARCHITECTURE
  • 52.
    DEPLOYMENT  Crowbar:  Completeautomated operations platform.  To provision h/w, configure it, and install Red Hat Enterprise Linux and Cloudera Manager.  Designed to deploy layers of infrastructure on bare- metal servers and all the way up the stack.
  • 53.
    MANAGEMENT  Ganglia:  gatheringmetrics and tracking them over time.  designed to scale to thousands of nodes  Nagios:  Powerful monitoring system  Enables organizations to identify and resolve IT infrastructure problems.  Monitoring, Alerting, Response, Reporting, Maintenance, and Planning.
  • 54.
    CDH : COMPONENTS Avro: Serialization system.  Crunch: Java library for more easily writing, testing, and running MR pipelines.  DataFu: library of UDFs for data mining and statistics in Apache Pig.  Cloudera Impala  Kite SDK: APIs, examples, and docs for building apps on top of Hadoop.  Cloudera Search: Offers free-text, Google-style search of Hadoop data for business users.
  • 55.
     JobTracker :Master Name Node  TaskTracker : Data Node(x)  NameNode : Master Name Node  Secondary namenode : Secondary Name Node  Operating System Provisioning : Admin Node  Chef : Admin Node  Yum Repositories : Admin Node  Cloudera Manager : Edge Node(x)  Zookeeper : Data Node(x)  HMaster : Master Name Node  RegionServer : Data Node(x)  Crowbar Admin : Admin Node  Journal : Master Name Node, Secondary Name Node, HA Node CLOUDERA :SOFTWARE LOCATIONS
  • 56.
     Cloudera worksusing Crowbar and Cloudera Manager for setting up hardware and connecting different tools to the cluster. HDP deployed YARN data operating system with resource management and pluggable architecture.  Cloudera focus on business solutions while Hortonworks focus on research stream.  Both supports almost all the tools on Hadoop Cluster. CLOUDERA V/S HORTONWORKS
  • 57.
     Cloudera Searchin CDH and Solr Search in HDP.  Cloudera Hadoop have some extra tools defined to work on Hadoop cluster than the common tools on Apache Hadoop. CLOUDERA V/S HORTONWORKS
  • 58.
    CONCLUSION  Hadoop isa successful solution for Big Data Handling  Hadoop expanded from a simple project to the level of a platform  The projects and tools on Hadoop are proof for the successfulness of Hadoop.
  • 59.
    REFERENCES [1] "Apache Hadoop",http://hadoop.apache.org/ [2] “Apache Hive”, http://hive.apache.org/ [3] “Apache HBase”, https://hbase.apache.org/ [4] “Apache ZooKeeper”, http://zookeeper.apache.org/ [5] Jason Venner, "Pro Hadoop", Apress Books, 2009 [6] "Hadoop Wiki", http://wiki.apache.org/hadoop/ [7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, Xiao Qin, " Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters", 19th International Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010
  • 60.
    [8] Dhruba Borthakur,The Hadoop Distributed File System: Architecture and Design, The Apache Software Foundation 2007. [9] "Apache Hadoop", http://en.wikipedia.org/wiki/Apache_Hadoop [10] "Hadoop Overview", http://www.revelytix.com/?q=content/hadoop- overview [11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo!, Sunnyvale, California USA, Published in: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium. REFERENCES
  • 61.
    [12] Vinod KumarVavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, Eric Baldeschwieler, Apache Hadoop YARN: Yet Another Resource Negotiator, ACM Symposium on Cloud Computing 2013, Santa Clara, California. [13] Raja Appuswamy, Christos Gkantsidis, Dushyanth Narayanan, Orion Hodson, and Antony Rowstron, Scale-up vs Scale-out for Hadoop: Time to rethink?, Microsoft Research, ACM Symposium on Cloud Computing 2013, Santa Clara, California. [14] “Hortonworks Data Platform”, http://www.hortonworks.com/hdp/ REFERENCES
  • 62.
    [15] Dell |Cloudera Solution Reference Architecture v2.1.0, A Dell Reference Architecture Guide, Nov 2012 [16] “Cloudera”, http://www.cloudera.com/ REFERENCES