Hadoop Introduction
Ziv Huang (黃玄超)
2013/12/05
1
Outline
 Overview of Hadoop
 Overview of HDFS
 Overview of MapReduce
2
What is Hadoop?
 The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across clusters of
computers using simple programming models.
3
What is Hadoop?
 Created by Doug Cutting
4
 An Apache top level project
Inspired by google papers:
–SOSP 2003 :“The Google File System”
–OSDI 2004 :“MapReduce : Simplifed Data Processing on Large Cluster”
 Written in Java
 It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage
 Handling petabyte amount of data
 October, 2013: release 2.2.0 available
Chief Architect at Cloudera
Director at Apache Software Foundation
The Core of Hadoop
 Hadoop Common:
The common utilities that support the other Hadoop modules.
 Hadoop Distributed File System (HDFS™):
A distributed file system that provides high-throughput access to
application data.
 HadoopYARN:
A framework for job scheduling and cluster resource management.
 Hadoop MapReduce:
AYARN-based system for parallel processing of large data sets.
5
After installing Hadoop (the core library), then one can start to
store data into HDFS and write MapReduce programs to analyze
those data.
The Core of Hadoop
6
Software Packages for Hadoop
 Other Hadoop-related projects at Apache include:
 HBase
 Hive
 Sqoop
 Pig
 Mahout
 Zookeeper
 Ambari
 Flume
 Avro
 Whirr
 Accumulo
 Nutch
7
Cloudera further provides: Hue, Impala, Sentry, Search
Storing very big tables
Querying and managing data using a SQL-like language
Transfering data between RDBM and Hadoop
A language that makes MapReduce programing easier
Machine learning libaraies
A centralized service maintaining configuration information
A web interface for managing Hadoop services
Aggregating and moving large amounts of log data
A data serialization system.
A set of libraries for running cloud services.
Another HBase
Web search engine based on Lucene
GoogleVS Hadoop
8
Google’s paper
OSDI 2006 :“Bigtable:A Distributed Storage
System for Structured Data”
Selected News Messages of Using Hadoop
 全球最大連鎖超市業者Wal-Mart就是善用Hadoop來挖掘出更多
商機,甚至Wal-Mart能比父親更快知道女兒懷孕的消息,並且
主動寄送相關商品的促銷郵件
 eBay用Hadoop來分析買賣雙方在網站上的行為,透過Hadoop
來進行資料預先處理,將大塊結構的非結構化資料拆解成小型
資料,再放入資料倉儲系統的資料模型中分析,來加快分析速
度,也減輕對資料倉儲系統的分析負載
 Visa也在2009年時導入了Hadoop,建置了2套Hadoop叢集(每
套不到50個節點),讓分析時間從1個月縮短到13分鐘,更快
速地找出了可疑交易,也能更快對銀行提出預警,甚至能及時
阻止詐騙交易。
 台積電派員學習Hadoop分析技術,甚至不惜到美國取得
Hadoop專業證照,來強化分析製程Log資訊的運算能力
9
Selected News Messages of Using Hadoop
 中華電信嘗試用此平臺來分析訊務資料、MOD每日收視率分析、
影音資料等傳統關聯式資料難以處理的非結構化資料。
 Facebook將自己的MySQL資料庫儲存數據傳送到Hadoop進行計
算,而Hadoop計算結果會再次轉移到MySQL,以提供有用的資
訊給用戶的頁面使用。
 主流資料庫系統如甲骨文資料庫、微軟SQL Server、IBM的DB
2,還有開源的MySQL紛紛支援Hadoop,資料倉儲產品如
Teradata的EDW、EMC的Greenplum、IBM的Netezza也不例外紛
紛擁抱Hadoop,只是各家支援作法不同。
10
Distributions of Hadoop
 Hadoop套件發行公司:
 Cloudera - Hadoop創始人Doug Cutting加入的公司
 Hortonworks -Yahoo內部Hadoop團隊獨立出來的
 精誠資訊 - 硬體+CDH(Cloudera’s product)的big data solution
 intel - 軟體+硬體的big data solution
 MapR - 使用C/C++重寫 Hadoop / HDFS / Hbase 的Kernel,雖然
對外的API 跟一般的Hadoop 一模一樣,但是內部的架構幾乎都
不一樣 基本上就是close source的類hadoop軟體
 IBM - InfoSphere BigInsights
 talend* - Talend Open Studio for Big Data
 For more lelated information, see
http://wiki.apache.org/hadoop/Distributions%20and%20Commercial
%20Support
 http://en.wikipedia.org/wiki/Apache_Hadoop
11
What is HDFS?
12
 A distributed, scalable, and portable file-system written in Java for
the Hadoop framework
 HDFS stores metadata on a dedicated server, called the NameNode;
Application data are stored on other servers called DataNodes
 All servers are fully connected and communicate with each other
usingTCP-based protocols
 DataNodes in HDFS do not rely on data protection mechanisms
such as RAID to make the data durable. Instead, the file content is
replicated on multiple DataNodes for reliability.
 A rack-aware file system
 HDFS is designed for write-once-read-many access operation of files
Writing data in HDFS clusters
13
Client NameNodeData
Ask for DataNodes
Reply addresses of 3 datanodes;
Sorted in increasing distance
from the client
1st DataNode 2nd DataNode 3rd DataNode
A data block, 64MB or 128MB,
transferred by 64KB packets
Forward the data block Forward the data block
Inform when done
All blocks written Close file
NameNode stores all metadata
in hard disks
14
Client NameNode
Ask for file info.
Reply (1) list of all blocks and (2)
list of datanodes for each block;
Sorted in increasing distance
from the client
DataNode 1 DataNode 2 DataNode 3
Ask for block i
Send blocks
Ask for block j
Ask for block k
Reading data in HDFS clusters
Management of HDFS
15
Block reports, containing list of blocks
 Hadoop 2.2.0 : multiple NameNodes federated
 Better in handling massive small files
 Throughtput increased
 A crush of a Namenode crush won’t lead to a crush of the HDFS
Suggestion Readings:
16
 For more information, see
 http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html
(comic, rough ideas but very easy to understand)
 http://bradhedlund.com/2011/09/10/understanding-hadoop-
clusters-and-the-network/
 http://www.aosabook.org/en/hdfs.html
 Hadoop: The Definitive Guide, 3rd Edition
What is MapReduce?
 MapReduce is a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster
17
 A MapReduce program is composed of
 a Map() procedure that performs filtering and sorting
 a Reduce() procedure that performs a summary operation
 The MapReduce framework orchestrates by
 arranging the distributed servers in a suitable manner,
 running the various tasks in parallel,
 managing all communications and data transfers between the
various parts of the system, and
 providing for redundancy and fault tolerance.
Plan and direct
File FlowView of MapReduce
18
M pieces, typically
16~64MB
M workers, assigned
by one master
R workers
Map program and the
corresponding file are on the
same machine
Reduce when all
map jobs are done
Ideally, M, R >>
number of worker
machines
Google: M=20000,
R=5000, and the number
of worker
machines=2000
<key, value>
Job AssignmentView of MapReduce
19
Replicate MapReduce
programs to Master and every
worker machine
Job tracker
Monitoring the job, managing the
Map/Reduce phase, managing the
retries in case of errors
task tracker
task tracker
task tracker
task tracker
task tracker
executes the tasks
of the job on the
locally stored data
YARN –
Apache Hadoop NextGen MapReduce
20
 Key idea:
Job tracker
Resource management Job scheduling/monitoring
divided
ResourceManager :
global assignment of compute
resources to applications
ApplicationMaster :
manages the application’s
scheduling and coordination
YARN –
Apache Hadoop NextGen MapReduce
21
Scheduler
ApplicationsManagerCapacity scheduler
FIFO scheduler
Fair scheduler
accepting job-submissions, negotiating the first container for executing
the application specific ApplicationMaster and provides the service for
restarting the ApplicationMaster container on failure
responsible for containers, monitoring their
resource usage (cpu, memory, disk, network)
Thank you!
22

20131205 hadoop-hdfs-map reduce-introduction

  • 1.
    Hadoop Introduction Ziv Huang(黃玄超) 2013/12/05 1
  • 2.
    Outline  Overview ofHadoop  Overview of HDFS  Overview of MapReduce 2
  • 3.
    What is Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. 3
  • 4.
    What is Hadoop? Created by Doug Cutting 4  An Apache top level project Inspired by google papers: –SOSP 2003 :“The Google File System” –OSDI 2004 :“MapReduce : Simplifed Data Processing on Large Cluster”  Written in Java  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage  Handling petabyte amount of data  October, 2013: release 2.2.0 available Chief Architect at Cloudera Director at Apache Software Foundation
  • 5.
    The Core ofHadoop  Hadoop Common: The common utilities that support the other Hadoop modules.  Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.  HadoopYARN: A framework for job scheduling and cluster resource management.  Hadoop MapReduce: AYARN-based system for parallel processing of large data sets. 5 After installing Hadoop (the core library), then one can start to store data into HDFS and write MapReduce programs to analyze those data.
  • 6.
    The Core ofHadoop 6
  • 7.
    Software Packages forHadoop  Other Hadoop-related projects at Apache include:  HBase  Hive  Sqoop  Pig  Mahout  Zookeeper  Ambari  Flume  Avro  Whirr  Accumulo  Nutch 7 Cloudera further provides: Hue, Impala, Sentry, Search Storing very big tables Querying and managing data using a SQL-like language Transfering data between RDBM and Hadoop A language that makes MapReduce programing easier Machine learning libaraies A centralized service maintaining configuration information A web interface for managing Hadoop services Aggregating and moving large amounts of log data A data serialization system. A set of libraries for running cloud services. Another HBase Web search engine based on Lucene
  • 8.
    GoogleVS Hadoop 8 Google’s paper OSDI2006 :“Bigtable:A Distributed Storage System for Structured Data”
  • 9.
    Selected News Messagesof Using Hadoop  全球最大連鎖超市業者Wal-Mart就是善用Hadoop來挖掘出更多 商機,甚至Wal-Mart能比父親更快知道女兒懷孕的消息,並且 主動寄送相關商品的促銷郵件  eBay用Hadoop來分析買賣雙方在網站上的行為,透過Hadoop 來進行資料預先處理,將大塊結構的非結構化資料拆解成小型 資料,再放入資料倉儲系統的資料模型中分析,來加快分析速 度,也減輕對資料倉儲系統的分析負載  Visa也在2009年時導入了Hadoop,建置了2套Hadoop叢集(每 套不到50個節點),讓分析時間從1個月縮短到13分鐘,更快 速地找出了可疑交易,也能更快對銀行提出預警,甚至能及時 阻止詐騙交易。  台積電派員學習Hadoop分析技術,甚至不惜到美國取得 Hadoop專業證照,來強化分析製程Log資訊的運算能力 9
  • 10.
    Selected News Messagesof Using Hadoop  中華電信嘗試用此平臺來分析訊務資料、MOD每日收視率分析、 影音資料等傳統關聯式資料難以處理的非結構化資料。  Facebook將自己的MySQL資料庫儲存數據傳送到Hadoop進行計 算,而Hadoop計算結果會再次轉移到MySQL,以提供有用的資 訊給用戶的頁面使用。  主流資料庫系統如甲骨文資料庫、微軟SQL Server、IBM的DB 2,還有開源的MySQL紛紛支援Hadoop,資料倉儲產品如 Teradata的EDW、EMC的Greenplum、IBM的Netezza也不例外紛 紛擁抱Hadoop,只是各家支援作法不同。 10
  • 11.
    Distributions of Hadoop Hadoop套件發行公司:  Cloudera - Hadoop創始人Doug Cutting加入的公司  Hortonworks -Yahoo內部Hadoop團隊獨立出來的  精誠資訊 - 硬體+CDH(Cloudera’s product)的big data solution  intel - 軟體+硬體的big data solution  MapR - 使用C/C++重寫 Hadoop / HDFS / Hbase 的Kernel,雖然 對外的API 跟一般的Hadoop 一模一樣,但是內部的架構幾乎都 不一樣 基本上就是close source的類hadoop軟體  IBM - InfoSphere BigInsights  talend* - Talend Open Studio for Big Data  For more lelated information, see http://wiki.apache.org/hadoop/Distributions%20and%20Commercial %20Support  http://en.wikipedia.org/wiki/Apache_Hadoop 11
  • 12.
    What is HDFS? 12 A distributed, scalable, and portable file-system written in Java for the Hadoop framework  HDFS stores metadata on a dedicated server, called the NameNode; Application data are stored on other servers called DataNodes  All servers are fully connected and communicate with each other usingTCP-based protocols  DataNodes in HDFS do not rely on data protection mechanisms such as RAID to make the data durable. Instead, the file content is replicated on multiple DataNodes for reliability.  A rack-aware file system  HDFS is designed for write-once-read-many access operation of files
  • 13.
    Writing data inHDFS clusters 13 Client NameNodeData Ask for DataNodes Reply addresses of 3 datanodes; Sorted in increasing distance from the client 1st DataNode 2nd DataNode 3rd DataNode A data block, 64MB or 128MB, transferred by 64KB packets Forward the data block Forward the data block Inform when done All blocks written Close file NameNode stores all metadata in hard disks
  • 14.
    14 Client NameNode Ask forfile info. Reply (1) list of all blocks and (2) list of datanodes for each block; Sorted in increasing distance from the client DataNode 1 DataNode 2 DataNode 3 Ask for block i Send blocks Ask for block j Ask for block k Reading data in HDFS clusters
  • 15.
    Management of HDFS 15 Blockreports, containing list of blocks  Hadoop 2.2.0 : multiple NameNodes federated  Better in handling massive small files  Throughtput increased  A crush of a Namenode crush won’t lead to a crush of the HDFS
  • 16.
    Suggestion Readings: 16  Formore information, see  http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html (comic, rough ideas but very easy to understand)  http://bradhedlund.com/2011/09/10/understanding-hadoop- clusters-and-the-network/  http://www.aosabook.org/en/hdfs.html  Hadoop: The Definitive Guide, 3rd Edition
  • 17.
    What is MapReduce? MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster 17  A MapReduce program is composed of  a Map() procedure that performs filtering and sorting  a Reduce() procedure that performs a summary operation  The MapReduce framework orchestrates by  arranging the distributed servers in a suitable manner,  running the various tasks in parallel,  managing all communications and data transfers between the various parts of the system, and  providing for redundancy and fault tolerance. Plan and direct
  • 18.
    File FlowView ofMapReduce 18 M pieces, typically 16~64MB M workers, assigned by one master R workers Map program and the corresponding file are on the same machine Reduce when all map jobs are done Ideally, M, R >> number of worker machines Google: M=20000, R=5000, and the number of worker machines=2000 <key, value>
  • 19.
    Job AssignmentView ofMapReduce 19 Replicate MapReduce programs to Master and every worker machine Job tracker Monitoring the job, managing the Map/Reduce phase, managing the retries in case of errors task tracker task tracker task tracker task tracker task tracker executes the tasks of the job on the locally stored data
  • 20.
    YARN – Apache HadoopNextGen MapReduce 20  Key idea: Job tracker Resource management Job scheduling/monitoring divided ResourceManager : global assignment of compute resources to applications ApplicationMaster : manages the application’s scheduling and coordination
  • 21.
    YARN – Apache HadoopNextGen MapReduce 21 Scheduler ApplicationsManagerCapacity scheduler FIFO scheduler Fair scheduler accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure responsible for containers, monitoring their resource usage (cpu, memory, disk, network)
  • 22.