Hadoop/MapReduce/HDFS
Team:
Wasnaa AL-Mawee
Praveen Bhat
Class: CS6550
Department of Computer Science
Western Michigan University
• We live in the data age
 Facebook - 1.01b daily active users
 New York Stock Exchange – 1 terabyte of new trade/day
 Internet Archive stores appr. 2 petabytes
Introduction
Data
Enterprise
Social
Media
Sensor
PublicTransaction
• Characteristics of data
 Humongous.
 Structured, Semi-structured, and unstructured
 Growing beyond one can imagine.
• We call it Big Data!
Introduction
Velocity
Variety
Volume
Big
Data
What is the problem
Storage Drive capacity
1990 1370MB
2010 1 terabyte
2013 4 terabyte
Transfer Speed
1990 4.4 MB/s
2010 100MB/s
2013 146MB/s
• Require more time to read data from disk.
• Traditional data storage mechanism insufficient
What do we do ?
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for
more systems of computers.”
—Grace Hopper, Computer Scientist
• Create a cluster of systems
• Store data in clustered systems
• Process data sets independent of one another
Hadoop
Hadoop is a framework for running applications on large cluster built of
commodity hardware.
In other words,
A reliable shared storage and analysis system.
Hadoop Modules
• Hadoop Common
• Hadoop Distributed File System(HDFS)
• Hadoop Yarn
• Hadoop MapReduce
Journey of Hadoop
2002
Started by
Dough
Cutting and
Mike
Cafarella as a
text search
library
2003
Google’s
distributed file
system paper
published
Yahoo hired
Dough,
Supported
Hadoop
2006
2008
Yahoo
announced
that its search
index was
generated by
10,000-core
Hadoop
cluster
2009
Won the
minute sort by
sorting 500
GB in 59
seconds ! 2013
More than half
of the Fortune
50 use
Hadoop
Current projects under Apache Hadoop
• Avro
• Cassandra:
• Chukwa
• HBase
• Hive
• Mahout
• Pig
• Spark
• Tez
• Zoookeeper
Hadoop Distributed File System(HDFS)
• File systems that manages the storage across a network of machines
• Built around to handle
 Very large files - Terabytes, petabytes
 Streaming data access - write once, read many times
 Commodity Hardware - commonly available hardware
Namenodes and Datanodes
• Two types of node operating in a master-worker pattern
• Namenode
 Master node
 Manages filesystem namespace
 Maintains metadata for all the files and directories in the tree
• Datanode
 Workhorses of the file system
 Store and retrieve blocks when told by client or Namenode
 Periodically report to Namenode
HDFS Architecture
Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Client reading files from HDFS
Client
Name Node
Tell me the
block
locations of
results.txt
Blk A = 1,5,6
Blk B = 1, 2, 8
Blk C = 5, 8, 9
Data Node
Data Node
Data Node 6
Data Node 5
SwitchSwitch
Data Node 1
Data Node 2
Data Node
Data Node
B A
B
C A
Data Node
Data Node
Data Node 9
Data Node 8
Switch
C
C
B
A
Result.txt =
Blk A :
DN1, DN5, N6
Blk B:
DN8, DN1, DN2
Blk C = DN5, DN8,
DN9
Metadata
• Client receives Data Node list from each block
• Picks first Data Node for each block
• Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG
Writing files to HDFS
I want to
write blocks
A,B,C of
file.txt
Client
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Blk A Blk B Blk C
file.txt
Blk A Blk B Blk C
OK. Write to
data nodes
1,5, 6
• Client consults Name Node
• Writes block directly to one Data Node
• Data Node replicates block
• Cycle repeats for next block
Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG
What is MapReduce?
• MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm
on a cluster.
• Published in 2004 from Google engineers Jeffrey
Dean and Sanjay Ghemawat.
MapReduce Features
• Large-scale distributed data processing
• Parallel programming.
• Simple but restricted.
• Load Balancing
• Handling machine failure
When should we use MapReduce ?
Query
• Index and search such as inverted index
• Classification
• Filtering
Analytics
• Sorting and merging
• Frequency distribution
• Summarization and statistics
• SQL-based queries: group by, having, etc.
• Generation of graphics
Others
• Message passing such as Breadth first-search algorithm
MapReduce Inspiration!
- Read massive data
- Map: Extracting data from each record
map (in_key, in_value) (out_key, intermediate_value) list
- Shuffle and Sort
- Reduce: Aggregate, filter, summarize and transform
reduce (out_key, intermediate_value list) out_value list
- Write the result
MapReduce Process Architecture
MapReduce Examples
1. Word Counting
2. Inverted indexes
MapReduce Algorithms
1. Disease propagation detection based-MapReduce
2. Trading strategies based-MapReduce.
3. Graph processing algorithm based-MapReduce.
Final Note !
• Open source community taking newer and larger steps
– Spark, Ceph, Open Stack
• Need for better processing
– Batch processing + Streaming
• Time to move on from Hadoop?
References
• http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705.
• http://mashable.com/2008/10/15/facebook-10-billion-photos/.
• http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx,
• http://www.archive.org/about/faqs.php.
• http://www.interactions.org/cms/?pid=1027032.
• Hadoop The Definitive Guide 2nd Edition by Tom White
• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003
• http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/
• R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact
Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6.
• X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy
Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274.
• K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing
Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid
Computing, Delft, May 2013, pp. 277-284.
• G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information
Processing and Trusted, Hubei, Oct. 2011, pp.154-156.
• C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int.
Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434.
• Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available
on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf.
• Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr.
• Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp-
content/uploads/2010/01/InvertedIndex.pdf.

Hadoop/MapReduce/HDFS

  • 1.
    Hadoop/MapReduce/HDFS Team: Wasnaa AL-Mawee Praveen Bhat Class:CS6550 Department of Computer Science Western Michigan University
  • 2.
    • We livein the data age  Facebook - 1.01b daily active users  New York Stock Exchange – 1 terabyte of new trade/day  Internet Archive stores appr. 2 petabytes Introduction Data Enterprise Social Media Sensor PublicTransaction
  • 3.
    • Characteristics ofdata  Humongous.  Structured, Semi-structured, and unstructured  Growing beyond one can imagine. • We call it Big Data! Introduction Velocity Variety Volume Big Data
  • 4.
    What is theproblem Storage Drive capacity 1990 1370MB 2010 1 terabyte 2013 4 terabyte Transfer Speed 1990 4.4 MB/s 2010 100MB/s 2013 146MB/s • Require more time to read data from disk. • Traditional data storage mechanism insufficient
  • 5.
    What do wedo ? “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” —Grace Hopper, Computer Scientist • Create a cluster of systems • Store data in clustered systems • Process data sets independent of one another
  • 6.
    Hadoop Hadoop is aframework for running applications on large cluster built of commodity hardware. In other words, A reliable shared storage and analysis system. Hadoop Modules • Hadoop Common • Hadoop Distributed File System(HDFS) • Hadoop Yarn • Hadoop MapReduce
  • 7.
    Journey of Hadoop 2002 Startedby Dough Cutting and Mike Cafarella as a text search library 2003 Google’s distributed file system paper published Yahoo hired Dough, Supported Hadoop 2006 2008 Yahoo announced that its search index was generated by 10,000-core Hadoop cluster 2009 Won the minute sort by sorting 500 GB in 59 seconds ! 2013 More than half of the Fortune 50 use Hadoop
  • 8.
    Current projects underApache Hadoop • Avro • Cassandra: • Chukwa • HBase • Hive • Mahout • Pig • Spark • Tez • Zoookeeper
  • 9.
    Hadoop Distributed FileSystem(HDFS) • File systems that manages the storage across a network of machines • Built around to handle  Very large files - Terabytes, petabytes  Streaming data access - write once, read many times  Commodity Hardware - commonly available hardware
  • 10.
    Namenodes and Datanodes •Two types of node operating in a master-worker pattern • Namenode  Master node  Manages filesystem namespace  Maintains metadata for all the files and directories in the tree • Datanode  Workhorses of the file system  Store and retrieve blocks when told by client or Namenode  Periodically report to Namenode
  • 11.
  • 12.
    Client reading filesfrom HDFS Client Name Node Tell me the block locations of results.txt Blk A = 1,5,6 Blk B = 1, 2, 8 Blk C = 5, 8, 9 Data Node Data Node Data Node 6 Data Node 5 SwitchSwitch Data Node 1 Data Node 2 Data Node Data Node B A B C A Data Node Data Node Data Node 9 Data Node 8 Switch C C B A Result.txt = Blk A : DN1, DN5, N6 Blk B: DN8, DN1, DN2 Blk C = DN5, DN8, DN9 Metadata • Client receives Data Node list from each block • Picks first Data Node for each block • Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG
  • 13.
    Writing files toHDFS I want to write blocks A,B,C of file.txt Client Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C file.txt Blk A Blk B Blk C OK. Write to data nodes 1,5, 6 • Client consults Name Node • Writes block directly to one Data Node • Data Node replicates block • Cycle repeats for next block Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG
  • 14.
    What is MapReduce? •MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. • Published in 2004 from Google engineers Jeffrey Dean and Sanjay Ghemawat.
  • 15.
    MapReduce Features • Large-scaledistributed data processing • Parallel programming. • Simple but restricted. • Load Balancing • Handling machine failure
  • 16.
    When should weuse MapReduce ? Query • Index and search such as inverted index • Classification • Filtering Analytics • Sorting and merging • Frequency distribution • Summarization and statistics • SQL-based queries: group by, having, etc. • Generation of graphics Others • Message passing such as Breadth first-search algorithm
  • 17.
    MapReduce Inspiration! - Readmassive data - Map: Extracting data from each record map (in_key, in_value) (out_key, intermediate_value) list - Shuffle and Sort - Reduce: Aggregate, filter, summarize and transform reduce (out_key, intermediate_value list) out_value list - Write the result
  • 18.
  • 19.
  • 20.
  • 21.
    MapReduce Algorithms 1. Diseasepropagation detection based-MapReduce 2. Trading strategies based-MapReduce. 3. Graph processing algorithm based-MapReduce.
  • 22.
    Final Note ! •Open source community taking newer and larger steps – Spark, Ceph, Open Stack • Need for better processing – Batch processing + Streaming • Time to move on from Hadoop?
  • 23.
    References • http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705. • http://mashable.com/2008/10/15/facebook-10-billion-photos/. •http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx, • http://www.archive.org/about/faqs.php. • http://www.interactions.org/cms/?pid=1027032. • Hadoop The Definitive Guide 2nd Edition by Tom White • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003 • http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/ • R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6. • X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274. • K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid Computing, Delft, May 2013, pp. 277-284. • G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information Processing and Trusted, Hubei, Oct. 2011, pp.154-156. • C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int. Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434. • Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf. • Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr. • Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp- content/uploads/2010/01/InvertedIndex.pdf.