Anatomy of distributed computing with Hadoop

Anatomy of
distributed
computing with
Hadoop

What is Hadoop?
 Hadoop was started out as a subproject of Nutch by
Doug Cutting

 Hadoop boosted Nutch’s scalability

 Enhanced by Yahoo! and became Apache top level
project

 System for distributed big data processing
 Big data is Terabytes and
Petabytes and
more…
 Exabytes, Zettabytes datasets?

Hadoop basics
 Implements Google’s whitepaper:
http://research.google.com/archive/mapreduce.html

 Hadoop is a combination of:
HDFS Storage
MapReduce Computation

HDFS
Hadoop Distributed File System
 It’s a file system
bin/hadoop dfs <command> <options>

<command>
cat expunge put
chgrp get rm
chmod getmerge rmr
chown ls setrep
copyFromLocal lsr stat
copyToLocal mkdir tail
cp moveFromLocal test
du moveToLocal text
dus mv touchz

 It’s accessible

 It’s distributed
 It employs masterslave architecture

 Name Node:
Stores file system metadata

 Secondary Name Node(s):
Periodically merges file system image

 Data Node(s):
Stores actual data (blocks)
Allows data to be replicated

MapReduce
 A programming model for distributed data
processing

 A data processing primitives are functions:
Mappers and Reducers

MapReduce

! To decompose MapReduce think of data in
terms of keys and values:

<key, value>
<user id, user profile>
<timestamp, apache log entry>
<tag, list of tagged images>

MapReduce
 Mapper
Function that takes key and value and emits
zero or more keys and values

 Reducer
Function that takes key and all “mapped”
values and emits zero or more new keys and
value

MapReduce example
 “Hello World” for Hadoop:
http://wiki.apache.org/hadoop/WordCount

 “Tag Cloud” example for Hadoop:

tag1 tag2 tag3
tag1 tag3 weight(tagi)
tag3
tag4 tag5 tag6

Tag Cloud example
 Input is taggable content (images, posts,
videos) with space separated tags:
<posti, “tag1 tag2 … tagn”>

 Output is tagi with it’s count and total tags:
<tagi, tag count>
<total tags, total tags count>

 Results:
weight(tagi)=tagi count/total tags
font(tagi)=fn(weight(tagi))

Tag Cloud Mapper
 Mapper implements interface:
org.apache.hadoop.mapreduce.Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

 Mapper input:
<post1, “tag1 tag3”>
<post2, “tag3”>
<post3, “tag2 tag3 tag4”>
<post4, “tag1 tag2 tag3”>

simplify model & make line number a key

<line1, “tag1 tag3”>
<line2, “tag3”>
<line3, “tag2 tag3 tag4”>
<line4, “tag1 tag2 tag3”>

write raw tags to input file

Tag Cloud Mapper
 Mapper input:  Mapper output:

<0, “tag1 tag3”> <“total tags”, 2>
<1, “tag3”> <“tag1”, 1>
<2, “tag2 tag3 tag4”> <“tag3”, 1>
<3, “tag1 tag2 tag3”>
<“total tags”, 1>
read values - tags from file (line number is a key) <“tag3”, 1>

“tag1 tag3” // space separated tags <“total tags”, 3>
<“tag2”, 1>
String line = value.toString(); <“tag3”, 1>
StringTokenizer tokenizer = new StringTokenizer(line, ” "); <“tag4”, 1>
context.write(TOTAL_TAGS_KEY, context.write()
new IntWritable(tokenizer.countTokens())); <“total tags”, 3>
while (tokenizer.hasMoreTokens()) { <“tag1”, 1>
Text tag = new Text(tokenizer.nextToken()); <“tag2”, 1>
context.write(tag, new IntWritable(1)); // write to HDFS <“tag3”, 1>
}

Reducer phases
 1. Shuffle or Copy phase:
Copies output from Mapper to Reducer local file system

 2. Sort phase:
Sort Mapper output by keys. This becomes Reducer input
Mapper output: Reducer input:
<“total tags”, 2> <“tag1”, 1>
<“tag1”, 1> <“tag1”, 1>
<“tag3”, 1>
<“tag2”, 1>
<“tag3”, 1>
shuffle & sort by
<“total tags”, 3> key <“tag3”, 1>
<“tag2”, 1> <“tag3”, 1>
<“tag3”, 1> <“tag3”, 1>
<“tag4”, 1> <“tag3”, 1>

<“tag1”, 1>
<“tag2”, 1> <“total tags”, 2>
<“tag3”, 1> <“total tags”, 1>
 3. Reduce or Emit phase:
Performs reduce() for each sorted <key, value> input groups

Tag Cloud Reduce phase
 Reducer implements interface:
org.apache.hadoop.mapreduce.Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

 Reducer input: [<“tag1”, 1>, <“tag1”, 1>]
<“tag1”, 1>
<“tag1”, 1> int tagsCount = 0;
pairs grouped by tagi for (IntWritable value : values) {
<“tag2”, 1> tagsCount += value.get();
<“tag2”, 1> }
context.write(key, new IntWritable(tagsCount));
<“tag3”, 1>
<“tag3”, 1> context.write()
<“tag3”, 1>
<“tag3”, 1>
 Reducer output:
<“tag4”, 1> <tag1, 2>
<tag2, 2>
<“total tags”, 2> <tag3, 4>
<“total tags”, 1> <tag4, 1>
<“total tags”, 3> <total tags, 9>

Tag Cloud Output
 Reducer output is weighted list:
<tag1, 2>
<tag2, 2>
<tag3, 4>
<tag4, 1>
<total tags, 9>
output
 Tag’s weight:
weight(tagi)=tagi count/total tags

<weight(tag1), 2/9>
<weight(tag2), 2/9>
<weight(tag3), 4/9>
<weight(tag4), 1/9>

 Size of font:
font(tagi)=fn(weight(tagi))

Between Map and Reduce
Mapper output:
 Combiner: <“total tags”, 2>
<“tag1”, 1>
 implements interface <“tag1”, 1>
org.apache.hadoop.mapreduce.Reducer <“tag3”, 1>

 function works as in-memory Reducer in-memory combine
 serves for additional optimization
Combiner output:
<“tag1”, 2>
 Partitioner: <“tag3”, 1>
 implements interface
org.apache.hadoop.mapreduce.Partitioner
 function assigns intermediate <key, value> pair from
Mapper to designed Reducer partition

Time for a Workshop
Standalone mode
 Build “Tag Cloud” project jar:
cd $TAG_CLOUD_HOME
mvn clean install

 Check input directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/input/

 Check input file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/input/tags01

 Submit TagCloudJob to Hadoop:
$HADOOP_HOME/bin/hadoop jar $TAG_CLOUD_HOME/target/tagcloud-1.0.jar
com.altoros.rnd.hadoop.tagcloud.TagCloudJob $TAG_CLOUD_HOME/input
$TAG_CLOUD_HOME/output

 Check output directory:
$HADOOP_HOME/bin/hadoop fs -ls $TAG_CLOUD_HOME/output/

 Check output file:
$HADOOP_HOME/bin/hadoop fs -cat $TAG_CLOUD_HOME/output/part-r-00000

Apache Pig
 Higher-level data processing layer on top
of Hadoop
 Data-flow oriented language (pig scripts)
 Data types include sets, associative
arrays, tuples
 Developed at Yahoo!

Apache Hive
 Feature set is similar to Pig
 SQL-like data warehouse infrastructure
 Language is more strictly SQL
 Supports SELECT, JOIN, GROUP BY, etc
 Developed at Facebook

Apache HBase
 Column-store database (after Google
BigTable model)
 HDFS is an underlying file system
 Holds extremely large datasets (multi Tb)
 Constrained access model

Apache Mahout
 Scalable machine learning algorithms on
top of Hadoop:
– filtering,
– recommendations,
– classifiers,
– clustering

Apache ZooKeeper
 Common services for distributed
applications:
- group services,
- configuration management,
- naming services,
- synchronization

Oozie
 Workflow engine for Hadoop
 Orchestrates dependencies between
jobs running on Hadoop (including HDFS,
Pig and MapReduce)
 Another query processing API
 Developed at Yahoo!

Apache Chukwa
 System for reliable large-scale log
collection
 Displaying, monitoring and analyzing results
 Built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce
 Incubated at apache.org

Questions

links:
http://www.slideshare.net/tazija/anatomy-of-distributed-computing-with-hadoop
https://github.com/tazija/TagCloud

skype: siarhei_bushyk
mailto: tazija@gmail.com
mailto: sergey.bushik@altoros.com

Anatomy of distributed computing with Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Anatomy of distributed computing with Hadoop

Similar to Anatomy of distributed computing with Hadoop (20)

Recently uploaded

Recently uploaded (20)

Anatomy of distributed computing with Hadoop

Editor's Notes