Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

Big Data & Analytics:
MapReduce/Hadoop– A
Programmer’s Perspective
Tushar Telichari
Principal Engineer – NetWorker Development
EMC Proven Specialist - Data Center Architect

Abstract : In this session two of the most prominent technologies in the realm of Big
Data are covered; namely MapReduce and Hadoop. We will take an in-depth look at
MapReduce, Hadoop, and the Hadoop ecosystem, including: Hadoop Setup and
Maintenance , MapReduce/Hadoop Programming , Interacting with the Hadoop
Distributed File System (HDFS) @tushartelichari

© Copyright 2012 EMC Corporation. All rights reserved. 1

Agenda
What is Big Data?
Introduction
MapReduce Framework
MapReduce/Hadoop Programming
Interacting with Hadoop Distributed File
System (HDFS)
Demo


What is Big Data?
In information technology, big data is a collection of data sets so
large and complex that it becomes awkward to work with using
on-hand database management tools. Difficulties include
capture, storage, search, sharing, analysis, and visualization. The
trend to larger data sets is due to the additional information
derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of
data, allowing correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic
conditions.“ - Wikipedia


Introduction
Volume of data being generated is growing
exponentially and enterprises are struggling
to manage and analyze it
Most existing tools and methodologies to
filter and analyze this data offer inadequate
speed and performance to yield meaningful
results
Big Data have significant potential to create
value for both businesses and consumers


Introduction
Continued

MapReduce is a software framework introduced by
Google for processing huge datasets on certain kinds
of problems on a distributed system
Hadoop is an open source software framework
inspired by Google’s MapReduce and Google File
System


MapReduce Framework
A parallel programming model developed by
Google as a mechanism for processing large
amounts of raw data, e.g., web pages the
search engine has crawled
This data is so large that it must be
distributed across thousands of machines in
order to be processed in a reasonable time
This distribution implies parallel computing
since the same computations are performed
on each CPU, but with a different dataset


MapReduce Framework
Continued

MapReduce is an abstraction that allows simple
computations to be performed while hiding the
details of parallelization, data distribution, load
balancing, and fault tolerance


Programming model & constructs
MapReduce works by breaking the processing
into two phases: the map phase and the
reduce phase
Each phase has key-value pairs as input and
output, the types of which may be chosen by
the programmer
The programmer also specifies two functions:
the map function and the reduce function


Steps in MapReduce
Map works independently to convert input
data to key value pairs
Reduce works independently on all values for
a given key and transforms them to a single
output set (possibly even just the 0) per key

Step Input Output
map <k1, v1> list <k2, v2>
reduce <k2, list(v2)> list <k3, v3>


“Hello World”: Word Count Program
Word count is the traditional “hello world”
program for MapReduce
The problem definition is to count the
number of times each word occurs in a set of
documents
The program reads in a stream of text and
emits each word as a key with a value of 1


“Hello World”: Word Count Program
Map(String input_key, String input_value) { Reduce(String key, Iterator
intermediate_values) {
// input_key: document name
// key: a word, same for input and
// input_value: document contents
output
for each word w in input_values {
// intermediate_values: a list of
EmitIntermediate(w, "1"); counts
}
int result = 0;
}
for each v in intermediate_values {

result += ParseInt(v);

Emit(AsString(result));
}
}


Map function – Word Count Program
Input parameters:
<String input_key, String input_value>
Output
A list of <String word, Integer count>


Reduce function – Word Count Program
The map output for one document may be a
list with pair <‖some_text‖, 1> three times,
and the map output for another document
may be a list with pair <‖some_text‖, 1>
twice. The aggregated pair the reducer will
see is <‖some_text‖, list(1,1,1,1,1)>
The output of reducer function is
<‖some_text‖, 5>, which is the total number
of times ‖some_text‖ has occurred in the
document set


WordCount program
Source code –
/usr/local/Hadoop/src/examples/org/apache/
Hadoop/examples/WordCount.java


Job configuration
– Identify classes implementing Mapper and
Reducer interfaces
▪ job.setMapperClass(TokenizerMapper.class);
▪ job.setCombinerClass(IntSumReducer.class);
▪ job.setReducerClass(IntSumReducer.class);
– Specify inputs, outputs
▪ job.setOutputKeyClass(Text.class);
▪ job.setOutputValueClass(IntWritable.class);
▪ FileInputFormat.addInputPath(job, new
Path(otherArgs[0]));
▪ FileOutputFormat.setOutputPath(job, new
Path(otherArgs[1]));


Job submission
– Submit the job to the cluster and wait for it to
finish.
▪ job.waitForCompletion


Mapper method TokenizerMapper
– The Mapper implementation, via the
TokenizerMapper method, processes one line at a
time. It then splits the line into tokens separated
by whitespaces, via the StringTokenizer, and
emits a key-value pair of < <word>, 1>
(context.write(word, one))
Reducer method
– The Reducer implementation, via the
IntSumReducer method, just sums the values,
which are the occurrence counts for each key
(i.e. words, in this example).


Hadoop Daemons
NameNode
DataNode
Secondary NameNode
JobTracker
TaskTracker


Hadoop Cluster
Done by configuring a single Hadoop
environment on two or more individual
machines and then linking them together.
The link is achieved by configuring a
master/slave mode


Hadoop Cluster


Interacting with HDFS
The HDFS operations are performed via the
“Hadoop dfs” option.
hduser@ncdqd110:/usr/local/Hadoop>
Hadoop dfs


Demo
Hadoop Setup & Maintenance
Setting up a Hadoop cluster
Hadoop in action


Additional Information
• Visit
– http://academy.mapr.com
– http://www.datasciencecentral.com/
– http://datascienceseries.com/
– http://gigaom.com/data/

• Get Started
– Greenplum HD Community Edition ( available soon )
– Data Science and Big Data Analytics Certification from
EMC Education Services


Q&A


Get Social @EMCAcademics


Next Session:
Webinar : Cloud
Computing Demystified
on
30 Aug 2012


Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

More Related Content

Viewers also liked

Similar to Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

More from EMC

Recently uploaded

Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective