Big Data & Analytics:
                                            MapReduce/Hadoop– A
                                            Programmer’s Perspective
                                            Tushar Telichari
                                            Principal Engineer – NetWorker Development
                                            EMC Proven Specialist - Data Center Architect

Abstract : In this session two of the most prominent technologies in the realm of Big
Data are covered; namely MapReduce and Hadoop. We will take an in-depth look at
MapReduce, Hadoop, and the Hadoop ecosystem, including: Hadoop Setup and
Maintenance , MapReduce/Hadoop Programming , Interacting with the Hadoop
Distributed File System (HDFS)                                                              @tushartelichari



© Copyright 2012 EMC Corporation. All rights reserved.                                                         1
Agenda
    What is Big Data?
    Introduction
    MapReduce Framework
    MapReduce/Hadoop Programming
    Interacting with Hadoop Distributed File
    System (HDFS)
    Demo



© Copyright 2012 EMC Corporation. All rights reserved.   2
What is Big Data?
In information technology, big data is a collection of data sets so
large and complex that it becomes awkward to work with using
on-hand database management tools. Difficulties include
capture, storage, search, sharing, analysis, and visualization. The
trend to larger data sets is due to the additional information
derivable from analysis of a single large set of related data, as
compared to separate smaller sets with the same total amount of
data, allowing correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic
conditions.“ - Wikipedia




© Copyright 2012 EMC Corporation. All rights reserved.                3
Introduction
    Volume of data being generated is growing
    exponentially and enterprises are struggling
    to manage and analyze it
    Most existing tools and methodologies to
    filter and analyze this data offer inadequate
    speed and performance to yield meaningful
    results
    Big Data have significant potential to create
    value for both businesses and consumers


© Copyright 2012 EMC Corporation. All rights reserved.   4
Introduction
Continued


    MapReduce is a software framework introduced by
    Google for processing huge datasets on certain kinds
    of problems on a distributed system
    Hadoop is an open source software framework
    inspired by Google’s MapReduce and Google File
    System




© Copyright 2012 EMC Corporation. All rights reserved.     5
MapReduce Framework
    A parallel programming model developed by
    Google as a mechanism for processing large
    amounts of raw data, e.g., web pages the
    search engine has crawled
    This data is so large that it must be
    distributed across thousands of machines in
    order to be processed in a reasonable time
     This distribution implies parallel computing
    since the same computations are performed
    on each CPU, but with a different dataset

© Copyright 2012 EMC Corporation. All rights reserved.   6
MapReduce Framework
Continued


    MapReduce is an abstraction that allows simple
    computations to be performed while hiding the
    details of parallelization, data distribution, load
    balancing, and fault tolerance




© Copyright 2012 EMC Corporation. All rights reserved.    7
Programming model & constructs
    MapReduce works by breaking the processing
    into two phases: the map phase and the
    reduce phase
    Each phase has key-value pairs as input and
    output, the types of which may be chosen by
    the programmer
    The programmer also specifies two functions:
    the map function and the reduce function



© Copyright 2012 EMC Corporation. All rights reserved.   8
Steps in MapReduce
    Map works independently to convert input
    data to key value pairs
    Reduce works independently on all values for
    a given key and transforms them to a single
    output set (possibly even just the 0) per key

                       Step                              Input            Output
                       map                               <k1, v1>         list <k2, v2>
                       reduce                            <k2, list(v2)>   list <k3, v3>




© Copyright 2012 EMC Corporation. All rights reserved.                                    9
“Hello World”: Word Count Program
    Word count is the traditional “hello world”
    program for MapReduce
    The problem definition is to count the
    number of times each word occurs in a set of
    documents
    The program reads in a stream of text and
    emits each word as a key with a value of 1




© Copyright 2012 EMC Corporation. All rights reserved.   10
“Hello World”: Word Count Program
                       Map(String input_key, String input_value) {     Reduce(String key, Iterator
                                                                       intermediate_values) {
                               // input_key: document name
                                                                             // key: a word, same for input and
                                // input_value: document contents
                                                                             output
                                   for each word w in input_values {
                                                                             // intermediate_values: a list of
                                        EmitIntermediate(w, "1");            counts
                               }
                                                                             int result = 0;
                       }
                                                                             for each v in intermediate_values {

                                                                                      result += ParseInt(v);

                                                                                      Emit(AsString(result));
                                                                                      }
                                                                       }




© Copyright 2012 EMC Corporation. All rights reserved.                                                             11
Map function – Word Count Program
    Input parameters:
<String input_key, String input_value>
    Output
A list of <String word, Integer count>




© Copyright 2012 EMC Corporation. All rights reserved.   12
Reduce function – Word Count Program
    The map output for one document may be a
    list with pair <‖some_text‖, 1> three times,
    and the map output for another document
    may be a list with pair <‖some_text‖, 1>
    twice. The aggregated pair the reducer will
    see is <‖some_text‖, list(1,1,1,1,1)>
    The output of reducer function is
    <‖some_text‖, 5>, which is the total number
    of times ‖some_text‖ has occurred in the
    document set

© Copyright 2012 EMC Corporation. All rights reserved.   13
MapReduce/Hadoop Programming
    WordCount program
    Source code –
    /usr/local/Hadoop/src/examples/org/apache/
    Hadoop/examples/WordCount.java




© Copyright 2012 EMC Corporation. All rights reserved.   14
MapReduce/Hadoop Programming
    Job configuration
        – Identify classes implementing Mapper and
          Reducer interfaces
                 ▪ job.setMapperClass(TokenizerMapper.class);
                 ▪ job.setCombinerClass(IntSumReducer.class);
                 ▪ job.setReducerClass(IntSumReducer.class);
        – Specify inputs, outputs
                 ▪ job.setOutputKeyClass(Text.class);
                 ▪ job.setOutputValueClass(IntWritable.class);
                 ▪ FileInputFormat.addInputPath(job, new
                   Path(otherArgs[0]));
                 ▪ FileOutputFormat.setOutputPath(job, new
                   Path(otherArgs[1]));


© Copyright 2012 EMC Corporation. All rights reserved.           15
MapReduce/Hadoop Programming
    Job submission
        – Submit the job to the cluster and wait for it to
          finish.
                 ▪ job.waitForCompletion




© Copyright 2012 EMC Corporation. All rights reserved.       16
MapReduce/Hadoop Programming
    Mapper method TokenizerMapper
        – The Mapper implementation, via the
          TokenizerMapper method, processes one line at a
          time. It then splits the line into tokens separated
          by whitespaces, via the StringTokenizer, and
          emits a key-value pair of < <word>, 1>
          (context.write(word, one))
    Reducer method
        – The Reducer implementation, via the
          IntSumReducer method, just sums the values,
          which are the occurrence counts for each key
          (i.e. words, in this example).

© Copyright 2012 EMC Corporation. All rights reserved.          17
Hadoop Daemons
    NameNode
    DataNode
    Secondary NameNode
    JobTracker
    TaskTracker




© Copyright 2012 EMC Corporation. All rights reserved.   18
Hadoop Cluster
    Done by configuring a single Hadoop
    environment on two or more individual
    machines and then linking them together.
    The link is achieved by configuring a
    master/slave mode




© Copyright 2012 EMC Corporation. All rights reserved.   19
Hadoop Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   20
Interacting with HDFS
    The HDFS operations are performed via the
    “Hadoop dfs” option.
    hduser@ncdqd110:/usr/local/Hadoop>
    Hadoop dfs




© Copyright 2012 EMC Corporation. All rights reserved.   21
Demo
    Hadoop Setup & Maintenance
    Setting up a Hadoop cluster
    Hadoop in action




© Copyright 2012 EMC Corporation. All rights reserved.   22
Additional Information
• Visit
         –    http://academy.mapr.com
         –    http://www.datasciencecentral.com/
         –    http://datascienceseries.com/
         –    http://gigaom.com/data/




• Get Started
         – Greenplum HD Community Edition ( available soon )
         – Data Science and Big Data Analytics Certification from
           EMC Education Services


© Copyright 2012 EMC Corporation. All rights reserved.              23
Q&A

© Copyright 2012 EMC Corporation. All rights reserved.   24
Get Social @EMCAcademics




© Copyright 2012 EMC Corporation. All rights reserved.   25
Next Session:
                   Webinar : Cloud
                Computing Demystified
                         on
                    30 Aug 2012



© Copyright 2012 EMC Corporation. All rights reserved.   26
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

  • 1.
    Big Data &Analytics: MapReduce/Hadoop– A Programmer’s Perspective Tushar Telichari Principal Engineer – NetWorker Development EMC Proven Specialist - Data Center Architect Abstract : In this session two of the most prominent technologies in the realm of Big Data are covered; namely MapReduce and Hadoop. We will take an in-depth look at MapReduce, Hadoop, and the Hadoop ecosystem, including: Hadoop Setup and Maintenance , MapReduce/Hadoop Programming , Interacting with the Hadoop Distributed File System (HDFS) @tushartelichari © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2.
    Agenda What is Big Data? Introduction MapReduce Framework MapReduce/Hadoop Programming Interacting with Hadoop Distributed File System (HDFS) Demo © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3.
    What is BigData? In information technology, big data is a collection of data sets so large and complex that it becomes awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.“ - Wikipedia © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4.
    Introduction Volume of data being generated is growing exponentially and enterprises are struggling to manage and analyze it Most existing tools and methodologies to filter and analyze this data offer inadequate speed and performance to yield meaningful results Big Data have significant potential to create value for both businesses and consumers © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5.
    Introduction Continued MapReduce is a software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system Hadoop is an open source software framework inspired by Google’s MapReduce and Google File System © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6.
    MapReduce Framework A parallel programming model developed by Google as a mechanism for processing large amounts of raw data, e.g., web pages the search engine has crawled This data is so large that it must be distributed across thousands of machines in order to be processed in a reasonable time This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7.
    MapReduce Framework Continued MapReduce is an abstraction that allows simple computations to be performed while hiding the details of parallelization, data distribution, load balancing, and fault tolerance © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8.
    Programming model &constructs MapReduce works by breaking the processing into two phases: the map phase and the reduce phase Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer The programmer also specifies two functions: the map function and the reduce function © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9.
    Steps in MapReduce Map works independently to convert input data to key value pairs Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the 0) per key Step Input Output map <k1, v1> list <k2, v2> reduce <k2, list(v2)> list <k3, v3> © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10.
    “Hello World”: WordCount Program Word count is the traditional “hello world” program for MapReduce The problem definition is to count the number of times each word occurs in a set of documents The program reads in a stream of text and emits each word as a key with a value of 1 © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11.
    “Hello World”: WordCount Program Map(String input_key, String input_value) { Reduce(String key, Iterator intermediate_values) { // input_key: document name // key: a word, same for input and // input_value: document contents output for each word w in input_values { // intermediate_values: a list of EmitIntermediate(w, "1"); counts } int result = 0; } for each v in intermediate_values { result += ParseInt(v); Emit(AsString(result)); } } © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12.
    Map function –Word Count Program Input parameters: <String input_key, String input_value> Output A list of <String word, Integer count> © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13.
    Reduce function –Word Count Program The map output for one document may be a list with pair <‖some_text‖, 1> three times, and the map output for another document may be a list with pair <‖some_text‖, 1> twice. The aggregated pair the reducer will see is <‖some_text‖, list(1,1,1,1,1)> The output of reducer function is <‖some_text‖, 5>, which is the total number of times ‖some_text‖ has occurred in the document set © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14.
    MapReduce/Hadoop Programming WordCount program Source code – /usr/local/Hadoop/src/examples/org/apache/ Hadoop/examples/WordCount.java © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15.
    MapReduce/Hadoop Programming Job configuration – Identify classes implementing Mapper and Reducer interfaces ▪ job.setMapperClass(TokenizerMapper.class); ▪ job.setCombinerClass(IntSumReducer.class); ▪ job.setReducerClass(IntSumReducer.class); – Specify inputs, outputs ▪ job.setOutputKeyClass(Text.class); ▪ job.setOutputValueClass(IntWritable.class); ▪ FileInputFormat.addInputPath(job, new Path(otherArgs[0])); ▪ FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16.
    MapReduce/Hadoop Programming Job submission – Submit the job to the cluster and wait for it to finish. ▪ job.waitForCompletion © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17.
    MapReduce/Hadoop Programming Mapper method TokenizerMapper – The Mapper implementation, via the TokenizerMapper method, processes one line at a time. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1> (context.write(word, one)) Reducer method – The Reducer implementation, via the IntSumReducer method, just sums the values, which are the occurrence counts for each key (i.e. words, in this example). © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18.
    Hadoop Daemons NameNode DataNode Secondary NameNode JobTracker TaskTracker © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19.
    Hadoop Cluster Done by configuring a single Hadoop environment on two or more individual machines and then linking them together. The link is achieved by configuring a master/slave mode © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20.
    Hadoop Cluster © Copyright2012 EMC Corporation. All rights reserved. 20
  • 21.
    Interacting with HDFS The HDFS operations are performed via the “Hadoop dfs” option. hduser@ncdqd110:/usr/local/Hadoop> Hadoop dfs © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22.
    Demo Hadoop Setup & Maintenance Setting up a Hadoop cluster Hadoop in action © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23.
    Additional Information • Visit – http://academy.mapr.com – http://www.datasciencecentral.com/ – http://datascienceseries.com/ – http://gigaom.com/data/ • Get Started – Greenplum HD Community Edition ( available soon ) – Data Science and Big Data Analytics Certification from EMC Education Services © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24.
    Q&A © Copyright 2012EMC Corporation. All rights reserved. 24
  • 25.
    Get Social @EMCAcademics ©Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26.
    Next Session: Webinar : Cloud Computing Demystified on 30 Aug 2012 © Copyright 2012 EMC Corporation. All rights reserved. 26