• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective
 

Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective

on

  • 1,617 views

In this session two of the most prominent technologies in the realm of Big Data are covered; namely MapReduce and Hadoop. ...

In this session two of the most prominent technologies in the realm of Big Data are covered; namely MapReduce and Hadoop.
We will take an in-depth look at MapReduce, Hadoop, and the Hadoop ecosystem, including:
1. Hadoop Setup and Maintenance
2. MapReduce/Hadoop Programming
3. Interacting with the Hadoop Distributed File System (HDFS)

Statistics

Views

Total Views
1,617
Views on SlideShare
1,616
Embed Views
1

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective Big Data & Analytics MapReduce/Hadoop – A programmer’s perspective Presentation Transcript

    • Big Data & Analytics: MapReduce/Hadoop– A Programmer’s Perspective Tushar Telichari Principal Engineer – NetWorker Development EMC Proven Specialist - Data Center ArchitectAbstract : In this session two of the most prominent technologies in the realm of BigData are covered; namely MapReduce and Hadoop. We will take an in-depth look atMapReduce, Hadoop, and the Hadoop ecosystem, including: Hadoop Setup andMaintenance , MapReduce/Hadoop Programming , Interacting with the HadoopDistributed File System (HDFS) @tushartelichari© Copyright 2012 EMC Corporation. All rights reserved. 1
    • Agenda What is Big Data? Introduction MapReduce Framework MapReduce/Hadoop Programming Interacting with Hadoop Distributed File System (HDFS) Demo© Copyright 2012 EMC Corporation. All rights reserved. 2
    • What is Big Data?In information technology, big data is a collection of data sets solarge and complex that it becomes awkward to work with usingon-hand database management tools. Difficulties includecapture, storage, search, sharing, analysis, and visualization. Thetrend to larger data sets is due to the additional informationderivable from analysis of a single large set of related data, ascompared to separate smaller sets with the same total amount ofdata, allowing correlations to be found to "spot business trends,determine quality of research, prevent diseases, link legalcitations, combat crime, and determine real-time roadway trafficconditions.“ - Wikipedia© Copyright 2012 EMC Corporation. All rights reserved. 3
    • Introduction Volume of data being generated is growing exponentially and enterprises are struggling to manage and analyze it Most existing tools and methodologies to filter and analyze this data offer inadequate speed and performance to yield meaningful results Big Data have significant potential to create value for both businesses and consumers© Copyright 2012 EMC Corporation. All rights reserved. 4
    • IntroductionContinued MapReduce is a software framework introduced by Google for processing huge datasets on certain kinds of problems on a distributed system Hadoop is an open source software framework inspired by Google’s MapReduce and Google File System© Copyright 2012 EMC Corporation. All rights reserved. 5
    • MapReduce Framework A parallel programming model developed by Google as a mechanism for processing large amounts of raw data, e.g., web pages the search engine has crawled This data is so large that it must be distributed across thousands of machines in order to be processed in a reasonable time This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset© Copyright 2012 EMC Corporation. All rights reserved. 6
    • MapReduce FrameworkContinued MapReduce is an abstraction that allows simple computations to be performed while hiding the details of parallelization, data distribution, load balancing, and fault tolerance© Copyright 2012 EMC Corporation. All rights reserved. 7
    • Programming model & constructs MapReduce works by breaking the processing into two phases: the map phase and the reduce phase Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer The programmer also specifies two functions: the map function and the reduce function© Copyright 2012 EMC Corporation. All rights reserved. 8
    • Steps in MapReduce Map works independently to convert input data to key value pairs Reduce works independently on all values for a given key and transforms them to a single output set (possibly even just the 0) per key Step Input Output map <k1, v1> list <k2, v2> reduce <k2, list(v2)> list <k3, v3>© Copyright 2012 EMC Corporation. All rights reserved. 9
    • “Hello World”: Word Count Program Word count is the traditional “hello world” program for MapReduce The problem definition is to count the number of times each word occurs in a set of documents The program reads in a stream of text and emits each word as a key with a value of 1© Copyright 2012 EMC Corporation. All rights reserved. 10
    • “Hello World”: Word Count Program Map(String input_key, String input_value) { Reduce(String key, Iterator intermediate_values) { // input_key: document name // key: a word, same for input and // input_value: document contents output for each word w in input_values { // intermediate_values: a list of EmitIntermediate(w, "1"); counts } int result = 0; } for each v in intermediate_values { result += ParseInt(v); Emit(AsString(result)); } }© Copyright 2012 EMC Corporation. All rights reserved. 11
    • Map function – Word Count Program Input parameters:<String input_key, String input_value> OutputA list of <String word, Integer count>© Copyright 2012 EMC Corporation. All rights reserved. 12
    • Reduce function – Word Count Program The map output for one document may be a list with pair <‖some_text‖, 1> three times, and the map output for another document may be a list with pair <‖some_text‖, 1> twice. The aggregated pair the reducer will see is <‖some_text‖, list(1,1,1,1,1)> The output of reducer function is <‖some_text‖, 5>, which is the total number of times ‖some_text‖ has occurred in the document set© Copyright 2012 EMC Corporation. All rights reserved. 13
    • MapReduce/Hadoop Programming WordCount program Source code – /usr/local/Hadoop/src/examples/org/apache/ Hadoop/examples/WordCount.java© Copyright 2012 EMC Corporation. All rights reserved. 14
    • MapReduce/Hadoop Programming Job configuration – Identify classes implementing Mapper and Reducer interfaces ▪ job.setMapperClass(TokenizerMapper.class); ▪ job.setCombinerClass(IntSumReducer.class); ▪ job.setReducerClass(IntSumReducer.class); – Specify inputs, outputs ▪ job.setOutputKeyClass(Text.class); ▪ job.setOutputValueClass(IntWritable.class); ▪ FileInputFormat.addInputPath(job, new Path(otherArgs[0])); ▪ FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));© Copyright 2012 EMC Corporation. All rights reserved. 15
    • MapReduce/Hadoop Programming Job submission – Submit the job to the cluster and wait for it to finish. ▪ job.waitForCompletion© Copyright 2012 EMC Corporation. All rights reserved. 16
    • MapReduce/Hadoop Programming Mapper method TokenizerMapper – The Mapper implementation, via the TokenizerMapper method, processes one line at a time. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1> (context.write(word, one)) Reducer method – The Reducer implementation, via the IntSumReducer method, just sums the values, which are the occurrence counts for each key (i.e. words, in this example).© Copyright 2012 EMC Corporation. All rights reserved. 17
    • Hadoop Daemons NameNode DataNode Secondary NameNode JobTracker TaskTracker© Copyright 2012 EMC Corporation. All rights reserved. 18
    • Hadoop Cluster Done by configuring a single Hadoop environment on two or more individual machines and then linking them together. The link is achieved by configuring a master/slave mode© Copyright 2012 EMC Corporation. All rights reserved. 19
    • Hadoop Cluster© Copyright 2012 EMC Corporation. All rights reserved. 20
    • Interacting with HDFS The HDFS operations are performed via the “Hadoop dfs” option. hduser@ncdqd110:/usr/local/Hadoop> Hadoop dfs© Copyright 2012 EMC Corporation. All rights reserved. 21
    • Demo Hadoop Setup & Maintenance Setting up a Hadoop cluster Hadoop in action© Copyright 2012 EMC Corporation. All rights reserved. 22
    • Additional Information• Visit – http://academy.mapr.com – http://www.datasciencecentral.com/ – http://datascienceseries.com/ – http://gigaom.com/data/• Get Started – Greenplum HD Community Edition ( available soon ) – Data Science and Big Data Analytics Certification from EMC Education Services© Copyright 2012 EMC Corporation. All rights reserved. 23
    • Q&A© Copyright 2012 EMC Corporation. All rights reserved. 24
    • Get Social @EMCAcademics© Copyright 2012 EMC Corporation. All rights reserved. 25
    • Next Session: Webinar : Cloud Computing Demystified on 30 Aug 2012© Copyright 2012 EMC Corporation. All rights reserved. 26