Introduction to Hadoop Developer Training Webinar

An Introduction to Cloudera’s
Hadoop Developer Training Course
Ian Wrigley
Curriculum Manager

1

Welcome to the Webinar!
 All lines are muted
 Q & A after the presentation
 Ask questions at any time by typing them in the
WebEx panel
 A recording of this Webinar will be available on
demand at cloudera.com

2

Topics
 Why Cloudera Training?
 Who Should Attend Developer Training?
 Developer Course Contents
 A Deeper Dive: The New API vs The Old API
 A Deeper Dive: Determining the Optimal Number of
Reducers
 Conclusion

3

Cloudera’s Training is the Industry Standard
Big Data Cloudera has trained
professionals from employees from

55% 100%
of the Fortune 100 of the top 20 global
have attended live technology firms to
Cloudera training use Hadoop
Cloudera has trained over

15,000
students
4

Cloudera Training: The Benefits

1 Broadest Range of Courses
Cover all the key Hadoop components 5 Widest Geographic Coverage
Most classes offered: 20 countries plus virtual classroom

2 Most Experienced Instructors
Over 15,000 students trained since 2009 6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined

3 Leader in Certification
Over 5,000 accredited Cloudera professionals 7 Depth of Training Material
Hands-on labs and VMs support live instruction

4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training

5

The professionalism and expansive
technical knowledge of our classroom
instructor was incredible. The quality of
the training was on par with a university.

6

Topics
Reducers
 Conclusion

7

Common Attendee Profiles
 Software Developers/Engineers
 Business analysts
 IT managers
 Hadoop system administrators

8

Course Pre-Requisites
 Programming experience
 Knowledge of Java highly recommended
 Understanding of common computer science
principles is helpful
 Prior knowledge of Hadoop is not required

9

Who Should Not Attend?
 If you have no programming experience, you’re likely
to find the course very difficult
 You might consider our Hive and Pig training course instead
 If you will be focused solely on configuring and
managing your cluster, our Administrator training
course would probably be a better alternative

10

Topics
Reducers
 Conclusion

11

Developer Training: Overview
 The course assumes no pre-existing knowledge of
Hadoop
 Starts by discussing the motivation for Hadoop
 What problems exist that are difficult (or impossible) to
solve with existing systems
 Explains basic Hadoop concepts
 The Hadoop Distributed File System (HDFS)
 MapReduce
 Introduces the Hadoop API (Application Programming
Interface)

12

Developer Training: Overview (cont’d)
 Moves on to discuss more complex Hadoop concepts
 Custom Partitioners
 Custom Writables and WritableComparables
 Custom InputFormats and OutputFormats
 Investigates common MapReduce algorithms
 Sorting, searching, indexing, joining data sets, etc.
 Then covers the Hadoop ‘ecosystem’
 Hive, Pig, Sqoop, Flume, Mahout, Oozie

13

Hands-On Exercises
 The course features many Hands-On Exercises
 Analyzing log files
 Unit-testing Hadoop code
 Writing and implementing Combiners
 Writing custom Partitioners
 Using SequenceFiles and file compression
 Creating an inverted index
 Creating custom WritableComparables
 Importing data with Sqoop
 Writing Hive queries
 …and more

15

Certification
 Our Developer course is good preparation for the
Cloudera Certified Developer for Apache Hadoop
(CCDH) exam
 A voucher for one attempt at the exam is currently
included in the course fee

16

Topics
Reducers
 Conclusion

17

Chapter Topics

Basic Programming with the
Writing a MapReduce Program
Hadoop Core API

 The MapReduce flow
 Basic MapReduce API concepts
 Writing MapReduce applications in Java
– The driver
– The Mapper
– The Reducer
 Writing Mappers and Reducers in other languages with the Streaming API
 Speeding up Hadoop development by using Eclipse
 Hands-On Exercise: Writing a MapReduce Program
 Differences between the Old and New MapReduce APIs
 Conclusion

© Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 18

What Is The Old API?

 When Hadoop 0.20 was released, a ‘New API’ was introduced
–Designed to make the API easier to evolve in the future
–Favors abstract classes over interfaces
 Some developers still use the Old API
–Until CDH4, the New API was not absolutely feature-complete
 All the code examples in this course use the New API
–Old API-based solutions for many of the Hands-On Exercises for this
course are available in the sample_solutions_oldapi directory


New API vs. Old API: Some Key Differences
New API Old API
import org.apache.hadoop.mapreduce.* import org.apache.hadoop.mapred.*

Driver code: Driver code:

Configuration conf = new Configuration(); JobConf conf = new JobConf(conf,
Job job = new Job(conf); Driver.class);
job.setJarByClass(Driver.class); conf.setSomeProperty(...);
job.setSomeProperty(...); ...
... JobClient.runJob(conf);
job.waitForCompletion(true);

Mapper: Mapper:

public class MyMapper extends Mapper { public class MyMapper extends
MapReduceBase
public void map(Keytype k, Valuetype v, implements
Context c) Mapper {
{
... public void map(Keytype k, Valuetype v,
c.write(key, val); OutputCollector o, Reporter r)
} {
} ...
o.collect(key, val);
}
}


New API vs. Old API: Some Key Differences (cont’d)
New API Old API
Reducer: Reducer:

public class MyReducer extends Reducer { public class MyReducer extends
MapReduceBase
public void reduce(Keytype k, implements Reducer
Iterable<Valuetype> v, Context {
c) {
for(Valuetype v : eachval) { public void reduce(Keytype k,
// process eachval Iterator<Valuetype>
c.write(key, val); v,
} OutputCollector o, Reporter
} r) {
} while(v.hasnext()) {
// process v.next()
o.collect(key, val);
}
}
}

setup(Context c) (See later) configure(JobConf job)

cleanup(Context c) (See later) close()


MRv1 vs MRv2, Old API vs New API

 There is a lot of confusion about the New and Old APIs, and MapReduce
version 1 and MapReduce version 2
 The chart below should clarify what is available with each version of
MapReduce

Old API New API

MapReduce v1 ✔ ✔

MapReduce v2 ✔ ✔

 Summary: Code using either the Old API or the New API will run under
MRv1 and MRv2
–You will have to recompile the code to move from MR1 to MR2, but you
will not have to change the code itself


Topics
Reducers
 Conclusion

23

Chapter Topics

Practical Development Tips Basic Programming with the
and Techniques Hadoop Core API

 Strategies for debugging MapReduce code
 Testing MapReduce code locally using LocalJobRunner
 Writing and viewing log files
 Retrieving job information with Counters
 Determining the optimal number of Reducers for a job
 Reusing objects
 Creating Map-only MapReduce jobs
 Hands-On Exercise: Using Counters and a Map-Only Job
 Conclusion


How Many Reducers Do You Need?

 An important consideration when creating your job is to determine the
number of Reducers specified
 Default is a single Reducer
 With a single Reducer, one task receives all keys in sorted order
–This is sometimes advantageous if the output must be in completely
sorted order
–Can cause significant problems if there is a large amount of
intermediate data
–Node on which the Reducer is running may not have enough disk
space to hold all intermediate data
–The Reducer will take a long time to run


Jobs Which Require a Single Reducer

 If a job needs to output a file where all keys are listed in sorted order, a
single Reducer must be used
 Alternatively, the TotalOrderPartitioner can be used
–Uses an externally generated file which contains information about
intermediate key distribution
–Partitions data such that all keys which go to the first Reducer are
smaller than any which go to the second, etc
–In this way, multiple Reducers can be used
–Concatenating the Reducers’ output files results in a totally ordered list


Jobs Which Require a Fixed Number of Reducers

 Some jobs will require a specific number of Reducers
 Example: a job must output one file per day of the week
–Key will be the weekday
–Seven Reducers will be specified
–A Partitioner will be written which sends one key to each Reducer


Jobs With a Variable Number of Reducers

 Many jobs can be run with a variable number of Reducers
 Developer must decide how many to specify
–Each Reducer should get a reasonable amount of intermediate data, but
not too much
–Chicken-and-egg problem
 Typical way to determine how many Reducers to specify:
–Test the job with a relatively small test data set
–Extrapolate to calculate the amount of intermediate data expected from
the ‘real’ input data
–Use that to calculate the number of Reducers which should be specified


Jobs With a Variable Number of Reducers (cont’d)

 Note: you should take into account the number of Reduce slots likely to be
available on the cluster
–If your job requires one more Reduce slot than there are available, a
second ‘wave’ of Reducers will run
–Consisting just of that single Reducer
–Potentially doubling the amount of time spent on the Reduce phase
–In this case, increasing the number of Reducers further may cut down
the time spent in the Reduce phase
–Two or more waves will run, but the Reducers in each wave will
have to process less data


Topics
Reducers
 Conclusion

30

Conclusion
 Cloudera’s Developer training course is:
 Technical
 Hands-on
 Interactive
 Comprehensive
 Attendees leave the course with the skillset required
to write, test, and run Hadoop jobs
 The course is a good preparation for the CCDH
certification exam

31

Questions?
 For more information on Cloudera’s training
courses, or to book a place on an upcoming course:

http://university.cloudera.com

 My e-mail address: ian@cloudera.com

 Feel free to ask questions!
 Hit the Q&A button, and type away

32

Introduction to Hadoop Developer Training Webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Introduction to Hadoop Developer Training Webinar

Similar to Introduction to Hadoop Developer Training Webinar (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop Developer Training Webinar

Editor's Notes