• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to Hadoop Developer Training Webinar
 

Introduction to Hadoop Developer Training Webinar

on

  • 6,636 views

Are you new to Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are ...

Are you new to Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are you prepared to unlock the full potential of all your data by building and deploying powerful Hadoop-based applications?

If you're wondering whether Cloudera's Developer Training for Apache Hadoop is right for you and your team, then this presentation is right for you. You will learn who is best suited to attend the live training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during the course and how they will help you become a full-fledged Hadoop application developer.

During the session, Ian will also present a short portion of the actual Cloudera Developer course, discussing the difference between New and Old APIs, why there are different APIs, and which you should use when writing your MapReduce code. Following the presentation, Ian will answer your questions about this or any of Cloudera’s other training courses.

Visit the resources section of cloudera.com to view the on-demand webinar.

Statistics

Views

Total Views
6,636
Views on SlideShare
2,259
Embed Views
4,377

Actions

Likes
6
Downloads
1
Comments
0

7 Embeds 4,377

http://www.cloudera.com 3449
http://cloudera.com 916
http://author01.mtv.cloudera.com 5
http://staging-author01.mtv.cloudera.com 3
http://author01.core.cloudera.com 2
http://translate.googleusercontent.com 1
http://www.cloudera.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This topic is discussed in further detail in TDG 3e on pages 27-30 (TDG 2e, 25-27).NOTE: The New API / Old API is completely unrelated to MRv1 (MapReduce in CDH3 and earlier) / MRv2 (next-generation MapReduce, also called YARN, which will be available along with MRv1 starting in CDH4). Instructors are advised to avoid confusion by not mentioning MRv2 during this section of class, and if asked about it, to simply say that it’s unrelated to the old/new API and defer further discussion until later.
  • On this slide, you should point out the similarities as well as the differences between the two APIs. You should emphasize that they are both doing the same thing and that there are just a few differences in how they go about it.You can tell whether a class belongs to the “Old API” or the “New API” based on the package name. The old API contains “mapred” while the new API contains “mapreduce” instead. This is the most important thing to keep in mind, because some classes/interfaces have the same name in both APIs. Consequently, when you are writing your import statements (or generating them with the IDE), you will want to be cautious and use the one that corresponds whichever API you are using to write your code.The functions of the OutputCollector and Reporter object have been consolidated into a single Context object. For this reason, the new API is sometimes called the “Context Objects” API (TDG 3e, page 27 or TDG 2e, page 25).NOTE: The “Keytype” and “Valuetype” shown in the map method signature aren’t actual classes defined in Hadoop API. They are just placeholders for whatever type you use for key and value (e.g. IntWritable and Text). Also, the generics for the keys and values are not shown in the class definition for the sake of brevity, but they are used in the new API just as they are in the old API.
  • An example of maintaining sorted order globally across all reducers was given earlier in the course when Partitioners were introduced.NOTE: worker nodes are configured to reserve a portion (typically 20% - 30%) of their available disk space for storing intermediate data. If too many Mappers are feeding into too few reducers, you can produce more data than the reducer(s) could store. That’s a problem.At any rate, having all your mappers feeding into a single reducer (or just a few reducers) isn’t spreading the work efficiently across the cluster.
  • Use of the TotalOrderPartitioner is described in detail on pages 274-277 of TDG 3e (TDG 2e, 237-241). It is essentially based on sampling your keyspace so you can divide it up efficiently among several reducers, based on the global sort order of those keys.
  • But beware that this can be a naïve approach. If processing sales data this way, business-to-business operations (like plumbing supply warehouses) would likely have little or no data for the weekend since they will likely be closed. Conversely, a retail store in a shopping mall will likely have far more data for a Saturday than a Tuesday.
  • The upper bound on the number of reducers is based on your cluster (machines are configured to have a certain number of “reduce slots” based on the CPU, RAM and other performance characteristics of the machine). The general advice is to choose something a bit less than the max number of reduce slots to allow for speculative execution.
  • One factor in determining the reducer count is the reduce capacity the developer has access to (or the number of "reduce slots" in either the cluster or the user's pool). One technique is to make the reducer count a multiple of this capacity. If the developer has access to N slots, but they pick N+1 reducers, the reduce phase will go into a second "wave" which will cause that one extra reducer to potentially double the execution time of the reduce phase. However, if the developer chooses 2N or 3N reducers, each wave takes less time, but there are more "waves", so you don't see a big degradation in job performance if you need a second wave (or more waves) due to an extra reducer, a failed task, etc.Suggestion: draw a picture on the whiteboard that shows reducers running in waves, showing cluster slot count, reducer execution times, etc. to tie together the explanation of performance issues as they have been explained in the last few slides:1 reducer will run very slow on an entire data setSetting the number of reducers to the available slot count can maximize parallelism in one reducer wave. However, if you have a failure then you'll run the reduce phase of the job into a second wave, and that will double the execution time of the reduce phase of the job.Setting the number of reducers to a high number will mean many waves of shorter running reducers. This scales nicely because you don't have to be aware of the cluster size and you don't have the cost of a second wave, but it might be more inefficient for some jobs.

Introduction to Hadoop Developer Training Webinar Introduction to Hadoop Developer Training Webinar Presentation Transcript

  • An Introduction to Cloudera’s Hadoop Developer Training Course Ian Wrigley Curriculum Manager1
  • Welcome to the Webinar!  All lines are muted  Q & A after the presentation  Ask questions at any time by typing them in the WebEx panel  A recording of this Webinar will be available on demand at cloudera.com2
  • Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion3
  • Cloudera’s Training is the Industry Standard Big Data Cloudera has trained professionals from employees from 55% 100% of the Fortune 100 of the top 20 global have attended live technology firms to Cloudera training use Hadoop Cloudera has trained over 15,000 students4
  • Cloudera Training: The Benefits 1 Broadest Range of Courses Cover all the key Hadoop components 5 Widest Geographic Coverage Most classes offered: 20 countries plus virtual classroom 2 Most Experienced Instructors Over 15,000 students trained since 2009 6 Most Relevant Platform & Community CDH deployed more than all other distributions combined 3 Leader in Certification Over 5,000 accredited Cloudera professionals 7 Depth of Training Material Hands-on labs and VMs support live instruction 4 State of the Art Curriculum Classes updated regularly as Hadoop evolves 8 Ongoing Learning Video tutorials and e-learning complement training5
  • The professionalism and expansive technical knowledge of our classroom instructor was incredible. The quality of the training was on par with a university.6
  • Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion7
  • Common Attendee Profiles  Software Developers/Engineers  Business analysts  IT managers  Hadoop system administrators8
  • Course Pre-Requisites  Programming experience  Knowledge of Java highly recommended  Understanding of common computer science principles is helpful  Prior knowledge of Hadoop is not required9
  • Who Should Not Attend?  If you have no programming experience, you’re likely to find the course very difficult  You might consider our Hive and Pig training course instead  If you will be focused solely on configuring and managing your cluster, our Administrator training course would probably be a better alternative10
  • Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion11
  • Developer Training: Overview  The course assumes no pre-existing knowledge of Hadoop  Starts by discussing the motivation for Hadoop  What problems exist that are difficult (or impossible) to solve with existing systems  Explains basic Hadoop concepts  The Hadoop Distributed File System (HDFS)  MapReduce  Introduces the Hadoop API (Application Programming Interface)12
  • Developer Training: Overview (cont’d)  Moves on to discuss more complex Hadoop concepts  Custom Partitioners  Custom Writables and WritableComparables  Custom InputFormats and OutputFormats  Investigates common MapReduce algorithms  Sorting, searching, indexing, joining data sets, etc.  Then covers the Hadoop ‘ecosystem’  Hive, Pig, Sqoop, Flume, Mahout, Oozie13
  • Course Contents14
  • Hands-On Exercises  The course features many Hands-On Exercises  Analyzing log files  Unit-testing Hadoop code  Writing and implementing Combiners  Writing custom Partitioners  Using SequenceFiles and file compression  Creating an inverted index  Creating custom WritableComparables  Importing data with Sqoop  Writing Hive queries  …and more15
  • Certification  Our Developer course is good preparation for the Cloudera Certified Developer for Apache Hadoop (CCDH) exam  A voucher for one attempt at the exam is currently included in the course fee16
  • Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion17
  • Chapter Topics Basic Programming with the Writing a MapReduce Program Hadoop Core API  The MapReduce flow  Basic MapReduce API concepts  Writing MapReduce applications in Java – The driver – The Mapper – The Reducer  Writing Mappers and Reducers in other languages with the Streaming API  Speeding up Hadoop development by using Eclipse  Hands-On Exercise: Writing a MapReduce Program  Differences between the Old and New MapReduce APIs  Conclusion © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 18
  • What Is The Old API? When Hadoop 0.20 was released, a ‘New API’ was introduced –Designed to make the API easier to evolve in the future –Favors abstract classes over interfaces Some developers still use the Old API –Until CDH4, the New API was not absolutely feature-complete All the code examples in this course use the New API –Old API-based solutions for many of the Hands-On Exercises for this course are available in the sample_solutions_oldapi directory © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 19
  • New API vs. Old API: Some Key DifferencesNew API Old APIimport org.apache.hadoop.mapreduce.* import org.apache.hadoop.mapred.*Driver code: Driver code:Configuration conf = new Configuration(); JobConf conf = new JobConf(conf,Job job = new Job(conf); Driver.class);job.setJarByClass(Driver.class); conf.setSomeProperty(...);job.setSomeProperty(...); ...... JobClient.runJob(conf);job.waitForCompletion(true);Mapper: Mapper:public class MyMapper extends Mapper { public class MyMapper extends MapReduceBase public void map(Keytype k, Valuetype v, implements Context c) Mapper {{ ... public void map(Keytype k, Valuetype v, c.write(key, val); OutputCollector o, Reporter r) } {} ... o.collect(key, val); } } © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 20
  • New API vs. Old API: Some Key Differences (cont’d)New API Old APIReducer: Reducer:public class MyReducer extends Reducer { public class MyReducer extends MapReduceBase public void reduce(Keytype k, implements Reducer Iterable<Valuetype> v, Context {c) { for(Valuetype v : eachval) { public void reduce(Keytype k, // process eachval Iterator<Valuetype> c.write(key, val); v, } OutputCollector o, Reporter } r) {} while(v.hasnext()) { // process v.next() o.collect(key, val); } } }setup(Context c) (See later) configure(JobConf job)cleanup(Context c) (See later) close() © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 21
  • MRv1 vs MRv2, Old API vs New API There is a lot of confusion about the New and Old APIs, and MapReduce version 1 and MapReduce version 2 The chart below should clarify what is available with each version of MapReduce Old API New APIMapReduce v1 ✔ ✔MapReduce v2 ✔ ✔ Summary: Code using either the Old API or the New API will run under MRv1 and MRv2 –You will have to recompile the code to move from MR1 to MR2, but you will not have to change the code itself © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 22
  • Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion23
  • Chapter Topics Practical Development Tips Basic Programming with the and Techniques Hadoop Core API  Strategies for debugging MapReduce code  Testing MapReduce code locally using LocalJobRunner  Writing and viewing log files  Retrieving job information with Counters  Determining the optimal number of Reducers for a job  Reusing objects  Creating Map-only MapReduce jobs  Hands-On Exercise: Using Counters and a Map-Only Job  Conclusion © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 24
  • How Many Reducers Do You Need? An important consideration when creating your job is to determine the number of Reducers specified Default is a single Reducer With a single Reducer, one task receives all keys in sorted order –This is sometimes advantageous if the output must be in completely sorted order –Can cause significant problems if there is a large amount of intermediate data –Node on which the Reducer is running may not have enough disk space to hold all intermediate data –The Reducer will take a long time to run © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 25
  • Jobs Which Require a Single Reducer If a job needs to output a file where all keys are listed in sorted order, a single Reducer must be used Alternatively, the TotalOrderPartitioner can be used –Uses an externally generated file which contains information about intermediate key distribution –Partitions data such that all keys which go to the first Reducer are smaller than any which go to the second, etc –In this way, multiple Reducers can be used –Concatenating the Reducers’ output files results in a totally ordered list © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 26
  • Jobs Which Require a Fixed Number of Reducers Some jobs will require a specific number of Reducers Example: a job must output one file per day of the week –Key will be the weekday –Seven Reducers will be specified –A Partitioner will be written which sends one key to each Reducer © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 27
  • Jobs With a Variable Number of Reducers Many jobs can be run with a variable number of Reducers Developer must decide how many to specify –Each Reducer should get a reasonable amount of intermediate data, but not too much –Chicken-and-egg problem Typical way to determine how many Reducers to specify: –Test the job with a relatively small test data set –Extrapolate to calculate the amount of intermediate data expected from the ‘real’ input data –Use that to calculate the number of Reducers which should be specified © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 28
  • Jobs With a Variable Number of Reducers (cont’d) Note: you should take into account the number of Reduce slots likely to be available on the cluster –If your job requires one more Reduce slot than there are available, a second ‘wave’ of Reducers will run –Consisting just of that single Reducer –Potentially doubling the amount of time spent on the Reduce phase –In this case, increasing the number of Reducers further may cut down the time spent in the Reduce phase –Two or more waves will run, but the Reducers in each wave will have to process less data © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 29
  • Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion30
  • Conclusion  Cloudera’s Developer training course is:  Technical  Hands-on  Interactive  Comprehensive  Attendees leave the course with the skillset required to write, test, and run Hadoop jobs  The course is a good preparation for the CCDH certification exam31
  • Questions?  For more information on Cloudera’s training courses, or to book a place on an upcoming course: http://university.cloudera.com  My e-mail address: ian@cloudera.com  Feel free to ask questions!  Hit the Q&A button, and type away32