Introduction to Hadoop Developer Training Webinar

8,586 views

Published on

Are you new to Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are you prepared to unlock the full potential of all your data by building and deploying powerful Hadoop-based applications?

If you're wondering whether Cloudera's Developer Training for Apache Hadoop is right for you and your team, then this presentation is right for you. You will learn who is best suited to attend the live training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during the course and how they will help you become a full-fledged Hadoop application developer.

During the session, Ian will also present a short portion of the actual Cloudera Developer course, discussing the difference between New and Old APIs, why there are different APIs, and which you should use when writing your MapReduce code. Following the presentation, Ian will answer your questions about this or any of Cloudera’s other training courses.

Visit the resources section of cloudera.com to view the on-demand webinar.

Published in: Technology
  • Be the first to comment

Introduction to Hadoop Developer Training Webinar

  1. 1. An Introduction to Cloudera’s Hadoop Developer Training Course Ian Wrigley Curriculum Manager1
  2. 2. Welcome to the Webinar!  All lines are muted  Q & A after the presentation  Ask questions at any time by typing them in the WebEx panel  A recording of this Webinar will be available on demand at cloudera.com2
  3. 3. Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion3
  4. 4. Cloudera’s Training is the Industry Standard Big Data Cloudera has trained professionals from employees from 55% 100% of the Fortune 100 of the top 20 global have attended live technology firms to Cloudera training use Hadoop Cloudera has trained over 15,000 students4
  5. 5. Cloudera Training: The Benefits 1 Broadest Range of Courses Cover all the key Hadoop components 5 Widest Geographic Coverage Most classes offered: 20 countries plus virtual classroom 2 Most Experienced Instructors Over 15,000 students trained since 2009 6 Most Relevant Platform & Community CDH deployed more than all other distributions combined 3 Leader in Certification Over 5,000 accredited Cloudera professionals 7 Depth of Training Material Hands-on labs and VMs support live instruction 4 State of the Art Curriculum Classes updated regularly as Hadoop evolves 8 Ongoing Learning Video tutorials and e-learning complement training5
  6. 6. The professionalism and expansive technical knowledge of our classroom instructor was incredible. The quality of the training was on par with a university.6
  7. 7. Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion7
  8. 8. Common Attendee Profiles  Software Developers/Engineers  Business analysts  IT managers  Hadoop system administrators8
  9. 9. Course Pre-Requisites  Programming experience  Knowledge of Java highly recommended  Understanding of common computer science principles is helpful  Prior knowledge of Hadoop is not required9
  10. 10. Who Should Not Attend?  If you have no programming experience, you’re likely to find the course very difficult  You might consider our Hive and Pig training course instead  If you will be focused solely on configuring and managing your cluster, our Administrator training course would probably be a better alternative10
  11. 11. Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion11
  12. 12. Developer Training: Overview  The course assumes no pre-existing knowledge of Hadoop  Starts by discussing the motivation for Hadoop  What problems exist that are difficult (or impossible) to solve with existing systems  Explains basic Hadoop concepts  The Hadoop Distributed File System (HDFS)  MapReduce  Introduces the Hadoop API (Application Programming Interface)12
  13. 13. Developer Training: Overview (cont’d)  Moves on to discuss more complex Hadoop concepts  Custom Partitioners  Custom Writables and WritableComparables  Custom InputFormats and OutputFormats  Investigates common MapReduce algorithms  Sorting, searching, indexing, joining data sets, etc.  Then covers the Hadoop ‘ecosystem’  Hive, Pig, Sqoop, Flume, Mahout, Oozie13
  14. 14. Course Contents14
  15. 15. Hands-On Exercises  The course features many Hands-On Exercises  Analyzing log files  Unit-testing Hadoop code  Writing and implementing Combiners  Writing custom Partitioners  Using SequenceFiles and file compression  Creating an inverted index  Creating custom WritableComparables  Importing data with Sqoop  Writing Hive queries  …and more15
  16. 16. Certification  Our Developer course is good preparation for the Cloudera Certified Developer for Apache Hadoop (CCDH) exam  A voucher for one attempt at the exam is currently included in the course fee16
  17. 17. Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion17
  18. 18. Chapter Topics Basic Programming with the Writing a MapReduce Program Hadoop Core API  The MapReduce flow  Basic MapReduce API concepts  Writing MapReduce applications in Java – The driver – The Mapper – The Reducer  Writing Mappers and Reducers in other languages with the Streaming API  Speeding up Hadoop development by using Eclipse  Hands-On Exercise: Writing a MapReduce Program  Differences between the Old and New MapReduce APIs  Conclusion © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 18
  19. 19. What Is The Old API? When Hadoop 0.20 was released, a ‘New API’ was introduced –Designed to make the API easier to evolve in the future –Favors abstract classes over interfaces Some developers still use the Old API –Until CDH4, the New API was not absolutely feature-complete All the code examples in this course use the New API –Old API-based solutions for many of the Hands-On Exercises for this course are available in the sample_solutions_oldapi directory © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 19
  20. 20. New API vs. Old API: Some Key DifferencesNew API Old APIimport org.apache.hadoop.mapreduce.* import org.apache.hadoop.mapred.*Driver code: Driver code:Configuration conf = new Configuration(); JobConf conf = new JobConf(conf,Job job = new Job(conf); Driver.class);job.setJarByClass(Driver.class); conf.setSomeProperty(...);job.setSomeProperty(...); ...... JobClient.runJob(conf);job.waitForCompletion(true);Mapper: Mapper:public class MyMapper extends Mapper { public class MyMapper extends MapReduceBase public void map(Keytype k, Valuetype v, implements Context c) Mapper {{ ... public void map(Keytype k, Valuetype v, c.write(key, val); OutputCollector o, Reporter r) } {} ... o.collect(key, val); } } © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 20
  21. 21. New API vs. Old API: Some Key Differences (cont’d)New API Old APIReducer: Reducer:public class MyReducer extends Reducer { public class MyReducer extends MapReduceBase public void reduce(Keytype k, implements Reducer Iterable<Valuetype> v, Context {c) { for(Valuetype v : eachval) { public void reduce(Keytype k, // process eachval Iterator<Valuetype> c.write(key, val); v, } OutputCollector o, Reporter } r) {} while(v.hasnext()) { // process v.next() o.collect(key, val); } } }setup(Context c) (See later) configure(JobConf job)cleanup(Context c) (See later) close() © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 21
  22. 22. MRv1 vs MRv2, Old API vs New API There is a lot of confusion about the New and Old APIs, and MapReduce version 1 and MapReduce version 2 The chart below should clarify what is available with each version of MapReduce Old API New APIMapReduce v1 ✔ ✔MapReduce v2 ✔ ✔ Summary: Code using either the Old API or the New API will run under MRv1 and MRv2 –You will have to recompile the code to move from MR1 to MR2, but you will not have to change the code itself © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 22
  23. 23. Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion23
  24. 24. Chapter Topics Practical Development Tips Basic Programming with the and Techniques Hadoop Core API  Strategies for debugging MapReduce code  Testing MapReduce code locally using LocalJobRunner  Writing and viewing log files  Retrieving job information with Counters  Determining the optimal number of Reducers for a job  Reusing objects  Creating Map-only MapReduce jobs  Hands-On Exercise: Using Counters and a Map-Only Job  Conclusion © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 24
  25. 25. How Many Reducers Do You Need? An important consideration when creating your job is to determine the number of Reducers specified Default is a single Reducer With a single Reducer, one task receives all keys in sorted order –This is sometimes advantageous if the output must be in completely sorted order –Can cause significant problems if there is a large amount of intermediate data –Node on which the Reducer is running may not have enough disk space to hold all intermediate data –The Reducer will take a long time to run © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 25
  26. 26. Jobs Which Require a Single Reducer If a job needs to output a file where all keys are listed in sorted order, a single Reducer must be used Alternatively, the TotalOrderPartitioner can be used –Uses an externally generated file which contains information about intermediate key distribution –Partitions data such that all keys which go to the first Reducer are smaller than any which go to the second, etc –In this way, multiple Reducers can be used –Concatenating the Reducers’ output files results in a totally ordered list © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 26
  27. 27. Jobs Which Require a Fixed Number of Reducers Some jobs will require a specific number of Reducers Example: a job must output one file per day of the week –Key will be the weekday –Seven Reducers will be specified –A Partitioner will be written which sends one key to each Reducer © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 27
  28. 28. Jobs With a Variable Number of Reducers Many jobs can be run with a variable number of Reducers Developer must decide how many to specify –Each Reducer should get a reasonable amount of intermediate data, but not too much –Chicken-and-egg problem Typical way to determine how many Reducers to specify: –Test the job with a relatively small test data set –Extrapolate to calculate the amount of intermediate data expected from the ‘real’ input data –Use that to calculate the number of Reducers which should be specified © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 28
  29. 29. Jobs With a Variable Number of Reducers (cont’d) Note: you should take into account the number of Reduce slots likely to be available on the cluster –If your job requires one more Reduce slot than there are available, a second ‘wave’ of Reducers will run –Consisting just of that single Reducer –Potentially doubling the amount of time spent on the Reduce phase –In this case, increasing the number of Reducers further may cut down the time spent in the Reduce phase –Two or more waves will run, but the Reducers in each wave will have to process less data © Copyright 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 29
  30. 30. Topics  Why Cloudera Training?  Who Should Attend Developer Training?  Developer Course Contents  A Deeper Dive: The New API vs The Old API  A Deeper Dive: Determining the Optimal Number of Reducers  Conclusion30
  31. 31. Conclusion  Cloudera’s Developer training course is:  Technical  Hands-on  Interactive  Comprehensive  Attendees leave the course with the skillset required to write, test, and run Hadoop jobs  The course is a good preparation for the CCDH certification exam31
  32. 32. Questions?  For more information on Cloudera’s training courses, or to book a place on an upcoming course: http://university.cloudera.com  My e-mail address: ian@cloudera.com  Feel free to ask questions!  Hit the Q&A button, and type away32

×