Successfully reported this slideshow.
Your SlideShare is downloading. ×

Hadoop MapReduce Fundamentals

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
Introduction to Map Reduce
Introduction to Map Reduce
Loading in …3
×

Check these out next

1 of 86 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Hadoop MapReduce Fundamentals (20)

More from Lynn Langit (20)

Advertisement

Recently uploaded (20)

Hadoop MapReduce Fundamentals

  1. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 1 of 5
  2. Course Outline
  3. What is Hadoop?  Open-source data storage and processing API  Massively scalable, automatically parallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and Vendor Work  Apache Hadoop  Cloudera – CH4 w/ Impala  Hortonworks  MapR  AWS  Windows Azure HDInsight
  4. Why Use Hadoop?  Cheaper  Scales to Petabytes or more  Faster  Parallel data processing  Better  Suited for particular types of BigData problems
  5. What types of business problems for Hadoop? Source: Cloudera “Ten Common Hadoopable Problems”
  6. Companies Using Hadoop  Facebook  Yahoo  Amazon  eBay  American Airlines  The New York Times  Federal Reserve Board  IBM  Orbitz
  7. Forecast growth of Hadoop Job Market Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  8. Hadoop is a set of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing  Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduce API Data Access Tools & Libraries Monitoring & Alerting
  9. What are the core parts of a Hadoop distribution?
  10. Hadoop Cluster HDFS (Physical) Storage
  11. MapReduce Job – Logical View Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  12. Hadoop Ecosystem
  13. Common Hadoop Distributions  Open Source  Apache  Commercial  Cloudera  Hortonworks  MapR  AWS MapReduce  Microsoft HDInsight (Beta)
  14. A View of Hadoop (from Hortonworks) Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  15. Setting up Hadoop Development
  16. Demo – Setting up Cloudera Hadoop Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  17. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 2 of 5
  18. So, what’s the problem?  “I can just use some ‘SQL-like’ language to query Hadoop, right?  “Yeah, SQL-on-Hadoop…that’s what I want  “I don’t want learn a new query language and….  “I want massive scale for my shiny, new BigData
  19. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  20. Demo – Using Hive QL on CDH4
  21. What is Hive?  a data warehouse system for Hadoop that  facilitates easy data summarization  supports ad-hoc queries (still batch though…)  created by Facebook  a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL  Interactive-console –or-  Execute scripts  Kicks off one or more MapReduce jobs in the background  an ability to use indexes, built-in user-defined functions
  22. Is HQL == ANSI SQL? – NO! --non-equality joins ARE allowed on ANSI SQL --but are NOT allowed on Hive (HQL) SELECT a.* FROM a JOIN b ON (a.id <> b.id) Note: Joins are quite different in MapReduce, more on that coming up…
  23. Preparing for MapReduce
  24. Common Hadoop Shell Commands hadoop fs –cat file:///file2 hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs –copyFromLocal <fromDir> <toDir> hadoop fs –put <localfile> hdfs://nn.example.com/hadoop/hadoopfile sudo hadoop jar <jarFileName> <method> <fromDir> <toDir> hadoop fs –ls /user/hadoop/dir1 hadoop fs –cat hdfs://nn1.example.com/file1 hadoop fs –get /user/hadoop/file <localfile> Tips -- ‘sudo’ means ‘run as administrator’ (super user) --some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail
  25. Demo – Working with Files and HDFS
  26. Thinking in MapReduce  Hint: “It’s Functional”
  27. Understanding MapReduce – P1/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)
  28. Understanding MapReduce – P2/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Shuffle/Sort>>
  29. Understanding MapReduce – P3/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Reduce  (K2, list(V2)   Shuffle / Sort phase precedes Reduce phase  Combines Map output into a list  list (K3, V3)  Usually aggregates intermediate values (input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)  Shuffle/Sort>>
  30. Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png MapReduce Example - WordCount
  31. MapReduce Objects Each daemon spawns a new JVM
  32. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  33. Demo – Running MapReduce WordCount
  34. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 3 of 5
  35. Ways to run MapReduce Jobs  Configure JobConf options  From Development Environment (IDE)  From a GUI utility  Cloudera – Hue  Microsoft Azure – HDInsight console  From the command line  hadoop jar <filename.jar> input output
  36. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  37. Setting up Hadoop On Windows Azure  About HDInsight
  38. Demo – MapReduce in the Cloud  WordCount MapReduce using HDInsight
  39. MapReduce (WordCount) with Java Script Note: JavaScript is part of the Azure Hadoop distribution
  40. Common Data Sources for MapReduce Jobs
  41. Where is your Data coming from?  On premises  Local file system  Local HDFS instance  Private Cloud  Cloud storage  Public Cloud  Input Storage buckets  Script / Code buckets  Output buckets
  42. Common Data Jobs for MapReduce
  43. Demo – Other Types of MapReduce Tip: Review the Java MapReduce code in these samples as well.
  44. Methods to write MapReduce Jobs  Typical – usually written in Java  MapReduce 2.0 API  MapReduce 1.0 API  Streaming  Uses stdin and stdout  Can use any language to write Map and Reduce Functions  C#, Python, JavaScript, etc…  Pipes  Often used with C++  Abstraction libraries  Hive, Pig, etc… write in a higher level language, generate one or more MapReduce jobs
  45. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  46. Demo – MapReduce via C# & PowerShell
  47. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  48. Using AWS MapReduce Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  49. What is Pig?  ETL Library for HDFS developed at Yahoo  Pig Runtime  Pig Language  Generates MapReduce Jobs  ETL steps  LOAD <file>  FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…  DUMP {to screen for testing}  STORE <newFile>
  50. MapReduce Python Sample Remember that white space matters in Python!
  51. Demo – Using AWS MapReduce with Pig Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  52. AWS Data Pipeline with HIVE
  53. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 4 of 5
  54. Better MapReduce - Optimizations
  55. Optimization BEFORE running a MapReduce Job
  56. More about Input File Compression  From Cloudera…  Their version of LZO ‘splittable’ Type File Size GB Compress Decompress None Log 8.0 - - Gzip Log.gz 1.3 241 72 LZO Log.lzo 2.0 55 35
  57. Optimization WITHIN a MapReduce Job
  58. 59
  59. Mapper Task Optimization
  60. Data Types  Writable  Text (String)  IntWritable  LongWritable  FloatWritable  BooleanWritable  WritableComparable for keys  Custom Types supported – write RawComparator
  61. Reducer Task Optimization
  62. MapReduce Job Optimization
  63. Demo – Unit Testing MapReduce  Using MRUnit + Asserts  Optionally using ApprovalTests Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  64. A note about MapReduce 2.0  Splits the existing JobTracker’s roles  resource management  job lifecycle management  MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability  through distributed job lifecycle management  support for multiple Hadoop MapReduce API versions in a single cluster
  65. What is Mahout?  Library with common machine learning algorithms  Over 20 algorithms  Recommendation (likelihood – Pandora)  Classification (known data and new data – spam id)  Clustering (new groups of similar data – Google news)  Can non-statisticians find value using this library?
  66. Mahout Algorithms
  67. Setting up Hadoop on Windows  For local development  Install from binaries from Web Platform Installer  Install .NET Azure SDK (for Azure BLOB storage)  Install other tools  Neudesic Azure Storage Viewer
  68. Demo – Mahout  Using HDInsight
  69. What about the output?
  70. Clients (Visualizations) for HDFS  Many clients use Hive  Often included in GUI console tools for Hadoop distributions as well  Microsoft includes clients in Office (Excel 2013)  Direct Hive client  Connect using ODBC  PowerPivot – data mashups and presentation  Data Explorer – connect, transform, mashup and filter  Hadoop SDK on Codeplex  Other popular clients  Qlikview  Tableau  Karmasphere
  71. Demo – Executing Hive Queries
  72. Demo – Using HDFS output in Excel 2013 To download Data Explorer: http://www.microsoft.com/en- us/download/details.aspx?id=36803
  73. AboutVisualization
  74. Demo – New Visualizations – D3
  75. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 5 of 5
  76. Limitations of MapReduce
  77. Comparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  78. Microsoft alternatives to MapReduce  Use existing relational system  Scale via cloud or edition (i.e. Enterprise or PDW)  Use in memory OLAP  SQL Server Analysis Services Tabular Models  Use “productized” Dremel  Microsoft Polybase – status = beta?
  79. Looking Forward - Dremel or Apache Drill  Based on original research from Google
  80. Apache Drill Architecture
  81. In-market MapReduce Alternatives Cloudera  Impala Google  Big Query
  82. Demo – Google’s BigQuery  Dremel for the rest of us
  83. Hadoop MapReduce Call to Action
  84. More MapReduce Developer Resources  Based on the distribution – on premises  Apache  MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera  Cloudera  Cloudera University - http://university.cloudera.com/  Cloudera Developer Course (4 day) - *RECOMMENDED* - http://university.cloudera.com/training/apache_hadoop/developer.html  Hortonworks  MapR  Based on the distribution – cloud  AWS MapReduce  Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs  Windows Azure HDInsight  Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/  More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  85. The Changing Data Landscape

Editor's Notes

  • http://en.wikipedia.org/wiki/MapReduce
  • http://allthingsd.com/files/2012/04/big-numbers.jpg
  • http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
  • Image: http://siliconangle.com/files/2012/08/hadoop-300x300.jpg
  • http://www.platfora.com/wp-content/themes/PlatforaV2.0/img/enter/deployment_pick_graphic.png
  • http://indoos.files.wordpress.com/2010/08/hadoop_map1.png?w=819&amp;h=612
  • http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://datameer2.datameer.com/blog/wp-content/uploads/2013/01/hadoop_ecosystem_clean.png http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  • Image from: http://vichargrave.com/wp-content/uploads/2013/02/Hadoop-Development.png http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment https://ccp.cloudera.com/display/SUPPORT/Cloudera&apos;s+Hadoop+Demo+VM+for+CDH4
  • https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
  • http://queryio.com/hadoop-big-data-images/hadoop-sql.jpg
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/GettingStarted
  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://en.wikipedia.org/wiki/Apache_Hive
  • http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html
  • http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml http://rbxbx.info/images/fault-tolerance.png
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/
  • Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
  • http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes &lt;nameOfJavaFile&gt;.java &amp;&amp; jar –cvf &lt;nameOfJarFile&gt;.jar –C classes/
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://blogs.msdn.com/b/carlnol/archive/2013/02/05/submitting-hadoop-mapreduce-jobs-using-powershell.aspx
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • About: Pig - http://en.wikipedia.org/wiki/Pig_(programming_tool) PigLatin language reference - http://pig.apache.org/docs/r0.10.0/start.html#pl-statements
  • http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
  • http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/cloudera/mr-perf
  • http://4.bp.blogspot.com/-2S6IuPD71A8/TZiNw8AyWkI/AAAAAAAAB0k/tS5QTP9SzHA/s1600/Detailed%2BHadoop%2BMapreduce%2BData%2BFlow.png
  • The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
  • Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ &amp; http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
  • http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ http://hadoop.apache.org/docs/r0.23.6/api/index.html
  • http://mahout.apache.org/
  • Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&amp;referringTitle=Home
  • Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png
  • http://www.research-live.com/Journals/1/Files/2013/1/11/covermania.jpg
  • https://github.com/mbostock/d3/wiki/Gallery
  • Original Reference: Tom White’ s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  • http://research.google.com/pubs/pub36632.html
  • https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
  • http://cloudera.com/content/cloudera/en/campaign/introducing-impala.html GigaOm ‘The Future…of Hadoop is real-time’ -- http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/ http://devopsangle.com/2012/08/20/googles-dremel-here-comes-a-new-challenger-to-yarnhadoop/
  • Course Title: Module Title ©2011 DevelopMentor 1-Oct-2011

×