Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop MapReduce Fundamentals

108,706 views

Published on

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

Published in: Technology, Education
  • Worthful Hadoop tutorial. Appreciate a lot for taking up the pain to write such a quality content on Hadoop course. Just now I watched this similar Hadoop tutorial and I think this will enhance the knowledge of other visitors for sure. Thanks anyway.https://www.youtube.com/watch?v=1jMR4cHBwZE
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi All, We are planning to start new Salesforce Online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/salesforce-online-training-hyderabad-bangalore
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • search more than 5000 registered IT trainers at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Hadoop MapReduce Fundamentals

  1. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 1 of 5
  2. Course Outline
  3. What is Hadoop? Open-source data storage and processing API Massively scalable, automatically parallelizableBased on work from GoogleGFS + MapReduce + BigTableCurrent Distributions based on Open Source and Vendor WorkApache HadoopCloudera – CH4 w/ ImpalaHortonworksMapRAWSWindows Azure HDInsight
  4. Why Use Hadoop? CheaperScales to Petabytes ormore FasterParallel data processing BetterSuited for particular typesof BigData problems
  5. What types of business problems for Hadoop?Source: Cloudera “Ten Common Hadoopable Problems”
  6. Companies UsingHadoop Facebook Yahoo Amazon eBay American Airlines The New York Times Federal Reserve Board IBM Orbitz
  7. Forecast growth of Hadoop Job MarketSource: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  8. Hadoop is a set of Apache Frameworks and more… Data storage (HDFS)Runs on commodity hardware (usually Linux)Horizontally scalable Processing (MapReduce)Parallelized (scalable) processingFault Tolerant Other Tools / FrameworksData AccessHBase, Hive, Pig, MahoutToolsHue, SqoopMonitoringGreenplum, ClouderaHadoop Core - HDFSMapReduce APIData AccessTools & LibrariesMonitoring & Alerting
  9. What are the core parts of a Hadoop distribution?
  10. Hadoop Cluster HDFS (Physical) Storage
  11. MapReduce Job – Logical ViewImage from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  12. Hadoop Ecosystem
  13. Common Hadoop Distributions Open SourceApache CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight (Beta)
  14. A View of Hadoop (from Hortonworks)Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  15. Setting up Hadoop Development
  16. Demo – Setting up Cloudera HadoopNote: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  17. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 2 of 5
  18. So, what’s the problem? “I can just use some ‘SQL-like’ language to query Hadoop, right? “Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData
  19. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  20. Demo – Using Hive QL on CDH4
  21. What is Hive? a data warehouse system for Hadoop thatfacilitates easy data summarizationsupports ad-hoc queries (still batch though…)created by Facebook a mechanism to project structure onto this data and query the data using aSQL-like language – HiveQLInteractive-console –or-Execute scriptsKicks off one or more MapReduce jobs in the background an ability to use indexes, built-in user-defined functions
  22. Is HQL == ANSI SQL? – NO!--non-equality joins ARE allowed on ANSI SQL--but are NOT allowed on Hive (HQL)SELECT a.*FROM aJOIN b ON (a.id <> b.id)Note: Joins are quite different in MapReduce, more on that coming up…
  23. Preparing for MapReduce
  24. Common Hadoop Shell Commandshadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir> <toDir>hadoop fs –ls /user/hadoop/dir1hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the linkincluded for more detail
  25. Demo – Working with Files and HDFS
  26. Thinking in MapReduce Hint: “It’s Functional”
  27. Understanding MapReduce – P1/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner)
  28. Understanding MapReduce – P2/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Shuffle/Sort>>
  29. Understanding MapReduce – P3/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Reduce(K2, list(V2) Shuffle / Sort phaseprecedes Reduce phaseCombines Map outputinto a listlist (K3, V3)Usually aggregatesintermediate values(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output) Shuffle/Sort>>
  30. Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.pngMapReduce Example - WordCount
  31. MapReduce ObjectsEach daemon spawns a new JVM
  32. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  33. Demo – Running MapReduce WordCount
  34. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 3 of 5
  35. Ways to run MapReduce Jobs Configure JobConf options From Development Environment (IDE) From a GUI utilityCloudera – HueMicrosoft Azure – HDInsight console From the command linehadoop jar <filename.jar> input output
  36. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  37. Setting up Hadoop On Windows Azure About HDInsight
  38. Demo – MapReduce in the Cloud WordCount MapReduce using HDInsight
  39. MapReduce (WordCount) with Java ScriptNote: JavaScript ispart of the AzureHadoop distribution
  40. Common Data Sources for MapReduce Jobs
  41. Where is your Data coming from? On premisesLocal file systemLocal HDFS instance Private CloudCloud storage Public CloudInput Storage bucketsScript / Code bucketsOutput buckets
  42. Common Data Jobs for MapReduce
  43. Demo – Other Types of MapReduceTip: Review the Java MapReduce code in these samples as well.
  44. Methods to write MapReduce Jobs Typical – usually written in JavaMapReduce 2.0 APIMapReduce 1.0 API StreamingUses stdin and stdoutCan use any language to write Map and Reduce FunctionsC#, Python, JavaScript, etc… PipesOften used with C++ Abstraction librariesHive, Pig, etc… write in a higher level language, generate one or moreMapReduce jobs
  45. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  46. Demo – MapReduce via C# & PowerShell
  47. Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  48. Using AWS MapReduceNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  49. What is Pig? ETL Library for HDFS developed at YahooPig RuntimePig LanguageGenerates MapReduce Jobs ETL stepsLOAD <file>FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…DUMP {to screen for testing}  STORE <newFile>
  50. MapReduce Python SampleRemember that white space matters in Python!
  51. Demo – Using AWS MapReduce withPigNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  52. AWS Data Pipeline with HIVE
  53. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 4 of 5
  54. Better MapReduce - Optimizations
  55. Optimization BEFORE running a MapReduce Job
  56. More about Input File Compression From Cloudera… Their version of LZO ‘splittable’Type File Size GB Compress DecompressNone Log 8.0 - -Gzip Log.gz 1.3 241 72LZO Log.lzo 2.0 55 35
  57. Optimization WITHIN a MapReduce Job
  58. 59
  59. Mapper Task Optimization
  60. Data Types WritableText (String)IntWritableLongWritableFloatWritableBooleanWritable WritableComparable for keys Custom Types supported – write RawComparator
  61. Reducer Task Optimization
  62. MapReduce Job Optimization
  63. Demo – Unit Testing MapReduce Using MRUnit + Asserts Optionally using ApprovalTestsImage from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  64. A note about MapReduce 2.0 Splits the existing JobTracker’s rolesresource managementjob lifecycle management MapReduce 2.0 provides many benefits over the existing MapReduceframework, such as better scalabilitythrough distributed job lifecycle managementsupport for multiple Hadoop MapReduce API versions in a single cluster
  65. What is Mahout? Library with common machine learning algorithms Over 20 algorithmsRecommendation (likelihood – Pandora)Classification (known data and new data – spam id)Clustering (new groups of similar data – Google news) Can non-statisticians find value using this library?
  66. Mahout Algorithms
  67. Setting up Hadoop on Windows For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other toolsNeudesic Azure Storage Viewer
  68. Demo – Mahout Using HDInsight
  69. What about the output?
  70. Clients (Visualizations) for HDFS Many clients use HiveOften included in GUI console tools for Hadoop distributions as well Microsoft includes clients in Office (Excel 2013)Direct Hive clientConnect using ODBCPowerPivot – data mashups and presentationData Explorer – connect, transform, mashup and filterHadoop SDK on Codeplex Other popular clientsQlikviewTableauKarmasphere
  71. Demo – Executing Hive Queries
  72. Demo – Using HDFS output in Excel 2013To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803
  73. AboutVisualization
  74. Demo – New Visualizations – D3
  75. Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 5 of 5
  76. Limitations of MapReduce
  77. Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch Batch – NOT InteractiveUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearQuery ResponseTimeCan be near immediate Has latency (due to batch processing)
  78. Microsoft alternatives to MapReduce Use existing relational systemScale via cloud or edition (i.e. Enterprise or PDW) Use in memory OLAPSQL Server Analysis Services Tabular Models Use “productized” DremelMicrosoft Polybase – status = beta?
  79. Looking Forward - Dremel or Apache Drill Based on original research from Google
  80. Apache Drill Architecture
  81. In-market MapReduce AlternativesCloudera ImpalaGoogle Big Query
  82. Demo – Google’s BigQuery Dremel for the rest of us
  83. Hadoop MapReduce Call to Action
  84. More MapReduce Developer Resources Based on the distribution – on premisesApacheMapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlClouderaClouderaCloudera University - http://university.cloudera.com/Cloudera Developer Course (4 day) - *RECOMMENDED* -http://university.cloudera.com/training/apache_hadoop/developer.htmlHortonworksMapR Based on the distribution – cloudAWS MapReduceTutorial - http://aws.amazon.com/elasticmapreduce/training/#gsWindows Azure HDInsightTutorial -http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  85. The Changing Data Landscape

×