Hadoop MapReduce Fundamentals
Upcoming SlideShare
Loading in...5
×
 

Hadoop MapReduce Fundamentals

on

  • 26,092 views

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

deck from my 5 part series of YouTube (SoCalDevGal channel) on Hadoop MapReduce

Statistics

Views

Total Views
26,092
Views on SlideShare
25,660
Embed Views
432

Actions

Likes
38
Downloads
5,511
Comments
0

10 Embeds 432

https://my.zyncro.com 275
https://twitter.com 126
http://www.linkedin.com 10
http://localhost 6
http://tweetedtimes.com 6
https://ranking.dynamicsignal.com 3
http://swazzy.com 2
https://www.linkedin.com 2
https://www.rebelmouse.com 1
https://www.linkedin-ei.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://en.wikipedia.org/wiki/MapReduce
  • http://allthingsd.com/files/2012/04/big-numbers.jpg
  • http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
  • Image: http://siliconangle.com/files/2012/08/hadoop-300x300.jpg
  • http://www.platfora.com/wp-content/themes/PlatforaV2.0/img/enter/deployment_pick_graphic.png
  • http://indoos.files.wordpress.com/2010/08/hadoop_map1.png?w=819&h=612
  • http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://datameer2.datameer.com/blog/wp-content/uploads/2013/01/hadoop_ecosystem_clean.png http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  • Image from: http://vichargrave.com/wp-content/uploads/2013/02/Hadoop-Development.png http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4
  • https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
  • http://queryio.com/hadoop-big-data-images/hadoop-sql.jpg
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/GettingStarted
  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://en.wikipedia.org/wiki/Apache_Hive
  • http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html
  • http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml http://rbxbx.info/images/fault-tolerance.png
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/
  • Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
  • http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes .java && jar –cvf .jar –C classes/
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • http://blogs.msdn.com/b/carlnol/archive/2013/02/05/submitting-hadoop-mapreduce-jobs-using-powershell.aspx
  • http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • About: Pig - http://en.wikipedia.org/wiki/Pig_(programming_tool) PigLatin language reference - http://pig.apache.org/docs/r0.10.0/start.html#pl-statements
  • http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
  • http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/cloudera/mr-perf
  • http://4.bp.blogspot.com/-2S6IuPD71A8/TZiNw8AyWkI/AAAAAAAAB0k/tS5QTP9SzHA/s1600/Detailed%2BHadoop%2BMapreduce%2BData%2BFlow.png
  • The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
  • Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ & http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
  • http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ http://hadoop.apache.org/docs/r0.23.6/api/index.html
  • http://mahout.apache.org/
  • Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&referringTitle=Home
  • Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png
  • http://www.research-live.com/Journals/1/Files/2013/1/11/covermania.jpg
  • https://github.com/mbostock/d3/wiki/Gallery
  • Original Reference: Tom White’ s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  • http://research.google.com/pubs/pub36632.html
  • https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
  • http://cloudera.com/content/cloudera/en/campaign/introducing-impala.html GigaOm ‘The Future…of Hadoop is real-time’ -- http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/ http://devopsangle.com/2012/08/20/googles-dremel-here-comes-a-new-challenger-to-yarnhadoop/
  • Course Title: Module Title ©2011 DevelopMentor 1-Oct-2011

Hadoop MapReduce Fundamentals Hadoop MapReduce Fundamentals Presentation Transcript

  • Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 1 of 5
  • Course Outline
  • What is Hadoop? Open-source data storage and processing API Massively scalable, automatically parallelizableBased on work from GoogleGFS + MapReduce + BigTableCurrent Distributions based on Open Source and Vendor WorkApache HadoopCloudera – CH4 w/ ImpalaHortonworksMapRAWSWindows Azure HDInsight
  • Why Use Hadoop? CheaperScales to Petabytes ormore FasterParallel data processing BetterSuited for particular typesof BigData problems
  • What types of business problems for Hadoop?Source: Cloudera “Ten Common Hadoopable Problems”
  • Companies UsingHadoop Facebook Yahoo Amazon eBay American Airlines The New York Times Federal Reserve Board IBM Orbitz
  • Forecast growth of Hadoop Job MarketSource: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  • Hadoop is a set of Apache Frameworks and more… Data storage (HDFS)Runs on commodity hardware (usually Linux)Horizontally scalable Processing (MapReduce)Parallelized (scalable) processingFault Tolerant Other Tools / FrameworksData AccessHBase, Hive, Pig, MahoutToolsHue, SqoopMonitoringGreenplum, ClouderaHadoop Core - HDFSMapReduce APIData AccessTools & LibrariesMonitoring & Alerting
  • What are the core parts of a Hadoop distribution?
  • Hadoop Cluster HDFS (Physical) Storage
  • MapReduce Job – Logical ViewImage from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • Hadoop Ecosystem
  • Common Hadoop Distributions Open SourceApache CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight (Beta)
  • A View of Hadoop (from Hortonworks)Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  • Setting up Hadoop Development
  • Demo – Setting up Cloudera HadoopNote: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  • Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 2 of 5
  • So, what’s the problem? “I can just use some ‘SQL-like’ language to query Hadoop, right? “Yeah, SQL-on-Hadoop…that’s what I want “I don’t want learn a new query language and…. “I want massive scale for my shiny, new BigData
  • Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • Demo – Using Hive QL on CDH4
  • What is Hive? a data warehouse system for Hadoop thatfacilitates easy data summarizationsupports ad-hoc queries (still batch though…)created by Facebook a mechanism to project structure onto this data and query the data using aSQL-like language – HiveQLInteractive-console –or-Execute scriptsKicks off one or more MapReduce jobs in the background an ability to use indexes, built-in user-defined functions
  • Is HQL == ANSI SQL? – NO!--non-equality joins ARE allowed on ANSI SQL--but are NOT allowed on Hive (HQL)SELECT a.*FROM aJOIN b ON (a.id <> b.id)Note: Joins are quite different in MapReduce, more on that coming up…
  • Preparing for MapReduce
  • Common Hadoop Shell Commandshadoop fs –cat file:///file2hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2hadoop fs –copyFromLocal <fromDir> <toDir>hadoop fs –put <localfile>hdfs://nn.example.com/hadoop/hadoopfilesudo hadoop jar <jarFileName> <method> <fromDir> <toDir>hadoop fs –ls /user/hadoop/dir1hadoop fs –cat hdfs://nn1.example.com/file1hadoop fs –get /user/hadoop/file <localfile>Tips-- ‘sudo’ means ‘run as administrator’ (super user)--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the linkincluded for more detail
  • Demo – Working with Files and HDFS
  • Thinking in MapReduce Hint: “It’s Functional”
  • Understanding MapReduce – P1/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner)
  • Understanding MapReduce – P2/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Shuffle/Sort>>
  • Understanding MapReduce – P3/3 Map>>(K1, V1) Info inInput Splitlist (K2, V2)Key / Value out(intermediate values)One list per localnodeCan implement localReducer (orCombiner) Reduce(K2, list(V2) Shuffle / Sort phaseprecedes Reduce phaseCombines Map outputinto a listlist (K3, V3)Usually aggregatesintermediate values(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output) Shuffle/Sort>>
  • Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.pngMapReduce Example - WordCount
  • MapReduce ObjectsEach daemon spawns a new JVM
  • Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • Demo – Running MapReduce WordCount
  • Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 3 of 5
  • Ways to run MapReduce Jobs Configure JobConf options From Development Environment (IDE) From a GUI utilityCloudera – HueMicrosoft Azure – HDInsight console From the command linehadoop jar <filename.jar> input output
  • Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • Setting up Hadoop On Windows Azure About HDInsight
  • Demo – MapReduce in the Cloud WordCount MapReduce using HDInsight
  • MapReduce (WordCount) with Java ScriptNote: JavaScript ispart of the AzureHadoop distribution
  • Common Data Sources for MapReduce Jobs
  • Where is your Data coming from? On premisesLocal file systemLocal HDFS instance Private CloudCloud storage Public CloudInput Storage bucketsScript / Code bucketsOutput buckets
  • Common Data Jobs for MapReduce
  • Demo – Other Types of MapReduceTip: Review the Java MapReduce code in these samples as well.
  • Methods to write MapReduce Jobs Typical – usually written in JavaMapReduce 2.0 APIMapReduce 1.0 API StreamingUses stdin and stdoutCan use any language to write Map and Reduce FunctionsC#, Python, JavaScript, etc… PipesOften used with C++ Abstraction librariesHive, Pig, etc… write in a higher level language, generate one or moreMapReduce jobs
  • Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • Demo – MapReduce via C# & PowerShell
  • Ways to MapReduceLibraries LanguagesNote: Java is most common, but other languages can be used
  • Using AWS MapReduceNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  • What is Pig? ETL Library for HDFS developed at YahooPig RuntimePig LanguageGenerates MapReduce Jobs ETL stepsLOAD <file>FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…DUMP {to screen for testing}  STORE <newFile>
  • MapReduce Python SampleRemember that white space matters in Python!
  • Demo – Using AWS MapReduce withPigNote: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on theAWS Cloud
  • AWS Data Pipeline with HIVE
  • Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 4 of 5
  • Better MapReduce - Optimizations
  • Optimization BEFORE running a MapReduce Job
  • More about Input File Compression From Cloudera… Their version of LZO ‘splittable’Type File Size GB Compress DecompressNone Log 8.0 - -Gzip Log.gz 1.3 241 72LZO Log.lzo 2.0 55 35
  • Optimization WITHIN a MapReduce Job
  • 59
  • Mapper Task Optimization
  • Data Types WritableText (String)IntWritableLongWritableFloatWritableBooleanWritable WritableComparable for keys Custom Types supported – write RawComparator
  • Reducer Task Optimization
  • MapReduce Job Optimization
  • Demo – Unit Testing MapReduce Using MRUnit + Asserts Optionally using ApprovalTestsImage from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  • A note about MapReduce 2.0 Splits the existing JobTracker’s rolesresource managementjob lifecycle management MapReduce 2.0 provides many benefits over the existing MapReduceframework, such as better scalabilitythrough distributed job lifecycle managementsupport for multiple Hadoop MapReduce API versions in a single cluster
  • What is Mahout? Library with common machine learning algorithms Over 20 algorithmsRecommendation (likelihood – Pandora)Classification (known data and new data – spam id)Clustering (new groups of similar data – Google news) Can non-statisticians find value using this library?
  • Mahout Algorithms
  • Setting up Hadoop on Windows For local development Install from binaries from Web Platform Installer Install .NET Azure SDK (for Azure BLOB storage) Install other toolsNeudesic Azure Storage Viewer
  • Demo – Mahout Using HDInsight
  • What about the output?
  • Clients (Visualizations) for HDFS Many clients use HiveOften included in GUI console tools for Hadoop distributions as well Microsoft includes clients in Office (Excel 2013)Direct Hive clientConnect using ODBCPowerPivot – data mashups and presentationData Explorer – connect, transform, mashup and filterHadoop SDK on Codeplex Other popular clientsQlikviewTableauKarmasphere
  • Demo – Executing Hive Queries
  • Demo – Using HDFS output in Excel 2013To download Data Explorer:http://www.microsoft.com/en-us/download/details.aspx?id=36803
  • AboutVisualization
  • Demo – New Visualizations – D3
  • Hadoop MapReduce Fundamentals@LynnLangita five part series – Part 5 of 5
  • Limitations of MapReduce
  • Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access Interactive and Batch Batch – NOT InteractiveUpdates Read / Write many times Write once, Read many timesStructure Static Schema Dynamic SchemaIntegrity High (ACID) LowScaling Nonlinear LinearQuery ResponseTimeCan be near immediate Has latency (due to batch processing)
  • Microsoft alternatives to MapReduce Use existing relational systemScale via cloud or edition (i.e. Enterprise or PDW) Use in memory OLAPSQL Server Analysis Services Tabular Models Use “productized” DremelMicrosoft Polybase – status = beta?
  • Looking Forward - Dremel or Apache Drill Based on original research from Google
  • Apache Drill Architecture
  • In-market MapReduce AlternativesCloudera ImpalaGoogle Big Query
  • Demo – Google’s BigQuery Dremel for the rest of us
  • Hadoop MapReduce Call to Action
  • More MapReduce Developer Resources Based on the distribution – on premisesApacheMapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlClouderaClouderaCloudera University - http://university.cloudera.com/Cloudera Developer Course (4 day) - *RECOMMENDED* -http://university.cloudera.com/training/apache_hadoop/developer.htmlHortonworksMapR Based on the distribution – cloudAWS MapReduceTutorial - http://aws.amazon.com/elasticmapreduce/training/#gsWindows Azure HDInsightTutorial -http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  • The Changing Data Landscape