Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 1 of 5
Course Outline
What is Hadoop?
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop

Cloudera – CH4 w/ Impala

Hortonworks

MapR

AWS

Windows Azure HDInsight
Why Use Hadoop?
 Cheaper

Scales to Petabytes or
more
 Faster

Parallel data processing
 Better

Suited for particular types
of BigData problems
What types of business problems for Hadoop?
Source: Cloudera “Ten Common Hadoopable Problems”
Companies Using
Hadoop
 Facebook
 Yahoo
 Amazon
 eBay
 American Airlines
 The New York Times
 Federal Reserve Board
 IBM
 Orbitz
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks and more…
 Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Horizontally scalable
 Processing (MapReduce)

Parallelized (scalable) processing

Fault Tolerant
 Other Tools / Frameworks

Data Access

HBase, Hive, Pig, Mahout

Tools

Hue, Sqoop

Monitoring

Greenplum, Cloudera
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
What are the core parts of a Hadoop distribution?
Hadoop Cluster HDFS (Physical) Storage
MapReduce Job – Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem
Common Hadoop Distributions
 Open Source

Apache
 Commercial

Cloudera

Hortonworks

MapR

AWS MapReduce

Microsoft HDInsight (Beta)
A View of Hadoop (from Hortonworks)
Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
Setting up Hadoop Development
Demo – Setting up Cloudera Hadoop
Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 2 of 5
So, what’s the problem?
 “I can just use some ‘SQL-like’ language to query Hadoop, right?
 “Yeah, SQL-on-Hadoop…that’s what I want
 “I don’t want learn a new query language and….
 “I want massive scale for my shiny, new BigData
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Using Hive QL on CDH4
What is Hive?
 a data warehouse system for Hadoop that

facilitates easy data summarization

supports ad-hoc queries (still batch though…)

created by Facebook
 a mechanism to project structure onto this data and query the data using a
SQL-like language – HiveQL

Interactive-console –or-

Execute scripts

Kicks off one or more MapReduce jobs in the background
 an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? – NO!
--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce
Common Hadoop Shell Commands
hadoop fs –cat file:///file2
hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1
hadoop fs –get /user/hadoop/file <localfile>
Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link
included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce
 Hint: “It’s Functional”
Understanding MapReduce – P1/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
Understanding MapReduce – P2/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
 Shuffle/Sort>>
Understanding MapReduce – P3/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
 Reduce

(K2, list(V2) 

Shuffle / Sort phase
precedes Reduce phase

Combines Map output
into a list

list (K3, V3)

Usually aggregates
intermediate values
(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)
 Shuffle/Sort>>
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
MapReduce Example - WordCount
MapReduce Objects
Each daemon spawns a new JVM
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Running MapReduce WordCount
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 3 of 5
Ways to run MapReduce Jobs
 Configure JobConf options
 From Development Environment (IDE)
 From a GUI utility

Cloudera – Hue

Microsoft Azure – HDInsight console
 From the command line

hadoop jar <filename.jar> input output
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Setting up Hadoop On Windows Azure
 About HDInsight
Demo – MapReduce in the Cloud
 WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script
Note: JavaScript is
part of the Azure
Hadoop distribution
Common Data Sources for MapReduce Jobs
Where is your Data coming from?
 On premises

Local file system

Local HDFS instance
 Private Cloud

Cloud storage
 Public Cloud

Input Storage buckets

Script / Code buckets

Output buckets
Common Data Jobs for MapReduce
Demo – Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.
Methods to write MapReduce Jobs
 Typical – usually written in Java

MapReduce 2.0 API

MapReduce 1.0 API
 Streaming

Uses stdin and stdout

Can use any language to write Map and Reduce Functions

C#, Python, JavaScript, etc…
 Pipes

Often used with C++
 Abstraction libraries

Hive, Pig, etc… write in a higher level language, generate one or more
MapReduce jobs
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – MapReduce via C# & PowerShell
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
What is Pig?
 ETL Library for HDFS developed at Yahoo

Pig Runtime

Pig Language

Generates MapReduce Jobs
 ETL steps

LOAD <file>

FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…

DUMP {to screen for testing}  STORE <newFile>
MapReduce Python Sample
Remember that white space matters in Python!
Demo – Using AWS MapReduce with
Pig
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 4 of 5
Better MapReduce - Optimizations
Optimization BEFORE running a MapReduce Job
More about Input File Compression
 From Cloudera…
 Their version of LZO ‘splittable’
Type File Size GB Compress Decompress
None Log 8.0 - -
Gzip Log.gz 1.3 241 72
LZO Log.lzo 2.0 55 35
Optimization WITHIN a MapReduce Job
59
Mapper Task Optimization
Data Types
 Writable

Text (String)

IntWritable

LongWritable

FloatWritable

BooleanWritable
 WritableComparable for keys
 Custom Types supported – write RawComparator
Reducer Task Optimization
MapReduce Job Optimization
Demo – Unit Testing MapReduce
 Using MRUnit + Asserts
 Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
A note about MapReduce 2.0
 Splits the existing JobTracker’s roles

resource management

job lifecycle management
 MapReduce 2.0 provides many benefits over the existing MapReduce
framework, such as better scalability

through distributed job lifecycle management

support for multiple Hadoop MapReduce API versions in a single cluster
What is Mahout?
 Library with common machine learning algorithms
 Over 20 algorithms

Recommendation (likelihood – Pandora)

Classification (known data and new data – spam id)

Clustering (new groups of similar data – Google news)
 Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows
 For local development
 Install from binaries from Web Platform Installer
 Install .NET Azure SDK (for Azure BLOB storage)
 Install other tools

Neudesic Azure Storage Viewer
Demo – Mahout
 Using HDInsight
What about the output?
Clients (Visualizations) for HDFS
 Many clients use Hive

Often included in GUI console tools for Hadoop distributions as well
 Microsoft includes clients in Office (Excel 2013)

Direct Hive client

Connect using ODBC

PowerPivot – data mashups and presentation

Data Explorer – connect, transform, mashup and filter

Hadoop SDK on Codeplex
 Other popular clients

Qlikview

Tableau

Karmasphere
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013
To download Data Explorer:
http://www.microsoft.com/en-
us/download/details.aspx?id=36803
AboutVisualization
Demo – New Visualizations – D3
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 5 of 5
Limitations of MapReduce
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
Microsoft alternatives to MapReduce
 Use existing relational system

Scale via cloud or edition (i.e. Enterprise or PDW)
 Use in memory OLAP

SQL Server Analysis Services Tabular Models
 Use “productized” Dremel

Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
 Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives
Cloudera
 Impala
Google
 Big Query
Demo – Google’s BigQuery
 Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources
 Based on the distribution – on premises

Apache

MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera

Cloudera

Cloudera University - http://university.cloudera.com/

Cloudera Developer Course (4 day) - *RECOMMENDED* -
http://university.cloudera.com/training/apache_hadoop/developer.html

Hortonworks

MapR
 Based on the distribution – cloud

AWS MapReduce

Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs

Windows Azure HDInsight

Tutorial -
http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/

More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
The Changing Data Landscape

Hadoop MapReduce Fundamentals

  • 1.
    Hadoop MapReduce Fundamentals @LynnLangit afive part series – Part 1 of 5
  • 2.
  • 3.
    What is Hadoop? Open-source data storage and processing API  Massively scalable, automatically parallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and Vendor Work  Apache Hadoop  Cloudera – CH4 w/ Impala  Hortonworks  MapR  AWS  Windows Azure HDInsight
  • 4.
    Why Use Hadoop? Cheaper  Scales to Petabytes or more  Faster  Parallel data processing  Better  Suited for particular types of BigData problems
  • 5.
    What types ofbusiness problems for Hadoop? Source: Cloudera “Ten Common Hadoopable Problems”
  • 6.
    Companies Using Hadoop  Facebook Yahoo  Amazon  eBay  American Airlines  The New York Times  Federal Reserve Board  IBM  Orbitz
  • 7.
    Forecast growth ofHadoop Job Market Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  • 8.
    Hadoop is aset of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing  Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduce API Data Access Tools & Libraries Monitoring & Alerting
  • 9.
    What are thecore parts of a Hadoop distribution?
  • 10.
    Hadoop Cluster HDFS(Physical) Storage
  • 11.
    MapReduce Job –Logical View Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 12.
  • 14.
    Common Hadoop Distributions Open Source  Apache  Commercial  Cloudera  Hortonworks  MapR  AWS MapReduce  Microsoft HDInsight (Beta)
  • 15.
    A View ofHadoop (from Hortonworks) Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  • 16.
    Setting up HadoopDevelopment
  • 17.
    Demo – Settingup Cloudera Hadoop Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  • 18.
    Hadoop MapReduce Fundamentals @LynnLangit afive part series – Part 2 of 5
  • 19.
    So, what’s theproblem?  “I can just use some ‘SQL-like’ language to query Hadoop, right?  “Yeah, SQL-on-Hadoop…that’s what I want  “I don’t want learn a new query language and….  “I want massive scale for my shiny, new BigData
  • 20.
    Ways to MapReduce LibrariesLanguages Note: Java is most common, but other languages can be used
  • 21.
    Demo – UsingHive QL on CDH4
  • 22.
    What is Hive? a data warehouse system for Hadoop that  facilitates easy data summarization  supports ad-hoc queries (still batch though…)  created by Facebook  a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL  Interactive-console –or-  Execute scripts  Kicks off one or more MapReduce jobs in the background  an ability to use indexes, built-in user-defined functions
  • 23.
    Is HQL ==ANSI SQL? – NO! --non-equality joins ARE allowed on ANSI SQL --but are NOT allowed on Hive (HQL) SELECT a.* FROM a JOIN b ON (a.id <> b.id) Note: Joins are quite different in MapReduce, more on that coming up…
  • 24.
  • 25.
    Common Hadoop ShellCommands hadoop fs –cat file:///file2 hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs –copyFromLocal <fromDir> <toDir> hadoop fs –put <localfile> hdfs://nn.example.com/hadoop/hadoopfile sudo hadoop jar <jarFileName> <method> <fromDir> <toDir> hadoop fs –ls /user/hadoop/dir1 hadoop fs –cat hdfs://nn1.example.com/file1 hadoop fs –get /user/hadoop/file <localfile> Tips -- ‘sudo’ means ‘run as administrator’ (super user) --some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail
  • 26.
    Demo – Workingwith Files and HDFS
  • 27.
    Thinking in MapReduce Hint: “It’s Functional”
  • 28.
    Understanding MapReduce –P1/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)
  • 29.
    Understanding MapReduce –P2/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Shuffle/Sort>>
  • 30.
    Understanding MapReduce –P3/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Reduce  (K2, list(V2)   Shuffle / Sort phase precedes Reduce phase  Combines Map output into a list  list (K3, V3)  Usually aggregates intermediate values (input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)  Shuffle/Sort>>
  • 31.
  • 32.
  • 33.
    Ways to MapReduce LibrariesLanguages Note: Java is most common, but other languages can be used
  • 34.
    Demo – RunningMapReduce WordCount
  • 35.
    Hadoop MapReduce Fundamentals @LynnLangit afive part series – Part 3 of 5
  • 36.
    Ways to runMapReduce Jobs  Configure JobConf options  From Development Environment (IDE)  From a GUI utility  Cloudera – Hue  Microsoft Azure – HDInsight console  From the command line  hadoop jar <filename.jar> input output
  • 37.
    Ways to MapReduce LibrariesLanguages Note: Java is most common, but other languages can be used
  • 38.
    Setting up HadoopOn Windows Azure  About HDInsight
  • 39.
    Demo – MapReducein the Cloud  WordCount MapReduce using HDInsight
  • 40.
    MapReduce (WordCount) withJava Script Note: JavaScript is part of the Azure Hadoop distribution
  • 41.
    Common Data Sourcesfor MapReduce Jobs
  • 42.
    Where is yourData coming from?  On premises  Local file system  Local HDFS instance  Private Cloud  Cloud storage  Public Cloud  Input Storage buckets  Script / Code buckets  Output buckets
  • 43.
    Common Data Jobsfor MapReduce
  • 44.
    Demo – OtherTypes of MapReduce Tip: Review the Java MapReduce code in these samples as well.
  • 45.
    Methods to writeMapReduce Jobs  Typical – usually written in Java  MapReduce 2.0 API  MapReduce 1.0 API  Streaming  Uses stdin and stdout  Can use any language to write Map and Reduce Functions  C#, Python, JavaScript, etc…  Pipes  Often used with C++  Abstraction libraries  Hive, Pig, etc… write in a higher level language, generate one or more MapReduce jobs
  • 46.
    Ways to MapReduce LibrariesLanguages Note: Java is most common, but other languages can be used
  • 47.
    Demo – MapReducevia C# & PowerShell
  • 48.
    Ways to MapReduce LibrariesLanguages Note: Java is most common, but other languages can be used
  • 49.
    Using AWS MapReduce Note:You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  • 50.
    What is Pig? ETL Library for HDFS developed at Yahoo  Pig Runtime  Pig Language  Generates MapReduce Jobs  ETL steps  LOAD <file>  FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…  DUMP {to screen for testing}  STORE <newFile>
  • 51.
    MapReduce Python Sample Rememberthat white space matters in Python!
  • 52.
    Demo – UsingAWS MapReduce with Pig Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  • 53.
  • 54.
    Hadoop MapReduce Fundamentals @LynnLangit afive part series – Part 4 of 5
  • 55.
    Better MapReduce -Optimizations
  • 56.
  • 57.
    More about InputFile Compression  From Cloudera…  Their version of LZO ‘splittable’ Type File Size GB Compress Decompress None Log 8.0 - - Gzip Log.gz 1.3 241 72 LZO Log.lzo 2.0 55 35
  • 58.
    Optimization WITHIN aMapReduce Job
  • 59.
  • 60.
  • 61.
    Data Types  Writable  Text(String)  IntWritable  LongWritable  FloatWritable  BooleanWritable  WritableComparable for keys  Custom Types supported – write RawComparator
  • 62.
  • 63.
  • 64.
    Demo – UnitTesting MapReduce  Using MRUnit + Asserts  Optionally using ApprovalTests Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  • 65.
    A note aboutMapReduce 2.0  Splits the existing JobTracker’s roles  resource management  job lifecycle management  MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability  through distributed job lifecycle management  support for multiple Hadoop MapReduce API versions in a single cluster
  • 66.
    What is Mahout? Library with common machine learning algorithms  Over 20 algorithms  Recommendation (likelihood – Pandora)  Classification (known data and new data – spam id)  Clustering (new groups of similar data – Google news)  Can non-statisticians find value using this library?
  • 67.
  • 68.
    Setting up Hadoopon Windows  For local development  Install from binaries from Web Platform Installer  Install .NET Azure SDK (for Azure BLOB storage)  Install other tools  Neudesic Azure Storage Viewer
  • 69.
    Demo – Mahout Using HDInsight
  • 70.
  • 71.
    Clients (Visualizations) forHDFS  Many clients use Hive  Often included in GUI console tools for Hadoop distributions as well  Microsoft includes clients in Office (Excel 2013)  Direct Hive client  Connect using ODBC  PowerPivot – data mashups and presentation  Data Explorer – connect, transform, mashup and filter  Hadoop SDK on Codeplex  Other popular clients  Qlikview  Tableau  Karmasphere
  • 72.
    Demo – ExecutingHive Queries
  • 73.
    Demo – UsingHDFS output in Excel 2013 To download Data Explorer: http://www.microsoft.com/en- us/download/details.aspx?id=36803
  • 74.
  • 75.
    Demo – NewVisualizations – D3
  • 76.
    Hadoop MapReduce Fundamentals @LynnLangit afive part series – Part 5 of 5
  • 77.
  • 78.
    Comparing: RDBMS vs.Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  • 79.
    Microsoft alternatives toMapReduce  Use existing relational system  Scale via cloud or edition (i.e. Enterprise or PDW)  Use in memory OLAP  SQL Server Analysis Services Tabular Models  Use “productized” Dremel  Microsoft Polybase – status = beta?
  • 80.
    Looking Forward -Dremel or Apache Drill  Based on original research from Google
  • 81.
  • 82.
  • 83.
    Demo – Google’sBigQuery  Dremel for the rest of us
  • 84.
  • 85.
    More MapReduce DeveloperResources  Based on the distribution – on premises  Apache  MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera  Cloudera  Cloudera University - http://university.cloudera.com/  Cloudera Developer Course (4 day) - *RECOMMENDED* - http://university.cloudera.com/training/apache_hadoop/developer.html  Hortonworks  MapR  Based on the distribution – cloud  AWS MapReduce  Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs  Windows Azure HDInsight  Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/  More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  • 86.

Editor's Notes

  • #4 http://en.wikipedia.org/wiki/MapReduce
  • #5 http://allthingsd.com/files/2012/04/big-numbers.jpg
  • #6 http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
  • #7 Image: http://siliconangle.com/files/2012/08/hadoop-300x300.jpg
  • #9 http://www.platfora.com/wp-content/themes/PlatforaV2.0/img/enter/deployment_pick_graphic.png
  • #13 http://indoos.files.wordpress.com/2010/08/hadoop_map1.png?w=819&amp;h=612
  • #14 http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://datameer2.datameer.com/blog/wp-content/uploads/2013/01/hadoop_ecosystem_clean.png http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  • #17 Image from: http://vichargrave.com/wp-content/uploads/2013/02/Hadoop-Development.png http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment https://ccp.cloudera.com/display/SUPPORT/Cloudera&apos;s+Hadoop+Demo+VM+for+CDH4
  • #18 https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
  • #20 http://queryio.com/hadoop-big-data-images/hadoop-sql.jpg
  • #21 http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • #23 http://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/GettingStarted
  • #24 https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://en.wikipedia.org/wiki/Apache_Hive
  • #26 http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html
  • #28 http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml http://rbxbx.info/images/fault-tolerance.png
  • #29 The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • #30 The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • #31 The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  • #34 http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • #38 http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • #39 http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/
  • #43 Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
  • #46 http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes &lt;nameOfJavaFile&gt;.java &amp;&amp; jar –cvf &lt;nameOfJarFile&gt;.jar –C classes/
  • #47 http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • #48 http://blogs.msdn.com/b/carlnol/archive/2013/02/05/submitting-hadoop-mapreduce-jobs-using-powershell.aspx
  • #49 http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  • #53 About: Pig - http://en.wikipedia.org/wiki/Pig_(programming_tool) PigLatin language reference - http://pig.apache.org/docs/r0.10.0/start.html#pl-statements
  • #58 http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
  • #59 http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/cloudera/mr-perf
  • #60 http://4.bp.blogspot.com/-2S6IuPD71A8/TZiNw8AyWkI/AAAAAAAAB0k/tS5QTP9SzHA/s1600/Detailed%2BHadoop%2BMapreduce%2BData%2BFlow.png
  • #62 The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
  • #64 Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ &amp; http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
  • #66 http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ http://hadoop.apache.org/docs/r0.23.6/api/index.html
  • #67 http://mahout.apache.org/
  • #69 Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&amp;referringTitle=Home
  • #71 Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png
  • #75 http://www.research-live.com/Journals/1/Files/2013/1/11/covermania.jpg
  • #76 https://github.com/mbostock/d3/wiki/Gallery
  • #79 Original Reference: Tom White’ s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  • #81 http://research.google.com/pubs/pub36632.html
  • #82 https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
  • #83 http://cloudera.com/content/cloudera/en/campaign/introducing-impala.html GigaOm ‘The Future…of Hadoop is real-time’ -- http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/ http://devopsangle.com/2012/08/20/googles-dremel-here-comes-a-new-challenger-to-yarnhadoop/
  • #87 Course Title: Module Title ©2011 DevelopMentor 1-Oct-2011