SlideShare a Scribd company logo
1 of 26
Download to read offline
Hadoop 101

                      Mohit Soni
                       eBay Inc.




BarCamp Chennai - 5                Mohit Soni
About Me

• I work as a Software Engineer at eBay
• Worked on large-scale data processing with
  eBay Research Labs




      BarCamp Chennai - 5              Mohit Soni
First Things First




BarCamp Chennai - 5             Mohit Soni
MapReduce
• Inspired from functional operations
  – Map
  – Reduce
• Functional operations do not modify data,
  they generate new data
• Original data remains unmodified




      BarCamp Chennai - 5           Mohit Soni
Functional Operations
Map                                Reduce
def sqr(n):                        def add(i, j):
   return n * n                      return i + j

list = [1,2,3,4]                   list = [1,2,3,4]

map(sqr, list) -> [1,4,9,16]       reduce(add, list) -> 10



                             MapReduce
        def MapReduce(data, mapper, reducer):
          return reduce(reducer, map(mapper, data))

        MapReduce(list, sqr, add) -> 30


                                                          Python code
       BarCamp Chennai - 5                   Mohit Soni
BarCamp Chennai - 5   Mohit Soni
What is Hadoop ?


•   Framework for large-scale data processing
•   Based on Google’s MapReduce and GFS
•   An Apache Software Foundation project
•   Open Source!
•   Written in Java
•   Oh, btw



        BarCamp Chennai - 5           Mohit Soni
Why Hadoop ?


• Need to process lots of data (PetaByte scale)
• Need to parallelize processing across
  multitude of CPUs
• Achieves above while KeepIng Software
  Simple
• Gives scalability with low-cost commodity
  hardware


      BarCamp Chennai - 5            Mohit Soni
Hadoop fans




Source: Hadoop Wiki
                BarCamp Chennai - 5           Mohit Soni
When to use and not-use Hadoop ?
Hadoop is a good choice for:
• Indexing data
• Log Analysis
• Image manipulation
• Sorting large-scale data
• Data Mining
Hadoop is not a good choice:
• For real-time processing
• For processing intensive tasks with little data
• If you have Jaguar or RoadRunner in your stock

       BarCamp Chennai - 5               Mohit Soni
HDFS – Overview


•   Hadoop Distributed File System
•   Based on Google’s GFS (Google File System)
•   Write once read many access model
•   Fault tolerant
•   Efficient for batch-processing




        BarCamp Chennai - 5           Mohit Soni
HDFS – Blocks
                                 Block 1



                                 Block 2


            Input Data           Block 3



• HDFS splits input data into blocks
• Block size in HDFS: 64/128MB (configurable)
• Block size *nix: 4KB

      BarCamp Chennai - 5                  Mohit Soni
HDFS – Replication
                Block 1                   Block 1



                Block 2                   Block 3




                    Block 2



                    Block 3


• Blocks are replicated across nodes to handle hardware failure
• Node failure is handled gracefully, without loss of data

          BarCamp Chennai - 5                 Mohit Soni
HDFS – Architecture

                           NameNode

Client




                                                           Cluster
                                  DataNodes

         BarCamp Chennai - 5                  Mohit Soni
HDFS – NameNode
• NameNode (Master)
   – Manages filesystem metadata
   – Manages replication of blocks
   – Manages read/write access to files
• Metadata
   – List of files
   – List of blocks that constitutes a file
   – List of DataNodes on which blocks reside, etc
• Single Point of Failure (candidate for spending $$)



       BarCamp Chennai - 5                      Mohit Soni
HDFS – DataNode
• DataNode (Slave)
   –   Contains actual data
   –   Manages data blocks
   –   Informs NameNode about block IDs stored
   –   Client read/write data blocks from DataNode
   –   Performs block replication as instructed by NameNode
• Block Replication
   – Supports various pluggable replication strategies
   – Clients read blocks from nearest DataNode
• Data Pipelining
   – Client write block to first DataNode
   – First DataNode forwards data to next DataNode in pipeline
   – When block is replicated across all replicas, next block is chosen


         BarCamp Chennai - 5                              Mohit Soni
Hadoop - Architecture

User                            JobTracker




       TaskTracker                            TaskTracker



                                NameNode



        DataNode                               DataNode
       DataNode                                 DataNode

  DataNode                                        DataNode


          BarCamp Chennai - 5                Mohit Soni
Hadoop - Terminology
• JobTracker (Master)
   –   1 Job Tracker per cluster
   –   Accepts job requests from users
   –   Schedule Map and Reduce tasks for TaskTrackers
   –   Monitors tasks and TaskTrackers status
   –   Re-execute task on failure
• TaskTracker (Slave)
   – Multiple TaskTrackers in a cluster
   – Run Map and Reduce tasks




         BarCamp Chennai - 5                   Mohit Soni
MapReduce – Flow
Input          Map            Shuffle + Sort   Reduce                 Output




               Map

                                                Reduce

Input                                                                 Output
 Data          Map                                                     Data

                                                Reduce

               Map




        BarCamp Chennai - 5                              Mohit Soni
Word Count
                Hadoop’s HelloWorld




BarCamp Chennai - 5                   Mohit Soni
Word Count Example
• Input
  – Text files
• Output
  – Single file containing (Word <TAB> Count)
• Map Phase
  – Generates (Word, Count) pairs
  – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase
  – For each word, calculates aggregate
  – [{a,7}, {b,5}, {c,6}]

       BarCamp Chennai - 5                            Mohit Soni
Word Count – Mapper
public class WordCountMapper extends MapReduceBase implements
   Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value, OutputCollector<Text,
    IntWritable> out, Reporter reporter) throws Exception {
         String l = value.toString();
         StringTokenizer t = new StringTokenizer(l);
         while(t.hasMoreTokens()) {
                word.set(t.nextToken());
                out.collect(word, one);
         }
    }
}




           BarCamp Chennai - 5                      Mohit Soni
Word Count – Reducer
public class WordCountReducer extends MapReduceBase implements
   Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWriter> values,
    OutputCollector<Text, IntWritable> out, Reporter reporter) throws
    Exception {
         int sum = 0;
         while(values.hasNext()) {
                sum += values.next().get();
         }
         out.collect(key, new IntWritable(sum));
    }
}




           BarCamp Chennai - 5                      Mohit Soni
Word Count – Config
public class WordCountConfig {
   public static void main(String[] args) throws Exception {
        if (args.length() != 2) {
               System.exit(1);
        }
        JobConf conf = new JobConf(WordCountConfig.class);
        conf.setJobName(“Word Counter”);

        FileInputFormat.addInputPath(conf, new Path(args[0]);
        FileInputFormat.addOutputPath(conf, new Path(args[1]));

        conf.setMapperClass(WordCountMapper.class);
        conf.setCombinerClass(WordCountReducer.class);
        conf.setReducerClass(WordCountReducer.class);
        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        JobClient.runJob(conf);
    }
}

          BarCamp Chennai - 5                       Mohit Soni
Diving Deeper
• http://hadoop.apache.org/
• Jeffrey Dean and Sanjay Ghemwat, MapReduce:
  Simplified Data Processing on Large Clusters
• Tom White, Hadoop: The Definitive Guide, O’Reilly
• Setting up a Single-Node Cluster: http://bit.ly/glNzs4
• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP




       BarCamp Chennai - 5                Mohit Soni
Catching-Up


• Follow me on twitter @mohitsoni
• http://mohitsoni.com/




     BarCamp Chennai - 5            Mohit Soni

More Related Content

What's hot

Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1Donghan Kim
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceobdit
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 

What's hot (20)

Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hive
HiveHive
Hive
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
HDFS
HDFSHDFS
HDFS
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
제3회 사내기술세미나-hadoop(배포용)-dh kim-2014-10-1
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 

Viewers also liked

AWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-On
AWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-OnAWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-On
AWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-OnAmazon Web Services
 
Single sign on using WSO2 identity server
Single sign on using WSO2 identity serverSingle sign on using WSO2 identity server
Single sign on using WSO2 identity serverWSO2
 
AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...
AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...
AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...Amazon Web Services
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWSIan Massingham
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web ServicesAmazon Web Services
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

Viewers also liked (6)

AWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-On
AWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-OnAWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-On
AWS Partner Webcast - Get Closer to the Cloud with Federated Single Sign-On
 
Single sign on using WSO2 identity server
Single sign on using WSO2 identity serverSingle sign on using WSO2 identity server
Single sign on using WSO2 identity server
 
AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...
AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...
AWS Webcast - How to Architect and Deploy a Multi-Tier SharePoint Server Farm...
 
AWS 101: Introduction to AWS
AWS 101: Introduction to AWSAWS 101: Introduction to AWS
AWS 101: Introduction to AWS
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Hadoop 101: An Introduction to Apache Hadoop and MapReduce

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoopRon Sher
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationScott Miao
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 

Similar to Hadoop 101: An Introduction to Apache Hadoop and MapReduce (20)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Data Science
Data ScienceData Science
Data Science
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
big data ppt.ppt
big data ppt.pptbig data ppt.ppt
big data ppt.ppt
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 

Hadoop 101: An Introduction to Apache Hadoop and MapReduce

  • 1. Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5 Mohit Soni
  • 2. About Me • I work as a Software Engineer at eBay • Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni
  • 3. First Things First BarCamp Chennai - 5 Mohit Soni
  • 4. MapReduce • Inspired from functional operations – Map – Reduce • Functional operations do not modify data, they generate new data • Original data remains unmodified BarCamp Chennai - 5 Mohit Soni
  • 5. Functional Operations Map Reduce def sqr(n): def add(i, j): return n * n return i + j list = [1,2,3,4] list = [1,2,3,4] map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10 MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Python code BarCamp Chennai - 5 Mohit Soni
  • 6. BarCamp Chennai - 5 Mohit Soni
  • 7. What is Hadoop ? • Framework for large-scale data processing • Based on Google’s MapReduce and GFS • An Apache Software Foundation project • Open Source! • Written in Java • Oh, btw BarCamp Chennai - 5 Mohit Soni
  • 8. Why Hadoop ? • Need to process lots of data (PetaByte scale) • Need to parallelize processing across multitude of CPUs • Achieves above while KeepIng Software Simple • Gives scalability with low-cost commodity hardware BarCamp Chennai - 5 Mohit Soni
  • 9. Hadoop fans Source: Hadoop Wiki BarCamp Chennai - 5 Mohit Soni
  • 10. When to use and not-use Hadoop ? Hadoop is a good choice for: • Indexing data • Log Analysis • Image manipulation • Sorting large-scale data • Data Mining Hadoop is not a good choice: • For real-time processing • For processing intensive tasks with little data • If you have Jaguar or RoadRunner in your stock BarCamp Chennai - 5 Mohit Soni
  • 11. HDFS – Overview • Hadoop Distributed File System • Based on Google’s GFS (Google File System) • Write once read many access model • Fault tolerant • Efficient for batch-processing BarCamp Chennai - 5 Mohit Soni
  • 12. HDFS – Blocks Block 1 Block 2 Input Data Block 3 • HDFS splits input data into blocks • Block size in HDFS: 64/128MB (configurable) • Block size *nix: 4KB BarCamp Chennai - 5 Mohit Soni
  • 13. HDFS – Replication Block 1 Block 1 Block 2 Block 3 Block 2 Block 3 • Blocks are replicated across nodes to handle hardware failure • Node failure is handled gracefully, without loss of data BarCamp Chennai - 5 Mohit Soni
  • 14. HDFS – Architecture NameNode Client Cluster DataNodes BarCamp Chennai - 5 Mohit Soni
  • 15. HDFS – NameNode • NameNode (Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files • Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc • Single Point of Failure (candidate for spending $$) BarCamp Chennai - 5 Mohit Soni
  • 16. HDFS – DataNode • DataNode (Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode • Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode • Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen BarCamp Chennai - 5 Mohit Soni
  • 17. Hadoop - Architecture User JobTracker TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode BarCamp Chennai - 5 Mohit Soni
  • 18. Hadoop - Terminology • JobTracker (Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure • TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks BarCamp Chennai - 5 Mohit Soni
  • 19. MapReduce – Flow Input Map Shuffle + Sort Reduce Output Map Reduce Input Output Data Map Data Reduce Map BarCamp Chennai - 5 Mohit Soni
  • 20. Word Count Hadoop’s HelloWorld BarCamp Chennai - 5 Mohit Soni
  • 21. Word Count Example • Input – Text files • Output – Single file containing (Word <TAB> Count) • Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}] • Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni
  • 22. Word Count – Mapper public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } } } BarCamp Chennai - 5 Mohit Soni
  • 23. Word Count – Reducer public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } BarCamp Chennai - 5 Mohit Soni
  • 24. Word Count – Config public class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); } } BarCamp Chennai - 5 Mohit Soni
  • 25. Diving Deeper • http://hadoop.apache.org/ • Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters • Tom White, Hadoop: The Definitive Guide, O’Reilly • Setting up a Single-Node Cluster: http://bit.ly/glNzs4 • Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP BarCamp Chennai - 5 Mohit Soni
  • 26. Catching-Up • Follow me on twitter @mohitsoni • http://mohitsoni.com/ BarCamp Chennai - 5 Mohit Soni