SlideShare a Scribd company logo
1 of 65
Title Here
      • First Level
            – Second Level
                   • Third Level


                    0-60 Hadoop Development
                      in 60 minutes or less
                                    Abe Taha
                            abetaha@karmasphere.com




                                      1
Saturday, August 14, 2010
Agenda
      • Background
      • Motivation for Hadoop
      • Hadoop Architecture
         - HDFS
         - MapReduce framework
      • Example Jobs
      • Karmasphere Studio
      • Ancillary Hadoop technologies
      • Questions


 • 2
Saturday, August 14, 2010
Background
      • Worked at Yahoo on search and social search
      • Worked at Google on App infrastructure
      • Worked at Ning on Hadoop for analytics and
        system management services
      • Worked at Ask on Dictionary.com and
        Reference.com properties
      • Now at Karmasphere




 • 3
Saturday, August 14, 2010
Motivation for Hadoop
      • Data is growing fast
         - Website usage increasing
         - Logging user events on the rise
         - Disks are becoming cheaper
         - Companies realize insights buried in the data
      • Era of Big Data
         - You know big data when you see it
         - Data that is large enough that it takes time
                to extract insights in a reasonable amount of
                time



 • 4
Saturday, August 14, 2010
Big Data example
      • Apache log files are common for web properties
      • Simple format
            -   27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200
                2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"


      • Contains wealth of information
         - IP address of the client
         - User requesting the resource
         - Date and Time
         - URL Path
         - Result code
         - Object size returned to the client
         - Referrer
         - User-Agent
 • 5
Saturday, August 14, 2010
Insights in log data
      • The log data contains a wealth of information
         - Duration of user’s visit
         - Most popular queries/pages
         - Most common browsers
         - Geo location of users
         - Flow analysis of user sessions




 • 6
Saturday, August 14, 2010
Typical log data lifecycle
      • Instead of gaining these insights
         - Logs are kept for 30 days
         - Then sent to tape
            ‣ Where they die
            ‣ Except if the government needs to access
              them
      • Sometimes
         - Data is extracted
         - Placed into a data warehouse for future
           processing
         - Not very flexible, if data fields change

 • 7
Saturday, August 14, 2010
Solution?
      • Problem prevalent in a lot of search companies
        and at a very large scale
      • In 2004 Google published their take on the
        problem
         - Paper in OSDI ’04
         - MapReduce: Simplified Data Processing on
           Large Clusters
      • System built on cheap commodity hardware,
        and horizontally scalable
      • New paradigm for solving problems
         - Map
         - Reduce
 • 8
Saturday, August 14, 2010
What is MapReduce
      • Old paradigm from functional languages
      • Works on data tuples
      • For each tuple apply mapper function f: [k1, v1]
        -> [k2, v2]
      • Collect tuples with similar keys and apply a
        combine function g: [k2, [v1, v2, …,vn]]->[k3,v3]




 • 9
Saturday, August 14, 2010
MapReduce (cont’d)
      • To speed up the computation we divide and
        conquer
         - Divide the tuples into manageable groups
         - Process each group of tuples separately
         - Collect similar tuples and send them to the
           reduce phase
         - Combine the results together
      • Luckily in most data problems the data records
        are independent




 • 10
Saturday, August 14, 2010
MapReduce Framework
      • Takes care of the scaffolding around the map/
          reduce functions
           - Partition the data across multiple machines
           - Run a function (Map) on each partition in
             parallel
           - Collect the results, and sort them
           - Send the results to multiple machines that
             run a Reduce function
           - Rinse and repeat if needed



 • 11
Saturday, August 14, 2010
MapReduce Framework

    Input                   Map


                                            Reduce   Output
    Input                   Map


                                  Shuffle
    Input                   Map     &       Reduce   Output
                                   Sort


    Input                   Map



    Input                   Map




   12
Saturday, August 14, 2010
Example
      • Find the maximum number in a list
      • Luckily max A = max(max(A[1..k]), max(A[k..N]))
      • A = [1, 2, 3, 4, 5, …, 10]
      • Divide A into chunks
         - A1=[1,..,5]
         - A2=[6,…,10]
      • Map max on A1 to get 5
      • Map max on A2 to get 10
      • Reduce [5,10] by using max to get 10


 • 13
Saturday, August 14, 2010
Another example
      • Add Numbers from 1..100
      • Sum of A[1..100] = Sum of A[1..k] + Sum of A[k
          +1..p] + Sum of A[p+1..100]


                                                 Text




             1     2    3        4   5   6   7   8   9   .   .   .   .   .   100


                            15                   N                   M


                                                 5050

 • 14
Saturday, August 14, 2010
Another example
      • Canonical word count
      • Divide a text into words
         - “To be or not to be”
         - To, be, or, not, to, be
      • Mapper
         - For every word emit a tuple (word, 1)
         - (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1)
      • Collect output by word
         - (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1])
      • Reduce the tuples
         - (To, 2), (be, 2), (or, 1), (not, 1)
 • 15
Saturday, August 14, 2010
So how do we run the examples
      • Using Hadoop
         - Open source implementation of MR
                framework

            - Two major components
               ‣ Distributed file system--HDFS
               ‣ Code execution framework--MR




 • 16
Saturday, August 14, 2010
HDFS
      • Stores data in files that are divided into blocks
      • Blocks are large, usually 64MB to marginalize
        the cost of seeks
      • Blocks are stored on multiple machines called
        “Data Nodes”
      • One master node “Name Node” stores
        filesystem meta-data including the directory
        hierarchy, file names, and file to block mapping
      • All meta data operations go through the Name
        Node, however data access goes directly to the
        data nodes


 • 17
Saturday, August 14, 2010
HDFS
      • Single point of failure because of single Name
        Node
         - Secondary Name Node that replicates all
           transactions from the name node
      • Limitation on number of files in the file system
        as all meta-data is stored in memory on the
        Name Node
         - Hadoop archive files




 • 18
Saturday, August 14, 2010
MapReduce Framework
      • Execution framework that orchestrates the MR
        jobs
         - Takes care of running the code where the
           data is
         - Partitions the input into chunks
         - Runs the user provided Mappers and collects
           the output, sorts and combines the
           intermediate results
         - Takes care of job failures and task laggards
         - Runs Reducers to summarize results
      • Supports streaming for scripting languages and
        Pipes for C/C++
 • 19
Saturday, August 14, 2010
And how would WordCount look?

      public class HadoopMapper extends MapReduceBase implements
      Mapper<LongWritable,Text,Text,LongWritable> {

          @Override
          public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Reporter reporter)
                  throws IOException {

               String[] line = value.toString().split("[,s]+");

               for(String token : line) {
               	 output.collect(new Text(token), new LongWritable(1));
               }
          }
      }




 • 20
Saturday, August 14, 2010
Word Count-Reducer

      public class HadoopReducer extends MapReduceBase implements
      Reducer<Text,LongWritable,Text,LongWritable> {
      	    @Override
          public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text,
      LongWritable> output, Reporter reporter)
                  throws IOException {

      	       	   long sum = 0;

      	       	   while (value.hasNext()) {
      	       	   	    ++sum;
      	       	   	    value.next();
      	       	   }
      	       	   output.collect(key, new LongWritable(sum));
          }
      }




 • 21
Saturday, August 14, 2010
Max - Mapper

      public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Reporter reporter)
                   throws IOException {
          	 String numbers[] = value.toString().split("[,s]+");
      	     	
      	     	    long max = -1;

      	       	   for (String token : numbers) {
      	       	   	    long number = Long.parseLong(token);
      	       	   	    if (number > max) {
      	       	   	    	     max = number;
      	       	   	    }
      	       	   }

      	       	   output.collect(new Text("k"), new LongWritable(max));
          }




 • 22
Saturday, August 14, 2010
Max-Reducer

      public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
      output, Reporter reporter)
                  throws IOException {
      	    	     long max = 0;

      	       	   while (value.hasNext()) {
      	       	   	    long number = value.next().get();
      	       	   	    if(number>max) {
      	       	   	    	     max = number;
      	       	   	    }
      	       	   }

      	       	   output.collect(key, new LongWritable(max));
          }




 • 23
Saturday, August 14, 2010
Sum - Mapper


      public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Reporter reporter)
                  throws IOException {
             String numbers[] = value.toString().split("[,s]+");

              long sum = 0;

              for(String token : numbers) {
          	     sum += Long.parseLong(token);
              }

              output.collect(new Text("k"), new LongWritable(sum));
          }




 • 24
Saturday, August 14, 2010
Sum - Reducer


      public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
      output, Reporter reporter)
                   throws IOException {
      	     	    long sum = 0;
      	     	
      	     	    while(value.hasNext()) {
      	     	    	     sum += value.next().get();
      	     	    }
      	     	
      	     	    output.collect(key, new LongWritable(sum));
          }




 • 25
Saturday, August 14, 2010
First Impressions
      • Lots of overhead even for simple examples
      • Can’t test on data before deploying to cluster
         - Bugs
         - Prototyping for data format changes
         - Testing different versions of Hadoop runtime
      • Tools like Karmasphere help with that




 • 26
Saturday, August 14, 2010
Karmasphere Studio
      • For NetBeans and Eclipse
      • Two editions
         - Community (Free)
         - Professional




 • 27
Saturday, August 14, 2010
Community Edition
      • Community edition focusses on
         - Development and prototyping
            ‣ MR workflow development
            ‣ Local execution with multiple Hadoop
              versions
         - Packaging jars
      • Eclipse
         - http://www.hadoopstudio.org/dist/eclipse-
           community/site.xml
      • NetBeans
         - http://hadoopstudio.org/updates/
           updates.xml
 • 28
Saturday, August 14, 2010
Professional Edition
      • Professional edition focuses on what happens to
        the job after initial development
         - Profiling and tuning
         - Packaging and deployment (local/colo/ssh
           tunnel/EMR)
         - Support
      • Sign-up for beta on our site
         - http://karmasphere.com/Products-
           Information/karmasphere-studio.html




 • 29
Saturday, August 14, 2010
Workflow demo




 • 30
Saturday, August 14, 2010
Create new Java project




   31
Saturday, August 14, 2010
Add Hadoop libraries




   32
Saturday, August 14, 2010
Add library




   33
Saturday, August 14, 2010
Client and MR libraries




   34
Saturday, August 14, 2010
Demo




   35
Saturday, August 14, 2010
Create Job




   36
Saturday, August 14, 2010
Hadoop Jobs




   37
Saturday, August 14, 2010
MR Job




   38
Saturday, August 14, 2010
Hadoop Workflow




   39
Saturday, August 14, 2010
Input Format




   40
Saturday, August 14, 2010
Mapper




   41
Saturday, August 14, 2010
Partitioner




   42
Saturday, August 14, 2010
Comparator




   43
Saturday, August 14, 2010
Combiner




   44
Saturday, August 14, 2010
Reducer




   45
Saturday, August 14, 2010
Output




   46
Saturday, August 14, 2010
What happened?
      • Without deploying anything to the cluster, we
        can:
         - See how the job behaves locally
         - Fix bugs if data output does not match
           expectation
         - Experiment with different versions of Hadoop
      • We can also write custom code for each MR
        stage or use the ones provided by Hadoop




 • 47
Saturday, August 14, 2010
Running locally
      • Studio comes with 3 versions of Hadoop
        runtime libraries
         - 0.18
         - 0.19
         - 0.20
      • Can run job locally as an in-proc thread using
        export jar
         - Test behavior on different Hadoop runtimes
           without deploying
         - Just need to supply input/output


 • 48
Saturday, August 14, 2010
Looking at file systems




   49
Saturday, August 14, 2010
Local/HDFS/Amazon S3




   50
Saturday, August 14, 2010
Direct connection or SSH Tunnel




   51
Saturday, August 14, 2010
Browse, Drag Drop, Copy




   52
Saturday, August 14, 2010
Monitor File System




   53
Saturday, August 14, 2010
Amazon Elastic MapReduce (EMR)




   54
Saturday, August 14, 2010
Amazon S3




   55
Saturday, August 14, 2010
S3 credentials




   56
Saturday, August 14, 2010
Monitoring Job Flows




   57
Saturday, August 14, 2010
Diagnostics




   58
Saturday, August 14, 2010
Summary




   59
Saturday, August 14, 2010
Logs




   60
Saturday, August 14, 2010
Tasks




   61
Saturday, August 14, 2010
Config




   62
Saturday, August 14, 2010
Other Hadoop technologies
      • Cascading
            -   Higher level data flow language
            -   Operates on sources and sinks
            -   Turns workflows into jobs
            -   Studio includes Cascading support
      • Hive
            -   High level SQL like language
            -   Concepts such as tables, and queries
            -   Converts SQL to MapReduce
            -   Working on enterprise quality Hive based SQL product
      • Pig
            - Scripting language
            - Converts script to MR job


 • 63
Saturday, August 14, 2010
Questions




 • 64
Saturday, August 14, 2010
Title Here
      • First Level
            – Second Level
                   • Third Level




                                   65
Saturday, August 14, 2010

More Related Content

What's hot

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with HadoopVisionGEOMATIQUE2014
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingSam Ng
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovNikolay Samokhvalov
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataJetlore
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Deltares
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHortonworks
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
Spotting Hadoop in the wild
Spotting Hadoop in the wildSpotting Hadoop in the wild
Spotting Hadoop in the wildKlaas Bosteels
 

What's hot (20)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
MindRaider
MindRaiderMindRaider
MindRaider
 
Spatial Data processing with Hadoop
Spatial Data processing with HadoopSpatial Data processing with Hadoop
Spatial Data processing with Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
 
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive DataSpark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
Dsd int 2014 - data science symposium - application 1 - point clouds, prof. p...
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
Spotting Hadoop in the wild
Spotting Hadoop in the wildSpotting Hadoop in the wild
Spotting Hadoop in the wild
 

Viewers also liked

Karmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- finalKarmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- finalAbe Taha
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Abe Taha
 
Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3acanales04
 
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Malta Trading Company Tax System  Guide - Acumum Legal & AdvisoryMalta Trading Company Tax System  Guide - Acumum Legal & Advisory
Malta Trading Company Tax System Guide - Acumum Legal & AdvisoryAcumum - Legal & Advisory
 
Akamai: From Theory to Practice
Akamai: From Theory to PracticeAkamai: From Theory to Practice
Akamai: From Theory to PracticeLiz Bradley
 

Viewers also liked (7)

Karmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- finalKarmasphere bdabi blueprint- final
Karmasphere bdabi blueprint- final
 
Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011Chicago HUG Presentation Oct 2011
Chicago HUG Presentation Oct 2011
 
Social Media and Public Health
Social Media and Public HealthSocial Media and Public Health
Social Media and Public Health
 
Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3Copyright edtc6340.66 april_canales#3
Copyright edtc6340.66 april_canales#3
 
Quality Milk Through Milking Parlor Technology
Quality Milk Through Milking Parlor TechnologyQuality Milk Through Milking Parlor Technology
Quality Milk Through Milking Parlor Technology
 
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
Malta Trading Company Tax System  Guide - Acumum Legal & AdvisoryMalta Trading Company Tax System  Guide - Acumum Legal & Advisory
Malta Trading Company Tax System Guide - Acumum Legal & Advisory
 
Akamai: From Theory to Practice
Akamai: From Theory to PracticeAkamai: From Theory to Practice
Akamai: From Theory to Practice
 

Similar to Seattle hug 2010

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopJosh Devins
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at myliferesponseteam
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupCsaba Toth
 
Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015Codemotion
 

Similar to Seattle hug 2010 (20)

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
L3.fa14.ppt
L3.fa14.pptL3.fa14.ppt
L3.fa14.ppt
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
mapreduce ppt.ppt
mapreduce ppt.pptmapreduce ppt.ppt
mapreduce ppt.ppt
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015Sahara presentation latest - Codemotion Rome 2015
Sahara presentation latest - Codemotion Rome 2015
 

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Seattle hug 2010

  • 1. Title Here • First Level – Second Level • Third Level 0-60 Hadoop Development in 60 minutes or less Abe Taha abetaha@karmasphere.com 1 Saturday, August 14, 2010
  • 2. Agenda • Background • Motivation for Hadoop • Hadoop Architecture - HDFS - MapReduce framework • Example Jobs • Karmasphere Studio • Ancillary Hadoop technologies • Questions • 2 Saturday, August 14, 2010
  • 3. Background • Worked at Yahoo on search and social search • Worked at Google on App infrastructure • Worked at Ning on Hadoop for analytics and system management services • Worked at Ask on Dictionary.com and Reference.com properties • Now at Karmasphere • 3 Saturday, August 14, 2010
  • 4. Motivation for Hadoop • Data is growing fast - Website usage increasing - Logging user events on the rise - Disks are becoming cheaper - Companies realize insights buried in the data • Era of Big Data - You know big data when you see it - Data that is large enough that it takes time to extract insights in a reasonable amount of time • 4 Saturday, August 14, 2010
  • 5. Big Data example • Apache log files are common for web properties • Simple format - 27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • Contains wealth of information - IP address of the client - User requesting the resource - Date and Time - URL Path - Result code - Object size returned to the client - Referrer - User-Agent • 5 Saturday, August 14, 2010
  • 6. Insights in log data • The log data contains a wealth of information - Duration of user’s visit - Most popular queries/pages - Most common browsers - Geo location of users - Flow analysis of user sessions • 6 Saturday, August 14, 2010
  • 7. Typical log data lifecycle • Instead of gaining these insights - Logs are kept for 30 days - Then sent to tape ‣ Where they die ‣ Except if the government needs to access them • Sometimes - Data is extracted - Placed into a data warehouse for future processing - Not very flexible, if data fields change • 7 Saturday, August 14, 2010
  • 8. Solution? • Problem prevalent in a lot of search companies and at a very large scale • In 2004 Google published their take on the problem - Paper in OSDI ’04 - MapReduce: Simplified Data Processing on Large Clusters • System built on cheap commodity hardware, and horizontally scalable • New paradigm for solving problems - Map - Reduce • 8 Saturday, August 14, 2010
  • 9. What is MapReduce • Old paradigm from functional languages • Works on data tuples • For each tuple apply mapper function f: [k1, v1] -> [k2, v2] • Collect tuples with similar keys and apply a combine function g: [k2, [v1, v2, …,vn]]->[k3,v3] • 9 Saturday, August 14, 2010
  • 10. MapReduce (cont’d) • To speed up the computation we divide and conquer - Divide the tuples into manageable groups - Process each group of tuples separately - Collect similar tuples and send them to the reduce phase - Combine the results together • Luckily in most data problems the data records are independent • 10 Saturday, August 14, 2010
  • 11. MapReduce Framework • Takes care of the scaffolding around the map/ reduce functions - Partition the data across multiple machines - Run a function (Map) on each partition in parallel - Collect the results, and sort them - Send the results to multiple machines that run a Reduce function - Rinse and repeat if needed • 11 Saturday, August 14, 2010
  • 12. MapReduce Framework Input Map Reduce Output Input Map Shuffle Input Map & Reduce Output Sort Input Map Input Map 12 Saturday, August 14, 2010
  • 13. Example • Find the maximum number in a list • Luckily max A = max(max(A[1..k]), max(A[k..N])) • A = [1, 2, 3, 4, 5, …, 10] • Divide A into chunks - A1=[1,..,5] - A2=[6,…,10] • Map max on A1 to get 5 • Map max on A2 to get 10 • Reduce [5,10] by using max to get 10 • 13 Saturday, August 14, 2010
  • 14. Another example • Add Numbers from 1..100 • Sum of A[1..100] = Sum of A[1..k] + Sum of A[k +1..p] + Sum of A[p+1..100] Text 1 2 3 4 5 6 7 8 9 . . . . . 100 15 N M 5050 • 14 Saturday, August 14, 2010
  • 15. Another example • Canonical word count • Divide a text into words - “To be or not to be” - To, be, or, not, to, be • Mapper - For every word emit a tuple (word, 1) - (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1) • Collect output by word - (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1]) • Reduce the tuples - (To, 2), (be, 2), (or, 1), (not, 1) • 15 Saturday, August 14, 2010
  • 16. So how do we run the examples • Using Hadoop - Open source implementation of MR framework - Two major components ‣ Distributed file system--HDFS ‣ Code execution framework--MR • 16 Saturday, August 14, 2010
  • 17. HDFS • Stores data in files that are divided into blocks • Blocks are large, usually 64MB to marginalize the cost of seeks • Blocks are stored on multiple machines called “Data Nodes” • One master node “Name Node” stores filesystem meta-data including the directory hierarchy, file names, and file to block mapping • All meta data operations go through the Name Node, however data access goes directly to the data nodes • 17 Saturday, August 14, 2010
  • 18. HDFS • Single point of failure because of single Name Node - Secondary Name Node that replicates all transactions from the name node • Limitation on number of files in the file system as all meta-data is stored in memory on the Name Node - Hadoop archive files • 18 Saturday, August 14, 2010
  • 19. MapReduce Framework • Execution framework that orchestrates the MR jobs - Takes care of running the code where the data is - Partitions the input into chunks - Runs the user provided Mappers and collects the output, sorts and combines the intermediate results - Takes care of job failures and task laggards - Runs Reducers to summarize results • Supports streaming for scripting languages and Pipes for C/C++ • 19 Saturday, August 14, 2010
  • 20. And how would WordCount look? public class HadoopMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String[] line = value.toString().split("[,s]+"); for(String token : line) { output.collect(new Text(token), new LongWritable(1)); } } } • 20 Saturday, August 14, 2010
  • 21. Word Count-Reducer public class HadoopReducer extends MapReduceBase implements Reducer<Text,LongWritable,Text,LongWritable> { @Override public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long sum = 0; while (value.hasNext()) { ++sum; value.next(); } output.collect(key, new LongWritable(sum)); } } • 21 Saturday, August 14, 2010
  • 22. Max - Mapper public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String numbers[] = value.toString().split("[,s]+"); long max = -1; for (String token : numbers) { long number = Long.parseLong(token); if (number > max) { max = number; } } output.collect(new Text("k"), new LongWritable(max)); } • 22 Saturday, August 14, 2010
  • 23. Max-Reducer public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long max = 0; while (value.hasNext()) { long number = value.next().get(); if(number>max) { max = number; } } output.collect(key, new LongWritable(max)); } • 23 Saturday, August 14, 2010
  • 24. Sum - Mapper public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String numbers[] = value.toString().split("[,s]+"); long sum = 0; for(String token : numbers) { sum += Long.parseLong(token); } output.collect(new Text("k"), new LongWritable(sum)); } • 24 Saturday, August 14, 2010
  • 25. Sum - Reducer public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long sum = 0; while(value.hasNext()) { sum += value.next().get(); } output.collect(key, new LongWritable(sum)); } • 25 Saturday, August 14, 2010
  • 26. First Impressions • Lots of overhead even for simple examples • Can’t test on data before deploying to cluster - Bugs - Prototyping for data format changes - Testing different versions of Hadoop runtime • Tools like Karmasphere help with that • 26 Saturday, August 14, 2010
  • 27. Karmasphere Studio • For NetBeans and Eclipse • Two editions - Community (Free) - Professional • 27 Saturday, August 14, 2010
  • 28. Community Edition • Community edition focusses on - Development and prototyping ‣ MR workflow development ‣ Local execution with multiple Hadoop versions - Packaging jars • Eclipse - http://www.hadoopstudio.org/dist/eclipse- community/site.xml • NetBeans - http://hadoopstudio.org/updates/ updates.xml • 28 Saturday, August 14, 2010
  • 29. Professional Edition • Professional edition focuses on what happens to the job after initial development - Profiling and tuning - Packaging and deployment (local/colo/ssh tunnel/EMR) - Support • Sign-up for beta on our site - http://karmasphere.com/Products- Information/karmasphere-studio.html • 29 Saturday, August 14, 2010
  • 30. Workflow demo • 30 Saturday, August 14, 2010
  • 31. Create new Java project 31 Saturday, August 14, 2010
  • 32. Add Hadoop libraries 32 Saturday, August 14, 2010
  • 33. Add library 33 Saturday, August 14, 2010
  • 34. Client and MR libraries 34 Saturday, August 14, 2010
  • 35. Demo 35 Saturday, August 14, 2010
  • 36. Create Job 36 Saturday, August 14, 2010
  • 37. Hadoop Jobs 37 Saturday, August 14, 2010
  • 38. MR Job 38 Saturday, August 14, 2010
  • 39. Hadoop Workflow 39 Saturday, August 14, 2010
  • 40. Input Format 40 Saturday, August 14, 2010
  • 41. Mapper 41 Saturday, August 14, 2010
  • 42. Partitioner 42 Saturday, August 14, 2010
  • 43. Comparator 43 Saturday, August 14, 2010
  • 44. Combiner 44 Saturday, August 14, 2010
  • 45. Reducer 45 Saturday, August 14, 2010
  • 46. Output 46 Saturday, August 14, 2010
  • 47. What happened? • Without deploying anything to the cluster, we can: - See how the job behaves locally - Fix bugs if data output does not match expectation - Experiment with different versions of Hadoop • We can also write custom code for each MR stage or use the ones provided by Hadoop • 47 Saturday, August 14, 2010
  • 48. Running locally • Studio comes with 3 versions of Hadoop runtime libraries - 0.18 - 0.19 - 0.20 • Can run job locally as an in-proc thread using export jar - Test behavior on different Hadoop runtimes without deploying - Just need to supply input/output • 48 Saturday, August 14, 2010
  • 49. Looking at file systems 49 Saturday, August 14, 2010
  • 50. Local/HDFS/Amazon S3 50 Saturday, August 14, 2010
  • 51. Direct connection or SSH Tunnel 51 Saturday, August 14, 2010
  • 52. Browse, Drag Drop, Copy 52 Saturday, August 14, 2010
  • 53. Monitor File System 53 Saturday, August 14, 2010
  • 54. Amazon Elastic MapReduce (EMR) 54 Saturday, August 14, 2010
  • 55. Amazon S3 55 Saturday, August 14, 2010
  • 56. S3 credentials 56 Saturday, August 14, 2010
  • 57. Monitoring Job Flows 57 Saturday, August 14, 2010
  • 58. Diagnostics 58 Saturday, August 14, 2010
  • 59. Summary 59 Saturday, August 14, 2010
  • 60. Logs 60 Saturday, August 14, 2010
  • 61. Tasks 61 Saturday, August 14, 2010
  • 62. Config 62 Saturday, August 14, 2010
  • 63. Other Hadoop technologies • Cascading - Higher level data flow language - Operates on sources and sinks - Turns workflows into jobs - Studio includes Cascading support • Hive - High level SQL like language - Concepts such as tables, and queries - Converts SQL to MapReduce - Working on enterprise quality Hive based SQL product • Pig - Scripting language - Converts script to MR job • 63 Saturday, August 14, 2010
  • 64. Questions • 64 Saturday, August 14, 2010
  • 65. Title Here • First Level – Second Level • Third Level 65 Saturday, August 14, 2010