Title Here
      • First Level
            – Second Level
                   • Third Level


                    0-60 Hadoop Development
                      in 60 minutes or less
                                    Abe Taha
                            abetaha@karmasphere.com




                                      1
Saturday, August 14, 2010
Agenda
      • Background
      • Motivation for Hadoop
      • Hadoop Architecture
         - HDFS
         - MapReduce framework
      • Example Jobs
      • Karmasphere Studio
      • Ancillary Hadoop technologies
      • Questions


 • 2
Saturday, August 14, 2010
Background
      • Worked at Yahoo on search and social search
      • Worked at Google on App infrastructure
      • Worked at Ning on Hadoop for analytics and
        system management services
      • Worked at Ask on Dictionary.com and
        Reference.com properties
      • Now at Karmasphere




 • 3
Saturday, August 14, 2010
Motivation for Hadoop
      • Data is growing fast
         - Website usage increasing
         - Logging user events on the rise
         - Disks are becoming cheaper
         - Companies realize insights buried in the data
      • Era of Big Data
         - You know big data when you see it
         - Data that is large enough that it takes time
                to extract insights in a reasonable amount of
                time



 • 4
Saturday, August 14, 2010
Big Data example
      • Apache log files are common for web properties
      • Simple format
            -   27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200
                2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"


      • Contains wealth of information
         - IP address of the client
         - User requesting the resource
         - Date and Time
         - URL Path
         - Result code
         - Object size returned to the client
         - Referrer
         - User-Agent
 • 5
Saturday, August 14, 2010
Insights in log data
      • The log data contains a wealth of information
         - Duration of user’s visit
         - Most popular queries/pages
         - Most common browsers
         - Geo location of users
         - Flow analysis of user sessions




 • 6
Saturday, August 14, 2010
Typical log data lifecycle
      • Instead of gaining these insights
         - Logs are kept for 30 days
         - Then sent to tape
            ‣ Where they die
            ‣ Except if the government needs to access
              them
      • Sometimes
         - Data is extracted
         - Placed into a data warehouse for future
           processing
         - Not very flexible, if data fields change

 • 7
Saturday, August 14, 2010
Solution?
      • Problem prevalent in a lot of search companies
        and at a very large scale
      • In 2004 Google published their take on the
        problem
         - Paper in OSDI ’04
         - MapReduce: Simplified Data Processing on
           Large Clusters
      • System built on cheap commodity hardware,
        and horizontally scalable
      • New paradigm for solving problems
         - Map
         - Reduce
 • 8
Saturday, August 14, 2010
What is MapReduce
      • Old paradigm from functional languages
      • Works on data tuples
      • For each tuple apply mapper function f: [k1, v1]
        -> [k2, v2]
      • Collect tuples with similar keys and apply a
        combine function g: [k2, [v1, v2, …,vn]]->[k3,v3]




 • 9
Saturday, August 14, 2010
MapReduce (cont’d)
      • To speed up the computation we divide and
        conquer
         - Divide the tuples into manageable groups
         - Process each group of tuples separately
         - Collect similar tuples and send them to the
           reduce phase
         - Combine the results together
      • Luckily in most data problems the data records
        are independent




 • 10
Saturday, August 14, 2010
MapReduce Framework
      • Takes care of the scaffolding around the map/
          reduce functions
           - Partition the data across multiple machines
           - Run a function (Map) on each partition in
             parallel
           - Collect the results, and sort them
           - Send the results to multiple machines that
             run a Reduce function
           - Rinse and repeat if needed



 • 11
Saturday, August 14, 2010
MapReduce Framework

    Input                   Map


                                            Reduce   Output
    Input                   Map


                                  Shuffle
    Input                   Map     &       Reduce   Output
                                   Sort


    Input                   Map



    Input                   Map




   12
Saturday, August 14, 2010
Example
      • Find the maximum number in a list
      • Luckily max A = max(max(A[1..k]), max(A[k..N]))
      • A = [1, 2, 3, 4, 5, …, 10]
      • Divide A into chunks
         - A1=[1,..,5]
         - A2=[6,…,10]
      • Map max on A1 to get 5
      • Map max on A2 to get 10
      • Reduce [5,10] by using max to get 10


 • 13
Saturday, August 14, 2010
Another example
      • Add Numbers from 1..100
      • Sum of A[1..100] = Sum of A[1..k] + Sum of A[k
          +1..p] + Sum of A[p+1..100]


                                                 Text




             1     2    3        4   5   6   7   8   9   .   .   .   .   .   100


                            15                   N                   M


                                                 5050

 • 14
Saturday, August 14, 2010
Another example
      • Canonical word count
      • Divide a text into words
         - “To be or not to be”
         - To, be, or, not, to, be
      • Mapper
         - For every word emit a tuple (word, 1)
         - (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1)
      • Collect output by word
         - (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1])
      • Reduce the tuples
         - (To, 2), (be, 2), (or, 1), (not, 1)
 • 15
Saturday, August 14, 2010
So how do we run the examples
      • Using Hadoop
         - Open source implementation of MR
                framework

            - Two major components
               ‣ Distributed file system--HDFS
               ‣ Code execution framework--MR




 • 16
Saturday, August 14, 2010
HDFS
      • Stores data in files that are divided into blocks
      • Blocks are large, usually 64MB to marginalize
        the cost of seeks
      • Blocks are stored on multiple machines called
        “Data Nodes”
      • One master node “Name Node” stores
        filesystem meta-data including the directory
        hierarchy, file names, and file to block mapping
      • All meta data operations go through the Name
        Node, however data access goes directly to the
        data nodes


 • 17
Saturday, August 14, 2010
HDFS
      • Single point of failure because of single Name
        Node
         - Secondary Name Node that replicates all
           transactions from the name node
      • Limitation on number of files in the file system
        as all meta-data is stored in memory on the
        Name Node
         - Hadoop archive files




 • 18
Saturday, August 14, 2010
MapReduce Framework
      • Execution framework that orchestrates the MR
        jobs
         - Takes care of running the code where the
           data is
         - Partitions the input into chunks
         - Runs the user provided Mappers and collects
           the output, sorts and combines the
           intermediate results
         - Takes care of job failures and task laggards
         - Runs Reducers to summarize results
      • Supports streaming for scripting languages and
        Pipes for C/C++
 • 19
Saturday, August 14, 2010
And how would WordCount look?

      public class HadoopMapper extends MapReduceBase implements
      Mapper<LongWritable,Text,Text,LongWritable> {

          @Override
          public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Reporter reporter)
                  throws IOException {

               String[] line = value.toString().split("[,s]+");

               for(String token : line) {
               	 output.collect(new Text(token), new LongWritable(1));
               }
          }
      }




 • 20
Saturday, August 14, 2010
Word Count-Reducer

      public class HadoopReducer extends MapReduceBase implements
      Reducer<Text,LongWritable,Text,LongWritable> {
      	    @Override
          public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text,
      LongWritable> output, Reporter reporter)
                  throws IOException {

      	       	   long sum = 0;

      	       	   while (value.hasNext()) {
      	       	   	    ++sum;
      	       	   	    value.next();
      	       	   }
      	       	   output.collect(key, new LongWritable(sum));
          }
      }




 • 21
Saturday, August 14, 2010
Max - Mapper

      public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Reporter reporter)
                   throws IOException {
          	 String numbers[] = value.toString().split("[,s]+");
      	     	
      	     	    long max = -1;

      	       	   for (String token : numbers) {
      	       	   	    long number = Long.parseLong(token);
      	       	   	    if (number > max) {
      	       	   	    	     max = number;
      	       	   	    }
      	       	   }

      	       	   output.collect(new Text("k"), new LongWritable(max));
          }




 • 22
Saturday, August 14, 2010
Max-Reducer

      public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
      output, Reporter reporter)
                  throws IOException {
      	    	     long max = 0;

      	       	   while (value.hasNext()) {
      	       	   	    long number = value.next().get();
      	       	   	    if(number>max) {
      	       	   	    	     max = number;
      	       	   	    }
      	       	   }

      	       	   output.collect(key, new LongWritable(max));
          }




 • 23
Saturday, August 14, 2010
Sum - Mapper


      public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Reporter reporter)
                  throws IOException {
             String numbers[] = value.toString().split("[,s]+");

              long sum = 0;

              for(String token : numbers) {
          	     sum += Long.parseLong(token);
              }

              output.collect(new Text("k"), new LongWritable(sum));
          }




 • 24
Saturday, August 14, 2010
Sum - Reducer


      public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
      output, Reporter reporter)
                   throws IOException {
      	     	    long sum = 0;
      	     	
      	     	    while(value.hasNext()) {
      	     	    	     sum += value.next().get();
      	     	    }
      	     	
      	     	    output.collect(key, new LongWritable(sum));
          }




 • 25
Saturday, August 14, 2010
First Impressions
      • Lots of overhead even for simple examples
      • Can’t test on data before deploying to cluster
         - Bugs
         - Prototyping for data format changes
         - Testing different versions of Hadoop runtime
      • Tools like Karmasphere help with that




 • 26
Saturday, August 14, 2010
Karmasphere Studio
      • For NetBeans and Eclipse
      • Two editions
         - Community (Free)
         - Professional




 • 27
Saturday, August 14, 2010
Community Edition
      • Community edition focusses on
         - Development and prototyping
            ‣ MR workflow development
            ‣ Local execution with multiple Hadoop
              versions
         - Packaging jars
      • Eclipse
         - http://www.hadoopstudio.org/dist/eclipse-
           community/site.xml
      • NetBeans
         - http://hadoopstudio.org/updates/
           updates.xml
 • 28
Saturday, August 14, 2010
Professional Edition
      • Professional edition focuses on what happens to
        the job after initial development
         - Profiling and tuning
         - Packaging and deployment (local/colo/ssh
           tunnel/EMR)
         - Support
      • Sign-up for beta on our site
         - http://karmasphere.com/Products-
           Information/karmasphere-studio.html




 • 29
Saturday, August 14, 2010
Workflow demo




 • 30
Saturday, August 14, 2010
Create new Java project




   31
Saturday, August 14, 2010
Add Hadoop libraries




   32
Saturday, August 14, 2010
Add library




   33
Saturday, August 14, 2010
Client and MR libraries




   34
Saturday, August 14, 2010
Demo




   35
Saturday, August 14, 2010
Create Job




   36
Saturday, August 14, 2010
Hadoop Jobs




   37
Saturday, August 14, 2010
MR Job




   38
Saturday, August 14, 2010
Hadoop Workflow




   39
Saturday, August 14, 2010
Input Format




   40
Saturday, August 14, 2010
Mapper




   41
Saturday, August 14, 2010
Partitioner




   42
Saturday, August 14, 2010
Comparator




   43
Saturday, August 14, 2010
Combiner




   44
Saturday, August 14, 2010
Reducer




   45
Saturday, August 14, 2010
Output




   46
Saturday, August 14, 2010
What happened?
      • Without deploying anything to the cluster, we
        can:
         - See how the job behaves locally
         - Fix bugs if data output does not match
           expectation
         - Experiment with different versions of Hadoop
      • We can also write custom code for each MR
        stage or use the ones provided by Hadoop




 • 47
Saturday, August 14, 2010
Running locally
      • Studio comes with 3 versions of Hadoop
        runtime libraries
         - 0.18
         - 0.19
         - 0.20
      • Can run job locally as an in-proc thread using
        export jar
         - Test behavior on different Hadoop runtimes
           without deploying
         - Just need to supply input/output


 • 48
Saturday, August 14, 2010
Looking at file systems




   49
Saturday, August 14, 2010
Local/HDFS/Amazon S3




   50
Saturday, August 14, 2010
Direct connection or SSH Tunnel




   51
Saturday, August 14, 2010
Browse, Drag Drop, Copy




   52
Saturday, August 14, 2010
Monitor File System




   53
Saturday, August 14, 2010
Amazon Elastic MapReduce (EMR)




   54
Saturday, August 14, 2010
Amazon S3




   55
Saturday, August 14, 2010
S3 credentials




   56
Saturday, August 14, 2010
Monitoring Job Flows




   57
Saturday, August 14, 2010
Diagnostics




   58
Saturday, August 14, 2010
Summary




   59
Saturday, August 14, 2010
Logs




   60
Saturday, August 14, 2010
Tasks




   61
Saturday, August 14, 2010
Config




   62
Saturday, August 14, 2010
Other Hadoop technologies
      • Cascading
            -   Higher level data flow language
            -   Operates on sources and sinks
            -   Turns workflows into jobs
            -   Studio includes Cascading support
      • Hive
            -   High level SQL like language
            -   Concepts such as tables, and queries
            -   Converts SQL to MapReduce
            -   Working on enterprise quality Hive based SQL product
      • Pig
            - Scripting language
            - Converts script to MR job


 • 63
Saturday, August 14, 2010
Questions




 • 64
Saturday, August 14, 2010
Title Here
      • First Level
            – Second Level
                   • Third Level




                                   65
Saturday, August 14, 2010

Seattle hug 2010

  • 1.
    Title Here • First Level – Second Level • Third Level 0-60 Hadoop Development in 60 minutes or less Abe Taha abetaha@karmasphere.com 1 Saturday, August 14, 2010
  • 2.
    Agenda • Background • Motivation for Hadoop • Hadoop Architecture - HDFS - MapReduce framework • Example Jobs • Karmasphere Studio • Ancillary Hadoop technologies • Questions • 2 Saturday, August 14, 2010
  • 3.
    Background • Worked at Yahoo on search and social search • Worked at Google on App infrastructure • Worked at Ning on Hadoop for analytics and system management services • Worked at Ask on Dictionary.com and Reference.com properties • Now at Karmasphere • 3 Saturday, August 14, 2010
  • 4.
    Motivation for Hadoop • Data is growing fast - Website usage increasing - Logging user events on the rise - Disks are becoming cheaper - Companies realize insights buried in the data • Era of Big Data - You know big data when you see it - Data that is large enough that it takes time to extract insights in a reasonable amount of time • 4 Saturday, August 14, 2010
  • 5.
    Big Data example • Apache log files are common for web properties • Simple format - 27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • Contains wealth of information - IP address of the client - User requesting the resource - Date and Time - URL Path - Result code - Object size returned to the client - Referrer - User-Agent • 5 Saturday, August 14, 2010
  • 6.
    Insights in logdata • The log data contains a wealth of information - Duration of user’s visit - Most popular queries/pages - Most common browsers - Geo location of users - Flow analysis of user sessions • 6 Saturday, August 14, 2010
  • 7.
    Typical log datalifecycle • Instead of gaining these insights - Logs are kept for 30 days - Then sent to tape ‣ Where they die ‣ Except if the government needs to access them • Sometimes - Data is extracted - Placed into a data warehouse for future processing - Not very flexible, if data fields change • 7 Saturday, August 14, 2010
  • 8.
    Solution? • Problem prevalent in a lot of search companies and at a very large scale • In 2004 Google published their take on the problem - Paper in OSDI ’04 - MapReduce: Simplified Data Processing on Large Clusters • System built on cheap commodity hardware, and horizontally scalable • New paradigm for solving problems - Map - Reduce • 8 Saturday, August 14, 2010
  • 9.
    What is MapReduce • Old paradigm from functional languages • Works on data tuples • For each tuple apply mapper function f: [k1, v1] -> [k2, v2] • Collect tuples with similar keys and apply a combine function g: [k2, [v1, v2, …,vn]]->[k3,v3] • 9 Saturday, August 14, 2010
  • 10.
    MapReduce (cont’d) • To speed up the computation we divide and conquer - Divide the tuples into manageable groups - Process each group of tuples separately - Collect similar tuples and send them to the reduce phase - Combine the results together • Luckily in most data problems the data records are independent • 10 Saturday, August 14, 2010
  • 11.
    MapReduce Framework • Takes care of the scaffolding around the map/ reduce functions - Partition the data across multiple machines - Run a function (Map) on each partition in parallel - Collect the results, and sort them - Send the results to multiple machines that run a Reduce function - Rinse and repeat if needed • 11 Saturday, August 14, 2010
  • 12.
    MapReduce Framework Input Map Reduce Output Input Map Shuffle Input Map & Reduce Output Sort Input Map Input Map 12 Saturday, August 14, 2010
  • 13.
    Example • Find the maximum number in a list • Luckily max A = max(max(A[1..k]), max(A[k..N])) • A = [1, 2, 3, 4, 5, …, 10] • Divide A into chunks - A1=[1,..,5] - A2=[6,…,10] • Map max on A1 to get 5 • Map max on A2 to get 10 • Reduce [5,10] by using max to get 10 • 13 Saturday, August 14, 2010
  • 14.
    Another example • Add Numbers from 1..100 • Sum of A[1..100] = Sum of A[1..k] + Sum of A[k +1..p] + Sum of A[p+1..100] Text 1 2 3 4 5 6 7 8 9 . . . . . 100 15 N M 5050 • 14 Saturday, August 14, 2010
  • 15.
    Another example • Canonical word count • Divide a text into words - “To be or not to be” - To, be, or, not, to, be • Mapper - For every word emit a tuple (word, 1) - (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1) • Collect output by word - (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1]) • Reduce the tuples - (To, 2), (be, 2), (or, 1), (not, 1) • 15 Saturday, August 14, 2010
  • 16.
    So how dowe run the examples • Using Hadoop - Open source implementation of MR framework - Two major components ‣ Distributed file system--HDFS ‣ Code execution framework--MR • 16 Saturday, August 14, 2010
  • 17.
    HDFS • Stores data in files that are divided into blocks • Blocks are large, usually 64MB to marginalize the cost of seeks • Blocks are stored on multiple machines called “Data Nodes” • One master node “Name Node” stores filesystem meta-data including the directory hierarchy, file names, and file to block mapping • All meta data operations go through the Name Node, however data access goes directly to the data nodes • 17 Saturday, August 14, 2010
  • 18.
    HDFS • Single point of failure because of single Name Node - Secondary Name Node that replicates all transactions from the name node • Limitation on number of files in the file system as all meta-data is stored in memory on the Name Node - Hadoop archive files • 18 Saturday, August 14, 2010
  • 19.
    MapReduce Framework • Execution framework that orchestrates the MR jobs - Takes care of running the code where the data is - Partitions the input into chunks - Runs the user provided Mappers and collects the output, sorts and combines the intermediate results - Takes care of job failures and task laggards - Runs Reducers to summarize results • Supports streaming for scripting languages and Pipes for C/C++ • 19 Saturday, August 14, 2010
  • 20.
    And how wouldWordCount look? public class HadoopMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String[] line = value.toString().split("[,s]+"); for(String token : line) { output.collect(new Text(token), new LongWritable(1)); } } } • 20 Saturday, August 14, 2010
  • 21.
    Word Count-Reducer public class HadoopReducer extends MapReduceBase implements Reducer<Text,LongWritable,Text,LongWritable> { @Override public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long sum = 0; while (value.hasNext()) { ++sum; value.next(); } output.collect(key, new LongWritable(sum)); } } • 21 Saturday, August 14, 2010
  • 22.
    Max - Mapper public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String numbers[] = value.toString().split("[,s]+"); long max = -1; for (String token : numbers) { long number = Long.parseLong(token); if (number > max) { max = number; } } output.collect(new Text("k"), new LongWritable(max)); } • 22 Saturday, August 14, 2010
  • 23.
    Max-Reducer public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long max = 0; while (value.hasNext()) { long number = value.next().get(); if(number>max) { max = number; } } output.collect(key, new LongWritable(max)); } • 23 Saturday, August 14, 2010
  • 24.
    Sum - Mapper public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String numbers[] = value.toString().split("[,s]+"); long sum = 0; for(String token : numbers) { sum += Long.parseLong(token); } output.collect(new Text("k"), new LongWritable(sum)); } • 24 Saturday, August 14, 2010
  • 25.
    Sum - Reducer public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long sum = 0; while(value.hasNext()) { sum += value.next().get(); } output.collect(key, new LongWritable(sum)); } • 25 Saturday, August 14, 2010
  • 26.
    First Impressions • Lots of overhead even for simple examples • Can’t test on data before deploying to cluster - Bugs - Prototyping for data format changes - Testing different versions of Hadoop runtime • Tools like Karmasphere help with that • 26 Saturday, August 14, 2010
  • 27.
    Karmasphere Studio • For NetBeans and Eclipse • Two editions - Community (Free) - Professional • 27 Saturday, August 14, 2010
  • 28.
    Community Edition • Community edition focusses on - Development and prototyping ‣ MR workflow development ‣ Local execution with multiple Hadoop versions - Packaging jars • Eclipse - http://www.hadoopstudio.org/dist/eclipse- community/site.xml • NetBeans - http://hadoopstudio.org/updates/ updates.xml • 28 Saturday, August 14, 2010
  • 29.
    Professional Edition • Professional edition focuses on what happens to the job after initial development - Profiling and tuning - Packaging and deployment (local/colo/ssh tunnel/EMR) - Support • Sign-up for beta on our site - http://karmasphere.com/Products- Information/karmasphere-studio.html • 29 Saturday, August 14, 2010
  • 30.
    Workflow demo •30 Saturday, August 14, 2010
  • 31.
    Create new Javaproject 31 Saturday, August 14, 2010
  • 32.
    Add Hadoop libraries 32 Saturday, August 14, 2010
  • 33.
    Add library 33 Saturday, August 14, 2010
  • 34.
    Client and MRlibraries 34 Saturday, August 14, 2010
  • 35.
    Demo 35 Saturday, August 14, 2010
  • 36.
    Create Job 36 Saturday, August 14, 2010
  • 37.
    Hadoop Jobs 37 Saturday, August 14, 2010
  • 38.
    MR Job 38 Saturday, August 14, 2010
  • 39.
    Hadoop Workflow 39 Saturday, August 14, 2010
  • 40.
    Input Format 40 Saturday, August 14, 2010
  • 41.
    Mapper 41 Saturday, August 14, 2010
  • 42.
    Partitioner 42 Saturday, August 14, 2010
  • 43.
    Comparator 43 Saturday, August 14, 2010
  • 44.
    Combiner 44 Saturday, August 14, 2010
  • 45.
    Reducer 45 Saturday, August 14, 2010
  • 46.
    Output 46 Saturday, August 14, 2010
  • 47.
    What happened? • Without deploying anything to the cluster, we can: - See how the job behaves locally - Fix bugs if data output does not match expectation - Experiment with different versions of Hadoop • We can also write custom code for each MR stage or use the ones provided by Hadoop • 47 Saturday, August 14, 2010
  • 48.
    Running locally • Studio comes with 3 versions of Hadoop runtime libraries - 0.18 - 0.19 - 0.20 • Can run job locally as an in-proc thread using export jar - Test behavior on different Hadoop runtimes without deploying - Just need to supply input/output • 48 Saturday, August 14, 2010
  • 49.
    Looking at filesystems 49 Saturday, August 14, 2010
  • 50.
    Local/HDFS/Amazon S3 50 Saturday, August 14, 2010
  • 51.
    Direct connection orSSH Tunnel 51 Saturday, August 14, 2010
  • 52.
    Browse, Drag Drop,Copy 52 Saturday, August 14, 2010
  • 53.
    Monitor File System 53 Saturday, August 14, 2010
  • 54.
    Amazon Elastic MapReduce(EMR) 54 Saturday, August 14, 2010
  • 55.
    Amazon S3 55 Saturday, August 14, 2010
  • 56.
    S3 credentials 56 Saturday, August 14, 2010
  • 57.
    Monitoring Job Flows 57 Saturday, August 14, 2010
  • 58.
    Diagnostics 58 Saturday, August 14, 2010
  • 59.
    Summary 59 Saturday, August 14, 2010
  • 60.
    Logs 60 Saturday, August 14, 2010
  • 61.
    Tasks 61 Saturday, August 14, 2010
  • 62.
    Config 62 Saturday, August 14, 2010
  • 63.
    Other Hadoop technologies • Cascading - Higher level data flow language - Operates on sources and sinks - Turns workflows into jobs - Studio includes Cascading support • Hive - High level SQL like language - Concepts such as tables, and queries - Converts SQL to MapReduce - Working on enterprise quality Hive based SQL product • Pig - Scripting language - Converts script to MR job • 63 Saturday, August 14, 2010
  • 64.
  • 65.
    Title Here • First Level – Second Level • Third Level 65 Saturday, August 14, 2010