Title Here
      • First Level
            – Second Level
                   • Third Level


                    0-60 Hado...
Agenda
      • Background
      • Motivation for Hadoop
      • Hadoop Architecture
         - HDFS
         - MapReduce f...
Background
      • Worked at Yahoo on search and social search
      • Worked at Google on App infrastructure
      • Work...
Motivation for Hadoop
      • Data is growing fast
         - Website usage increasing
         - Logging user events on t...
Big Data example
      • Apache log files are common for web properties
      • Simple format
            -   27.0.0.1 - fr...
Insights in log data
      • The log data contains a wealth of information
         - Duration of user’s visit
         - ...
Typical log data lifecycle
      • Instead of gaining these insights
         - Logs are kept for 30 days
         - Then ...
Solution?
      • Problem prevalent in a lot of search companies
        and at a very large scale
      • In 2004 Google ...
What is MapReduce
      • Old paradigm from functional languages
      • Works on data tuples
      • For each tuple apply...
MapReduce (cont’d)
      • To speed up the computation we divide and
        conquer
         - Divide the tuples into man...
MapReduce Framework
      • Takes care of the scaffolding around the map/
          reduce functions
           - Partitio...
MapReduce Framework

    Input                   Map


                                            Reduce   Output
    Inp...
Example
      • Find the maximum number in a list
      • Luckily max A = max(max(A[1..k]), max(A[k..N]))
      • A = [1, ...
Another example
      • Add Numbers from 1..100
      • Sum of A[1..100] = Sum of A[1..k] + Sum of A[k
          +1..p] + ...
Another example
      • Canonical word count
      • Divide a text into words
         - “To be or not to be”
         - T...
So how do we run the examples
      • Using Hadoop
         - Open source implementation of MR
                framework

...
HDFS
      • Stores data in files that are divided into blocks
      • Blocks are large, usually 64MB to marginalize
      ...
HDFS
      • Single point of failure because of single Name
        Node
         - Secondary Name Node that replicates al...
MapReduce Framework
      • Execution framework that orchestrates the MR
        jobs
         - Takes care of running the...
And how would WordCount look?

      public class HadoopMapper extends MapReduceBase implements
      Mapper<LongWritable,...
Word Count-Reducer

      public class HadoopReducer extends MapReduceBase implements
      Reducer<Text,LongWritable,Text...
Max - Mapper

      public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Report...
Max-Reducer

      public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
      ou...
Sum - Mapper


      public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output,
      Repor...
Sum - Reducer


      public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable>
     ...
First Impressions
      • Lots of overhead even for simple examples
      • Can’t test on data before deploying to cluster...
Karmasphere Studio
      • For NetBeans and Eclipse
      • Two editions
         - Community (Free)
         - Profession...
Community Edition
      • Community edition focusses on
         - Development and prototyping
            ‣ MR workflow de...
Professional Edition
      • Professional edition focuses on what happens to
        the job after initial development
   ...
Workflow demo




 • 30
Saturday, August 14, 2010
Create new Java project




   31
Saturday, August 14, 2010
Add Hadoop libraries




   32
Saturday, August 14, 2010
Add library




   33
Saturday, August 14, 2010
Client and MR libraries




   34
Saturday, August 14, 2010
Demo




   35
Saturday, August 14, 2010
Create Job




   36
Saturday, August 14, 2010
Hadoop Jobs




   37
Saturday, August 14, 2010
MR Job




   38
Saturday, August 14, 2010
Hadoop Workflow




   39
Saturday, August 14, 2010
Input Format




   40
Saturday, August 14, 2010
Mapper




   41
Saturday, August 14, 2010
Partitioner




   42
Saturday, August 14, 2010
Comparator




   43
Saturday, August 14, 2010
Combiner




   44
Saturday, August 14, 2010
Reducer




   45
Saturday, August 14, 2010
Output




   46
Saturday, August 14, 2010
What happened?
      • Without deploying anything to the cluster, we
        can:
         - See how the job behaves local...
Running locally
      • Studio comes with 3 versions of Hadoop
        runtime libraries
         - 0.18
         - 0.19
 ...
Looking at file systems




   49
Saturday, August 14, 2010
Local/HDFS/Amazon S3




   50
Saturday, August 14, 2010
Direct connection or SSH Tunnel




   51
Saturday, August 14, 2010
Browse, Drag Drop, Copy




   52
Saturday, August 14, 2010
Monitor File System




   53
Saturday, August 14, 2010
Amazon Elastic MapReduce (EMR)




   54
Saturday, August 14, 2010
Amazon S3




   55
Saturday, August 14, 2010
S3 credentials




   56
Saturday, August 14, 2010
Monitoring Job Flows




   57
Saturday, August 14, 2010
Diagnostics




   58
Saturday, August 14, 2010
Summary




   59
Saturday, August 14, 2010
Logs




   60
Saturday, August 14, 2010
Tasks




   61
Saturday, August 14, 2010
Config




   62
Saturday, August 14, 2010
Other Hadoop technologies
      • Cascading
            -   Higher level data flow language
            -   Operates on sou...
Questions




 • 64
Saturday, August 14, 2010
Title Here
      • First Level
            – Second Level
                   • Third Level




                           ...
Upcoming SlideShare
Loading in …5
×

Seattle hug 2010

4,710 views
4,611 views

Published on

Published in: Technology

Seattle hug 2010

  1. 1. Title Here • First Level – Second Level • Third Level 0-60 Hadoop Development in 60 minutes or less Abe Taha abetaha@karmasphere.com 1 Saturday, August 14, 2010
  2. 2. Agenda • Background • Motivation for Hadoop • Hadoop Architecture - HDFS - MapReduce framework • Example Jobs • Karmasphere Studio • Ancillary Hadoop technologies • Questions • 2 Saturday, August 14, 2010
  3. 3. Background • Worked at Yahoo on search and social search • Worked at Google on App infrastructure • Worked at Ning on Hadoop for analytics and system management services • Worked at Ask on Dictionary.com and Reference.com properties • Now at Karmasphere • 3 Saturday, August 14, 2010
  4. 4. Motivation for Hadoop • Data is growing fast - Website usage increasing - Logging user events on the rise - Disks are becoming cheaper - Companies realize insights buried in the data • Era of Big Data - You know big data when you see it - Data that is large enough that it takes time to extract insights in a reasonable amount of time • 4 Saturday, August 14, 2010
  5. 5. Big Data example • Apache log files are common for web properties • Simple format - 27.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /search?q=book HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • Contains wealth of information - IP address of the client - User requesting the resource - Date and Time - URL Path - Result code - Object size returned to the client - Referrer - User-Agent • 5 Saturday, August 14, 2010
  6. 6. Insights in log data • The log data contains a wealth of information - Duration of user’s visit - Most popular queries/pages - Most common browsers - Geo location of users - Flow analysis of user sessions • 6 Saturday, August 14, 2010
  7. 7. Typical log data lifecycle • Instead of gaining these insights - Logs are kept for 30 days - Then sent to tape ‣ Where they die ‣ Except if the government needs to access them • Sometimes - Data is extracted - Placed into a data warehouse for future processing - Not very flexible, if data fields change • 7 Saturday, August 14, 2010
  8. 8. Solution? • Problem prevalent in a lot of search companies and at a very large scale • In 2004 Google published their take on the problem - Paper in OSDI ’04 - MapReduce: Simplified Data Processing on Large Clusters • System built on cheap commodity hardware, and horizontally scalable • New paradigm for solving problems - Map - Reduce • 8 Saturday, August 14, 2010
  9. 9. What is MapReduce • Old paradigm from functional languages • Works on data tuples • For each tuple apply mapper function f: [k1, v1] -> [k2, v2] • Collect tuples with similar keys and apply a combine function g: [k2, [v1, v2, …,vn]]->[k3,v3] • 9 Saturday, August 14, 2010
  10. 10. MapReduce (cont’d) • To speed up the computation we divide and conquer - Divide the tuples into manageable groups - Process each group of tuples separately - Collect similar tuples and send them to the reduce phase - Combine the results together • Luckily in most data problems the data records are independent • 10 Saturday, August 14, 2010
  11. 11. MapReduce Framework • Takes care of the scaffolding around the map/ reduce functions - Partition the data across multiple machines - Run a function (Map) on each partition in parallel - Collect the results, and sort them - Send the results to multiple machines that run a Reduce function - Rinse and repeat if needed • 11 Saturday, August 14, 2010
  12. 12. MapReduce Framework Input Map Reduce Output Input Map Shuffle Input Map & Reduce Output Sort Input Map Input Map 12 Saturday, August 14, 2010
  13. 13. Example • Find the maximum number in a list • Luckily max A = max(max(A[1..k]), max(A[k..N])) • A = [1, 2, 3, 4, 5, …, 10] • Divide A into chunks - A1=[1,..,5] - A2=[6,…,10] • Map max on A1 to get 5 • Map max on A2 to get 10 • Reduce [5,10] by using max to get 10 • 13 Saturday, August 14, 2010
  14. 14. Another example • Add Numbers from 1..100 • Sum of A[1..100] = Sum of A[1..k] + Sum of A[k +1..p] + Sum of A[p+1..100] Text 1 2 3 4 5 6 7 8 9 . . . . . 100 15 N M 5050 • 14 Saturday, August 14, 2010
  15. 15. Another example • Canonical word count • Divide a text into words - “To be or not to be” - To, be, or, not, to, be • Mapper - For every word emit a tuple (word, 1) - (To, 1), (be, 1), (or, 1), (not, 1), (to, 1), (be, 1) • Collect output by word - (To, [1, 1]), (be, [1,1]), (or, [1]), (not, [1]) • Reduce the tuples - (To, 2), (be, 2), (or, 1), (not, 1) • 15 Saturday, August 14, 2010
  16. 16. So how do we run the examples • Using Hadoop - Open source implementation of MR framework - Two major components ‣ Distributed file system--HDFS ‣ Code execution framework--MR • 16 Saturday, August 14, 2010
  17. 17. HDFS • Stores data in files that are divided into blocks • Blocks are large, usually 64MB to marginalize the cost of seeks • Blocks are stored on multiple machines called “Data Nodes” • One master node “Name Node” stores filesystem meta-data including the directory hierarchy, file names, and file to block mapping • All meta data operations go through the Name Node, however data access goes directly to the data nodes • 17 Saturday, August 14, 2010
  18. 18. HDFS • Single point of failure because of single Name Node - Secondary Name Node that replicates all transactions from the name node • Limitation on number of files in the file system as all meta-data is stored in memory on the Name Node - Hadoop archive files • 18 Saturday, August 14, 2010
  19. 19. MapReduce Framework • Execution framework that orchestrates the MR jobs - Takes care of running the code where the data is - Partitions the input into chunks - Runs the user provided Mappers and collects the output, sorts and combines the intermediate results - Takes care of job failures and task laggards - Runs Reducers to summarize results • Supports streaming for scripting languages and Pipes for C/C++ • 19 Saturday, August 14, 2010
  20. 20. And how would WordCount look? public class HadoopMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { @Override public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String[] line = value.toString().split("[,s]+"); for(String token : line) { output.collect(new Text(token), new LongWritable(1)); } } } • 20 Saturday, August 14, 2010
  21. 21. Word Count-Reducer public class HadoopReducer extends MapReduceBase implements Reducer<Text,LongWritable,Text,LongWritable> { @Override public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long sum = 0; while (value.hasNext()) { ++sum; value.next(); } output.collect(key, new LongWritable(sum)); } } • 21 Saturday, August 14, 2010
  22. 22. Max - Mapper public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String numbers[] = value.toString().split("[,s]+"); long max = -1; for (String token : numbers) { long number = Long.parseLong(token); if (number > max) { max = number; } } output.collect(new Text("k"), new LongWritable(max)); } • 22 Saturday, August 14, 2010
  23. 23. Max-Reducer public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long max = 0; while (value.hasNext()) { long number = value.next().get(); if(number>max) { max = number; } } output.collect(key, new LongWritable(max)); } • 23 Saturday, August 14, 2010
  24. 24. Sum - Mapper public void map(LongWritable key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String numbers[] = value.toString().split("[,s]+"); long sum = 0; for(String token : numbers) { sum += Long.parseLong(token); } output.collect(new Text("k"), new LongWritable(sum)); } • 24 Saturday, August 14, 2010
  25. 25. Sum - Reducer public void reduce(Text key, Iterator<LongWritable> value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { long sum = 0; while(value.hasNext()) { sum += value.next().get(); } output.collect(key, new LongWritable(sum)); } • 25 Saturday, August 14, 2010
  26. 26. First Impressions • Lots of overhead even for simple examples • Can’t test on data before deploying to cluster - Bugs - Prototyping for data format changes - Testing different versions of Hadoop runtime • Tools like Karmasphere help with that • 26 Saturday, August 14, 2010
  27. 27. Karmasphere Studio • For NetBeans and Eclipse • Two editions - Community (Free) - Professional • 27 Saturday, August 14, 2010
  28. 28. Community Edition • Community edition focusses on - Development and prototyping ‣ MR workflow development ‣ Local execution with multiple Hadoop versions - Packaging jars • Eclipse - http://www.hadoopstudio.org/dist/eclipse- community/site.xml • NetBeans - http://hadoopstudio.org/updates/ updates.xml • 28 Saturday, August 14, 2010
  29. 29. Professional Edition • Professional edition focuses on what happens to the job after initial development - Profiling and tuning - Packaging and deployment (local/colo/ssh tunnel/EMR) - Support • Sign-up for beta on our site - http://karmasphere.com/Products- Information/karmasphere-studio.html • 29 Saturday, August 14, 2010
  30. 30. Workflow demo • 30 Saturday, August 14, 2010
  31. 31. Create new Java project 31 Saturday, August 14, 2010
  32. 32. Add Hadoop libraries 32 Saturday, August 14, 2010
  33. 33. Add library 33 Saturday, August 14, 2010
  34. 34. Client and MR libraries 34 Saturday, August 14, 2010
  35. 35. Demo 35 Saturday, August 14, 2010
  36. 36. Create Job 36 Saturday, August 14, 2010
  37. 37. Hadoop Jobs 37 Saturday, August 14, 2010
  38. 38. MR Job 38 Saturday, August 14, 2010
  39. 39. Hadoop Workflow 39 Saturday, August 14, 2010
  40. 40. Input Format 40 Saturday, August 14, 2010
  41. 41. Mapper 41 Saturday, August 14, 2010
  42. 42. Partitioner 42 Saturday, August 14, 2010
  43. 43. Comparator 43 Saturday, August 14, 2010
  44. 44. Combiner 44 Saturday, August 14, 2010
  45. 45. Reducer 45 Saturday, August 14, 2010
  46. 46. Output 46 Saturday, August 14, 2010
  47. 47. What happened? • Without deploying anything to the cluster, we can: - See how the job behaves locally - Fix bugs if data output does not match expectation - Experiment with different versions of Hadoop • We can also write custom code for each MR stage or use the ones provided by Hadoop • 47 Saturday, August 14, 2010
  48. 48. Running locally • Studio comes with 3 versions of Hadoop runtime libraries - 0.18 - 0.19 - 0.20 • Can run job locally as an in-proc thread using export jar - Test behavior on different Hadoop runtimes without deploying - Just need to supply input/output • 48 Saturday, August 14, 2010
  49. 49. Looking at file systems 49 Saturday, August 14, 2010
  50. 50. Local/HDFS/Amazon S3 50 Saturday, August 14, 2010
  51. 51. Direct connection or SSH Tunnel 51 Saturday, August 14, 2010
  52. 52. Browse, Drag Drop, Copy 52 Saturday, August 14, 2010
  53. 53. Monitor File System 53 Saturday, August 14, 2010
  54. 54. Amazon Elastic MapReduce (EMR) 54 Saturday, August 14, 2010
  55. 55. Amazon S3 55 Saturday, August 14, 2010
  56. 56. S3 credentials 56 Saturday, August 14, 2010
  57. 57. Monitoring Job Flows 57 Saturday, August 14, 2010
  58. 58. Diagnostics 58 Saturday, August 14, 2010
  59. 59. Summary 59 Saturday, August 14, 2010
  60. 60. Logs 60 Saturday, August 14, 2010
  61. 61. Tasks 61 Saturday, August 14, 2010
  62. 62. Config 62 Saturday, August 14, 2010
  63. 63. Other Hadoop technologies • Cascading - Higher level data flow language - Operates on sources and sinks - Turns workflows into jobs - Studio includes Cascading support • Hive - High level SQL like language - Concepts such as tables, and queries - Converts SQL to MapReduce - Working on enterprise quality Hive based SQL product • Pig - Scripting language - Converts script to MR job • 63 Saturday, August 14, 2010
  64. 64. Questions • 64 Saturday, August 14, 2010
  65. 65. Title Here • First Level – Second Level • Third Level 65 Saturday, August 14, 2010

×