MapReduce Application Scripting

571 views

Published on

Workshop conducted for Teradata, Islamabad

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
571
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MapReduce Application Scripting

  1. 1. 8: MapReduce Application ScriptingZubair Nabizubair.nabi@itu.edu.pkMay 25, 2013Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 1 / 28
  2. 2. Outline1 Pig Latin2 CascadingZubair Nabi 8: MapReduce Application Scripting May 25, 2013 2 / 28
  3. 3. Outline1 Pig Latin2 CascadingZubair Nabi 8: MapReduce Application Scripting May 25, 2013 3 / 28
  4. 4. IntroductionMapReduce is too low-level and rigid and leads to lots of custom usercodeZubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
  5. 5. IntroductionMapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
  6. 6. IntroductionMapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduceZubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
  7. 7. IntroductionMapReduce is too low-level and rigid and leads to lots of custom usercodePig Latin is a declarative language atop MapReduce designed byYahoo!Finds the sweet spot between the declarative style of SQL and thelow-level interface of MapReduceThe Pig system compiles Pig Latin queries into physical plans that areexecuted atop HadoopZubair Nabi 8: MapReduce Application Scripting May 25, 2013 4 / 28
  8. 8. SQL query to find average pagerank for each large categoryof URLs1 SELECT category , AVG(pagerank)2 FROM urls WHERE pagerank > 0.23 GROUP BY category HAVING COUNT(∗) > 10^6Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 5 / 28
  9. 9. Equivalent Pig query1 good_urls = FILTER urls BY pagerank > 0.2;2 groups = GROUP good_urls BY category;3 big_groups = FILTER groups BY COUNT(good_urls)>10^6;4 output = FOREACH big_groups GENERATE category , AVG(good_urls.pagerank);Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 6 / 28
  10. 10. Pig InterfaceA Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languagesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
  11. 11. Pig InterfaceA Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languagesIn contrast, SQL consists of declarative constraints that collectivelydefine the resultZubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
  12. 12. Pig InterfaceA Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languagesIn contrast, SQL consists of declarative constraints that collectivelydefine the resultEach step carries out a single data transformationZubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
  13. 13. Pig InterfaceA Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languagesIn contrast, SQL consists of declarative constraints that collectivelydefine the resultEach step carries out a single data transformationA Pig Latin program is similar to specifying a query execution or adataflow graphZubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
  14. 14. Pig InterfaceA Pig Latin program is a sequence of steps, reminiscent of traditionalprogramming languagesIn contrast, SQL consists of declarative constraints that collectivelydefine the resultEach step carries out a single data transformationA Pig Latin program is similar to specifying a query execution or adataflow graphDue to this dataflow model, it is easier for programmers to understandand control how their data processing task is executedZubair Nabi 8: MapReduce Application Scripting May 25, 2013 7 / 28
  15. 15. FeaturesSupport for a fully nested data model with complex data typesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
  16. 16. FeaturesSupport for a fully nested data model with complex data typesExtensive support for user-defined functionsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
  17. 17. FeaturesSupport for a fully nested data model with complex data typesExtensive support for user-defined functionsAbility to operate over plain, schema-less input filesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
  18. 18. FeaturesSupport for a fully nested data model with complex data typesExtensive support for user-defined functionsAbility to operate over plain, schema-less input filesOpen-source Apache projectZubair Nabi 8: MapReduce Application Scripting May 25, 2013 8 / 28
  19. 19. InteroperabilityQueries can be performed atop raw data dumps directlyZubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
  20. 20. InteroperabilityQueries can be performed atop raw data dumps directlyThe user needs to provide a function to parse the content of the file intotuplesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
  21. 21. InteroperabilityQueries can be performed atop raw data dumps directlyThe user needs to provide a function to parse the content of the file intotuplesSimilarly, the user also needs to provide a function to convert tuplesinto a byte sequenceZubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
  22. 22. InteroperabilityQueries can be performed atop raw data dumps directlyThe user needs to provide a function to parse the content of the file intotuplesSimilarly, the user also needs to provide a function to convert tuplesinto a byte sequenceDatasets can be laid across diverse data storage sources andapplicationsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 9 / 28
  23. 23. UDFs as first-class citizensA significant part of large-scale data analysis relies on customprocessingZubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
  24. 24. UDFs as first-class citizensA significant part of large-scale data analysis relies on customprocessingFor instance, the user may be interested in figuring out whether aparticular website is spamZubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
  25. 25. UDFs as first-class citizensA significant part of large-scale data analysis relies on customprocessingFor instance, the user may be interested in figuring out whether aparticular website is spamAll aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
  26. 26. UDFs as first-class citizensA significant part of large-scale data analysis relies on customprocessingFor instance, the user may be interested in figuring out whether aparticular website is spamAll aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFsUDFs take non-atomic parameters as input and produce non-atomicvalues as outputZubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
  27. 27. UDFs as first-class citizensA significant part of large-scale data analysis relies on customprocessingFor instance, the user may be interested in figuring out whether aparticular website is spamAll aspects of processing in Pig Latin including grouping, filtering,joining, and per-tuple processing can be customized via UDFsUDFs take non-atomic parameters as input and produce non-atomicvalues as outputUDFs are defined in Java1 groups = GROUP urls BY category;2 output = FOREACH groups GENERATE3 category , top10(urls);Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 10 / 28
  28. 28. Data ModelPig has four data types:1 Atom: A single atomic value such as a string or an integerZubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
  29. 29. Data ModelPig has four data types:1 Atom: A single atomic value such as a string or an integer2 Tuple: A sequence of values, each with possibly a different data typeZubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
  30. 30. Data ModelPig has four data types:1 Atom: A single atomic value such as a string or an integer2 Tuple: A sequence of values, each with possibly a different data type3 Bag: A collection of tuplesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
  31. 31. Data ModelPig has four data types:1 Atom: A single atomic value such as a string or an integer2 Tuple: A sequence of values, each with possibly a different data type3 Bag: A collection of tuples4 Map: A collection of data types, each with an associated keyZubair Nabi 8: MapReduce Application Scripting May 25, 2013 11 / 28
  32. 32. CommandsLOAD: Load and deserialize an input fileZubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
  33. 33. CommandsLOAD: Load and deserialize an input fileFOREACH: Process each tuple of a datasetZubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
  34. 34. CommandsLOAD: Load and deserialize an input fileFOREACH: Process each tuple of a datasetFILTER: Filter a dataset based on some condition or UDFZubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
  35. 35. CommandsLOAD: Load and deserialize an input fileFOREACH: Process each tuple of a datasetFILTER: Filter a dataset based on some condition or UDFCOGROUP: Group together tuples which are related in some way fromone or more datasetsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
  36. 36. CommandsLOAD: Load and deserialize an input fileFOREACH: Process each tuple of a datasetFILTER: Filter a dataset based on some condition or UDFCOGROUP: Group together tuples which are related in some way fromone or more datasetsGROUP: Group together tuples which are related in some way fromone datasetZubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
  37. 37. CommandsLOAD: Load and deserialize an input fileFOREACH: Process each tuple of a datasetFILTER: Filter a dataset based on some condition or UDFCOGROUP: Group together tuples which are related in some way fromone or more datasetsGROUP: Group together tuples which are related in some way fromone datasetSTORE: Materialize the output of a Pig Latin expression to a fileZubair Nabi 8: MapReduce Application Scripting May 25, 2013 12 / 28
  38. 38. Other CommandsUNION: Return the union of two or more bagsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
  39. 39. Other CommandsUNION: Return the union of two or more bagsCROSS: Return the cross product of two or more bagsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
  40. 40. Other CommandsUNION: Return the union of two or more bagsCROSS: Return the cross product of two or more bagsORDER: Order a bag by a specified fieldZubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
  41. 41. Other CommandsUNION: Return the union of two or more bagsCROSS: Return the cross product of two or more bagsORDER: Order a bag by a specified fieldDISTINCT: Eliminate duplicate tuples in a bagZubair Nabi 8: MapReduce Application Scripting May 25, 2013 13 / 28
  42. 42. MapReduce in PigLatin1 map_result = FOREACH input GENERATE FLATTEN(map(∗));2 key_groups = GROUP map_result BY $0;3 output = FOREACH key_groups GENERATE reduce(∗);Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 14 / 28
  43. 43. Outline1 Pig Latin2 CascadingZubair Nabi 8: MapReduce Application Scripting May 25, 2013 15 / 28
  44. 44. IntroductionMany applications require a chain of MapReduce jobsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
  45. 45. IntroductionMany applications require a chain of MapReduce jobsCascading allows the creation of processing pipelines using languagesthat run atop the JVMZubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
  46. 46. IntroductionMany applications require a chain of MapReduce jobsCascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigmZubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
  47. 47. IntroductionMany applications require a chain of MapReduce jobsCascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigmData comes from sourcesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
  48. 48. IntroductionMany applications require a chain of MapReduce jobsCascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigmData comes from sourcesPipes perform data analysisZubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
  49. 49. IntroductionMany applications require a chain of MapReduce jobsCascading allows the creation of processing pipelines using languagesthat run atop the JVMSource-pipe-sink paradigmData comes from sourcesPipes perform data analysisResults are written to sinksZubair Nabi 8: MapReduce Application Scripting May 25, 2013 16 / 28
  50. 50. TerminologyPipe: data streamZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  51. 51. TerminologyPipe: data streamTuple: data recordZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  52. 52. TerminologyPipe: data streamTuple: data recordBranch: chain of pipesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  53. 53. TerminologyPipe: data streamTuple: data recordBranch: chain of pipesPipe Assembly: set of pipe branchesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  54. 54. TerminologyPipe: data streamTuple: data recordBranch: chain of pipesPipe Assembly: set of pipe branchesTap: data source or sinkZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  55. 55. TerminologyPipe: data streamTuple: data recordBranch: chain of pipesPipe Assembly: set of pipe branchesTap: data source or sinkFlow: pipe assembly bound to a tapZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  56. 56. TerminologyPipe: data streamTuple: data recordBranch: chain of pipesPipe Assembly: set of pipe branchesTap: data source or sinkFlow: pipe assembly bound to a tapCascade: a collection flows, in which one flow depends on the outputof anotherZubair Nabi 8: MapReduce Application Scripting May 25, 2013 17 / 28
  57. 57. PipesBase class: PipeZubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  58. 58. PipesBase class: PipeEach: Analyze, transform, or filter individual tuplesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  59. 59. PipesBase class: PipeEach: Analyze, transform, or filter individual tuplesMerge: Combine streams with same fields into oneZubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  60. 60. PipesBase class: PipeEach: Analyze, transform, or filter individual tuplesMerge: Combine streams with same fields into oneGroupBy: Group tuples based on common values in a specified fieldZubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  61. 61. PipesBase class: PipeEach: Analyze, transform, or filter individual tuplesMerge: Combine streams with same fields into oneGroupBy: Group tuples based on common values in a specified fieldCoGroup: Join streams (similar to SQL join)Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  62. 62. PipesBase class: PipeEach: Analyze, transform, or filter individual tuplesMerge: Combine streams with same fields into oneGroupBy: Group tuples based on common values in a specified fieldCoGroup: Join streams (similar to SQL join)Every: Aggregate tuplesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  63. 63. PipesBase class: PipeEach: Analyze, transform, or filter individual tuplesMerge: Combine streams with same fields into oneGroupBy: Group tuples based on common values in a specified fieldCoGroup: Join streams (similar to SQL join)Every: Aggregate tuplesHashJoin: Similar to CoGroup but more efficient if one stream canbe held in memoryZubair Nabi 8: MapReduce Application Scripting May 25, 2013 18 / 28
  64. 64. Pipe AssembliesDefine the processing of tuple streamsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
  65. 65. Pipe AssembliesDefine the processing of tuple streamsTuples are read/written to tapsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
  66. 66. Pipe AssembliesDefine the processing of tuple streamsTuples are read/written to tapsProcessing includes filtering, transforming, organizing, and calculatingZubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
  67. 67. Pipe AssembliesDefine the processing of tuple streamsTuples are read/written to tapsProcessing includes filtering, transforming, organizing, and calculatingCan use multiple tapsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
  68. 68. Pipe AssembliesDefine the processing of tuple streamsTuples are read/written to tapsProcessing includes filtering, transforming, organizing, and calculatingCan use multiple tapsMay also define splits, merges, and joins to manipulate tuple streamsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 19 / 28
  69. 69. Example: Pipe AssemblyZubair Nabi 8: MapReduce Application Scripting May 25, 2013 20 / 28
  70. 70. Example: Pipe Assembly (2)1 Pipe lhs = new Pipe( "lhs" );2 lhs = new Each( lhs, new SomeFunction() );3 lhs = new Each( lhs, new SomeFilter() );45 Pipe rhs = new Pipe( "rhs" );6 rhs = new Each( rhs, new SomeFunction() );78 Pipe join = new CoGroup( lhs, rhs );9 join = new Every( join, new SomeAggregator() );10 join = new GroupBy( join );11 join = new Every( join, new SomeAggregator() );1213 join = new Each( join, new SomeFunction() );Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 21 / 28
  71. 71. Data ProcessingOperation: Accept an input tuple, process it, and output zero or moretuplesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
  72. 72. Data ProcessingOperation: Accept an input tuple, process it, and output zero or moretuplesTuple: Array of fieldsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
  73. 73. Data ProcessingOperation: Accept an input tuple, process it, and output zero or moretuplesTuple: Array of fieldsField: Defines a data type, such as string, integer, etc.Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 22 / 28
  74. 74. TapsData flows in and out of tapsZubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
  75. 75. TapsData flows in and out of tapsRepresent data sources and sinks, such local files, distributed FS files,etc.Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
  76. 76. TapsData flows in and out of tapsRepresent data sources and sinks, such local files, distributed FS files,etc.Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
  77. 77. TapsData flows in and out of tapsRepresent data sources and sinks, such local files, distributed FS files,etc.Each tap is associated with a scheme that describe the data, such asTextLine, TextDelimited, etc.Sinks have modes such as SinkMode.KEEP,SinkMode.REPLACE, and SinkMode.UPDATEZubair Nabi 8: MapReduce Application Scripting May 25, 2013 23 / 28
  78. 78. FlowsRepresent entire pipelinesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28
  79. 79. FlowsRepresent entire pipelinesA pipeline reads data from a source, processes it, and then writes it toa sinkZubair Nabi 8: MapReduce Application Scripting May 25, 2013 24 / 28
  80. 80. Example: Flow1 Pipe lhs = new Pipe( "lhs" );2 lhs = new Each( lhs, new SomeFunction() );3 lhs = new Each( lhs, new SomeFilter() );4 Pipe rhs = new Pipe( "rhs" );5 rhs = new Each( rhs, new SomeFunction() );6 Pipe join = new CoGroup( lhs, rhs );7 join = new Every( join, new SomeAggregator() );89 Tap lhsSource = new Hfs( new TextLine(), "lhs.txt" );10 Tap rhsSource = new Hfs( new TextLine(), "rhs.txt" );11 Tap sink = new Hfs( new TextLine(), "output" );12 FlowDef flowDef = new FlowDef()13 .setName( "flow−name" )14 .addSource( rhs, rhsSource )15 .addSource( lhs, lhsSource )16 .addTailSink( join, sink );17 Flow flow = new HadoopFlowConnector().connect( flowDef );Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 25 / 28
  81. 81. OperationsOperations manipulate dataZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  82. 82. OperationsOperations manipulate dataFour kinds:1 FunctionZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  83. 83. OperationsOperations manipulate dataFour kinds:1 Function2 FilterZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  84. 84. OperationsOperations manipulate dataFour kinds:1 Function2 Filter3 AggregatorZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  85. 85. OperationsOperations manipulate dataFour kinds:1 Function2 Filter3 Aggregator4 BufferZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  86. 86. OperationsOperations manipulate dataFour kinds:1 Function2 Filter3 Aggregator4 BufferTake an input tuple and emit zero or more tuplesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  87. 87. OperationsOperations manipulate dataFour kinds:1 Function2 Filter3 Aggregator4 BufferTake an input tuple and emit zero or more tuplesFilter returns a BooleanZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  88. 88. OperationsOperations manipulate dataFour kinds:1 Function2 Filter3 Aggregator4 BufferTake an input tuple and emit zero or more tuplesFilter returns a BooleanMust be wrapped around in either Every or Each pipesZubair Nabi 8: MapReduce Application Scripting May 25, 2013 26 / 28
  89. 89. Example: Wordcount1 Scheme sourceScheme = new TextLine( new Fields( "line" ) );2 Tap source = new Hfs( sourceScheme , inputPath );3 Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );4 Tap sink = new Hfs( sinkScheme , outputPath , SinkMode.REPLACE );5 Pipe assembly = new Pipe( "wordcount" );6 String regex = " ";7 Function function = new RegexGenerator( new Fields( "word" ), regex );8 assembly = new Each( assembly , new Fields( "line" ), function );9 assembly = new GroupBy( assembly , new Fields( "word" ) );10 Aggregator count = new Count( new Fields( "count" ) );11 assembly = new Every( assembly , count );12 FlowConnector flowConnector = new FlowConnector();13 Flow flow = flowConnector.connect( "word−count", source, sink, assembly );14 flow.complete();Zubair Nabi 8: MapReduce Application Scripting May 25, 2013 27 / 28
  90. 90. References1 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language fordata processing. In Proceedings of the 2008 ACM SIGMODinternational conference on Management of data (SIGMOD ’08). ACM,New York, NY, USA, 1099-1110.2 Cascading 2.1 User Guide: http://docs.cascading.org/cascading/2.1/userguide/pdf/userguide.pdfZubair Nabi 8: MapReduce Application Scripting May 25, 2013 28 / 28

×