Cascading introduction


Published on

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster.

Published in: Technology, Business
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Cascading and its extensions have their own Maven/Ivy Jar repositoryThis 1.2 release will run against hadoop 0.19.x, and 0.20.x. Including Amazon Elastic MapReduce. And 0.21Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.At one level Cascading is a MapReduce query planner, just like PIG. Except the Cascading API is for public consumption and fully extensiblein PIG you typically interact with the PigLatin text syntax. With Cascading, you can layer your own syntax on top of the APIGiven a data set and you want to run a number of groupBys i.e. group by key1, generate value1, ... group by keyN, generate valueN, Cascading primary programming model is similar to PIG but with a Java API.Pig would optimize from N to smaller (e.g. 1) number of reduce runsOozie workflows are actions arranged in a control dependency DAG (Direct Acyclic Graph).Cascading runs as a client from the command lineOozieis a server system (like Hadoop Job Tracker) to which you submit workflow jobs and later check the status.
  • By providing a clean API to the core Cascading model, tools like Jython, Groovy, and JRuby can be used instead to define complex processing flow
  • The MapReduce Job Planner is an internal feature of Cascading.Every job is delimited by a temporary file that is the sink from the first job, and then the source to the next job.temporary file will be deleted whether the flow runs successfully or failed. However, it’s configurable.If two or more Flow instances have no dependencies, they will be submitted together so they can execute in parallel.DAG : directed acyclic graph : 不循環有向圖an internal graph that makes each Flow a 'vertex', and each file an 'edge‘When a vertex has all it's incoming edges (files) available, it will be scheduled on the cluster.TopologicalOrderAnd by default, if any outputs from a Flow are newer than the inputs, the Flow is skippedI can’t customize combiner and partitioner
  • 7 tools can parse the dot file.DOT is a plain text graph description language. To see how your Flows are partitioned, call the Flow#writeDOT() method. This will write a DOT fileThe writeDOTapi isn’t useful for logging
  • All Taps must have a Scheme associated with them. If the Tap is about where the data is, and how to get it, the Scheme is about what the data is.TextLineTextLine reads and writes raw text files and returns Tuples with two field names by default, "offset" and "line".TextDelimited(csv, tsv, etc)SequenceFile - SequenceFile is based on the Hadoop Sequence file, which is a binary format.WritableSequenceFile - like the SequenceFile Scheme, except it was designed to read and write key and/or value Hadoop Writable objects directly.
  • MultiSourceTapThe cascading.tap.MultiSourceTap is used to tie multiple Tap instances into a single Tap for use as an input source. The only restriction is that all the Tap instances passed to a new MultiSourceTap share the same Scheme classes (not necessarily the same Scheme instance).MultiSinkTapThe cascading.tap.MultiSinkTap is used to tie multiple Tap instances into a single Tap for use as an output sink. During runtime, for every Tuple output by the pipe assembly each child tap to the MultiSinkTap will sink the Tuple.TemplateTapTemplateTap can be used to write tuple streams out to subdirectories based on the values in the Tuple instance. The constructor takes a HfsTap and a Formatter format syntax String. This allows Tuple values at given positions to be used as directory names. Note that Hadoop can only sink to directories, and all files in those directories are "part-xxxxx" files. openTapsThreshold limits the number of open files to be output to. This value defaults to 300 files. Each time the threshold is exceeded, 10% of the least recently used open files will be closed. TextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "\\t" ); Hfs tap = new Hfs( scheme, path ); String template = "%s-%s"; // dirs named "year-month" Tap months = new TemplateTap( tap, template, SinkMode.REPLACE );GlobHfs extends MultiSourceTapThe cascading.tap.GlobHfs Tap accepts Hadoop style 'file globbing' expression patterns. This allows for multiple paths to be used as a single source, where all paths match the given pattern.Changed the semantics of file globbing with a PathFilter (using the globStatus method of FileSystem). Previously, the filtering was too restrictive, so that a glob of /*/* and a filter that only accepts /a/b would not have matched /a/b. With this change /a/b does match.
  • SinkMode.KEEP This is the default behavior. If the resource exists, attempting to write to it will fail.SinkMode.REPLACE This allows Cascading to delete the file immediately after the Flow is started.SinkMode.UPDATE Allows for new Tap types that have the concept of update or append. For example, updating records in a database. It is up to the Tap to decide how to implement its "update" semantics. When Cascading sees the update mode, it knows not to attempt to delete the resource first or to not fail because it already exists.
  • Avro is a data serialization system.Avro provides functionality similar to systems such as Thrift, Protocol BuffersCascading.SimpleDB - Integration with Amazon SimpleDB.
  • It is not required that an Every follow either GroupBy or CoGroup, an Each may follow immediately after. But an Every many not follow an Each.For example : DISTINCTThe Each pipe may only apply Functions and Filters to the tuple stream as these operations may only operate on one Tuple at a time.The Every pipe may only apply Aggregators and Buffers to the tuple stream as these operations may only operate on groups of tuples, one grouping at a time.GroupBy supports ordering
  • Self joins supportedIn practice this would fail since the result Tuple has duplicate field names.A Mixed join is where 3 or more tuple streams are joined, and each pair must be joined differently. See the cascading.pipe.cogroup.MixedJoin class for more details.When joining two streams via a CoGroup Pipe, attempt to place the largest of the streams in the left most argument to the CoGroup. Joining multiple streams requires some accumulation of values before the join operator can begin, but the left most stream will not be accumulated. This should improve the performance of most joins.
  • Operation is a superclass of Function, Filter, Aggregator, Buffer, and Assertion. Function and Filter are each operationsAggregator and Buffer are every operationsUsually extends BaseOperation class
  • Identity FunctionDiscard unused fieldsRename all fieldsRename a single fieldDebugLevelenum values NONE,DEFAULT, or VERBOSEFlowConnector.setDebugLevel( properties, DebugLevel.NONE ); Sample The cascading.operation.filter.Sample filter allows a percentage of tuples to pass.Limit The cascading.operation.filter.Limit filter allows a set number of Tuples to pass.when some missing parameter or value, like a date String for the current date, needs to be inserted.Text FunctionsDateParserDateFormatterRegular Expression OperationsRegexParserRegexSplitterJava Expression OperationsExpressionFunctionExpressionFilterExpressionFilter filter = new ExpressionFilter( "status != 200", Integer.TYPE ); some characters will cause compilation errors
  • (Function, Filter,Aggregator, or Buffer) do not store operation state in class fields.For example, if implementing a custom 'counter' Aggregator, do not create a field named 'count' and increment it on every Aggregator.aggregate() call. There is no guarantee your Operation will be called from a single thread in a JVMThere is a context that you can record aggregation value. It’s the same ashadoop.
  • An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.It differs by the fact that an Iterator is provided and it is the responsibility of the operate(cascading.flow.FlowProcess, BufferCall) method to iterate overall all the input arguments returned by this Iterator, if any. Header, footerdocument_id, term, term_count_in_document, total_terms_in_document
  • An Buffer may only be used with an Every pipe, and it may only follow a GroupBy or CoGroup pipe type.AggregateBy is a SubAssembly
  • Verifying input and output schemas before running flowStart() method is anasynchronized callA properties object can be set into FlowConnector, as you setHadoopjobconf
  • riffle is a lightweight Java library for executing collections of dependent processes as a single process. This library provides Java Annotations for tagging classes and methods supporting required life-cycle stages,import riffle.process.DependencyIncoming;import riffle.process.DependencyOutgoing;import riffle.process.ProcessCleanup;import riffle.process.ProcessComplete;import riffle.process.ProcessPrepare;import riffle.process.ProcessStart;import riffle.process.ProcessStop;
  • Assertions aren’t pipes.When running a tests against regression data, it makes sense to use strict assertions. This regression data should be small and represent many of the edge cases the processing assembly must support robustly. When running tests in staging, or with data that may vary in quality since it is from an unmanaged source, using validating assertions make much sense. Then there are obvious cases where assertions just get in the way and slow down processing and it would be nice to just bypass them.
  • Traps were not designed as a filtering mechanism
  • Since version 1.2Cascading does not support the so called MapReduce Combiners. Combiners are very powerful in that they reduce the IO between the Mappers and Reducers. Why send all your Mapper to data to Reducers when you can compute some values Map side and combine them in the Reducer. But Combiners are limited to Associative and Commutative functions only, like 'sum' and 'max'. And in order to work, values emitted from the Map task must be serialized, sorted (deserialized and compared), deserialized again and operated on, where again the results are serialized and sorted. Combiners trade CPU for gains in IO.Cascading takes a different approach by providing a mechanism to perform partial aggregations Map side and also combine them Reduce side. But Cascading chooses to trade Memory for IO gains by caching values (up to a threshold). This approach bypasses the unnecessary serialization, deserialization, and sorting steps. It also allows for any aggregate function to be implemented, not just Associative and Commutative ones.Class AggregateBy is a SubAssembly that serves two roles for handling aggregate operations. AverageBy, CountBy, SumBy
  • ClusterTestCase : MiniDFSCluster, MiniMRCluster, FileSystemFunctions : copyFromLocal, getFileSystem,getJobConf, getPropertiesLimit will get half records in version 1.1
  • Wireit supports firefox 3.5 above, it doesn’t work on firefox 3.0WireIt is released under the MIT License.
  • Cascading introduction

    1. 1. CascadingAlex Su2011/02/11 Copyright 2010 TCloud Computing Inc.
    2. 2. Agenda•Introduction• How it works•Data Processing•Advanced Processing•Monitoring•Testing•Best Practices•Cascading GUI Trend Micro Confidential
    3. 3. Introduction•Hadoop coding is non-trivial•Hadoop is looking for a class to do Map steps and a class to do Reduce step•What if you need multiple in your application? Who coordinates what can be run in parallel?•What if you need to do non-Hadoop logic between Hadoop steps?•Chain the Operations into data processing work- flows Trend Micro Confidential
    4. 4. Introduction•Operations are chained together to define a Pipeassembly or a reusable sub-assembly Trend Micro Confidential
    5. 5. IntroductionPipe lhs = new Pipe( "lhs" );lhs = new Each( lhs, new SomeFunction() );lhs = new Each( lhs, new SomeFilter() );// the "right hand side" assembly headPipe rhs = new Pipe( "rhs" );rhs = new Each( rhs, new SomeFunction() );// joins the lhs and rhsPipe join = new CoGroup( lhs, rhs );join = new Every( join, new SomeAggregator() );join = new GroupBy( join );join = new Every( join, new SomeAggregator() );// the tail of the assemblyjoin = new Each( join, new SomeFunction() );Properties properties = new Properties();FlowConnector.setApplicationJarClass( properties, Main.class );FlowConnector flowConnector = new FlowConnector( properties );Flow flow = flowConnector.connect( “join", source, sink, join);// execute the flow, block until completeflow.complete(); Trend Micro Confidential
    6. 6. How it works•Pipe Assemblies become Flows•Translates a DAG of operations to a DAG of MapReduce jobs•All MapReduce jobs in Flow scheduled in dependency order Trend Micro Confidential
    7. 7. How it worksdigraph G { 1 [label = "Every(akamaiPipe*whiteListPipe)[Count[decl:count]]"]; 2 [label = "Hfs[TextLine[[host, count]->[ALL]]][/user/alex/output]]"]; 3 [label = "GroupBy(akamaiPipe*whiteListPipe)[by:[host]]"]; 4 [label = "Each(akamaiPipe*whiteListPipe)[NotMatchedFilter[decl:host, offset, line]]"]; 5 [label = "CoGroup(akamaiPipe*whiteListPipe)[by:whiteListPipe:[line]akamaiPipe:[host]]"]; 6 [label = "Hfs[TextLine[[offset, line]->[ALL]]][/user/alex/whitelist/whitelist.txt]]"]; 7 [label = "Each(akamaiPipe)[RegexParser[decl:host][args:1]]"]; 8 [label = "Hfs[TextLine[[line]->[ALL]]][/user/alex/input/akamai.log]]"]; 9 [label = "[head]"]; 10 [label = "[tail]"]; 11 [label = "TempHfs[SequenceFile[[host, offset, line]]][akamaiPipe_whiteListPipe/52729/]"]; 12 [label = "Hfs[TextLine[[offset, line]->[ALL]]][/user/alex/trap]]"]; 1 -> 2 [label = "[{2}:host, count]n[{3}:host, offset, line]"]; 7 -> 5 [label = "[{1}:host]n[{1}:host]"]; 5 -> 4 [label = "whiteListPipe[{1}:line],akamaiPipe[{1}:host]n[{3}:host, offset, line]"]; 3 -> 1 [label = "akamaiPipe*whiteListPipe[{1}:host]n[{3}:host, offset, line]"]; 9 -> 6 [label = ""]; 9 -> 8 [label = ""]; 2 -> 10 [label = "[{?}:ALL]n[{?}:ALL]"]; 4 -> 11 [label = "[{3}:host, offset, line]n[{3}:host, offset, line]"]; 11 -> 3 [label = "[{3}:host, offset, line]n[{3}:host, offset, line]"]; 8 -> 7 [label = "[{1}:line]n[{1}:line]"]; 6 -> 5 [label = "[{2}:offset, line]n[{2}:offset, line]"]; 7 -> 12 [label = ""];} Trend Micro Confidential
    8. 8. Data Processing•Tuple •A single ‘row’ of data being processed •Each column is named •Can access data by name or position Trend Micro Confidential
    9. 9. Data Processing•TAP •Abstraction on top of Hadoop files •Allows you to define own parser for files •Example: •Scheme •TextLine •TextDelimited •SequenceFile •WritableSequenceFileHfs input = new Hfs(new TextLine(), a_hdfsDirectory + "/" + name); Trend Micro Confidential
    10. 10. Data Processing• Tap •LFS •DFS •HFS •MultiSourceTap •MultiSinkTap •TemplateTap •GlobHfs •S3fs(Deprecated) Trend Micro Confidential
    11. 11. Data Processing• TemplateTap TemplateTap can be used to write tuple streams out to subdirectories based on the values in the Tuple instance. Trend Micro Confidential
    12. 12. Data Processing• TemplateTapTextDelimited scheme = new TextDelimited( new Fields( "year", "month", "entry" ), "t" );Hfs tap = new Hfs( scheme, path );String template = "%s-%s"; // dirs named "year-month"Tap months = new TemplateTap( tap, template, SinkMode.REPLACE ); Trend Micro Confidential
    13. 13. Data Processing•TAP types •SinkMode.KEEP •SinkMode.REPLACE •SinkMode.UPDATE Trend Micro Confidential
    14. 14. Data Processing•Integration •Cascading.Avro •Cascading.Hbase •Cascading.JDBC •Cascading.Memcached •Cascading.SimpleDB Trend Micro Confidential
    15. 15. Data Processing•Pipe Trend Micro Confidential
    16. 16. Data Processing• Pipe • a base class for core processing model types• Each • for each “tuple” in data do this to it• GroupBy • similar to a ‘group by’ in SQL• CoGroup • joins of tuple streams together• Every • applies an Aggregator (like count, or sum) or Buffer (a sliding window) Operation to every group of Tuples that pass through it.• SubAssembly • allows for nesting reusable pipe assemblies into a Pipe class Trend Micro Confidential
    17. 17. Data Processing • CoGroup • InnerJoin • OuterJoin • LeftJoin • RightJoin • MixedJoin lhsFields new Fields("url", "word", “count");Fields common = new Fields( "url" ); rhsFields = new Fields("url", “sentence", “count");Fields declared = new Fields( "url1", "word", "wd_count", "url2", "sentence", "snt_count" );Pipe join = new CoGroup( lhs, common, rhs, common, declared, new InnerJoin() ); lhsFields, rhsFields, new InnerJoin() ); Trend Micro Confidential
    18. 18. Data Processing•Operation •Define what to do on the data •Each operations allow logic on the row, such a parsing dates, creating new attributes etc. •Every operations allow you to iterate over the ‘group’ of rows to do non-trivial operations. Trend Micro Confidential
    19. 19. Data Processing•Function •Identity Function •Debug Function •Sample and Limit Functions •Insert Function •Text Functions •Regular Expression Operations •Java Expression Operations •"first-name" is a valid field name for use with Cascading, but this expression, first- name.trim(), will fail. Trend Micro Confidential
    20. 20. Data Processing•Filter •And •Or •Not •Xor •NotNull •Null •RegexFilter Trend Micro Confidential
    21. 21. Data Processing•Aggregator •Average •Count •First •Last •Max •Min •Sum Trend Micro Confidential
    22. 22. Data Processing•Buffer •It is very similar to the typical Reducer interface •It is very useful when header or footer values need to be inserted into a grouping, or if values need to be inserted into the middle of the group values Trend Micro Confidential
    23. 23. Data Processing•Buffer Trend Micro Confidential
    24. 24. Data Processing Trend Micro Confidential
    25. 25. Data Processing•Flow •To create a Flow, it must be planned though the FlowConnector object. The connect() method is used to create new Flow instances based on a set of sink Taps, source Taps, and a pipe assembly.Flow flow = new FlowConnector(new Properties()).connect( "flow-name", source, sink, pipe );flow.complete(); Trend Micro Confidential
    26. 26. Data Processing•MapReduceFlow •a Flow subclass that supports custom MapReduce jobs pre-configured via the JobConf object.• ProcessFlow • a Flow subclass that supports custom Riffle jobs. Trend Micro Confidential
    27. 27. Data Processing•Cascades •Groups of Flow are called Cascades •Custom MapReduce jobs can participate in CascadeCascade cascade = cascadeConnector.connect(flow1, flow2, flow3);cascade.complete(); Trend Micro Confidential
    28. 28. Advanced Processing•Stream Assertions •Unit and Regression tests for Flows •Planner can remove ‘strict’, ‘validating’, or all assertions Trend Micro Confidential
    29. 29. Advanced Processing•Failure Traps •Catch data causing Operations or Assertions to fail •Allows processes to continue without data loss Trend Micro Confidential
    30. 30. Advanced Processing•Partial Aggregation instead of Combiners •trade Memory for IO gains by caching valuesFields groupingFields = new Fields( "date" );Fields valueField = new Fields( "size" );Fields sumField = new Fields( "total-size" );assembly = new SumBy( assembly, groupingFields, valueField, sumField, long.class ); Trend Micro Confidential
    31. 31. Monitoring•Implement FlowListener interface •onStarting •onStopping •onCompleted •onThrowable Trend Micro Confidential
    32. 32. Monitoring• Polling FlowStatsFlow ID: 756271765aa375773f9bbb5570de4d2aStepStats Count: 2cascading.flow.FlowStepJob$1: 1, Step{status=RUNNING, startTime=1297344994624}Name: (1/2) ...SequenceFile[[host, offset, line]]"][akamaiPipe_whiteListPipe/52729/]Status: RUNNINGNum Mappers: 2Num Reducers: 1 Task ID: task_201102101702_0002_m_000003 Task ID: task_201102101702_0002_m_000000 Task ID: task_201102101702_0002_m_000001 Task ID: task_201102101702_0002_r_000000 Task ID: task_201102101702_0002_m_000002cascading.flow.FlowStepJob$1: 2, Step{status=PENDING, startTime=0}Name: (2/2) Hfs["TextLine[[host, count]->[ALL]]"]["/user/alex/output"]"]Status: PENDINGNum Mappers: 0Num Reducers: 0 Trend Micro Confidential
    33. 33. Testing•Use ClusterTestCase if you want to launch an embedded Hadoop cluster inside your TestCase•A few validation and hadoop functions are provided•Doesn’t support Hadoop 0.21 testing library Trend Micro Confidential
    34. 34. Cascading GUI•Yahoo PipesPipes is a powerful composition tool to aggregate, manipulate, and mashup content from around the web. Trend Micro Confidential
    35. 35. Cascading GUI•WireItWireIt is an open-source javascript library to create web wirable interfaces for dataflow applications, visual programming languages, graphical modeling, or graph editors. Trend Micro Confidential
    36. 36. Cascading GUI Trend Micro Confidential
    37. 37. Live Demo Trend Micro Confidential
    38. 38. THANK YOU! Trend Micro Confidential
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.