Your SlideShare is downloading. ×
  • Like
Cascading on starfish
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cascading on starfish



  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Cascading on Starfish Fei Dong Duke University December 10, 20111 IntroductionHadoop [6] is a software framework installed on a cluster to permit large scaledistributed data analysis. It provides the robust Hadoop Distributed FileSystem (HDFS) as well as a Java-based API that allows parallel processingacross the nodes of the cluster. Programs employ a Map/Reduce executionengine which functions as a fault-tolerant distributed computing system overlarge data sets. In addition to Hadoop, which is a top-level Apache project, there are sub-projects related to workflow of Hadoop, such as Hive [8], a data warehouseframework used for ad hoc querying (with an SQL type query language);and Pig [9], a high-level data-flow language and execution framework whosecompiler produces sequences of Map/Reduce programs for execution withinHadoop. Cascading [2], an API for defining and executing fault tolerantdata processing workflows on a Hadoop cluster. All of mentioned projectssimplify some of work for developers, allowing them to write more traditionalprocedural or SQL-style code that, under the covers, creates a sequence ofHadoop jobs. In this report, we focus on Cascading as the main data-parallelworkflow choice.1.1 Cascading IntroductionCascading is a Java application framework that allows you to more easilywrite scripts to access and manipulate data inside Hadoop. There are anumber of key features provided by this API: • Dependency-Based ’Topological Scheduler’ and MapReduce Planning - Two key components of the cascading API are its ability to sched- ule the invocation of flows based on dependency; with the execution order being independent of construction order, often allowing for con- current invocation of portions of flows and cascades. In addition, the 1
  • 2. steps of the various flows are intelligently converted into map-reduce invocations against the hadoop cluster. • Event Notification - The various steps of the flow can perform notifi- cations via callbacks, allowing for the host application to report and respond to the progress of the data processing. • Scriptable - The Cascading API has scriptable interfaces for Jython, Groovy, and JRuby. Although Cascading provides the above benefits, we still consider aboutthe balance of the performance and productivity on Cascading. Marz [5]shows some rules to optimize Cascading Flows. For some experienced Cas-cading users, they can gain some performance improvement by followingthose principles in high level. One interesting questions is whether there ex-ist some ways to improve the workflow performance without expert knowl-edge. In other words, we want to optimize workflow in physical level. InStarfish [7], the authors demonstrate the power of self-tuning jobs on Hadoopand Herodotou has successfully applied optimization technology on Pig.This report will discuss auto-optimization on Cascading with the help ofStarfish.2 TerminologyFirst, we introduce some concepts widely used in Cascading. • Stream: data input and output. • Tuple: stream is composed of a series of Tuples. Tuples are sets of ordered data. • Tap: abstraction on top of Hadoop files. Source - A source tap is read from and acted upon. Actions on source taps result in pipes. Sink - A sink tap is a location to be written to. A Sink tap can later serve as a Source in the same script. • Operations: define what to do on the data. i.e.: Each(), Group(), CoGroup(), Every(). • Pipe: tie Operation together. When an operation is executed upon a Tap, the result is a Pipe. In other words, a flow is a pipe with data flowing through it. i.e: Pipes can use other Pipes as input, thereby wrapping themselves into a series of operations. • Filter: pass through it to remove useless records. i.e. RegexFilter(), And(), Or(). 2
  • 3. • Aggregator: function after group operation. i.e. Count(), Average(), Min(), Max(). • Step: a logic unit in Flow. It represents a Map-only or MapReduce job. • Flow: A Flow is a combination of a Source, a Sink and Pipe. • Cascade:a series of Flows.3 Cascading Structure Figure 1: A Typical Cascading Structure. In Figure 1, we can clearly see that Cascading Structure. The top levelis called Cascading which is composed of several flows. In each flow, itdefines a source Tap, a sink Tap and Pipes. We also notice one flow canhave multiple pipes to do data operations like filter, grouping, aggregator. Internally, a Cascade is constructed through the CascadeConnector class,by building an internal graph that makes each Flow a ’vertex’, and each filean ’edge’. A topological walk on this graph will touch each vertex in orderof its dependencies. When a vertex has all it’s incoming edges available, itwill be scheduled on the cluster. Figure 2 gives us an example which goalis to statistic second and minute count from Apache logs. The dataflow isrepresented as a Graph. The first step is to import and parse source data.Next it generates two following steps to process ”second” and ”minutes”respectively. The execution order for Log Analysis is:1. calculate the dependency between flows, so we get F low1 → F low22. start to call F low1 2.1 initialize ”import” flowStep and construct the Job1 2.2 submit ”import” Job1 to Hadoop3. start to call F low2 3.1 initialize ”minute and secend statistics” flowSteps and constructthe Job2, Job3 3.2 submit Job2, Job3 to Hadoop The complete code is attached at Appendix. 3
  • 4. Figure 2: Workflow Sample:Log Analysis.4 Cascading on Starfish4.1 Change to new Hadoop APIWe notice current Cascading is based on Hadoop Old-API. Since Starfishonly works within New-API, the first work is to connect those heterogeneoussystems. Herodotos works on supporting Hadoop Old-API on Starfish. Iwork on replacing Old-API of Cascading with New-API. Although Hadoopcommunity recommends new API and provide some upgrade advice [11], itstill take us much energy on translating. One reason is the system complexity(40K lines), we sacrifice some advanced features such as S3fs, TemplateTap,ZipSplit, Stats reports and Strategy to make the change work. Finally, weprovide a revised version of Cascading that only use Hadoop New-API. Inthe mean time Herodotos updated Starfish to support Old-API recently.While this report will only consider New-API version of Cascading.4.2 Cascading ProfilerFirst, we need to decide when to capture the profilers. Since modified Cas-cading is using Hadoop New-API, the position to enable Profiler is the sameas a single MapReduce job. We choose the return point of blockT illCompleteOrStopedof cascading.f low.F lowStepJob to collect job execution files when job com- 4
  • 5. pletes. When all of jobs are finished and execution files are collected, wewould like to build a profile graph to represent dataflow dependencies amongthe jobs. In order to build the job DAG, we decouple the hierarchy of Cas-cading and Flows. As we see before, Log Analysis workflow has two de-pendent Flows and finally will submit three MapReduce jobs on Hadoop.Figure 3 shows the original Workflow in Cascading and translating JobGraphin Starfish. We propose the following algorithm to build Job DAG.Algorithm 1 Build Job DAG Pseudo-Code 1: procedure BuildJobDAG(f lowGraph) 2: for f low ∈ f lowGraph do Iterate over all flows 3: 4: for f lowStep ∈ f low.f lowStepGraph do Add the job vertices 5: Create the jobVertex from the flowStep 6: end for 7: 8: for edge ∈ f low.f lowStepGraph.edgeSet do Add the job edges within a flow 9: Create the corresponding edge in the jobGraph10: end for11: end for12: for f lowEdge ∈ f lowGraph.edgeSet do Iterate over all flow edges (source → target)13: sourceF lowSteps ← f lowEdge.sourceF low.getLeaf F lowSteps14: targetF lowSteps ← f lowEdge.targetF low.getRootF lowSteps15: for sourceF S ∈ sourceF lowSteps do16: for targetF S ∈ targetF lowSteps do17: Create the job edge from corresponding source to target18: end for19: end for20: end for21: end procedure4.3 Cascading What-if Engine and OptimizerWhat-if Engine is to predict the behavior of a workflow W . To achievethat, DAG Profilers ,Data Model, Cluster, DAG Configurations are givenas parameters. Building the Conf Graph shares the same idea as buildingJob Graph. We capture the returning point of initializeN ewJobM ap incascading.cascade where we process what-if requests and exit the programafterwards. 5
  • 6. (a) Cascading Represent (b) Dataflow Transla- tion Figure 3: Log Analysis. For the Cascading optimizer, I make use of data flow optimizer and feedthe related interface. When running the Optimizer, we keep the defaultOptimizer mode as crossjob + dynamic.4.4 Program InterfaceThe usage of Cascading on Starfish is simple and user-friendly. Users do notneed to change the source code or import new package. We can list somecases as follows. prof ile cascading jar loganalysis.jar Profiler: collect task profiles when running a workflow and generate theprofile files in P ROF ILER OU T P U T DIR. execute cascading jar loganalysis.jar Execute: only run program without collecting profiles. analyze cascading details workf low 20111017205527 Analyze: list some List basic or detail statistical information regardingall jobs found in the P ROF ILER OU T P U T DIR whatif details workf low 20111018014128 cascading jar loganalysis.jar What-if Engine: ask hypothetical question on a particular workflow andreturn predicted profiles. optimize run workf low 20111018014128 cascading jar loganalysis.jar Optimizer:Execute a MapReduce workflow using the configuration pa-rameter settings automatically suggested by the Cost-based Optimizer. 6
  • 7. 5 Evaluation5.1 Experiment EnvironmentIn the experimental evaluation, we used Hadoop clusters running on AmazonEC2. The following is the detail preparation. • Cluster Type: m1.large 10 nodes. Each node has 7.5 GB memory, 2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasks concurrently. • Hadoop Configurations: 0.20.203. • Cascading Version : modified V1.2.4 (use Hadoop New-API) • Data Set: 20G TPC-H [10], 10G random text, 10G pagegraphs for pagerank, 5G paper author pairs. • Optimizer Type: cross jobs and dynamic5.2 Description of Data-parallel WorkflowsWe evaluate the end-to-end performance of optimizers on seven representa-tive workflows used in different domains. Term Frequency-Inverse Document Frequency(TF-IDF): TF-IDF calculates weights representing the importance of each word to a doc-ument in a collection. The workflow contains three jobs: 1) the total termsin each document. 2) calculate the number of documents containing eachterm. 3) calculate tf * idf. Job 2 depends on Job 1 and Job 3 depends onJob 2. Top 20 Coauthor Pairs: Suppose you have a large datasets of papersand authors. You want to know who and if there is any correlation betweenbeing collaborative and being a prolific author. It can take three jobs: 1)Group authors by paper. 2) Generate co-authorship pairs (map) and count(reduce). 3) Sort by count. Job 2 depends on Job 1 and Job 3 depends onJob 2. Log Analysis: Given an Apache log, parse it with specified format,statistic the minute count and second count seperately and dump each re-sults. There are three jobs: 1) Import and parse raw log. 2) group byminutes and statistic counts. 3) Group by seconds and statistic counts. Job2 and Job 3 depends on Job 1. PageRank: The goal is to find the ranking of web pages. The algorithmcan be implemented as an iterative workflow containing two jobs:1) Join onthe pageId of two datasets.2) Calculate the new rankings of each webpage.Job 2 depends on Job 1. TPC-H: TPC-H benchmark as a representative example of a complexSQL query. Query 3 is implemented in four-job workflow. 1) Join the order 7
  • 8. and customer table, with filter conditions on each table. 2) Join lineitemand result table in job one. 3) Calculate the volume by discount. 4) Getthe sum after grouping by some keys. Job 2 depends on Job 1 and Job 3depends on Job 2 and Job 4 depends on Job 3. HTML Parser and WordCount: The workflow processes a collectionof web source pages. It has three jobs: 1) Parse the raw data with HTMLSAX Parser. 2) Statistic the number of words with the same urls. 3) Ag-gregate the total word count. Job 2 depends on Job 1 and Job 3 dependson Job 2. User-defined Partition: It spill the dataset into three parts by therange of key. Some statistics are collected on each spilled part. In general,it is run in three jobs and each job is responsible for one part of dataset.There is no dependency between those three job, which means three jobscan be run in parallel. The source code for experiment groups is submitted in Starfish reposi-tory.5.3 Speedup with Starfish OptimizerFigure 4 shows the timeline for TPC-H Query 3 workflow. When usingprofiler, it spends 20% more time. The final cross-job optimizer causes 1.3xspeedup. Figure 5 analysis the speedup for six workflows respectively. Theoptimizer is effective for most of workflows with only exception of the user-defined partition. One possible reason is that workflows generates three jobsin parallel which compete the limited cluster resource (30 available map slotsand 20 available reduce slots) from each other. Figure 4: run TPC-H Query3 with no Optimizer, Profiler and Optimizer. 8
  • 9. (a) Log Analysls (b) Coauthor Pairs (c) PageRank (d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word- count Figure 5: Speedup with Starfish Optimizer.5.4 Overhead on ProfilerFigure 6 shows the profiling overhead by comparing againest the same jobrun with profiling turned off. In average, profiling consumes 20% of therunning time.5.5 Compare with PigWe are very interested in comparing various workflow framework with thesame datasets. We run the identical workflow written by Harold Lim. Fig-ure 7 shows the performance of Pig overwhelm Cascading even if Cascadingis optimized. We think of several possible reasons. • Cascading does not support Combiner. One article [4] talks about hand-rolled join optimizations. • Pig does many optimization work on physical and logic layer, while Cascading does not optimize the planner well. In user-defined parti- tion, Pig only has one MapReduce job while Cascading populates 3 jobs. • Cascading only uses Customed Inputformat and InputSplit called Mul- tiInputFormat and MultiInputSplit, no matter for single job or single 9
  • 10. (a) Log Analysls (b) Coauthor Pairs (c) PageRank(d) User-defined Partition (e) TF-IDF (f) HTML Parser and Word- count Figure 6: Overhead to measure profile. Figure 7: Pig Versus Cascading on Performance . input source.• Cascading’s CoGroup() join is not meant to be used with large data files.• Using RegexSplit to parse files into tuples is not efficient.• Disable compression. 10
  • 11. 6 ConclusionCascading aims to help developers build powerful applications quickly andsimply, through a well-reasoned API, without needing to think in MapRe-duce, while leaving the heavy lifting of data distribution, replication, dis-tributed process management, and liveness to Hadoop. With Starfish Optimizer, we can boost the original Cascading programby 20% to 200% without modifying any source code. It also demonstratesthat the similar syntax sentences as Pig in representation, but the experi-ment group display distinct differences in results, which shows Pig perfor-mance is much better than Cascading in most cases. Considering the code scale, learning cost and performance, we recom-mend for simple queries, using Pig is much more suitable and performant.We also find Cascalog [3] which is data processing and querying library forClojure, is another choice of writing workflows on Hadoop. We notice Cascading 2.0 [1] is ready to release, which will improve hugelyon performance and fix bugs of previous version. For the future work, whenStarfish supports old API, we can import latest version of Cascading andrerun the experiment.7 AcknowledgementI would like to thank Herodotos Herodotou, the lead contributor of Starfish,who gave me so much help on the system design and Hadoop internal mech-anism. The report could not be done without him. I also want to thankHarold Lim who gives me some support on benchmarks. Thank Professor Shivnath Babu for his help and supervising this work,and holding meeting for us to exchange ideas.References [1] Tips for Optimizing Cascading Flows. 20-early-access.html. [2] Cascading. [3] Cascalog. [4] Pseudo Combiners in Cascading. combiners-in-cascading/. [5] Tips for Optimizing Cascading Flows. optimizing-cascading-flows.html. [6] Apache Hadoop. [7] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011. 11
  • 12. [8] Hive. [9] Pig.[10] TPC. TPC Benchmark H Standard Specification , 2009.[11] Upgrading to the New Map Reduce API . Appendix8.1 Complete Source Code of Log Analysis Listing 1: 1 package loganalysis; 2 3 import java.util.*; 4 import org.apache.hadoop.conf.*; 5 import org.apache.hadoop.util.*; 6 7 import cascading.cascade.*; 8 import cascading.flow.*; 9 import cascading.operation.aggregator.Count; 10 import cascading.operation.expression.ExpressionFunction; 11 import cascading.operation.regex.RegexParser; 12 import cascading.operation.text.DateParser; 13 import cascading.pipe.*; 14 import cascading.scheme.TextLine; 15 import cascading.tap.*; 16 import cascading.tuple.Fields; 17 18 public class LogAnalysis extends Configured implements Tool { 19 public int run(String[] args) throws Exception { 20 // set the Hadoop parameters 21 Properties properties = new Properties(); 22 Iterator<Map.Entry<String, String>> iter = getConf(). iterator(); 23 while (iter.hasNext()) { 24 Map.Entry<String, String> entry =; 25 properties.put(entry.getKey(), entry.getValue()); 26 } 27 28 FlowConnector.setApplicationJarClass(properties, Main.class ); 29 FlowConnector flowConnector = new FlowConnector(properties) ; 30 CascadeConnector cascadeConnector = new CascadeConnector(); 31 32 String inputPath = args[0]; 33 String logsPath = args[1] + "/logs/"; 34 String arrivalRatePath = args[1] + "/arrivalrate/"; 12
  • 13. 35 String arrivalRateSecPath = arrivalRatePath + "sec";36 String arrivalRateMinPath = arrivalRatePath + "min";3738 // create an assembly to import an Apache log file and store on DFS39 // declares: "time", "method", "event", "status", "size"40 Fields apacheFields = new Fields("ip", "time", "method", " event",41 "status", "size");42 String apacheRegex = "ˆ([ˆ ]*) +[ˆ ]* +[ˆ ]* +[([ˆ]]*)] +"([ˆ ]*) ([ˆ ]*) [ˆ ]*" ([ˆ ]*) ([ˆ ]*).*$";43 int[] apacheGroups = { 1, 2, 3, 4, 5, 6 };44 RegexParser parser = new RegexParser(apacheFields, apacheRegex,45 apacheGroups);46 Pipe importPipe = new Each("import", new Fields("line"), parser);4748 // create tap to read a resource from the local file system , if not an49 // url for an external resource50 // Lfs allows for relative paths51 Tap logTap = new Hfs(new TextLine(), inputPath);52 // create a tap to read/write from the default filesystem53 Tap parsedLogTap = new Hfs(apacheFields, logsPath);5455 // connect the assembly to source and sink taps56 Flow importLogFlow = flowConnector.connect(logTap, parsedLogTap,57 importPipe);5859 // create an assembly to parse out the time field into a timestamp60 // then count the number of requests per second and per minute6162 // apply a text parser to create a timestamp with ’second’ granularity63 // declares field "ts"64 DateParser dateParser = new DateParser(new Fields("ts"),65 "dd/MMM/yyyy:HH:mm:ss Z");66 Pipe tsPipe = new Each("arrival rate", new Fields("time"), dateParser,67 Fields.RESULTS);6869 // name the per second assembly and split on tsPipe70 Pipe tsCountPipe = new Pipe("tsCount", tsPipe);71 tsCountPipe = new GroupBy(tsCountPipe, new Fields("ts"));72 tsCountPipe = new Every(tsCountPipe, Fields.GROUP, new Count());7374 // apply expression to create a timestamp with ’minute’ granularity75 // declares field "tm" 13
  • 14. 76 Pipe tmPipe = new Each(tsPipe, new ExpressionFunction(new Fields("tm"), 77 "ts - (ts % (60 * 1000))", long.class)); 78 79 // name the per minute assembly and split on tmPipe 80 Pipe tmCountPipe = new Pipe("tmCount", tmPipe); 81 tmCountPipe = new GroupBy(tmCountPipe, new Fields("tm")); 82 tmCountPipe = new Every(tmCountPipe, Fields.GROUP, new Count()); 83 84 // create taps to write the results the default filesystem, using the 85 // given fields 86 Tap tsSinkTap = new Hfs(new TextLine(), arrivalRateSecPath, true); 87 Tap tmSinkTap = new Hfs(new TextLine(), arrivalRateMinPath, true); 88 89 // a convenience method for binding taps and pipes, order is significant 90 Map<String, Tap> sinks = Cascades.tapsMap(Pipe.pipes( tsCountPipe, 91 tmCountPipe), Tap.taps(tsSinkTap, tmSinkTap)); 92 93 // connect the assembly to the source and sink taps 94 Flow arrivalRateFlow = flowConnector.connect(parsedLogTap, sinks, 95 tsCountPipe, tmCountPipe); 96 97 // optionally print out the arrivalRateFlow to a graph file for import 98 // into a graphics package 99 //arrivalRateFlow.writeDOT( "" );100101 // connect the flows by their dependencies, order is not significant102 Cascade cascade = cascadeConnector.connect(importLogFlow,103 arrivalRateFlow);104105 // execute the cascade, which in turn executes each flow in dependency106 // order107 cascade.complete();108 return 0;109 }110111 public static void main(String[] args) throws Exception {112 int res = Configuration(), new Main(), args);113 System.exit(res);114 }115 } 14