Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (Nov. 20, 2013)

  • 1,378 views
Uploaded on

Presentation by Dr. Kostas Tzoumas at the Big Data Beers Meetup [1] (Nov. 20, 2013) introducing the Stratosphere Platform for Big Data Analytics. …

Presentation by Dr. Kostas Tzoumas at the Big Data Beers Meetup [1] (Nov. 20, 2013) introducing the Stratosphere Platform for Big Data Analytics.

Check out http://stratosphere.eu for more information.

[1] http://www.meetup.com/Big-Data-Beers/events/147397982/

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,378
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
20
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data looks tiny from Stratosphere Kostas Tzoumas kostas.tzoumas@tu-berlin.de
  • 2. Data is an important asset video & audio streams, sensor data, RFID, GPS, user online behavior, scientific simulations, web archives, ... Volume Handle petabytes of data Velocity Handle high data arrival rates Variety Handle many heterogeneous data sources Veracity 2 Handle inherent uncertainty of data
  • 3. Data Analysis 3
  • 4. Four “I”s for Big Analysis text mining, interactive and ad hoc analysis, machine learning, graph analysis, statistical algorithms Iterative Model the data, do not just describe it Incremental Maintain the model under high arrival rates Interactive Step-by-step data exploration on very large data Integrative 4 Fluent unified interfaces for different data models
  • 5. Hadoop Hadoop’s selling point is its low effective storage cost. Hadoop clusters are becoming a data vortex, attracting cross-departmental data and changing the data usage culture in companies. Hadoop MapReduce was the wrong abstraction and implementation to begin with and will be superseded by better systems. 5
  • 6. Advanced Analytics Analytics that model the data to reveal hidden relationships, not just describe the data. E.g., machine learning, predictive stats, graph analysis Increasingly important from a market perspective. Very different than SQL analytics: different languages and access patterns (iterative vs. one-pass programs). Hadoop toolchain poor; R, Matlab, etc not parallel. 6
  • 7. SQL MapReduce BigAnalytics BigSQL NoMapReduce
  • 8. scripting XQuery? SQL-- column a query store++ plan wrong platform scalable parallel sort 8
  • 9. Data Scientist: The Sexiest Job of the 21st Century Meet the people who can coax treasure out of messy, unstructured data. FROM!(! by Thomas H. Davenport !!FROM!pv_users! and D.J. Patil !!MAP!pv_users.userid,!pv_users.date! !!USING!'map_script'! !!AS!dt,!uid! !!CLUSTER0BY0dt)!map_output! INSERT0OVERWRITE0TABLE0pv_users_reduced! !!REDUCE!map_output.dt,!map_output.uid! !!USING!'reduce_script'! !!AS!date,!count;! ≠ hen Jonathan Goldman arrived for work in June 2006 at LinkedIn, the business networking site, the place still felt like a start-up. The company had just under 8 million accounts, and the number was A"="load"'WordcountInput.txt';" growing quickly as existing memB"="MAPREDUCE"wordcount.jar"store"A"into"'inputDir‘"load" """"'outputDir'"as"(word:chararray,"count:"int)" colbers invited their friends and """"'org.myorg.WordCount"inputDir"outputDir';" weren’t leagues to join. But users C"="sort"B"by"count;" seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently missing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you 9
  • 10. Taken from http://www.oracle.com/technetwork/java/jvmls2013vitek-2013524.pdf 10
  • 11. Hadoop is... 1. A programming model called MapReduce 2. An implementation of said programming model, called Hadoop MapReduce 3. A file system, called HDFS 4. A resource manager, called Yarn 5. Interfaces to Hadoop MapReduce (Pig, Hive, Cascading, ...) 6. An ML library called Mahout. 7. Recently, a collection of runtime systems (Tez, Impala, Spark, Stratosphere, ...) 11 * Inspired by Jens Dittrich
  • 12. 1. A programming model called MapReduce val!input!=!TextFile(textInput) val!words!=!input.flatMap!{!line!=>!line.split(“!“)!} val!counts!=!words.groupBy!{!word!=>!word!}.count()! val!output!=!counts.write6(wordsOutput,!CsvOutputFormat()) map reduce ( ( ) [ ) [ “Romeo, Romeo, wherefore art thou Romeo?” = (Romeo,(1,1,1)) (wherefore,1), (art,1) (thou,1) ] (Romeo,1), (Romeo,1) (wherefore,1), (art,1) (thou,1), (Romeo,1) = 12 (Romeo,3) (wherefore,1), (art,1) (thou,1) ]
  • 13. (Romeo, (1,1,1)) (art, (1,1)) (thou, (1,1)) Reduce (Romeo, 3) (art, 2) (thou, 2) (What, 1) (art, 1) (thou, 1) (hurt, 1) (wherefore, 1) (What, 1) (hurt, 1) Reduce “What, art thou hurt?” Map “Romeo, Romeo, wherefore art thou Romeo?” (Romeo, 1) (Romeo, 1) (wherefore, 1) (art, 1) (thou, 1) (Romeo, 1) Map 2. An implementation of said programming model, called Hadoop MapReduce (wherefore, 1) (What, 1) (hurt, 1) Data written to disk Data shuffled over network 13
  • 14. Hand-coded join in Hadoop MapReduce public!class!ReduceSideBookAndAuthorJoin!extends!HadoopJob!{ !!private!static!final!Pattern!SEPARATOR!=!Pattern.compile("t"); !!@Override !!public!int!run(String[]!args)!throws!Exception!{ !!!!Map<String,String>!parsedArgs!=!parseArgs(args); !!!!Path!authors!=!new!Path(parsedArgs.get("OOauthors")); !!!!Path!books!=!new!Path(parsedArgs.get("OObooks")); !!!!Path!outputPath!=!new!Path(parsedArgs.get("OOoutput")); !!!!Job!join!=!new!Job(new!Configuration(getConf())); !!!!Configuration!jobConf!=!join.getConfiguration(); !!!!MultipleInputs.addInputPath(join,!authors,!TextInputFormat.class,!ConvertAuthorsMapper.class); !!!!MultipleInputs.addInputPath(join,!books,!TextInputFormat.class,!ConvertBooksMapper.class); !!!!join.setMapOutputKeyClass(SecondarySortedAuthorID.class); !!!!join.setMapOutputValueClass(AuthorOrTitleAndYearOfPublication.class); !!!!jobConf.setBoolean("mapred.compress.map.output",!true); !!!!join.setReducerClass(JoinReducer.class); !!!!join.setOutputKeyClass(Text.class); !!!!join.setOutputValueClass(NullWritable.class); !!!!join.setJarByClass(JoinReducer.class); !!!!join.setJobName("reduceSideBookAuthorJoin"); !!!!join.setOutputFormatClass(TextOutputFormat.class); !!!!jobConf.set("mapred.output.dir",!outputPath.toString()); !!!!join.setGroupingComparatorClass(SecondarySortedAuthorID.GroupingComparator.class); !!!!join.waitForCompletion(true); !!!!return!0; !!} !!static!class!ConvertAuthorsMapper !!!!!!extends!Mapper<Object,Text,SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication>!{ !!!!@Override !!!!protected!void!map(Object!key,!Text!value,!Context!ctx)!throws!IOException,!InterruptedException! { !!!!!!String!line!=!value.toString(); !!!!!!if!(line.length()!>!0)!{ !!!!!!!!String[]!tokens!=!SEPARATOR.split(line.toString()); !!!!!!!!long!authorID!=!Long.parseLong(tokens[0]); !!!!!!!!String!author!=!tokens[1]; !!!!!!!!ctx.write(new!SecondarySortedAuthorID(authorID,!true),!new! AuthorOrTitleAndYearOfPublication(author)); !!!!!!} !!!!} !!} !!static!class!ConvertBooksMapper !!!!!!extends!Mapper<Object,Text,SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication>!{ !!!!@Override !!!!protected!void!map(Object!key,!Text!line,!Context!ctx)!throws!IOException,!InterruptedException!{ !!!!!!String[]!tokens!=!SEPARATOR.split(line.toString()); !!!!!!long!authorID!=!Long.parseLong(tokens[0]); !!!!!!short!yearOfPublication!=!Short.parseShort(tokens[1]); !!!!!!String!title!=!tokens[2]; !!!!!!ctx.write(new!SecondarySortedAuthorID(authorID,!false),!new! AuthorOrTitleAndYearOfPublication(title, !!!!!!!!!!yearOfPublication)); !!!!} !!} !!static!class!JoinReducer !!!!!!extends!Reducer<SecondarySortedAuthorID,AuthorOrTitleAndYearOfPublication,Text,NullWritable>!{ !!!!@Override !!!!protected!void!reduce(SecondarySortedAuthorID!key,!Iterable<AuthorOrTitleAndYearOfPublication>! values,!Context!ctx) !!!!!!!!throws!IOException,!InterruptedException!{ !!!!!!String!author!=!null; !!!!!!for!(AuthorOrTitleAndYearOfPublication!value!:!values)!{ !!!!!!!!if!(author!==!null!&&!!value.containsAuthor())!{ !!!!!!!!!!throw!new!IllegalStateException("No!author!found!for!book:!"!+!value.getTitle()); !!!!!!!!}!else!if!(author!==!null!&&!value.containsAuthor())!{ !!!!!!!!!!author!=!value.getAuthor(); !!!!!!!!}!else!{ !!!!!!!!!!ctx.write(new!Text(author!+!'t'!+!value.getTitle()!+!'t'!+!value.getYearOfPublication()), !!!!!!!!!!!!!!NullWritable.get()); !!!!!!!!} !!!!!!} !!!!} !!} !!static!class!SecondarySortedAuthorID!implements!WritableComparable<SecondarySortedAuthorID>!{ !!!!private!boolean!containsAuthor; !!!!private!long!id; !!!!static!{ !!!!!!WritableComparator.define(SecondarySortedAuthorID.class,!new!SecondarySortComparator()); !!!!} !!!!SecondarySortedAuthorID()!{} !!!!SecondarySortedAuthorID(long!id,!boolean!containsAuthor)!{ !!!!!!this.id!=!id; !!!!!!this.containsAuthor!=!containsAuthor; !!!!} !!!!@Override !!!!public!int!compareTo(SecondarySortedAuthorID!other)!{ !!!!!!return!ComparisonChain.start() !!!!!!!!!!.compare(id,!other.id) !!!!!!!!!!.result(); !!!!} !!!!@Override !!!!public!void!write(DataOutput!out)!throws!IOException!{ !!!!!!out.writeBoolean(containsAuthor); !!!!!!out.writeLong(id); !!!!} !!!!@Override !!!!public!void!readFields(DataInput!in)!throws!IOException!{ !!!!!!containsAuthor!=!in.readBoolean(); !!!!!!id!=!in.readLong(); !!!!} !!!!@Override !!!!public!boolean!equals(Object!o)!{ !!!!!!if!(o!instanceof!SecondarySortedAuthorID)!{ !!!!!!!!return!id!==!((SecondarySortedAuthorID)!o).id; !!!!!!} !!!!!!return!false; !!!!} !!!!@Override !!!!public!int!hashCode()!{ !!!!!!return!Longs.hashCode(id); !!!!} !!!!static!class!SecondarySortComparator!extends!WritableComparator!implements!Serializable!{ !!!!!!protected!SecondarySortComparator()!{ !!!!!!!!super(SecondarySortedAuthorID.class,!true); !!!!!!} !!!!!!@Override !!!!!!public!int!compare(WritableComparable!a,!WritableComparable!b)!{ !!!!!!!!SecondarySortedAuthorID!keyA!=!(SecondarySortedAuthorID)!a; !!!!!!!!SecondarySortedAuthorID!keyB!=!(SecondarySortedAuthorID)!b; !!!!!!!!return!ComparisonChain.start() !!!!!!!!!!!!.compare(keyA.id,!keyB.id) !!!!!!!!!!!!.compare(!keyA.containsAuthor,!!keyB.containsAuthor) !!!!!!!!!!!!.result(); !!!!!!} !!!!} !!!!static!class!GroupingComparator!extends!WritableComparator!implements!Serializable!{ !!!!!!protected!GroupingComparator()!{ !!!!!!!!super(SecondarySortedAuthorID.class,!true); !!!!!!} !!!!} !!} 14 !!static!class!AuthorOrTitleAndYearOfPublication!implements! Writable!{ !!!!private!boolean!containsAuthor; !!!!private!String!author; !!!!private!String!title; !!!!private!Short!yearOfPublication; !!!!AuthorOrTitleAndYearOfPublication()!{} !!!!AuthorOrTitleAndYearOfPublication(String!author)!{ !!!!!!this.containsAuthor!=!true; !!!!!!this.author!=!Preconditions.checkNotNull(author); !!!!} !!!!AuthorOrTitleAndYearOfPublication(String!title,!short! yearOfPublication)!{ !!!!!!this.containsAuthor!=!false; !!!!!!this.title!=!Preconditions.checkNotNull(title); !!!!!!this.yearOfPublication!=!yearOfPublication; !!!!} !!!!public!boolean!containsAuthor()!{ !!!!!!return!containsAuthor; !!!!} !!!!public!String!getAuthor()!{ !!!!!!return!author; !!!!} !!!!public!String!getTitle()!{ !!!!!!return!title; !!!!} !!!!public!Short!getYearOfPublication()!{ !!!!!!return!yearOfPublication; !!!!} !!!!@Override !!!!public!void!write(DataOutput!out)!throws!IOException!{ !!!!!!out.writeBoolean(containsAuthor); !!!!!!if!(containsAuthor)!{ !!!!!!!!out.writeUTF(author); !!!!!!}!else!{ !!!!!!!!out.writeUTF(title); !!!!!!!!out.writeShort(yearOfPublication); !!!!!!} !!!!} !!!!@Override !!!!public!void!readFields(DataInput!in)!throws!IOException!{ !!!!!!author!=!null; !!!!!!title!=!null; !!!!!!yearOfPublication!=!null; !!!!!!containsAuthor!=!in.readBoolean(); !!!!!!if!(containsAuthor)!{ !!!!!!!!author!=!in.readUTF(); !!!!!!}!else!{ !!!!!!!!title!=!in.readUTF(); !!!!!!!!yearOfPublication!=!in.readShort(); !!!!!!} !!!!} !!} }
  • 15. 5. Interfaces to Hadoop MapReduce (Pig, Hive, Cascading, ...) Reduce Reduce Reduce Map Map Map Lacking in declarativity 15 Operators exchange data via HDFS Sort the only grouping operator Need many MapReduce rounds
  • 16. 6. An ML library called Mahout. Iterative programs in Hadoop Client 16 Reduce Iteration 3 Map Reduce Iteration 2 Map Reduce Map Iteration 1
  • 17. Iterations in MapReduce too slow. Design a new runtime system and use the Hadoop Incremental Iterations matter scheduler to exploit sparse computational dependencies. ■ Changes to the iteration's result for Connected Components in each superstep # Vertices (thousands) 1400 1200 1000 800 600 400 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Naïve (Bulk) Superstep 17 Incremental
  • 18. Observations 1. MapReduce programming model good for grouping & counting. 2. MapReduce programming model not good for much else. 3. Hadoop implementation of MapReduce trades performance for fault-tolerance (disk-based data shuffling). 4. MapReduce programming model not suited for SQL. Need to hack around it with multiple MapReduce rounds. 5. Hadoop’s implementation of MapReduce not suited for SQL. 6. MapReduce programming model and its Hadoop implementation not suited for iterations. Need to hack around it with implementing iterations in client or embedding a new runtime in a Map function. 18
  • 19. Stratosphere Big Data 19
  • 20. Stratosphere: a brief history 2009: DFG-funded research group from TUB, HUB, HPI starts research on “Information Management in the Cloud.” 2010-2012: Stratosphere released as open source (v0.1, v0.2) and becomes known in academic community. Companies and Universities in Europe become part of Stratosphere. 2013 and beyond: Transition from a research project to a stable and usable open source system, developer community, and real-world use cases. 20
  • 21. Stratosphere status Next stable release (v0.4) coming up around end of November. Snapshot available to download; maturity equivalent to Apache incubations. 21 Community picking up: external developers from Universities (KTH, SICS, Inria, and others), hackathons in Berlin, Paris, Budapest, companies are starting to use Stratosphere (Deutsche Telekom, Internet Memory, Mediaplus).
  • 22. 22
  • 23. Desiderata for next-gen big data platforms: Usability 10 million Excel users 3 million R users 70,000 Hadoop users 23 “the market faces certain challenges such as unavailability of qualified and experienced work professionals, who can effectively handle the Hadoop architecture.”
  • 24. Desiderata for next-gen big data platforms: Performance Stratosphere! Hadoop! 0! 100! 200! 300! 400! 500! 600! 700! Performance difference from days to minutes enables real time decision making and widespread use of data within the organization. 24
  • 25. Data characteristics change Each color is a differently written program that produces the same result but has very different performance depending on small changes in the data set and the analysis requirements Query optimizers: the enabling technology for SQL data warehousing and BI Successful industrial application of artificial intelligence Data characteristics change Currently, only Stratosphere can optimize non-relational data analysis programs. (a) Complex Plan Diagram (b) Reduced Plan Diagram Figure 2: Complex Plan and Reduced Plan Diagram (Query 8, OptA) 25
  • 26. filter MATC aggregate project Enumeration Algorithm: 4 5 6 7 8r0 0 123 r0 = this Read Set UDF Code Analysis 0 123456 78 = this r1 = @parameter0 // Iter ter0 // project Iterator r2 = @parameter1 // Coll ter1 // Collector Use a combination of compiler and • Checks reorder conditions and switches successive operators database technology to lift optimization Supported Transformations: • Filter push-down beyond relational algebra. Derive • Join reordering • Invariant group transformations properties of user-defined functions via • Non-relational operators are integrated code analysis and use these to mimic a relational database optimizer. r0 = this r1 = @parameter0 // Record / Reco Prerequisites: r2 = @parameter1 // Collector / Coll r1 = @parameter0 // Record / Reco • Descents data //filter r2 = @parameter1 ow recursively top-down / Collector Coll $r5 = r1.getField(8) 8) Control-Flow, Def-Use, Use-Def lists $r6 = r0.date_lb $i0 = $r5.compareTo($r6) o($r6) • Fixed API to access records $d0 = r1.getField(6) r0 = this r1 / Reco r3 = r1.next() = @parameter0 // Record r0.extendedprice = $d0 = @parameter0 // Record = r r2 = @parameter1 // Collector / Coll r1 / Reco d0 r3.getField(4) ) $d1 = r1.getField(7) r2 = @parameter1 // Collector / Coll $d0 = r1.getField(6) goto 2 r0.discount = $d1 r0.extendedprice = $d0 $r5 = r1.getField(8) 8) $d1 $r6 = r0.date_lb 1: r3 = r1.next()= r1.getField(7) $r7 = r0.revenue // PactRecord r0.discount = $d1 $i0 = $r5.compareTo($r6) o($r6) $d1 = r3.getField(4) Extracted Information: $d2 = r0.extendedprice $i0 < 0 goto 1 if $r9 = r1.getField(8) 8) d0 = d0 + $d1 $r7 = r0.revenue // PactRecord $d3 = • Field sets= r0.date_ub write accesses on records r0.discount track read and $d2 = r0.extendedprice $r10 $r9 = r1.getField(8) 8) $d4 = 0 - $d3 2: $z0 = r1.hasNext() $d3 = r0.discount • Upper and$r9.compareTo($r10) $i1 = lower output cardinality bounds $r10 = r0.date_ub $d5 = $d2 * $d4 $d4 1 if $z0 != 0 goto = 0 - $d3 if $i1 >= 0 goto 1 $i1 = $r9.compareTo($r10) $d5 = $d2 * $d4 $r7.setValue($d5) if $i1 >= 0 goto 1 r3.setField(4, d0) 4, Safety: $r7.setValue($d5) r1.setNull(6) ) r2.collect(r1) r2.collect(r3) r1.setNull(6) ) • All record access instructions are detected r2.collect(r1) r1.setNull(7) r1.setNull(7) • Supersets of actual Read/Write sets are returned r1.setNull(8) 1: return r1.setNull(8) 1: return if $i0 < 0 goto 1 • Supersets allow fewer but always safe transformations $r8 = r0.revenue nds 0 123456 78 Details Set[HPS+12] Bounds Write in Ou et Out-Card [0,1] 0 123456 78 0 123456 78 [0,1] Data Flow Transformations Reorder Conditions: output [0,1] 0 123456 78 [0,1] 0 123456 78 Physical Optimization output 0 123456 78 0 12345678 0 12345678 0 123456 78 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 5 0 12345678 MATCH 5 0 123456 78 0 123456 78 0 123456 78 0 0 123456 78 aggregate 0 123456 78 0 123456 78 0 12345678 0 123456 7 supplier8 Interesting Properties: 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 0 123456 78 • Checks reorder conditions and switches successive operators REDUCE join 0 12345678 REDUCE aggregate MAP supplier project 5 0 123456 78 0 123456 78 0 123456 78 Details in [HPS+12] 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 REDUCE aggregate 0 123456 78 0 123456 78 0 123456 78 MAP lineitem filter Partition project output lineitem Execution Plan Selection: Details in [BEH+10] • Chooses execution strategies for 2nd-order functions Local Forward MATCH Hybrid-Hash • Chooses shipping strategies to distribute data [WK09] Warneke, Kao, • Strategies known from parallel databases Local Forward Parallel Execution REDUCE 0 12345678 0 12345678 0 12345678 0 12345678 0 12345678 0 12345678 Partition supplier MAP filter 0 123456 78 MAP • Massively parallel execution of 26 DAG-structured data ows • Sequential processing tasks • • R 0 123456 78 0 12345678 0 12345678 0 12345678 0 12345678 MAP 0 123456 78 Pipeline Local Forward 0 12345678 0 12345678 • lineitem MAP Pipeline • 0 123456 78 Execution Engine: • 0 123456 78 0 123456 78 lineitem E filter COMBINE 3 4 5 6 7 8 0 12 0 1 Part-Sort2 3 4 5 6 7 8 MAP Local Forward project 0 123456 78 lineitem 0 123456 78 Partition REDUCE 0 123456 78 • Exploits UDF annotations for size estimates • Cost model combines network, disk I/O and CPU costs Parallel Execution lineitem Pa 0 123456 78 aggregate supplier MAP 0 123456 78 0 123456 78 hysical Optimization [0,1] 5 REDUCE 2 3 4 5 6 7 8 0 1 Sort 0 1 2 3 4 5 6 7 8 supplier 0 123456 78 project 0 123456 78 0 123456 78 supplier 0 123456 78 0 123456 78 MAP Cost-based Plan Selection: 0 123456 78 0 123456 78 Local Forward 5 0 123456 78 0 123456 78 filter lineitem Local Forward MATCH join MATCH 5 0 0 Hybrid-Hash 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 0 123456 78 MAP 0 0 123456 78 0 123456 78 • Sorting, Grouping, Partitioning MAP MAP MAP supplier filter filter project • Invariant group project transformations • Property preservation reasoning with write sets • Non-relational operators are integrated filter 0 123456 78 0 12345678 5 0 123456 78 0 123456 78 0 • Filter push-down 1 2 3 4 5 6 7 8 0 123456 78 • Join reordering MAP r3.setField(4, d0) 4, r2.collect(r3) MAP 0 12345678 0 12345678 • Chooses shipping strategies to distribute data join Enumeration Algorithm: • Descents data ow recursively top-down • Strategies known from parallel databases MATCH Supported Transformations: 2: $z0 = r1.hasNext() if $z0 != 0 goto 1 0 123456 78 5 0 123456 78 0 0 123456 78 0 123456 78 0 123456 78 0 123456 78 5 0 123456 78 0 123456 78 0 123456 78 0 12345678 0 123456 78 0 12345678 0 12345678 0 123456 78 0 123456 78 1: r3 = r1.next() $d1 = r3.getField(4) d0 = d0 + $d1 output 0 12345678 0 12345678 0 123456 78 0 12345678 5 0 123456 78 project goto 2 output output 0 12345678 0 12345678 REDUCE MATCH Execution Plan Selection: aggregate join 2. Preservation of groupsREDUCE for grouping operators nd-order functions MATCH • Chooses executionMATCH strategies for 2 • Groups must remain unchanged or be completely removed join aggregate join 0 123456 78 MAP r3 = r1.next() d0 = r r3.getField(4) ) output output 1. No Write-Read / Write-Write con icts on 0 1 2 3 4 5 6 7 8 record elds 0 12345678 0 12345 78 • Similar to con ict detection 6in optimistic 1concurrency control 0 2345678 [0,1] 0 r0 = this r1 = @parameter0 // Iter 0 1 2 3 4 5 6 7 8 ter0 // Iterator r2 = @parameter1 // Coll 0 1 2 3 4 5 6 7 8 ter1 // Collector $r8 = r0.revenue r1.setField(4, $r8) r2.collect(r1) 1) r1.setField(4, $r8) r2.collect(r1) 1) Details in [HPS+12] and [HKT12] 5 0 123456 78 aggregate r0 = this • Static Code Analysis Framework provides join 0 123456 78 Local Forward output output output lineitem MATCH Hybrid-Hash REDUCE Sort MATCH Hybrid-Hash supplier REDUCE Sort MATCH Hybrid-Hash supplier REDUCE Sort D supplier [H
  • 27. one pass dataflow many pass dataflow MapReduce Impala, ... Stratosphere Text ✔ ✔ ✔ Aggregation ✔ ✔ ✔ ETL ✔ ✔ ✔ SQL Hive is too slow ✔ ✔ Advanced analytics Mahout is slow and low level Madlib is too slow ✔ A fast, massively parallel database-inspired backend. map reduce Truly scales to disk-resident large data sets using database technology (e.g., hybrid hashing and external sort-merge for implementing key matching). Built-in support for iterative programs via “iterate” operator: predictive and advanced analytics (machine learning, graph processing, stats) are all iterative. 27
  • 28. Giraph is a Stratosphere Incremental program Iterations: Doing Pregel Working Set has messages sent by the vertices Wi+1 Create Messages from new state Graph Topology Delta set has state of changed vertices Di+1 Aggregate messages and derive new state Match . U CoGroup N (left outer) Wi Si Stratosphere – Parallel Analytics Beyond MapReduce 28
  • 29. To recap: Stratosphere is an open-source system that runs on top of Hadoop Yarn and HDFS, but replaces Hadoop MapReduce with a new runtime engine designed for iterative and DAG-shaped programs, offers a program optimizer that frees programmer from low-level decisions, is scalable to large clusters and disk-resident data sets, and is programmable in Java and Scala (and more to come). 29
  • 30. A next-generation Big Data platform is being developed in Berlin. Help us shape the future of Stratosphere! 30 http://www.flickr.com/photos/andiearbeit/4354455624/lightbox/