Your SlideShare is downloading. ×
Stratosphere Intro (Java and Scala Interface)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Stratosphere Intro (Java and Scala Interface)

3,872
views

Published on

A quick walk overview of Stratosphere, including our Scala programming interface. …

A quick walk overview of Stratosphere, including our Scala programming interface.

See also bigdataclass.org for two self-paced Stratosphere Big Data exercises.
More information about Stratosphere: stratosphere.eu

Published in: Technology, Education

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,872
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
42
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Google: Search results, Spam Filter
    Amazon: Recommendations
    Soundcloud: Recommendations
    Spotify: Recommendations
    Youtube: Recommendations, Adverts
    Netflix: Recommendations, compare to Maxdome :D
    Twitter: Just everything … :D
    Facebook: Adverts, GraphSearch, Friend suggestion,
    Filtering (for annoying friends)
    Instagram: They have lots of data, theres gotta be something …
    Bioinformatik: DNA, 1TB per genom, 1000 genome
  • Google: Search results, Spam Filter
    Amazon: Recommendations
    Soundcloud: Recommendations
    Spotify: Recommendations
    Youtube: Recommendations, Adverts
    Netflix: Recommendations, compare to Maxdome :D
    Twitter: Just everything … :D
    Facebook: Adverts, GraphSearch, Friend suggestion,
    Filtering (for annoying friends)
    Instagram: They have lots of data, theres gotta be something …
    Bioinformatik: DNA, 1TB per genom, 1000 genome
  • Transcript

    • 1. Introduction to Stratosphere Aljoscha Krettek DIMA / TU Berlin
    • 2. What is this? ● ● ● Distributed data processing system source DAG (Directed acyclic graph) of sources, sinks, and operators: “data flow” map: “split words” Handles distribution, faulttolerance, network transfer reduce: “count words” sink 2
    • 3. Why would I use this? Automatic parallelization / Because you are told to source source source map: “split words” map: “split words” map: “split words” reduce: “count words” reduce: “count words” reduce: “count words” sink sink sink 3
    • 4. So how do I use this? (from Java) ● How is data represented in the system? ● How to I create data flows? ● Which types of operators are there? ● How do I write operators? ● How do run the whole shebang? 4
    • 5. How do I move my data? ● ● ● Data is stored in fields in PactRecord Basic data types: PactString, PactInteger, PactDouble, PactFloat, PactBoolean, … New data types must implement Value interface 5
    • 6. PactRecord PactRecord rec = ... PactInteger foo = rec.getField(0, PactInteger.class) int i = foo.getValue() PactInteger foo2 = new PactInteger(3) rec.setField(1, foo2) 6
    • 7. Creating Data Flows ● Create one or several sources ● Create operators: – – ● Input is/are preceding operator(s) Specify a class/object with the operator implementation Create one or several sinks: – Input is some operator 7
    • 8. WordCount Example Data Flow FileDataSource source = new FileDataSource(TextInputFormat.class, dataInput, "Input Lines"); MapContract mapper = MapContract.builder(TokenizeLine.class) .input(source) .name("Tokenize Lines") .build(); ReduceContract reducer = ReduceContract.builder(CountWords.class, PactString.class, 0) .input(mapper) .name("Count Words") .build(); FileDataSink out = new FileDataSink(RecordOutputFormat.class, output, reducer, "Word Counts"); RecordOutputFormat.configureRecordFormat(out) .recordDelimiter('n') .fieldDelimiter(' ') .field(PactString.class, 0) .field(PactInteger.class, 1); Plan plan = new Plan(out, "WordCount Example"); 8
    • 9. Operator Types ● We call them second order functions (SOF) ● Code inside the operator is the first order function or user defined function (UDF) ● ● Currently five SOFs: map, reduce, match, cogroup, cross SOF describes how PactRecords are handed to the UDF 9
    • 10. Map Operator ● ● User code receives one record at a time (per call to user code function) Not really a functional map since all operators can output an arbitrary number of records 10
    • 11. Map Operator Example public static class TokenizeLine extends MapStub { private final AsciiUtils.WhitespaceTokenizer tokenizer = new AsciiUtils.WhitespaceTokenizer(); private final PactRecord outputRecord = new PactRecord(); private final PactString word = new PactString(); private final PactInteger one = new PactInteger(1); @Override public void map(PactRecord record, Collector<PactRecord> collector) { PactString line = record.getField(0, PactString.class); this.tokenizer.setStringToTokenize(line); while (tokenizer.next(word)) { outputRecord.setField(0, word); outputRecord.setField(1, one); collector.collect(outputRecord); } } } 11
    • 12. Reduce Operator ● ● User code receives a group of records with same key Must specify which fields of a record are the key 12
    • 13. Reduce Operator Example public static class CountWords extends ReduceStub { private final PactInteger cnt = new PactInteger(); @Override public void reduce(Iterator<PactRecord> records, Collector<PactRecord> out) throws Exception { PactRecord element = null; int sum = 0; while (records.hasNext()) { element = records.next(); PactInteger i = element.getField(1, PactInteger.class); sum += i.getValue(); } cnt.setValue(sum); element.setField(1, cnt); out.collect(element); } } 13
    • 14. Specifying the Key Fields ReduceContract reducer = ReduceContract.builder( Foo.class, PactString.class, 0) .input(mapper) .keyField(PactInteger.class, 1) .name("Count Words") .build(); 14
    • 15. Cross Operator ● ● ● ● Two input operator Cartesian product: every record from left combined with every record from right One record from left, one record from right per user code call Implement CrossStub 15
    • 16. Match Operator ● ● ● Two input operator with keys Join: record from left combined with every record from right with same key Implement MatchStub 16
    • 17. CoGroup Operator ● ● ● ● Two input operator with keys Records from left combined with all record from right with same key User code gets an iterator for left and right records Implement CoGroupStub 17
    • 18. How to execute a data flow plan ● Either use LocalExecutor: LocalExecutor.execute(plan) ● Implement PlanAssembler.getPlan(String...args) And run on a local cluster or proper cluster ● See: http://stratosphere.eu/quickstart/ and http://stratosphere.eu/docs/gettingstarted.html 18
    • 19. Getting Started https://github.com/stratosphere/stratosphere https://github.com/stratosphere/stratosphere-quickstart 19
    • 20. And Now for Something Completely Different val input = TextFile(textInput) val words = input .flatMap { _.split(" ") map { (_, 1) } } val counts = words .groupBy { case (word, _) => word } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts .write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output)) 20
    • 21. (Very) Short Introduction to Scala 21
    • 22. Anatomy of a Scala Class package foo.bar import something.else class Job(arg1: Int) { def map(in: Int): String = { val i: Int = in + 2 var a = “Hello” i.toString } } 22
    • 23. Singletons ● Similar to Java singletons and/or static methods object Job { def main(args: String*) { println(“Hello World”) } } 23
    • 24. Collections val a = Seq(1, 2, 4) List(“Hallo”, 2) Array(2,3) Map(1->”1”, 2->”2”) val b = a map { x => x + 2} val c = a map { _ + 2 } val c = a.map({ _ + 2 }) 24
    • 25. Generics and Tuples val a: Seq[Int] = Seq(1, 2, 4) val tup = (3, “a”) val tup2: (Int, String) = (3, “a”) 25
    • 26. Stratosphere Scala Front End 26
    • 27. Skeleton of a Stratosphere Program ● Input: a text file/JDBC source/CSV, etc. – ● Transformations on the Dataset – ● loaded in internal representation: the DataSet map, reduce, join, etc. Output: program results in a DataSink – Text file, JDBC, CSV, etc. 27
    • 28. The Almighty DataSet ● ● ● ● Operations are methods on DataSet[A] Working with DataSet[A] feels like working with Scala collections DataSet[A] is not an actual collection but represents computation on a collection Stringing together operations creates a data flow graph that can be execute 28
    • 29. An Important Difference Immediately Executed Executed when data flow is executed val input: List[String] = ... val input: DataSet[String] = ... val mapped = input.map { s => (s, 1) } val mapped = input.map { s => (s, 1) } val result = mapped.write(“file”, ...) val plan = new Plan(result) execute(plan) 29
    • 30. Usable Data Types ● Primitive types ● Tuples ● Case classes ● Custom data types that implement the Value interface 30
    • 31. Creating Data Sources val input = TextFile(“file://”) val input: DataSet[(Int, String)] = DataSource(“hdfs://”, CsvInputFormat[(Int, String)]()) def parseInput(line: String): (Int, Int) = {…} val input = DataSource(“hdfs://”, DelimitedInputFormat](parseInput)) 31
    • 32. Interlude: Anonymous Functions var fun: ((Int, String)) => String = ... fun = { t => t._2 } fun = { _._2 } fun = { case (i, w) => w } 32
    • 33. Map val input: DataSet[(Int, String)] = ... val mapper = input .map { case (a, b) => (a + 2, b) } val mapper2 = input .flatMap { _._2.split(“ “) } val filtered = input .filter { case (a, b) => a > 3 } 33
    • 34. Reduce val input: DataSet[(String, Int)] = ... val reducer = input .groupBy { case (w, _) => w } .groupReduce { _.minBy {...} } val reducer2 = input .groupBy { case (w, _) => w } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } 34
    • 35. Cross val left: DataSet[(String, Int)] = ... val right: DataSet[(String, Int)] = ... val cross = left cross right .map { (l, r) => ... } val cross = left cross right .flatMap { (l, r) => ... } 35
    • 36. Join (Match) val counts: DataSet[(String, Int)] = ... val names: DataSet[(Int, String)] = ... val join = counts .join(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .map { (l, r) => (l._1, r._2) } val join = counts .join(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .flatMap { (l, r) => ... } 36
    • 37. CoGroup val counts: DataSet[(String, Int)] = ... val names: DataSet[(Int, String)] = ... val cogroup = counts .cogroup(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .map { (l, r) => (l.minBy {...} , r.minBy {...}) } val cogroup = counts .cogroup(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .flatMap { (l, r) => ... } 37
    • 38. Creating Data Sinks val counts: DataSet[(String, Int)] val sink = counts.write(“<>”, CsvOutputFormat()) def formatOutput(a: (String, Int)): String = { “Word “ + a._1 + “ count “ + a._2 } val sink = counts.write(“<>”, DelimitedOutputFormat(formatOutput)) 38
    • 39. Word Count example val input = TextFile(textInput) val words = input .flatMap { _.split(" ") map { (_, 1) } } val counts = words .groupBy { case (word, _) => word } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts .write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output)) 39
    • 40. Things not mentioned ● The is support for iterations (both in Java and Scala) ● Many more data source/sink formats ● Look at the examples in the stratosphere source ● Don't be afraid to write on mailing list and on github: – ● http://stratosphere.eu/quickstart/scala.html Or come directly to us 40
    • 41. End.