Stratosphere Intro (Java and Scala Interface)

4,801 views

Published on

A quick walk overview of Stratosphere, including our Scala programming interface.

See also bigdataclass.org for two self-paced Stratosphere Big Data exercises.
More information about Stratosphere: stratosphere.eu

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,801
On SlideShare
0
From Embeds
0
Number of Embeds
2,203
Actions
Shares
0
Downloads
48
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Google: Search results, Spam Filter
    Amazon: Recommendations
    Soundcloud: Recommendations
    Spotify: Recommendations
    Youtube: Recommendations, Adverts
    Netflix: Recommendations, compare to Maxdome :D
    Twitter: Just everything … :D
    Facebook: Adverts, GraphSearch, Friend suggestion,
    Filtering (for annoying friends)
    Instagram: They have lots of data, theres gotta be something …
    Bioinformatik: DNA, 1TB per genom, 1000 genome
  • Google: Search results, Spam Filter
    Amazon: Recommendations
    Soundcloud: Recommendations
    Spotify: Recommendations
    Youtube: Recommendations, Adverts
    Netflix: Recommendations, compare to Maxdome :D
    Twitter: Just everything … :D
    Facebook: Adverts, GraphSearch, Friend suggestion,
    Filtering (for annoying friends)
    Instagram: They have lots of data, theres gotta be something …
    Bioinformatik: DNA, 1TB per genom, 1000 genome
  • Stratosphere Intro (Java and Scala Interface)

    1. 1. Introduction to Stratosphere Aljoscha Krettek DIMA / TU Berlin
    2. 2. What is this? ● ● ● Distributed data processing system source DAG (Directed acyclic graph) of sources, sinks, and operators: “data flow” map: “split words” Handles distribution, faulttolerance, network transfer reduce: “count words” sink 2
    3. 3. Why would I use this? Automatic parallelization / Because you are told to source source source map: “split words” map: “split words” map: “split words” reduce: “count words” reduce: “count words” reduce: “count words” sink sink sink 3
    4. 4. So how do I use this? (from Java) ● How is data represented in the system? ● How to I create data flows? ● Which types of operators are there? ● How do I write operators? ● How do run the whole shebang? 4
    5. 5. How do I move my data? ● ● ● Data is stored in fields in PactRecord Basic data types: PactString, PactInteger, PactDouble, PactFloat, PactBoolean, … New data types must implement Value interface 5
    6. 6. PactRecord PactRecord rec = ... PactInteger foo = rec.getField(0, PactInteger.class) int i = foo.getValue() PactInteger foo2 = new PactInteger(3) rec.setField(1, foo2) 6
    7. 7. Creating Data Flows ● Create one or several sources ● Create operators: – – ● Input is/are preceding operator(s) Specify a class/object with the operator implementation Create one or several sinks: – Input is some operator 7
    8. 8. WordCount Example Data Flow FileDataSource source = new FileDataSource(TextInputFormat.class, dataInput, "Input Lines"); MapContract mapper = MapContract.builder(TokenizeLine.class) .input(source) .name("Tokenize Lines") .build(); ReduceContract reducer = ReduceContract.builder(CountWords.class, PactString.class, 0) .input(mapper) .name("Count Words") .build(); FileDataSink out = new FileDataSink(RecordOutputFormat.class, output, reducer, "Word Counts"); RecordOutputFormat.configureRecordFormat(out) .recordDelimiter('n') .fieldDelimiter(' ') .field(PactString.class, 0) .field(PactInteger.class, 1); Plan plan = new Plan(out, "WordCount Example"); 8
    9. 9. Operator Types ● We call them second order functions (SOF) ● Code inside the operator is the first order function or user defined function (UDF) ● ● Currently five SOFs: map, reduce, match, cogroup, cross SOF describes how PactRecords are handed to the UDF 9
    10. 10. Map Operator ● ● User code receives one record at a time (per call to user code function) Not really a functional map since all operators can output an arbitrary number of records 10
    11. 11. Map Operator Example public static class TokenizeLine extends MapStub { private final AsciiUtils.WhitespaceTokenizer tokenizer = new AsciiUtils.WhitespaceTokenizer(); private final PactRecord outputRecord = new PactRecord(); private final PactString word = new PactString(); private final PactInteger one = new PactInteger(1); @Override public void map(PactRecord record, Collector<PactRecord> collector) { PactString line = record.getField(0, PactString.class); this.tokenizer.setStringToTokenize(line); while (tokenizer.next(word)) { outputRecord.setField(0, word); outputRecord.setField(1, one); collector.collect(outputRecord); } } } 11
    12. 12. Reduce Operator ● ● User code receives a group of records with same key Must specify which fields of a record are the key 12
    13. 13. Reduce Operator Example public static class CountWords extends ReduceStub { private final PactInteger cnt = new PactInteger(); @Override public void reduce(Iterator<PactRecord> records, Collector<PactRecord> out) throws Exception { PactRecord element = null; int sum = 0; while (records.hasNext()) { element = records.next(); PactInteger i = element.getField(1, PactInteger.class); sum += i.getValue(); } cnt.setValue(sum); element.setField(1, cnt); out.collect(element); } } 13
    14. 14. Specifying the Key Fields ReduceContract reducer = ReduceContract.builder( Foo.class, PactString.class, 0) .input(mapper) .keyField(PactInteger.class, 1) .name("Count Words") .build(); 14
    15. 15. Cross Operator ● ● ● ● Two input operator Cartesian product: every record from left combined with every record from right One record from left, one record from right per user code call Implement CrossStub 15
    16. 16. Match Operator ● ● ● Two input operator with keys Join: record from left combined with every record from right with same key Implement MatchStub 16
    17. 17. CoGroup Operator ● ● ● ● Two input operator with keys Records from left combined with all record from right with same key User code gets an iterator for left and right records Implement CoGroupStub 17
    18. 18. How to execute a data flow plan ● Either use LocalExecutor: LocalExecutor.execute(plan) ● Implement PlanAssembler.getPlan(String...args) And run on a local cluster or proper cluster ● See: http://stratosphere.eu/quickstart/ and http://stratosphere.eu/docs/gettingstarted.html 18
    19. 19. Getting Started https://github.com/stratosphere/stratosphere https://github.com/stratosphere/stratosphere-quickstart 19
    20. 20. And Now for Something Completely Different val input = TextFile(textInput) val words = input .flatMap { _.split(" ") map { (_, 1) } } val counts = words .groupBy { case (word, _) => word } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts .write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output)) 20
    21. 21. (Very) Short Introduction to Scala 21
    22. 22. Anatomy of a Scala Class package foo.bar import something.else class Job(arg1: Int) { def map(in: Int): String = { val i: Int = in + 2 var a = “Hello” i.toString } } 22
    23. 23. Singletons ● Similar to Java singletons and/or static methods object Job { def main(args: String*) { println(“Hello World”) } } 23
    24. 24. Collections val a = Seq(1, 2, 4) List(“Hallo”, 2) Array(2,3) Map(1->”1”, 2->”2”) val b = a map { x => x + 2} val c = a map { _ + 2 } val c = a.map({ _ + 2 }) 24
    25. 25. Generics and Tuples val a: Seq[Int] = Seq(1, 2, 4) val tup = (3, “a”) val tup2: (Int, String) = (3, “a”) 25
    26. 26. Stratosphere Scala Front End 26
    27. 27. Skeleton of a Stratosphere Program ● Input: a text file/JDBC source/CSV, etc. – ● Transformations on the Dataset – ● loaded in internal representation: the DataSet map, reduce, join, etc. Output: program results in a DataSink – Text file, JDBC, CSV, etc. 27
    28. 28. The Almighty DataSet ● ● ● ● Operations are methods on DataSet[A] Working with DataSet[A] feels like working with Scala collections DataSet[A] is not an actual collection but represents computation on a collection Stringing together operations creates a data flow graph that can be execute 28
    29. 29. An Important Difference Immediately Executed Executed when data flow is executed val input: List[String] = ... val input: DataSet[String] = ... val mapped = input.map { s => (s, 1) } val mapped = input.map { s => (s, 1) } val result = mapped.write(“file”, ...) val plan = new Plan(result) execute(plan) 29
    30. 30. Usable Data Types ● Primitive types ● Tuples ● Case classes ● Custom data types that implement the Value interface 30
    31. 31. Creating Data Sources val input = TextFile(“file://”) val input: DataSet[(Int, String)] = DataSource(“hdfs://”, CsvInputFormat[(Int, String)]()) def parseInput(line: String): (Int, Int) = {…} val input = DataSource(“hdfs://”, DelimitedInputFormat](parseInput)) 31
    32. 32. Interlude: Anonymous Functions var fun: ((Int, String)) => String = ... fun = { t => t._2 } fun = { _._2 } fun = { case (i, w) => w } 32
    33. 33. Map val input: DataSet[(Int, String)] = ... val mapper = input .map { case (a, b) => (a + 2, b) } val mapper2 = input .flatMap { _._2.split(“ “) } val filtered = input .filter { case (a, b) => a > 3 } 33
    34. 34. Reduce val input: DataSet[(String, Int)] = ... val reducer = input .groupBy { case (w, _) => w } .groupReduce { _.minBy {...} } val reducer2 = input .groupBy { case (w, _) => w } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } 34
    35. 35. Cross val left: DataSet[(String, Int)] = ... val right: DataSet[(String, Int)] = ... val cross = left cross right .map { (l, r) => ... } val cross = left cross right .flatMap { (l, r) => ... } 35
    36. 36. Join (Match) val counts: DataSet[(String, Int)] = ... val names: DataSet[(Int, String)] = ... val join = counts .join(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .map { (l, r) => (l._1, r._2) } val join = counts .join(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .flatMap { (l, r) => ... } 36
    37. 37. CoGroup val counts: DataSet[(String, Int)] = ... val names: DataSet[(Int, String)] = ... val cogroup = counts .cogroup(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .map { (l, r) => (l.minBy {...} , r.minBy {...}) } val cogroup = counts .cogroup(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .flatMap { (l, r) => ... } 37
    38. 38. Creating Data Sinks val counts: DataSet[(String, Int)] val sink = counts.write(“<>”, CsvOutputFormat()) def formatOutput(a: (String, Int)): String = { “Word “ + a._1 + “ count “ + a._2 } val sink = counts.write(“<>”, DelimitedOutputFormat(formatOutput)) 38
    39. 39. Word Count example val input = TextFile(textInput) val words = input .flatMap { _.split(" ") map { (_, 1) } } val counts = words .groupBy { case (word, _) => word } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts .write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output)) 39
    40. 40. Things not mentioned ● The is support for iterations (both in Java and Scala) ● Many more data source/sink formats ● Look at the examples in the stratosphere source ● Don't be afraid to write on mailing list and on github: – ● http://stratosphere.eu/quickstart/scala.html Or come directly to us 40
    41. 41. End.

    ×