Stratosphere Intro (Java and Scala Interface)

Introduction to Stratosphere
Aljoscha Krettek
DIMA / TU Berlin

What is this?
●

●

●

Distributed data
processing system

source

DAG (Directed acyclic
graph) of sources, sinks,
and operators: “data flow”

map: “split words”

Handles distribution, faulttolerance, network
transfer

reduce: “count words”

sink
2

Why would I use this?
Automatic parallelization / Because you are told to
source

source

source







sink

sink

sink

3

So how do I use this?
(from Java)
●

How is data represented in the system?

●

How to I create data flows?

●

Which types of operators are there?

●

How do I write operators?

●

How do run the whole shebang?

4

How do I move my data?
●

●

●

Data is stored in fields in PactRecord
Basic data types: PactString, PactInteger, PactDouble,
PactFloat, PactBoolean, …
New data types must implement Value interface

5

PactRecord
PactRecord rec = ...
PactInteger foo =
rec.getField(0, PactInteger.class)
int i = foo.getValue()
PactInteger foo2 = new PactInteger(3)
rec.setField(1, foo2)

6

Creating Data Flows
●

Create one or several sources

●

Create operators:
–
–

●

Input is/are preceding operator(s)
Specify a class/object with the operator implementation

Create one or several sinks:
–

Input is some operator

7

WordCount Example Data Flow
FileDataSource source = new FileDataSource(TextInputFormat.class, dataInput, "Input Lines");
MapContract mapper = MapContract.builder(TokenizeLine.class)
.input(source)
.name("Tokenize Lines")
.build();
ReduceContract reducer = ReduceContract.builder(CountWords.class, PactString.class, 0)
.input(mapper)
.name("Count Words")
.build();
FileDataSink out = new FileDataSink(RecordOutputFormat.class, output, reducer, "Word Counts");
RecordOutputFormat.configureRecordFormat(out)
.recordDelimiter('n')
.fieldDelimiter(' ')
.field(PactString.class, 0)
.field(PactInteger.class, 1);
Plan plan = new Plan(out, "WordCount Example");

8

Operator Types
●

We call them second order functions (SOF)

●

Code inside the operator is the first order function
or user defined function (UDF)

●

●

Currently five SOFs: map, reduce, match, cogroup,
cross
SOF describes how PactRecords are handed to the
UDF

9

Map Operator
●

●

User code receives one
record at a time (per
call to user code
function)
Not really a functional
map since all operators
can output an arbitrary
number of records

10

Map Operator Example
public static class TokenizeLine extends MapStub {
private final AsciiUtils.WhitespaceTokenizer tokenizer =
new AsciiUtils.WhitespaceTokenizer();
private final PactRecord outputRecord = new PactRecord();
private final PactString word = new PactString();
private final PactInteger one = new PactInteger(1);
@Override
public void map(PactRecord record, Collector<PactRecord> collector) {
PactString line = record.getField(0, PactString.class);
this.tokenizer.setStringToTokenize(line);
while (tokenizer.next(word)) {
outputRecord.setField(0, word);
outputRecord.setField(1, one);
collector.collect(outputRecord);
}
}
}
11

Reduce Operator
●

●

User code receives a
group of records with
same key
Must specify which
fields of a record are
the key

12

Reduce Operator Example
public static class CountWords extends ReduceStub {
private final PactInteger cnt = new PactInteger();
@Override
public void reduce(Iterator<PactRecord> records, Collector<PactRecord> out)
throws Exception {
PactRecord element = null;
int sum = 0;
while (records.hasNext()) {
element = records.next();
PactInteger i = element.getField(1, PactInteger.class);
sum += i.getValue();
}
cnt.setValue(sum);
element.setField(1, cnt);
out.collect(element);
}
}
13

Specifying the Key Fields
ReduceContract reducer =
ReduceContract.builder(
Foo.class,
PactString.class, 0)
.input(mapper)
.keyField(PactInteger.class, 1)
.name("Count Words")
.build();
14

Cross Operator
●

●

●

●

Two input operator
Cartesian product: every
record from left combined
with every record from
right
One record from left, one
record from right per user
code call
Implement CrossStub

15

Match Operator
●

●

●

Two input operator
with keys
Join: record from left
combined with every
record from right with
same key
Implement MatchStub

16

CoGroup Operator
●

●

●

●

Two input operator with
keys
Records from left
combined with all record
from right with same key
User code gets an iterator
for left and right records
Implement CoGroupStub

17

How to execute a data flow plan
●

Either use LocalExecutor:
LocalExecutor.execute(plan)

●

Implement
PlanAssembler.getPlan(String...args)

And run on a local cluster or proper cluster
●

See: http://stratosphere.eu/quickstart/
and http://stratosphere.eu/docs/gettingstarted.html

18

Getting Started

https://github.com/stratosphere/stratosphere
https://github.com/stratosphere/stratosphere-quickstart

19

And Now for Something Completely
Different
val input = TextFile(textInput)
val words = input
.flatMap { _.split(" ") map { (_, 1) } }
val counts = words
.groupBy { case (word, _) => word }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
val output = counts
.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
20

(Very) Short Introduction to Scala

21

Anatomy of a Scala Class
package foo.bar
import something.else
class Job(arg1: Int) {
def map(in: Int): String = {
val i: Int = in + 2
var a = “Hello”
i.toString
}
}

22

Singletons
●

Similar to Java singletons and/or static methods

object Job {
def main(args: String*) {
println(“Hello World”)
}
}

23

Collections
val a = Seq(1, 2, 4)
List(“Hallo”, 2)
Array(2,3)
Map(1->”1”, 2->”2”)
val b = a map { x => x + 2}
val c = a map { _ + 2 }
val c = a.map({ _ + 2 })

24

Generics and Tuples
val a: Seq[Int] = Seq(1, 2, 4)
val tup = (3, “a”)
val tup2: (Int, String) = (3, “a”)

25

Stratosphere Scala Front End

26

Skeleton of a Stratosphere Program
●

Input: a text file/JDBC source/CSV, etc.
–

●

Transformations on the Dataset
–

●

loaded in internal representation: the DataSet
map, reduce, join, etc.

Output: program results in a DataSink
–

Text file, JDBC, CSV, etc.

27

The Almighty DataSet
●

●

●

●

Operations are methods on DataSet[A]
Working with DataSet[A] feels like working with
Scala collections
DataSet[A] is not an actual collection but
represents computation on a collection
Stringing together operations creates a data flow
graph that can be execute

28

An Important Difference
Immediately Executed

Executed when data flow is executed

val input: List[String] = ...

val input: DataSet[String] = ...

val mapped = input.map { s => (s, 1) }

val mapped = input.map { s => (s, 1) }

val result = mapped.write(“file”, ...)

val plan = new Plan(result)

execute(plan)

29

Usable Data Types
●

Primitive types

●

Tuples

●

Case classes

●

Custom data types that implement the Value
interface

30

Creating Data Sources
val input = TextFile(“file://”)
val input: DataSet[(Int, String)] =
DataSource(“hdfs://”,
CsvInputFormat[(Int, String)]())
def parseInput(line: String): (Int, Int) = {…}
val input = DataSource(“hdfs://”,
DelimitedInputFormat](parseInput))

31

Interlude: Anonymous Functions
var fun: ((Int, String)) => String = ...
fun = { t => t._2 }
fun = { _._2 }
fun = { case (i, w) => w }

32

Map
val input: DataSet[(Int, String)] = ...
val mapper = input
.map { case (a, b) => (a + 2, b) }
val mapper2 = input
.flatMap { _._2.split(“ “) }
val filtered = input
.filter { case (a, b) => a > 3 }

33

Reduce
val input: DataSet[(String, Int)] = ...
val reducer = input
.groupBy { case (w, _) => w }
.groupReduce { _.minBy {...} }
val reducer2 = input
.groupBy { case (w, _) => w }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }

34

Cross
val left: DataSet[(String, Int)] = ...
val right: DataSet[(String, Int)] = ...
val cross = left cross right
.map { (l, r) => ... }
val cross = left cross right
.flatMap { (l, r) => ... }

35

Join (Match)
val counts: DataSet[(String, Int)] = ...
val names: DataSet[(Int, String)] = ...
val join = counts
.join(right)
.where {case (_,c) => c}.isEqualsTo {case (n,_) => n}
.map { (l, r) => (l._1, r._2) }
val join = counts
.join(right)
.flatMap { (l, r) => ... }

36

CoGroup
val counts: DataSet[(String, Int)] = ...
val names: DataSet[(Int, String)] = ...
val cogroup = counts
.cogroup(right)
.map { (l, r) => (l.minBy {...} , r.minBy {...}) }
val cogroup = counts
.cogroup(right)
.flatMap { (l, r) => ... }

37

Creating Data Sinks
val counts: DataSet[(String, Int)]
val sink = counts.write(“<>”, CsvOutputFormat())
def formatOutput(a: (String, Int)): String = {
“Word “ + a._1 + “ count “ + a._2
}
val sink = counts.write(“<>”,
DelimitedOutputFormat(formatOutput))

38

Word Count example
val input = TextFile(textInput)
val words = input
.flatMap { _.split(" ") map { (_, 1) } }
val counts = words
.groupBy { case (word, _) => word }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
val output = counts
.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
39

Things not mentioned
●

The is support for iterations (both in Java and Scala)

●

Many more data source/sink formats

●

Look at the examples in the stratosphere source

●

Don't be afraid to write on mailing list and on
github:
–

●

http://stratosphere.eu/quickstart/scala.html

Or come directly to us

40

Stratosphere Intro (Java and Scala Interface)

More Related Content

What's hot

Viewers also liked

Similar to Stratosphere Intro (Java and Scala Interface)

More from Robert Metzger

Recently uploaded

Stratosphere Intro (Java and Scala Interface)

Editor's Notes