Your SlideShare is downloading. ×
0
Stratosphere:
System Overview
Robert Metzger
mail@robertmetzger.de
Twitter: @rmetzger_
Big Data Beers Meetup, Nov. 19th, 2...
Stratosphere
… is a distributed data processing engine
… automatically handles parallelization
… brings database technolog...
Overview
● Extends MapReduce with more operators
map

cross

join

reduce

cogroup

New in Stratosphere

Known from Hadoop...
Stratosphere System Stack
Java
API

Scala
API

Meteor

...

Hive
Stratosphere Optimizer
Stratosphere Runtime

Hadoop MR
Cl...
Stratosphere in a Cluster
Master Node

●
●
●
●
●

Operators are executed
over the whole cluster
Side by side with Hadoop
S...
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
Data Flows: Execution Models
M

Apache Hadoop MR is
limited to one data flow

R

One of many possible data flows
in Strato...
Complex Data Flows in Hadoop
Grouping

R

Grouping

J

Filtering
M

M

R

Joining

R

M
M

R

M

R
Data Flows: Lessons Learned

1. Most tasks do not fit the MapReduce model
2. Very expensive
○ Always go to disk and HDFS

...
Advanced Data Flows in Stratosphere
●
●

Data flow graphs are supported natively
Stratosphere only writes to disk if neces...
Skeleton of a Stratosphere Program
● Input: text file, JDBC source, CSV, etc.
● Transformations
○ map, reduce, join, itera...
Data Flows: Code Example

R
J

R

M
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
F...
Map Stub and PactRecord by Example
MapContract ordersFiltered = MapContract.builder(FilterOrders.class)
.input(orders).bui...
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
Joins in Hadoop
Map (Broadcast) Join

Reduce (Repartition) Join

● Which strategy to choose?
● How to configure it
Lessons...
Joins with Stratosphere
● Natively implemented into the system
● Optimizer decides join strategy:
○ Sort-merge-join
○ Hybr...
Optimizer Magic
Recap example job:
Grouping

R

Grouping

J

Filtering
M

R

Joining

We require a grouped input for the r...
Stratosphere Optimizer
●

Cost-based optimizer
○ Enumerate different execution plans
○ Choose the cheapest one

●

Optimiz...
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
Algorithms that need iterations
●
●
●
●
●
●
●

K-Means
Gradient descent
Page-Rank
Logistic Regression
Path algorithms on g...
Why Iterations?
●

Many algorithms loop over the data
○ Machine learning: iteratively refine the model
○ Graph processing:...
Iterations in Hadoop
Loop is outside the system
○ Hard to program
○ Very poor performance

Itera
n 2nd

Ite
io
n

R

Usual...
Iterations in Stratosphere
●

Loop is inside the system
○ Easy to program
○ Huge performance gains

Iterate
M

C

M

R

R
...
1. Data Flows

2. Optimizer

3. Iterations

4. Scala Interface
●
●
●
●
●
●
●
●

Functional object oriented programming language
ScaLa = Scalable Language
Very productive (few LOC)
Feels...
Do more, write less!
class Person(val firstName: String, val lastName: String)

public class Person {
private final String...
Let the code speak
val input = TextFile(textInput)
val words = input.flatMap { line => line.split(" ") }
val counts = word...
R

Example in Scala

J

R

M
FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath);
FileData...
Summary: Feature Matrix
Stratosphere: Database inspired Big Data Analytics
Map Reduce
●
●

Map
Reduce

Operators

Stratosp...
Get In Touch
Stratosphere is the next-generation open source
Big Data Analytics Platform.
Quickstart: http://stratosphere....
Upcoming SlideShare
Loading in...5
×

Stratosphere System Overview Big Data Beers Berlin. 20.11.2013

3,688

Published on

Stratosphere is the next generation big data processing engine.

These slides introduce the most important features of Stratosphere by comparing it with Apache Hadoop.

For more information, visit stratosphere.eu

Based on university research, it is now a completely open-source, community driven development with focus on stability and usability.

Published in: Business, Technology, Education

Transcript of "Stratosphere System Overview Big Data Beers Berlin. 20.11.2013"

  1. 1. Stratosphere: System Overview Robert Metzger mail@robertmetzger.de Twitter: @rmetzger_ Big Data Beers Meetup, Nov. 19th, 2013
  2. 2. Stratosphere … is a distributed data processing engine … automatically handles parallelization … brings database technology to the world of big data
  3. 3. Overview ● Extends MapReduce with more operators map cross join reduce cogroup New in Stratosphere Known from Hadoop ● Support for advanced data flow graphs M M R J R R M Known from Hadoop New in Stratosphere ● Compiler/Optimizer, Java/Scala Interface, YARN R
  4. 4. Stratosphere System Stack Java API Scala API Meteor ... Hive Stratosphere Optimizer Stratosphere Runtime Hadoop MR Cluster Manager YARN Direct EC2 Storage Local Files HDFS S3 ...
  5. 5. Stratosphere in a Cluster Master Node ● ● ● ● ● Operators are executed over the whole cluster Side by side with Hadoop Scales by adding more nodes Support for YARN is in development We have a LocalExecutor Job Submission JobManager Resource Mgmt Compiler Web Interface TaskManager TaskManager DataNode DataNode TaskManager TaskManager DataNode DataNode Legend: Cluster Node Stratosphere Hadoop 4 Worker Nodes
  6. 6. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  7. 7. Data Flows: Execution Models M Apache Hadoop MR is limited to one data flow R One of many possible data flows in Stratosphere M R J M R
  8. 8. Complex Data Flows in Hadoop Grouping R Grouping J Filtering M M R Joining R M M R M R
  9. 9. Data Flows: Lessons Learned 1. Most tasks do not fit the MapReduce model 2. Very expensive ○ Always go to disk and HDFS 3. Tedious to implement ○ Custom data types and file formats between jobs That’s why higher level abstractions for MR exist.
  10. 10. Advanced Data Flows in Stratosphere ● ● Data flow graphs are supported natively Stratosphere only writes to disk if necessary, otherwise in-memory R J M R
  11. 11. Skeleton of a Stratosphere Program ● Input: text file, JDBC source, CSV, etc. ● Transformations ○ map, reduce, join, iterate etc. ● Output: to file etc. ● Data Types ○ PactRecord: Tuples with n fields. ○ custom data types for vectors, images, audio (we only expect serialization and compare) 2
  12. 12. Data Flows: Code Example R J R M FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath); FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath); MapContract ordersFiltered = MapContract.builder(FilterOrders.class) .input(orders).build(); Filter Mapper ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class) .input(customers) .keyField(PactInteger.class, 0).build(); Define group key MatchContract joined = MatchContract.builder(JoinOnCustomerid.class, PactInteger.class, 0, 0) .input1(ordersFiltered) .input2(groupedCustomers).build(); ReduceContract orderBy = ReduceContract.builder(MaxSum.class) .input(joined) .keyField(PactInteger.class, 0).build(); FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy);
  13. 13. Map Stub and PactRecord by Example MapContract ordersFiltered = MapContract.builder(FilterOrders.class) .input(orders).build(); public class FilterOrders extends MapStub { @Override public void map(PactRecord order, Collector<PactRecord> out) throws Exception { PactString date = order.getField(Orders.DATE_IDX, PactString.class); if (date.getValue().equals("11.20.2013")) { out.collect(order); } } }
  14. 14. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  15. 15. Joins in Hadoop Map (Broadcast) Join Reduce (Repartition) Join ● Which strategy to choose? ● How to configure it Lessons Learned: ● Joins do not naturally fit MapReduce ● Very time consuming to implement ● Hand optimization necessary Source: Sebastian Schelter, TU Berlin
  16. 16. Joins with Stratosphere ● Natively implemented into the system ● Optimizer decides join strategy: ○ Sort-merge-join ○ Hybrid Hash Join ○ Data Shipping Strategy ● Hybrid Hash Join starts in-memory and gracefully degrades to disk
  17. 17. Optimizer Magic Recap example job: Grouping R Grouping J Filtering M R Joining We require a grouped input for the reducer (sorting or hashing) ● Optimizer chooses Sort-Merge-Join → no sorting for reduce ●
  18. 18. Stratosphere Optimizer ● Cost-based optimizer ○ Enumerate different execution plans ○ Choose the cheapest one ● Optimizer collects statistics ○ Size of input and output Operators (Map, Reduce, Join) tell how they modify fields ● In-memory chaining of operators ● Memory Distribution ⇒ Focus on your application logic rather than parallel execution. ●
  19. 19. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  20. 20. Algorithms that need iterations ● ● ● ● ● ● ● K-Means Gradient descent Page-Rank Logistic Regression Path algorithms on graphs Graph communities / dense sub-components Inference (belief propagation)
  21. 21. Why Iterations? ● Many algorithms loop over the data ○ Machine learning: iteratively refine the model ○ Graph processing: propagate information hop by hop Initial Input 1 1st Iteration 1 2 4 3 1 1 2 2 5 6 2nd Iteration 1 5 7 5 1 1 5 5 Example: Connected Components 5 5
  22. 22. Iterations in Hadoop Loop is outside the system ○ Hard to program ○ Very poor performance Itera n 2nd Ite io n R Usually each iteration is more than a single map and reduce! t ra 1st Iteration th M n- S n It aw n w pa 1st Sp on i rat e tion Driver Spaw ● M 2nd Iteration R M ... n-th Iteration R
  23. 23. Iterations in Stratosphere ● Loop is inside the system ○ Easy to program ○ Huge performance gains Iterate M C M R R M
  24. 24. 1. Data Flows 2. Optimizer 3. Iterations 4. Scala Interface
  25. 25. ● ● ● ● ● ● ● ● Functional object oriented programming language ScaLa = Scalable Language Very productive (few LOC) Feels like a scripting language No more UDFs Easy to integrate Runs in JVM, is compatible to regular Java classes Basis for developing embedded domain specific languages (DSL)
  26. 26. Do more, write less! class Person(val firstName: String, val lastName: String) public class Person { private final String firstName; private final String lastName; public Person(String firstName, String lastName) { this.firstName = firstName; this.lastName = lastName; } public String getFirstName() { return firstName; } public String getLastName() { return lastName; } }
  27. 27. Let the code speak val input = TextFile(textInput) val words = input.flatMap { line => line.split(" ") } val counts = words .groupBy { word => word } .count() val output = counts.write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output))
  28. 28. R Example in Scala J R M FileDataSource customers = new FileDataSource(TextInputFormat.class, customersPath); FileDataSource orders = new FileDataSource(TextInputFormat.class, ordersPath); MapContract ordersFiltered = MapContract.builder(FilterOrders.class).input(orders).build(); ReduceContract groupedCustomers = ReduceContract.builder(GroupCustomers.class) .input(customers) .keyField(PactInteger.class, 0) .build(); MatchContract joined = MatchContract.builder(JoinOnCustomerid.class,PactInteger.class, 0,0) .input1(ordersFiltered).input2(groupedCustomers).build(); ReduceContract orderBy = ReduceContract.builder(MaxSum.class) .input(joined) .keyField(PactInteger.class, 0) .build(); FileDataSink sink = new FileDataSink(RecordOutputFormat.class, outputPath, orderBy, "output: word counts"); val customers = DataSource(customersPath, CsvInputFormat[Customer]) val orders = DataSource(ordersPath, CsvInputFormat[Order]) val ordersFiltered = orders filter { order => order.date.equals("11.20.2013")} val groupedCustomers = customers groupBy { cust => cust.zip} reduceGroup {grp => (grp.buffered.head.zip, grp.maxBy{_.total})} val joined = ordersFiltered .join(groupedCustomers) .where {ord => ord.c_id} .isEqualTo {cust => cust._1} .map { (orders, cust) => cust} val max = joined groupBy { cust => cust.category_id} reduceGroup {_.maxBy{_.sum}} val output = counts.write(wordsOutput, DelimitedOutputFormat(formatOutput.tupled)) val plan = new ScalaPlan(Seq(output), "BDB Example")
  29. 29. Summary: Feature Matrix Stratosphere: Database inspired Big Data Analytics Map Reduce ● ● Map Reduce Operators Stratosphere ● ● ● ● ● ● ● Map Reduce (multiple sort keys) Cross Join CoGroup Union Iterate, Iterate Delta Composition Only MapReduce Arbitrary Data flows Data Exchange Batch through disk Pipelined, in-memory (automatic spilling to disk)
  30. 30. Get In Touch Stratosphere is the next-generation open source Big Data Analytics Platform. Quickstart: http://stratosphere.eu/quickstart Website: http://stratosphere.eu GitHub: https://github.com/stratosphere Mailing List: https://groups.google.com/d/forum/stratosphere-dev Twitter: @stratosphere_eu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×