Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Sebastian Schelter – Distributed Ma... by Flink Forward 8228 views
- Apache Flink - Hadoop MapReduce Com... by Fabian Hueske 1638 views
- Tran Nam-Luc – Stale Synchronous Pa... by Flink Forward 6163 views
- Chris Hillman – Beyond Mapreduce Sc... by Flink Forward 6487 views
- Anwar Rizal – Streaming & Parallel ... by Flink Forward 7451 views
- Michael Häusler – Everyday flink by Flink Forward 7551 views

6,326 views

Published on

How to put your alogrithms into Flink

Published in:
Technology

No Downloads

Total views

6,326

On SlideShare

0

From Embeds

0

Number of Embeds

4,472

Shares

0

Downloads

13

Comments

0

Likes

2

No embeds

No notes for slide

- 1. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 1 Flink Forward 2015 Data flow vs. procedural programming: How to put your algorithms into Flink October 13, 2015 Mikio L. Braun, Zalando SE @mikiobraun
- 2. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 2 Python vs Flink ● Coming from Python, what are the differences in programming style I have to know to get started in Flink?
- 3. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 3 Programming how we're used to ● Computing a sum ● Tools at our disposal: – variables – control flow (loops, if) – function calls as basic piece of abstraction def computeSum(a): sum = 0 for i in range(len(a)) sum += a[i] return sum
- 4. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 4 Data Analysis Algorithms Let's consider centering becomes or even just def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs xs - xs.mean(axis=0)
- 5. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 5 Don't use for-loops ● Put your data into a matrix ● Don't use for loops
- 6. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 6 Least Squares Regression ● Compute ● Becomes What you learn is thinking in matrices, breaking down computations in terms of matrix algebra def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w
- 7. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 7 Basic tools Advantage – very familiar – close to math Disadvantage – hard to scale ● Basic procedural programming paradigm ● Variables ● Ordered arrays and efficient functions on those
- 8. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 8 Parallel Data Flow Often you have stuff like Which is inherently easy to scale for i in someSet: map x[i] to y[i]
- 9. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 9 New Paradigm ● Basic building block is an (unordered) set. ● Basic operations inherently parallel
- 10. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 10 Computing, Data Flow Style Computing a sum Computing a mean sum(x) = xs.reduce((x,y) => x + y) mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)
- 11. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 11 Apache Flink ● Data Flow system ● Basic building block is a DataSet[X] ● For execution, sets up all computing nodes, streams through data
- 12. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 12 Apache Flink: Getting Started ● Use Scala API ● Minimal project with Maven (build tool) or Gradle ● Use an IDE like IntelliJ ● Always import org.apache.flink.api.scala._
- 13. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 13 Centering (First Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.map(x => x – mean) } You cannot nest DataSet operations!
- 14. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 14 Sorry, restrictions apply. ● Variables hold (lazy) computations ● You can't work with sets within the operations ● Even if result is just a single element, it's a DataSet[Elem]. ● So what to do? – cross joins – broadcast variables
- 15. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 15 Centering (Second Try) Works, but seems excessive because the mean is copied to each data element. def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2) }
- 16. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 16 Broadcast Variables ● Side information sent to all worker nodes ● Can be a DataSet ● Gets accessed as a Java collection
- 17. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 17 class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _ @throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) } override def map(value: T): O = { fun(value, broadcastVariable) } } Broadcast Variables
- 18. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 18 Centering (Third Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.mapWithBcVar(mean).map((x, m) => x – m) }
- 19. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 19 Intermediate Results pattern val x = someDataSetComputation() val y = someOtherDataSetComputation() val z = dataSet.mapWithBcVar(x)((d, x) => …) val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz … } x = someComputation() y = someOtherComputation() z = someComputationOn(dataSet, x) result = moreComputationOn(y, z)
- 20. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 20 Matrix Algebra ● No ordered sets per se in Data Flow context.
- 21. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 21 Vector operations by explicit joins ● Encode vector (a1, a2, …, an) with {(1, a1), (2, a2), … (n, an)} ● Addition: – a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2)) after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }
- 22. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 22 Back to Least Squares Regression Two operations: computing X'X and X'Y def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _) C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2 val weight = XTX XTY } }
- 23. October 13, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Forward 23 Summary and Outlook ● Procedural vs. Data Flow – basic building blocks elementwise operations on unordered sets – can't be nested – combine intermediate results via broadcast vars ● Iterations ● Beware of TypeInformation implicits.

No public clipboards found for this slide

Be the first to comment