SlideShare a Scribd company logo
1 of 22
Download to read offline
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 1
Flink Meetup #8
Data flow vs. procedural
programming: How to put your
algorithms into Flink
June 23, 2015
Mikio L. Braun
@mikiobraun
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 2
Programming how we're used to
● Computing a sum
● Tools at our disposal:
– variables
– control flow (loops, if)
– function calls as basic piece of abstraction
def computeSum(a):
sum = 0
for i in range(len(a))
sum += a[i]
return sum
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 3
Data Analysis Algorithms
Let's consider centering
becomes
or even just
def centerPoints(xs):
sum = xs[0].copy()
for i in range(1, len(xs)):
sum += xs[i]
mean = sum / len(xs)
for i in range(len(xs)):
xs[i] -= mean
return xs
xs -
xs.mean(axis=0)
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 4
Don't use for-loops
● Put your data into a matrix
● Don't use for loops
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 5
Least Squares Regression
● Compute
● Becomes
What you learn is thinking in matrices, breaking
down computations in terms of matrix algebra
def lsr(X, y, lam):
d = X.shape[1]
C = X.T.dot(X) + lam * pl.eye(d)
w = np.linalg.solve(C, X.T.dot(y))
return w
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 6
Basic tools
Advantage
– very familiar
– close to math
Disadvantage
– hard to scale
● Basic procedural programming paradigm
● Variables
● Ordered arrays and efficient functions on those
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 7
Parallel Data Flow
Often you have stuff like
Which is inherently easy to scale
for i in someSet:
map x[i] to y[i]
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 8
New Paradigm
● Basic building block is an (unordered) set.
● Basic operations inherently parallel
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 9
Computing, Data Flow Style
Computing a sum
Computing a mean
sum(x) = xs.reduce((x,y) => x + y)
mean(x) = xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 10
Apache Flink
● Data Flow system
● Basic building block is a DataSet[X]
● For execution, sets up all computing nodes,
streams through data
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 11
Apache Flink: Getting Started
● Use Scala API
● Minimal project with Maven (build tool) or
Gradle
● Use an IDE like IntelliJ
● Always import
org.apache.flink.api.scala._
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 12
Centering (First Try)
def computeMeans(xs: DataSet[DenseVector]) =
xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
def centerPoints(xs: DataSet[DenseVector]) = {
val mean = computeMean(xs)
xs.map(x => x – mean)
}
You cannot nest DataSet operations!
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 13
Sorry, restrictions apply.
● Variables hold (lazy) computations
● You can't work with sets within the operations
● Even if result is just a single element, it's a
DataSet[Elem].
● So what to do?
– cross joins
– broadcast variables
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 14
Centering (Second Try)
Works, but seems excessive because the mean
is copied to each data element.
def computeMeans(xs: DataSet[DenseVector]) =
xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
def centerPoints(xs: DataSet[DenseVector]) = {
val mean = computeMean(xs)
xs.crossWithTiny(mean).map(xm => xm._1 – xm._2)
}
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 15
Broadcast Variables
● Side information sent to all worker nodes
● Can be a DataSet
● Gets accessed as a Java collection
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 16
class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O)
extends RichMapFunction[T, O] {
var broadcastVariable: B = _
@throws(classOf[Exception])
override def open(configuration: Configuration): Unit = {
broadcastVariable = getRuntimeContext
.getBroadcastVariable[B]("broadcastVariable")
.get(0)
}
override def map(value: T): O = {
fun(value, broadcastVariable)
}
}
Broadcast Variables
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 17
Centering (Third Try)
def computeMeans(xs: DataSet[DenseVector]) =
xs.map(x => (x,1))
.reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2))
.map(xc => xc._1 / xc._2)
def centerPoints(xs: DataSet[DenseVector]) = {
val mean = computeMean(xs)
xs.mapWithBcVar(mean).map((x, m) => x – m)
}
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 18
Intermediate Results pattern
val x = someDataSetComputation()
val y = someOtherDataSetComputation()
val z = dataSet.mapWithBcVar(x)((d, x) => …)
val result = anotherDataSet.mapWithBcVar((y,z)) {
(d, yz) =>
val (y,z) = yz
…
}
x = someComputation()
y = someOtherComputation()
z = someComputationOn(dataSet, x)
result = moreComputationOn(y, z)
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 19
Matrix Algebra
● No ordered sets per se in Data Flow context.
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 20
Vector operations by explicit joins
● Encode vector (a1, a2, …, an) with
{(1, a1), (2, a2), … (n, an)}
● Addition:
– a.join(b).where(0).equalTo(0)
.map((ab) => (ab._1._1, ab._1._2 + ab._2._2))
after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 21
Back to Least Squares Regression
Two operations: computing X'X and X'Y
def lsr(xys: DataSet[(DenseVector, Double)]) = {
val XTX = xs.map(x => x.outer(x)).reduce(_ + _)
val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _)
C = XTX.mapWithBcVar(XTY) { vars =>
val XTX = vars._1
val XTY = var.s_2
val weight = XTX  XTY
}
}
June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 22
Summary and Outlook
● Procedural vs. Data Flow
– basic building blocks elementwise operations on
unordered sets
– can't be nested
– combine intermediate results via broadcast vars
● Iterations
● Beware of TypeInformation implicits.

More Related Content

What's hot

GIS Work Example Portfolio
GIS Work Example PortfolioGIS Work Example Portfolio
GIS Work Example Portfolio
Nicholas Raio
 
Graphing Functions and Their Transformations
Graphing Functions and Their TransformationsGraphing Functions and Their Transformations
Graphing Functions and Their Transformations
zacho1c
 

What's hot (20)

Probability
ProbabilityProbability
Probability
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Computing and Data Analysis for Environmental Applications
Computing and Data Analysis for Environmental ApplicationsComputing and Data Analysis for Environmental Applications
Computing and Data Analysis for Environmental Applications
 
Lecture 3.6 bt
Lecture 3.6 btLecture 3.6 bt
Lecture 3.6 bt
 
Dma
DmaDma
Dma
 
CLASS XII COMPUTER SCIENCE MONTHLY TEST PAPER
CLASS XII COMPUTER SCIENCE MONTHLY TEST  PAPERCLASS XII COMPUTER SCIENCE MONTHLY TEST  PAPER
CLASS XII COMPUTER SCIENCE MONTHLY TEST PAPER
 
Your first TensorFlow programming with Jupyter
Your first TensorFlow programming with JupyterYour first TensorFlow programming with Jupyter
Your first TensorFlow programming with Jupyter
 
GIS Work Example Portfolio
GIS Work Example PortfolioGIS Work Example Portfolio
GIS Work Example Portfolio
 
Lecture 3.2 bt
Lecture 3.2 btLecture 3.2 bt
Lecture 3.2 bt
 
Assignment premier academic writing agency with industry
Assignment premier academic writing agency with industry Assignment premier academic writing agency with industry
Assignment premier academic writing agency with industry
 
R class 5 -data visualization
R class 5 -data visualizationR class 5 -data visualization
R class 5 -data visualization
 
2 transformation computer graphics
2 transformation computer graphics2 transformation computer graphics
2 transformation computer graphics
 
Data visualization
Data visualizationData visualization
Data visualization
 
System approach in civil engg slideshare.vvs
System approach in civil engg slideshare.vvsSystem approach in civil engg slideshare.vvs
System approach in civil engg slideshare.vvs
 
Functional Programming for Fun and Profit
Functional Programming for Fun and ProfitFunctional Programming for Fun and Profit
Functional Programming for Fun and Profit
 
Enhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points AnalysisEnhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points Analysis
 
Graphing Functions and Their Transformations
Graphing Functions and Their TransformationsGraphing Functions and Their Transformations
Graphing Functions and Their Transformations
 
Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems Non Deterministic and Deterministic Problems
Non Deterministic and Deterministic Problems
 
R programmingmilano
R programmingmilanoR programmingmilano
R programmingmilano
 
Machine Learning lecture5(octave)
Machine Learning lecture5(octave)Machine Learning lecture5(octave)
Machine Learning lecture5(octave)
 

Viewers also liked

Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Cloudera, Inc.
 
Presentación de Moodle
Presentación de MoodlePresentación de Moodle
Presentación de Moodle
cruizgaray
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
parallellabs
 

Viewers also liked (20)

Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
 
Individual and societal risk
Individual and societal riskIndividual and societal risk
Individual and societal risk
 
REDES NEURONALES
REDES NEURONALESREDES NEURONALES
REDES NEURONALES
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Presentación de Moodle
Presentación de MoodlePresentación de Moodle
Presentación de Moodle
 
El cambio
El cambioEl cambio
El cambio
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
The influence-of-prayer-coping-on-patients
The influence-of-prayer-coping-on-patientsThe influence-of-prayer-coping-on-patients
The influence-of-prayer-coping-on-patients
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming System
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)Cloudera Impala Overview (via Scott Leberknight)
Cloudera Impala Overview (via Scott Leberknight)
 
Hpca2012 facebook keynote
Hpca2012 facebook keynoteHpca2012 facebook keynote
Hpca2012 facebook keynote
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
 
Deep Dive into Apache Apex App Development
Deep Dive into Apache Apex App DevelopmentDeep Dive into Apache Apex App Development
Deep Dive into Apache Apex App Development
 

Similar to Data flow vs. procedural programming: How to put your algorithms into Flink

Similar to Data flow vs. procedural programming: How to put your algorithms into Flink (20)

Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming Mikio Braun – Data flow vs. procedural programming
Mikio Braun – Data flow vs. procedural programming
 
slides.07.pptx
slides.07.pptxslides.07.pptx
slides.07.pptx
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 
ML .pptx
ML .pptxML .pptx
ML .pptx
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Matlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processingMatlab for beginners, Introduction, signal processing
Matlab for beginners, Introduction, signal processing
 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
 
Class 26: Objectifying Objects
Class 26: Objectifying ObjectsClass 26: Objectifying Objects
Class 26: Objectifying Objects
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Linear Regression (Machine Learning)
Linear Regression (Machine Learning)Linear Regression (Machine Learning)
Linear Regression (Machine Learning)
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Introduction to Matlab
Introduction to MatlabIntroduction to Matlab
Introduction to Matlab
 
Deep Learning, Scala, and Spark
Deep Learning, Scala, and SparkDeep Learning, Scala, and Spark
Deep Learning, Scala, and Spark
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Matlab solved problems
Matlab solved problemsMatlab solved problems
Matlab solved problems
 
Automatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSELAutomatic Task-based Code Generation for High Performance DSEL
Automatic Task-based Code Generation for High Performance DSEL
 
Aggregation computation over distributed data streams
Aggregation computation over distributed data streamsAggregation computation over distributed data streams
Aggregation computation over distributed data streams
 
Matlab1
Matlab1Matlab1
Matlab1
 

More from Mikio L. Braun (7)

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
 
Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018Machine Learning for Time Series, Strata London 2018
Machine Learning for Time Series, Strata London 2018
 
Scalable Machine Learning
Scalable Machine LearningScalable Machine Learning
Scalable Machine Learning
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
 

Recently uploaded

Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Marc Lester
 

Recently uploaded (20)

The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...
Reinforcement Learning – a Rewards Based Approach to Machine Learning - Marko...
 
What is an API Development- Definition, Types, Specifications, Documentation.pdf
What is an API Development- Definition, Types, Specifications, Documentation.pdfWhat is an API Development- Definition, Types, Specifications, Documentation.pdf
What is an API Development- Definition, Types, Specifications, Documentation.pdf
 
Weeding your micro service landscape.pdf
Weeding your micro service landscape.pdfWeeding your micro service landscape.pdf
Weeding your micro service landscape.pdf
 
Sourcing Success - How to Find a Clothing Manufacturer
Sourcing Success - How to Find a Clothing ManufacturerSourcing Success - How to Find a Clothing Manufacturer
Sourcing Success - How to Find a Clothing Manufacturer
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
Odoo vs Shopify: Why Odoo is Best for Ecommerce Website Builder in 2024
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)The mythical technical debt. (Brooke, please, forgive me)
The mythical technical debt. (Brooke, please, forgive me)
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
Malaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptxMalaysia E-Invoice digital signature docpptx
Malaysia E-Invoice digital signature docpptx
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
Optimizing Operations by Aligning Resources with Strategic Objectives Using O...
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 

Data flow vs. procedural programming: How to put your algorithms into Flink

  • 1. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 1 Flink Meetup #8 Data flow vs. procedural programming: How to put your algorithms into Flink June 23, 2015 Mikio L. Braun @mikiobraun
  • 2. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 2 Programming how we're used to ● Computing a sum ● Tools at our disposal: – variables – control flow (loops, if) – function calls as basic piece of abstraction def computeSum(a): sum = 0 for i in range(len(a)) sum += a[i] return sum
  • 3. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 3 Data Analysis Algorithms Let's consider centering becomes or even just def centerPoints(xs): sum = xs[0].copy() for i in range(1, len(xs)): sum += xs[i] mean = sum / len(xs) for i in range(len(xs)): xs[i] -= mean return xs xs - xs.mean(axis=0)
  • 4. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 4 Don't use for-loops ● Put your data into a matrix ● Don't use for loops
  • 5. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 5 Least Squares Regression ● Compute ● Becomes What you learn is thinking in matrices, breaking down computations in terms of matrix algebra def lsr(X, y, lam): d = X.shape[1] C = X.T.dot(X) + lam * pl.eye(d) w = np.linalg.solve(C, X.T.dot(y)) return w
  • 6. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 6 Basic tools Advantage – very familiar – close to math Disadvantage – hard to scale ● Basic procedural programming paradigm ● Variables ● Ordered arrays and efficient functions on those
  • 7. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 7 Parallel Data Flow Often you have stuff like Which is inherently easy to scale for i in someSet: map x[i] to y[i]
  • 8. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 8 New Paradigm ● Basic building block is an (unordered) set. ● Basic operations inherently parallel
  • 9. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 9 Computing, Data Flow Style Computing a sum Computing a mean sum(x) = xs.reduce((x,y) => x + y) mean(x) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2)
  • 10. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 10 Apache Flink ● Data Flow system ● Basic building block is a DataSet[X] ● For execution, sets up all computing nodes, streams through data
  • 11. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 11 Apache Flink: Getting Started ● Use Scala API ● Minimal project with Maven (build tool) or Gradle ● Use an IDE like IntelliJ ● Always import org.apache.flink.api.scala._
  • 12. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 12 Centering (First Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.map(x => x – mean) } You cannot nest DataSet operations!
  • 13. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 13 Sorry, restrictions apply. ● Variables hold (lazy) computations ● You can't work with sets within the operations ● Even if result is just a single element, it's a DataSet[Elem]. ● So what to do? – cross joins – broadcast variables
  • 14. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 14 Centering (Second Try) Works, but seems excessive because the mean is copied to each data element. def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.crossWithTiny(mean).map(xm => xm._1 – xm._2) }
  • 15. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 15 Broadcast Variables ● Side information sent to all worker nodes ● Can be a DataSet ● Gets accessed as a Java collection
  • 16. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 16 class BroadcastSingleElementMapper[T, B, O](fun: (T, B) => O) extends RichMapFunction[T, O] { var broadcastVariable: B = _ @throws(classOf[Exception]) override def open(configuration: Configuration): Unit = { broadcastVariable = getRuntimeContext .getBroadcastVariable[B]("broadcastVariable") .get(0) } override def map(value: T): O = { fun(value, broadcastVariable) } } Broadcast Variables
  • 17. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 17 Centering (Third Try) def computeMeans(xs: DataSet[DenseVector]) = xs.map(x => (x,1)) .reduce((xc, yc) => (xc._1 + yc._1, xc._2 + yc._2)) .map(xc => xc._1 / xc._2) def centerPoints(xs: DataSet[DenseVector]) = { val mean = computeMean(xs) xs.mapWithBcVar(mean).map((x, m) => x – m) }
  • 18. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 18 Intermediate Results pattern val x = someDataSetComputation() val y = someOtherDataSetComputation() val z = dataSet.mapWithBcVar(x)((d, x) => …) val result = anotherDataSet.mapWithBcVar((y,z)) { (d, yz) => val (y,z) = yz … } x = someComputation() y = someOtherComputation() z = someComputationOn(dataSet, x) result = moreComputationOn(y, z)
  • 19. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 19 Matrix Algebra ● No ordered sets per se in Data Flow context.
  • 20. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 20 Vector operations by explicit joins ● Encode vector (a1, a2, …, an) with {(1, a1), (2, a2), … (n, an)} ● Addition: – a.join(b).where(0).equalTo(0) .map((ab) => (ab._1._1, ab._1._2 + ab._2._2)) after join: {((1, a1), (1, b1)), ((2, a1), (2, b1)), … }
  • 21. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 21 Back to Least Squares Regression Two operations: computing X'X and X'Y def lsr(xys: DataSet[(DenseVector, Double)]) = { val XTX = xs.map(x => x.outer(x)).reduce(_ + _) val XTY = xys.map(xy => xy._1 * xy._2).reduce(_ + _) C = XTX.mapWithBcVar(XTY) { vars => val XTX = vars._1 val XTY = var.s_2 val weight = XTX XTY } }
  • 22. June 23, 2015Mikio L. Braun, Data Flow vs. Procedural Programming, Berlin Flink Meetup 22 Summary and Outlook ● Procedural vs. Data Flow – basic building blocks elementwise operations on unordered sets – can't be nested – combine intermediate results via broadcast vars ● Iterations ● Beware of TypeInformation implicits.