SlideShare a Scribd company logo
The Approximate Filter,
Join, and GroupBy
David Rodriguez, Scott Sitar, Dhia Mahjoub
Cisco Systems
About Us
• David Rodriguez [davrodr3@cisco.com, @dvidrdgz]

Data Scientist / Security Researcher. MA Mathematics.
• Scott Sitar [ssitar@cisco.com]

Engineer Tech Lead. Hadoop Based Technologies.
Mathematics.
• Dhia Mahjoub [dmahjoub@cisco.com, @dhialite]

Head of Security Research for Cisco Umbrella. PhD.
Mathematics.
One Problem
Motivating Problem: group similar spikes
“Key, simple idea : Partition data into buckets by
hashing, then analyze.”
Michael Mitzenmacher
Some Uses of Hashing in Networking Problems
Overview
Outline
• Primitives Filter, Join, GroupBy

examples: clickstream and cpu usage
• Non-Primitive Filter, Join, GroupBy

examples: signal clustering
• Locality Sensitive Hashing (LSH)

examples: cosine measure (minhash if time permits)
• Application: Botnet Detection
Primitives
Clickstream Counts
• Identify most frequent words occurring in clickstreams
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// fake clickstream
DataSet<String> text = env.fromElements("http://www.google.de/login",
"http://www.google.de/search?q=flink+forward",
"http://www.google.de/search?q=simple+string",
"http://www.google.de/search?q=simple+count"
);
DataSet<Tuple2<String, Integer>> counts =
// extract words embedded in url
text.flatMap(new URLExtractor())
// group over words extracted
.groupBy(0)
.sum(1);
// execute and print result
counts.print();
}
Questions
• If this URLExtractor is used, what is the output of the
previous program?
public static final class URLExtractor implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("(/|+|=|?|:)");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<String, Integer>(token, 1));
}
}
}
}
Challenges
• Group similar browsing history: e.g. google searches.
• In General, we need an unsupervised method of
detecting similar browsing patterns.
// fake clickstream
DataSet<String> text = env.fromElements("http://www.google.de/login",
"http://www.google.de/search?q=flink+forward",
“http://www.google.de/search?q=simple+string",
“http://www.google.de/search?q=simple+count",
“https://berlin.flink-forward.org/kb_sessions“
);
CPU Usage
• Identify most frequent buckets occurring in cpu usage
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// fake cpu usage
DataSet<Integer> text = env.fromElements(1, 10, 100, 20, 15,
20000, 20001, 10001, 1000, 99, 98, 10, 2
);
DataSet<Tuple2<Integer, Integer>> counts =
// discretize values
text.flatMap(new Histogram())
// group over buckets in histogram
.groupBy(0)
.sum(1);
// execute and print result
counts.print();
}
Questions
• If this Histogram is used, what is the output of the
previous program?
public static final class Histogram implements FlatMapFunction<Integer,
Tuple2<Integer, Integer>> {
@Override
public void flatMap(Integer value, Collector<Tuple2<Integer, Integer>>
out) {
Double value_log = Math.log10(value);
Integer bucket =(int)Math.floor(value_log);
out.collect(new Tuple2<Integer, Integer>(bucket, 1));
}
}
Challenges
• Group multivariate signal readings.
• In General, we need an unsupervised method of
detecting similar readings.
// fake cpu usage
DataSet<Tuple2> text3 = env.fromElements(
new Tuple2(10, 0),
new Tuple2(10, 1),
new Tuple2(5, 5),
new Tuple2(4, 5)
);
Approximate
Signal Clustering
• Identify similar signals: similar wave forms possibly
dilated or contracted
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// gen fake signals with event name
DataSet<Event> text = env.fromElements(
new Event("flinker1", new Signal(100, 0, 100)),
new Event("flinker2", new Signal(100, 0, 101)),
new Event("flinker3", new Signal(50, 0, 50)),
new Event("unflinker1", new Signal(100, 0, 50)));
DataSet<OutPreview> counts =
// create LSH hash by random hyperplanes
text.flatMap(new RandHyperplanes())
// group by hash signature
.groupBy(0)
.sum(2);
// execute and print result
counts.print();
}
Questions
• Are (100, 0, 100) and (100, 0, 101) similar?
• Are (100, 0, 100) and (50, 0, 50) similar?
• Are (100, 0, 100) and (100, 0, 50) similar?
• In General, we need an unsupervised method of
detecting similar readings.
// gen fake signals with event name
DataSet<Event> text = env.fromElements(
new Event("flinker1", new Signal(100, 0, 100)),
new Event("flinker2", new Signal(100, 0, 101)),
new Event("flinker3", new Signal(50, 0, 50)),
new Event("unflinker1", new Signal(100, 0, 50)));
Locality Sensitive Hashing -
LSH
Cosine Similarity
• Given two vectors: 

(1, 0) = x and (1, 0) = y
• Define angle between and from
calculus:



theta = arccos(x dot y / (||x|| * ||y||))
• theta is a distance measure:

1. d(x,x) = 0

2. d(x,y) = d(y,x)

3. d(x,y) >= 0 w/ 0 <= theta <= 180

Theta and Hashes
• Define a function f_1:

0 , if sgn(h_1 dot x) > 0

1 , otherwise
• Then:

P(f_1(x) = f_2(y)) = 1-theta / 180
Theta and Deltas
• Are (100, 0, 100) and (100, 0, 101)
similar?
• Are (100, 0, 100) and (50, 0, 50)
similar?
• Are (100, 0, 100) and (100, 0, 50)
similar?
p-Stable Hashes
• family H of hash functions is said to be 

(d_1, d_2, p_1, p_2) - sensitive

if for any x and y in a possible space of points

1. if d(x,y)<= d_1, then P(h(x)=h(y)) >= p_1

2. if d(x,y)>= d_2, then P(h(x)=h(y)) <= p_2
Finding the Number of Hyperplanes
• Let: 

( 20°, 120°, 1 - 20°/180°, 1 - 120°/180°)

for one hyperplane.
• We can determine the number of
hyperplanes.
theta (1-(1-p)^2)^2
20° 0.0440
40° 0.1560
60° 0.3086
80° 0.4779
100° 0.6439
Signal Clustering
Flink Main
public class ApproximateSignals {
public static class Event {
String name;
Signal signal;
public Event() {}
public Event(String name, Signal signal) {
this.name = name;
this.signal = signal;
}
}
public static class Signal extends Tuple3<Integer, Integer, Integer> {
Signal(Integer x, Integer y, Integer z) {
super(x, y, z);
}
}
public static class OutPreview extends Tuple3<String, String, Integer> {
OutPreview(String hash, String preview, Integer count) {
super(hash, preview, count);
}
}
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// get input data
DataSet<Event> text = env.fromElements(
new Event("flinker1", new Signal(100, 0, 100)),
new Event("flinker2", new Signal(100, 0, 101)),
new Event(“flinker3", new Signal(50, 0, 50)),
new Event("unflinker2", new Signal(100, 0, 50)));
DataSet<OutPreview> counts =
// create LSH hash by random hyperplanes
text.flatMap(new RandHyperplanes())
// group by hash signature
.groupBy(0)
.sum(2);
// execute and print result
counts.print();
}
public static final class RandHyperplanes implements FlatMapFunction<Event,
OutPreview> {
@Override
public void flatMap(Event event, Collector<OutPreview> out) {
String hash = HyperplaneSignature.matVProd(event.signal);
String name = event.name;
out.collect(new OutPreview(hash, name, 1));
}
}
}
Hyperplane Generation
public class HyperplaneSignature {
static Integer[][] hyperplanes = {{-1, 1, 1},
{-1, -1, 1},
{1, 1, 1},
{1, -1, 1},
{-1, 1, -1},
{1, -1, -1}};
static Integer product(Integer[] w, Integer[] v) {
Integer n = w.length;
Integer sum = 0;
for (int i = 0; i < n; i++) {
sum += w[i] * v[i];
}
return sum;
}
static String sign(Integer value) {
if (value < 0.0) {
return "a";
} else {
return "b";
}
}
public static String matVProd(Tuple3 v) {
Integer[] vhat = {(Integer) v.f0, (Integer) v.f1, (Integer) v.f2};
String signature = "";
for (Integer[] hyperplane : hyperplanes) {
Integer magnitude = product(hyperplane, vhat);
signature += sign(magnitude);
}
return signature;
}
}
Hyperplane Generation - alternative
import breeze.linalg._
import scala.util.Random
object LSH {
val hyperplanes_n = 10
val series_len = 3
val hyperplanes = __initHyperplanes
def __initHyperplanes: DenseMatrix[Double] = {
val planes = (1 to hyperplanes_n).map(x =>
(1 to series_len).map(y =>
Random.nextGaussian()
)
)
DenseMatrix(planes.map(_.toArray): _*)
}
def hash(volume: DenseVector[Double]): Array[Int] = {
val mult = hyperplanes * volume
mult
.toArray
.map({ x =>
if (x > 0) 0 else 1
})
}
}
Signal Clusters
• Surprised?
(aabbab,unflinker1,1)
(bbbbab,flinker3,3)
Program execution finished
Job with JobID 9658d0fb4037ec75e24b244479ef4b39 has finished.
Job Runtime: 469 ms
Pipelines
Traditional Pipelines
• Count events, group related items, etc.
Adversarial Pipelines
• Pattern recognition, similarity searching, cohort analysis
Adversarial Join
public class ApproximateArraysJoin {
public static class Event {
String name;
Signal signal;
public Event() {
}
public Event(String name, Signal signal) {
this.name = name;
this.signal = signal;
}
}
public static class Signal extends Tuple3<Integer, Integer, Integer> {
Signal(Integer x, Integer y, Integer z) {
super(x, y, z);
}
}
public static void main(String[] args) throws Exception {
// set up the execution environment
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Event> control = env.fromElements(
new Event("badflinker", new Signal(99, 0, 100)),
new Event("badflinker!", new Signal(77, 0, 75)));
// get input data
DataStream<Event> data = env.fromElements(
new Event("flinker1", new Signal(100, 0, 100)),
new Event("flinker2", new Signal(100, 0, 101)),
new Event("flinker3", new Signal(50, 0, 50)),
new Event("unflinker2", new Signal(100, 0, 50)));
DataStream<String> result = control
.broadcast()
.connect(data)
.flatMap(new RandHyperplaneCoFlatMap());
result.print();
env.execute();
}
public static final class RandHyperplaneCoFlatMap implements
CoFlatMapFunction<Event, Event, String> {
HashSet blacklist = new HashSet();
@Override
public void flatMap1(Event event, Collector<String> out) {
String hash = HyperplaneSignature.matVProd(event.signal);
blacklist.add(hash);
out.collect(“added to blacklist " + hash + " " + event.name);
}
@Override
public void flatMap2(Event event, Collector<String> out) {
String hash = HyperplaneSignature.matVProd(event.signal);
if (blacklist.contains(hash)) {
out.collect(“alas! caught another bad one " + hash + " " + event.name);
} else {
out.collect(“bummer! this isn’t bad " + hash + " " + event.name);
}
}
}
}
Application
Architecture
• Decouples & Plays nice with legacy
MapReduce oozie workflows
• Projections enable passing:
• User-Item Connected Components
of Graphs
• Achieves
• 5 Million + domains per batch
• Fetching signals with 700+ readings
• Subdomain aggregation
• Cohort Analysis
Detecting Botnets
• 200 Randomly sampled Algorithmically Generated
Domains from: Pykspa, Necurs, Suppobox, Pushdo, Zeus
Gameover.
Detecting Botnets
zyawafdeqer.kz
pokhuwyad.kz
pelzobdath.kz
zohevesvylu.kz
bosutymcure.kz
Pushdo
qcakimrtsrarcqjcutxmb.cc
ltcolhdcvv.nf
memeugampjhdbnwhxxsa.nu
vexbklktsbfyuv.la
uyndateoeqshbeytpwwb.jp
Necurs
xzaulqbqojdyzpdykzhojduhml.net
mjtyphaealldutxcdscnrh.org
dyugcqmzzhkduzplteabqgidfa.info
pjscugmuaynbfqwoqcqcfmhtfyo.net
ividkvrcfehoveavsljaelvro.com
Gameover Zeus
nemynulcnx.net
aszzfvzqb.net
oahebrlytur.net
nalznbmmhvn.com
lhdajsdenwb.com
Pykspa
Signal Pre-Processing
• The idea: test whether more compressed versions of the
signals could be stored.
Detecting Botnets
• Assigning each of the 8,475 signals a cluster from different LSH & pre-processing techniques. We label then randomly  project the 500-dimensional signals into 2-
dimensions. The plots are as follows: (1) Hand labeled domains into 9 families, (2) raw-counts hashed then clusters formed, (3) histogram of signals hashed then
clustered, (4) rolling-average of signals hashed then clustered, (5) rolling-average-histogram of signals hashed then clusters formed.
Recap
Outline: recap
• Primitives Filter, Join, GroupBy

examples: clickstream and cpu usage
• Non-Primitive Filter, Join, GroupBy

examples: signal clustering
• Locality Sensitive Hashing (LSH)

examples: cosine measure
• Application: Botnet Detection
The End
Bibliography
Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized
Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY,
USA.

Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets.
Cambridge University Press, New York, NY, USA.
S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data
cleaning,” Proc. Intl. Conf. on Data Engineering, 2006.
A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min-wise independent
permutations,” ACM Symposium on Theory of Computing, pp. 327–336, 1998.
A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions,” Comm. ACM 51:1, pp. 117– 122, 2008.
Appendix
Flink Main
public class ApproximateStringsNew {
public static class OutPreview extends Tuple3<String, String, Integer> {
OutPreview(String hash, String preview, Integer count) {
super(hash, preview, count);
}
}
public static void main(String[] args) throws Exception {
// set up the execution environment
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
// get input data
DataSet<String> text = env.fromElements(
"http://www.google.de/login",
"http://www.google.de/search?q=flink+forward",
"http://www.google.de/search?q=simple+string",
"http://www.google.de/search?q=simple+count"
);
DataSet<Tuple3<String, String, Integer>> counts =
text.flatMap(new MinHasher())
.groupBy(0)
.sum(2);
// execute and print result
counts.print();
}
public static final class MinHasher implements FlatMapFunction<String,
Tuple3<String, String, Integer>> {
@Override
public void flatMap(String value, Collector<Tuple3<String, String, Integer>>
out) {
// simple url word extractor
String[] tokens = value.toLowerCase().split("(/|+|=|?|:)");
String hash = MinHasherStatic.minHash(tokens);
out.collect(new Tuple3<String, String, Integer>(hash,
value.toLowerCase(), 1));
}
}
}
Hyperplane Generation
public class MinHasherStatic {
static int VOCAB_SIZE = 12;
static int KHASHES = 2;
static private List<Map<String, Integer>> kHashes = setup();
static private List<Map<String, Integer>> setup() {
Map<String, Integer> hm1 = new HashMap<>();
hm1.put("http", 1);
hm1.put("www", 7);
hm1.put("google", 6);
hm1.put("de", 5);
hm1.put("login", 4);
hm1.put("search", 2);
hm1.put("q", 9);
hm1.put("flink", 10);
hm1.put("forward", 3);
hm1.put("simple", 8);
hm1.put("string", 11);
hm1.put("count", 12);
Map<String, Integer> hm2 = new HashMap<>();
hm2.put("http", 8);
hm2.put("www", 9);
hm2.put("google", 12);
hm2.put("de", 3);
hm2.put("login", 4);
hm2.put("search", 11);
hm2.put("q", 7);
hm2.put("flink", 10);
hm2.put("forward", 5);
hm2.put("simple", 1);
hm2.put("string", 2);
hm2.put("count", 6);
List<Map<String, Integer>> hashes = new ArrayList<>();
hashes.add(hm1);
hashes.add(hm2);
return hashes;
}
static String minHash(String[] words) {
Integer[] mh = new Integer[KHASHES];
Arrays.fill(mh, VOCAB_SIZE + 1);
for (String word : words) {
for (int i = 0; i < KHASHES; i++) {
Map<String, Integer> m = kHashes.get(i);
if (m != null){
if (m.containsKey(word)){
int h = m.get(word);
int c = mh[i];
if (h < c) {
mh[i] = h;
}
}
}
}
}
return Arrays.toString(mh);
}
}
ClickStream Clusters
• Surprised?
([1, 1],http://www.google.de/search?q=simple+count,2)
([1, 4],http://www.google.de/login,1)
([1, 5],http://www.google.de/search?q=flink+forward,1)
Program execution finished
Job with JobID 0f8fac5481bb2ccd1acd772dca0759cc has finished.
Job Runtime: 698 ms

More Related Content

What's hot

Machine Learning Live
Machine Learning LiveMachine Learning Live
Machine Learning Live
Mike Anderson
 
Java puzzles
Java puzzlesJava puzzles
Java puzzles
Nikola Petrov
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
Mark Chang
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
Mike Anderson
 
Procedural Content Generation with Clojure
Procedural Content Generation with ClojureProcedural Content Generation with Clojure
Procedural Content Generation with Clojure
Mike Anderson
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
Gilles Louppe
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
Mike Anderson
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Java Puzzlers
Java PuzzlersJava Puzzlers
Java Puzzlers
Mike Donikian
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
ECCV2010: feature learning for image classification, part 2
ECCV2010: feature learning for image classification, part 2ECCV2010: feature learning for image classification, part 2
ECCV2010: feature learning for image classification, part 2zukun
 
Java Basics - Part1
Java Basics - Part1Java Basics - Part1
Java Basics - Part1
Vani Kandhasamy
 
Advanced data structures slide 1 2
Advanced data structures slide 1 2Advanced data structures slide 1 2
Advanced data structures slide 1 2jomerson remorosa
 
From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introductionelliando dias
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
Kevin Chun-Hsien Hsu
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 
Probabilistic Programming in Scala
Probabilistic Programming in ScalaProbabilistic Programming in Scala
Probabilistic Programming in ScalaBeScala
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
台灣資料科學年會
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
Kyunghoon Kim
 

What's hot (20)

Machine Learning Live
Machine Learning LiveMachine Learning Live
Machine Learning Live
 
Java puzzles
Java puzzlesJava puzzles
Java puzzles
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
Clojure for Data Science
Clojure for Data ScienceClojure for Data Science
Clojure for Data Science
 
Procedural Content Generation with Clojure
Procedural Content Generation with ClojureProcedural Content Generation with Clojure
Procedural Content Generation with Clojure
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Enter The Matrix
Enter The MatrixEnter The Matrix
Enter The Matrix
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Java Puzzlers
Java PuzzlersJava Puzzlers
Java Puzzlers
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
ECCV2010: feature learning for image classification, part 2
ECCV2010: feature learning for image classification, part 2ECCV2010: feature learning for image classification, part 2
ECCV2010: feature learning for image classification, part 2
 
Java Basics - Part1
Java Basics - Part1Java Basics - Part1
Java Basics - Part1
 
Advanced data structures slide 1 2
Advanced data structures slide 1 2Advanced data structures slide 1 2
Advanced data structures slide 1 2
 
7. arrays
7. arrays7. arrays
7. arrays
 
From Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn IntroductionFrom Lisp to Clojure/Incanter and RAn Introduction
From Lisp to Clojure/Incanter and RAn Introduction
 
[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2[1062BPY12001] Data analysis with R / week 2
[1062BPY12001] Data analysis with R / week 2
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 
Probabilistic Programming in Scala
Probabilistic Programming in ScalaProbabilistic Programming in Scala
Probabilistic Programming in Scala
 
[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務[系列活動] 手把手的深度學習實務
[系列活動] 手把手的深度學習實務
 
Welcome to python
Welcome to pythonWelcome to python
Welcome to python
 

Similar to Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, and GroupBy

Poly-paradigm Java
Poly-paradigm JavaPoly-paradigm Java
Poly-paradigm Java
Pavel Tcholakov
 
Kotlin for Android Developers - 2
Kotlin for Android Developers - 2Kotlin for Android Developers - 2
Kotlin for Android Developers - 2
Mohamed Nabil, MSc.
 
TDC2016SP - Trilha Programação Funcional
TDC2016SP - Trilha Programação FuncionalTDC2016SP - Trilha Programação Funcional
TDC2016SP - Trilha Programação Funcional
tdc-globalcode
 
Haskell 101
Haskell 101Haskell 101
Haskell 101
Roberto Pepato
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Palak Sanghani
 
Parameters
ParametersParameters
Parameters
James Brotsos
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
규영 허
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
ihji
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
JavaDayUA
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
Ahmed BESBES
 
C Language Lecture 17
C Language Lecture 17C Language Lecture 17
C Language Lecture 17
Shahzaib Ajmal
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Big_Data_Ukraine
 
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
STAMP Project
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Codemotion
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
K Hari Shankar
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
Egor Bogatov
 
CSE 103 Project Presentation.pptx
CSE 103 Project Presentation.pptxCSE 103 Project Presentation.pptx
CSE 103 Project Presentation.pptx
TasnimSaimaRaita
 
PRACTICAL COMPUTING
PRACTICAL COMPUTINGPRACTICAL COMPUTING
PRACTICAL COMPUTING
Ramachendran Logarajah
 
Property-based testing
Property-based testingProperty-based testing
Property-based testing
Dmitriy Morozov
 
ch03-parameters-objects.ppt
ch03-parameters-objects.pptch03-parameters-objects.ppt
ch03-parameters-objects.ppt
Mahyuddin8
 

Similar to Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, and GroupBy (20)

Poly-paradigm Java
Poly-paradigm JavaPoly-paradigm Java
Poly-paradigm Java
 
Kotlin for Android Developers - 2
Kotlin for Android Developers - 2Kotlin for Android Developers - 2
Kotlin for Android Developers - 2
 
TDC2016SP - Trilha Programação Funcional
TDC2016SP - Trilha Programação FuncionalTDC2016SP - Trilha Programação Funcional
TDC2016SP - Trilha Programação Funcional
 
Haskell 101
Haskell 101Haskell 101
Haskell 101
 
Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]Lec 9 05_sept [compatibility mode]
Lec 9 05_sept [compatibility mode]
 
Parameters
ParametersParameters
Parameters
 
Swift for tensorflow
Swift for tensorflowSwift for tensorflow
Swift for tensorflow
 
Yoyak ScalaDays 2015
Yoyak ScalaDays 2015Yoyak ScalaDays 2015
Yoyak ScalaDays 2015
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
C Language Lecture 17
C Language Lecture 17C Language Lecture 17
C Language Lecture 17
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
Advance data structure & algorithm
Advance data structure & algorithmAdvance data structure & algorithm
Advance data structure & algorithm
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
CSE 103 Project Presentation.pptx
CSE 103 Project Presentation.pptxCSE 103 Project Presentation.pptx
CSE 103 Project Presentation.pptx
 
PRACTICAL COMPUTING
PRACTICAL COMPUTINGPRACTICAL COMPUTING
PRACTICAL COMPUTING
 
Property-based testing
Property-based testingProperty-based testing
Property-based testing
 
ch03-parameters-objects.ppt
ch03-parameters-objects.pptch03-parameters-objects.ppt
ch03-parameters-objects.ppt
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 

Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, and GroupBy

  • 1. The Approximate Filter, Join, and GroupBy David Rodriguez, Scott Sitar, Dhia Mahjoub Cisco Systems
  • 2. About Us • David Rodriguez [davrodr3@cisco.com, @dvidrdgz]
 Data Scientist / Security Researcher. MA Mathematics. • Scott Sitar [ssitar@cisco.com]
 Engineer Tech Lead. Hadoop Based Technologies. Mathematics. • Dhia Mahjoub [dmahjoub@cisco.com, @dhialite]
 Head of Security Research for Cisco Umbrella. PhD. Mathematics.
  • 4. Motivating Problem: group similar spikes
  • 5. “Key, simple idea : Partition data into buckets by hashing, then analyze.” Michael Mitzenmacher Some Uses of Hashing in Networking Problems
  • 7. Outline • Primitives Filter, Join, GroupBy
 examples: clickstream and cpu usage • Non-Primitive Filter, Join, GroupBy
 examples: signal clustering • Locality Sensitive Hashing (LSH)
 examples: cosine measure (minhash if time permits) • Application: Botnet Detection
  • 9. Clickstream Counts • Identify most frequent words occurring in clickstreams public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // fake clickstream DataSet<String> text = env.fromElements("http://www.google.de/login", "http://www.google.de/search?q=flink+forward", "http://www.google.de/search?q=simple+string", "http://www.google.de/search?q=simple+count" ); DataSet<Tuple2<String, Integer>> counts = // extract words embedded in url text.flatMap(new URLExtractor()) // group over words extracted .groupBy(0) .sum(1); // execute and print result counts.print(); }
  • 10. Questions • If this URLExtractor is used, what is the output of the previous program? public static final class URLExtractor implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { String[] tokens = value.toLowerCase().split("(/|+|=|?|:)"); for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } }
  • 11. Challenges • Group similar browsing history: e.g. google searches. • In General, we need an unsupervised method of detecting similar browsing patterns. // fake clickstream DataSet<String> text = env.fromElements("http://www.google.de/login", "http://www.google.de/search?q=flink+forward", “http://www.google.de/search?q=simple+string", “http://www.google.de/search?q=simple+count", “https://berlin.flink-forward.org/kb_sessions“ );
  • 12. CPU Usage • Identify most frequent buckets occurring in cpu usage public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // fake cpu usage DataSet<Integer> text = env.fromElements(1, 10, 100, 20, 15, 20000, 20001, 10001, 1000, 99, 98, 10, 2 ); DataSet<Tuple2<Integer, Integer>> counts = // discretize values text.flatMap(new Histogram()) // group over buckets in histogram .groupBy(0) .sum(1); // execute and print result counts.print(); }
  • 13. Questions • If this Histogram is used, what is the output of the previous program? public static final class Histogram implements FlatMapFunction<Integer, Tuple2<Integer, Integer>> { @Override public void flatMap(Integer value, Collector<Tuple2<Integer, Integer>> out) { Double value_log = Math.log10(value); Integer bucket =(int)Math.floor(value_log); out.collect(new Tuple2<Integer, Integer>(bucket, 1)); } }
  • 14. Challenges • Group multivariate signal readings. • In General, we need an unsupervised method of detecting similar readings. // fake cpu usage DataSet<Tuple2> text3 = env.fromElements( new Tuple2(10, 0), new Tuple2(10, 1), new Tuple2(5, 5), new Tuple2(4, 5) );
  • 16. Signal Clustering • Identify similar signals: similar wave forms possibly dilated or contracted public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // gen fake signals with event name DataSet<Event> text = env.fromElements( new Event("flinker1", new Signal(100, 0, 100)), new Event("flinker2", new Signal(100, 0, 101)), new Event("flinker3", new Signal(50, 0, 50)), new Event("unflinker1", new Signal(100, 0, 50))); DataSet<OutPreview> counts = // create LSH hash by random hyperplanes text.flatMap(new RandHyperplanes()) // group by hash signature .groupBy(0) .sum(2); // execute and print result counts.print(); }
  • 17. Questions • Are (100, 0, 100) and (100, 0, 101) similar? • Are (100, 0, 100) and (50, 0, 50) similar? • Are (100, 0, 100) and (100, 0, 50) similar? • In General, we need an unsupervised method of detecting similar readings. // gen fake signals with event name DataSet<Event> text = env.fromElements( new Event("flinker1", new Signal(100, 0, 100)), new Event("flinker2", new Signal(100, 0, 101)), new Event("flinker3", new Signal(50, 0, 50)), new Event("unflinker1", new Signal(100, 0, 50)));
  • 19. Cosine Similarity • Given two vectors: 
 (1, 0) = x and (1, 0) = y • Define angle between and from calculus:
 
 theta = arccos(x dot y / (||x|| * ||y||)) • theta is a distance measure:
 1. d(x,x) = 0
 2. d(x,y) = d(y,x)
 3. d(x,y) >= 0 w/ 0 <= theta <= 180

  • 20. Theta and Hashes • Define a function f_1:
 0 , if sgn(h_1 dot x) > 0
 1 , otherwise • Then:
 P(f_1(x) = f_2(y)) = 1-theta / 180
  • 21. Theta and Deltas • Are (100, 0, 100) and (100, 0, 101) similar? • Are (100, 0, 100) and (50, 0, 50) similar? • Are (100, 0, 100) and (100, 0, 50) similar?
  • 22. p-Stable Hashes • family H of hash functions is said to be 
 (d_1, d_2, p_1, p_2) - sensitive
 if for any x and y in a possible space of points
 1. if d(x,y)<= d_1, then P(h(x)=h(y)) >= p_1
 2. if d(x,y)>= d_2, then P(h(x)=h(y)) <= p_2
  • 23. Finding the Number of Hyperplanes • Let: 
 ( 20°, 120°, 1 - 20°/180°, 1 - 120°/180°)
 for one hyperplane. • We can determine the number of hyperplanes. theta (1-(1-p)^2)^2 20° 0.0440 40° 0.1560 60° 0.3086 80° 0.4779 100° 0.6439
  • 25. Flink Main public class ApproximateSignals { public static class Event { String name; Signal signal; public Event() {} public Event(String name, Signal signal) { this.name = name; this.signal = signal; } } public static class Signal extends Tuple3<Integer, Integer, Integer> { Signal(Integer x, Integer y, Integer z) { super(x, y, z); } } public static class OutPreview extends Tuple3<String, String, Integer> { OutPreview(String hash, String preview, Integer count) { super(hash, preview, count); } } public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data DataSet<Event> text = env.fromElements( new Event("flinker1", new Signal(100, 0, 100)), new Event("flinker2", new Signal(100, 0, 101)), new Event(“flinker3", new Signal(50, 0, 50)), new Event("unflinker2", new Signal(100, 0, 50))); DataSet<OutPreview> counts = // create LSH hash by random hyperplanes text.flatMap(new RandHyperplanes()) // group by hash signature .groupBy(0) .sum(2); // execute and print result counts.print(); } public static final class RandHyperplanes implements FlatMapFunction<Event, OutPreview> { @Override public void flatMap(Event event, Collector<OutPreview> out) { String hash = HyperplaneSignature.matVProd(event.signal); String name = event.name; out.collect(new OutPreview(hash, name, 1)); } } }
  • 26. Hyperplane Generation public class HyperplaneSignature { static Integer[][] hyperplanes = {{-1, 1, 1}, {-1, -1, 1}, {1, 1, 1}, {1, -1, 1}, {-1, 1, -1}, {1, -1, -1}}; static Integer product(Integer[] w, Integer[] v) { Integer n = w.length; Integer sum = 0; for (int i = 0; i < n; i++) { sum += w[i] * v[i]; } return sum; } static String sign(Integer value) { if (value < 0.0) { return "a"; } else { return "b"; } } public static String matVProd(Tuple3 v) { Integer[] vhat = {(Integer) v.f0, (Integer) v.f1, (Integer) v.f2}; String signature = ""; for (Integer[] hyperplane : hyperplanes) { Integer magnitude = product(hyperplane, vhat); signature += sign(magnitude); } return signature; } }
  • 27. Hyperplane Generation - alternative import breeze.linalg._ import scala.util.Random object LSH { val hyperplanes_n = 10 val series_len = 3 val hyperplanes = __initHyperplanes def __initHyperplanes: DenseMatrix[Double] = { val planes = (1 to hyperplanes_n).map(x => (1 to series_len).map(y => Random.nextGaussian() ) ) DenseMatrix(planes.map(_.toArray): _*) } def hash(volume: DenseVector[Double]): Array[Int] = { val mult = hyperplanes * volume mult .toArray .map({ x => if (x > 0) 0 else 1 }) } }
  • 28. Signal Clusters • Surprised? (aabbab,unflinker1,1) (bbbbab,flinker3,3) Program execution finished Job with JobID 9658d0fb4037ec75e24b244479ef4b39 has finished. Job Runtime: 469 ms
  • 30. Traditional Pipelines • Count events, group related items, etc.
  • 31. Adversarial Pipelines • Pattern recognition, similarity searching, cohort analysis
  • 32. Adversarial Join public class ApproximateArraysJoin { public static class Event { String name; Signal signal; public Event() { } public Event(String name, Signal signal) { this.name = name; this.signal = signal; } } public static class Signal extends Tuple3<Integer, Integer, Integer> { Signal(Integer x, Integer y, Integer z) { super(x, y, z); } } public static void main(String[] args) throws Exception { // set up the execution environment final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Event> control = env.fromElements( new Event("badflinker", new Signal(99, 0, 100)), new Event("badflinker!", new Signal(77, 0, 75))); // get input data DataStream<Event> data = env.fromElements( new Event("flinker1", new Signal(100, 0, 100)), new Event("flinker2", new Signal(100, 0, 101)), new Event("flinker3", new Signal(50, 0, 50)), new Event("unflinker2", new Signal(100, 0, 50))); DataStream<String> result = control .broadcast() .connect(data) .flatMap(new RandHyperplaneCoFlatMap()); result.print(); env.execute(); } public static final class RandHyperplaneCoFlatMap implements CoFlatMapFunction<Event, Event, String> { HashSet blacklist = new HashSet(); @Override public void flatMap1(Event event, Collector<String> out) { String hash = HyperplaneSignature.matVProd(event.signal); blacklist.add(hash); out.collect(“added to blacklist " + hash + " " + event.name); } @Override public void flatMap2(Event event, Collector<String> out) { String hash = HyperplaneSignature.matVProd(event.signal); if (blacklist.contains(hash)) { out.collect(“alas! caught another bad one " + hash + " " + event.name); } else { out.collect(“bummer! this isn’t bad " + hash + " " + event.name); } } } }
  • 34. Architecture • Decouples & Plays nice with legacy MapReduce oozie workflows • Projections enable passing: • User-Item Connected Components of Graphs • Achieves • 5 Million + domains per batch • Fetching signals with 700+ readings • Subdomain aggregation • Cohort Analysis
  • 35. Detecting Botnets • 200 Randomly sampled Algorithmically Generated Domains from: Pykspa, Necurs, Suppobox, Pushdo, Zeus Gameover.
  • 37. Signal Pre-Processing • The idea: test whether more compressed versions of the signals could be stored.
  • 38. Detecting Botnets • Assigning each of the 8,475 signals a cluster from different LSH & pre-processing techniques. We label then randomly  project the 500-dimensional signals into 2- dimensions. The plots are as follows: (1) Hand labeled domains into 9 families, (2) raw-counts hashed then clusters formed, (3) histogram of signals hashed then clustered, (4) rolling-average of signals hashed then clustered, (5) rolling-average-histogram of signals hashed then clusters formed.
  • 39. Recap
  • 40. Outline: recap • Primitives Filter, Join, GroupBy
 examples: clickstream and cpu usage • Non-Primitive Filter, Join, GroupBy
 examples: signal clustering • Locality Sensitive Hashing (LSH)
 examples: cosine measure • Application: Botnet Detection
  • 42. Bibliography Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA.
 Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA. S. Chaudhuri, V. Ganti, and R. Kaushik, “A primitive operator for similarity joins in data cleaning,” Proc. Intl. Conf. on Data Engineering, 2006. A.Z. Broder, M. Charikar, A.M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” ACM Symposium on Theory of Computing, pp. 327–336, 1998. A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Comm. ACM 51:1, pp. 117– 122, 2008.
  • 44. Flink Main public class ApproximateStringsNew { public static class OutPreview extends Tuple3<String, String, Integer> { OutPreview(String hash, String preview, Integer count) { super(hash, preview, count); } } public static void main(String[] args) throws Exception { // set up the execution environment final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // get input data DataSet<String> text = env.fromElements( "http://www.google.de/login", "http://www.google.de/search?q=flink+forward", "http://www.google.de/search?q=simple+string", "http://www.google.de/search?q=simple+count" ); DataSet<Tuple3<String, String, Integer>> counts = text.flatMap(new MinHasher()) .groupBy(0) .sum(2); // execute and print result counts.print(); } public static final class MinHasher implements FlatMapFunction<String, Tuple3<String, String, Integer>> { @Override public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) { // simple url word extractor String[] tokens = value.toLowerCase().split("(/|+|=|?|:)"); String hash = MinHasherStatic.minHash(tokens); out.collect(new Tuple3<String, String, Integer>(hash, value.toLowerCase(), 1)); } } }
  • 45. Hyperplane Generation public class MinHasherStatic { static int VOCAB_SIZE = 12; static int KHASHES = 2; static private List<Map<String, Integer>> kHashes = setup(); static private List<Map<String, Integer>> setup() { Map<String, Integer> hm1 = new HashMap<>(); hm1.put("http", 1); hm1.put("www", 7); hm1.put("google", 6); hm1.put("de", 5); hm1.put("login", 4); hm1.put("search", 2); hm1.put("q", 9); hm1.put("flink", 10); hm1.put("forward", 3); hm1.put("simple", 8); hm1.put("string", 11); hm1.put("count", 12); Map<String, Integer> hm2 = new HashMap<>(); hm2.put("http", 8); hm2.put("www", 9); hm2.put("google", 12); hm2.put("de", 3); hm2.put("login", 4); hm2.put("search", 11); hm2.put("q", 7); hm2.put("flink", 10); hm2.put("forward", 5); hm2.put("simple", 1); hm2.put("string", 2); hm2.put("count", 6); List<Map<String, Integer>> hashes = new ArrayList<>(); hashes.add(hm1); hashes.add(hm2); return hashes; } static String minHash(String[] words) { Integer[] mh = new Integer[KHASHES]; Arrays.fill(mh, VOCAB_SIZE + 1); for (String word : words) { for (int i = 0; i < KHASHES; i++) { Map<String, Integer> m = kHashes.get(i); if (m != null){ if (m.containsKey(word)){ int h = m.get(word); int c = mh[i]; if (h < c) { mh[i] = h; } } } } } return Arrays.toString(mh); } }
  • 46. ClickStream Clusters • Surprised? ([1, 1],http://www.google.de/search?q=simple+count,2) ([1, 4],http://www.google.de/login,1) ([1, 5],http://www.google.de/search?q=flink+forward,1) Program execution finished Job with JobID 0f8fac5481bb2ccd1acd772dca0759cc has finished. Job Runtime: 698 ms