Spark algorithms

Spark Algorithms
Ashutosh Trivedi
Kaushik Ranjan
IIIT Bangalore
Spark-Meetup Bangalore
Outlier Detection and KNN Join

Agenda
• Introduction to two core algorithms
– Outlier Detection on Categorical Data
– KNN-Join
• Application in graph algorithms
– Feedback Vertex Set of a Graph
– Geographical Information Systems
• Challenges we faced
• Best practices
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
2

• Its good to be different but not in data !!
• Something is wrong, generated by a different mechanism.
• How will my model generalize ?
• Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier
Outliers
Bangalore Jan-2015
3

Bangalore Jan-2015
4
Solutions
• Distance based solutions.
– Mahalanobis Distance
• Covariance matrix solution
• Single class SVM.
• Density based solutions
– Counting frequency
Categorical Data ?

• Attribute Value Frequency(AVF) is based on assigning a score to
each point in the dataset using the frequency of each unique
attribute value.
• Easily parallelizable.
• Shown to perform favourably compared to other competitive but
more complex outlier detection strategies.
• Usages
– Anomaly Detection
– Security
MR-AVF
Bangalore Jan-2015 5

Bangalore Jan-2015
6
Algorithm

Outliers on Categorical
• Attribute Value Frequency
Col 1 Col 2
A B
A C
C B
D E Outlier
Col 1 Col 2 Score
A B 4
A C 3
C B 3
D E 2
Low
Score
Bangalore Jan-2015
7

AVF -Mapping
(1,A)  1 , (2,B)  1,
(1,A)  1, (2,C)  1,
(1,C)  1, (2,B)  1,
(1,D)  1, (2,E)  1,
Key
<Column No, Attribute>
(1,A)  2, (2,B)  2,
(1,C)  1, (2,C)  1,
(1,D)  1, (2,E)  1,
Bangalore Jan-2015
8

AVF – frequency Calculations
Col 1 Col 2
A B
A C
C B
D E
Input RDD
freq RDD
 Information of line numbers
 A unique Identifier
(1,A)  2, (2,B)  2,
(1,C)  1, (2,C)  1,
(1,D)  1, (2,E)  1,
Bangalore Jan-2015
9

Centralized to Distributed
Bangalore Jan-2015
10

Bangalore Jan-2015
11
Imperative to functional

Bangalore Jan-2015
12

Bangalore Jan-2015
13

AVF – Line Calculations
(1,A)  X1, (2,B)  X1,
(1,A)  X2, (2,C)  X2,
(1,C)  X3, (2,B)  X3,
(1,D)  X4, (2,E)  X4,
Col 1 Col 2
A B
A C
C B
D E
 Column Index as well as row index
 ZipWithIndex
data RDD
Bangalore Jan-2015
14

Bangalore Jan-2015
15
(1,A)  X1, (2,B)  X1,
(1,A)  X2, (2,C)  X2,
(1,C)  X3, (2,B)  X3,
(1,D)  X4, (2,E)  X4,
data RDDfreq RDD
(1,A)  2, (2,B)  2,
(1,C)  1, (2,C)  1,
(1,D)  1, (2,E)  1,
Col 1 Col 2
A B
A C
C B
D E
Input RDD
AVF – Join

AVF - Join
(1,A)  ( 2, X1 ) , ( 2, X2 ) , (2,B)  ( 2, X1 ) ,(2, X3 ),
(1,C)  ( 1, X3 ) , (2,C)  ( 1, X2 ) ,
(1,D)  ( 1, X4) , (2,E)  (1, X4 ) ,
( X1, 2 ) , ( X2, 2 ) , ( X1, 2 ) ,( X3, 2 ), ( X3, 1 ) ,
( X4, 1 ) , ( X4, 1 ) , ( X2, 1 ) ,
Col 1 Col 2
X1 4
X2 3
X3 3
X4 2Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
16

Bangalore Jan-2015
17
Performance

Bangalore Jan-2015
18
Performance on Spark
Performance on different data-points
438MB Memory, Intel core i3 machine

Bangalore Jan-2015
19
Performance on 43358 data-points with different partition of file

Best Practices
• Minimal use of variable, Everything should be
immutable.
• More transformations less actions.
• Minimize broadcast.
• No updating variable in filter.
Bangalore Jan-2015
20

KNN-Join
• Finds the K nearest neighbors from a data set for a given data point.
• Approximate KNN-Join helps generate results with order of log(n) page
access.
• This idea uses Z- Values to map points in a multi dimensional space to a
single dimension.
• It translate KNN search for the query point on the single dimensional
space.
• Usages
• Similarity Search in huge Datasets
• Smoothening of images
Bangalore Jan-2015
21

Bangalore Jan-2015
22
Z-Order a.k.a. Morton Code a.k.a. Space-Filling Curve

Bangalore Jan-2015
23
Z-Order
Z-order curve iterations extended
to three dimensions

KNN-Join
3 14 6
2 13 7
4 12 7
4 14 6
Data Set
Data Point
3 12 7
Iteration : 2 Neighbors : 1
Bangalore Jan-2015
24

KNN-Join Calculations
3 , 14 , 6 0
2, 13 , 7 1
4 , 12 , 7 2
4 , 14 , 6 3
1 1259
0 1276
2 1481
3 1496
1261
Bangalore Jan-2015
25

3 , 14 , 6 0
2, 13 , 7 1
First Iteration Result
Bangalore Jan-2015
26

Second Iteration
3 , 14 , 6 0
2, 13 , 7 1
4 , 12 , 7 2
4 , 14 , 6 3
Data Point
3 12 7
Random Vector
17 22 34
20 , 36 , 40 0
19, 35 , 41 1
21, 34 , 41 2
21 , 36 , 40 3
new data Point
20 34 41
Bangalore Jan-2015
27

1 115255
2 115477
0 115584
3 115588
115473
2, 13 , 7 1
4, 12 , 7 2
3 , 14 , 6 0
2, 13 , 7 1
Union only append, Does not remove duplicates
Second Iteration -Result
Bangalore Jan-2015
28

Data Point
3 12 7
Z-KNN
Bangalore Dec-2014
29
1 [ (2, 13 , 7) , (2, 13 , 7) ]
2 [ (4, 12 , 7) ]
0 [ (3 , 14 , 6) ]
2, 13 , 7
4, 12 , 7
3 , 14 , 6
Data Set
1 2, 13 , 7
2 4, 12 , 7
0 3 , 14 , 6
1 2, 13 , 7

2, 13 , 7
4, 12 , 7
3 , 14 , 6
Data Point
3 12 7
Data Set
4 12 7
Z-KNN Results
1 Nearest Neighbor
Bangalore Jan-2015
30

Bangalore Jan-2015
31
Performance on different Ks

Bangalore Jan-2015
32
Performance on different data-points
with k = 30 and 30 iterations

Best Practices
• More code review at codacy (www.codacy.com)
• Integrated with Github
Bangalore Jan-2015
33

Bangalore Jan-2015
34

Application on GraphX
• Feedback Vertex Set of a Graph
• Geographical Information Systems
Bangalore Jan-2015
35

Future Works
• Social Content Matching (max flow Algorithm) (alpha)
• KNN for float types (requires calculation of Morton order for floats)
• Matrix multiplication by the Strassen algorithm, using Morton order as
locality search.
• Similarity between two documents, implementation of all sequence
kernels.
• More outlier detection algorithm
Bangalore Jan-2015
36

Bangalore Jan-2015
37
Connect with us
@mail
ashu.trv@gmail.com
kaushikranjan.619@gmail.com
LinkedIn
Ashutosh
https://www.linkedin.com/in/ashutoshtrivedi
Kaushik
https://www.linkedin.com/in/ranjankaushik
Fork our repository at
https://github.com/anantasty/SparkAlgorithms

References
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier
detection for categorical datasets using MapReduce." IEEE World Congress on computational
Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008.
• DOI> 10.1109/IJCNN.2008.4634266
• Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in
MapReduce." Proceedings of the 15th International Conference on Extending Database
Technology. ACM, 2012.
• DOI>10.1145/2247596.2247602
Bangalore Jan-2015
38

Spark algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark algorithms

Similar to Spark algorithms (20)

Recently uploaded

Recently uploaded (20)

Spark algorithms