SlideShare a Scribd company logo
Spark Algorithms
Ashutosh Trivedi
Kaushik Ranjan
IIIT Bangalore
Spark-Meetup Bangalore
Outlier Detection and KNN Join
Agenda
• Introduction to two core algorithms
– Outlier Detection on Categorical Data
– KNN-Join
• Application in graph algorithms
– Feedback Vertex Set of a Graph
– Geographical Information Systems
• Challenges we faced
• Best practices
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
2
• Its good to be different but not in data !!
• Something is wrong, generated by a different mechanism.
• How will my model generalize ?
• Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier
Outliers
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
3
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
4
Solutions
• Distance based solutions.
– Mahalanobis Distance
• Covariance matrix solution
• Single class SVM.
• Density based solutions
– Counting frequency
Categorical Data ?
• Attribute Value Frequency(AVF) is based on assigning a score to
each point in the dataset using the frequency of each unique
attribute value.
• Easily parallelizable.
• Shown to perform favourably compared to other competitive but
more complex outlier detection strategies.
• Usages
– Anomaly Detection
– Security
MR-AVF
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015 5
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
6
Algorithm
Outliers on Categorical
• Attribute Value Frequency
Col 1 Col 2
A B
A C
C B
D E Outlier
Col 1 Col 2 Score
A B 4
A C 3
C B 3
D E 2
Low
Score
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
7
AVF -Mapping
(1,A)  1 , (2,B)  1,
(1,A)  1, (2,C)  1,
(1,C)  1, (2,B)  1,
(1,D)  1, (2,E)  1,
Key
<Column No, Attribute>
(1,A)  2, (2,B)  2,
(1,C)  1, (2,C)  1,
(1,D)  1, (2,E)  1,
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
8
AVF – frequency Calculations
Col 1 Col 2
A B
A C
C B
D E
Input RDD
freq RDD
 Information of line numbers
 A unique Identifier
(1,A)  2, (2,B)  2,
(1,C)  1, (2,C)  1,
(1,D)  1, (2,E)  1,
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
9
Centralized to Distributed
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
10
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
11
Imperative to functional
Centralized to Distributed
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
12
Imperative to functional
Centralized to Distributed
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
13
Imperative to functional
AVF – Line Calculations
(1,A)  X1, (2,B)  X1,
(1,A)  X2, (2,C)  X2,
(1,C)  X3, (2,B)  X3,
(1,D)  X4, (2,E)  X4,
Col 1 Col 2
A B
A C
C B
D E
 Column Index as well as row index
 ZipWithIndex
data RDD
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
14
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
15
(1,A)  X1, (2,B)  X1,
(1,A)  X2, (2,C)  X2,
(1,C)  X3, (2,B)  X3,
(1,D)  X4, (2,E)  X4,
data RDDfreq RDD
(1,A)  2, (2,B)  2,
(1,C)  1, (2,C)  1,
(1,D)  1, (2,E)  1,
Col 1 Col 2
A B
A C
C B
D E
Input RDD
AVF – Join
AVF - Join
(1,A)  ( 2, X1 ) , ( 2, X2 ) , (2,B)  ( 2, X1 ) ,(2, X3 ),
(1,C)  ( 1, X3 ) , (2,C)  ( 1, X2 ) ,
(1,D)  ( 1, X4) , (2,E)  (1, X4 ) ,
( X1, 2 ) , ( X2, 2 ) , ( X1, 2 ) ,( X3, 2 ), ( X3, 1 ) ,
( X4, 1 ) , ( X4, 1 ) , ( X2, 1 ) ,
Col 1 Col 2
X1 4
X2 3
X3 3
X4 2Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
16
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
17
Performance
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
18
Performance on Spark
Performance on different data-points
438MB Memory, Intel core i3 machine
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
19
Performance on Spark
Performance on 43358 data-points with different partition of file
438MB Memory, Intel core i3 machine
Best Practices
• Minimal use of variable, Everything should be
immutable.
• More transformations less actions.
• Minimize broadcast.
• No updating variable in filter.
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
20
KNN-Join
• Finds the K nearest neighbors from a data set for a given data point.
• Approximate KNN-Join helps generate results with order of log(n) page
access.
• This idea uses Z- Values to map points in a multi dimensional space to a
single dimension.
• It translate KNN search for the query point on the single dimensional
space.
• Usages
• Similarity Search in huge Datasets
• Smoothening of images
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
21
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
22
Z-Order a.k.a. Morton Code a.k.a. Space-Filling Curve
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
23
Z-Order
Z-order curve iterations extended
to three dimensions
KNN-Join
3 14 6
2 13 7
4 12 7
4 14 6
Data Set
Data Point
3 12 7
Iteration : 2 Neighbors : 1
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
24
KNN-Join Calculations
3 , 14 , 6 0
2, 13 , 7 1
4 , 12 , 7 2
4 , 14 , 6 3
1 1259
0 1276
2 1481
3 1496
1261
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
25
3 , 14 , 6 0
2, 13 , 7 1
First Iteration Result
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
26
Second Iteration
3 , 14 , 6 0
2, 13 , 7 1
4 , 12 , 7 2
4 , 14 , 6 3
Data Point
3 12 7
Random Vector
17 22 34
20 , 36 , 40 0
19, 35 , 41 1
21, 34 , 41 2
21 , 36 , 40 3
new data Point
20 34 41
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
27
1 115255
2 115477
0 115584
3 115588
115473
2, 13 , 7 1
4, 12 , 7 2
3 , 14 , 6 0
2, 13 , 7 1
Union only append, Does not remove duplicates
Second Iteration -Result
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
28
Data Point
3 12 7
Z-KNN
Ashutosh & Kaushik, Spark-Meetup
Bangalore Dec-2014
29
1 [ (2, 13 , 7) , (2, 13 , 7) ]
2 [ (4, 12 , 7) ]
0 [ (3 , 14 , 6) ]
2, 13 , 7
4, 12 , 7
3 , 14 , 6
Data Set
1 2, 13 , 7
2 4, 12 , 7
0 3 , 14 , 6
1 2, 13 , 7
2, 13 , 7
4, 12 , 7
3 , 14 , 6
Data Point
3 12 7
Data Set
4 12 7
Z-KNN Results
1 Nearest Neighbor
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
30
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
31
Performance on Spark
Performance on different Ks
438MB Memory, Intel core i3 machine
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
32
Performance on Spark
Performance on different data-points
with k = 30 and 30 iterations
438MB Memory, Intel core i3 machine
Best Practices
• More code review at codacy (www.codacy.com)
• Integrated with Github
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
33
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
34
Application on GraphX
• Feedback Vertex Set of a Graph
• Geographical Information Systems
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
35
Future Works
• Social Content Matching (max flow Algorithm) (alpha)
• KNN for float types (requires calculation of Morton order for floats)
• Matrix multiplication by the Strassen algorithm, using Morton order as
locality search.
• Similarity between two documents, implementation of all sequence
kernels.
• More outlier detection algorithm
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
36
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
37
Connect with us
@mail
ashu.trv@gmail.com
kaushikranjan.619@gmail.com
LinkedIn
Ashutosh
https://www.linkedin.com/in/ashutoshtrivedi
Kaushik
https://www.linkedin.com/in/ranjankaushik
Fork our repository at
https://github.com/anantasty/SparkAlgorithms
References
• Follow us at
• https://github.com/codeAshu
• https://github.com/kaushikranjan
• A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier
detection for categorical datasets using MapReduce." IEEE World Congress on computational
Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008.
• DOI> 10.1109/IJCNN.2008.4634266
• Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in
MapReduce." Proceedings of the 15th International Conference on Extending Database
Technology. ACM, 2012.
• DOI>10.1145/2247596.2247602
Ashutosh & Kaushik, Spark-Meetup
Bangalore Jan-2015
38

More Related Content

What's hot

Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
Kexin Xie
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Databricks
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Spark Summit
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
Jen Aman
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
Spark Summit
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Jen Aman
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
Jim Cooley
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
Turi, Inc.
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Turi, Inc.
 
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial Analytics
VisionGEOMATIQUE2014
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
Piotr Tylenda
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 

What's hot (20)

Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
2014.06.24.what is ubix
2014.06.24.what is ubix2014.06.24.what is ubix
2014.06.24.what is ubix
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
 
GeoMesa: Scalable Geospatial Analytics
GeoMesa:  Scalable Geospatial AnalyticsGeoMesa:  Scalable Geospatial Analytics
GeoMesa: Scalable Geospatial Analytics
 
Log Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine LearningLog Analytics in Datacenter with Apache Spark and Machine Learning
Log Analytics in Datacenter with Apache Spark and Machine Learning
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 

Similar to Spark algorithms

Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
Chawanat Nakasan
 
Programing Slicing and Its applications
Programing Slicing and Its applicationsPrograming Slicing and Its applications
Programing Slicing and Its applications
Ankur Jain
 
Implementation of Carry Skip Adder using PTL
Implementation of Carry Skip Adder using PTLImplementation of Carry Skip Adder using PTL
Implementation of Carry Skip Adder using PTL
IRJET Journal
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Group, Inc.
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
Hyun Wong Choi
 
Design of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix Adders Design of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix Adders
IOSR Journals
 
Design of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix AddersDesign of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix Adders
IOSR Journals
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
DataWorks Summit
 
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
IJMTST Journal
 
Hybrid predictive modelling of geometry with limited data in cold spray addit...
Hybrid predictive modelling of geometry with limited data in cold spray addit...Hybrid predictive modelling of geometry with limited data in cold spray addit...
Hybrid predictive modelling of geometry with limited data in cold spray addit...
Daiki Ikeuchi
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache SparkAttribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
IRJET Journal
 
Stock Market Analysis and Prediction
Stock Market Analysis and PredictionStock Market Analysis and Prediction
Stock Market Analysis and Prediction
Anil Shrestha
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregel
Sigmoid
 
Design and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix addersDesign and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix adders
IJERA Editor
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
Editor IJCATR
 
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic Circuits
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic CircuitsDesign Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic Circuits
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic Circuits
IJRES Journal
 
Evaluation of High Speed and Low Memory Parallel Prefix Adders
Evaluation of High Speed and Low Memory Parallel Prefix AddersEvaluation of High Speed and Low Memory Parallel Prefix Adders
Evaluation of High Speed and Low Memory Parallel Prefix Adders
IOSR Journals
 

Similar to Spark algorithms (20)

Personal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research MeetingPersonal Research Overview presented at the KU-NAIST Research Meeting
Personal Research Overview presented at the KU-NAIST Research Meeting
 
Programing Slicing and Its applications
Programing Slicing and Its applicationsPrograming Slicing and Its applications
Programing Slicing and Its applications
 
Implementation of Carry Skip Adder using PTL
Implementation of Carry Skip Adder using PTLImplementation of Carry Skip Adder using PTL
Implementation of Carry Skip Adder using PTL
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...Rakuten Technology Conference 2017 A Distributed SQL Database  For Data Analy...
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
 
Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2Master defense presentation 2019 04_18_rev2
Master defense presentation 2019 04_18_rev2
 
resume
resumeresume
resume
 
Design of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix Adders Design of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix Adders
 
Design of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix AddersDesign of 32 bit Parallel Prefix Adders
Design of 32 bit Parallel Prefix Adders
 
Foundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theoryFoundations of streaming SQL: stream & table theory
Foundations of streaming SQL: stream & table theory
 
Sai Dheeraj_Resume
Sai Dheeraj_ResumeSai Dheeraj_Resume
Sai Dheeraj_Resume
 
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
Implementation and Estimation of Delay, Power and Area for Parallel Prefix Ad...
 
Hybrid predictive modelling of geometry with limited data in cold spray addit...
Hybrid predictive modelling of geometry with limited data in cold spray addit...Hybrid predictive modelling of geometry with limited data in cold spray addit...
Hybrid predictive modelling of geometry with limited data in cold spray addit...
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache SparkAttribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
Attribute Reduction:An Implementation of Heuristic Algorithm using Apache Spark
 
Stock Market Analysis and Prediction
Stock Market Analysis and PredictionStock Market Analysis and Prediction
Stock Market Analysis and Prediction
 
Graph x pregel
Graph x pregelGraph x pregel
Graph x pregel
 
Design and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix addersDesign and Estimation of delay, power and area for Parallel prefix adders
Design and Estimation of delay, power and area for Parallel prefix adders
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic Circuits
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic CircuitsDesign Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic Circuits
Design Of 64-Bit Parallel Prefix VLSI Adder For High Speed Arithmetic Circuits
 
Evaluation of High Speed and Low Memory Parallel Prefix Adders
Evaluation of High Speed and Low Memory Parallel Prefix AddersEvaluation of High Speed and Low Memory Parallel Prefix Adders
Evaluation of High Speed and Low Memory Parallel Prefix Adders
 

Recently uploaded

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 

Recently uploaded (20)

Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 

Spark algorithms

  • 1. Spark Algorithms Ashutosh Trivedi Kaushik Ranjan IIIT Bangalore Spark-Meetup Bangalore Outlier Detection and KNN Join
  • 2. Agenda • Introduction to two core algorithms – Outlier Detection on Categorical Data – KNN-Join • Application in graph algorithms – Feedback Vertex Set of a Graph – Geographical Information Systems • Challenges we faced • Best practices Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 2
  • 3. • Its good to be different but not in data !! • Something is wrong, generated by a different mechanism. • How will my model generalize ? • Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier Outliers Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 3
  • 4. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 4 Solutions • Distance based solutions. – Mahalanobis Distance • Covariance matrix solution • Single class SVM. • Density based solutions – Counting frequency Categorical Data ?
  • 5. • Attribute Value Frequency(AVF) is based on assigning a score to each point in the dataset using the frequency of each unique attribute value. • Easily parallelizable. • Shown to perform favourably compared to other competitive but more complex outlier detection strategies. • Usages – Anomaly Detection – Security MR-AVF Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 5
  • 6. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 6 Algorithm
  • 7. Outliers on Categorical • Attribute Value Frequency Col 1 Col 2 A B A C C B D E Outlier Col 1 Col 2 Score A B 4 A C 3 C B 3 D E 2 Low Score Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 7
  • 8. AVF -Mapping (1,A)  1 , (2,B)  1, (1,A)  1, (2,C)  1, (1,C)  1, (2,B)  1, (1,D)  1, (2,E)  1, Key <Column No, Attribute> (1,A)  2, (2,B)  2, (1,C)  1, (2,C)  1, (1,D)  1, (2,E)  1, Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 8
  • 9. AVF – frequency Calculations Col 1 Col 2 A B A C C B D E Input RDD freq RDD  Information of line numbers  A unique Identifier (1,A)  2, (2,B)  2, (1,C)  1, (2,C)  1, (1,D)  1, (2,E)  1, Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 9
  • 10. Centralized to Distributed Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 10
  • 11. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 11 Imperative to functional
  • 12. Centralized to Distributed Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 12 Imperative to functional
  • 13. Centralized to Distributed Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 13 Imperative to functional
  • 14. AVF – Line Calculations (1,A)  X1, (2,B)  X1, (1,A)  X2, (2,C)  X2, (1,C)  X3, (2,B)  X3, (1,D)  X4, (2,E)  X4, Col 1 Col 2 A B A C C B D E  Column Index as well as row index  ZipWithIndex data RDD Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 14
  • 15. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 15 (1,A)  X1, (2,B)  X1, (1,A)  X2, (2,C)  X2, (1,C)  X3, (2,B)  X3, (1,D)  X4, (2,E)  X4, data RDDfreq RDD (1,A)  2, (2,B)  2, (1,C)  1, (2,C)  1, (1,D)  1, (2,E)  1, Col 1 Col 2 A B A C C B D E Input RDD AVF – Join
  • 16. AVF - Join (1,A)  ( 2, X1 ) , ( 2, X2 ) , (2,B)  ( 2, X1 ) ,(2, X3 ), (1,C)  ( 1, X3 ) , (2,C)  ( 1, X2 ) , (1,D)  ( 1, X4) , (2,E)  (1, X4 ) , ( X1, 2 ) , ( X2, 2 ) , ( X1, 2 ) ,( X3, 2 ), ( X3, 1 ) , ( X4, 1 ) , ( X4, 1 ) , ( X2, 1 ) , Col 1 Col 2 X1 4 X2 3 X3 3 X4 2Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 16
  • 17. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 17 Performance
  • 18. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 18 Performance on Spark Performance on different data-points 438MB Memory, Intel core i3 machine
  • 19. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 19 Performance on Spark Performance on 43358 data-points with different partition of file 438MB Memory, Intel core i3 machine
  • 20. Best Practices • Minimal use of variable, Everything should be immutable. • More transformations less actions. • Minimize broadcast. • No updating variable in filter. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 20
  • 21. KNN-Join • Finds the K nearest neighbors from a data set for a given data point. • Approximate KNN-Join helps generate results with order of log(n) page access. • This idea uses Z- Values to map points in a multi dimensional space to a single dimension. • It translate KNN search for the query point on the single dimensional space. • Usages • Similarity Search in huge Datasets • Smoothening of images Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 21
  • 22. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 22 Z-Order a.k.a. Morton Code a.k.a. Space-Filling Curve
  • 23. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 23 Z-Order Z-order curve iterations extended to three dimensions
  • 24. KNN-Join 3 14 6 2 13 7 4 12 7 4 14 6 Data Set Data Point 3 12 7 Iteration : 2 Neighbors : 1 Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 24
  • 25. KNN-Join Calculations 3 , 14 , 6 0 2, 13 , 7 1 4 , 12 , 7 2 4 , 14 , 6 3 1 1259 0 1276 2 1481 3 1496 1261 Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 25
  • 26. 3 , 14 , 6 0 2, 13 , 7 1 First Iteration Result Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 26
  • 27. Second Iteration 3 , 14 , 6 0 2, 13 , 7 1 4 , 12 , 7 2 4 , 14 , 6 3 Data Point 3 12 7 Random Vector 17 22 34 20 , 36 , 40 0 19, 35 , 41 1 21, 34 , 41 2 21 , 36 , 40 3 new data Point 20 34 41 Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 27
  • 28. 1 115255 2 115477 0 115584 3 115588 115473 2, 13 , 7 1 4, 12 , 7 2 3 , 14 , 6 0 2, 13 , 7 1 Union only append, Does not remove duplicates Second Iteration -Result Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 28
  • 29. Data Point 3 12 7 Z-KNN Ashutosh & Kaushik, Spark-Meetup Bangalore Dec-2014 29 1 [ (2, 13 , 7) , (2, 13 , 7) ] 2 [ (4, 12 , 7) ] 0 [ (3 , 14 , 6) ] 2, 13 , 7 4, 12 , 7 3 , 14 , 6 Data Set 1 2, 13 , 7 2 4, 12 , 7 0 3 , 14 , 6 1 2, 13 , 7
  • 30. 2, 13 , 7 4, 12 , 7 3 , 14 , 6 Data Point 3 12 7 Data Set 4 12 7 Z-KNN Results 1 Nearest Neighbor Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 30
  • 31. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 31 Performance on Spark Performance on different Ks 438MB Memory, Intel core i3 machine
  • 32. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 32 Performance on Spark Performance on different data-points with k = 30 and 30 iterations 438MB Memory, Intel core i3 machine
  • 33. Best Practices • More code review at codacy (www.codacy.com) • Integrated with Github Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 33
  • 34. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 34
  • 35. Application on GraphX • Feedback Vertex Set of a Graph • Geographical Information Systems Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 35
  • 36. Future Works • Social Content Matching (max flow Algorithm) (alpha) • KNN for float types (requires calculation of Morton order for floats) • Matrix multiplication by the Strassen algorithm, using Morton order as locality search. • Similarity between two documents, implementation of all sequence kernels. • More outlier detection algorithm Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 36
  • 37. Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 37 Connect with us @mail ashu.trv@gmail.com kaushikranjan.619@gmail.com LinkedIn Ashutosh https://www.linkedin.com/in/ashutoshtrivedi Kaushik https://www.linkedin.com/in/ranjankaushik Fork our repository at https://github.com/anantasty/SparkAlgorithms
  • 38. References • Follow us at • https://github.com/codeAshu • https://github.com/kaushikranjan • A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier detection for categorical datasets using MapReduce." IEEE World Congress on computational Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008. • DOI> 10.1109/IJCNN.2008.4634266 • Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in MapReduce." Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012. • DOI>10.1145/2247596.2247602 Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015 38