Scalable Fast Parallel SVM in Cloud Clusters for
Large Datasets Classification
Ghazanfar Latif (Gabe)
Part 1: Introduction of Cloud Computing
Part 2: Introduction of Support Vector Machine
Part 3: Problem Description
Part 4: Distributing SVM on Cloud Cluster Nodes
Part 5: Experimental Results & Conclusion
Amazon Cloud Services
Cloud Servers ranges from 1GHz CPU, 613MB RAM to 110GHz CPU
and 68GB RAM. (6 Regions, 3 Zones)
Cloud Storage Service where we can upload up to 5000 TB of Data.
Virtual Private Cloud within the Cloud Servers or in between Cloud
Servers and our local machines.
Amazon Cloud Watch/SNS
Resources Utilization Monitoring and sending emails or SMS to the
Support Vector Machine
• Support vector machines were originally proposed by
Boser, Guyon and Vapnik in 1992 and gained increasing
popularity in late 1990s.
• SVM is supervised learning methods that analyze data and
recognize patterns, used for classification.
• SVMs are currently among the best performers for a number
of classification tasks ranging from text to genomic data.
• SVMs can be applied to complex data types(e.g. graphs, sequences,
relational data) by designing kernel functions for such data.
• Currently, SVM is widely used in object detection & recognition.
content-based image retrieval
DNA array expression data analysis
Face Expression Recognition
Sorting documents by topic
SVM: Basic Idea
• Find the hyper-plane that
maximizes the margin
• The perpendicular distance to
the closest positive sample or
negative sample is called the
• Tuning SVMs remains a black
art: selecting a specific kernel
and parameters is usually done
in a try-and-see manner.
Which of the linear separators is optimal?
SVM: Basic Idea (continue)
Vectors on the margin
are the support
vectors, and the total
margin is 2/llWll
• For testing and training of a multidimensional large datasets
by using SVM requires a lot of computing resources in terms
of memory and computational power.
• It is very expensive to purchase High performance
computational hardware for training of large datasets.
• Researchers also face problems due to limited computational
resources available at their institutions and they need to wait
a lot to get results.
17CS Department, KFUPM (KSA).
• Cloud Computing is emerging today as a commercial
infrastructure that eliminates the need for maintaining
expensive computing hardware.
• We purposed a technique for running support vector
machines in parallel on distributed cloud cluster nodes which
reduced memory requirements and computational power.
• Our solution is auto scalable and cost effective in terms of
time and computational power expenditures.
18CS Department, KFUPM (KSA).
• We used 4 nodes of Amazon EC2 HPC Clusters which are
locally interconnected via VPC for testing our datasets in the
• EC2 Cluster Specifications
Memory: 23 GB Memory
CPU: 33.5 EC2 Compute Units (≈ 43.5 GHz)
Network Connectivity: 10 Gigabit Ethernet
Operating System: Linux
Tools: MATLAB, AWS Scripting in Java
21CS Department, KFUPM (KSA).
• For testing our proposed solution, we used 8 different sized datasets
having 2, 4, 8 features:
• To created Testing Datasets we used Cos-Exp, Gaussian, Multi Class
Gaussian distribution classes.
• We also tested our proposed solution on online available LIBSVM
Classification datasets at www.ntu.edu.tw.
22CS Department, KFUPM (KSA).
Test # Data Size # of Features
1 2000 2
2 5000 2
3 10000 2
4 16000 2
5 24000 2
6 4000 4
7 22400 4
8 59535 8
Single Node Test Results
23CS Department, KFUPM (KSA).
Test # Data Size Features
PT ISV Accuracy
1 2000 2 14.549 804 86.2
2 5000 2 89.35 1916 84.84
3 10000 2 982.68 3620 85.12
4 16000 2 21422.22 5715 84.84
5 24000 2 79195 8407 84.97
6 4000 4 388.5193 1815 90.375
7 22400 4 53052.36 8647 85.96
8 59535 8 83517 25074 96.797
PT Processing Time
ISV Identified Support Vectors
Comparison with Existing Techniques
I. An Intelligent System for Accelerating Parallel SVM Classification Problems on
Large Datasets Using GPU.
II. Parallel Support Vector Machines: The Cascade SVM.
III. Distributed Parallel Support Vector Machines in Strongly Connected Networks.
IV. A Fast Parallel Optimization for Training Support Vector Machine.
29CS Department, KFUPM (KSA).
Type of Infrastructure Efficiency Accuracy Resources Cost
Amazon Cloud Clusters Up to 60%
Pay only what you use
GPU Clusters Up to 80%
GPU Maintenance Cost
Local Cascade SVM Method
the # of iterations
the # of iterations
Local Strongly Connected Networks
the # of iterations
the # of iterations
Local Single Node Maximum Time
• We prove that our proposed solution is very efficient in terms
of training time as compared to the existing techniques and it
classifies the datasets correctly with minimal error rate.
• Experiments over a real-world and test databases shows that
this algorithm is scalable and robust.
30CS Department, KFUPM (KSA).
• We will extend the performance evaluation results by
running similar experiments on other IaaS providers and
clouds also on other real large-scale platforms, such as
grids and commodity clusters .
31CS Department, KFUPM (KSA).
32CS Department, KFUPM (KSA).
 Florian Schatz, Sven Koschnicke, Niklas Paulsen, Christoph Starke, and Manfred Schimmler, “MPI
Performance Analysis of Amazon EC2 Cloud Services for High Performance Computing”, A. Abraham et al.
(Eds.): ACC 2011, Part I, CCIS 190, pp. 371–381, 2011. Springer-Verlag Berlin Heidelberg 2011.
 Simon Ostermann, AlexandruIosup , Nezih Yigitbasi, Radu Prodan, Thomas Fahringer and Dick Eperna, “A
Performance Analysis of EC2 Cloud Computing Services for Scientific Computing”, D.R. Avreskyetal. (Eds.) :
Cloudcomp 2009 , LNICST 34, pp. 115- 131 , 2010. Institute for Computer Sciences, Social-Informatics and
Telecommunications Engineering 2010.
 Amazon Elastic Compute Cloud (Amazon EC2): http://aws.amazon.com/ec2/
 High Performance Computing (HPC) on AWS Clusters: http://aws.amazon.com/hpc-applications/
 G. Zanghirati and L. Zanni, “A parallel solver for large quadratic programs in training support vector
machines,” Parallel Comput., vol. 29, pp. 535–551, Nov. 2003.
 C. Caragea, D. Caragea, and V. Honavar, “Learning support vector machine classifiers from distributed data
sources,” in Proc. 20th Nat. Conf. Artif. Intell. Student Abstract Poster Program, Pittsburgh, PA, 2005, pp.
 A. Navia-Vazquez, D. Gutierrez-Gonzalez, E. Parrado-Hernandez, and J. Navarro-Abellan, “Distributed
support vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 1091–1097, Jul. 2006.
 Yumao Lu, Vwani Roychowdhury, and Lieven Vandenberghe, “Distributed Parallel Support Vector Machines
in Strongly Connected Networks”, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 7, JULY 2008.
33CS Department, KFUPM (KSA).
 C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software and datasets
available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
 B. Catanzaro, N. Sundaram, and K. Keutzer, “Fast support vector machine training and classification on
graphics processors,” in ICML ’08: Proceedings of the 25th international conference on Machine learning.
New York, NY, USA: ACM, 2008, pp. 104–111.
 S. Herrero-Lopez, J. R. Williams, and A. Sanchez, “Parallel multiclass classification using svms on gpus,” in
GPGPU’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing
Units. New York, NY, USA: ACM, 2010, pp. 2–11.
 Cao, L., Keerthi, S., Ong, C.-J., Zhang, J., Periyathamby, U., Fu, X. J., & Lee, H. (2006). Parallel sequential
minimal optimization for the training of support vector machines. IEEE Transactions on Neural Networks,
 Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., & Vapnik, V. (2005). Parallel support vector machines:
The cascade svm. In L. K. Saul, Y. Weiss and L. Bottou (Eds.), Advances in neural information processing
systems 17, 521-528. Cambridge, MA: MIT Press.
 Wu, G., Chang, E., Chen, Y. K., & Hughes, C. (2006). Incremental approximate matrix factorization for
speeding up support vector machines. KDD '06: Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 760-766). New York, NY, USA: ACM Press.
 Zanni, L., Serani, T., & Zanghirati, G. (2006). Parallel software for training large scale support vector
machines on multiprocessor systems. J. Mach. Learn. Res., 7, 1467-1492.
 Qi Li, Raied Salman, Vojislav Kecman, “An Intelligent System for Accelerating Parallel SVM Classification
Problems on Large Datasets Using GPU”, 2010 10th International Conference on Intelligent Systems Design