SlideShare a Scribd company logo
PROPOSAL 6: [ML] [CEP] PREDICTIVE ANALYTICS WITH ONLINE DATA FOR
WSO2 MACHINE LEARNER (ML)
(dananjayamahesh@gmail.com)
I am Mahesh Dananjaya, student from university of Moratuwa, the leading technical university
of the country. I am majoring electronic and telecommunication engineering and I have been
very enthusiastic to machine learning and big data analysis and also tried to implement more
optimized and high performance platforms to implement the networking, telecommunication,
signal processing and information processing.
I have done lots of projects related to machine intelligence and information processing
and data analysis. Some of my projects include development of SOPC (System on
Programmable Chip) for real time speech recognition on reconfigurable hardware, Artificial
neural network implementation for pattern recognition in speech signals, intrusion detection
application based on cluster analysis and several other projects in SVM, Regressions,
Discriminant analysis (DA) with Fisher discriminant and ML, factor analysis for real world
applications, principle component analysis (PCA) to reduce dimensionality in network
information for information security analysis. And I have been engaging with the Big Data
analysis research with the Dr. Prathapasinghe Dharmawansa from Stanford University
(Currently in University of Moratuwa). My area is to combine IOT, Information Processing
and machine learning areas. I have been specializing multivariate analysis in my mathematics
area.
1 INTRODUCTION
Machine learning and Big Data analysis is becoming a next breakthrough of
information processing and data analysis. Many organizations carried out many researches to
bring the big data analytics a reality and to build up some commercial level applications.
Nowadays there are many machine learning algorithms are there to come up with various
solutions to build up some intelligent information analysis to usual data flow. Most of the
network data is passing through high speed network lines as streams. Therefore the today’s
trend is to extend prevailing data analysis algorithms to support streaming and to provide the
same level of predictive data analytics to stream data also. Company like WSO2 is trying to
use those Big Data technologies and Machine Learning algorithms with their current product
such as Data Analysis servers (DAS) and CEP (Complex Event Processors) with Siddhi
processor by supporting streaming of data. The use of machine learning algorithms for data
analytics with the stream data support is an emerging technology.
2 PROBLEM STATEMENT
Even though WSO2 ML (Machine Learner) can be used to analyses cluster of data as
well as standalone data and it is still not supporting stream of data. Therefore it cannot be
directly used with the other products such as CEP (Complex Event Processor). And also WSO2
ML (Machine Learner) is using Apache Spark MLLib for their machine learning algorithms.
Spark MLLib is supporting its machine learning algorithms for streaming k-means and other
generalized linear models with the spark streaming by breaking of streaming data into mini
batches. But still that technology is supported only with the Scala API. Therefore it cannot be
directly used with the WS02 machine learning package.
3 PPROPOSAL
Proposed solution is to develop Java API for WSO2 ML with streaming data support and with
spark MLLib k-mean and generalized linear models algorithms. This can be done with the
Spark MLLib k-mean clustering and GLM (Generalized Linear Model) algorithms and by
periodically retraining the model with the newly arriving sequential data. This will be
implemented as a Siddhi extension that can operate directly on incoming streams that can be
used to perform a predictive analysis of data/event streams coming from the WS02 CEP
(Complex Event Processor). Thus we can use this for real time streamline data analysis by
providing better predictions.
4 METHODOLOGY
What we are going to do basically is that we can uses the same Apache Spark MLLib for
machine learning algorithms which we use for standalone data to analyze them and periodically
retrain the model with upcoming data come as data streams from several applications such as
WS02 CEP (Complex Event Processor)’s event streams. Therefore in this API I have to
develop the same behavior inside apache spark which we have to get mini-batches of data from
data coming as streams periodically by breaking of streaming data into mini batches and retrain
the model. We use mini-batch learning in my incremental learning approach since these
algorithms operates as stochastic gradient descents so that any learning with new data can be
done on top of the previously learned models.
Then we have to save the previous model parameters and re-train the model again with the
next set of mini-batch of data. Therefore in the process we should have to follow some basic
steps. First we need to break the streams into several pieces of data sets which called the mini-
batches and train a model repeatedly with those mini-batches. After each training run, we can
save the model information (such as weights, intercepts for regression and cluster centers for
k-mean clustering). As WS02 proposal page shows we use the first approach and try to
implement incremental learning algorithms based on the mini batches of data taken from the
data streams.
4.1 Mini-Batches from Streaming Data
First task is to taking pieces of data bundles out of streaming data which is known as
mini-batches to retrain model. Initially we looked into how spark streaming taking the mini
batches from streaming data. Basically Spark Streaming use the modified version of the
Lambda Architecture, known as the Kappa Architecture. In this architecture the output of a
fast, and often imprecise, streaming layer is joined with the output of a slow, but precise, batch
layer. Typically this takes the form of a series of scheduled batch processes that replace the
output of a continually running stream processor with a more accurate equivalent. This
provides a guarantee that any inaccurate data introduced by the stream processor will be
replaced by accurate data within a small window. In Spark Mini-Batch process, it defines
several other important parameters to coordinate the mini-batches. They are Zero Time and
Batch interval. The batch interval is the interval at which events that have been collected are
processed. The zero time is defined as the same time throughout the entire distributed system,
allowing Spark Streaming to use it to short-circuit computation of outputs from before inputs
were available.
Spark MLLib can then it its machine learning algorithms to with those mini batches. This image
shows how the spark mini batch sampling is happening. With the help of these methods, we
can train models again with newly arriving data, keeping the characteristics learned with the
previous data.
In the reference research paper of [1] in the references shows that how the mini batches and the
mini batch size b with the convergence time T can be used for the Stochastic Gradient Descent
(SGD) technique which can be effectively used for large scale optimization problems including
predictive data analysis. Mini batch training needs to be employed to reduce the
communication cost. However, an increase in mini batch size typically decreases the rate of
convergence. Since we have to use spark MLLib to get the retraining model and we have the
previous models saved in the program we can go for the SGD based optimization techniques
to find the incremental model for any learning algorithms, initially k-mean clustering and
Generalized Linear Regression models (GLM).
Therefore initially what we have to do is separate streaming data into set of data bundles
called mini batch with the appropriate mini batch size b which will be a paramount important
parameters to avoid
4.2 Running the Spark ML Algorithms and Save the Model parameters with Streaming
Data.
For each mini batch we can use spark MLLib to run the relevant machine learning algorithm,
in our case it is k mean clustering and generalized linear regression model (GLM). We have to
run the algorithm and save the model parameters and when the next batch comes run the
algorithm and get the new parameters and then we can combine the models and build the new
model with this incremental algorithms.
If we take the K mean clustering algorithm that is implemented on the spark streaming is
elaborated below.
4.3 Periodically Train the models with the previous models.
For the periodically retraining of the model can be achieved by several ways. As the [1]
research paper suggest and most of the people are using we can use mini-batch training with
some stochastic gradient descent optimization techniques. So we can train the model
periodically with mini batches.
Mini Batch Generalized Linear Regression
Especially when we comes to the Generalized Linear Regression we can use the stochastic
gradient descent techniques to find the new model based on the past models also. As the spark
streaming algorithms do referring to reference [5], we can use stochastic gradient descent
methods for incremental learning model with GLS. We need to keenly look into Regularization
of the parameters and along with various parameters associated with stochastic gradient descent
such as Step Size, Number of Iterations and Mini batch fraction to control some issues
regarding data horizon (how quickly a most recent data point becomes a part of the model) and
data obsolescence (how long does it take a past data point to become irrelevant to the model.
Description of how the algorithms run with the references [8], [9], [10] and [11]. There
those references describes the how it works and how we can use it with mini batch learning
algorithms. Reference [8] shows clearly and simply how the mini batch size can be selected
and how it can be parameterized b (B- batch size).
Mini Batch K Mean Clustering
When we comes to mini batch k mean clustering we can use the model parameters such
as cluster center to periodically retrain the model. Referring to [4] spark streaming algorithm
and Scala API we can write as following.
we may want to estimate clusters dynamically, updating them as new data arrive we
need to develop the API to support streaming k mean clustering with mini batches, with
parameters to control the decay (or “forgetfulness”) of the estimates. The algorithm uses a
generalization of the mini-batch k-means update rule. For each batch of data, we assign all
points to their nearest cluster, compute new cluster centers, then update each cluster using:
𝐶𝑡+1 =
𝐶𝑡 𝑛 𝑡 𝛼 + 𝑋𝑡 𝑚 𝑡
𝑛 𝑡 𝛼 + 𝑚 𝑡
Where, 𝑛 𝑡+1 = 𝑛 𝑡 + 𝑚 𝑡
Where 𝐶𝑡the previous center for the cluster is, 𝑛 𝑡is the number of points assigned to the cluster
thus far, 𝑥𝑡 is the new cluster center from the current batch, and 𝑚 𝑡 is the number of points
added to the cluster in the current batch. The decay factor α can be used to ignore the past: with
α=1 all data will be used from the beginning; with α=0 only the most recent data will be used.
This is analogous to an exponentially-weighted moving average. These decaying factors make
more control upon data horizon and data obsolesces to control the mini batch process.
5 ARCHITECTURE OVERVIEW
Streaming data in accepted inside the core. Those streams may be coming from the WSO2 CEP
(Complex Event Processor), data streams coming out from the Siddhi processor, or else it can
be another streamline of data. Then the stream of data I break into mini batches and passes into
the API core where the algorithms run and store the models derived with the past models.
Apache Spark is used to run the machine learning algorithms that we need to run for each and
every mini batch of data.
Receiver
API Core
Apache Spark MLLib
Past
ModelsSTREAMS MINI-BATCH
6 Time Line of the Project
April 22- May 22 Community bonding period - Getting familiar with the platform and
discussing implementation methods.
May 23 – June 18 Implementing streaming generalized linear regression and test for
streams
June 19 - June 21 Implementing a part of streaming k-means
June 21-June 28 Midterm Evaluation
June 28-July 17 Implementing the other part of the streaming k means clustering and test
and test streaming GLR further with streams.
July 18-August 12 Integrating The Java API with the CEP Siddhi processor streams and
test with the CEP event streams and other streams.
August 13– August
16
Final Documentation of the project
August 16-August 24 Submit Codes and documentations and evaluations
7 References
[1] https://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf
[2] https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
[3]http://crediblecode.net/z-featureditems/featured-1/sparking-insights
[4]http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
[5]http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
[6] https://papers.nips.cc/paper/4432-better-mini-batch-algorithms-via-accelerated-gradient-
methods.pdf
[7] https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
[8] http://stats.stackexchange.com/questions/140811/how-large-should-the-batch-size-be-for-
stochastic-gradient-descent
[9] http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf
[10] http://arxiv.org/pdf/1106.4574.pdf
[11] http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

More Related Content

What's hot

System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J Greaves
Satya Harish
 
VLSI GDI Technology
VLSI GDI TechnologyVLSI GDI Technology
VLSI GDI Technology
Techno Electronics
 
Maquina estado
Maquina estadoMaquina estado
Maquina estado
Cesar Gil Arrieta
 
Asic design flow
Asic design flowAsic design flow
Asic design flow
yogeshwaran k
 
Vlsi physical design
Vlsi physical designVlsi physical design
Vlsi physical designI World Tech
 
Floor planning
Floor planningFloor planning
Floor planning
shaik sharief
 
CAD: Layout Extraction
CAD: Layout ExtractionCAD: Layout Extraction
CAD: Layout Extraction
Team-VLSI-ITMU
 
Design challenges in physical design
Design challenges in physical designDesign challenges in physical design
Design challenges in physical designDeiptii Das
 
Floorplanning in physical design
Floorplanning in physical designFloorplanning in physical design
Floorplanning in physical design
Murali Rai
 
Floor planning ppt
Floor planning pptFloor planning ppt
Floor planning ppt
Thrinadh Komatipalli
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
Prathyusha Madapalli
 
DSP Processors versus ASICs
DSP Processors versus ASICsDSP Processors versus ASICs
DSP Processors versus ASICs
Sudhanshu Janwadkar
 
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
ijceronline
 
Silicon to software share
Silicon to software shareSilicon to software share
Silicon to software share
Narendra Patel
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
VLSICS Design
 
System partitioning in VLSI and its considerations
System partitioning in VLSI and its considerationsSystem partitioning in VLSI and its considerations
System partitioning in VLSI and its considerationsSubash John
 
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
Silicon Mentor
 
Vlsi design flow
Vlsi design flowVlsi design flow
Vlsi design flow
Rajendra Kumar
 
Placement and routing in full custom physical design
Placement and routing in full custom physical designPlacement and routing in full custom physical design
Placement and routing in full custom physical designDeiptii Das
 

What's hot (19)

System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J Greaves
 
VLSI GDI Technology
VLSI GDI TechnologyVLSI GDI Technology
VLSI GDI Technology
 
Maquina estado
Maquina estadoMaquina estado
Maquina estado
 
Asic design flow
Asic design flowAsic design flow
Asic design flow
 
Vlsi physical design
Vlsi physical designVlsi physical design
Vlsi physical design
 
Floor planning
Floor planningFloor planning
Floor planning
 
CAD: Layout Extraction
CAD: Layout ExtractionCAD: Layout Extraction
CAD: Layout Extraction
 
Design challenges in physical design
Design challenges in physical designDesign challenges in physical design
Design challenges in physical design
 
Floorplanning in physical design
Floorplanning in physical designFloorplanning in physical design
Floorplanning in physical design
 
Floor planning ppt
Floor planning pptFloor planning ppt
Floor planning ppt
 
Floor plan & Power Plan
Floor plan & Power Plan Floor plan & Power Plan
Floor plan & Power Plan
 
DSP Processors versus ASICs
DSP Processors versus ASICsDSP Processors versus ASICs
DSP Processors versus ASICs
 
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
 
Silicon to software share
Silicon to software shareSilicon to software share
Silicon to software share
 
Design and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpgaDesign and implementation of complex floating point processor using fpga
Design and implementation of complex floating point processor using fpga
 
System partitioning in VLSI and its considerations
System partitioning in VLSI and its considerationsSystem partitioning in VLSI and its considerations
System partitioning in VLSI and its considerations
 
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
Design and Implementation of Single Precision Pipelined Floating Point Co-Pro...
 
Vlsi design flow
Vlsi design flowVlsi design flow
Vlsi design flow
 
Placement and routing in full custom physical design
Placement and routing in full custom physical designPlacement and routing in full custom physical design
Placement and routing in full custom physical design
 

Viewers also liked

High Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data PlaneHigh Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data Plane
Mahesh Dananjaya
 
Gsoc 2016-iit-snk-v1.0
Gsoc 2016-iit-snk-v1.0Gsoc 2016-iit-snk-v1.0
Gsoc 2016-iit-snk-v1.0
Suranga Nath Kasthurirathne
 
Clock Gating
Clock GatingClock Gating
Clock Gating
Mahesh Dananjaya
 
Low Power VLSI Design
Low Power VLSI DesignLow Power VLSI Design
Low Power VLSI Design
Mahesh Dananjaya
 
Why you need a Partner led installed base program
Why you need a Partner led installed base programWhy you need a Partner led installed base program
Why you need a Partner led installed base program
Riverview B2B
 
When Your Internet Things Know How You Feel
When Your Internet Things Know How You FeelWhen Your Internet Things Know How You Feel
When Your Internet Things Know How You Feel
Pamela Pavliscak
 
How Instagram Onboards New Users
How Instagram Onboards New UsersHow Instagram Onboards New Users
How Instagram Onboards New Users
Geoffrey Dorne
 
The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation
The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation
The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation
SwiftKey
 

Viewers also liked (9)

High Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data PlaneHigh Performance Flow Matching Architecture for Openflow Data Plane
High Performance Flow Matching Architecture for Openflow Data Plane
 
Gsoc 2016-iit-snk-v1.0
Gsoc 2016-iit-snk-v1.0Gsoc 2016-iit-snk-v1.0
Gsoc 2016-iit-snk-v1.0
 
Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...Study of vlsi design methodologies and limitations using cad tools for cmos t...
Study of vlsi design methodologies and limitations using cad tools for cmos t...
 
Clock Gating
Clock GatingClock Gating
Clock Gating
 
Low Power VLSI Design
Low Power VLSI DesignLow Power VLSI Design
Low Power VLSI Design
 
Why you need a Partner led installed base program
Why you need a Partner led installed base programWhy you need a Partner led installed base program
Why you need a Partner led installed base program
 
When Your Internet Things Know How You Feel
When Your Internet Things Know How You FeelWhen Your Internet Things Know How You Feel
When Your Internet Things Know How You Feel
 
How Instagram Onboards New Users
How Instagram Onboards New UsersHow Instagram Onboards New Users
How Instagram Onboards New Users
 
The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation
The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation
The Linguistic Secrets Found in Billions of Emoji - SXSW 2016 presentation
 

Similar to Proposal for google summe of code 2016

ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
E05312426
E05312426E05312426
E05312426
IOSR-JEN
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
Pranya Prabhakar
 
Real timeeventmonitoringsystem(1)
Real timeeventmonitoringsystem(1)Real timeeventmonitoringsystem(1)
Real timeeventmonitoringsystem(1)
Atyam Sriharsha
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Osman Ali
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
SBGC
 
Poster
PosterPoster
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_LibrarySaurabh Chauhan
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
Adam Doyle
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
Robert Grossman
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
dbpublications
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Serverless Machine Learning
Serverless Machine LearningServerless Machine Learning
Serverless Machine Learning
Asavari Tayal
 
Project PlanFor our Project Plan, we are going to develop.docx
Project PlanFor our Project Plan, we are going to develop.docxProject PlanFor our Project Plan, we are going to develop.docx
Project PlanFor our Project Plan, we are going to develop.docx
wkyra78
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
Arpitha Gurumurthy
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
ijcsa
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
Nick Boucart
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
Kamau Francis
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 

Similar to Proposal for google summe of code 2016 (20)

ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
E05312426
E05312426E05312426
E05312426
 
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
 
Real timeeventmonitoringsystem(1)
Real timeeventmonitoringsystem(1)Real timeeventmonitoringsystem(1)
Real timeeventmonitoringsystem(1)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
Poster
PosterPoster
Poster
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
Web Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using HadoopWeb Oriented FIM for large scale dataset using Hadoop
Web Oriented FIM for large scale dataset using Hadoop
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Serverless Machine Learning
Serverless Machine LearningServerless Machine Learning
Serverless Machine Learning
 
Project PlanFor our Project Plan, we are going to develop.docx
Project PlanFor our Project Plan, we are going to develop.docxProject PlanFor our Project Plan, we are going to develop.docx
Project PlanFor our Project Plan, we are going to develop.docx
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
EMPIRICAL APPLICATION OF SIMULATED ANNEALING USING OBJECT-ORIENTED METRICS TO...
 
Experimenting With Big Data
Experimenting With Big DataExperimenting With Big Data
Experimenting With Big Data
 
Proposal with sdlc
Proposal with sdlcProposal with sdlc
Proposal with sdlc
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 

More from Mahesh Dananjaya

OpenFlow Aware Network Processor
OpenFlow Aware Network ProcessorOpenFlow Aware Network Processor
OpenFlow Aware Network Processor
Mahesh Dananjaya
 
Digital Integrated Circuit (IC) Design
Digital Integrated Circuit (IC) DesignDigital Integrated Circuit (IC) Design
Digital Integrated Circuit (IC) Design
Mahesh Dananjaya
 
Image segmentation using normalized graph cut
Image segmentation using normalized graph cutImage segmentation using normalized graph cut
Image segmentation using normalized graph cut
Mahesh Dananjaya
 
Low power electronic design
Low power electronic designLow power electronic design
Low power electronic design
Mahesh Dananjaya
 
Low power electronic design
Low power electronic designLow power electronic design
Low power electronic design
Mahesh Dananjaya
 
Low Power VLSI Designs
Low Power VLSI DesignsLow Power VLSI Designs
Low Power VLSI Designs
Mahesh Dananjaya
 
VLSI Power Reduction
VLSI Power ReductionVLSI Power Reduction
VLSI Power Reduction
Mahesh Dananjaya
 
Vlsi power estimation
Vlsi power estimationVlsi power estimation
Vlsi power estimation
Mahesh Dananjaya
 
Power Gating
Power GatingPower Gating
Power Gating
Mahesh Dananjaya
 
SoC Power Reduction
SoC Power ReductionSoC Power Reduction
SoC Power Reduction
Mahesh Dananjaya
 
SOC Power Estimation
SOC Power EstimationSOC Power Estimation
SOC Power Estimation
Mahesh Dananjaya
 
VLSI Power in a Nutshell
VLSI Power in a NutshellVLSI Power in a Nutshell
VLSI Power in a Nutshell
Mahesh Dananjaya
 

More from Mahesh Dananjaya (12)

OpenFlow Aware Network Processor
OpenFlow Aware Network ProcessorOpenFlow Aware Network Processor
OpenFlow Aware Network Processor
 
Digital Integrated Circuit (IC) Design
Digital Integrated Circuit (IC) DesignDigital Integrated Circuit (IC) Design
Digital Integrated Circuit (IC) Design
 
Image segmentation using normalized graph cut
Image segmentation using normalized graph cutImage segmentation using normalized graph cut
Image segmentation using normalized graph cut
 
Low power electronic design
Low power electronic designLow power electronic design
Low power electronic design
 
Low power electronic design
Low power electronic designLow power electronic design
Low power electronic design
 
Low Power VLSI Designs
Low Power VLSI DesignsLow Power VLSI Designs
Low Power VLSI Designs
 
VLSI Power Reduction
VLSI Power ReductionVLSI Power Reduction
VLSI Power Reduction
 
Vlsi power estimation
Vlsi power estimationVlsi power estimation
Vlsi power estimation
 
Power Gating
Power GatingPower Gating
Power Gating
 
SoC Power Reduction
SoC Power ReductionSoC Power Reduction
SoC Power Reduction
 
SOC Power Estimation
SOC Power EstimationSOC Power Estimation
SOC Power Estimation
 
VLSI Power in a Nutshell
VLSI Power in a NutshellVLSI Power in a Nutshell
VLSI Power in a Nutshell
 

Recently uploaded

Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 

Recently uploaded (20)

Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 

Proposal for google summe of code 2016

  • 1. PROPOSAL 6: [ML] [CEP] PREDICTIVE ANALYTICS WITH ONLINE DATA FOR WSO2 MACHINE LEARNER (ML) (dananjayamahesh@gmail.com) I am Mahesh Dananjaya, student from university of Moratuwa, the leading technical university of the country. I am majoring electronic and telecommunication engineering and I have been very enthusiastic to machine learning and big data analysis and also tried to implement more optimized and high performance platforms to implement the networking, telecommunication, signal processing and information processing. I have done lots of projects related to machine intelligence and information processing and data analysis. Some of my projects include development of SOPC (System on Programmable Chip) for real time speech recognition on reconfigurable hardware, Artificial neural network implementation for pattern recognition in speech signals, intrusion detection application based on cluster analysis and several other projects in SVM, Regressions, Discriminant analysis (DA) with Fisher discriminant and ML, factor analysis for real world applications, principle component analysis (PCA) to reduce dimensionality in network information for information security analysis. And I have been engaging with the Big Data analysis research with the Dr. Prathapasinghe Dharmawansa from Stanford University (Currently in University of Moratuwa). My area is to combine IOT, Information Processing and machine learning areas. I have been specializing multivariate analysis in my mathematics area. 1 INTRODUCTION Machine learning and Big Data analysis is becoming a next breakthrough of information processing and data analysis. Many organizations carried out many researches to bring the big data analytics a reality and to build up some commercial level applications. Nowadays there are many machine learning algorithms are there to come up with various solutions to build up some intelligent information analysis to usual data flow. Most of the network data is passing through high speed network lines as streams. Therefore the today’s trend is to extend prevailing data analysis algorithms to support streaming and to provide the same level of predictive data analytics to stream data also. Company like WSO2 is trying to use those Big Data technologies and Machine Learning algorithms with their current product such as Data Analysis servers (DAS) and CEP (Complex Event Processors) with Siddhi
  • 2. processor by supporting streaming of data. The use of machine learning algorithms for data analytics with the stream data support is an emerging technology. 2 PROBLEM STATEMENT Even though WSO2 ML (Machine Learner) can be used to analyses cluster of data as well as standalone data and it is still not supporting stream of data. Therefore it cannot be directly used with the other products such as CEP (Complex Event Processor). And also WSO2 ML (Machine Learner) is using Apache Spark MLLib for their machine learning algorithms. Spark MLLib is supporting its machine learning algorithms for streaming k-means and other generalized linear models with the spark streaming by breaking of streaming data into mini batches. But still that technology is supported only with the Scala API. Therefore it cannot be directly used with the WS02 machine learning package. 3 PPROPOSAL Proposed solution is to develop Java API for WSO2 ML with streaming data support and with spark MLLib k-mean and generalized linear models algorithms. This can be done with the Spark MLLib k-mean clustering and GLM (Generalized Linear Model) algorithms and by periodically retraining the model with the newly arriving sequential data. This will be implemented as a Siddhi extension that can operate directly on incoming streams that can be used to perform a predictive analysis of data/event streams coming from the WS02 CEP (Complex Event Processor). Thus we can use this for real time streamline data analysis by providing better predictions. 4 METHODOLOGY What we are going to do basically is that we can uses the same Apache Spark MLLib for machine learning algorithms which we use for standalone data to analyze them and periodically retrain the model with upcoming data come as data streams from several applications such as WS02 CEP (Complex Event Processor)’s event streams. Therefore in this API I have to develop the same behavior inside apache spark which we have to get mini-batches of data from data coming as streams periodically by breaking of streaming data into mini batches and retrain the model. We use mini-batch learning in my incremental learning approach since these algorithms operates as stochastic gradient descents so that any learning with new data can be done on top of the previously learned models.
  • 3. Then we have to save the previous model parameters and re-train the model again with the next set of mini-batch of data. Therefore in the process we should have to follow some basic steps. First we need to break the streams into several pieces of data sets which called the mini- batches and train a model repeatedly with those mini-batches. After each training run, we can save the model information (such as weights, intercepts for regression and cluster centers for k-mean clustering). As WS02 proposal page shows we use the first approach and try to implement incremental learning algorithms based on the mini batches of data taken from the data streams. 4.1 Mini-Batches from Streaming Data First task is to taking pieces of data bundles out of streaming data which is known as mini-batches to retrain model. Initially we looked into how spark streaming taking the mini batches from streaming data. Basically Spark Streaming use the modified version of the Lambda Architecture, known as the Kappa Architecture. In this architecture the output of a fast, and often imprecise, streaming layer is joined with the output of a slow, but precise, batch layer. Typically this takes the form of a series of scheduled batch processes that replace the output of a continually running stream processor with a more accurate equivalent. This provides a guarantee that any inaccurate data introduced by the stream processor will be replaced by accurate data within a small window. In Spark Mini-Batch process, it defines several other important parameters to coordinate the mini-batches. They are Zero Time and Batch interval. The batch interval is the interval at which events that have been collected are processed. The zero time is defined as the same time throughout the entire distributed system, allowing Spark Streaming to use it to short-circuit computation of outputs from before inputs were available.
  • 4. Spark MLLib can then it its machine learning algorithms to with those mini batches. This image shows how the spark mini batch sampling is happening. With the help of these methods, we can train models again with newly arriving data, keeping the characteristics learned with the previous data. In the reference research paper of [1] in the references shows that how the mini batches and the mini batch size b with the convergence time T can be used for the Stochastic Gradient Descent (SGD) technique which can be effectively used for large scale optimization problems including predictive data analysis. Mini batch training needs to be employed to reduce the communication cost. However, an increase in mini batch size typically decreases the rate of convergence. Since we have to use spark MLLib to get the retraining model and we have the previous models saved in the program we can go for the SGD based optimization techniques to find the incremental model for any learning algorithms, initially k-mean clustering and Generalized Linear Regression models (GLM). Therefore initially what we have to do is separate streaming data into set of data bundles called mini batch with the appropriate mini batch size b which will be a paramount important parameters to avoid 4.2 Running the Spark ML Algorithms and Save the Model parameters with Streaming Data. For each mini batch we can use spark MLLib to run the relevant machine learning algorithm, in our case it is k mean clustering and generalized linear regression model (GLM). We have to run the algorithm and save the model parameters and when the next batch comes run the
  • 5. algorithm and get the new parameters and then we can combine the models and build the new model with this incremental algorithms. If we take the K mean clustering algorithm that is implemented on the spark streaming is elaborated below. 4.3 Periodically Train the models with the previous models. For the periodically retraining of the model can be achieved by several ways. As the [1] research paper suggest and most of the people are using we can use mini-batch training with some stochastic gradient descent optimization techniques. So we can train the model periodically with mini batches. Mini Batch Generalized Linear Regression Especially when we comes to the Generalized Linear Regression we can use the stochastic gradient descent techniques to find the new model based on the past models also. As the spark streaming algorithms do referring to reference [5], we can use stochastic gradient descent methods for incremental learning model with GLS. We need to keenly look into Regularization of the parameters and along with various parameters associated with stochastic gradient descent such as Step Size, Number of Iterations and Mini batch fraction to control some issues regarding data horizon (how quickly a most recent data point becomes a part of the model) and data obsolescence (how long does it take a past data point to become irrelevant to the model. Description of how the algorithms run with the references [8], [9], [10] and [11]. There those references describes the how it works and how we can use it with mini batch learning algorithms. Reference [8] shows clearly and simply how the mini batch size can be selected and how it can be parameterized b (B- batch size). Mini Batch K Mean Clustering When we comes to mini batch k mean clustering we can use the model parameters such as cluster center to periodically retrain the model. Referring to [4] spark streaming algorithm and Scala API we can write as following. we may want to estimate clusters dynamically, updating them as new data arrive we need to develop the API to support streaming k mean clustering with mini batches, with parameters to control the decay (or “forgetfulness”) of the estimates. The algorithm uses a generalization of the mini-batch k-means update rule. For each batch of data, we assign all points to their nearest cluster, compute new cluster centers, then update each cluster using:
  • 6. 𝐶𝑡+1 = 𝐶𝑡 𝑛 𝑡 𝛼 + 𝑋𝑡 𝑚 𝑡 𝑛 𝑡 𝛼 + 𝑚 𝑡 Where, 𝑛 𝑡+1 = 𝑛 𝑡 + 𝑚 𝑡 Where 𝐶𝑡the previous center for the cluster is, 𝑛 𝑡is the number of points assigned to the cluster thus far, 𝑥𝑡 is the new cluster center from the current batch, and 𝑚 𝑡 is the number of points added to the cluster in the current batch. The decay factor α can be used to ignore the past: with α=1 all data will be used from the beginning; with α=0 only the most recent data will be used. This is analogous to an exponentially-weighted moving average. These decaying factors make more control upon data horizon and data obsolesces to control the mini batch process. 5 ARCHITECTURE OVERVIEW Streaming data in accepted inside the core. Those streams may be coming from the WSO2 CEP (Complex Event Processor), data streams coming out from the Siddhi processor, or else it can be another streamline of data. Then the stream of data I break into mini batches and passes into the API core where the algorithms run and store the models derived with the past models. Apache Spark is used to run the machine learning algorithms that we need to run for each and every mini batch of data. Receiver API Core Apache Spark MLLib Past ModelsSTREAMS MINI-BATCH
  • 7. 6 Time Line of the Project April 22- May 22 Community bonding period - Getting familiar with the platform and discussing implementation methods. May 23 – June 18 Implementing streaming generalized linear regression and test for streams June 19 - June 21 Implementing a part of streaming k-means June 21-June 28 Midterm Evaluation June 28-July 17 Implementing the other part of the streaming k means clustering and test and test streaming GLR further with streams. July 18-August 12 Integrating The Java API with the CEP Siddhi processor streams and test with the CEP event streams and other streams. August 13– August 16 Final Documentation of the project August 16-August 24 Submit Codes and documentations and evaluations 7 References [1] https://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf [2] https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html [3]http://crediblecode.net/z-featureditems/featured-1/sparking-insights [4]http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means [5]http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression [6] https://papers.nips.cc/paper/4432-better-mini-batch-algorithms-via-accelerated-gradient- methods.pdf [7] https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html [8] http://stats.stackexchange.com/questions/140811/how-large-should-the-batch-size-be-for- stochastic-gradient-descent [9] http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf [10] http://arxiv.org/pdf/1106.4574.pdf [11] http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf