SlideShare a Scribd company logo
MEMBERS:
Dheeraj Pachauri(1809113042)
Himanshu Bharti(1809113052)
Shahnawaz Khan(1900910139007)
Abhay Kumar Mishra(1900910139001)
 Clustering
 Data Stream
 Stream Clustering
 Requirements for clustering algorithms
 Stream clustering steps & algorithms
 Prototype array
 Window models
 Outliers & its detection
 Applications of clustering
 Method of identifying similar groups of data in a data
set.
 Entities in each group are comparatively more similar
to entities of that group than those of other group.
 Some methods include K-means, K-mediods, DB-
SCAN etc.
 STREAM: Data that arrives continuously such as Google
queries, telephone records, multimedia data, financial
transactions etc.
 Not feasible to store in a database & data can be lost if not
processed immediately
 DATA STREAM: Continuous, massive, unbounded
sequences of data objects that are continuously generated at
a rapid rate.
 The problem of data stream clustering is defined as:
Input: a sequence of n points in metric space & an
integer k.
Output: k centers in the set of the n points so as to
minimize the sum of distances from data points to their
closest cluster centers.
 ONLINE PHASE
 Summarize the data into memory-efficient data
structures
 OFFLINE PHASE
 Use a clustering algorithm to find the data
partition
 Provide timely results by performing fast &
incremental processing of data objects
 Rapidly adapt to changing dynamics of the data,
which means algorithm should detect when new
clusters may appear, or others disappear
 Scale to the number of objects that are
continuously arriving
 Provide a compact model representation
 Rapidly detect the presence of outliers & act
accordingly
 High dimensionality, interpretability & usability
 Deals with different data types. Ex- XML trees,
DNA sequences, GPS information etc.
 ALGORITHM STEPS:
 Data Abstraction: Summarize the data into
memory-efficient data structures
 Clustering phase: Use a clustering algorithm to
find the data partition
There are five main classes:
 HIERARCHICAL BASED ALGORITHMS: It
uses the dendrogram data structure which is
binary tree based. Useful to summarize &
visualize the data.
 Examples are BIRCH, CHAMELEON, ODAC,
E-Stream & HUE-Stream.
 It splits the data instances into a predefined
number of clusters based on similarity to the
cluster centroids.
 Examples are Clustream, HPStream,
SWClustering, StreamKM++ & CLARA.
 It uses multi-resolution grid data structure.
 The workspace is divided into a number of
cells, in a grid structure, and each instance is
assigned to a cell
 Grid cells are then clustered.
 Examples include GCHDS, GSCDS, DGClust,
CLIQUE, WaveCluster & STING.
 It keeps summary of input data in large
number of micro clusters.
 Micro cluster is a set of data instances that are
very close to each other.
 Synopsis is kept with a feature vector. Then,
these micro clusters are merged & formed final
clusters.
 Examples are DBSCAN, LDBSCAN, DSCLU,
SOStream & MR-Stream
 It finds the data distribution model that fit best
to the input data.
 Attempt to optimize the fit between the data &
some mathematical model.
 Adopts statistical & AI approach
 Examples are COBWEB, CluDistream & SWEM
 Some data stream clustering algorithms usea
simplified summarization structure called
prototype array.
 Array of protoypes that summarizes the data
partition.
 It’s used to summarize the stream to divide the
data stream into chunks of size m.
 In most data stream scenarios, more recent
information from the stream can reflect the
emerging of new trends or changes on the data
distribution.
 This information can be used to explain the
evolution of the process under observation.
 Moving window techniques have been
proposed to partially address this problem.
 Only the most recent information from the data stream are stored
in a data structure whose size can be variable or fixed.
 This is usually a first in, first out(FIFO) structure which considers
the objects from the current period of time upto a certain period in
the past.
 The organization & manipulation of objects are based on the
principle of queue processing.
 Considers the most recent information by associating
weights to objects from the data stream.
 More recent objects receive higher weight than older
objects & the weights of the objects decrease with time.
 The weight of the objects exponentially decays from
black to white.
 Adopted in density based clustering algorithms.
 Last in the row
 It considers the data in the data stream from
the beginning until now.
 The coreset tree structure is responsible for
reducing 2m objects to m objects. The
construction of this structure is defined as
follows:
 First, the tree has only the root node v, which
contains all the 2m objects in Ev. The prototype
of the root node Xpv is chosen randomly from
Ev & Nv=|Ev|=2m. Afterwards, two child
nodes for v are created as v1 & v2.
 To create these nodes, the object that is farthest
away from the prototype object is selected.
 OUTLIERS: The set of objects are considerably dissimilar from
the remainder of the data.
 PROBLEM: Find top n outlier points
 APPLICATIONS:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis
 Besides the requirements of being incremental
& fast, data stream clustering algorithms
should also be able to properly handle outliers
through the stream.
 These are objects that deviate from the general
behaviour of a data model & occur due to
different causes, such as problems in data
collection, storage & transmission errors,
fraudulent activities or changes in the
behaviour of the system.
 Pattern recognition
 Spatial data analysis
 Image processing
 Economic Science(especially market research)
 WWW
 Internet
 Data Mining & Analysis by MJ Zaki
 Websites(dimacs.rutgers.edu &
dsc.soic.indiana.edu)
 Class notes
Clustering for Stream and Parallelism (DATA ANALYTICS)

More Related Content

What's hot

Cloud architecture
Cloud architectureCloud architecture
Cloud architecture
Adeel Javaid
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
rajshreemuthiah
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
Rehan Guha
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
itnewsafrica
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
Krish_ver2
 
Cloud Computing Architecture
Cloud Computing ArchitectureCloud Computing Architecture
Cloud Computing Architecture
Animesh Chaturvedi
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
Software Park Thailand
 
Mobile transport layer - traditional TCP
Mobile transport layer - traditional TCPMobile transport layer - traditional TCP
Mobile transport layer - traditional TCP
Vishal Tandel
 
ID3 ALGORITHM
ID3 ALGORITHMID3 ALGORITHM
ID3 ALGORITHM
HARDIK SINGH
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
Krish_ver2
 
Cloud computing lab experiments
Cloud computing lab experimentsCloud computing lab experiments
Cloud computing lab experiments
richendraravi
 
Cloud Resource Management
Cloud Resource ManagementCloud Resource Management
Cloud Resource Management
NASIRSAYYED4
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
hktripathy
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
Ashish Kumar
 

What's hot (20)

Cloud architecture
Cloud architectureCloud architecture
Cloud architecture
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
2.3 bayesian classification
2.3 bayesian classification2.3 bayesian classification
2.3 bayesian classification
 
Cloud Computing Architecture
Cloud Computing ArchitectureCloud Computing Architecture
Cloud Computing Architecture
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Mobile transport layer - traditional TCP
Mobile transport layer - traditional TCPMobile transport layer - traditional TCP
Mobile transport layer - traditional TCP
 
ID3 ALGORITHM
ID3 ALGORITHMID3 ALGORITHM
ID3 ALGORITHM
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Cloud computing lab experiments
Cloud computing lab experimentsCloud computing lab experiments
Cloud computing lab experiments
 
Cloud Resource Management
Cloud Resource ManagementCloud Resource Management
Cloud Resource Management
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
remote procedure calls
  remote procedure calls  remote procedure calls
remote procedure calls
 

Similar to Clustering for Stream and Parallelism (DATA ANALYTICS)

Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
IJECEIAES
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
Jenny Liu
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
Alexander Decker
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Datamining Tools
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
IOSR Journals
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
SaiPragnaKancheti
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET Journal
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
ijdmtaiir
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
IJDKP
 
Web based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architectureWeb based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architecture
Aidah Izzah Huriyah
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
iaemedu
 
G0354451
G0354451G0354451
G0354451
iosrjournals
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
Sơn Còm Nhom
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
IJMER
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
Clustering
ClusteringClustering
Clustering
Meme Hei
 

Similar to Clustering for Stream and Parallelism (DATA ANALYTICS) (20)

Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
A fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming dataA fuzzy clustering algorithm for high dimensional streaming data
A fuzzy clustering algorithm for high dimensional streaming data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
K- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptxK- means clustering method based Data Mining of Network Shared Resources .pptx
K- means clustering method based Data Mining of Network Shared Resources .pptx
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
Certain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic DataminingCertain Investigation on Dynamic Clustering in Dynamic Datamining
Certain Investigation on Dynamic Clustering in Dynamic Datamining
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Web based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architectureWeb based-distributed-sesnzer-using-service-oriented-architecture
Web based-distributed-sesnzer-using-service-oriented-architecture
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
A frame work for clustering time evolving data
A frame work for clustering time evolving dataA frame work for clustering time evolving data
A frame work for clustering time evolving data
 
G0354451
G0354451G0354451
G0354451
 
Introduction to Datamining Concept and Techniques
Introduction to Datamining Concept and TechniquesIntroduction to Datamining Concept and Techniques
Introduction to Datamining Concept and Techniques
 
Ba2419551957
Ba2419551957Ba2419551957
Ba2419551957
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Clustering
ClusteringClustering
Clustering
 

Recently uploaded

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 

Recently uploaded (20)

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 

Clustering for Stream and Parallelism (DATA ANALYTICS)

  • 1. MEMBERS: Dheeraj Pachauri(1809113042) Himanshu Bharti(1809113052) Shahnawaz Khan(1900910139007) Abhay Kumar Mishra(1900910139001)
  • 2.  Clustering  Data Stream  Stream Clustering  Requirements for clustering algorithms  Stream clustering steps & algorithms  Prototype array  Window models  Outliers & its detection  Applications of clustering
  • 3.  Method of identifying similar groups of data in a data set.  Entities in each group are comparatively more similar to entities of that group than those of other group.  Some methods include K-means, K-mediods, DB- SCAN etc.
  • 4.  STREAM: Data that arrives continuously such as Google queries, telephone records, multimedia data, financial transactions etc.  Not feasible to store in a database & data can be lost if not processed immediately  DATA STREAM: Continuous, massive, unbounded sequences of data objects that are continuously generated at a rapid rate.  The problem of data stream clustering is defined as: Input: a sequence of n points in metric space & an integer k. Output: k centers in the set of the n points so as to minimize the sum of distances from data points to their closest cluster centers.
  • 5.  ONLINE PHASE  Summarize the data into memory-efficient data structures  OFFLINE PHASE  Use a clustering algorithm to find the data partition
  • 6.  Provide timely results by performing fast & incremental processing of data objects  Rapidly adapt to changing dynamics of the data, which means algorithm should detect when new clusters may appear, or others disappear  Scale to the number of objects that are continuously arriving  Provide a compact model representation  Rapidly detect the presence of outliers & act accordingly  High dimensionality, interpretability & usability  Deals with different data types. Ex- XML trees, DNA sequences, GPS information etc.
  • 7.  ALGORITHM STEPS:  Data Abstraction: Summarize the data into memory-efficient data structures  Clustering phase: Use a clustering algorithm to find the data partition
  • 8. There are five main classes:  HIERARCHICAL BASED ALGORITHMS: It uses the dendrogram data structure which is binary tree based. Useful to summarize & visualize the data.  Examples are BIRCH, CHAMELEON, ODAC, E-Stream & HUE-Stream.
  • 9.  It splits the data instances into a predefined number of clusters based on similarity to the cluster centroids.  Examples are Clustream, HPStream, SWClustering, StreamKM++ & CLARA.
  • 10.  It uses multi-resolution grid data structure.  The workspace is divided into a number of cells, in a grid structure, and each instance is assigned to a cell  Grid cells are then clustered.  Examples include GCHDS, GSCDS, DGClust, CLIQUE, WaveCluster & STING.
  • 11.  It keeps summary of input data in large number of micro clusters.  Micro cluster is a set of data instances that are very close to each other.  Synopsis is kept with a feature vector. Then, these micro clusters are merged & formed final clusters.  Examples are DBSCAN, LDBSCAN, DSCLU, SOStream & MR-Stream
  • 12.  It finds the data distribution model that fit best to the input data.  Attempt to optimize the fit between the data & some mathematical model.  Adopts statistical & AI approach  Examples are COBWEB, CluDistream & SWEM
  • 13.  Some data stream clustering algorithms usea simplified summarization structure called prototype array.  Array of protoypes that summarizes the data partition.  It’s used to summarize the stream to divide the data stream into chunks of size m.
  • 14.  In most data stream scenarios, more recent information from the stream can reflect the emerging of new trends or changes on the data distribution.  This information can be used to explain the evolution of the process under observation.  Moving window techniques have been proposed to partially address this problem.
  • 15.  Only the most recent information from the data stream are stored in a data structure whose size can be variable or fixed.  This is usually a first in, first out(FIFO) structure which considers the objects from the current period of time upto a certain period in the past.  The organization & manipulation of objects are based on the principle of queue processing.
  • 16.  Considers the most recent information by associating weights to objects from the data stream.  More recent objects receive higher weight than older objects & the weights of the objects decrease with time.  The weight of the objects exponentially decays from black to white.  Adopted in density based clustering algorithms.
  • 17.  Last in the row  It considers the data in the data stream from the beginning until now.
  • 18.  The coreset tree structure is responsible for reducing 2m objects to m objects. The construction of this structure is defined as follows:  First, the tree has only the root node v, which contains all the 2m objects in Ev. The prototype of the root node Xpv is chosen randomly from Ev & Nv=|Ev|=2m. Afterwards, two child nodes for v are created as v1 & v2.  To create these nodes, the object that is farthest away from the prototype object is selected.
  • 19.  OUTLIERS: The set of objects are considerably dissimilar from the remainder of the data.  PROBLEM: Find top n outlier points  APPLICATIONS:  Credit card fraud detection  Telecom fraud detection  Customer segmentation  Medical analysis
  • 20.  Besides the requirements of being incremental & fast, data stream clustering algorithms should also be able to properly handle outliers through the stream.  These are objects that deviate from the general behaviour of a data model & occur due to different causes, such as problems in data collection, storage & transmission errors, fraudulent activities or changes in the behaviour of the system.
  • 21.  Pattern recognition  Spatial data analysis  Image processing  Economic Science(especially market research)  WWW
  • 22.  Internet  Data Mining & Analysis by MJ Zaki  Websites(dimacs.rutgers.edu & dsc.soic.indiana.edu)  Class notes