Introduction to Clustering algorithm

•Download as PPTX, PDF•

3 likes•2,247 views

This is very simple introduction to Clustering with some real world example. At the end of lecture I use stackOverflow API to test some clustering. I also wants to try facebook but it has some problem with it's API

Education

Clustering Algorithm
COMPLEX NETWORK ALGORITHM
AMIR HADIFAR
1

Objectives
 At the end of this presentation you will understand :
 Understand data science and it’s application
 Get overview of Machine Learning
 Learn some type of clustering algorithm
 Implementation clustering with R
2

Data science and it’s Applications
 Extract knowledge or insight from data
 From speech-recognition and search engine to health-care and humanities
 These scenarios involves :
 Storing , organizing and integrating huge amount of unstructured data
 Processing and Analyzing data
 Extracting Knowledge , insight and predict future from data
 Processing , Analyzing , Extracting knowledge and insight done through Machine
Learning
3

Machine Learning
 Field of study that gives computers the ability to learn without being explicitly
programmed
 Classified into three broad category :
 Supervised Learning
 Unsupervised Learning
 *Reinforcement Learning
5

Machine Learning Category
 Supervised learning
 Decision tree learning
 Classification
 …
 Unsupervised learning
 Clustering
 Association rule learning
 …
6

Cluster definition
 Cluster analysis or clustering grouping similar object together ( called cluster)
 Type of Clustering
 Intra-class similarity
 Inter-class similarity
7

Clustering Scenario
 The following scenarios implement clustering :
 Market segmentation
 Summarized news ( cluster and then find centroid )
 City planning
 Image segmentation
8

Methods of clustering
 Partitioning methods (Centroid models )
 Hierarchical methods (Connectivity models )
 Density-based methods
 Grid-based methods
 Model-based methods
 Constraint-based methods
9

Partitioning method
 database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data
which satisfy following :
 Each group contains at least one object
 Each object must belong to exactly one group
 Points to remember
 This method create initial partitioning
 Use iterative relocation technique to improve partitioning
10

Other K-mean variant
 K-mean++
 K-mean stream
 Mini batch k-mean
 K-medoids
 Fuzzy k-means
 Many others
12

Hierarchical Clustering
 Agglomerative
 Bottom up
 Divisive
 Top down
14

Calculate distance between points
 Single linkage
 Complete linkage
 Average linkage
15

Density based Methods
 Areas of higher density consider as cluster
 Sparse areas usually consider as noise
 It use two basic idea
 Density reachable
 Density connectivity
17

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
18

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
19

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
 Advantage
 Does not require a-priori specification of number of clusters.
 Able to identify noise data while clustering.
 is able to find arbitrarily size and arbitrarily shaped clusters
 Disadvantage
 Fails in case of neck type of dataset.
 Does not work well in case of high dimensional data
20

Grid based algorithm
 Using multi-resolution grid data structure
 Clustering complexity depends on number of grid cell and not objects
 Space into finite number cells that form a grid structure on which all of the
operation for clustering is performed
 Clique , STING , WaveCluster
21

Clique ( CLustering-In-QUEst
 Clique is used for clustering high-dimensional data
 High dimensional data means have many attrs
 Clique identifies the dense unit in subspace
22

What's hot

Machine learning ppt.

ASHOK KUMAR

Clustering

M Rizwan Aqeel

Hierachical clustering

Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Introduction to Deep Learning

Oswald Campesato

This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques

Classification in data mining

Sulman Ahmed

Hierarchical Clustering

Carlos Castillo (ChaTo)

Semi-Supervised Learning

Lukas Tencer

Introduction to Machine Learning Classifiers

Functional Imperative

Decision Tree Learning

Milind Gokhale

Machine Learning presentation.

butest

This Support Vector Machine (SVM) presentation will help you understand Support Vector Machine algorithm, a supervised machine learning algorithm which can be used for both classification and regression problems. This SVM presentation will help you learn where and when to use SVM algorithm, how does the algorithm work, what are hyperplanes and support vectors in SVM, how distance margin helps in optimizing the hyperplane, kernel functions in SVM for data transformation and advantages of SVM algorithm. At the end, we will also implement Support Vector Machine algorithm in Python to differentiate crocodiles from alligators for a given dataset. Below topics are explained in this Support Vector Machine presentation: 1. What is Machine Learning? 2. Why support vector machine? 3. What is support vector machine? 4. Understanding support vector machine 5. Advantages of support vector machine 6. Use case in Python - - - - - - - - About Simplilearn Machine Learning course: A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning. - - - - - - - Why learn Machine Learning? Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period. - - - - - - What skills will you learn from this Machine Learning course? By the end of this Machine Learning course, you will be able to: 1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling. 2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project. 3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning. 4. Understand the concepts and operation of support vector machines, kernel SVM, Naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more. 5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems - - - - - - -

Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...

Simplilearn

Classification Based Machine Learning Algorithms

Md. Main Uddin Rony

Machine Learning

Girish Khanzode

Machine learning with ADA Boost

Aman Patel

In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.

Support Vector Machines for Classification

Prakash Pimpale

ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built.

Ensemble learning

Mustafa Sherazi

Machine Learning with Decision trees

Knoldus Inc.

Density Based Clustering

SSA KPI

Clustering, k-means clustering

Megha Sharma

An introductory course on building ML applications with primary focus on supervised learning. Covers the typical ML application cycle - Problem formulation, data definitions, offline modeling, platform design. Also, includes key tenets for building applications. Note: This is an old slide deck. The content on building internal ML platforms is a bit outdated and slides on the model choices do not include deep learning models.

ML Basics

SrujanaMerugu1

What's hot (20)

Machine learning ppt.

Clustering

Hierachical clustering

Introduction to Deep Learning

Classification in data mining

Hierarchical Clustering

Semi-Supervised Learning

Introduction to Machine Learning Classifiers

Decision Tree Learning

Machine Learning presentation.

Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...

Classification Based Machine Learning Algorithms

Machine Learning

Machine learning with ADA Boost

Support Vector Machines for Classification

Ensemble learning

Machine Learning with Decision trees

Density Based Clustering

Clustering, k-means clustering

ML Basics

Viewers also liked

3.4 density and grid methods

Krish_ver2

Dataa miining

SUBBIAH SURESH

Clustering in Data Mining

Archana Swaminathan

Optics ordering points to identify the clustering structure

Rajesh Piryani

NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms

zukun

Legal Analytics Course - Class 7 - Binary Classification with Decision Tree L...

Daniel Katz

Presentation ucb 2012

kranen

Clustering data streams based on shared density between micro clusters

Shakas Technologies

TO GET THIS PROJECT COMPLETE SOURCE ON SUPPORT WITH EXECUTION PLEASE CALL BELOW CONTACT DETAILS MOBILE: 9791938249, 0413-2211159, WEB: WWW.NEXGENPROJECT.COM,WWW.FINALYEAR-IEEEPROJECTS.COM, EMAIL:Praveen@nexgenproject.com NEXGEN TECHNOLOGY provides total software solutions to its customers. Apsys works closely with the customers to identify their business processes for computerization and help them implement state-of-the-art solutions. By identifying and enhancing their processes through information technology solutions. NEXGEN TECHNOLOGY help it customers optimally use their resources.

CLUSTERING DATA STREAMS BASED ON SHARED DENSITY BETWEEN MICRO-CLUSTERS

Nexgen Technology

Decision Tree - C4.5&CART

Xueping Peng

Clustering

Shubra Singh

Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

error007

A survey on ant colony clustering papers

Zahra Sadeghi

Clusteryanam

Nagasuri Bala Venkateswarlu

I am an algorithm - workshop on understanding bias in coding

Acuity Design

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...

Sunil Nair

Humanrithm: why data without people is not enough

Guillaume Decugis

Slides contain: Classification: Basic Concepts, Decision Tree Induction, Bayes Classification Methods, Rule-Based Classification, Model Evaluation and Selection, Techniques to Improve Classification Accuracy: Ensemble Methods, Summary by Jiawei Han, Micheline Kamber, and Jian Pei, University of Illinois at Urbana-Champaign & Simon Fraser University, ©2013 Han, Kamber & Pei. All rights reserved.

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts

Salah Amean

3.5 model based clustering

Krish_ver2

Application of Clustering in Data Science using Real-life Examples

Edureka!

Viewers also liked (20)

3.4 density and grid methods

Dataa miining

Clustering in Data Mining

Optics ordering points to identify the clustering structure

NIPS2009: Sparse Methods for Machine Learning: Theory and Algorithms

Legal Analytics Course - Class 7 - Binary Classification with Decision Tree L...

Presentation ucb 2012

Clustering data streams based on shared density between micro clusters

CLUSTERING DATA STREAMS BASED ON SHARED DENSITY BETWEEN MICRO-CLUSTERS

Decision Tree - C4.5&CART

Clustering

Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber

A survey on ant colony clustering papers

Clusteryanam

I am an algorithm - workshop on understanding bias in coding

Data Mining - Classification Of Breast Cancer Dataset using Decision Tree Ind...

Humanrithm: why data without people is not enough

Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts

3.5 model based clustering

Application of Clustering in Data Science using Real-life Examples

Similar to Introduction to Clustering algorithm

Capter10 cluster basic : Han & Kamber

Houw Liong The

Capter10 cluster basic

Houw Liong The

Data mining concepts and techniques Chapter 10

mqasimsheikh5

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

Salah Amean

data mining cocepts and techniques chapter

NaveenKumar5162

data mining cocepts and techniques chapter

NaveenKumar5162

10 clusbasic

engrasi

CLUSTERING

Aman Jatain

15857 cse422 unsupervised-learning

Anil Yadav

10 clusbasic

JoonyoungJayGwak

Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt

Subrata Kumer Paul

Clustering

Kiran Bhowmick

dm_clustering2.ppt

Bhuvanya Raghunathan

Unsupervised Learning.pptx

GandhiMathy6

My8clst

ketan533

Clustering is the process of making a group of abstract objects into classes of similar objects. Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and these subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an informed decision about who we think is best suited for this product. Cluster analysis is a data analysis technique that explores the naturally occurring groups within a data set known as clusters. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an unsupervised learning method. Cluster analysis is a data analysis technique that explores the naturally occurring groups within a data set known as clusters. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an unsupervised learning method.

Data Mining: Cluster Analysis

Suman Mia

Chapter 5.pdf

DrGnaneswariG

The clustering is a without monitoring process and one of the most common data mining techniques. The purpose of clustering is grouping similar data together in a group, so were most similar to each other in a cluster and the difference with most other instances in the cluster are. In this paper we focus on clustering partition k-means, due to ease of implementation and high-speed performance of large data sets, After 30 year it is still very popular among the developed clustering algorithm and then for improvement problem of placing of k-means algorithm in local optimal, we pose extended PSO algorithm, that its name is ECPSO. Our new algorithm is able to be cause of exit from local optimal and with high percent produce the problem’s optimal answer. The probe of results show that mooted algorithm have better performance regards as other clustering algorithms specially in two index, the carefulness of clustering and the quality of clustering.

Extended pso algorithm for improvement problems k means clustering algorithm

IJMIT JOURNAL

clustering ppt.pptx

chmeghana1

Paper id 26201478

IJRAT

Similar to Introduction to Clustering algorithm (20)

Capter10 cluster basic : Han & Kamber

Capter10 cluster basic

Data mining concepts and techniques Chapter 10

Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...

data mining cocepts and techniques chapter

10 clusbasic

CLUSTERING

15857 cse422 unsupervised-learning

10 clusbasic

Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt

Clustering

dm_clustering2.ppt

Unsupervised Learning.pptx

My8clst

Data Mining: Cluster Analysis

Chapter 5.pdf

Extended pso algorithm for improvement problems k means clustering algorithm

clustering ppt.pptx

Paper id 26201478

Recently uploaded

MARUTI SUZUKI- A Successful Joint Venture in India.pptx

bennyroshan06

The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence. The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth. Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration. The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority. Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages. Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.

The Roman Empire A Historical Colossus.pdf

kaushalkr1407

Basic phrases for greeting and assisting costumers

PedroFerreira53928

Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf

QucHHunhnh

INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf

bu07226

Sha'Carri Richardson Presentation 202345

beazzy04

Matatag-Curriculum and the 21st Century Skills Presentation.pptx

JenilouCasareno

Chapter 3 - Islamic Banking Products and Services.pptx

Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia

We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.

How to Break the cycle of negative Thoughts

Col Mukteshwar Prasad

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...

Nguyen Thanh Tu Collection

Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV

Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx

EduSkills OECD

Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction. This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.

Embracing GenAI - A Strategic Imperative

Peter Windle

plant breeding methods in asexually or clonally propagated crops

parmarsneha2

The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension. For more information, visit-www.vavaclasses.com

Sectors of the Indian Economy - Class 10 Study Notes pdf

Vivekanand Anglo Vedic Academy

Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity Green house effect & Hydrological cycle Types of Ecosystem (1) Natural Ecosystem (2) Artificial Ecosystem component of ecosystem Biotic Components Abiotic Components Producers Consumers Decomposers Functions of Ecosystem Types of Biodiversity Genetic Biodiversity Species Biodiversity Ecological Biodiversity Importance of Biodiversity Hydrological Cycle Green House Effect

Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...

Denish Jangid

NLC-2024-Orientation-for-RO-SDO (1).pptx

ssuserbdd3e8

The geography of Taylor Swift - some ideas

GeoBlogs

This presentation provides an introduction to quantitative trait loci (QTL) analysis and marker-assisted selection (MAS) in plant breeding. The presentation begins by explaining the type of quantitative traits. The process of QTL analysis, including the use of molecular genetic markers and statistical methods, is discussed. Practical examples demonstrating the power of MAS are provided, such as its use in improving crop traits in plant breeding programs. Overall, this presentation offers a comprehensive overview of these important genomics-based approaches that are transforming modern agriculture.

Basic_QTL_Marker-assisted_Selection_Sourabh.ppt

Sourabh Kumar

Overview on Edible Vaccine: Pros & Cons with Mechanism

DeeptiGupta154

The Art Pastor's Guide to Sabbath | Steve Thomason

Steve Thomason

Recently uploaded (20)

MARUTI SUZUKI- A Successful Joint Venture in India.pptx

The Roman Empire A Historical Colossus.pdf

Basic phrases for greeting and assisting costumers

Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf

INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf

Sha'Carri Richardson Presentation 202345

Matatag-Curriculum and the 21st Century Skills Presentation.pptx

Chapter 3 - Islamic Banking Products and Services.pptx

How to Break the cycle of negative Thoughts

GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...

Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx

Embracing GenAI - A Strategic Imperative

plant breeding methods in asexually or clonally propagated crops

Sectors of the Indian Economy - Class 10 Study Notes pdf

Basic Civil Engineering Notes of Chapter-6, Topic- Ecosystem, Biodiversity G...

NLC-2024-Orientation-for-RO-SDO (1).pptx

The geography of Taylor Swift - some ideas

Basic_QTL_Marker-assisted_Selection_Sourabh.ppt

Overview on Edible Vaccine: Pros & Cons with Mechanism

The Art Pastor's Guide to Sabbath | Steve Thomason

Introduction to Clustering algorithm

1. Clustering Algorithm COMPLEX NETWORK ALGORITHM AMIR HADIFAR 1

2. Objectives  At the end of this presentation you will understand :  Understand data science and it’s application  Get overview of Machine Learning  Learn some type of clustering algorithm  Implementation clustering with R 2

3. Data science and it’s Applications  Extract knowledge or insight from data  From speech-recognition and search engine to health-care and humanities  These scenarios involves :  Storing , organizing and integrating huge amount of unstructured data  Processing and Analyzing data  Extracting Knowledge , insight and predict future from data  Processing , Analyzing , Extracting knowledge and insight done through Machine Learning 3

4. Data science and it’s Applications 4

5. Machine Learning  Field of study that gives computers the ability to learn without being explicitly programmed  Classified into three broad category :  Supervised Learning  Unsupervised Learning  *Reinforcement Learning 5

6. Machine Learning Category  Supervised learning  Decision tree learning  Classification  …  Unsupervised learning  Clustering  Association rule learning  … 6

7. Cluster definition  Cluster analysis or clustering grouping similar object together ( called cluster)  Type of Clustering  Intra-class similarity  Inter-class similarity 7

8. Clustering Scenario  The following scenarios implement clustering :  Market segmentation  Summarized news ( cluster and then find centroid )  City planning  Image segmentation 8

9. Methods of clustering  Partitioning methods (Centroid models )  Hierarchical methods (Connectivity models )  Density-based methods  Grid-based methods  Model-based methods  Constraint-based methods 9

10. Partitioning method  database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data which satisfy following :  Each group contains at least one object  Each object must belong to exactly one group  Points to remember  This method create initial partitioning  Use iterative relocation technique to improve partitioning 10

11. K-Mean or Lyold’s algorithm 11

12. Other K-mean variant  K-mean++  K-mean stream  Mini batch k-mean  K-medoids  Fuzzy k-means  Many others 12

13. K-mean Clustering with R 13

14. Hierarchical Clustering  Agglomerative  Bottom up  Divisive  Top down 14

15. Calculate distance between points  Single linkage  Complete linkage  Average linkage 15

16. H Clustering with R 16

17. Density based Methods  Areas of higher density consider as cluster  Sparse areas usually consider as noise  It use two basic idea  Density reachable  Density connectivity 17

18. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) 18

19. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) 19

20. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)  Advantage  Does not require a-priori specification of number of clusters.  Able to identify noise data while clustering.  is able to find arbitrarily size and arbitrarily shaped clusters  Disadvantage  Fails in case of neck type of dataset.  Does not work well in case of high dimensional data 20

21. Grid based algorithm  Using multi-resolution grid data structure  Clustering complexity depends on number of grid cell and not objects  Space into finite number cells that form a grid structure on which all of the operation for clustering is performed  Clique , STING , WaveCluster 21

22. Clique ( CLustering-In-QUEst  Clique is used for clustering high-dimensional data  High dimensional data means have many attrs  Clique identifies the dense unit in subspace 22

23. StackOverFlow Analysis Using R 23

24. StackOverFlow Analysis Using R 24

25. StackOverFlow Analysis Using R 25

Editor's Notes

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured,[1][2] which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar toKnowledge Discovery in Databases (KDD). Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, chemometrics, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing. The development of machine learning has enhanced the growth and importance of data science Data science affects academic and applied research in many domains, including machine translation, speech recognition, robotics, search engines,digital economy, but also the biological sciences, medical informatics, health care, social sciences and the humanities. It heavily influences economics,business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.[3]
Detection of fake book reviews (Amazon) and fake restaurant reviews (Zagat). A major car company exploring how deep learning can react to audio recordings from the engine to determine if maintenance is necessary, or if parts are nearing the need for replacement. Outdoor marketing company Route is using big data to define and justify its pricing model for advertising space on billboards, benches and the sides of busses. Traditionally, outdoor media pricing was priced “per impression” based on an estimate of how many eyes would see the ad in a given day. No more! Now they’re using sophisticated GPS, eye-tracking software, and analysis of traffic patterns to have a much more realistic idea of which advertisements will be seen the most — and therefore be the most effective.
Alan Turing : "Can machines think?” Supervised learning : is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called thesupervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Unsupervised learning : is the machine learning task of inferring a function to describe hidden structure from unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. Reinforcement learning : A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a car), without a teacher explicitly telling it whether it has come close to its goal. Another example is learning to play a game by playing against an opponent There are also exist other categories which categorized by output , … Between supervised and unsupervised learning is semi-supervised learning, where the teacher gives an incomplete training signal: a training set with some (often many) of the target outputs missing
Supervised learning is the most common technique for training neural networks and decision trees. Differences between clustering and classification In general, in classification you have a set of predefined classes and want to know which class a new object belongs to. Clustering tries to group a set of objects and find whether there is some relationship between the objects.
Intra calss : dissimilarity Inter class : similarity
( Find best place to Open Emergency-Care wards )
ClassificationClustering algorithms may be classified as listed below: Exclusive Clustering Overlapping Clustering Hierarchical Clustering Probabilistic Clustering some times use models for grouping : Connectivity models: for example, hierarchical clustering builds models based on distance connectivity. Centroid models: for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm. Density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes. Group models: some algorithms do not provide a refined model for their results and just provide the grouping information. Graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques, as in the HCS clustering algorithm. In recent years considerable effort has been put into improving the performance of existing algorithms. Among them are CLARANS (Ng and Han, 1994), and BIRCH (Zhang et al., 1996).With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting "clusters" are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering. Various other approaches to clustering have been tried such as seed based clustering. For high-dimensional data, many of the existing methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces. This led to newclustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated ("correlated") subspace clusters that can be modeled by giving a correlation of their attributes. Examples for such clustering algorithms are CLIQUE and SUBCLU. Ideas from density-based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been adopted to subspace clustering (HiSC, hierarchical subspace clustering and DiSH) and correlation clustering (HiCO, hierarchical correlation clustering, 4C using "correlation connectivity" and ERiC exploring hierarchical density-based correlation clusters). Several different clustering systems based on mutual information have been proposed. One is Marina Meilă's variation of information metric; another provides hierarchical clustering. Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.[29] Also message passing algorithms, a recent development in Computer Science andStatistical Physics, has led to the creation of new types of clustering algorithms.[30]
Points to remember : For a given number of partitions (say k), the partitioning method will create an initial partitioning. Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other.K-means clustering can handle larger datasets than hierarchical cluster approaches.
There are two package in R for this kind pam() , k-meanSelects K centroids (K rows chosen at random) Assigns each data point to its closest centroid Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables) Assigns data points to their closest centroids Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached.
The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time,
Usually for small dataset ( 100 ) In R use hclust to Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. O(n^3) Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. O(2^n)
Single LinkageIn single linkage hierarchical clustering, the distance between two clusters is defined as the shortest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.Complete LinkageIn complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster. For example, the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points.Average LinkageIn average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster. For example, the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.
The idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster. It works like this: First we choose two parameters, a positive number epsilon and a natural number minPoints. We then begin by picking an arbitrary point in our dataset. If there are more than minPoints points within a distance of epsilon from that point, (including the original point itself), we consider all of them to be part of a "cluster". We then expand that cluster by checking all of the new points and seeing if they too have more than minPoints points within a distance of epsilon, growing the cluster recursively if so. Eventually, we run out of points to add to the cluster. We then pick a new arbitrary point and repeat the process. Now, it's entirely possible that a point we pick has fewer than minPoints points in its epsilon ball, and is also not a part of any other cluster. If that is the case, it's considered a "noise point" not belonging to any cluster.
Advantage : reconginze noise Disadvanatage : cannot recongize cluster which are not dense ( OPTIC )
OPTICS********************
Clique – STING - WaveCluster
Refrences : http://varianceexplained.org/r/introducing-stackr/
Refrences : http://varianceexplained.org/r/introducing-stackr/

Introduction to Clustering algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Clustering algorithm

Similar to Introduction to Clustering algorithm (20)

Recently uploaded

Recently uploaded (20)

Introduction to Clustering algorithm

Editor's Notes