SlideShare a Scribd company logo
1 of 34
DATA MINING
CLUSTER ANALYSIS
Clustering in Data Mining
• Clustering is an unsupervised Machine Learning-
based Algorithm that comprises a group of data
points into clusters so that the objects belong to the
same group.
• Clustering helps to splits data into several subsets.
Each of these subsets contains data similar to each
other, and these subsets are called clusters.
• Now that the data from our customer base is divided
into clusters, we can make an informed decision
about who we think is best suited for this product
Contd…
• Let's understand this with an example, suppose we
are a market manager, and we have a new tempting
product to sell.
• We are sure that the product would bring enormous
profit, as long as it is sold to the right people. So, how
can we tell who is best suited for the product from our
company's huge customer base?
Showing for clusters formed the set of unlabled data
Contd…
Clustering, falling under the category of unsupervised machine
learning, is one of the problems that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or
similarities in its input data.
A good clustering algorithm aims to obtain clusters whose:
The intra-cluster similarities are high, It implies that the data present
inside the cluster is similar to one another.
The inter-cluster similarity is low, and it means each cluster holds data
that is not similar to other data.
What is a Cluster?
• A cluster is a subset of similar objects
• A subset of objects such that the distance between
any of the two objects in the cluster is less than the
distance between any object in the cluster and any
object that is not located inside it.
• A connected region of a multidimensional space with
a comparatively high density of objects.
What is clustering in Data Mining?
• Clustering is the method of converting a group of
abstract objects into classes of similar objects.
• Clustering is a method of partitioning a set of data or
objects into a set of significant subclasses called
clusters.
• It helps users to understand the structure or natural
grouping in a data set and used either as a stand-
alone instrument to get a better insight into data
distribution or as a pre-processing step for other
algorithms
Important points:
• Data objects of a cluster can be considered as one
group.
• We first partition the information set into groups while
doing cluster analysis. It is based on data similarities
and then assigns the levels to the groups.
• The over-classification main advantage is that it is
adaptable to modifications, and it helps single out
important characteristics that differentiate between
distinct groups.
Applications of cluster analysis in data
mining:
• In many applications, clustering analysis is widely
used, such as data analysis, market research, pattern
recognition, and image processing.
• It assists marketers to find different groups in their
client base and based on the purchasing patterns.
• They can characterize their customer groups.
• It helps in allocating documents on the internet for
data discovery.
• Clustering is also used in tracking applications such
as detection of credit card fraud.
• As a data mining function, cluster analysis serves as
a tool to gain insight into the distribution of data to
analyze the characteristics of each cluster.
• In terms of biology, It can be used to determine plant
and animal taxonomies, categorization of genes with
the same functionalities and gain insight into
structure inherent to populations.
• It helps in the identification of areas of similar land
that are used in an earth observation database and
the identification of house groups in a city according
to house type, value, and geographical location.
Why is clustering used in data
mining?
• Clustering analysis has been an evolving problem in
data mining due to its variety of applications.
• The advent of various data clustering tools in the last
few years and their comprehensive use in a broad
range of applications, including image processing,
computational biology, mobile communication,
medicine, and economics, must contribute to the
popularity of these algorithms
Contd…
• . The main issue with the data clustering algorithms is
that it cant be standardized.
• The advanced algorithm may give the best results
with one type of data set, but it may fail or perform
poorly with other kinds of data set.
• Although many efforts have been made to
standardize the algorithms that can perform well in all
situations, no significant achievement has been
achieved so far.
• Many clustering tools have been proposed so far.
However, each algorithm has its advantages or
disadvantages and cant work on all real situations.
1. Scalability:
• Scalability in clustering implies that as we boost the
amount of data objects, the time to perform clustering
should approximately scale to the complexity order of
the algorithm.
• For example, if we perform K- means clustering, we
know it is O(n), where n is the number of objects in
the data. If we raise the number of data objects 10
folds, then the time taken to cluster them should also
approximately increase 10 times. It means there
should be a linear relationship. If that is not the case,
then there is some error with our implementation
process.
Showing example where scalablitymay leads to wrong results
Contd…
• 2. Interpretability:
• The outcomes of clustering should be interpretable,
comprehensible, and usable.
• 3. Discovery of clusters with attribute shape:
• The clustering algorithm should be able to find
arbitrary shape clusters. They should not be limited to
only distance measurements that tend to discover a
spherical cluster of small sizes.
Contd…
4. Ability to deal with different types of attributes:
• Algorithms should be capable of being applied to any
data such as data based on intervals (numeric),
binary data, and categorical data.
• 5. Ability to deal with noisy data:
• Databases contain data that is noisy, missing, or
incorrect. Few algorithms are sensitive to such data
and may result in poor quality clusters.
6. High dimensionality:
• The clustering tools should not only able to handle
high dimensional data space but also the low-
dimensional space.
Text Data Mining
Areas of text mining in data mining:
• Information Extraction:
The automatic extraction of structured data such as
entities, entities relationships, and attributes
describing entities from an unstructured source is
called information extraction.
• Natural Language Processing:
NLP stands for Natural language processing.
Computer software can understand human language
as same as it is spoken. NLP is primarily a
component of artificial intelligence(AI). The
development of the NLP application is difficult
because computers generally expect humans to
"Speak" to them in a programming language that is
accurate, clear, and exceptionally structured. Human
speech is usually not authentic so that it can depend
on many complex variables, including slang, social
context, and regional dialects.
• Data Mining:
Data mining refers to the extraction of useful data,
hidden patterns from large data sets. Data mining
tools can predict behaviors and future trends that
allow businesses to make a better data-driven
decision. Data mining tools can be used to resolve
many business problems that have traditionally been
too time-consuming.
• Information Retrieval:
Information retrieval deals with retrieving useful data
from data that is stored in our systems. Alternately,
as an analogy, we can view search engines that
happen on websites such as e-commerce sites or
any other sites as part of information retrieval.
Text Mining Process:
Contd…
• Text transformation
A text transformation is a technique that is used to control
the capitalization of the text.
Here the two major way of document representation is
given.
– Bag of words
– Vector Space
• Text Pre-processing
Pre-processing is a significant task and a critical step in
Text Mining, Natural Language Processing (NLP), and
information retrieval(IR). In the field of text mining, data
pre-processing is used for extracting useful information and
knowledge from unstructured text data. Information
Retrieval (IR) is a matter of choosing which documents in a
collection should be retrieved to fulfill the user's need.
Contd…
• Feature selection:
Feature selection is a significant part of data mining.
Feature selection can be defined as the process of
reducing the input of processing or finding the
essential information sources. The feature selection
is also called variable selection.
• Data Mining:
Now, in this step, the text mining procedure merges
with the conventional process. Classic Data Mining
procedures are used in the structural database.
• Evaluate:
Afterward, it evaluates the results. Once the result is
evaluated, the result abandon.
Contd…
• Applications:
These are the following text mining applications:
• Risk Management:
Risk Management is a systematic and logical
procedure of analyzing, identifying, treating, and
monitoring the risks involved in any action or process
in organizations. Insufficient risk analysis is usually a
leading cause of disappointment. It is particularly true
in the financial organizations where adoption of Risk
Management Software based on text mining
technology can effectively enhance the ability to
diminish risk. It enables the administration of millions
of sources and petabytes of text documents, and
giving the ability to connect the data. It helps to
access the appropriate data at the right time.
Contd…
• Customer Care Service:
Text mining methods, particularly NLP, are finding
increasing significance in the field of customer care.
Organizations are spending in text analytics
programming to improve their overall experience by
accessing the textual data from different sources
such as customer feedback, surveys, customer calls,
etc. The primary objective of text analysis is to
reduce the response time of the organizations and
help to address the complaints of the customer
rapidly and productively.
Contd…
• Business Intelligence:
Companies and business firms have started to use text
mining strategies as a major aspect of their business
intelligence. Besides providing significant insights into
customer behavior and trends, text mining strategies also
support organizations to analyze the qualities and
weaknesses of their opponent's so, giving them a
competitive advantage in the market.
• .
Contd…
• Social Media Analysis:
Social media analysis helps to track the online data,
and there are numerous text mining tools designed
particularly for performance analysis of social media
sites. These tools help to monitor and interpret the
text generated via the internet from the news, emails,
blogs, etc. Text mining tools can precisely analyze
the total no of posts, followers, and total no of likes of
your brand on a social media platform that enables
you to understand the response of the individuals
who are interacting with your brand and content
Text Mining Approaches in Data
Mining:
• 1. Keyword-based Association Analysis:
• It collects sets of keywords or terms that often
happen together and afterward discover the
association relationship among them. First, it
preprocesses the text data by parsing, stemming,
removing stop words, etc. Once it pre-processed the
data, then it induces association mining algorithms.
Here, human effort is not required, so the number of
unwanted results and the execution time is reduced.
Contd…
• 2. Document Classification Analysis:
• Automatic document classification:
• This analysis is used for the automatic classification
of the huge number of online text documents like web
pages, emails, etc. Text document classification
varies with the classification of relational data as
document databases are not organized according to
attribute values pairs.
Numericizing text:
• Stemming algorithms
A significant pre-processing step before ordering of
input documents starts with the stemming of words.
The terms "stemming" can be defined as a reduction
of words to their roots. For example, different
grammatical forms of words and ordered are the
same. The primary purpose of stemming is to ensure
a similar word by text mining program.
Contd…
• Support for different languages:
There are some highly language-dependent
operations such as stemming, synonyms, the letters
that are allowed in words. Therefore, support for
various languages is important.
• Exclude certain character:
Excluding numbers, specific characters, or series of
characters, or words that are shorter or longer than a
specific number of letters can be done before the
ordering of the input documents.
Include lists, exclude lists
(stop-words):
•
A particular list of words to be listed can be
characterized, and it is useful when we want to
search for a specific word. It also classifies the input
documents based on the frequencies with which
those words occur. Additionally, "stop words," which
means terms that are to be rejected from the ordering
can be characterized. Normally, a default list of
English stop words incorporates "the," "a," "since,"
etc. These words are used in the respective language
very often but communicate very little data in the
document.
•
STAY SAFE
STAY CONNECTED

More Related Content

Similar to CLUSTER ANALYSIS.pptx

Knowledge Discovery & Representation
Knowledge Discovery & RepresentationKnowledge Discovery & Representation
Knowledge Discovery & Representation
Darshan Patil
 

Similar to CLUSTER ANALYSIS.pptx (20)

ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Unit II.pdf
Unit II.pdfUnit II.pdf
Unit II.pdf
 
Data mining
Data miningData mining
Data mining
 
Cluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptxCluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptx
 
Cluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptxCluster Analysis in Data Science.pptx
Cluster Analysis in Data Science.pptx
 
Knowledge Discovery & Representation
Knowledge Discovery & RepresentationKnowledge Discovery & Representation
Knowledge Discovery & Representation
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Data Mining Presentation.pptx
Data Mining Presentation.pptxData Mining Presentation.pptx
Data Mining Presentation.pptx
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Data mining
Data miningData mining
Data mining
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Machine learning in Banks
Machine learning in BanksMachine learning in Banks
Machine learning in Banks
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introduction
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
التقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتالتقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتبات
 
Ch~2.pdf
Ch~2.pdfCh~2.pdf
Ch~2.pdf
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Recently uploaded (20)

Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 

CLUSTER ANALYSIS.pptx

  • 2. Clustering in Data Mining • Clustering is an unsupervised Machine Learning- based Algorithm that comprises a group of data points into clusters so that the objects belong to the same group. • Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and these subsets are called clusters. • Now that the data from our customer base is divided into clusters, we can make an informed decision about who we think is best suited for this product
  • 3.
  • 4. Contd… • Let's understand this with an example, suppose we are a market manager, and we have a new tempting product to sell. • We are sure that the product would bring enormous profit, as long as it is sold to the right people. So, how can we tell who is best suited for the product from our company's huge customer base?
  • 5. Showing for clusters formed the set of unlabled data
  • 6. Contd… Clustering, falling under the category of unsupervised machine learning, is one of the problems that machine learning algorithms solve. Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input data. A good clustering algorithm aims to obtain clusters whose: The intra-cluster similarities are high, It implies that the data present inside the cluster is similar to one another. The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other data.
  • 7. What is a Cluster? • A cluster is a subset of similar objects • A subset of objects such that the distance between any of the two objects in the cluster is less than the distance between any object in the cluster and any object that is not located inside it. • A connected region of a multidimensional space with a comparatively high density of objects.
  • 8. What is clustering in Data Mining? • Clustering is the method of converting a group of abstract objects into classes of similar objects. • Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called clusters. • It helps users to understand the structure or natural grouping in a data set and used either as a stand- alone instrument to get a better insight into data distribution or as a pre-processing step for other algorithms
  • 9. Important points: • Data objects of a cluster can be considered as one group. • We first partition the information set into groups while doing cluster analysis. It is based on data similarities and then assigns the levels to the groups. • The over-classification main advantage is that it is adaptable to modifications, and it helps single out important characteristics that differentiate between distinct groups.
  • 10. Applications of cluster analysis in data mining: • In many applications, clustering analysis is widely used, such as data analysis, market research, pattern recognition, and image processing. • It assists marketers to find different groups in their client base and based on the purchasing patterns. • They can characterize their customer groups. • It helps in allocating documents on the internet for data discovery.
  • 11. • Clustering is also used in tracking applications such as detection of credit card fraud. • As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to analyze the characteristics of each cluster. • In terms of biology, It can be used to determine plant and animal taxonomies, categorization of genes with the same functionalities and gain insight into structure inherent to populations. • It helps in the identification of areas of similar land that are used in an earth observation database and the identification of house groups in a city according to house type, value, and geographical location.
  • 12. Why is clustering used in data mining? • Clustering analysis has been an evolving problem in data mining due to its variety of applications. • The advent of various data clustering tools in the last few years and their comprehensive use in a broad range of applications, including image processing, computational biology, mobile communication, medicine, and economics, must contribute to the popularity of these algorithms
  • 13. Contd… • . The main issue with the data clustering algorithms is that it cant be standardized. • The advanced algorithm may give the best results with one type of data set, but it may fail or perform poorly with other kinds of data set. • Although many efforts have been made to standardize the algorithms that can perform well in all situations, no significant achievement has been achieved so far. • Many clustering tools have been proposed so far. However, each algorithm has its advantages or disadvantages and cant work on all real situations.
  • 14. 1. Scalability: • Scalability in clustering implies that as we boost the amount of data objects, the time to perform clustering should approximately scale to the complexity order of the algorithm. • For example, if we perform K- means clustering, we know it is O(n), where n is the number of objects in the data. If we raise the number of data objects 10 folds, then the time taken to cluster them should also approximately increase 10 times. It means there should be a linear relationship. If that is not the case, then there is some error with our implementation process.
  • 15. Showing example where scalablitymay leads to wrong results
  • 16. Contd… • 2. Interpretability: • The outcomes of clustering should be interpretable, comprehensible, and usable. • 3. Discovery of clusters with attribute shape: • The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to only distance measurements that tend to discover a spherical cluster of small sizes.
  • 17. Contd… 4. Ability to deal with different types of attributes: • Algorithms should be capable of being applied to any data such as data based on intervals (numeric), binary data, and categorical data. • 5. Ability to deal with noisy data: • Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and may result in poor quality clusters. 6. High dimensionality: • The clustering tools should not only able to handle high dimensional data space but also the low- dimensional space.
  • 19. Areas of text mining in data mining:
  • 20. • Information Extraction: The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction. • Natural Language Processing: NLP stands for Natural language processing. Computer software can understand human language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The development of the NLP application is difficult because computers generally expect humans to "Speak" to them in a programming language that is accurate, clear, and exceptionally structured. Human speech is usually not authentic so that it can depend on many complex variables, including slang, social context, and regional dialects.
  • 21. • Data Mining: Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision. Data mining tools can be used to resolve many business problems that have traditionally been too time-consuming. • Information Retrieval: Information retrieval deals with retrieving useful data from data that is stored in our systems. Alternately, as an analogy, we can view search engines that happen on websites such as e-commerce sites or any other sites as part of information retrieval.
  • 23. Contd… • Text transformation A text transformation is a technique that is used to control the capitalization of the text. Here the two major way of document representation is given. – Bag of words – Vector Space • Text Pre-processing Pre-processing is a significant task and a critical step in Text Mining, Natural Language Processing (NLP), and information retrieval(IR). In the field of text mining, data pre-processing is used for extracting useful information and knowledge from unstructured text data. Information Retrieval (IR) is a matter of choosing which documents in a collection should be retrieved to fulfill the user's need.
  • 24. Contd… • Feature selection: Feature selection is a significant part of data mining. Feature selection can be defined as the process of reducing the input of processing or finding the essential information sources. The feature selection is also called variable selection. • Data Mining: Now, in this step, the text mining procedure merges with the conventional process. Classic Data Mining procedures are used in the structural database. • Evaluate: Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
  • 25. Contd… • Applications: These are the following text mining applications: • Risk Management: Risk Management is a systematic and logical procedure of analyzing, identifying, treating, and monitoring the risks involved in any action or process in organizations. Insufficient risk analysis is usually a leading cause of disappointment. It is particularly true in the financial organizations where adoption of Risk Management Software based on text mining technology can effectively enhance the ability to diminish risk. It enables the administration of millions of sources and petabytes of text documents, and giving the ability to connect the data. It helps to access the appropriate data at the right time.
  • 26. Contd… • Customer Care Service: Text mining methods, particularly NLP, are finding increasing significance in the field of customer care. Organizations are spending in text analytics programming to improve their overall experience by accessing the textual data from different sources such as customer feedback, surveys, customer calls, etc. The primary objective of text analysis is to reduce the response time of the organizations and help to address the complaints of the customer rapidly and productively.
  • 27. Contd… • Business Intelligence: Companies and business firms have started to use text mining strategies as a major aspect of their business intelligence. Besides providing significant insights into customer behavior and trends, text mining strategies also support organizations to analyze the qualities and weaknesses of their opponent's so, giving them a competitive advantage in the market. • .
  • 28. Contd… • Social Media Analysis: Social media analysis helps to track the online data, and there are numerous text mining tools designed particularly for performance analysis of social media sites. These tools help to monitor and interpret the text generated via the internet from the news, emails, blogs, etc. Text mining tools can precisely analyze the total no of posts, followers, and total no of likes of your brand on a social media platform that enables you to understand the response of the individuals who are interacting with your brand and content
  • 29. Text Mining Approaches in Data Mining: • 1. Keyword-based Association Analysis: • It collects sets of keywords or terms that often happen together and afterward discover the association relationship among them. First, it preprocesses the text data by parsing, stemming, removing stop words, etc. Once it pre-processed the data, then it induces association mining algorithms. Here, human effort is not required, so the number of unwanted results and the execution time is reduced.
  • 30. Contd… • 2. Document Classification Analysis: • Automatic document classification: • This analysis is used for the automatic classification of the huge number of online text documents like web pages, emails, etc. Text document classification varies with the classification of relational data as document databases are not organized according to attribute values pairs.
  • 31. Numericizing text: • Stemming algorithms A significant pre-processing step before ordering of input documents starts with the stemming of words. The terms "stemming" can be defined as a reduction of words to their roots. For example, different grammatical forms of words and ordered are the same. The primary purpose of stemming is to ensure a similar word by text mining program.
  • 32. Contd… • Support for different languages: There are some highly language-dependent operations such as stemming, synonyms, the letters that are allowed in words. Therefore, support for various languages is important. • Exclude certain character: Excluding numbers, specific characters, or series of characters, or words that are shorter or longer than a specific number of letters can be done before the ordering of the input documents.
  • 33. Include lists, exclude lists (stop-words): • A particular list of words to be listed can be characterized, and it is useful when we want to search for a specific word. It also classifies the input documents based on the frequencies with which those words occur. Additionally, "stop words," which means terms that are to be rejected from the ordering can be characterized. Normally, a default list of English stop words incorporates "the," "a," "since," etc. These words are used in the respective language very often but communicate very little data in the document. •