SlideShare a Scribd company logo
1 of 16
TEXT MINING
BY
THEJESWINI
B.Tech CSE 3 Year
SUBCODE:XCSE65
SUBNAME:DATA MINING
CONTENTS
INTRODUCTION
DATA MINING vs TEXT MINING
AREAS OF TEXT MINING
INFORMATION RETRIEVAL
TEXT MINING PROCESS
TEXT MINING APPROACHES
CHALLENGES OF TEXT MINING
REFERNECES
INTRODUCTION
 Nowadays, there is a rapid growth in text databases due to many sources
generating data in text.
 Sources that generate text databases are : collections of documents from
various sources - such as news articles, research papers, books, digital
libraries, e-mail messages, and World Wide web(which can also be viewed
as a huge, interconnected, dynamic text database) and also many
government and business institutions also store their data in form of text.
 Understanding that generated text patterns and obtaining useful and
reliable information has become the main reason for text mining.
INTRODUCTION...(CONTD)
 Text mining is formally defined as process of extracting relevant
information or pattern from different sources that are in unstructured or
semi-structured format
 Data stored in most text databases are semi structured data ,i.e. they are
neither completely unstructured nor completely structured.
 For example, a document may contain a few structured fields, such as title,
authors, publication date, category, and so on, but also contain some
largely unstructured text components, such as abstract and contents.
DATA MINING vs TEXT MINING
DATA MINING TEXT MINING
It is the process of finding patterns and
extracting useful data from large data sets.
Is applied on data from from various text
documents
Applied on all types of data Applied on text data, which is mostly semi
structured or unstructured
Processing of data is done directly. Processing of data is done linguistically.
Statistical techniques are used to evaluate
data.
Computational linguistic principles are used
to evaluate text.
AREAS OF TEXT MINING
IR(Information
Retrieval)
NLP(Natural
Language
Processing)
IE(Information
Extraction)
Data Mining
Query based search on large text documents
The development of the NLP application generally expect
humans to "Speak" to them in a programming language that
is accurate, clear, and exceptionally structured. Human
speech is usually not authentic so that it can depend on
many complex variables, including slang, social context, and
regional dialects.
The automatic extraction of structured data such as entities,
entities relationships, and attributes describing entities from
an unstructured source is called information extraction.
Data mining refers to the extraction of useful data, hidden
patterns from large data sets. Data mining tools can predict
behaviors and future trends that allow businesses to make a
better data-driven decision..
INFORMATION RETRIEVAL
Information retrieval is a method to retrieve information from a large number
of text-based documents.
Due to the abundance of text information, information retrieval has found
many applications. There exist many information retrieval systems, such as :
-on-line library catalog systems,
-on-line document management systems, and
-the more recently developed Web search engine
 A typical information retrieval problem is to locate relevant documents in a
document collection based on a user’s query, which is often some keywords
describing an information need.
INFORMATION RETRIEVAL…(CONTD)
1. BASIC MEASURES OF INFORMATION RETRIEVAL
There are two basic measures for assessing the quality of text retrieval:
Precision: This is the percentage of retrieved documents that are in fact
relevant to the query (i.e., “correct” responses). It is formally defined as
Recall: This is the percentage of documents that are relevant to the query and
were, in fact, retrieved. It is formally defined as
One commonly used trade-off is the F-score, which is defined as the harmonic
mean of recall and precision:
precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}|
recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}|
F score = recall × precision (recall + precision)/2
INFORMATION RETRIEVAL…(CONTD)
2. TEXT RETRIEVAL METHODS
Information retrieval of text documents can be done by the following methods:
-Document selection method: In this method , the query is given by
specifying constraints for selecting relevant documents. A typical method of this
category is the “Boolean retrieval model”, in which a document is represented by
a set of keywords and a user provides a Boolean expression of keywords, such as
e.g: “car and repair shops” , “tea or coffee”
-Document ranking method: In this method, the query is used to rank all
documents in the order of relevance. The goal is to approximate the degree of
relevance of a document with a score computed based on information such as the
frequency of words in the document and the whole collection.
INFORMATION RETRIEVAL…(CONTD)
 The first step in most retrieval
systems is to identify keywords for
representing documents, a
preprocessing step often called
tokenization. To avoid indexing
useless words, a text retrieval system
often associates a “stop list” with a
set of documents.
Text Mining is a part of Data Mining
text mining part data
mining
TEXT MINING PROCESS
• Text preprocessing
-Syntactic/Semantic
-text analysis (Text cleanup, Tokenization)
• Features Generation
-Bag of words (words it contains and occurences)
-Vector space
• Features Selection
-Simple counting
-Statistics
• Text/Data Mining
-Classification(supervised)
-Clustering(unsupervised)
-Associations(relationships)
• Analyzing results
TEXT MINING APPROACHES
 The text mining approaches are based on the inputs taken in the text mining
system and the data mining tasks to be performed. In general, the major
approaches, based on the kinds of data they take as input, are:
(1) the keyword-based approach, where the input is a set of keywords or
terms in the documents,
(2) the tagging approach, where the input is a set of tags, and
(3)the information-extraction approach, which inputs semantic
information, such as events, facts, or entities uncovered by information
extraction.
1) KEY WORD ASSOCIATION BASED ANALYSIS:
It is an analysis which collects sets of keywords or terms that occur frequently
together and then finds the association or correlation relationships among them.
E.g. [Stanford, University]
2) DOCUMENT CLASSIFICATION ANALYSIS:
Automated document classification is an important text mining task because,
with the existence of a tremendous number of on-line documents, it is tedious yet
essential to be able to automatically organize such documents into classes to
facilitate document retrieval and subsequent analysis. E.g. Tagging
3) DOCUMENT CLUSTERING ANALYSIS:
Document clustering is one of the most crucial techniques for organizing
documents in an unsupervised manner.
TEXT MINING APPROACHES…(CONTD)
CHALLENGES OF TEXT MINING
 Information is in unstructured textual form
 Large textual database – Difficult to apply text mining
 Complex and subtle relationships between concepts in text
 Word ambiguity and context sensitivity
e.g windows can be either operating system or opening in the wall to
allow air flow in the house.
 Noisy data
Spelling mistakes and irrelevant data(outliers)
REFERENCES
[1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber
“Data Mining: Concepts and Techniques Second Edition”
[2] https://www.javatpoint.com/text-data-mining
[3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf
Text mining

More Related Content

What's hot

Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)9866825059
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & ApplicationsFazle Rabbi Ador
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycleManoj Mishra
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMSai Kumar Ale
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 

What's hot (20)

Web mining
Web mining Web mining
Web mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
web mining
web miningweb mining
web mining
 
Inverted index
Inverted indexInverted index
Inverted index
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
WEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEMWEB BASED INFORMATION RETRIEVAL SYSTEM
WEB BASED INFORMATION RETRIEVAL SYSTEM
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Similar to Text mining

Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachIJMIT JOURNAL
 
IRintroduction.ppt
IRintroduction.pptIRintroduction.ppt
IRintroduction.pptThiyaguPappu
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET Journal
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...rahulmonikasharma
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsijdms
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxHasanulFahmi2
 
4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx
4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx
4 postsRe Topic 2 DQ 1Qualitative research produces a v.docxmeghanivkwserie
 

Similar to Text mining (20)

Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IR introduction
IR introductionIR introduction
IR introduction
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
 
Mam assign
Mam assignMam assign
Mam assign
 
Text mining
Text miningText mining
Text mining
 
IRintroduction.ppt
IRintroduction.pptIRintroduction.ppt
IRintroduction.ppt
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
 
4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx
4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx
4 postsRe Topic 2 DQ 1Qualitative research produces a v.docx
 

Recently uploaded

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Text mining

  • 1. TEXT MINING BY THEJESWINI B.Tech CSE 3 Year SUBCODE:XCSE65 SUBNAME:DATA MINING
  • 2. CONTENTS INTRODUCTION DATA MINING vs TEXT MINING AREAS OF TEXT MINING INFORMATION RETRIEVAL TEXT MINING PROCESS TEXT MINING APPROACHES CHALLENGES OF TEXT MINING REFERNECES
  • 3. INTRODUCTION  Nowadays, there is a rapid growth in text databases due to many sources generating data in text.  Sources that generate text databases are : collections of documents from various sources - such as news articles, research papers, books, digital libraries, e-mail messages, and World Wide web(which can also be viewed as a huge, interconnected, dynamic text database) and also many government and business institutions also store their data in form of text.  Understanding that generated text patterns and obtaining useful and reliable information has become the main reason for text mining.
  • 4. INTRODUCTION...(CONTD)  Text mining is formally defined as process of extracting relevant information or pattern from different sources that are in unstructured or semi-structured format  Data stored in most text databases are semi structured data ,i.e. they are neither completely unstructured nor completely structured.  For example, a document may contain a few structured fields, such as title, authors, publication date, category, and so on, but also contain some largely unstructured text components, such as abstract and contents.
  • 5. DATA MINING vs TEXT MINING DATA MINING TEXT MINING It is the process of finding patterns and extracting useful data from large data sets. Is applied on data from from various text documents Applied on all types of data Applied on text data, which is mostly semi structured or unstructured Processing of data is done directly. Processing of data is done linguistically. Statistical techniques are used to evaluate data. Computational linguistic principles are used to evaluate text.
  • 6. AREAS OF TEXT MINING IR(Information Retrieval) NLP(Natural Language Processing) IE(Information Extraction) Data Mining Query based search on large text documents The development of the NLP application generally expect humans to "Speak" to them in a programming language that is accurate, clear, and exceptionally structured. Human speech is usually not authentic so that it can depend on many complex variables, including slang, social context, and regional dialects. The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction. Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision..
  • 7. INFORMATION RETRIEVAL Information retrieval is a method to retrieve information from a large number of text-based documents. Due to the abundance of text information, information retrieval has found many applications. There exist many information retrieval systems, such as : -on-line library catalog systems, -on-line document management systems, and -the more recently developed Web search engine  A typical information retrieval problem is to locate relevant documents in a document collection based on a user’s query, which is often some keywords describing an information need.
  • 8. INFORMATION RETRIEVAL…(CONTD) 1. BASIC MEASURES OF INFORMATION RETRIEVAL There are two basic measures for assessing the quality of text retrieval: Precision: This is the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses). It is formally defined as Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved. It is formally defined as One commonly used trade-off is the F-score, which is defined as the harmonic mean of recall and precision: precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}| recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}| F score = recall × precision (recall + precision)/2
  • 9. INFORMATION RETRIEVAL…(CONTD) 2. TEXT RETRIEVAL METHODS Information retrieval of text documents can be done by the following methods: -Document selection method: In this method , the query is given by specifying constraints for selecting relevant documents. A typical method of this category is the “Boolean retrieval model”, in which a document is represented by a set of keywords and a user provides a Boolean expression of keywords, such as e.g: “car and repair shops” , “tea or coffee” -Document ranking method: In this method, the query is used to rank all documents in the order of relevance. The goal is to approximate the degree of relevance of a document with a score computed based on information such as the frequency of words in the document and the whole collection.
  • 10. INFORMATION RETRIEVAL…(CONTD)  The first step in most retrieval systems is to identify keywords for representing documents, a preprocessing step often called tokenization. To avoid indexing useless words, a text retrieval system often associates a “stop list” with a set of documents. Text Mining is a part of Data Mining text mining part data mining
  • 11. TEXT MINING PROCESS • Text preprocessing -Syntactic/Semantic -text analysis (Text cleanup, Tokenization) • Features Generation -Bag of words (words it contains and occurences) -Vector space • Features Selection -Simple counting -Statistics • Text/Data Mining -Classification(supervised) -Clustering(unsupervised) -Associations(relationships) • Analyzing results
  • 12. TEXT MINING APPROACHES  The text mining approaches are based on the inputs taken in the text mining system and the data mining tasks to be performed. In general, the major approaches, based on the kinds of data they take as input, are: (1) the keyword-based approach, where the input is a set of keywords or terms in the documents, (2) the tagging approach, where the input is a set of tags, and (3)the information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction.
  • 13. 1) KEY WORD ASSOCIATION BASED ANALYSIS: It is an analysis which collects sets of keywords or terms that occur frequently together and then finds the association or correlation relationships among them. E.g. [Stanford, University] 2) DOCUMENT CLASSIFICATION ANALYSIS: Automated document classification is an important text mining task because, with the existence of a tremendous number of on-line documents, it is tedious yet essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. E.g. Tagging 3) DOCUMENT CLUSTERING ANALYSIS: Document clustering is one of the most crucial techniques for organizing documents in an unsupervised manner. TEXT MINING APPROACHES…(CONTD)
  • 14. CHALLENGES OF TEXT MINING  Information is in unstructured textual form  Large textual database – Difficult to apply text mining  Complex and subtle relationships between concepts in text  Word ambiguity and context sensitivity e.g windows can be either operating system or opening in the wall to allow air flow in the house.  Noisy data Spelling mistakes and irrelevant data(outliers)
  • 15. REFERENCES [1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber “Data Mining: Concepts and Techniques Second Edition” [2] https://www.javatpoint.com/text-data-mining [3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf