SlideShare a Scribd company logo
1 of 12
TEXT MINING
By
Yashvi Babariya
INTRODUCTION
 Text mining is a Discovery
 Also referred as Text Data
Mining (TDM) and
Knowledge Discovery in
Textual Database (KDT)
 To extract relevant
information or knowledge or
pattern from different
sources that are in
unstructured or semi-
structured form.
DATA MINING VS. TEXT MINING
Data Mining Text Mining
Process directly Linguistic processing or natural
language processing (NLP)
Identify causal relationship Discover heretofore unknown
information
Structured Data Semi-structured & unstructured
Data (Text)
Structured numeric transection
data residing in rational data
warehouse
Applications deal with much more
diverse and eclectic collections of
systems & formats
INPUT OUTPUT MODEL FOR TEXT MINING
STEPS FOR TEXT MINING
 Pre processing the text
 Applying text mining techniques
 Summarization
 Classification
 Clustering
 Visualization
 Information extraction
o Analyzing The Text
TEXT DATABASES & INFORMATION
RETIEVAL
 Text databases ( document databases)
 Large collections of documents from various sources
: news articles, research papers, books, digital libraries, e-mail
messages and web pages, library database etc.
 Data stored is usually semi-structured
 Information retrieval
 A field developed in parallel with database systems
 Information is organized into a large number of documents
 Information retrieval problem: locating relevant documents
based on user input, such as keywords or example
documents
TYPICAL INFPRMATION RETRIEVAL
PROBLEM
 To locate relevant documents in a document collection
based on a user’s query
 Some keywords describing an information need
 For ad hoc information need user takes the initiative to
“pull” the relevant information out from the collection
 For long-term information need, a retrieval system may
also take the initiative to push relevant to user’s need
 Such an information access process is called
information filtering
 Corresponding systems are called filtering systems or
recommender systems
INFORMATION RETRIEVAL
 Typical IR systems
 Online library catalogs
 Online document management systems
 Information retrieval vs. database
systems
 some DB problems are not present in IR, e.g.,
update, transections management, complex objects
 Some IR problems are not addressed well in
DBMS, e.g., unstructured documents, approximate
search using keywords and relevance
BASIC MEASURE FOR TEXT RETRIEVAL
 Suppose that a text retrieval system has just retrieved a
number of documents based on query
 Let the set of documents relevant to a query be denoted as
{relevant}
 The set of documents retrieved be donated as {retrieved}
 The set of documents that are both relevant and retrieved is
denoted as {relevant} n {retrieved}
 Precision: percentage of retrieved documents that are
in fact relevant to the query(i.e. “correct” response)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
|{𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑
 Recall: percentage of documents that are relevant to the
query and were in fact, retrieved.
𝑟𝑒𝑐𝑎𝑙𝑙 =
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 |
| 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |
 An information retrieval system often needs to trade off
recall for precision or vice versa.
 F-score, is harmonic mean of recall and precision
𝐹_𝑠𝑐𝑜𝑟𝑒 =
𝑟𝑒𝑐𝑎𝑙𝑙×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
(𝑟𝑒𝑐𝑎𝑙𝑙+𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)/2
INFORMATION RETRIEVAL CONCEPT
 Basic concept
 A document can be described by a set of representative
keywords called index terms
 Different index terms varying relevance when used to
describe document content
 This effect is captured through the assignment of numerical
weights to each index term of a document
 DBMS Analogy
 Index terms → Attributes
 Weights → Attributes value
TEXT RETRIEVAL METHODS
 Document selection method
• Knowledge base retrieval
• The query is specifying constraints for selecting relevant
documents
• Boolean retrieval model- a document is represented by a set
of keywords
• User provides a Boolean expression of keywords such as “car
and repair shops”, “tea or coffee”
• Return documents that satisfy the boolean expression
• Difficulty in presenting a user’s information need exctly with a
boolean query
• Used when the user knows a lot about the document collection
and can formulate a good query

More Related Content

Similar to Text Mining.pptx

Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemssamiullahamjad06
 
DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT huma sh
 
Database Management System
Database Management SystemDatabase Management System
Database Management SystemTamur Iqbal
 
Ch-1-Introduction-to-Database.pdf
Ch-1-Introduction-to-Database.pdfCh-1-Introduction-to-Database.pdf
Ch-1-Introduction-to-Database.pdfMrjJoker1
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdfBahria University Islamabad, Pakistan
 
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...Bahria University Islamabad, Pakistan
 
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdfBahria University Islamabad, Pakistan
 

Similar to Text Mining.pptx (20)

Citation Database
Citation Database Citation Database
Citation Database
 
Datamining
DataminingDatamining
Datamining
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Lec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systemsLec20.pptx introduction to data bases and information systems
Lec20.pptx introduction to data bases and information systems
 
TAMUC LO 8
TAMUC LO 8TAMUC LO 8
TAMUC LO 8
 
DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT DATA RESOURCE MANAGEMENT
DATA RESOURCE MANAGEMENT
 
Database system Handbook 4th muhammad sharif.pdf
Database system Handbook 4th muhammad sharif.pdfDatabase system Handbook 4th muhammad sharif.pdf
Database system Handbook 4th muhammad sharif.pdf
 
Database system Handbook 4th muhammad sharif.pdf
Database system Handbook 4th muhammad sharif.pdfDatabase system Handbook 4th muhammad sharif.pdf
Database system Handbook 4th muhammad sharif.pdf
 
Database
DatabaseDatabase
Database
 
Database
DatabaseDatabase
Database
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Ch-1-Introduction-to-Database.pdf
Ch-1-Introduction-to-Database.pdfCh-1-Introduction-to-Database.pdf
Ch-1-Introduction-to-Database.pdf
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
 
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
 
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
 
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdfDatabase system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
 
Database systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdfDatabase systems Handbook by Muhammad Sharif.pdf
Database systems Handbook by Muhammad Sharif.pdf
 

Recently uploaded

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Text Mining.pptx

  • 2. INTRODUCTION  Text mining is a Discovery  Also referred as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT)  To extract relevant information or knowledge or pattern from different sources that are in unstructured or semi- structured form.
  • 3. DATA MINING VS. TEXT MINING Data Mining Text Mining Process directly Linguistic processing or natural language processing (NLP) Identify causal relationship Discover heretofore unknown information Structured Data Semi-structured & unstructured Data (Text) Structured numeric transection data residing in rational data warehouse Applications deal with much more diverse and eclectic collections of systems & formats
  • 4. INPUT OUTPUT MODEL FOR TEXT MINING
  • 5. STEPS FOR TEXT MINING  Pre processing the text  Applying text mining techniques  Summarization  Classification  Clustering  Visualization  Information extraction o Analyzing The Text
  • 6. TEXT DATABASES & INFORMATION RETIEVAL  Text databases ( document databases)  Large collections of documents from various sources : news articles, research papers, books, digital libraries, e-mail messages and web pages, library database etc.  Data stored is usually semi-structured  Information retrieval  A field developed in parallel with database systems  Information is organized into a large number of documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
  • 7. TYPICAL INFPRMATION RETRIEVAL PROBLEM  To locate relevant documents in a document collection based on a user’s query  Some keywords describing an information need  For ad hoc information need user takes the initiative to “pull” the relevant information out from the collection  For long-term information need, a retrieval system may also take the initiative to push relevant to user’s need  Such an information access process is called information filtering  Corresponding systems are called filtering systems or recommender systems
  • 8. INFORMATION RETRIEVAL  Typical IR systems  Online library catalogs  Online document management systems  Information retrieval vs. database systems  some DB problems are not present in IR, e.g., update, transections management, complex objects  Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance
  • 9. BASIC MEASURE FOR TEXT RETRIEVAL  Suppose that a text retrieval system has just retrieved a number of documents based on query  Let the set of documents relevant to a query be denoted as {relevant}  The set of documents retrieved be donated as {retrieved}  The set of documents that are both relevant and retrieved is denoted as {relevant} n {retrieved}
  • 10.  Precision: percentage of retrieved documents that are in fact relevant to the query(i.e. “correct” response) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = | 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 | |{𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑  Recall: percentage of documents that are relevant to the query and were in fact, retrieved. 𝑟𝑒𝑐𝑎𝑙𝑙 = | 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 ∩ 𝑅𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 | | 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 |  An information retrieval system often needs to trade off recall for precision or vice versa.  F-score, is harmonic mean of recall and precision 𝐹_𝑠𝑐𝑜𝑟𝑒 = 𝑟𝑒𝑐𝑎𝑙𝑙×𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑟𝑒𝑐𝑎𝑙𝑙+𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)/2
  • 11. INFORMATION RETRIEVAL CONCEPT  Basic concept  A document can be described by a set of representative keywords called index terms  Different index terms varying relevance when used to describe document content  This effect is captured through the assignment of numerical weights to each index term of a document  DBMS Analogy  Index terms → Attributes  Weights → Attributes value
  • 12. TEXT RETRIEVAL METHODS  Document selection method • Knowledge base retrieval • The query is specifying constraints for selecting relevant documents • Boolean retrieval model- a document is represented by a set of keywords • User provides a Boolean expression of keywords such as “car and repair shops”, “tea or coffee” • Return documents that satisfy the boolean expression • Difficulty in presenting a user’s information need exctly with a boolean query • Used when the user knows a lot about the document collection and can formulate a good query