SlideShare a Scribd company logo
ADDIS ABABA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
SCHOOL OF INFORMATION SCIENCE
February 2011
Presentation Outline
• Definition
• Related Research Areas
• Architecture
• TM Process
• Techniques
• Applications
• Pros and Cons
– Advantages
– Challenges/ Limitations
• Conclusion
• Recommendations /Future of Text Mining/
Introduction and Definitions
• Mining is the process of inferring for patterns
with in a structured or unstructured data.
• Text Mining is the discovery by computer of new,
previously unknown information, by
automatically extracting useful information from
different written resources.
• Text mining, also known as document mining, is
an emerging technology for analyzing large
collection of unstructured documents for the
purposes of extracting interesting and non-trivial
(important) patterns or knowledge.
Related Fields of Study
Database Type Search Mode Atomic entity
Data Retrieval Structured Goal-driven Data Record
Info. Retrieval Unstructured Goal-driven Document
Data Mining Structured Opportunistic Numbers and Dimensions
Text Mining Unstructured Opportunistic Language feature or concept
Table 1: Summary of difference among related fields of Text mining
Figure 1: The relation and
difference of text mining
with other fields
General Architecture of Text Mining Systems
(Feldman and Sanger, 2007)
• four main areas:
1. Preprocessing tasks: convert the information from each
original data source into a canonical (recognized or official)
format.
2. Core Mining Operations: “the heart of a TMS” and include
pattern discovery, trend analysis, and incremental
knowledge discovery algorithms.
3. Presentation Layer Components: include GUI and pattern
browsing functionality as well as access to the query
language. Visualization tools and user-facing query editors
and optimizers also fall under this architectural category.
4. Refinement Techniques (post-processing): include methods
that filter redundant information and cluster closely related
data
Figure 2: System architecture for generic text mining system
Figure 3: System architecture for an advanced or domain-oriented text
mining system
Figure 4: System architecture for an advanced text mining system with
background knowledge base
TM Process (Vidhya and Aghila, 2010)
Document
Collection
Retrieve and
Pre-process
Document
Feature
Selection
Feature
Generation
Classification
Clustering
TM Techniques
Management
Information Systems
Knowledge
Information
Retrieval
Information
Extraction
Summarization Topic Discovery
1. Tokenize
2. Remove
Stop words
3. Stem
Figure 5: Text Mining Process
Text Mining Techniques
The major TM techniques:
• Categorization
• Clustering
• Summarization
• Question Answering : deals with how to find the best answer to a given
question
• Concept linkage : connect related documents by identifying their commonly-
shared concepts
• Information Extraction: identify key phrases and relationships within text
• Topic tracking : A topic tracking system works by keeping user profiles
and, based on the documents the user views, predicts other documents of
interest to the user
• Association detection : the focus is on studying the relationships and
implications among topics, or descriptive concepts, which are used to
characterize a set of related text
• Information visualization : puts large textual sources in a visual hierarchy or
map and provides browsing capabilities.
The user can interact with the document map by zooming, scaling, and
creating sub-maps
Text mining Applications
Text Mining: General Applications
• Relationship Analysis
– If A is related to B, and B is related to C, there is potentially a relationship between A
and C.
• Trend analysis
– Occurrences of A peak in October.
• Mixed applications
– Co-occurrence of A together with B peak in November.
Text Mining: Business Applications
• Example 1: Decision Support in CRM
– What are customers’ typical complaints?
• Example 2: Personalization in eCommerce
– Suggest products that fit a user’s interest profile
Major Advantage
Text mining provides a competitive edge for a company to process
and take advantage of a large quantity of textual information.
Other Applications Areas of TM
• Security applications
• Biomedical applications
• Software and applications
• Online media applications
• Marketing applications
• Movie analysis
• Academic applications
• Internet search engine
• Call center specialists
• Lawyers, insurers and venture
capitalists
• Researching
• Intelligent Email Routing
Commercial applications
• AeroText
• Clarabridge
• Technologies
• Endeca
• Expert System S.p.A.
• Fair Isaac
• SAS
• IBM SPSS
• StatSoft
Free open-source applications
• Carrot2
• GATE
• OpenNLP
• Natural Language Toolkit
(NLTK)
• RapidMiner
• tm: Text Mining Package
Challenges of Text Mining
Analytical Challenges
• Soft matching :
Example:
Misspelt – Wal-mart , Walmart
Company names in short form – ClearForest instead of ClearForest corporation
Use of abbreviations - EDS instead of Electronic Data Systems Corporation
• Summarization : may create erroneous and senseless output
• Temporal resolution : most business documents are time dependant and may
expire after a certain period of time
• Uniqueness resolution : When processing a large number of documents, it is
possible to identify many features and events that resemble one another
Example : when the same name appear in different documents
Linguistics Challenges
• Anaphora Resolution : ability to resolve co-references
Example: resolving pronominal like “he”, “she”, “we” etc
• Full Parsing Vs Shallow parsing
Conclusion
• TM also known as Text Data Mining or KDT refers
generally to the process of extracting interesting
and non-trivial information and knowledge from
unstructured text.
• Text mining is an interdisciplinary field which
draws on information retrieval, data mining,
machine learning, statistics and computational
linguistics
• The motivation for TM is, information (over 90%) is
stored as text in the world
• TM has many applications in different sectors
• There are different TM techniques but there are a
number of challenges to implement each techniques
Recommendations
• Personalized autonomous mining: Current text mining
products and applications are still tools designed for
trained knowledge specialists
• Multilingual text refining: It is essential to develop text
refining algorithms, that process multilingual text
documents and produce language-independent intermediate
forms
• Stronger integration and bigger overlap between text
mining, information retrieval, natural language processing
and software engineering
• Domain knowledge integration: Domain knowledge do not
provided for any current text mining tools
February 2011

More Related Content

What's hot

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
Maurice Masih
 
Text mining
Text miningText mining
Text mining
Ali A Jalil
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
Prof.Nilesh Magar
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
neelamoberoi1030
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
Mounia Lalmas-Roelleke
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
ankit_ppt
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Mohammad Junaid Khan
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
Primya Tamil
 
Text categorization
Text categorizationText categorization
Text categorization
KU Leuven
 
The vector space model
The vector space modelThe vector space model
The vector space model
pkgosh
 
Web mining
Web miningWeb mining
Fp growth
Fp growthFp growth

What's hot (20)

3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Tesxt mining
Tesxt miningTesxt mining
Tesxt mining
 
Text mining
Text miningText mining
Text mining
 
Frequent itemset mining methods
Frequent itemset mining methodsFrequent itemset mining methods
Frequent itemset mining methods
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
Web Content Mining
Web Content MiningWeb Content Mining
Web Content Mining
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Text categorization
Text categorizationText categorization
Text categorization
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Web mining
Web miningWeb mining
Web mining
 
Fp growth
Fp growthFp growth
Fp growth
 

Similar to Text Mining

Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Artificial Intelligence Institute at UofSC
 
Text mining
Text miningText mining
Text mining
Pankaj Thakur
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
Jeremiah Fadugba
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical Universitybutest
 
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
Dr. Haxel Consult
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
Vanessa Camilleri
 
Open AI Tools for Data Analytics
Open AI Tools for Data AnalyticsOpen AI Tools for Data Analytics
Open AI Tools for Data Analytics
Mohammad Usman
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
elisarosa29
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
Boonlert Aroonpiboon
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
Mohammad Ilyas Malik
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
RohitRadhakrishnan8
 
Km cognitive computing overview by ken martin 19 jan2015
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015
HCL Technologies
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
thenmozhip8
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Shalin Hai-Jew
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
Anita de Waard
 

Similar to Text Mining (20)

Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 
Text mining
Text miningText mining
Text mining
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical UniversityData Mining Xuequn Shang NorthWestern Polytechnical University
Data Mining Xuequn Shang NorthWestern Polytechnical University
 
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
AI-SDV 2021: Stefan Geissler - AI support for creating and maintaining vocabu...
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
Open AI Tools for Data Analytics
Open AI Tools for Data AnalyticsOpen AI Tools for Data Analytics
Open AI Tools for Data Analytics
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
NLP, Expert system and pattern recognition
NLP, Expert system and pattern recognitionNLP, Expert system and pattern recognition
NLP, Expert system and pattern recognition
 
DU_SERIES_Session1.pdf
DU_SERIES_Session1.pdfDU_SERIES_Session1.pdf
DU_SERIES_Session1.pdf
 
Km cognitive computing overview by ken martin 19 jan2015
Km   cognitive computing overview by ken martin 19 jan2015Km   cognitive computing overview by ken martin 19 jan2015
Km cognitive computing overview by ken martin 19 jan2015
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
Chap1
Chap1Chap1
Chap1
 
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and SensemakingAuto Mapping Texts for Human-Machine Analysis and Sensemaking
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
C N I20080404
C N I20080404C N I20080404
C N I20080404
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
 

More from Biniam Asnake

Software Trends: Past, Present and Future
Software Trends: Past, Present and FutureSoftware Trends: Past, Present and Future
Software Trends: Past, Present and Future
Biniam Asnake
 
Service Oriented Architecture (SOA)
Service Oriented Architecture (SOA)Service Oriented Architecture (SOA)
Service Oriented Architecture (SOA)
Biniam Asnake
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
Biniam Asnake
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
Biniam Asnake
 
Information Systems: A Case Study of Bank of America and Commercial Bank of E...
Information Systems: A Case Study of Bank of America and Commercial Bank of E...Information Systems: A Case Study of Bank of America and Commercial Bank of E...
Information Systems: A Case Study of Bank of America and Commercial Bank of E...
Biniam Asnake
 
Computer vision and robotics
Computer vision and roboticsComputer vision and robotics
Computer vision and robotics
Biniam Asnake
 

More from Biniam Asnake (6)

Software Trends: Past, Present and Future
Software Trends: Past, Present and FutureSoftware Trends: Past, Present and Future
Software Trends: Past, Present and Future
 
Service Oriented Architecture (SOA)
Service Oriented Architecture (SOA)Service Oriented Architecture (SOA)
Service Oriented Architecture (SOA)
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
Information Systems: A Case Study of Bank of America and Commercial Bank of E...
Information Systems: A Case Study of Bank of America and Commercial Bank of E...Information Systems: A Case Study of Bank of America and Commercial Bank of E...
Information Systems: A Case Study of Bank of America and Commercial Bank of E...
 
Computer vision and robotics
Computer vision and roboticsComputer vision and robotics
Computer vision and robotics
 

Recently uploaded

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 

Recently uploaded (20)

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 

Text Mining

  • 1. ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES SCHOOL OF INFORMATION SCIENCE February 2011
  • 2. Presentation Outline • Definition • Related Research Areas • Architecture • TM Process • Techniques • Applications • Pros and Cons – Advantages – Challenges/ Limitations • Conclusion • Recommendations /Future of Text Mining/
  • 3. Introduction and Definitions • Mining is the process of inferring for patterns with in a structured or unstructured data. • Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting useful information from different written resources. • Text mining, also known as document mining, is an emerging technology for analyzing large collection of unstructured documents for the purposes of extracting interesting and non-trivial (important) patterns or knowledge.
  • 4. Related Fields of Study Database Type Search Mode Atomic entity Data Retrieval Structured Goal-driven Data Record Info. Retrieval Unstructured Goal-driven Document Data Mining Structured Opportunistic Numbers and Dimensions Text Mining Unstructured Opportunistic Language feature or concept Table 1: Summary of difference among related fields of Text mining Figure 1: The relation and difference of text mining with other fields
  • 5. General Architecture of Text Mining Systems (Feldman and Sanger, 2007) • four main areas: 1. Preprocessing tasks: convert the information from each original data source into a canonical (recognized or official) format. 2. Core Mining Operations: “the heart of a TMS” and include pattern discovery, trend analysis, and incremental knowledge discovery algorithms. 3. Presentation Layer Components: include GUI and pattern browsing functionality as well as access to the query language. Visualization tools and user-facing query editors and optimizers also fall under this architectural category. 4. Refinement Techniques (post-processing): include methods that filter redundant information and cluster closely related data
  • 6. Figure 2: System architecture for generic text mining system Figure 3: System architecture for an advanced or domain-oriented text mining system Figure 4: System architecture for an advanced text mining system with background knowledge base
  • 7. TM Process (Vidhya and Aghila, 2010) Document Collection Retrieve and Pre-process Document Feature Selection Feature Generation Classification Clustering TM Techniques Management Information Systems Knowledge Information Retrieval Information Extraction Summarization Topic Discovery 1. Tokenize 2. Remove Stop words 3. Stem Figure 5: Text Mining Process
  • 8. Text Mining Techniques The major TM techniques: • Categorization • Clustering • Summarization • Question Answering : deals with how to find the best answer to a given question • Concept linkage : connect related documents by identifying their commonly- shared concepts • Information Extraction: identify key phrases and relationships within text • Topic tracking : A topic tracking system works by keeping user profiles and, based on the documents the user views, predicts other documents of interest to the user • Association detection : the focus is on studying the relationships and implications among topics, or descriptive concepts, which are used to characterize a set of related text • Information visualization : puts large textual sources in a visual hierarchy or map and provides browsing capabilities. The user can interact with the document map by zooming, scaling, and creating sub-maps
  • 9. Text mining Applications Text Mining: General Applications • Relationship Analysis – If A is related to B, and B is related to C, there is potentially a relationship between A and C. • Trend analysis – Occurrences of A peak in October. • Mixed applications – Co-occurrence of A together with B peak in November. Text Mining: Business Applications • Example 1: Decision Support in CRM – What are customers’ typical complaints? • Example 2: Personalization in eCommerce – Suggest products that fit a user’s interest profile Major Advantage Text mining provides a competitive edge for a company to process and take advantage of a large quantity of textual information.
  • 10. Other Applications Areas of TM • Security applications • Biomedical applications • Software and applications • Online media applications • Marketing applications • Movie analysis • Academic applications • Internet search engine • Call center specialists • Lawyers, insurers and venture capitalists • Researching • Intelligent Email Routing Commercial applications • AeroText • Clarabridge • Technologies • Endeca • Expert System S.p.A. • Fair Isaac • SAS • IBM SPSS • StatSoft Free open-source applications • Carrot2 • GATE • OpenNLP • Natural Language Toolkit (NLTK) • RapidMiner • tm: Text Mining Package
  • 11. Challenges of Text Mining Analytical Challenges • Soft matching : Example: Misspelt – Wal-mart , Walmart Company names in short form – ClearForest instead of ClearForest corporation Use of abbreviations - EDS instead of Electronic Data Systems Corporation • Summarization : may create erroneous and senseless output • Temporal resolution : most business documents are time dependant and may expire after a certain period of time • Uniqueness resolution : When processing a large number of documents, it is possible to identify many features and events that resemble one another Example : when the same name appear in different documents Linguistics Challenges • Anaphora Resolution : ability to resolve co-references Example: resolving pronominal like “he”, “she”, “we” etc • Full Parsing Vs Shallow parsing
  • 12. Conclusion • TM also known as Text Data Mining or KDT refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. • Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics and computational linguistics • The motivation for TM is, information (over 90%) is stored as text in the world • TM has many applications in different sectors • There are different TM techniques but there are a number of challenges to implement each techniques
  • 13. Recommendations • Personalized autonomous mining: Current text mining products and applications are still tools designed for trained knowledge specialists • Multilingual text refining: It is essential to develop text refining algorithms, that process multilingual text documents and produce language-independent intermediate forms • Stronger integration and bigger overlap between text mining, information retrieval, natural language processing and software engineering • Domain knowledge integration: Domain knowledge do not provided for any current text mining tools

Editor's Notes

  1. Text Mining aka Text data mining, document mining, Knowledge Discovery in text (KDT), Knowledge Text Analysis The first workshops were held at the International Machine Learning Conference in July 1999 and the International Joint Conference on Artificial Intelligence in August 1999
  2. * Motivation: approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation).