SlideShare a Scribd company logo
WEBINAR ON
Presented by
Dr. A. MUTHUSAMY
Head, Associate Professor
Department of Computer Science with Cognitive Systems
Dr. N.G.P. Arts and Science College
Coimbatore-641 048
Tamil Nadu, India
Mobile: +91 8675390072, E-mail: muthusamy@drngpasc.ac.in 1
TM
OUTLINE
2
Dr. NGPASC
COIMBATORE | INDIA
Introduction - Data & Its Types
Need of Text Mining
Objectives of Text Mining
Search Vs Discover
Relevant Disciplines
Process & Tools - Text Mining
Benefits & Applications of TM
Conclusion
INTRODUCTION
ABOUT DATA
4
Dr. NGPASC
COIMBATORE | INDIA
• Data - collection of raw facts & symbols
• Data is collected from multiple sources
• Data can be of the following forms
 Text – Alphabets
 Number – 0 to 9
 Image – Scanner
 Audio – Microphone
 Video – Camera
 Sensors
ABOUT DATA
5
Dr. NGPASC
COIMBATORE | INDIA
ABOUT DATA
6
Dr. NGPASC
COIMBATORE | INDIA
• Every organization has its own specific data - to perform certain
operations within it
• It gives the status of past activities and enable us to make decision
• Information Processing – Converting data into information
REAL TIME SCENARIOS
• Many individuals use some form of computing every day
whether they realize it or not
• Swiping a debit card, sending an email / using a mobile phone
can all be considered forms of computing
7
Dr. NGPASC
COIMBATORE | INDIA
DATA VARIETIES
8
Dr. NGPASC
COIMBATORE | INDIA
• Data that has no inherent structure & is
usually stored as different types of files
• Examples: Text docs, PDFs, images & Videos
• Textual data with erratic formats that can be
formatted with effort & S/W tools
• Examples: Click stream data
• Textual data files with an apparent pattern,
enabling analysis
• Examples: Spreadsheets & XML files
• Data having a defined data model, format,
structure
• Example: Database
DATA VARIETIES
9
Dr. NGPASC
COIMBATORE | INDIA
Structured Unstructured
10
TM
TM
NEED OF TEXT MINING
11
Dr. NGPASC
COIMBATORE | INDIA
• Approximately 90% of the world’s data is held in unstructured formats
(source: Oracle Corporation)
• Most of the information in public and private sectors are stored
electronically in the form of text databases
• Information intensive business processes demand that we transcend
from simple document retrieval to Knowledge Discovery
Text Databases
NEED OF TEXT MINING
12
Dr. NGPASC
COIMBATORE | INDIA
Why Cats sit on mats?
NEED OF TEXT MINING
13
Dr. NGPASC
COIMBATORE | INDIA
• It would be impossible to read all the millions of research articles on
the topic yourself.
• Text mining helps to filter large amounts of research and extracts the
relevant information
• It helps to identify, ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the
proposition
NEED OF TEXT MINING
14
Dr. NGPASC
COIMBATORE | INDIA
• It is not just a search tool, it can also understand that the ‘cat’ is an
animal, ‘sit’ is an action, and a ‘mat’ is an object.
• Identifies and maps patterns and trends across the millions of articles.
Answer: Most of the cats who sit on mats come from cold climates.
TEXT MINING
• Knowledge Discovery of Text (KDT) - Large collections of written
resources to generate valuable information from unstructured formats
• Transform the unstructured text into structured data
• It involves algorithms of DM, ML, Statistics, and NLP, attempts to
extract high quality of useful information
• To exploit information contained in textual documents in various ways
including pattern discovery and trends in data, associations among
entities, predictive rules etc.,
16
Dr. NGPASC
COIMBATORE | INDIA
TEXT MINING
17
Dr. NGPASC
COIMBATORE | INDIA
Framework to Create Dictionary
Dr. NGPASC
COIMBATORE | INDIA
SEARCH Vs DISCOVER
19
Dr. NGPASC
COIMBATORE | INDIA
SEARCH Vs DISCOVER
20
Dr. NGPASC
COIMBATORE | INDIA
Find records within a structured database
• Database Type : Structured
• Search Mode : Goal-driven
• Atomic entity : Data Record
• Information Need : Find a Japanese restaurant in Bangalore that
serves vegetarian food
• Query : SELECT * FROM restaurants WHERE city =
Bangalore AND type = Japanese AND has_veg =
true
Data Retrieval
SEARCH Vs DISCOVER
21
Dr. NGPASC
COIMBATORE | INDIA
Find relevant information in an unstructured information source (usually text)
• Database Type : Unstructured
• Search Mode : Goal-driven
• Atomic entity : Document
• Information Need : Find a Japanese restaurant in Bangalore
• Query : “Japanese restaurant Bangalore” or
Bangalore ->Restaurants->Japanese
Inf. Retrieval
SEARCH Vs DISCOVER
22
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of data
• Database Type : Structured
• Search Mode : Opportunistic
• Atomic entity : Numbers and Dimensions
• Information Need : Show trend over time in # of visits to
Japanese restaurants in Bangalore
• Query : SELECT SUM(visits) FROM restaurants WHERE city
=Bangalore AND type=Japanese ORDER BY date
Data Mining
SEARCH Vs DISCOVER
23
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of text
• Database Type : Unstructured
• Search Mode : Opportunistic
• Atomic entity : Language feature or concept
• Information Need : Find the types of food poisoning most often
associated with Japanese restaurants
• Query : Rank diseases found associated with “Japanese
restaurants”
Text Mining
24
RELEAVANT DISCIPLINES
25
Dr. NGPASC
COIMBATORE | INDIA
Information Extraction (IE)
• The process of automatically obtaining structured data from unstructured
data
Natural Language Processing (NLP)
• Study of human language - primary component of Artificial Intelligence (AI)
• Computational technique to analysis & synthesis of natural language &
speech
RELEAVANT DISCIPLINES
26
Dr. NGPASC
COIMBATORE | INDIA
Data Mining (DM)
• It refers to the extraction of useful data, hidden patterns from large data
sets.
• Data mining tools can predict behaviors & future trends that allow
businesses to make a better data-driven decision
Information Retrieval (IR)
• It is the process of obtaining information system resources that are
relevant to an information need from a collection of those resources
• IR an extension to document extraction
• It helps to narrow down the set of records that are associated with a
specific problem
RELEAVANT DISCIPLINES
27
Dr. NGPASC
COIMBATORE | INDIA
Example
Named Entity Recognition
P
R
O
C
E
S
S
-
T
M
PROCESS OF TEXT MINING
29
Dr. NGPASC
COIMBATORE | INDIA
PROCESS OF TEXT MINING
30
Dr. NGPASC
COIMBATORE | INDIA
• Document extract
from different
external sources
• Web pages
Extract Docs.
• Document storage
• Handle multiple
documents at a
time
Corpus
• Tokenization, PoS
Tagging
• Stemming,
Lemmatization &
Stop words
Text Transformation
• Bags of Words
• Vector space
Extract Feature
• TF - IDF
• SVD & LDA
Reduce Dimension
• Supervised
• Unsupervised
Data Mining
EXAMPLES
31
Dr. NGPASC
COIMBATORE | INDIA
Tokenization
• Tokenization is just the process of splitting a sentence into words
Tokenization of a sentence
EXAMPLES
32
Dr. NGPASC
COIMBATORE | INDIA
Stemming and Lemmatization
• Stemming and Lemmatization is the method to normalize the text documents
• Text normalization helps to improve the vocabulary (reducing the inflectional
forms) and accuracy of many language modeling tasks
• It is the process of converting a word to its base form
EXAMPLES
33
Dr. NGPASC
COIMBATORE | INDIA
PoS Tagging
• Processing text and assigning parts of speech to each word
• Tags may vary by different tagging sets as, Noun, Verb, Preposition,
Determiner etc.
EXAMPLES
34
Dr. NGPASC
COIMBATORE | INDIA
EXAMPLES
35
Dr. NGPASC
COIMBATORE | INDIA
EXAMPLES
36
Dr. NGPASC
COIMBATORE | INDIA
Reduce Dimensions
• Each row in the document / term matrix represents each document as a
high dimension vector (each dimension represents the occurrence of the
term)
• It can be measured by, tf-idf
 The reason, to reduce the time & memory
 To transform the vector from term space to a topic space - allows
document of similar topics to associate with each other use different
terms
• Example: "pet" and "cat" are map to the same topic based on their
co-occurrence
EXAMPLES
37
Dr. NGPASC
COIMBATORE | INDIA
Data Exploration
• By computing term frequency, Word cloud & Histogram are presented to
represent the significance of the terms
• To prove the correctness of the algorithm include statistical tests within the
model
Data Visualization
TOOLS
38
Dr. NGPASC
COIMBATORE | INDIA
TEXT ANALYTICS
BENEFITS OF TEXT MINING
40
Dr. NGPASC
COIMBATORE | INDIA
Extract all kinds of
information
E-commerce –
personalized marketing
Society - identifying
criminal activities
Companies - better
customer relationship
Scalability &
Real Time Analysis
Predictive Modeling
APPLICATIONS OF TEXT MINING
41
Dr. NGPASC
COIMBATORE | INDIA
Search engine technology
Analysis of survey data
Spam identification
Call center routing
Surveillance
Public health early warning
CONCLUSION
• Overcome the database storage limits
• Extraction of relevant information & relationships from
unstructured docs
• Empirically proven as more accurate & highly effective
• Domain knowledge integration
• Multilingual text refinement
• Varying concepts granularity
• NLP ambiguity
42
Dr. NGPASC
COIMBATORE | INDIA
CHALLENGES
Queries
43
Dr. NGPASC
COIMBATORE | INDIA
Text mining

More Related Content

What's hot

Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
Bhawi247
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan
 
Ontology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold PreferenceOntology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold Preference
IJCERT
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
eSAT Journals
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
eSAT Publishing House
 
Text Mining Framework
Text Mining FrameworkText Mining Framework
Text Mining Framework
Prakhyath Rai
 
Text Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 KimelfeldText Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 Kimelfeld
Pedro Contreras Flores
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
IRJET Journal
 
Faster Case Retrieval Using Hash Indexing Technique
Faster Case Retrieval Using Hash Indexing TechniqueFaster Case Retrieval Using Hash Indexing Technique
Faster Case Retrieval Using Hash Indexing Technique
Waqas Tariq
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
Prof. Wim Van Criekinge
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
Mounia Lalmas-Roelleke
 

What's hot (13)

Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 
Cartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defenseCartic Ramakrishnan's dissertation defense
Cartic Ramakrishnan's dissertation defense
 
Ontology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold PreferenceOntology Based PMSE with Manifold Preference
Ontology Based PMSE with Manifold Preference
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
 
Text Mining Framework
Text Mining FrameworkText Mining Framework
Text Mining Framework
 
Text Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 KimelfeldText Analytics - JCC2014 Kimelfeld
Text Analytics - JCC2014 Kimelfeld
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Faster Case Retrieval Using Hash Indexing Technique
Faster Case Retrieval Using Hash Indexing TechniqueFaster Case Retrieval Using Hash Indexing Technique
Faster Case Retrieval Using Hash Indexing Technique
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload2018 02 20_biological_databases_part1_v_upload
2018 02 20_biological_databases_part1_v_upload
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 
Ijetcas14 409
Ijetcas14 409Ijetcas14 409
Ijetcas14 409
 

Similar to Text mining

Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Debanjan Mahata
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017
Debanjan Mahata
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
thenmozhip8
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
John Popoola
 
Optimising Your Content for findability
Optimising Your Content for findabilityOptimising Your Content for findability
Optimising Your Content for findability
Kristian Norling
 
Text mining
Text miningText mining
Text mining
Koshy Geoji
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
Dasha Herrmannova
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
AKSHAY BHAGAT
 
Cloud computing in qualitative research data analysis with support of web qda...
Cloud computing in qualitative research data analysis with support of web qda...Cloud computing in qualitative research data analysis with support of web qda...
Cloud computing in qualitative research data analysis with support of web qda...
German Jordanian university
 
RDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue LibrariesRDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue Libraries
ASIS&T
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
IUPUI
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrievalSadaf Rafiq
 
Ibm piquant summary
Ibm piquant summaryIbm piquant summary
Ibm piquant summary
IIUM
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
Artificial Intelligence Institute at UofSC
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
C. Tobin Magle
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
Findwise
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
Peter McQuilton
 
UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data Mining
Nandakumar P
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
Rebecca Grant
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
GESIS
 

Similar to Text mining (20)

Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Optimising Your Content for findability
Optimising Your Content for findabilityOptimising Your Content for findability
Optimising Your Content for findability
 
Text mining
Text miningText mining
Text mining
 
Machine Learning for Data Extraction
Machine Learning for Data ExtractionMachine Learning for Data Extraction
Machine Learning for Data Extraction
 
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...Big Data (SOCIOMETRIC METHODS FOR  RELEVANCY ANALYSIS OF LONG TAIL  SCIENCE D...
Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE D...
 
Cloud computing in qualitative research data analysis with support of web qda...
Cloud computing in qualitative research data analysis with support of web qda...Cloud computing in qualitative research data analysis with support of web qda...
Cloud computing in qualitative research data analysis with support of web qda...
 
RDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue LibrariesRDAP 15: Research Data Integration in the Purdue Libraries
RDAP 15: Research Data Integration in the Purdue Libraries
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Ibm piquant summary
Ibm piquant summaryIbm piquant summary
Ibm piquant summary
 
Semantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including AstrophysicsSemantic Technologies for Big Sciences including Astrophysics
Semantic Technologies for Big Sciences including Astrophysics
 
Data Management for librarians
Data Management for librariansData Management for librarians
Data Management for librarians
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
 
UNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data MiningUNIT - 5: Data Warehousing and Data Mining
UNIT - 5: Data Warehousing and Data Mining
 
Managing Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research MethodsManaging Ireland's Research Data - 3 Research Methods
Managing Ireland's Research Data - 3 Research Methods
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
 

Recently uploaded

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
TechSoup
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 

Recently uploaded (20)

special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup   New Member Orientation and Q&A (May 2024).pdfWelcome to TechSoup   New Member Orientation and Q&A (May 2024).pdf
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdf
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 

Text mining

  • 1. WEBINAR ON Presented by Dr. A. MUTHUSAMY Head, Associate Professor Department of Computer Science with Cognitive Systems Dr. N.G.P. Arts and Science College Coimbatore-641 048 Tamil Nadu, India Mobile: +91 8675390072, E-mail: muthusamy@drngpasc.ac.in 1 TM
  • 2. OUTLINE 2 Dr. NGPASC COIMBATORE | INDIA Introduction - Data & Its Types Need of Text Mining Objectives of Text Mining Search Vs Discover Relevant Disciplines Process & Tools - Text Mining Benefits & Applications of TM Conclusion
  • 4. ABOUT DATA 4 Dr. NGPASC COIMBATORE | INDIA • Data - collection of raw facts & symbols • Data is collected from multiple sources • Data can be of the following forms  Text – Alphabets  Number – 0 to 9  Image – Scanner  Audio – Microphone  Video – Camera  Sensors
  • 6. ABOUT DATA 6 Dr. NGPASC COIMBATORE | INDIA • Every organization has its own specific data - to perform certain operations within it • It gives the status of past activities and enable us to make decision • Information Processing – Converting data into information
  • 7. REAL TIME SCENARIOS • Many individuals use some form of computing every day whether they realize it or not • Swiping a debit card, sending an email / using a mobile phone can all be considered forms of computing 7 Dr. NGPASC COIMBATORE | INDIA
  • 8. DATA VARIETIES 8 Dr. NGPASC COIMBATORE | INDIA • Data that has no inherent structure & is usually stored as different types of files • Examples: Text docs, PDFs, images & Videos • Textual data with erratic formats that can be formatted with effort & S/W tools • Examples: Click stream data • Textual data files with an apparent pattern, enabling analysis • Examples: Spreadsheets & XML files • Data having a defined data model, format, structure • Example: Database
  • 9. DATA VARIETIES 9 Dr. NGPASC COIMBATORE | INDIA Structured Unstructured
  • 11. NEED OF TEXT MINING 11 Dr. NGPASC COIMBATORE | INDIA • Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation) • Most of the information in public and private sectors are stored electronically in the form of text databases • Information intensive business processes demand that we transcend from simple document retrieval to Knowledge Discovery Text Databases
  • 12. NEED OF TEXT MINING 12 Dr. NGPASC COIMBATORE | INDIA Why Cats sit on mats?
  • 13. NEED OF TEXT MINING 13 Dr. NGPASC COIMBATORE | INDIA • It would be impossible to read all the millions of research articles on the topic yourself. • Text mining helps to filter large amounts of research and extracts the relevant information • It helps to identify, ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the proposition
  • 14. NEED OF TEXT MINING 14 Dr. NGPASC COIMBATORE | INDIA • It is not just a search tool, it can also understand that the ‘cat’ is an animal, ‘sit’ is an action, and a ‘mat’ is an object. • Identifies and maps patterns and trends across the millions of articles. Answer: Most of the cats who sit on mats come from cold climates.
  • 15.
  • 16. TEXT MINING • Knowledge Discovery of Text (KDT) - Large collections of written resources to generate valuable information from unstructured formats • Transform the unstructured text into structured data • It involves algorithms of DM, ML, Statistics, and NLP, attempts to extract high quality of useful information • To exploit information contained in textual documents in various ways including pattern discovery and trends in data, associations among entities, predictive rules etc., 16 Dr. NGPASC COIMBATORE | INDIA
  • 17. TEXT MINING 17 Dr. NGPASC COIMBATORE | INDIA Framework to Create Dictionary
  • 19. SEARCH Vs DISCOVER 19 Dr. NGPASC COIMBATORE | INDIA
  • 20. SEARCH Vs DISCOVER 20 Dr. NGPASC COIMBATORE | INDIA Find records within a structured database • Database Type : Structured • Search Mode : Goal-driven • Atomic entity : Data Record • Information Need : Find a Japanese restaurant in Bangalore that serves vegetarian food • Query : SELECT * FROM restaurants WHERE city = Bangalore AND type = Japanese AND has_veg = true Data Retrieval
  • 21. SEARCH Vs DISCOVER 21 Dr. NGPASC COIMBATORE | INDIA Find relevant information in an unstructured information source (usually text) • Database Type : Unstructured • Search Mode : Goal-driven • Atomic entity : Document • Information Need : Find a Japanese restaurant in Bangalore • Query : “Japanese restaurant Bangalore” or Bangalore ->Restaurants->Japanese Inf. Retrieval
  • 22. SEARCH Vs DISCOVER 22 Dr. NGPASC COIMBATORE | INDIA Discover new knowledge through analysis of data • Database Type : Structured • Search Mode : Opportunistic • Atomic entity : Numbers and Dimensions • Information Need : Show trend over time in # of visits to Japanese restaurants in Bangalore • Query : SELECT SUM(visits) FROM restaurants WHERE city =Bangalore AND type=Japanese ORDER BY date Data Mining
  • 23. SEARCH Vs DISCOVER 23 Dr. NGPASC COIMBATORE | INDIA Discover new knowledge through analysis of text • Database Type : Unstructured • Search Mode : Opportunistic • Atomic entity : Language feature or concept • Information Need : Find the types of food poisoning most often associated with Japanese restaurants • Query : Rank diseases found associated with “Japanese restaurants” Text Mining
  • 24. 24
  • 25. RELEAVANT DISCIPLINES 25 Dr. NGPASC COIMBATORE | INDIA Information Extraction (IE) • The process of automatically obtaining structured data from unstructured data Natural Language Processing (NLP) • Study of human language - primary component of Artificial Intelligence (AI) • Computational technique to analysis & synthesis of natural language & speech
  • 26. RELEAVANT DISCIPLINES 26 Dr. NGPASC COIMBATORE | INDIA Data Mining (DM) • It refers to the extraction of useful data, hidden patterns from large data sets. • Data mining tools can predict behaviors & future trends that allow businesses to make a better data-driven decision Information Retrieval (IR) • It is the process of obtaining information system resources that are relevant to an information need from a collection of those resources • IR an extension to document extraction • It helps to narrow down the set of records that are associated with a specific problem
  • 27. RELEAVANT DISCIPLINES 27 Dr. NGPASC COIMBATORE | INDIA Example Named Entity Recognition
  • 29. PROCESS OF TEXT MINING 29 Dr. NGPASC COIMBATORE | INDIA
  • 30. PROCESS OF TEXT MINING 30 Dr. NGPASC COIMBATORE | INDIA • Document extract from different external sources • Web pages Extract Docs. • Document storage • Handle multiple documents at a time Corpus • Tokenization, PoS Tagging • Stemming, Lemmatization & Stop words Text Transformation • Bags of Words • Vector space Extract Feature • TF - IDF • SVD & LDA Reduce Dimension • Supervised • Unsupervised Data Mining
  • 31. EXAMPLES 31 Dr. NGPASC COIMBATORE | INDIA Tokenization • Tokenization is just the process of splitting a sentence into words Tokenization of a sentence
  • 32. EXAMPLES 32 Dr. NGPASC COIMBATORE | INDIA Stemming and Lemmatization • Stemming and Lemmatization is the method to normalize the text documents • Text normalization helps to improve the vocabulary (reducing the inflectional forms) and accuracy of many language modeling tasks • It is the process of converting a word to its base form
  • 33. EXAMPLES 33 Dr. NGPASC COIMBATORE | INDIA PoS Tagging • Processing text and assigning parts of speech to each word • Tags may vary by different tagging sets as, Noun, Verb, Preposition, Determiner etc.
  • 36. EXAMPLES 36 Dr. NGPASC COIMBATORE | INDIA Reduce Dimensions • Each row in the document / term matrix represents each document as a high dimension vector (each dimension represents the occurrence of the term) • It can be measured by, tf-idf  The reason, to reduce the time & memory  To transform the vector from term space to a topic space - allows document of similar topics to associate with each other use different terms • Example: "pet" and "cat" are map to the same topic based on their co-occurrence
  • 37. EXAMPLES 37 Dr. NGPASC COIMBATORE | INDIA Data Exploration • By computing term frequency, Word cloud & Histogram are presented to represent the significance of the terms • To prove the correctness of the algorithm include statistical tests within the model Data Visualization
  • 40. BENEFITS OF TEXT MINING 40 Dr. NGPASC COIMBATORE | INDIA Extract all kinds of information E-commerce – personalized marketing Society - identifying criminal activities Companies - better customer relationship Scalability & Real Time Analysis Predictive Modeling
  • 41. APPLICATIONS OF TEXT MINING 41 Dr. NGPASC COIMBATORE | INDIA Search engine technology Analysis of survey data Spam identification Call center routing Surveillance Public health early warning
  • 42. CONCLUSION • Overcome the database storage limits • Extraction of relevant information & relationships from unstructured docs • Empirically proven as more accurate & highly effective • Domain knowledge integration • Multilingual text refinement • Varying concepts granularity • NLP ambiguity 42 Dr. NGPASC COIMBATORE | INDIA CHALLENGES