This document summarizes a webinar on text mining presented by Dr. A. Muthusamy. The webinar covered an introduction to data types and text mining, the need for text mining in analyzing large amounts of unstructured text data, the objectives and process of text mining, and applications and benefits of text mining such as knowledge discovery, predictive modeling, and information extraction. Challenges of text mining include overcoming database storage limits and handling natural language processing ambiguities.
Text Mining is an Important part of data mining and it is used nowadays on a large scale. This mining technique is used to find patterns in text data collected from many online sources , and to gain some interestings insights from the patterns observed. Since text is basically everywhere on the internet, it becomes quite difficult to get the data in structured format, which is why text mining plays a huge role. It uses NLP(Natural Language Processing Techniques) to automate the text mining and this concept is used in Machine Learning.
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
Text Mining is an Important part of data mining and it is used nowadays on a large scale. This mining technique is used to find patterns in text data collected from many online sources , and to gain some interestings insights from the patterns observed. Since text is basically everywhere on the internet, it becomes quite difficult to get the data in structured format, which is why text mining plays a huge role. It uses NLP(Natural Language Processing Techniques) to automate the text mining and this concept is used in Machine Learning.
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
This paper introduces Named Entity Recognition approach for textual corpus. Supervised Statistical methods are used to develop our system. Our system can be used to categorize NEs belonging to a particular domain for which it is being trained. As Named Entities appears in text surrounded by contexts (words that are left or right of the NE), we will be focusing on extracting NE contexts from text and then perform statistical computing on them. We are using n-gram modeling for extracting contexts from text. Our methodology first extracts left and right tri-grams surrounding NE instances in the training corpus and calculate their probabilities. Then all the extracted tri-grams along with their calculated probabilities are stored in a file. During testing, system detects unrecognized NEs in the testing corpus and categorize them using the tri-gram probabilities calculated during training time. The proposed system consists of two modules namely Knowledge acquisition and NE Recognition. Knowledge acquisition module extracts the tri-grams surrounding NEs in the training corpus and NE Recognition module performs the categorization of Named Entities in the testing corpus.
These slides explain the basic meaning of text mining,its comparision with other data retrieval methods,its subtasks and applications, limitations, present and future of text mining. Also included is the topic data mining with its goals and applications.
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
Improved method for pattern discovery in text miningeSAT Journals
Abstract Digital data in the form of text documents is rapidly growing. Analyzing such data manually is a tedious task. Data mining techniques have been around to analyze such data and bring about interesting patterns. Many existing methods are based on term-based approaches that can’t deal with synonymy and polysemy. Moreover they lack the ability in using and updating the discovered patterns. Zhong et al. proposed an effective pattern discovery technique. It discovers patterns and then computes specificities of patterns for evaluating term weights as per their distribution in the discovered patterns. It also takes care of updating patterns that exhibit ambiguity which is a feature known as pattern evolution. In this paper we implemented that technique and also built a prototype application to test the efficiency of the technique. The empirical results revealed that the solution is very useful in text mining domain. Keywords – Text mining, pattern discovery, text classification, pattern evolving
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Faster Case Retrieval Using Hash Indexing TechniqueWaqas Tariq
The main objective of case retrieval is to scan and to map the most similar old cases in case base with a new problem. Beside accurateness, the time taken to retrieve case is also important. With the increasing number of cases in case base, the retrieval task is becoming more challenging where faster retrieval time and good accuracy are the main aim. Traditionally, sequential indexing method has been applied to search for possible cases in case base. This technique worked fast when the number of cases is small but requires more time to retrieve when the number of data in case base grows. As an alternative, this paper presents the integration of hashing indexing technique in case retrieval to mine large cases and speed up the retrieval time. Hashing indexing searches a record by determining the index using only an entry’s search key without traversing all records. To test the proposed method, real data namely Timah Tasoh Dam operational dataset, which is temporal in nature that represents the historical hydrological data of daily Timah Tasoh dam operation in Perlis, Malaysia ranging from year 1997-2005, was chosen as experiment. Then, the hashing indexing performance is compared with sequential method in term of retrieval time and accuracy. The finding indicates that hashing indexing is more accurate and faster than sequential approach in retrieving cases. Besides that, the combination of hashing search key x produces better result compared to single search key.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
A presentation on a special category of databases called Deductive Databases. It is an attempt to merge logic programming with relational database. Other types include Object-oriented databases, Graph databases, XML databases, Multi-model databases, etc.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
These slides explain the basic meaning of text mining,its comparision with other data retrieval methods,its subtasks and applications, limitations, present and future of text mining. Also included is the topic data mining with its goals and applications.
Ontology Based PMSE with Manifold PreferenceIJCERT
International journal from http://www.ijcert.org
IJCERT Standard on-line Journal
ISSN(Online):2349-7084,(An ISO 9001:2008 Certified Journal)
iso nicir csir
IJCERT (ISSN 2349–7084 (Online)) is approved by National Science Library (NSL), National Institute of Science Communication And Information Resources (NISCAIR), Council of Scientific and Industrial Research, New Delhi, India.
Improved method for pattern discovery in text miningeSAT Journals
Abstract Digital data in the form of text documents is rapidly growing. Analyzing such data manually is a tedious task. Data mining techniques have been around to analyze such data and bring about interesting patterns. Many existing methods are based on term-based approaches that can’t deal with synonymy and polysemy. Moreover they lack the ability in using and updating the discovered patterns. Zhong et al. proposed an effective pattern discovery technique. It discovers patterns and then computes specificities of patterns for evaluating term weights as per their distribution in the discovered patterns. It also takes care of updating patterns that exhibit ambiguity which is a feature known as pattern evolution. In this paper we implemented that technique and also built a prototype application to test the efficiency of the technique. The empirical results revealed that the solution is very useful in text mining domain. Keywords – Text mining, pattern discovery, text classification, pattern evolving
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Faster Case Retrieval Using Hash Indexing TechniqueWaqas Tariq
The main objective of case retrieval is to scan and to map the most similar old cases in case base with a new problem. Beside accurateness, the time taken to retrieve case is also important. With the increasing number of cases in case base, the retrieval task is becoming more challenging where faster retrieval time and good accuracy are the main aim. Traditionally, sequential indexing method has been applied to search for possible cases in case base. This technique worked fast when the number of cases is small but requires more time to retrieve when the number of data in case base grows. As an alternative, this paper presents the integration of hashing indexing technique in case retrieval to mine large cases and speed up the retrieval time. Hashing indexing searches a record by determining the index using only an entry’s search key without traversing all records. To test the proposed method, real data namely Timah Tasoh Dam operational dataset, which is temporal in nature that represents the historical hydrological data of daily Timah Tasoh dam operation in Perlis, Malaysia ranging from year 1997-2005, was chosen as experiment. Then, the hashing indexing performance is compared with sequential method in term of retrieval time and accuracy. The finding indicates that hashing indexing is more accurate and faster than sequential approach in retrieving cases. Besides that, the combination of hashing search key x produces better result compared to single search key.
The diversity and complexity of contents available on the web have dramatically increased in recent years. Multimedia content such as images, videos, maps, voice recordings has been published more often than before. Document genres have also been diversified, for instance, news, blogs, FAQs, wiki. These diversified information sources are often dealt with in a separated way. For example, in web search, users have to switch between search verticals to access different sources. Recently, there has been a growing interest in finding effective ways to aggregate these information sources so that to hide the complexity of the information spaces to users searching for relevant information. For example, so-called aggregated search investigated by the major search engine companies will provide search results from several sources in a single result page. Aggregation itself is not a new paradigm; for instance, aggregate operators are common in database technology.
This talk presents the challenges faced by the like of web search engines and digital libraries in providing the means to aggregate information from several and complex information spaces in a way that helps users in their information seeking tasks. It also discusses how other disciplines including databases, artificial intelligence, and cognitive science can be brought into building effective and efficient aggregated search systems.
A presentation on a special category of databases called Deductive Databases. It is an attempt to merge logic programming with relational database. Other types include Object-oriented databases, Graph databases, XML databases, Multi-model databases, etc.
Presented at OECD Workshop on Systematic Reviews in the Scope of the Endocrine Disrupter Testing and Assessment (EDTA) Conceptual Framework Level 1 in Paris, France
RDAP 15: Research Data Integration in the Purdue LibrariesASIS&T
Research Data Access and Preservation Summit, 2015
Minneapolis, MN
April 22-23, 2015
Lisa Zilinski, Data Specialist, Carnegie Mellon University
Amy Barton, Metadata Specialist, Purdue
Tao Zhang, Digital User Experience Specialist, Purdue
Line Pouchard, Computational Science Information Specialist, Purdue
Pete E. Pascuzzi, Molecular Biosciences Information Specialist, Purdue
Spring 2014 Data Management Lab: Session 2 Slides (more details at http://ulib.iupui.edu/digitalscholarship/dataservices/datamgmtlab)
What you will learn:
1. Build awareness of research data management issues associated with digital data.
2. Introduce methods to address common data management issues and facilitate data integrity.
3. Introduce institutional resources supporting effective data management methods.
4. Build proficiency in applying these methods.
5. Build strategic skills that enable attendees to solve new data management problems.
Amit Sheth with TK Prasad, "Semantic Technologies for Big Science and Astrophysics", Invited Plenary Presentation, at Earthcube Solar-Terrestrial End-User Workshop, NJIT, Newark, NJ, August 13, 2014.
Like many other fields of Big Science, Astrophysics and Solar Physics deal with the challenges of Big Data, including Volume, Variety, Velocity, and Veracity. There is already significant work on handling volume related challenges, including the use of high performance computing. In this talk, we will mainly focus on other challenges from the perspective of collaborative sharing and reuse of broad variety of data created by multiple stakeholders, large and small, along with tools that offer semantic variants of search, browsing, integration and discovery capabilities. We will borrow examples of tools and capabilities from state of the art work in supporting physicists (including astrophysicists) [1], life sciences [2], material sciences [3], and describe the role of semantics and semantic technologies that make these capabilities possible or easier to realize. This applied and practice oriented talk will complement more vision oriented counterparts [4].
[1] Science Web-based Interactive Semantic Environment: http://sciencewise.info/
[2] NCBO Bioportal: http://bioportal.bioontology.org/ , Kno.e.sis’s work on Semantic Web for Healthcare and Life Sciences: http://knoesis.org/amit/hcls
[3] MaterialWays (a Materials Genome Initiative related project): http://wiki.knoesis.org/index.php/MaterialWays
[4] From Big Data to Smart Data: http://wiki.knoesis.org/index.php/Smart_Data
Are you interesting in offering data management services at your library but aren’t sure where to start? Then this class is for you! During this session, we will
• Outline the data management topics that are commonly offered in libraries
• Present strategies for how to determine what services might be most useful on your campus and create synergistic partnerships with other university entities
• Dive into how to offer support with data management plans
• Present a case study for using an institutional repository to archive and share research data
• Identify additional training opportunities and open educational resources you can use to develop robust DM services
The class will consist of a mix of presentations, hands on activities, and discussion. So come ready to participate!
Introduction to Enterprise Search. A two hour class to introduce Enterprise Search. It covers:
The problems enterprise search can solve
History of (web) search
How we search and find?
Current state of Enterprise Search + stats
Technical concept
Information quality
Feedback cycle
Five dimensions of Findability
A 45min presentation given at the 'Getting published in Nature's Scientific Data journal', hosted by the University of Cambridge Research Data Management team (www.data.cam.ac.uk). Presented on Monday 11th January 2016.
UNIT - 5: Data Warehousing and Data MiningNandakumar P
UNIT-V
Mining Object, Spatial, Multimedia, Text, and Web Data: Multidimensional Analysis and Descriptive Mining of Complex Data Objects – Spatial Data Mining – Multimedia Data Mining – Text Mining – Mining the World Wide Web.
Managing Ireland's Research Data - 3 Research MethodsRebecca Grant
Slides providing an overview of the research methods used in the author's thesis, "Managing Ireland's Research Data: Recognising Roles for Recordkeepers". The methods discussed are online surveys, comparative case studies, and autoethnography.
Licensed as CC-BY.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
1. WEBINAR ON
Presented by
Dr. A. MUTHUSAMY
Head, Associate Professor
Department of Computer Science with Cognitive Systems
Dr. N.G.P. Arts and Science College
Coimbatore-641 048
Tamil Nadu, India
Mobile: +91 8675390072, E-mail: muthusamy@drngpasc.ac.in 1
TM
2. OUTLINE
2
Dr. NGPASC
COIMBATORE | INDIA
Introduction - Data & Its Types
Need of Text Mining
Objectives of Text Mining
Search Vs Discover
Relevant Disciplines
Process & Tools - Text Mining
Benefits & Applications of TM
Conclusion
4. ABOUT DATA
4
Dr. NGPASC
COIMBATORE | INDIA
• Data - collection of raw facts & symbols
• Data is collected from multiple sources
• Data can be of the following forms
Text – Alphabets
Number – 0 to 9
Image – Scanner
Audio – Microphone
Video – Camera
Sensors
6. ABOUT DATA
6
Dr. NGPASC
COIMBATORE | INDIA
• Every organization has its own specific data - to perform certain
operations within it
• It gives the status of past activities and enable us to make decision
• Information Processing – Converting data into information
7. REAL TIME SCENARIOS
• Many individuals use some form of computing every day
whether they realize it or not
• Swiping a debit card, sending an email / using a mobile phone
can all be considered forms of computing
7
Dr. NGPASC
COIMBATORE | INDIA
8. DATA VARIETIES
8
Dr. NGPASC
COIMBATORE | INDIA
• Data that has no inherent structure & is
usually stored as different types of files
• Examples: Text docs, PDFs, images & Videos
• Textual data with erratic formats that can be
formatted with effort & S/W tools
• Examples: Click stream data
• Textual data files with an apparent pattern,
enabling analysis
• Examples: Spreadsheets & XML files
• Data having a defined data model, format,
structure
• Example: Database
11. NEED OF TEXT MINING
11
Dr. NGPASC
COIMBATORE | INDIA
• Approximately 90% of the world’s data is held in unstructured formats
(source: Oracle Corporation)
• Most of the information in public and private sectors are stored
electronically in the form of text databases
• Information intensive business processes demand that we transcend
from simple document retrieval to Knowledge Discovery
Text Databases
12. NEED OF TEXT MINING
12
Dr. NGPASC
COIMBATORE | INDIA
Why Cats sit on mats?
13. NEED OF TEXT MINING
13
Dr. NGPASC
COIMBATORE | INDIA
• It would be impossible to read all the millions of research articles on
the topic yourself.
• Text mining helps to filter large amounts of research and extracts the
relevant information
• It helps to identify, ‘cat’ is the noun, ‘sit’ is the verb and ‘on’ is the
proposition
14. NEED OF TEXT MINING
14
Dr. NGPASC
COIMBATORE | INDIA
• It is not just a search tool, it can also understand that the ‘cat’ is an
animal, ‘sit’ is an action, and a ‘mat’ is an object.
• Identifies and maps patterns and trends across the millions of articles.
Answer: Most of the cats who sit on mats come from cold climates.
15.
16. TEXT MINING
• Knowledge Discovery of Text (KDT) - Large collections of written
resources to generate valuable information from unstructured formats
• Transform the unstructured text into structured data
• It involves algorithms of DM, ML, Statistics, and NLP, attempts to
extract high quality of useful information
• To exploit information contained in textual documents in various ways
including pattern discovery and trends in data, associations among
entities, predictive rules etc.,
16
Dr. NGPASC
COIMBATORE | INDIA
20. SEARCH Vs DISCOVER
20
Dr. NGPASC
COIMBATORE | INDIA
Find records within a structured database
• Database Type : Structured
• Search Mode : Goal-driven
• Atomic entity : Data Record
• Information Need : Find a Japanese restaurant in Bangalore that
serves vegetarian food
• Query : SELECT * FROM restaurants WHERE city =
Bangalore AND type = Japanese AND has_veg =
true
Data Retrieval
21. SEARCH Vs DISCOVER
21
Dr. NGPASC
COIMBATORE | INDIA
Find relevant information in an unstructured information source (usually text)
• Database Type : Unstructured
• Search Mode : Goal-driven
• Atomic entity : Document
• Information Need : Find a Japanese restaurant in Bangalore
• Query : “Japanese restaurant Bangalore” or
Bangalore ->Restaurants->Japanese
Inf. Retrieval
22. SEARCH Vs DISCOVER
22
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of data
• Database Type : Structured
• Search Mode : Opportunistic
• Atomic entity : Numbers and Dimensions
• Information Need : Show trend over time in # of visits to
Japanese restaurants in Bangalore
• Query : SELECT SUM(visits) FROM restaurants WHERE city
=Bangalore AND type=Japanese ORDER BY date
Data Mining
23. SEARCH Vs DISCOVER
23
Dr. NGPASC
COIMBATORE | INDIA
Discover new knowledge through analysis of text
• Database Type : Unstructured
• Search Mode : Opportunistic
• Atomic entity : Language feature or concept
• Information Need : Find the types of food poisoning most often
associated with Japanese restaurants
• Query : Rank diseases found associated with “Japanese
restaurants”
Text Mining
25. RELEAVANT DISCIPLINES
25
Dr. NGPASC
COIMBATORE | INDIA
Information Extraction (IE)
• The process of automatically obtaining structured data from unstructured
data
Natural Language Processing (NLP)
• Study of human language - primary component of Artificial Intelligence (AI)
• Computational technique to analysis & synthesis of natural language &
speech
26. RELEAVANT DISCIPLINES
26
Dr. NGPASC
COIMBATORE | INDIA
Data Mining (DM)
• It refers to the extraction of useful data, hidden patterns from large data
sets.
• Data mining tools can predict behaviors & future trends that allow
businesses to make a better data-driven decision
Information Retrieval (IR)
• It is the process of obtaining information system resources that are
relevant to an information need from a collection of those resources
• IR an extension to document extraction
• It helps to narrow down the set of records that are associated with a
specific problem
30. PROCESS OF TEXT MINING
30
Dr. NGPASC
COIMBATORE | INDIA
• Document extract
from different
external sources
• Web pages
Extract Docs.
• Document storage
• Handle multiple
documents at a
time
Corpus
• Tokenization, PoS
Tagging
• Stemming,
Lemmatization &
Stop words
Text Transformation
• Bags of Words
• Vector space
Extract Feature
• TF - IDF
• SVD & LDA
Reduce Dimension
• Supervised
• Unsupervised
Data Mining
31. EXAMPLES
31
Dr. NGPASC
COIMBATORE | INDIA
Tokenization
• Tokenization is just the process of splitting a sentence into words
Tokenization of a sentence
32. EXAMPLES
32
Dr. NGPASC
COIMBATORE | INDIA
Stemming and Lemmatization
• Stemming and Lemmatization is the method to normalize the text documents
• Text normalization helps to improve the vocabulary (reducing the inflectional
forms) and accuracy of many language modeling tasks
• It is the process of converting a word to its base form
33. EXAMPLES
33
Dr. NGPASC
COIMBATORE | INDIA
PoS Tagging
• Processing text and assigning parts of speech to each word
• Tags may vary by different tagging sets as, Noun, Verb, Preposition,
Determiner etc.
36. EXAMPLES
36
Dr. NGPASC
COIMBATORE | INDIA
Reduce Dimensions
• Each row in the document / term matrix represents each document as a
high dimension vector (each dimension represents the occurrence of the
term)
• It can be measured by, tf-idf
The reason, to reduce the time & memory
To transform the vector from term space to a topic space - allows
document of similar topics to associate with each other use different
terms
• Example: "pet" and "cat" are map to the same topic based on their
co-occurrence
37. EXAMPLES
37
Dr. NGPASC
COIMBATORE | INDIA
Data Exploration
• By computing term frequency, Word cloud & Histogram are presented to
represent the significance of the terms
• To prove the correctness of the algorithm include statistical tests within the
model
Data Visualization
40. BENEFITS OF TEXT MINING
40
Dr. NGPASC
COIMBATORE | INDIA
Extract all kinds of
information
E-commerce –
personalized marketing
Society - identifying
criminal activities
Companies - better
customer relationship
Scalability &
Real Time Analysis
Predictive Modeling
41. APPLICATIONS OF TEXT MINING
41
Dr. NGPASC
COIMBATORE | INDIA
Search engine technology
Analysis of survey data
Spam identification
Call center routing
Surveillance
Public health early warning
42. CONCLUSION
• Overcome the database storage limits
• Extraction of relevant information & relationships from
unstructured docs
• Empirically proven as more accurate & highly effective
• Domain knowledge integration
• Multilingual text refinement
• Varying concepts granularity
• NLP ambiguity
42
Dr. NGPASC
COIMBATORE | INDIA
CHALLENGES