SlideShare a Scribd company logo
1 of 4
Download to read offline
A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23
www.ijera.com 20 | P a g e
Discover Effective Pattern for Text Mining
A. D. Khade*, A. B. Karche**, D. S. Jadhav*** , A. S. Zore****
*-*** (Student of Computer Engineering Department MMIT College Pune University, Pune-411047)
**** (Prof of Information Technology & Computer Engineering Department MMIT College Pune University,
Pune-411047)
ABSTRACT
Many data mining techniques have been discovered for finding useful patterns in documents like text document.
However, how to use effective and bring to up to date discovered patterns is still an open research task,
especially in the domain of text mining. Text mining is the finding of very interesting knowledge (or features) in
the text documents. It is a challenging task to find appropriate knowledge (or features) in text documents to help
users to find what they exactly want. This paper represent efficient mining algorithm to find particular patterns
within a reasonable and acceptable time frame.
Keywords- Pattern mining; Pattern evolve; Text mining; Text arrangment
I. INTRODUCTION
In past few years, a number of data mining
techniques have been presented to find out the
different knowledge tasks. These techniques include
association rule mining, frequent item set mining,
sequential pattern mining, maximum pattern mining,
and closed pattern mining. With a large number of
patterns generated by using data mining approaches,
how to effectively use and update these patterns is
still an open research issue. In this paper, we focus on
the development of a knowledge discovery model to
effectively use and update the discovered patterns
and apply it to the field of text mining.
II. CONCEPT
Text mining is the discovery of useful
knowledge in text documents. It is a very difficult
task to find accurate knowledge in text documents to
help users requirement. In the beginning, Information
Retrieval (IR) provided many term-based methods to
solve this challenge, The merits of term-based
methods include efficient computational performance
as well as some theories for term weighting, which
have emerged over the last couple of decades from
the IR and machine learning communities. However,
term based methods suffer from the problems of
polysemy and synonymy, where polysemy means a
word has multiple meanings, and synonymy is
multiple words having the same meaning. The
semantic meaning of many discovered terms is
uncertain for answering what users want. Over the
years, people have often held the hypothesis that
phrase-based approaches could perform better than
the term based ones, as phrases may carry more
“semantics” like information. This hypothesis has not
fared too well in the history of IR. Although phrases
are less ambiguous and more discriminative than
individual terms, the likely reasons for the
discouraging performance include:
1) phrases have inferior statistical properties to terms,
2) they have low frequency of occurrence, and
3) there are large numbers of redundant and noisy
phrases among them .
In the presence of these set backs, sequential
patterns used in data mining community have turned
out to be a promising alternative to phrases because
sequential patterns enjoy good statistical properties
like terms. To overcome the disadvantages of phrase-
based approaches, pattern mining-based approaches
have been proposed, which adopted the concept of
closed sequential patterns, and pruned non closed
patterns.
III. PROPOSED WORK
An effective pattern discovery technique is
discovered. Evaluates specificities of patterns and
then evaluates term-weights according to the
distribution of terms in the discovered patterns Solves
Misinterpretation Problem. Considers the influence
of patterns from the negative training examples to
find ambiguous (noisy) patterns and tries to reduce
their influence for the low-frequency problem. The
process of updating ambiguous patterns can be
referred as pattern evolution. The proposed approach
can improve the accuracy of evaluating term weights
because discovered patterns are more specific than
whole documents.
In General there are two phases:
Training and Testing:-
Training: In training phase the d-patterns in
positive documents (D) based on a min sup are
found, and evaluates term supports by deploying
patterns to terms.
RESEARCH ARTICLE OPEN ACCESS
A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23
www.ijera.com 21 | P a g e
Testing: In Testing Phase to revise term supports
using noise negative documents in D based on an
experimental coefficient. The incoming documents
then can be sorted based on these weights.
1) Datasets:-
Dataset is collection of the data which
present in tabular form i.e. we can represent the data
in row & column wise format. Here we use special
type of dataset in our system known as RCV-
1(Reuters Corpus Value 1).
IV. SYSTEM ARCHITECTURE
The proposed architecture is shown in
Figure 1. This architecture shows the stepwise
solution of our project. The basic step is to load
documents in our database. The next step is to
remove stop word and text steaming. we removed
this stop word and text steaming with the help of
NLP(natural language process).
Fig 1: System Architecture
There are 5 sub modules of proposed system.
1) Loading documents
2) Text Preprocessing
3) Pattern taxonomy process
4) Pattern deploying
5) Pattern Testing
1) Loading documents
In this module, to load the list of all
documents. The user to retrieve one of the
documents. This document is given to next process.
That process is preprocessing.
Fig 2: Loading Dataset
2) Text Preprocessing
The retrieved document preprocessing is
done in module.
There are two types of process is done.
a) stop words removal
b) text stemming
stop words are words which are filtered out
prior to, or after, processing of natural language data.
stemming is the process for reducing inflected (or
sometimes derived) words to their stem base or root
form. It generally a written word form.
Fig 3: Preprocessing
3) Pattern taxonomy process:-
In this module, the documents are split into
paragraphs. Each paragraph is considered to beach
document. The terms , which can be extracted from
set of positive documents.
Where d=document;
m= set of paragraph;
s=keyword;
A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23
www.ijera.com 22 | P a g e
Paragraph Terms
d1 s1, s2
d2 s3,s4, s6
d3 s3, s4, s5, s6
d4 s3, s4, s5, s6
d5 s1 , s2, s6, s7
d6 s1, s2, s6 , s7
TABLE 1 :- A Set of Paragraph
Frequent Pattern Covering set
{ s3, s4, s6} {d2,d3,d4}
{ s3 , s4 } {d2,d3,d4}
{ s3 , s6} {d2,d3,d4}
{ s4 , s6} {d2,d3,d4}
{ s3 } {d2,d3,d4}
{ s4 } {d2,d3,d4}
{ s1 , s2 } {d1,d5,d6}
{ s1 } {d1,d5,d6}
{ s2 } {d1,d5,d6}
{ s6} {d2,d3,
d4,d5,d6}
TABLE 2:-Frqnt Pattern & Covering Set
Fig 4: Pattern Taxonomy Process
4) Pattern deploying
The discovered patterns are summarized.
The d-pattern algorithm is used to discover all
patterns in positive documents are composed. The
term supports are calculated by all terms in d-pattern.
Term support means weight of the term is evaluated.
Fig.5 : Pattern Discovery
Fig.6 : Pattern Deployment
5) Pattern Testing
In this module used to identify the noisy
patterns in documents. Sometimes, system falsely
identified negative document as a positive. So, noise
is occurred in positive document. The noised pattern
named as offender. If partial conflict offender
contains in positive documents, the reshuffle process
is applied.
V. TECHNICAL SPECIFICATION
Software:
1. Operating System Windows XP /7
2. Coding Language JAVA
3. Tools and Databases used-MySQL.
4. Dataset used-RCV1
A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23
www.ijera.com 23 | P a g e
Hardware:
1.System: Pentium IV 2.4 GHz.
2. Hard Disk: 40 GB. (Min)
3. Ram: 512 MB. (Min)
VI. TECHNICAL OVERVIEW
Advantages
1. It improves the effectiveness of using and
updating discovered patterns for finding relevant
and interesting information.
2. Able to produce the text mining on polysemy
and synonymy effectively.
3. Effective pattern Discovery Technique.
4. The proposed approach is used to improve the
accuracy of evaluating term weights.
5. Because, the discovered patterns are more
specific than whole documents.
6. To avoiding the issues of phrase-based approach
to using the pattern-based approach.
7. Pattern mining techniques can be used to find
various text patterns.
VII. EXPECTED RESULT
To focus on the development of knowledge
discovery model to effectively use and update the
discovered patterns and apply it to the field of text
mining. Technology can help in fast finding of text.
There is efficient use of text mining. Finding
searching pattern with there location in effectively.
Processing and Multilingual Aspects are present in
system.
VIII. CONCLUSION
Thus mining techniques have been proposed
in the last decade. These techniques include
association rule mining, frequent item set mining,
sequential pattern mining, maximum pattern mining,
and closed pattern mining. However, using these
discovered knowledge (or patterns) in the field of text
mining is difficult and ineffective. The reason is that
some useful long patterns with high specificity lack
in support (i.e., the low-frequency problem). We
argue that not all frequent short patterns are useful.
Hence, misinterpretations of patterns derived from
data mining techniques lead to the ineffective
performance. In this research work, an effective
pattern discovery technique has been proposed to
overcome the low-frequency and misinterpretation
problems for text mining.
IX. ACKNOWLEDGMENT
We would like to sincerely thank Mr. Amit
Zore, our mentor (Lecturer, MMIT, Lohgaon), for his
support and encouragement.
REFERENCES
[1] J. Han and K.C.-C. Chang, “Data Mining for
Web Intelligence,”, 2002.
[2] A. Maedche, Ontology Learning for the
Semantic Web. Kluwer Academic 2003
http://www.textmining.org/home/
[3] Y. Yang, “An Evaluation of Statistical
Approaches to Tex Categorization,”
Information Retrieval.
[4] “Interpretations of Association Rules by
Granular Computing,” By Y. Li and N.
Zhong.
[5] Data set RCV1.
http://www.RCV1dataset.com/home/
[6] Applying Data Mining Techniques for
Descriptive Phrase Extraction in Digital
Document Collections by H. Ahonen,
O.Heinonen, M. Klemettinen, and A.I.
Verkamo.
[7] K. Aas and L. Eikvil, “ext Categorisation: A
Survey,” Technical Report Raport NR 941,
Norwegian Computing Center, 1999
[8] R. Baeza-Yates and B. Ribeiro-Neto,”
Modern Information Retrieval” Addison
Wesley, 1999.

More Related Content

What's hot

AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
csandit
 
A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...
ijma
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
Ajay Ohri
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:
butest
 

What's hot (18)

USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
USING TF-ISF WITH LOCAL CONTEXT TO GENERATE AN OWL DOCUMENT REPRESENTATION FO...
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...A new keyphrases extraction method based on suffix tree data structure for ar...
A new keyphrases extraction method based on suffix tree data structure for ar...
 
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURESNAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
NAMED ENTITY RECOGNITION IN TURKISH USING ASSOCIATION MEASURES
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation method
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...Taxonomy extraction from automotive natural language requirements using unsup...
Taxonomy extraction from automotive natural language requirements using unsup...
 
Topicmodels
TopicmodelsTopicmodels
Topicmodels
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
IRJET- Survey for Amazon Fine Food Reviews
IRJET- Survey for Amazon Fine Food ReviewsIRJET- Survey for Amazon Fine Food Reviews
IRJET- Survey for Amazon Fine Food Reviews
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Proposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic ApproachProposed Method for String Transformation using Probablistic Approach
Proposed Method for String Transformation using Probablistic Approach
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:
 

Viewers also liked

Sistemas circulatorios
Sistemas circulatoriosSistemas circulatorios
Sistemas circulatorios
Isabel Salta
 
Livro de poesia plnm 2c
Livro de poesia  plnm 2cLivro de poesia  plnm 2c
Livro de poesia plnm 2c
Veronica Baptista
 
Tac grupo iv
Tac grupo iv Tac grupo iv
Tac grupo iv
grupoiv_ie
 
Aula3 4 pl..
Aula3 4 pl..Aula3 4 pl..
Aula3 4 pl..
kriolo_loko
 

Viewers also liked (20)

E43052831
E43052831E43052831
E43052831
 
Ek4301827831
Ek4301827831Ek4301827831
Ek4301827831
 
Fb4301931934
Fb4301931934Fb4301931934
Fb4301931934
 
En4301845851
En4301845851En4301845851
En4301845851
 
G43013539
G43013539G43013539
G43013539
 
E43062837
E43062837E43062837
E43062837
 
Ec4301780790
Ec4301780790Ec4301780790
Ec4301780790
 
Kq3518291832
Kq3518291832Kq3518291832
Kq3518291832
 
Li3619361944
Li3619361944Li3619361944
Li3619361944
 
Lo3619861992
Lo3619861992Lo3619861992
Lo3619861992
 
T4102160166
T4102160166T4102160166
T4102160166
 
Etica internet
Etica internetEtica internet
Etica internet
 
Estadiom
EstadiomEstadiom
Estadiom
 
Pro str pdf_capitulos200811714821187
Pro str pdf_capitulos200811714821187Pro str pdf_capitulos200811714821187
Pro str pdf_capitulos200811714821187
 
Sistemas circulatorios
Sistemas circulatoriosSistemas circulatorios
Sistemas circulatorios
 
Cores
CoresCores
Cores
 
SystemfĂśrvaltningsdagarna 2013 joakim lindbom - v1.0
SystemfĂśrvaltningsdagarna 2013   joakim lindbom - v1.0SystemfĂśrvaltningsdagarna 2013   joakim lindbom - v1.0
SystemfĂśrvaltningsdagarna 2013 joakim lindbom - v1.0
 
Livro de poesia plnm 2c
Livro de poesia  plnm 2cLivro de poesia  plnm 2c
Livro de poesia plnm 2c
 
Tac grupo iv
Tac grupo iv Tac grupo iv
Tac grupo iv
 
Aula3 4 pl..
Aula3 4 pl..Aula3 4 pl..
Aula3 4 pl..
 

Similar to E43022023

An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
IJCSIS Research Publications
 
A comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text miningA comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text mining
IAEME Publication
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
cscpconf
 

Similar to E43022023 (20)

A0210110
A0210110A0210110
A0210110
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
 
Improved method for pattern discovery in text mining
Improved method for pattern discovery in text miningImproved method for pattern discovery in text mining
Improved method for pattern discovery in text mining
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 An Investigation of Keywords Extraction from Textual Documents using Word2Ve... An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
An Investigation of Keywords Extraction from Textual Documents using Word2Ve...
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text mining
 
Using data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text miningUsing data mining methods knowledge discovery for text mining
Using data mining methods knowledge discovery for text mining
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
A comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text miningA comparative study on different types of effective methods in text mining
A comparative study on different types of effective methods in text mining
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
 
Relevance feature discovery for text mining
Relevance feature discovery for text miningRelevance feature discovery for text mining
Relevance feature discovery for text mining
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
 
Enhancing the labelling technique of
Enhancing the labelling technique ofEnhancing the labelling technique of
Enhancing the labelling technique of
 
Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11Ijarcet vol-3-issue-1-9-11
Ijarcet vol-3-issue-1-9-11
 
D1802023136
D1802023136D1802023136
D1802023136
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
 
A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

E43022023

  • 1. A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23 www.ijera.com 20 | P a g e Discover Effective Pattern for Text Mining A. D. Khade*, A. B. Karche**, D. S. Jadhav*** , A. S. Zore**** *-*** (Student of Computer Engineering Department MMIT College Pune University, Pune-411047) **** (Prof of Information Technology & Computer Engineering Department MMIT College Pune University, Pune-411047) ABSTRACT Many data mining techniques have been discovered for finding useful patterns in documents like text document. However, how to use effective and bring to up to date discovered patterns is still an open research task, especially in the domain of text mining. Text mining is the finding of very interesting knowledge (or features) in the text documents. It is a challenging task to find appropriate knowledge (or features) in text documents to help users to find what they exactly want. This paper represent efficient mining algorithm to find particular patterns within a reasonable and acceptable time frame. Keywords- Pattern mining; Pattern evolve; Text mining; Text arrangment I. INTRODUCTION In past few years, a number of data mining techniques have been presented to find out the different knowledge tasks. These techniques include association rule mining, frequent item set mining, sequential pattern mining, maximum pattern mining, and closed pattern mining. With a large number of patterns generated by using data mining approaches, how to effectively use and update these patterns is still an open research issue. In this paper, we focus on the development of a knowledge discovery model to effectively use and update the discovered patterns and apply it to the field of text mining. II. CONCEPT Text mining is the discovery of useful knowledge in text documents. It is a very difficult task to find accurate knowledge in text documents to help users requirement. In the beginning, Information Retrieval (IR) provided many term-based methods to solve this challenge, The merits of term-based methods include efficient computational performance as well as some theories for term weighting, which have emerged over the last couple of decades from the IR and machine learning communities. However, term based methods suffer from the problems of polysemy and synonymy, where polysemy means a word has multiple meanings, and synonymy is multiple words having the same meaning. The semantic meaning of many discovered terms is uncertain for answering what users want. Over the years, people have often held the hypothesis that phrase-based approaches could perform better than the term based ones, as phrases may carry more “semantics” like information. This hypothesis has not fared too well in the history of IR. Although phrases are less ambiguous and more discriminative than individual terms, the likely reasons for the discouraging performance include: 1) phrases have inferior statistical properties to terms, 2) they have low frequency of occurrence, and 3) there are large numbers of redundant and noisy phrases among them . In the presence of these set backs, sequential patterns used in data mining community have turned out to be a promising alternative to phrases because sequential patterns enjoy good statistical properties like terms. To overcome the disadvantages of phrase- based approaches, pattern mining-based approaches have been proposed, which adopted the concept of closed sequential patterns, and pruned non closed patterns. III. PROPOSED WORK An effective pattern discovery technique is discovered. Evaluates specificities of patterns and then evaluates term-weights according to the distribution of terms in the discovered patterns Solves Misinterpretation Problem. Considers the influence of patterns from the negative training examples to find ambiguous (noisy) patterns and tries to reduce their influence for the low-frequency problem. The process of updating ambiguous patterns can be referred as pattern evolution. The proposed approach can improve the accuracy of evaluating term weights because discovered patterns are more specific than whole documents. In General there are two phases: Training and Testing:- Training: In training phase the d-patterns in positive documents (D) based on a min sup are found, and evaluates term supports by deploying patterns to terms. RESEARCH ARTICLE OPEN ACCESS
  • 2. A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23 www.ijera.com 21 | P a g e Testing: In Testing Phase to revise term supports using noise negative documents in D based on an experimental coefficient. The incoming documents then can be sorted based on these weights. 1) Datasets:- Dataset is collection of the data which present in tabular form i.e. we can represent the data in row & column wise format. Here we use special type of dataset in our system known as RCV- 1(Reuters Corpus Value 1). IV. SYSTEM ARCHITECTURE The proposed architecture is shown in Figure 1. This architecture shows the stepwise solution of our project. The basic step is to load documents in our database. The next step is to remove stop word and text steaming. we removed this stop word and text steaming with the help of NLP(natural language process). Fig 1: System Architecture There are 5 sub modules of proposed system. 1) Loading documents 2) Text Preprocessing 3) Pattern taxonomy process 4) Pattern deploying 5) Pattern Testing 1) Loading documents In this module, to load the list of all documents. The user to retrieve one of the documents. This document is given to next process. That process is preprocessing. Fig 2: Loading Dataset 2) Text Preprocessing The retrieved document preprocessing is done in module. There are two types of process is done. a) stop words removal b) text stemming stop words are words which are filtered out prior to, or after, processing of natural language data. stemming is the process for reducing inflected (or sometimes derived) words to their stem base or root form. It generally a written word form. Fig 3: Preprocessing 3) Pattern taxonomy process:- In this module, the documents are split into paragraphs. Each paragraph is considered to beach document. The terms , which can be extracted from set of positive documents. Where d=document; m= set of paragraph; s=keyword;
  • 3. A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23 www.ijera.com 22 | P a g e Paragraph Terms d1 s1, s2 d2 s3,s4, s6 d3 s3, s4, s5, s6 d4 s3, s4, s5, s6 d5 s1 , s2, s6, s7 d6 s1, s2, s6 , s7 TABLE 1 :- A Set of Paragraph Frequent Pattern Covering set { s3, s4, s6} {d2,d3,d4} { s3 , s4 } {d2,d3,d4} { s3 , s6} {d2,d3,d4} { s4 , s6} {d2,d3,d4} { s3 } {d2,d3,d4} { s4 } {d2,d3,d4} { s1 , s2 } {d1,d5,d6} { s1 } {d1,d5,d6} { s2 } {d1,d5,d6} { s6} {d2,d3, d4,d5,d6} TABLE 2:-Frqnt Pattern & Covering Set Fig 4: Pattern Taxonomy Process 4) Pattern deploying The discovered patterns are summarized. The d-pattern algorithm is used to discover all patterns in positive documents are composed. The term supports are calculated by all terms in d-pattern. Term support means weight of the term is evaluated. Fig.5 : Pattern Discovery Fig.6 : Pattern Deployment 5) Pattern Testing In this module used to identify the noisy patterns in documents. Sometimes, system falsely identified negative document as a positive. So, noise is occurred in positive document. The noised pattern named as offender. If partial conflict offender contains in positive documents, the reshuffle process is applied. V. TECHNICAL SPECIFICATION Software: 1. Operating System Windows XP /7 2. Coding Language JAVA 3. Tools and Databases used-MySQL. 4. Dataset used-RCV1
  • 4. A. D. Khade et al Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 3( Version 2), March 2014, pp.20-23 www.ijera.com 23 | P a g e Hardware: 1.System: Pentium IV 2.4 GHz. 2. Hard Disk: 40 GB. (Min) 3. Ram: 512 MB. (Min) VI. TECHNICAL OVERVIEW Advantages 1. It improves the effectiveness of using and updating discovered patterns for finding relevant and interesting information. 2. Able to produce the text mining on polysemy and synonymy effectively. 3. Effective pattern Discovery Technique. 4. The proposed approach is used to improve the accuracy of evaluating term weights. 5. Because, the discovered patterns are more specific than whole documents. 6. To avoiding the issues of phrase-based approach to using the pattern-based approach. 7. Pattern mining techniques can be used to find various text patterns. VII. EXPECTED RESULT To focus on the development of knowledge discovery model to effectively use and update the discovered patterns and apply it to the field of text mining. Technology can help in fast finding of text. There is efficient use of text mining. Finding searching pattern with there location in effectively. Processing and Multilingual Aspects are present in system. VIII. CONCLUSION Thus mining techniques have been proposed in the last decade. These techniques include association rule mining, frequent item set mining, sequential pattern mining, maximum pattern mining, and closed pattern mining. However, using these discovered knowledge (or patterns) in the field of text mining is difficult and ineffective. The reason is that some useful long patterns with high specificity lack in support (i.e., the low-frequency problem). We argue that not all frequent short patterns are useful. Hence, misinterpretations of patterns derived from data mining techniques lead to the ineffective performance. In this research work, an effective pattern discovery technique has been proposed to overcome the low-frequency and misinterpretation problems for text mining. IX. ACKNOWLEDGMENT We would like to sincerely thank Mr. Amit Zore, our mentor (Lecturer, MMIT, Lohgaon), for his support and encouragement. REFERENCES [1] J. Han and K.C.-C. Chang, “Data Mining for Web Intelligence,”, 2002. [2] A. Maedche, Ontology Learning for the Semantic Web. Kluwer Academic 2003 http://www.textmining.org/home/ [3] Y. Yang, “An Evaluation of Statistical Approaches to Tex Categorization,” Information Retrieval. [4] “Interpretations of Association Rules by Granular Computing,” By Y. Li and N. Zhong. [5] Data set RCV1. http://www.RCV1dataset.com/home/ [6] Applying Data Mining Techniques for Descriptive Phrase Extraction in Digital Document Collections by H. Ahonen, O.Heinonen, M. Klemettinen, and A.I. Verkamo. [7] K. Aas and L. Eikvil, “ext Categorisation: A Survey,” Technical Report Raport NR 941, Norwegian Computing Center, 1999 [8] R. Baeza-Yates and B. Ribeiro-Neto,” Modern Information Retrieval” Addison Wesley, 1999.