SlideShare a Scribd company logo
1 of 4
Download to read offline
IOSR Journal of Engineering (IOSRJEN) www.iosrjen.org
ISSN (e): 2250-3021, ISSN (p): 2278-8719
Vol. 04, Issue 04 (April. 2014), ||V2|| PP 27-30
International organization of Scientific Research 27 | P a g e
Use of Stylometry and Outlier Detection Algorithm in Online
Writing Sample to Detect Outliers
Sonia Sharma, Pushpendra Kumar Petrya
1
(M.Tech Scholar, Department of Computer Science and Engg, Lovely Professional University (LPU))
2
(Assistant Professor, Department of Computer Science and Engg, Lovely Professional University)
Abstract: - Outlier Detection, now a days, is one of the emerging technology used in data mining. The data
objects which deviate from the other data objects in the data set are considered to be as outliers. Outliers are
classified as global outliers, collective outliers and contextual outliers. Outlier detection contains a broad
spectrum of techniques to detect outliers.[1] Here, we are going to propose an algorithm which detects outliers
(unmatched sample) in online writing sample. The online writing sample is analysed firstly by using the well
known concept „Stylometry‟. Stylometry is the study which helps to distinguish between the writing style of two
persons. It is assumed that the writing style of two persons always differs and every person contains a feature in
the sample which defines the uniqueness of the individual.[2] This model can be used in detecting plagiarism, e-
mail verification and author identification. The idea behind this proposed model is to save the data from any
criminal activities so that the correct author should be identified.
Keywords: - Feature Selection, Outlier detection, Stylometry, Threshold metrics, Unexpected parameter
I. INTRODUCTION
Data mining is one of the fast growing field now a days. It implies extraction of hidden information from the
large datasets. Various techniques are used in the data mining. Outlier detection is one of the important
technique used in the data mining.
1.1. Outlier Detection
Outlier detection, as we said, is one of the important branch in the data mining. It is the process by
which we can find the data objects which have different behavior than the other data objects. The various outlier
detection methods are Supervised methods, Unsupervised methods, Semi-Supervised methods, Statistical
methods, Proximity-based methods, Clustering-based methods[7]. Outlier detection has many applications in the
field of credit card fraud detection, security, image processing, medical field and intrusion detection.[1]
1.2 Stylometry
It is the study in which by the writing style a person can judge another person. It is the study of the
writing behavior of the person. It is mainly used to determine the writing style of a person. The analysis is done
based on the various features. The features include usage of vocabulary, usage of function words, structure and
length of sentences and formatting of text on the page. It is also assumed that the writing style and the features
of person remain constant throughout all his writings. In today‟s education system, plagiarism is an serious
issue. Stylometry can also be used to detect the plagiarism.
So, In this paper, we purposed a model which will detect the outliers (unmatched sample) in any online samples
by using stylometry to analyze the writing sample and then an algorithm.
II. RELATED WORK
Ramyaa et.al defines a model, “Using machine learning techniques for stylometry” to justify the use of
various techniques in stylometry. The results were obtained using Decision trees and neural networks on the test
set.[3] D.Pavelec et.al proposed another model, “Compression and Stylometry for Author Identification” which
uses two different paradigms for author identification. The one paradigm is compression based algorithm and
the second is classical pattern recognition framework.[4] Robert Goodman et.al proposed “The use of
Stylometry for Email Author identification: A Feasibility Study”. They applied a stylometric analysis of
biometric keystrokes for authorship attribution.[5] Oral Alan et.al proposed “An Outlier Detection Algorithm
Based on Object-Oriented Metrices Thresholds” in dataset based on some threshold metrices.[6]
III. PROPOSED WORK
This paper includes the work in Stylometry and Outlier detection. The analysis of handwriting patterns
has to be done to detect any unmatched pattern in the sample of an individual. The analysis of the documents
Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers
International organization of Scientific Research 28 | P a g e
has to be done using Stylometry in which writing pattern of an individual has been analysed based on the
different writing features. Then the parameter(s) should be selected to claim the uniqueness of the individual. If
there is any unexpected behavior analysed in the writing sample that should be considered as an outlier, and is
removed using an outlier detection algorithm.
This scheme introduces three phases, the first phase includes the analysis of the writing sample of the
individual using the concept of Stylometry, the second phase includes the selection of parameter which is more
frequently used by the individual, The third phase includes the detection of the unexpected pattern in the writing
sample which is considered as an outlier and then detection of that outlier using an outlier detection algorithm,
which includes the comparison of predefined threshold values and parameter.
The process of work flow is illustrated in the Fig. 1.
3.1 Phase Description
There are separate phases in this work, which are essential for the effective working of the process, The phases
has been explained below
3.1.1. Analysis Phase
In this the analysis of the features to be done based on some linguistic features. Till now, In the
stylometry the study has been done on sixty two features including the tagged text, Parsed text, Interpreted
text[3] e.g. , Number of words, Number of Sentences, Number of commas, Number of dashes, Number of Semi-
colons, number of Question-marks, Number of underscores. These are examples of some features used till now .
Three more features have been included in this paper. The features are
 Number of well per hundred token
 Number of particularly per hundred token
 Number of must per hundred token
3.1.2. Parameter Selection Phase
The next phase includes selection of an appropriate parameter which identifies the uniqueness of a
writing sample of an individual. Thus the pre-defined threshold frequency range should be provided to each
parameter used for analysis. The parameter should be selected based on the frequencies given.
3.1.3. Detection of Outliers.
The next phase includes the detection of unmatched pattern or unexpected parameter behaviour in the
writing sample, thus that is considered as an outlier and is identified using an outlier detection algorithm which
is based on the threshold defined for the metrics for the dataset. If the value of parameter does not match the
threshold value of parameter, then it is considered to be as outlier. The certain metrices need to be defined and
also their threshold values.
Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers
International organization of Scientific Research 29 | P a g e
Metric Abbreviation Metric Abbreviation
Number of commas per
hundred tokens
CPHT Number of semicolon
per hundred tokens
SPHT
Number of multiple
commas per hundred
tokens
MCPHT Number of the per
hundred token
TPHT
Number of dashes per
hundred tokens
DPHT Number of well per
hundred token
WPHT
Number of Exclamation
sign per hundred tokens
EPHT Number of particularly
per hundred token
PPHT
Number of question
mark per hundred
tokens
QPHT Number of must per
hundred token
MPHT
Table 1. Metrices and Abbreviations
The Outlier detection algorithm is given as
Outlier detection algorithm (Detection of outliers in the sample dataset.)
Inputs Writing Sample/ New Writing Sample.
Output User Identified/ Detection of Outlier.
Initialization of THR_CPHT, THR_MCPHT, THR_DPHT, THR_EPHT, THR_QPHT, THR_SPHT,
THR_TPHT, THR_WPHT, THR_PPHT,
for parameter in DATASET
if(parameter.CPHT!=THR_CPHT),
sample data is an outlier,
end if
if(parameter.MCPHT!=THR_MCPHT),
sample data is an outlier,
end if
if(parameter.DPHT!=THR_DPHT),
sample data is an outlier,
end if
if(parameter.EPHT!=THR_EPHT),
sample data is an outlier,
end if
if(parameter.QPHT!=THR_QPHT),
sample data is an outlier,
end if
if(parameter.SPHT!=THR_SPHT),
sample data is an outlier,
end if
if(parameter.TPHT!=THR_TPHT),
sample data is an outlier,
end if
if(parameter.WPHT!=THR_WPHT),
sample data is an outlier,
end if
if(parameter.PPHT!=THR_PPHT),
sample data is an outlier,
end if
if(parameter.MPHT!=THR_MPHT),
sample data is an outlier,
end if
else
Sample data is correct
end for
END Algorithm
Table 2. Outlier Detection Algorithm
Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers
International organization of Scientific Research 30 | P a g e
IV. CONCLUSION
In this paper, a framework for the detection of unexpected parameters in online samples has been
introduced. It uses the technique of stylometry for the analysis of the sample and then a feature selection phase
has also been included, Which helps to identify the parameter which is most frequently used by the person and
thus claim the individuality of the person. The next phase includes the detection of the unmatched pattern using
an outlier detection algorithm , which includes the comparison of pre-defined threshold values with the
parameters. Thus if the value of parameter does not match the threshold value, then it is an outlier. In future,
more features can be added and thus analysis can be done based on that features.
REFERENCES
[1] Jiwani Hen, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques”,Outlier Detection, pp
No. 546-576
[2] Stylometry. (n.d.). Retrieved December 2013, from wikipedia: http://en.wikipedia.org/wiki/Stylometry
[3] Supervised Outlier Detection. (n.d.). Retrieved December 2013, from link.springer.com:
http://link.springer.com/chapter/10.1007%2F978-1-4614-6396-2_6#page-1
[4] Ramyaa, He Congzhou, Rasheed Khaleed, “Using machine learning techniques for stylometry”, Artificial
intelligence Center, The university of Georgia, Athenes.
[5] Pavelec D, Oliviera L.S, Justino E, Neto F.D Nobre, Batista L.V, “Compression and Stylometry for
Author identification”, Curitiba, Brazil.
[6] Goodman Robert, Hahn Matthew, Marella Madhuri, Ojar Christina, Westcott Sandy(2007), “The use of
Stylometry for Email Author identification: A feasibility study”, Seidenberg School of CSIS, Pace
University,1 Martine Ave, White Plains, NY, 10606, USA.
[7] Alan Oral, Catal Cagatay(2009), “An Outlier detection algorithm based on the object oriented metrics
threshold”, Tubitak-Marmara research center, Information Technologies Institute, Turkey.

More Related Content

What's hot

Using Hybrid Approach Analyzing Sentence Pattern by POS Sequence over Twitter
Using Hybrid Approach Analyzing Sentence Pattern by POS Sequence over TwitterUsing Hybrid Approach Analyzing Sentence Pattern by POS Sequence over Twitter
Using Hybrid Approach Analyzing Sentence Pattern by POS Sequence over TwitterIRJET Journal
 
Information extraction from EHR
Information extraction from EHRInformation extraction from EHR
Information extraction from EHRAshis Chanda
 
Mining sequential patterns for interval based
Mining sequential patterns for interval basedMining sequential patterns for interval based
Mining sequential patterns for interval basedijcsa
 
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...ijiert bestjournal
 
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique Sujeet Suryawanshi
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine LearningIRJET Journal
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming AnalyticsIJARIIT
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
Comparison of Data Mining Techniques used in Anomaly Based IDS
Comparison of Data Mining Techniques used in Anomaly Based IDS  Comparison of Data Mining Techniques used in Anomaly Based IDS
Comparison of Data Mining Techniques used in Anomaly Based IDS IRJET Journal
 
Frequent Item Set Mining - A Review
Frequent Item Set Mining - A ReviewFrequent Item Set Mining - A Review
Frequent Item Set Mining - A Reviewijsrd.com
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringIRJET Journal
 

What's hot (19)

Test for AI model
Test for AI modelTest for AI model
Test for AI model
 
Using Hybrid Approach Analyzing Sentence Pattern by POS Sequence over Twitter
Using Hybrid Approach Analyzing Sentence Pattern by POS Sequence over TwitterUsing Hybrid Approach Analyzing Sentence Pattern by POS Sequence over Twitter
Using Hybrid Approach Analyzing Sentence Pattern by POS Sequence over Twitter
 
20120140506007
2012014050600720120140506007
20120140506007
 
Information extraction from EHR
Information extraction from EHRInformation extraction from EHR
Information extraction from EHR
 
Mining sequential patterns for interval based
Mining sequential patterns for interval basedMining sequential patterns for interval based
Mining sequential patterns for interval based
 
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
AN IMPROVED FRAMEWORK FOR OUTLIER PERIODIC PATTERN DETECTION IN TIME SERIES U...
 
D018212428
D018212428D018212428
D018212428
 
M43016571
M43016571M43016571
M43016571
 
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
NSL KDD Cup 99 dataset Anomaly Detection using Machine Learning Technique
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Streaming Analytics
Streaming AnalyticsStreaming Analytics
Streaming Analytics
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Comparison of Data Mining Techniques used in Anomaly Based IDS
Comparison of Data Mining Techniques used in Anomaly Based IDS  Comparison of Data Mining Techniques used in Anomaly Based IDS
Comparison of Data Mining Techniques used in Anomaly Based IDS
 
Frequent Item Set Mining - A Review
Frequent Item Set Mining - A ReviewFrequent Item Set Mining - A Review
Frequent Item Set Mining - A Review
 
Chapter 12 outlier
Chapter 12 outlierChapter 12 outlier
Chapter 12 outlier
 
Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
 

Similar to D04422730

A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsIRJET Journal
 
Intrusion Detection
Intrusion DetectionIntrusion Detection
Intrusion Detectionbutest
 
Improving the accuracy of fingerprinting system using multibiometric approach
Improving the accuracy of fingerprinting system using multibiometric approachImproving the accuracy of fingerprinting system using multibiometric approach
Improving the accuracy of fingerprinting system using multibiometric approachIJERA Editor
 
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISCORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISijseajournal
 
Trading outlier detection machine learning approach
Trading outlier detection  machine learning approachTrading outlier detection  machine learning approach
Trading outlier detection machine learning approachEditorIJAERD
 
Improvement of Security Systems by Keystroke Dynamics of Passwords
Improvement of Security Systems by Keystroke Dynamics of PasswordsImprovement of Security Systems by Keystroke Dynamics of Passwords
Improvement of Security Systems by Keystroke Dynamics of PasswordsIJCSIS Research Publications
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET Journal
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringeSAT Journals
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringeSAT Publishing House
 
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion ApproachEnhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion ApproachIJCI JOURNAL
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Marco Yandun
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...TEJVEER SINGH
 
System for Fingerprint Image Analysis
System for Fingerprint Image AnalysisSystem for Fingerprint Image Analysis
System for Fingerprint Image Analysisvivatechijri
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Automatic signature verification with chain code using weighted distance and ...
Automatic signature verification with chain code using weighted distance and ...Automatic signature verification with chain code using weighted distance and ...
Automatic signature verification with chain code using weighted distance and ...eSAT Journals
 
Spam filtering by using Genetic based Feature Selection
Spam filtering by using Genetic based Feature SelectionSpam filtering by using Genetic based Feature Selection
Spam filtering by using Genetic based Feature SelectionEditor IJCATR
 
Building a usage profile for anomaly detection on computer networks
Building a usage profile for anomaly detection on computer networksBuilding a usage profile for anomaly detection on computer networks
Building a usage profile for anomaly detection on computer networksNathanael Asaam
 
Shape-Based Plagiarism Detection for Flowchart Figures in Texts
Shape-Based Plagiarism Detection for Flowchart Figures in TextsShape-Based Plagiarism Detection for Flowchart Figures in Texts
Shape-Based Plagiarism Detection for Flowchart Figures in TextsSENOSY ARRISH
 
Machine learning and pattern recognition
Machine learning and pattern recognitionMachine learning and pattern recognition
Machine learning and pattern recognitionsureshraj43
 

Similar to D04422730 (20)

A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
 
Intrusion Detection
Intrusion DetectionIntrusion Detection
Intrusion Detection
 
Improving the accuracy of fingerprinting system using multibiometric approach
Improving the accuracy of fingerprinting system using multibiometric approachImproving the accuracy of fingerprinting system using multibiometric approach
Improving the accuracy of fingerprinting system using multibiometric approach
 
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSISCORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
CORRELATING FEATURES AND CODE BY DYNAMIC AND SEMANTIC ANALYSIS
 
Trading outlier detection machine learning approach
Trading outlier detection  machine learning approachTrading outlier detection  machine learning approach
Trading outlier detection machine learning approach
 
Improvement of Security Systems by Keystroke Dynamics of Passwords
Improvement of Security Systems by Keystroke Dynamics of PasswordsImprovement of Security Systems by Keystroke Dynamics of Passwords
Improvement of Security Systems by Keystroke Dynamics of Passwords
 
IRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation ForestIRJET- Credit Card Fraud Detection using Isolation Forest
IRJET- Credit Card Fraud Detection using Isolation Forest
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion ApproachEnhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
Enhancing Time Series Anomaly Detection: A Hybrid Model Fusion Approach
 
Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595Hidalgo jairo, yandun marco 595
Hidalgo jairo, yandun marco 595
 
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
Design principle of pattern recognition system and STATISTICAL PATTERN RECOGN...
 
System for Fingerprint Image Analysis
System for Fingerprint Image AnalysisSystem for Fingerprint Image Analysis
System for Fingerprint Image Analysis
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Automatic signature verification with chain code using weighted distance and ...
Automatic signature verification with chain code using weighted distance and ...Automatic signature verification with chain code using weighted distance and ...
Automatic signature verification with chain code using weighted distance and ...
 
Spam filtering by using Genetic based Feature Selection
Spam filtering by using Genetic based Feature SelectionSpam filtering by using Genetic based Feature Selection
Spam filtering by using Genetic based Feature Selection
 
Building a usage profile for anomaly detection on computer networks
Building a usage profile for anomaly detection on computer networksBuilding a usage profile for anomaly detection on computer networks
Building a usage profile for anomaly detection on computer networks
 
J1802035460
J1802035460J1802035460
J1802035460
 
Shape-Based Plagiarism Detection for Flowchart Figures in Texts
Shape-Based Plagiarism Detection for Flowchart Figures in TextsShape-Based Plagiarism Detection for Flowchart Figures in Texts
Shape-Based Plagiarism Detection for Flowchart Figures in Texts
 
Machine learning and pattern recognition
Machine learning and pattern recognitionMachine learning and pattern recognition
Machine learning and pattern recognition
 

More from IOSR-JEN

More from IOSR-JEN (20)

C05921721
C05921721C05921721
C05921721
 
B05921016
B05921016B05921016
B05921016
 
A05920109
A05920109A05920109
A05920109
 
J05915457
J05915457J05915457
J05915457
 
I05914153
I05914153I05914153
I05914153
 
H05913540
H05913540H05913540
H05913540
 
G05913234
G05913234G05913234
G05913234
 
F05912731
F05912731F05912731
F05912731
 
E05912226
E05912226E05912226
E05912226
 
D05911621
D05911621D05911621
D05911621
 
C05911315
C05911315C05911315
C05911315
 
B05910712
B05910712B05910712
B05910712
 
A05910106
A05910106A05910106
A05910106
 
B05840510
B05840510B05840510
B05840510
 
I05844759
I05844759I05844759
I05844759
 
H05844346
H05844346H05844346
H05844346
 
G05843942
G05843942G05843942
G05843942
 
F05843238
F05843238F05843238
F05843238
 
E05842831
E05842831E05842831
E05842831
 
D05842227
D05842227D05842227
D05842227
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 

D04422730

  • 1. IOSR Journal of Engineering (IOSRJEN) www.iosrjen.org ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 04 (April. 2014), ||V2|| PP 27-30 International organization of Scientific Research 27 | P a g e Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers Sonia Sharma, Pushpendra Kumar Petrya 1 (M.Tech Scholar, Department of Computer Science and Engg, Lovely Professional University (LPU)) 2 (Assistant Professor, Department of Computer Science and Engg, Lovely Professional University) Abstract: - Outlier Detection, now a days, is one of the emerging technology used in data mining. The data objects which deviate from the other data objects in the data set are considered to be as outliers. Outliers are classified as global outliers, collective outliers and contextual outliers. Outlier detection contains a broad spectrum of techniques to detect outliers.[1] Here, we are going to propose an algorithm which detects outliers (unmatched sample) in online writing sample. The online writing sample is analysed firstly by using the well known concept „Stylometry‟. Stylometry is the study which helps to distinguish between the writing style of two persons. It is assumed that the writing style of two persons always differs and every person contains a feature in the sample which defines the uniqueness of the individual.[2] This model can be used in detecting plagiarism, e- mail verification and author identification. The idea behind this proposed model is to save the data from any criminal activities so that the correct author should be identified. Keywords: - Feature Selection, Outlier detection, Stylometry, Threshold metrics, Unexpected parameter I. INTRODUCTION Data mining is one of the fast growing field now a days. It implies extraction of hidden information from the large datasets. Various techniques are used in the data mining. Outlier detection is one of the important technique used in the data mining. 1.1. Outlier Detection Outlier detection, as we said, is one of the important branch in the data mining. It is the process by which we can find the data objects which have different behavior than the other data objects. The various outlier detection methods are Supervised methods, Unsupervised methods, Semi-Supervised methods, Statistical methods, Proximity-based methods, Clustering-based methods[7]. Outlier detection has many applications in the field of credit card fraud detection, security, image processing, medical field and intrusion detection.[1] 1.2 Stylometry It is the study in which by the writing style a person can judge another person. It is the study of the writing behavior of the person. It is mainly used to determine the writing style of a person. The analysis is done based on the various features. The features include usage of vocabulary, usage of function words, structure and length of sentences and formatting of text on the page. It is also assumed that the writing style and the features of person remain constant throughout all his writings. In today‟s education system, plagiarism is an serious issue. Stylometry can also be used to detect the plagiarism. So, In this paper, we purposed a model which will detect the outliers (unmatched sample) in any online samples by using stylometry to analyze the writing sample and then an algorithm. II. RELATED WORK Ramyaa et.al defines a model, “Using machine learning techniques for stylometry” to justify the use of various techniques in stylometry. The results were obtained using Decision trees and neural networks on the test set.[3] D.Pavelec et.al proposed another model, “Compression and Stylometry for Author Identification” which uses two different paradigms for author identification. The one paradigm is compression based algorithm and the second is classical pattern recognition framework.[4] Robert Goodman et.al proposed “The use of Stylometry for Email Author identification: A Feasibility Study”. They applied a stylometric analysis of biometric keystrokes for authorship attribution.[5] Oral Alan et.al proposed “An Outlier Detection Algorithm Based on Object-Oriented Metrices Thresholds” in dataset based on some threshold metrices.[6] III. PROPOSED WORK This paper includes the work in Stylometry and Outlier detection. The analysis of handwriting patterns has to be done to detect any unmatched pattern in the sample of an individual. The analysis of the documents
  • 2. Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers International organization of Scientific Research 28 | P a g e has to be done using Stylometry in which writing pattern of an individual has been analysed based on the different writing features. Then the parameter(s) should be selected to claim the uniqueness of the individual. If there is any unexpected behavior analysed in the writing sample that should be considered as an outlier, and is removed using an outlier detection algorithm. This scheme introduces three phases, the first phase includes the analysis of the writing sample of the individual using the concept of Stylometry, the second phase includes the selection of parameter which is more frequently used by the individual, The third phase includes the detection of the unexpected pattern in the writing sample which is considered as an outlier and then detection of that outlier using an outlier detection algorithm, which includes the comparison of predefined threshold values and parameter. The process of work flow is illustrated in the Fig. 1. 3.1 Phase Description There are separate phases in this work, which are essential for the effective working of the process, The phases has been explained below 3.1.1. Analysis Phase In this the analysis of the features to be done based on some linguistic features. Till now, In the stylometry the study has been done on sixty two features including the tagged text, Parsed text, Interpreted text[3] e.g. , Number of words, Number of Sentences, Number of commas, Number of dashes, Number of Semi- colons, number of Question-marks, Number of underscores. These are examples of some features used till now . Three more features have been included in this paper. The features are  Number of well per hundred token  Number of particularly per hundred token  Number of must per hundred token 3.1.2. Parameter Selection Phase The next phase includes selection of an appropriate parameter which identifies the uniqueness of a writing sample of an individual. Thus the pre-defined threshold frequency range should be provided to each parameter used for analysis. The parameter should be selected based on the frequencies given. 3.1.3. Detection of Outliers. The next phase includes the detection of unmatched pattern or unexpected parameter behaviour in the writing sample, thus that is considered as an outlier and is identified using an outlier detection algorithm which is based on the threshold defined for the metrics for the dataset. If the value of parameter does not match the threshold value of parameter, then it is considered to be as outlier. The certain metrices need to be defined and also their threshold values.
  • 3. Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers International organization of Scientific Research 29 | P a g e Metric Abbreviation Metric Abbreviation Number of commas per hundred tokens CPHT Number of semicolon per hundred tokens SPHT Number of multiple commas per hundred tokens MCPHT Number of the per hundred token TPHT Number of dashes per hundred tokens DPHT Number of well per hundred token WPHT Number of Exclamation sign per hundred tokens EPHT Number of particularly per hundred token PPHT Number of question mark per hundred tokens QPHT Number of must per hundred token MPHT Table 1. Metrices and Abbreviations The Outlier detection algorithm is given as Outlier detection algorithm (Detection of outliers in the sample dataset.) Inputs Writing Sample/ New Writing Sample. Output User Identified/ Detection of Outlier. Initialization of THR_CPHT, THR_MCPHT, THR_DPHT, THR_EPHT, THR_QPHT, THR_SPHT, THR_TPHT, THR_WPHT, THR_PPHT, for parameter in DATASET if(parameter.CPHT!=THR_CPHT), sample data is an outlier, end if if(parameter.MCPHT!=THR_MCPHT), sample data is an outlier, end if if(parameter.DPHT!=THR_DPHT), sample data is an outlier, end if if(parameter.EPHT!=THR_EPHT), sample data is an outlier, end if if(parameter.QPHT!=THR_QPHT), sample data is an outlier, end if if(parameter.SPHT!=THR_SPHT), sample data is an outlier, end if if(parameter.TPHT!=THR_TPHT), sample data is an outlier, end if if(parameter.WPHT!=THR_WPHT), sample data is an outlier, end if if(parameter.PPHT!=THR_PPHT), sample data is an outlier, end if if(parameter.MPHT!=THR_MPHT), sample data is an outlier, end if else Sample data is correct end for END Algorithm Table 2. Outlier Detection Algorithm
  • 4. Use of Stylometry and Outlier Detection Algorithm in Online Writing Sample to Detect Outliers International organization of Scientific Research 30 | P a g e IV. CONCLUSION In this paper, a framework for the detection of unexpected parameters in online samples has been introduced. It uses the technique of stylometry for the analysis of the sample and then a feature selection phase has also been included, Which helps to identify the parameter which is most frequently used by the person and thus claim the individuality of the person. The next phase includes the detection of the unmatched pattern using an outlier detection algorithm , which includes the comparison of pre-defined threshold values with the parameters. Thus if the value of parameter does not match the threshold value, then it is an outlier. In future, more features can be added and thus analysis can be done based on that features. REFERENCES [1] Jiwani Hen, Micheline Kamber, Jian Pei, “Data Mining Concepts and Techniques”,Outlier Detection, pp No. 546-576 [2] Stylometry. (n.d.). Retrieved December 2013, from wikipedia: http://en.wikipedia.org/wiki/Stylometry [3] Supervised Outlier Detection. (n.d.). Retrieved December 2013, from link.springer.com: http://link.springer.com/chapter/10.1007%2F978-1-4614-6396-2_6#page-1 [4] Ramyaa, He Congzhou, Rasheed Khaleed, “Using machine learning techniques for stylometry”, Artificial intelligence Center, The university of Georgia, Athenes. [5] Pavelec D, Oliviera L.S, Justino E, Neto F.D Nobre, Batista L.V, “Compression and Stylometry for Author identification”, Curitiba, Brazil. [6] Goodman Robert, Hahn Matthew, Marella Madhuri, Ojar Christina, Westcott Sandy(2007), “The use of Stylometry for Email Author identification: A feasibility study”, Seidenberg School of CSIS, Pace University,1 Martine Ave, White Plains, NY, 10606, USA. [7] Alan Oral, Catal Cagatay(2009), “An Outlier detection algorithm based on the object oriented metrics threshold”, Tubitak-Marmara research center, Information Technologies Institute, Turkey.