SlideShare a Scribd company logo
1 of 26
Download to read offline
Authorship Attribution & Forensic
Linguistics with Python/Scikit-Learn/Pandas
Kostas Perifanos, Search & Analytics Engineer
@perifanoskostas
Learner Analytics & Data Science Team
Definition
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship is in doubt” [Love, 2002]
Domains of application
● Author attribution
● Author verification
● Plagiarism detection
● Author profiling [age, education, gender]
● Stylistic inconsistencies [multiple collaborators/authors]
● Can be also applied in computer code, music scores, ...
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship is in doubt”
“Automation”, “identification”, “text”: Machine Learning
A classification problem
● Define classes
● Extract features
● Train ML classifier
● Evaluate
Class definition[s]
● AuthorA, AuthorB, AuthorC, …
● Author vs rest-of-the-world [1-class classification
problem]
● Or even, in extended contexts, a clustering problem
Feature extraction
● Lexical features
● Character features
● Syntactic features
● Application specific
Feature extraction
● Lexical features
● Word length, sentence length etc
● Vocabulary richness [lexical density: functional word vs content words ratio]
● Word frequencies
● Word n-grams
● Spelling errors
Feature extraction
● Character features
● Character types (letters, digits, punctuation)
● Character n-grams (fixed and variable length)
● Compression methods [Entropy, which is really nice but for another talk :) ]
Feature extraction
● Syntactic features
● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc]
● Sentence and phrase structure
● Errors
Feature extraction
● Semantic features
● Synonyms
● Semantic dependencies
● Application specific features
● Structural
● Content specific
● Language specific
Demo application
Let’s apply a classification algorithm on texts, using word
and character n-grams and POS n-grams
Data set (1): 12867 tweets from 10 users, in Greek
Language, collected in 2012 [4]
Data set (2): 1157 judgments from 2 judges, in English [5]
But what’s an “n-gram”?
[…]an n-gram is a contiguous sequence of n items from a given sequence of
text. [http://en.wikipedia.org/wiki/N-gram]
So, for the sentence above:
word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a,
contiguous), …]
char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …]
We will use the TF-IDF weighted frequencies of both word and character n-
grams as features.
Enter Python
Flashback [or, transforming experiments to accepted papers in t<=2h]
A few months earlier, Dec 13, just one day before my holidays I get this call...
Load the dataset
# assume we have the data in 10 tsv files, one file per author.
# each file consists of two columns, id and actual text
import pandas as pd
def load_corpus(input_dir):
trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ]
trainset = []
for filename in trainfiles:
df = pd.read_csv( input_dir + "/" + filename , sep="t",
dtype={ 'id':object, 'text':object } )
for row in df['text']:
trainset.append( { "label":filename, "text": row } )
return trainset
Extract features [1]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion
word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),
max_features = 2000, binary = False )
char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char",
max_features = 2000,binary=False, min_df=0 )
for item in trainset:
corpus.append( item[“text”] )
classes.append( item["label"] )
#our vectors are the feature union of word/char ngrams
vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) ] )
# load corpus, use fit_transform to get vectors
X = vectorizer.fit_transform(corpus)
Extract features [2]
import nltk
#generate POS tags using nltk, return the sequence as whitespace separated string
def pos_tags(txt):
tokens = nltk.word_tokenize(txt)
return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] )
#combine word and char ngrams with POS-ngrams
tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2),
binary = False, max_features= 2000, decode_error = 'ignore' )
X1 = vectorizer.fit_transform( corpus )
X2 = tag_vector.fit_transform( tags )
#concatenate the two matrices
X = sp.hstack((X1, X2), format='csr')
Extract features [2.1]
#this last part is a little bit tricky
X = sp.hstack((X1, X2), format='csr')
There was no (obvious) way to use FeatureUnion
X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally
(column wise)
http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
Put everything together
word n-
grams
character n-
grams
POS tags n-
grams
(optional)
feature vector components
Author: A function of
Fit the model and evaluate (10-fold-CV)
model = LinearSVC( loss='l1', dual=True)
scores = cross_validation.cross_val_score( estimator = model,
X = matrix.toarray(),
y= np.asarray(classes), cv=10 )
print "10-fold cross validation results:", "mean score = ", scores.mean(), 
"std=", scores.std(), ", num folds =", len(scores)
Results: 96% accuracy for two authors, using 10-fold-
CV
Evaluate (train set vs test set)
from sklearn.cross_validation import train_test_split
model = LinearSVC( loss='l1', dual=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
y_pred = model.fit(X_train, y_train).predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
pl.matshow(cm)
pl.title('Confusion matrix')
pl.colorbar()
pl.ylabel('True label')
pl.xlabel('Predicted label')
pl.show()
[[ 57 1 2 0 4 8 3 27 13 2]
[ 0 71 1 1 0 13 0 8 6 0]
[ 3 0 51 1 3 5 4 8 25 0]
[ 0 1 0 207 2 8 8 8 82 2]
[ 5 4 3 7 106 30 10 25 23 3]
[ 9 11 3 15 11 350 14 46 42 12]
[ 3 1 3 8 13 16 244 21 38 5]
[ 8 12 10 3 11 46 13 414 39 8]
[ 8 4 7 59 11 21 31 49 579 10]
[ 2 6 1 4 3 24 13 29 15 61]]
Confusion Matrix
Interesting questions
● Many authors?
● Short texts / “micro messages"?
● Is writing style affected by time/age?
● Can we detect “mood”?
● Psychological profiles?
● What about obfuscation?
● Even more subtle problems [PAN Workshop 2013]
● Other applications (code, music scores etc)
References & Libraries
1. Authorship Attribution: An Introduction, Harold Love, 2002
2. A Survey of Modern Authorship Attribution Methods,Efstathios
Stamatatos, 2007
3. Authorship Attribution, Patrick Juola, 2008
4. Authorship Attribution in Greek Tweets Using Author's Multilevel
N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012
5. Authorship Attribution with Latent Dirichlet Allocation,
Seroussi,Zukerman, Bohnert, 2011
Python libraries:
● Pandas: http://pandas.pydata.org/
● Scikit-learn: http://scikit-learn.org/stable/
● nltk, http://www.nltk.org/
Data:
www.csse.monash.edu.au/research/umnl/data
Demo Python code:
https://gist.github.com/kperi/f0730ff3028f7be86b15
Questions?
Thank you!

More Related Content

What's hot

The innateness theory_and_theories_of_language_acquisition[1]
The innateness theory_and_theories_of_language_acquisition[1]The innateness theory_and_theories_of_language_acquisition[1]
The innateness theory_and_theories_of_language_acquisition[1]
Diego M.
 
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics
mimizin
 
Online signature recognition
Online signature recognitionOnline signature recognition
Online signature recognition
Piyush Mittal
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
Alicia Ruiz
 

What's hot (20)

Natural language processing
Natural language processing Natural language processing
Natural language processing
 
The innateness theory_and_theories_of_language_acquisition[1]
The innateness theory_and_theories_of_language_acquisition[1]The innateness theory_and_theories_of_language_acquisition[1]
The innateness theory_and_theories_of_language_acquisition[1]
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
Pidgins and creoles
Pidgins and creolesPidgins and creoles
Pidgins and creoles
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Forensic linguistics
Forensic linguistics Forensic linguistics
Forensic linguistics
 
Forensic linguistics
Forensic linguisticsForensic linguistics
Forensic linguistics
 
A Right to Our Voice: Linguistic Human Rights and Peace Education
A Right to Our Voice: Linguistic Human Rights and Peace EducationA Right to Our Voice: Linguistic Human Rights and Peace Education
A Right to Our Voice: Linguistic Human Rights and Peace Education
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 
The Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information RetrievalThe Role of Natural Language Processing in Information Retrieval
The Role of Natural Language Processing in Information Retrieval
 
Discourse as dialogue
Discourse as dialogueDiscourse as dialogue
Discourse as dialogue
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Stylometry In Authentication
Stylometry In AuthenticationStylometry In Authentication
Stylometry In Authentication
 
Online signature recognition
Online signature recognitionOnline signature recognition
Online signature recognition
 
Discourse Analysis and Grammar (cohesive devices)
Discourse Analysis and Grammar (cohesive devices)Discourse Analysis and Grammar (cohesive devices)
Discourse Analysis and Grammar (cohesive devices)
 
HISTORY AND TEXT TYPES
HISTORY AND TEXT TYPESHISTORY AND TEXT TYPES
HISTORY AND TEXT TYPES
 
Theories of Psycholinguistics.
Theories of Psycholinguistics.Theories of Psycholinguistics.
Theories of Psycholinguistics.
 
Corpus linguistics
Corpus linguisticsCorpus linguistics
Corpus linguistics
 

Viewers also liked

Viewers also liked (20)

Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
Intro to scikit-learn
Intro to scikit-learnIntro to scikit-learn
Intro to scikit-learn
 
Exploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-LearnExploring Machine Learning in Python with Scikit-Learn
Exploring Machine Learning in Python with Scikit-Learn
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael VaroquauxPyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
Pyparis2017 / Scikit-learn - an incomplete yearly review, by Gael Varoquaux
 
Machine learning with scikit-learn
Machine learning with scikit-learnMachine learning with scikit-learn
Machine learning with scikit-learn
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
Intro to machine learning with scikit learn
Intro to machine learning with scikit learnIntro to machine learning with scikit learn
Intro to machine learning with scikit learn
 
Introduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learnIntroduction to Machine Learning with Python and scikit-learn
Introduction to Machine Learning with Python and scikit-learn
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Intro to scikit learn may 2017
Intro to scikit learn may 2017Intro to scikit learn may 2017
Intro to scikit learn may 2017
 
Data Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learnData Science and Machine Learning Using Python and Scikit-learn
Data Science and Machine Learning Using Python and Scikit-learn
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Realtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learnRealtime predictive analytics using RabbitMQ & scikit-learn
Realtime predictive analytics using RabbitMQ & scikit-learn
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Converting Scikit-Learn to PMML
Converting Scikit-Learn to PMMLConverting Scikit-Learn to PMML
Converting Scikit-Learn to PMML
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 

Similar to Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
kperi
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
Reza Rahimi
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
carliotwaycave
 

Similar to Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos (20)

Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
R교육1
R교육1R교육1
R교육1
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Scala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowScala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To Know
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
 
R Basics
R BasicsR Basics
R Basics
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
R basics
R basicsR basics
R basics
 
Mca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structureMca ii dfs u-1 introduction to data structure
Mca ii dfs u-1 introduction to data structure
 
Bca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structureBca ii dfs u-1 introduction to data structure
Bca ii dfs u-1 introduction to data structure
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docxINFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
INFORMATIVE ESSAYThe purpose of the Informative Essay assignme.docx
 

More from PyData

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

  • 1. Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas Kostas Perifanos, Search & Analytics Engineer @perifanoskostas Learner Analytics & Data Science Team
  • 2. Definition “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” [Love, 2002]
  • 3. Domains of application ● Author attribution ● Author verification ● Plagiarism detection ● Author profiling [age, education, gender] ● Stylistic inconsistencies [multiple collaborators/authors] ● Can be also applied in computer code, music scores, ...
  • 4. “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” “Automation”, “identification”, “text”: Machine Learning
  • 5. A classification problem ● Define classes ● Extract features ● Train ML classifier ● Evaluate
  • 6. Class definition[s] ● AuthorA, AuthorB, AuthorC, … ● Author vs rest-of-the-world [1-class classification problem] ● Or even, in extended contexts, a clustering problem
  • 7. Feature extraction ● Lexical features ● Character features ● Syntactic features ● Application specific
  • 8. Feature extraction ● Lexical features ● Word length, sentence length etc ● Vocabulary richness [lexical density: functional word vs content words ratio] ● Word frequencies ● Word n-grams ● Spelling errors
  • 9. Feature extraction ● Character features ● Character types (letters, digits, punctuation) ● Character n-grams (fixed and variable length) ● Compression methods [Entropy, which is really nice but for another talk :) ]
  • 10. Feature extraction ● Syntactic features ● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc] ● Sentence and phrase structure ● Errors
  • 11. Feature extraction ● Semantic features ● Synonyms ● Semantic dependencies ● Application specific features ● Structural ● Content specific ● Language specific
  • 12. Demo application Let’s apply a classification algorithm on texts, using word and character n-grams and POS n-grams Data set (1): 12867 tweets from 10 users, in Greek Language, collected in 2012 [4] Data set (2): 1157 judgments from 2 judges, in English [5]
  • 13. But what’s an “n-gram”? […]an n-gram is a contiguous sequence of n items from a given sequence of text. [http://en.wikipedia.org/wiki/N-gram] So, for the sentence above: word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a, contiguous), …] char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …] We will use the TF-IDF weighted frequencies of both word and character n- grams as features.
  • 14. Enter Python Flashback [or, transforming experiments to accepted papers in t<=2h] A few months earlier, Dec 13, just one day before my holidays I get this call...
  • 15. Load the dataset # assume we have the data in 10 tsv files, one file per author. # each file consists of two columns, id and actual text import pandas as pd def load_corpus(input_dir): trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ] trainset = [] for filename in trainfiles: df = pd.read_csv( input_dir + "/" + filename , sep="t", dtype={ 'id':object, 'text':object } ) for row in df['text']: trainset.append( { "label":filename, "text": row } ) return trainset
  • 16. Extract features [1] from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import FeatureUnion word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), max_features = 2000, binary = False ) char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char", max_features = 2000,binary=False, min_df=0 ) for item in trainset: corpus.append( item[“text”] ) classes.append( item["label"] ) #our vectors are the feature union of word/char ngrams vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) ] ) # load corpus, use fit_transform to get vectors X = vectorizer.fit_transform(corpus)
  • 17. Extract features [2] import nltk #generate POS tags using nltk, return the sequence as whitespace separated string def pos_tags(txt): tokens = nltk.word_tokenize(txt) return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] ) #combine word and char ngrams with POS-ngrams tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), binary = False, max_features= 2000, decode_error = 'ignore' ) X1 = vectorizer.fit_transform( corpus ) X2 = tag_vector.fit_transform( tags ) #concatenate the two matrices X = sp.hstack((X1, X2), format='csr')
  • 18. Extract features [2.1] #this last part is a little bit tricky X = sp.hstack((X1, X2), format='csr') There was no (obvious) way to use FeatureUnion X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally (column wise) http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
  • 19. Put everything together word n- grams character n- grams POS tags n- grams (optional) feature vector components Author: A function of
  • 20. Fit the model and evaluate (10-fold-CV) model = LinearSVC( loss='l1', dual=True) scores = cross_validation.cross_val_score( estimator = model, X = matrix.toarray(), y= np.asarray(classes), cv=10 ) print "10-fold cross validation results:", "mean score = ", scores.mean(), "std=", scores.std(), ", num folds =", len(scores) Results: 96% accuracy for two authors, using 10-fold- CV
  • 21. Evaluate (train set vs test set) from sklearn.cross_validation import train_test_split model = LinearSVC( loss='l1', dual=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) y_pred = model.fit(X_train, y_train).predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) pl.matshow(cm) pl.title('Confusion matrix') pl.colorbar() pl.ylabel('True label') pl.xlabel('Predicted label') pl.show()
  • 22. [[ 57 1 2 0 4 8 3 27 13 2] [ 0 71 1 1 0 13 0 8 6 0] [ 3 0 51 1 3 5 4 8 25 0] [ 0 1 0 207 2 8 8 8 82 2] [ 5 4 3 7 106 30 10 25 23 3] [ 9 11 3 15 11 350 14 46 42 12] [ 3 1 3 8 13 16 244 21 38 5] [ 8 12 10 3 11 46 13 414 39 8] [ 8 4 7 59 11 21 31 49 579 10] [ 2 6 1 4 3 24 13 29 15 61]] Confusion Matrix
  • 23. Interesting questions ● Many authors? ● Short texts / “micro messages"? ● Is writing style affected by time/age? ● Can we detect “mood”? ● Psychological profiles? ● What about obfuscation? ● Even more subtle problems [PAN Workshop 2013] ● Other applications (code, music scores etc)
  • 24. References & Libraries 1. Authorship Attribution: An Introduction, Harold Love, 2002 2. A Survey of Modern Authorship Attribution Methods,Efstathios Stamatatos, 2007 3. Authorship Attribution, Patrick Juola, 2008 4. Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012 5. Authorship Attribution with Latent Dirichlet Allocation, Seroussi,Zukerman, Bohnert, 2011 Python libraries: ● Pandas: http://pandas.pydata.org/ ● Scikit-learn: http://scikit-learn.org/stable/ ● nltk, http://www.nltk.org/ Data: www.csse.monash.edu.au/research/umnl/data Demo Python code: https://gist.github.com/kperi/f0730ff3028f7be86b15