Authorship Attribution & Forensic
Linguistics with Python/Scikit-Learn/Pandas

Kostas Perifanos, Search & Analytics Engine...
Definition
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose ...
Domains of application
● Author attribution
● Author verification
● Plagiarism detection
● Author profiling [age, educatio...
“Automated authorship attribution is the problem
of identifying the author of an anonymous text, or
text whose authorship ...
A classification problem
●
●
●
●

Define classes
Extract features
Train ML classifier
Evaluate
Class definition[s]
● AuthorA, AuthorB, AuthorC, …
● Author vs rest-of-the-world [1-class classification
problem]
● Or eve...
Feature extraction
●
●
●
●

Lexical features
Character features
Syntactic features
Application specific
Feature extraction
● Lexical features
●
●
●
●
●

Word length, sentence length etc
Vocabulary richness [lexical density: fu...
Feature extraction
● Character features
●
●
●

Character types (letters, digits, punctuation)
Character n-grams (fixed and...
Feature extraction
● Syntactic features
●
●
●

Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc]
Sent...
Feature extraction
● Semantic features
●
●

Synonyms
Semantic dependencies

● Application specific features
●
●
●

Structu...
Demo application
Let’s apply a classification algorithm on texts, using word
and character n-grams and POS n-grams
Data se...
But what’s an “n-gram”?
[…]an n-gram is a contiguous sequence of n items from a given sequence of
text. [http://en.wikiped...
Enter Python
Flashback [or, transforming experiments to accepted papers in t<=2h]
A few months earlier, Dec 13, just one d...
Load the dataset
# assume we have the data in 10 tsv files, one file per author.
# each file consists of two columns, id a...
Extract features [1]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import FeatureUnion...
Extract features [2]
import nltk
#generate POS tags using nltk, return the sequence as whitespace separated string
def pos...
Extract features [2.1]
#this last part is a little bit tricky
X =

sp.hstack((X1, X2), format='csr')

There was no (obviou...
Put everything together

feature vector components

Author: A function of

word ngrams

character ngrams

POS tags ngrams
...
Fit the model and evaluate (10-fold-CV)
model = LinearSVC( loss='l1', dual=True)
scores = cross_validation.cross_val_score...
Evaluate (train set vs test set)
from sklearn.cross_validation import train_test_split
model = LinearSVC( loss='l1', dual=...
Confusion Matrix

[[ 57
[ 0
[ 3
[ 0
[ 5
[ 9
[ 3
[ 8
[ 8
[ 2

1
71
0
1
4
11
1
12
4
6

2
0
4
8
3 27 13
1
1
0 13
0
8
6
51
1
3...
Interesting questions
●
●
●
●
●
●
●
●

Many authors?
Short texts / “micro messages"?
Is writing style affected by time/age...
References & Libraries
1.
2.
3.
4.
5.

Authorship Attribution: An Introduction, Harold Love, 2002
A Survey of Modern Autho...
Questions?
Thank you!
Upcoming SlideShare
Loading in …5
×

Authorship attribution pydata london

1,548 views

Published on

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,548
On SlideShare
0
From Embeds
0
Number of Embeds
51
Actions
Shares
0
Downloads
36
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Authorship attribution pydata london

  1. 1. Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas Kostas Perifanos, Search & Analytics Engineer @perifanoskostas Learner Analytics & Data Science Team
  2. 2. Definition “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” [Love, 2002]
  3. 3. Domains of application ● Author attribution ● Author verification ● Plagiarism detection ● Author profiling [age, education, gender] ● Stylistic inconsistencies [multiple collaborators/authors] ● Can be also applied in computer code, music scores, ...
  4. 4. “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” “Automation”, “identification”, “text”: Machine Learning
  5. 5. A classification problem ● ● ● ● Define classes Extract features Train ML classifier Evaluate
  6. 6. Class definition[s] ● AuthorA, AuthorB, AuthorC, … ● Author vs rest-of-the-world [1-class classification problem] ● Or even, in extended contexts, a clustering problem
  7. 7. Feature extraction ● ● ● ● Lexical features Character features Syntactic features Application specific
  8. 8. Feature extraction ● Lexical features ● ● ● ● ● Word length, sentence length etc Vocabulary richness [lexical density: functional word vs content words ratio] Word frequencies Word n-grams Spelling errors
  9. 9. Feature extraction ● Character features ● ● ● Character types (letters, digits, punctuation) Character n-grams (fixed and variable length) Compression methods [Entropy, which is really nice but for another talk :) ]
  10. 10. Feature extraction ● Syntactic features ● ● ● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc] Sentence and phrase structure Errors
  11. 11. Feature extraction ● Semantic features ● ● Synonyms Semantic dependencies ● Application specific features ● ● ● Structural Content specific Language specific
  12. 12. Demo application Let’s apply a classification algorithm on texts, using word and character n-grams and POS n-grams Data set (1): 12867 tweets from 10 users, in Greek Language, collected in 2012 [4] Data set (2): 1157 judgments from 2 judges, in English [5]
  13. 13. But what’s an “n-gram”? […]an n-gram is a contiguous sequence of n items from a given sequence of text. [http://en.wikipedia.org/wiki/N-gram] So, for the sentence above: word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a, contiguous), …] char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …] We will use the TF-IDF weighted frequencies of both word and character ngrams as features.
  14. 14. Enter Python Flashback [or, transforming experiments to accepted papers in t<=2h] A few months earlier, Dec 13, just one day before my holidays I get this call...
  15. 15. Load the dataset # assume we have the data in 10 tsv files, one file per author. # each file consists of two columns, id and actual text import pandas as pd def load_corpus(input_dir): trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ] trainset = [] for filename in trainfiles: df = pd.read_csv( input_dir + "/" + filename , sep="t", dtype={ 'id':object, 'text':object } ) for row in df['text']: trainset.append( return trainset { "label":filename, "text": row } )
  16. 16. Extract features [1] from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import FeatureUnion word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), max_features = 2000, binary = False ) char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char", max_features = 2000,binary=False, min_df=0 ) for item in trainset: corpus.append( item[“text”] ) classes.append( item["label"] ) #our vectors are the feature union of word/char ngrams vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) # load corpus, use fit_transform to get vectors X = vectorizer.fit_transform(corpus) ] )
  17. 17. Extract features [2] import nltk #generate POS tags using nltk, return the sequence as whitespace separated string def pos_tags(txt): tokens = nltk.word_tokenize(txt) return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] ) #combine word and char ngrams with POS-ngrams tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), binary = False, max_features= 2000, decode_error = 'ignore' ) X1 = vectorizer.fit_transform( corpus ) X2 = tag_vector.fit_transform( tags ) #concatenate the two matrices X = sp.hstack((X1, X2), format='csr')
  18. 18. Extract features [2.1] #this last part is a little bit tricky X = sp.hstack((X1, X2), format='csr') There was no (obvious) way to use FeatureUnion X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally (column wise) http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
  19. 19. Put everything together feature vector components Author: A function of word ngrams character ngrams POS tags ngrams (optional)
  20. 20. Fit the model and evaluate (10-fold-CV) model = LinearSVC( loss='l1', dual=True) scores = cross_validation.cross_val_score( estimator = model, X = matrix.toarray(), y= np.asarray(classes), cv=10 ) print "10-fold cross validation results:", "mean score = ", scores.mean(), "std=", scores.std(), ", num folds =", len(scores) Results: 96% accuracy for two authors, using 10-foldCV
  21. 21. Evaluate (train set vs test set) from sklearn.cross_validation import train_test_split model = LinearSVC( loss='l1', dual=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) y_pred = model.fit(X_train, y_train).predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) pl.matshow(cm) pl.title('Confusion matrix') pl.colorbar() pl.ylabel('True label') pl.xlabel('Predicted label') pl.show()
  22. 22. Confusion Matrix [[ 57 [ 0 [ 3 [ 0 [ 5 [ 9 [ 3 [ 8 [ 8 [ 2 1 71 0 1 4 11 1 12 4 6 2 0 4 8 3 27 13 1 1 0 13 0 8 6 51 1 3 5 4 8 25 0 207 2 8 8 8 82 3 7 106 30 10 25 23 3 15 11 350 14 46 42 3 8 13 16 244 21 38 10 3 11 46 13 414 39 7 59 11 21 31 49 579 1 4 3 24 13 29 15 2] 0] 0] 2] 3] 12] 5] 8] 10] 61]]
  23. 23. Interesting questions ● ● ● ● ● ● ● ● Many authors? Short texts / “micro messages"? Is writing style affected by time/age? Can we detect “mood”? Psychological profiles? What about obfuscation? Even more subtle problems [PAN Workshop 2013] Other applications (code, music scores etc)
  24. 24. References & Libraries 1. 2. 3. 4. 5. Authorship Attribution: An Introduction, Harold Love, 2002 A Survey of Modern Authorship Attribution Methods,Efstathios Stamatatos, 2007 Authorship Attribution, Patrick Juola, 2008 Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012 Authorship Attribution with Latent Dirichlet Allocation, Seroussi,Zukerman, Bohnert, 2011 Python libraries: ● ● ● Pandas: http://pandas.pydata.org/ Scikit-learn: http://scikit-learn.org/stable/ nltk, http://www.nltk.org/ Data: www.csse.monash.edu.au/research/umnl/data Demo Python code: https://gist.github.com/kperi/f0730ff3028f7be86b15
  25. 25. Questions?
  26. 26. Thank you!

×