Your SlideShare is downloading. ×
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

1,075

Published on

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pandas by Kostas Perifanos

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,075
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
25
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas Kostas Perifanos, Search & Analytics Engineer @perifanoskostas Learner Analytics & Data Science Team
  • 2. Definition “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” [Love, 2002]
  • 3. Domains of application ● Author attribution ● Author verification ● Plagiarism detection ● Author profiling [age, education, gender] ● Stylistic inconsistencies [multiple collaborators/authors] ● Can be also applied in computer code, music scores, ...
  • 4. “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” “Automation”, “identification”, “text”: Machine Learning
  • 5. A classification problem ● Define classes ● Extract features ● Train ML classifier ● Evaluate
  • 6. Class definition[s] ● AuthorA, AuthorB, AuthorC, … ● Author vs rest-of-the-world [1-class classification problem] ● Or even, in extended contexts, a clustering problem
  • 7. Feature extraction ● Lexical features ● Character features ● Syntactic features ● Application specific
  • 8. Feature extraction ● Lexical features ● Word length, sentence length etc ● Vocabulary richness [lexical density: functional word vs content words ratio] ● Word frequencies ● Word n-grams ● Spelling errors
  • 9. Feature extraction ● Character features ● Character types (letters, digits, punctuation) ● Character n-grams (fixed and variable length) ● Compression methods [Entropy, which is really nice but for another talk :) ]
  • 10. Feature extraction ● Syntactic features ● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc] ● Sentence and phrase structure ● Errors
  • 11. Feature extraction ● Semantic features ● Synonyms ● Semantic dependencies ● Application specific features ● Structural ● Content specific ● Language specific
  • 12. Demo application Let’s apply a classification algorithm on texts, using word and character n-grams and POS n-grams Data set (1): 12867 tweets from 10 users, in Greek Language, collected in 2012 [4] Data set (2): 1157 judgments from 2 judges, in English [5]
  • 13. But what’s an “n-gram”? […]an n-gram is a contiguous sequence of n items from a given sequence of text. [http://en.wikipedia.org/wiki/N-gram] So, for the sentence above: word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a, contiguous), …] char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …] We will use the TF-IDF weighted frequencies of both word and character n- grams as features.
  • 14. Enter Python Flashback [or, transforming experiments to accepted papers in t<=2h] A few months earlier, Dec 13, just one day before my holidays I get this call...
  • 15. Load the dataset # assume we have the data in 10 tsv files, one file per author. # each file consists of two columns, id and actual text import pandas as pd def load_corpus(input_dir): trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ] trainset = [] for filename in trainfiles: df = pd.read_csv( input_dir + "/" + filename , sep="t", dtype={ 'id':object, 'text':object } ) for row in df['text']: trainset.append( { "label":filename, "text": row } ) return trainset
  • 16. Extract features [1] from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import FeatureUnion word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), max_features = 2000, binary = False ) char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char", max_features = 2000,binary=False, min_df=0 ) for item in trainset: corpus.append( item[“text”] ) classes.append( item["label"] ) #our vectors are the feature union of word/char ngrams vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) ] ) # load corpus, use fit_transform to get vectors X = vectorizer.fit_transform(corpus)
  • 17. Extract features [2] import nltk #generate POS tags using nltk, return the sequence as whitespace separated string def pos_tags(txt): tokens = nltk.word_tokenize(txt) return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] ) #combine word and char ngrams with POS-ngrams tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), binary = False, max_features= 2000, decode_error = 'ignore' ) X1 = vectorizer.fit_transform( corpus ) X2 = tag_vector.fit_transform( tags ) #concatenate the two matrices X = sp.hstack((X1, X2), format='csr')
  • 18. Extract features [2.1] #this last part is a little bit tricky X = sp.hstack((X1, X2), format='csr') There was no (obvious) way to use FeatureUnion X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally (column wise) http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
  • 19. Put everything together word n- grams character n- grams POS tags n- grams (optional) feature vector components Author: A function of
  • 20. Fit the model and evaluate (10-fold-CV) model = LinearSVC( loss='l1', dual=True) scores = cross_validation.cross_val_score( estimator = model, X = matrix.toarray(), y= np.asarray(classes), cv=10 ) print "10-fold cross validation results:", "mean score = ", scores.mean(), "std=", scores.std(), ", num folds =", len(scores) Results: 96% accuracy for two authors, using 10-fold- CV
  • 21. Evaluate (train set vs test set) from sklearn.cross_validation import train_test_split model = LinearSVC( loss='l1', dual=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) y_pred = model.fit(X_train, y_train).predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) pl.matshow(cm) pl.title('Confusion matrix') pl.colorbar() pl.ylabel('True label') pl.xlabel('Predicted label') pl.show()
  • 22. [[ 57 1 2 0 4 8 3 27 13 2] [ 0 71 1 1 0 13 0 8 6 0] [ 3 0 51 1 3 5 4 8 25 0] [ 0 1 0 207 2 8 8 8 82 2] [ 5 4 3 7 106 30 10 25 23 3] [ 9 11 3 15 11 350 14 46 42 12] [ 3 1 3 8 13 16 244 21 38 5] [ 8 12 10 3 11 46 13 414 39 8] [ 8 4 7 59 11 21 31 49 579 10] [ 2 6 1 4 3 24 13 29 15 61]] Confusion Matrix
  • 23. Interesting questions ● Many authors? ● Short texts / “micro messages"? ● Is writing style affected by time/age? ● Can we detect “mood”? ● Psychological profiles? ● What about obfuscation? ● Even more subtle problems [PAN Workshop 2013] ● Other applications (code, music scores etc)
  • 24. References & Libraries 1. Authorship Attribution: An Introduction, Harold Love, 2002 2. A Survey of Modern Authorship Attribution Methods,Efstathios Stamatatos, 2007 3. Authorship Attribution, Patrick Juola, 2008 4. Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012 5. Authorship Attribution with Latent Dirichlet Allocation, Seroussi,Zukerman, Bohnert, 2011 Python libraries: ● Pandas: http://pandas.pydata.org/ ● Scikit-learn: http://scikit-learn.org/stable/ ● nltk, http://www.nltk.org/ Data: www.csse.monash.edu.au/research/umnl/data Demo Python code: https://gist.github.com/kperi/f0730ff3028f7be86b15
  • 25. Questions?
  • 26. Thank you!

×