• Like
Authorship attribution   pydata london
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Authorship attribution pydata london

  • 913 views
Published

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
913
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
17
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Authorship Attribution & Forensic Linguistics with Python/Scikit-Learn/Pandas Kostas Perifanos, Search & Analytics Engineer @perifanoskostas Learner Analytics & Data Science Team
  • 2. Definition “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” [Love, 2002]
  • 3. Domains of application ● Author attribution ● Author verification ● Plagiarism detection ● Author profiling [age, education, gender] ● Stylistic inconsistencies [multiple collaborators/authors] ● Can be also applied in computer code, music scores, ...
  • 4. “Automated authorship attribution is the problem of identifying the author of an anonymous text, or text whose authorship is in doubt” “Automation”, “identification”, “text”: Machine Learning
  • 5. A classification problem ● ● ● ● Define classes Extract features Train ML classifier Evaluate
  • 6. Class definition[s] ● AuthorA, AuthorB, AuthorC, … ● Author vs rest-of-the-world [1-class classification problem] ● Or even, in extended contexts, a clustering problem
  • 7. Feature extraction ● ● ● ● Lexical features Character features Syntactic features Application specific
  • 8. Feature extraction ● Lexical features ● ● ● ● ● Word length, sentence length etc Vocabulary richness [lexical density: functional word vs content words ratio] Word frequencies Word n-grams Spelling errors
  • 9. Feature extraction ● Character features ● ● ● Character types (letters, digits, punctuation) Character n-grams (fixed and variable length) Compression methods [Entropy, which is really nice but for another talk :) ]
  • 10. Feature extraction ● Syntactic features ● ● ● Part-of-speech tags [eg Verbs (VB), Nouns (NN), Prepositions (PP) etc] Sentence and phrase structure Errors
  • 11. Feature extraction ● Semantic features ● ● Synonyms Semantic dependencies ● Application specific features ● ● ● Structural Content specific Language specific
  • 12. Demo application Let’s apply a classification algorithm on texts, using word and character n-grams and POS n-grams Data set (1): 12867 tweets from 10 users, in Greek Language, collected in 2012 [4] Data set (2): 1157 judgments from 2 judges, in English [5]
  • 13. But what’s an “n-gram”? […]an n-gram is a contiguous sequence of n items from a given sequence of text. [http://en.wikipedia.org/wiki/N-gram] So, for the sentence above: word 2-grams (or bigrams): [ (an, n-gram), (n-gram, is), (is, a), (a, contiguous), …] char 2-grams: [ ‘an’, ‘n ‘, ‘ n’, ‘n-’, ‘-g’, …] We will use the TF-IDF weighted frequencies of both word and character ngrams as features.
  • 14. Enter Python Flashback [or, transforming experiments to accepted papers in t<=2h] A few months earlier, Dec 13, just one day before my holidays I get this call...
  • 15. Load the dataset # assume we have the data in 10 tsv files, one file per author. # each file consists of two columns, id and actual text import pandas as pd def load_corpus(input_dir): trainfiles= [ f for f in listdir( input_dir ) if isfile(join(input_dir ,f)) ] trainset = [] for filename in trainfiles: df = pd.read_csv( input_dir + "/" + filename , sep="t", dtype={ 'id':object, 'text':object } ) for row in df['text']: trainset.append( return trainset { "label":filename, "text": row } )
  • 16. Extract features [1] from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import FeatureUnion word_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), max_features = 2000, binary = False ) char_vector = TfidfVectorizer(ngram_range=(2, 3), analyzer="char", max_features = 2000,binary=False, min_df=0 ) for item in trainset: corpus.append( item[“text”] ) classes.append( item["label"] ) #our vectors are the feature union of word/char ngrams vectorizer = FeatureUnion([ ("chars", char_vector),("words", word_vector) # load corpus, use fit_transform to get vectors X = vectorizer.fit_transform(corpus) ] )
  • 17. Extract features [2] import nltk #generate POS tags using nltk, return the sequence as whitespace separated string def pos_tags(txt): tokens = nltk.word_tokenize(txt) return " ".join( [ tag for (word, tag) in nltk.pos_tag( tokens ) ] ) #combine word and char ngrams with POS-ngrams tag_vector = TfidfVectorizer( analyzer="word" , ngram_range=(2,2), binary = False, max_features= 2000, decode_error = 'ignore' ) X1 = vectorizer.fit_transform( corpus ) X2 = tag_vector.fit_transform( tags ) #concatenate the two matrices X = sp.hstack((X1, X2), format='csr')
  • 18. Extract features [2.1] #this last part is a little bit tricky X = sp.hstack((X1, X2), format='csr') There was no (obvious) way to use FeatureUnion X1, X2 are sparse matrices - so, we are using hstack to stack two matrices horizontally (column wise) http://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
  • 19. Put everything together feature vector components Author: A function of word ngrams character ngrams POS tags ngrams (optional)
  • 20. Fit the model and evaluate (10-fold-CV) model = LinearSVC( loss='l1', dual=True) scores = cross_validation.cross_val_score( estimator = model, X = matrix.toarray(), y= np.asarray(classes), cv=10 ) print "10-fold cross validation results:", "mean score = ", scores.mean(), "std=", scores.std(), ", num folds =", len(scores) Results: 96% accuracy for two authors, using 10-foldCV
  • 21. Evaluate (train set vs test set) from sklearn.cross_validation import train_test_split model = LinearSVC( loss='l1', dual=True) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) y_pred = model.fit(X_train, y_train).predict(X_test) cm = confusion_matrix(y_test, y_pred) print(cm) pl.matshow(cm) pl.title('Confusion matrix') pl.colorbar() pl.ylabel('True label') pl.xlabel('Predicted label') pl.show()
  • 22. Confusion Matrix [[ 57 [ 0 [ 3 [ 0 [ 5 [ 9 [ 3 [ 8 [ 8 [ 2 1 71 0 1 4 11 1 12 4 6 2 0 4 8 3 27 13 1 1 0 13 0 8 6 51 1 3 5 4 8 25 0 207 2 8 8 8 82 3 7 106 30 10 25 23 3 15 11 350 14 46 42 3 8 13 16 244 21 38 10 3 11 46 13 414 39 7 59 11 21 31 49 579 1 4 3 24 13 29 15 2] 0] 0] 2] 3] 12] 5] 8] 10] 61]]
  • 23. Interesting questions ● ● ● ● ● ● ● ● Many authors? Short texts / “micro messages"? Is writing style affected by time/age? Can we detect “mood”? Psychological profiles? What about obfuscation? Even more subtle problems [PAN Workshop 2013] Other applications (code, music scores etc)
  • 24. References & Libraries 1. 2. 3. 4. 5. Authorship Attribution: An Introduction, Harold Love, 2002 A Survey of Modern Authorship Attribution Methods,Efstathios Stamatatos, 2007 Authorship Attribution, Patrick Juola, 2008 Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles, G. Mikros, Kostas Perifanos. 2012 Authorship Attribution with Latent Dirichlet Allocation, Seroussi,Zukerman, Bohnert, 2011 Python libraries: ● ● ● Pandas: http://pandas.pydata.org/ Scikit-learn: http://scikit-learn.org/stable/ nltk, http://www.nltk.org/ Data: www.csse.monash.edu.au/research/umnl/data Demo Python code: https://gist.github.com/kperi/f0730ff3028f7be86b15
  • 25. Questions?
  • 26. Thank you!