[ 
(‘We’, ‘PRP’), 
(‘<3’, ‘VBP’), 
(‘NLTK’, ‘NNP’) 
] 
Dhiana Deva | Gabriel Fonseca 
Data Matching @ UFRJ
“NLTK” == “Natural Language ToolKit” 
+ Python library for NLP 
+ Created in 2001 at University of Pennsylvania 
+ Very extensive 
+ Many examples 
+ Built-in support for 84 datasets (today!) 
+ Great documentation 
+ Open source ;) 
+ Active community
Lot’s of modules! 
corpus 
standardized interfaces 
to corpora and lexicons 
tokenize 
tokenizers! 
stem 
stemmers! 
collocation 
t-test, chi-squared, point-wise 
mutual information 
classify 
decision tree, maximum 
entropy, naive bayes 
cluster 
EM, k-means 
chunk 
regular expression, n-gram, 
named-entity 
metrics 
distances, precision, 
recall, agreement 
coefficients 
probability 
frequency distributions, 
smoothed probability 
distributions 
... 
parse 
chart, feature-based, 
unification, probabilistic, 
dependency 
tag 
part-of-speech tagging, 
n-gram, backoff, Brill, 
HMM, TnT
Can I haz Data Matching? 
☑ Accuracy score 
☑ Precision score 
☑ Recall score 
☑ F-measure score 
☐ Reduction ratio 
☑ Stop-words (11 languages) 
★ Punkt sentence tokenizer 
★ Punkt word tokenizer 
☑ N-gram (words and chars) 
☑ Tf-idf 
☑ Levenshtein distance 
☑ Damerau-Levenshtein 
distance 
☑ Binary distance... Durr! 
★ Krippendorff's distance 
★ Masi distance 
☑ Jaccard distance 
☐ Jaro distance 
☐ Jaro-Winkler distance 
☐ Monge-Elkan distance 
☐ Soundex 
☐ Phonex 
☐ NYSIIS 
☐ ONCA 
☐ Double-Metaphone 
☐ Fuzzy Soundex 
☑ Decision tree 
☑ SVM 
☑ Naive Bayes 
★ MaxEnt
Fun fun fun! 
Sentiment analysis 
Spelling correction 
Spam detection 
Topic modeling 
Recommender systems 
Data deduplication
Why not song matching?! 
Grooveshark: online music streaming service 
Songs uploaded by record labels, independent 
artists and users 
Lot’s of duplicates! 
Tinysong: Grooveshark’s open RESTful API 
Our goal: No repeated songs! 
(remixes and lives are okay!)
Bohemian Rhapsody by Qween-?! 
{ 
"Url": "http://tinysong.com/PBCJ", 
"SongID": 33834073, 
"SongName": "Bohemian Rhapsody", 
"ArtistID": 2324, 
"ArtistName": "Queen", 
"AlbumID": 1071492, 
"AlbumName": "Greatest Hits" 
}, 
... 
{ 
"Url": "http://tinysong.com/CYxG", 
"SongID": 28835215, 
"SongName": "Bohemian Rhapsody", 
"ArtistID": 1731732, 
"ArtistName": "Qween -", 
"AlbumID": 2364353, 
"AlbumName": "A Night at the Opera" 
} 
...
Next steps 
Other textual data 
Machine learning 
Acoustic features 
Loudness 
BPM 
Liveness 
Acoustic fingerprinting for supervised learning 
Yes, songs have fingerprints too!
Our “sentiment” 
+ Quick and easy! 
+ Exteeeeeeeeeeeeeeeeensive! 
+ Docs & community! 
+ Internationalization 
- Time performance 
- Memory usage 
- No online or active learning
Want more?! 
+ jellyfish 
Jaro-Winkler, Hamming, Soundex, Metaphone, NYSIIS, … 
+ nltk-trainer 
Command-line NLTK classifiers! 
+ scikit-learn 
More machine learning! Memory efficient! 
+ pattern 
Web mining. Out-of-the-box! 
+ gensim 
Topic modeling. Out-of-the-box!
References 
http://www.nltk.org/ 
http://www.nltk.org/book/ 
http://streamhacker.com/ 
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/ 
http://developers.grooveshark.com/tuts/tinysong 
https://github.com/sunlightlabs/jellyfish 
https://github.com/japerk/nltk-trainer 
http://scikit-learn.org/stable/ 
http://www.clips.ua.ac.be/pattern 
http://radimrehurek.com/gensim/
Thanks! ;)

We love NLTK

  • 1.
    [ (‘We’, ‘PRP’), (‘<3’, ‘VBP’), (‘NLTK’, ‘NNP’) ] Dhiana Deva | Gabriel Fonseca Data Matching @ UFRJ
  • 11.
    “NLTK” == “NaturalLanguage ToolKit” + Python library for NLP + Created in 2001 at University of Pennsylvania + Very extensive + Many examples + Built-in support for 84 datasets (today!) + Great documentation + Open source ;) + Active community
  • 12.
    Lot’s of modules! corpus standardized interfaces to corpora and lexicons tokenize tokenizers! stem stemmers! collocation t-test, chi-squared, point-wise mutual information classify decision tree, maximum entropy, naive bayes cluster EM, k-means chunk regular expression, n-gram, named-entity metrics distances, precision, recall, agreement coefficients probability frequency distributions, smoothed probability distributions ... parse chart, feature-based, unification, probabilistic, dependency tag part-of-speech tagging, n-gram, backoff, Brill, HMM, TnT
  • 13.
    Can I hazData Matching? ☑ Accuracy score ☑ Precision score ☑ Recall score ☑ F-measure score ☐ Reduction ratio ☑ Stop-words (11 languages) ★ Punkt sentence tokenizer ★ Punkt word tokenizer ☑ N-gram (words and chars) ☑ Tf-idf ☑ Levenshtein distance ☑ Damerau-Levenshtein distance ☑ Binary distance... Durr! ★ Krippendorff's distance ★ Masi distance ☑ Jaccard distance ☐ Jaro distance ☐ Jaro-Winkler distance ☐ Monge-Elkan distance ☐ Soundex ☐ Phonex ☐ NYSIIS ☐ ONCA ☐ Double-Metaphone ☐ Fuzzy Soundex ☑ Decision tree ☑ SVM ☑ Naive Bayes ★ MaxEnt
  • 19.
    Fun fun fun! Sentiment analysis Spelling correction Spam detection Topic modeling Recommender systems Data deduplication
  • 20.
    Why not songmatching?! Grooveshark: online music streaming service Songs uploaded by record labels, independent artists and users Lot’s of duplicates! Tinysong: Grooveshark’s open RESTful API Our goal: No repeated songs! (remixes and lives are okay!)
  • 22.
    Bohemian Rhapsody byQween-?! { "Url": "http://tinysong.com/PBCJ", "SongID": 33834073, "SongName": "Bohemian Rhapsody", "ArtistID": 2324, "ArtistName": "Queen", "AlbumID": 1071492, "AlbumName": "Greatest Hits" }, ... { "Url": "http://tinysong.com/CYxG", "SongID": 28835215, "SongName": "Bohemian Rhapsody", "ArtistID": 1731732, "ArtistName": "Qween -", "AlbumID": 2364353, "AlbumName": "A Night at the Opera" } ...
  • 26.
    Next steps Othertextual data Machine learning Acoustic features Loudness BPM Liveness Acoustic fingerprinting for supervised learning Yes, songs have fingerprints too!
  • 27.
    Our “sentiment” +Quick and easy! + Exteeeeeeeeeeeeeeeeensive! + Docs & community! + Internationalization - Time performance - Memory usage - No online or active learning
  • 28.
    Want more?! +jellyfish Jaro-Winkler, Hamming, Soundex, Metaphone, NYSIIS, … + nltk-trainer Command-line NLTK classifiers! + scikit-learn More machine learning! Memory efficient! + pattern Web mining. Out-of-the-box! + gensim Topic modeling. Out-of-the-box!
  • 30.
    References http://www.nltk.org/ http://www.nltk.org/book/ http://streamhacker.com/ http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/ http://developers.grooveshark.com/tuts/tinysong https://github.com/sunlightlabs/jellyfish https://github.com/japerk/nltk-trainer http://scikit-learn.org/stable/ http://www.clips.ua.ac.be/pattern http://radimrehurek.com/gensim/
  • 31.