Text classification
With Apache Mahout and Lucene
Isabel Drost-Fromm

Software Engineer at Nokia Maps*
Member of the Apache Software Foundation
Co-Founder of Berlin Buzzwor...
TM
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

… provide your own success story online.
TM
Classification?
January 8, 2008 by Pink Sherbet Photography
http://www.flickr.com/photos/pinksherbet/2177961471/
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
Image by jasondevilla
http://www.flickr.com/photos/jasondv/91960897/
How a linear classifier sees data
Image by ZapTheDingbat (Light meter)
http://www.flickr.com/photos/zapthedingbat/3028168415
Instance*
(sometimes also called example, item, or in databases a row)
Feature*
(sometimes also called attribute, signal, predictor, co-variate, or column in databases)
Label*
(sometimes also called class, target variable)
Image taken in Lisbon/ Portugal.
Image by jasondevilla
http://www.flickr.com/photos/jasondv/91960897/
●

Remove noise.
●

Remove noise.

●

Convert text to vectors.
Text consists of terms and phrases.
Encoding issues?
Chinese? Japanese?
“New York” vs. new York?
“go” vs. “going” vs. “went” vs. “gone”?
“go” vs. “Go”?
Terms? Tokens? Wait!
Now we have terms – how to turn them
into vectors?
If we looked at two phrases only:
Sunny weather

High performance computing
Aaron

Zuse
Binary bag of words
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Entry in vector...
Term Frequency
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Entry in vector equa...
TF with stop wording
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Filter stopwor...
TF- IDF
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Filter stopwords.

●

Entry...
Hashed feature vectors
●

Imagine a n-dimensional space.

●

Each word in texts = hashed to one dimension.

●

Entry in ve...
<
How a linear classifier sees data
HTML

Tokenstream+x

Apache Tika

FeatureVector
Encoder

Fulltext

Lucene
Analyzer

Vector

Online
Learner

Model
Image by ZapTheDingbat (Light meter)
http://www.flickr.com/photos/zapthedingbat/3028168415
Goals

●

Did I use the best model parameters?

●

How well will my model perform in the wild?
Tune model
Parameters,
Experiment with
Tokenization,
Experiment with
Vector Encoding

Compute expected
performance
Performance
●

Use same data for training and testing.

●

Problem:
–

Highly optimistic.

–

Model generalization unknown...
Performance
●

Use same data for training and testing.

DON'T
●

Problem:
–

Highly optimistic.

–

Model generalization u...
Performance
●

Use just a fraction for training.

●

Set some data aside for testing.

●

Problems:
–

Pessimistic predict...
Performance
●

Partition your data into n fractions.

●

Each fraction set aside for testing in turn.

●

Problem:
–

Stil...
Performance
●

Use just a fraction for training.

●

Set some data aside for tuning and testing.

●

Problems:
–

Highly o...
Performance
●

Use just a fraction for training.

●

Set some data aside for tuning and testing.
DON'T

●

Problems:
–

Hi...
Performance
●

Use just a fraction for training.

●

Set some data aside for tuning.

●

Set another set of data aside for...
Performance Measures
Correct prediction: negative

Model
prediction:
negative

Model
prediction:
positive

Correct prediction: positive
Accuracy
ACC=

●

true positivetrue negative
true positive false positive false negativetrue negative

Problems:
–

Wh...
Precision/ Recall
true positive
Precision=
true positive false positive
true positive
Recall=
true positive false negati...
ROC Curves
ROC Curves

Orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
AUC – area under ROC
True orange rate

False orange rate
Foto taken by fras1977
http://www.flickr.com/photos/fras/4992313333/
Image by Medienmagazin pro
http://www.flickr.com/photos/medienmagazinpro/6266643422
http://www.flickr.com/photos/generated/943078008/
Apache Hadoop-ready
Recommendations/
Collaborative filtering

kNN and matrix factorization
based Collaborative filtering
C...
Libraries to have a look at:
Vowpal Wabbit Mallet
LibSvm
LibLinear
Libfm
Incanter
GraphLab
Skikits learn

Where to get mor...
Get started today with the right tools.

January 8, 2008 by dreizehn28
http://www.flickr.com/photos/1328/2176949559
Discuss ideas and problems online.

November 16, 2005 [phil h]
http://www.flickr.com/photos/hi-phi/64055296
Images taken at Berlin Buzzwords 2011/12/13 by
Philipp Kaden. See you there end of May 2014.

Discuss ideas and problems i...
Become a committer yourself
BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.

http://

Online – user/dev@mahout.apache.org, java-user@lucene.a...
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Upcoming SlideShare
Loading in...5
×

Text Classification Powered by Apache Mahout and Lucene

4,310

Published on

Presented by Isabel Drost-Fromm, Software Developer, Apache Software Foundation/Nokia Gate 5 GmbH at Lucene/Solr Revolution 2013 Dublin

Text classification automates the task of filing documents into pre-defined categories based on a set of example documents. The first step in automating classification is to transform the documents to feature vectors. Though this step is highly domain specific Apache Mahout provides you with a lot of easy to use tooling to help you get started, most of which relies heavily on Apache Lucene for analysis, tokenisation and filtering. This session shows how to use facetting to quickly get an understanding of the fields in your document. It will walk you through the steps necessary to convert your text documents into feature vectors that Mahout classifiers can use including a few anecdotes on drafting domain specific features.
Configure


Published in: Technology, Education

Text Classification Powered by Apache Mahout and Lucene

  1. 1. Text classification With Apache Mahout and Lucene
  2. 2. Isabel Drost-Fromm Software Engineer at Nokia Maps* Member of the Apache Software Foundation Co-Founder of Berlin Buzzwords and Berlin Apache Hadoop GetTogether Co-founder of Apache Mahout *We are hiring, talk to me or mail careers@here.com
  3. 3. TM
  4. 4. https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout … provide your own success story online.
  5. 5. TM
  6. 6. Classification?
  7. 7. January 8, 2008 by Pink Sherbet Photography http://www.flickr.com/photos/pinksherbet/2177961471/
  8. 8. By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
  9. 9. http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/ http://www.flickr.com/photos/redux/409356158/
  10. 10. http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/ http://www.flickr.com/photos/redux/409356158/
  11. 11. Image by jasondevilla http://www.flickr.com/photos/jasondv/91960897/
  12. 12. How a linear classifier sees data
  13. 13. Image by ZapTheDingbat (Light meter) http://www.flickr.com/photos/zapthedingbat/3028168415
  14. 14. Instance* (sometimes also called example, item, or in databases a row)
  15. 15. Feature* (sometimes also called attribute, signal, predictor, co-variate, or column in databases)
  16. 16. Label* (sometimes also called class, target variable)
  17. 17. Image taken in Lisbon/ Portugal.
  18. 18. Image by jasondevilla http://www.flickr.com/photos/jasondv/91960897/
  19. 19. ● Remove noise.
  20. 20. ● Remove noise. ● Convert text to vectors.
  21. 21. Text consists of terms and phrases.
  22. 22. Encoding issues? Chinese? Japanese? “New York” vs. new York? “go” vs. “going” vs. “went” vs. “gone”? “go” vs. “Go”?
  23. 23. Terms? Tokens? Wait!
  24. 24. Now we have terms – how to turn them into vectors?
  25. 25. If we looked at two phrases only: Sunny weather High performance computing
  26. 26. Aaron Zuse
  27. 27. Binary bag of words ● Imagine a n-dimensional space. ● Each dimension = one possible word in texts. ● Entry in vector is one, if word occurs in text. ● Problem: – bi , j = { 1 ∀ x i ∈d j 0 else } How to know all possible terms in unknown text?
  28. 28. Term Frequency ● Imagine a n-dimensional space. ● Each dimension = one possible word in texts. ● Entry in vector equal to the words frequency. bi , j =ni , j ● Problem: – Common words dominate vectors.
  29. 29. TF with stop wording ● Imagine a n-dimensional space. ● Each dimension = one possible word in texts. ● Filter stopwords. ● Entry in vector equal to the words frequency. ● Problem: – bi , j =ni , j Common and uncommon words with same weight.
  30. 30. TF- IDF ● Imagine a n-dimensional space. ● Each dimension = one possible word in texts. ● Filter stopwords. ● Entry in vector equal to the weighted frequency. ● Problem: – bi , j =ni , j ×log  ∣D∣  ∣{ d : t i ∈d }∣ Long texts get larger values.
  31. 31. Hashed feature vectors ● Imagine a n-dimensional space. ● Each word in texts = hashed to one dimension. ● Entry in vector set to one, if word hashed to it.
  32. 32. <
  33. 33. How a linear classifier sees data
  34. 34. HTML Tokenstream+x Apache Tika FeatureVector Encoder Fulltext Lucene Analyzer Vector Online Learner Model
  35. 35. Image by ZapTheDingbat (Light meter) http://www.flickr.com/photos/zapthedingbat/3028168415
  36. 36. Goals ● Did I use the best model parameters? ● How well will my model perform in the wild?
  37. 37. Tune model Parameters, Experiment with Tokenization, Experiment with Vector Encoding Compute expected performance
  38. 38. Performance ● Use same data for training and testing. ● Problem: – Highly optimistic. – Model generalization unknown.
  39. 39. Performance ● Use same data for training and testing. DON'T ● Problem: – Highly optimistic. – Model generalization unknown.
  40. 40. Performance ● Use just a fraction for training. ● Set some data aside for testing. ● Problems: – Pessimistic predictor: Not all data used for training. – Result may depend on which data was set aside.
  41. 41. Performance ● Partition your data into n fractions. ● Each fraction set aside for testing in turn. ● Problem: – Still a pessimistic predictor.
  42. 42. Performance ● Use just a fraction for training. ● Set some data aside for tuning and testing. ● Problems: – Highly optimistic. – Parameters manually tuned to testing data.
  43. 43. Performance ● Use just a fraction for training. ● Set some data aside for tuning and testing. DON'T ● Problems: – Highly optimistic. – Parameters manually tuned to testing data.
  44. 44. Performance ● Use just a fraction for training. ● Set some data aside for tuning. ● Set another set of data aside for testing. ● Problems: – Pretty pessimistic as not all data is used. – May depend on which data was set aside.
  45. 45. Performance Measures
  46. 46. Correct prediction: negative Model prediction: negative Model prediction: positive Correct prediction: positive
  47. 47. Accuracy ACC= ● true positivetrue negative true positive false positive false negativetrue negative Problems: – What if class distribution is skewed?
  48. 48. Precision/ Recall true positive Precision= true positive false positive true positive Recall= true positive false negative ● Problem: – Depends on decision threshold.
  49. 49. ROC Curves
  50. 50. ROC Curves Orange rate
  51. 51. ROC Curves True orange rate False orange rate
  52. 52. ROC Curves True orange rate False orange rate
  53. 53. ROC Curves True orange rate False orange rate
  54. 54. ROC Curves True orange rate False orange rate
  55. 55. ROC Curves True orange rate False orange rate
  56. 56. AUC – area under ROC True orange rate False orange rate
  57. 57. Foto taken by fras1977 http://www.flickr.com/photos/fras/4992313333/
  58. 58. Image by Medienmagazin pro http://www.flickr.com/photos/medienmagazinpro/6266643422
  59. 59. http://www.flickr.com/photos/generated/943078008/
  60. 60. Apache Hadoop-ready Recommendations/ Collaborative filtering kNN and matrix factorization based Collaborative filtering Classification/ Naïve Bayes, random forest Frequent item sets/ (P)FPGrowth Classification/ Logistic Regression/ SGD Clustering/ Mean shift, k-Means, Canopy, Dirichlet Process, Co-Location search Sequence learning/ HMM Math libs/ Mahout collections LDA
  61. 61. Libraries to have a look at: Vowpal Wabbit Mallet LibSvm LibLinear Libfm Incanter GraphLab Skikits learn Where to get more information: “Mahout in Action” - Manning “Taming Text” - Manning “Machine Learning” - Andrew Ng https://cwiki.apache.org/confluence/dis play/MAHOUT/Books+Tutorials+and+T alks https://cwiki.apache.org/confluence/dis play/MAHOUT/Reference+Reading Image by pareeerica http://www.flickr.com/photos/pareeerica/3711741298/ Frameworks worth mentioning: Apache Mahout Matlab/ Otave Shogun RapidI Apache Giraph R Weka MyMedialight Get your hands dirty: http://kaggle.com https://cwiki.apache.org/confluence/dis play/MAHOUT/Collections Where to meet these people: RecSys NIPS KDD PKDD ApacheCon O'Reilly Strata ICML ECML WSDM JMLR Berlin Buzzwords
  62. 62. Get started today with the right tools. January 8, 2008 by dreizehn28 http://www.flickr.com/photos/1328/2176949559
  63. 63. Discuss ideas and problems online. November 16, 2005 [phil h] http://www.flickr.com/photos/hi-phi/64055296
  64. 64. Images taken at Berlin Buzzwords 2011/12/13 by Philipp Kaden. See you there end of May 2014. Discuss ideas and problems in person.
  65. 65. Become a committer yourself
  66. 66. BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany. http:// Online – user/dev@mahout.apache.org, java-user@lucene.apache.org, dev@lucene.apache.org Interest in solving hard problems. Being part of lively community. Engineering best practices. Bug reports, patches, features. Documentation, code, examples. Image by: Patrick McEvoy
  67. 67. http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/ http://www.flickr.com/photos/redux/409356158/
  68. 68. http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/ http://www.flickr.com/photos/redux/409356158/
  69. 69. By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
  1. Gostou de algum slide específico?

    Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

×