Text Classification Powered by Apache Mahout and Lucene

Text classification
With Apache Mahout and Lucene
Isabel Drost-Fromm

Software Engineer at Nokia Maps*
Member of the Apache Software Foundation
Co-Founder of Berlin Buzzwords and
Berlin Apache Hadoop GetTogether
Co-founder of Apache Mahout

*We are hiring, talk to me or mail careers@here.com
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
TM
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

… provide your own success story online.
TM
Classification?
Text Classification Powered by Apache Mahout and Lucene
January 8, 2008 by Pink Sherbet Photography
http://www.flickr.com/photos/pinksherbet/2177961471/
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
Image by jasondevilla
http://www.flickr.com/photos/jasondv/91960897/
How a linear classifier sees data
Image by ZapTheDingbat (Light meter)
http://www.flickr.com/photos/zapthedingbat/3028168415
Instance*
(sometimes also called example, item, or in databases a row)
Feature*
(sometimes also called attribute, signal, predictor, co-variate, or column in databases)
Label*
(sometimes also called class, target variable)
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Image taken in Lisbon/ Portugal.
Image by jasondevilla
http://www.flickr.com/photos/jasondv/91960897/
Text Classification Powered by Apache Mahout and Lucene
●

Remove noise.
Text Classification Powered by Apache Mahout and Lucene
●

Remove noise.

●

Convert text to vectors.
Text consists of terms and phrases.
Encoding issues?
Chinese? Japanese?
“New York” vs. new York?
“go” vs. “going” vs. “went” vs. “gone”?
“go” vs. “Go”?
Terms? Tokens? Wait!
Text Classification Powered by Apache Mahout and Lucene
Now we have terms – how to turn them
into vectors?
If we looked at two phrases only:
Sunny weather

High performance computing
Aaron

Zuse
Binary bag of words
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Entry in vector is one, if word occurs in text.

●

Problem:
–

bi , j =

{

1 ∀ x i ∈d j
0 else

}

How to know all possible terms in unknown text?
Term Frequency
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Entry in vector equal to the words frequency.
bi , j =ni , j

●

Problem:
–

Common words dominate vectors.
TF with stop wording
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Filter stopwords.

●

Entry in vector equal to the words frequency.

●

Problem:
–

bi , j =ni , j

Common and uncommon words with same weight.
TF- IDF
●

Imagine a n-dimensional space.

●

Each dimension = one possible word in texts.

●

Filter stopwords.

●

Entry in vector equal to the weighted frequency.

●

Problem:
–

bi , j =ni , j ×log 

∣D∣

∣{ d : t i ∈d }∣

Long texts get larger values.
Hashed feature vectors
●

Imagine a n-dimensional space.

●

Each word in texts = hashed to one dimension.

●

Entry in vector set to one, if word hashed to it.
Text Classification Powered by Apache Mahout and Lucene
<
How a linear classifier sees data
Text Classification Powered by Apache Mahout and Lucene
HTML

Tokenstream+x

Apache Tika

FeatureVector
Encoder

Fulltext

Lucene
Analyzer

Vector

Online
Learner

Model
Image by ZapTheDingbat (Light meter)
http://www.flickr.com/photos/zapthedingbat/3028168415
Goals

●

Did I use the best model parameters?

●

How well will my model perform in the wild?
Tune model
Parameters,
Experiment with
Tokenization,
Experiment with
Vector Encoding

Compute expected
performance
Text Classification Powered by Apache Mahout and Lucene
Performance
●

Use same data for training and testing.

●

Problem:
–

Highly optimistic.

–

Model generalization unknown.
Performance
●

Use same data for training and testing.

DON'T
●

Problem:
–

Highly optimistic.

–

Model generalization unknown.
Performance
●

Use just a fraction for training.

●

Set some data aside for testing.

●

Problems:
–

Pessimistic predictor: Not all data used for training.

–

Result may depend on which data was set aside.
Performance
●

Partition your data into n fractions.

●

Each fraction set aside for testing in turn.

●

Problem:
–

Still a pessimistic predictor.
Performance
●

Use just a fraction for training.

●

Set some data aside for tuning and testing.

●

Problems:
–

Highly optimistic.

–

Parameters manually tuned to testing data.
Performance
●

Use just a fraction for training.

●

Set some data aside for tuning and testing.
DON'T

●

Problems:
–

Highly optimistic.

–

Parameters manually tuned to testing data.
Performance
●

Use just a fraction for training.

●

Set some data aside for tuning.

●

Set another set of data aside for testing.

●

Problems:
–

Pretty pessimistic as not all data is used.

–

May depend on which data was set aside.
Performance Measures
Correct prediction: negative

Model
prediction:
negative

Model
prediction:
positive

Correct prediction: positive
Accuracy
ACC=

●

true positivetrue negative
true positive false positive false negativetrue negative

Problems:
–

What if class distribution is skewed?
Precision/ Recall
true positive
Precision=
true positive false positive
true positive
Recall=
true positive false negative
●

Problem:
–

Depends on decision threshold.
ROC Curves
ROC Curves

Orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
ROC Curves
True orange rate

False orange rate
AUC – area under ROC
True orange rate

False orange rate
Foto taken by fras1977
http://www.flickr.com/photos/fras/4992313333/
Image by Medienmagazin pro
http://www.flickr.com/photos/medienmagazinpro/6266643422
Text Classification Powered by Apache Mahout and Lucene
http://www.flickr.com/photos/generated/943078008/
Apache Hadoop-ready
Recommendations/
Collaborative filtering

kNN and matrix factorization
based Collaborative filtering
Classification/
Naïve Bayes, random forest
Frequent item sets/
(P)FPGrowth

Classification/
Logistic Regression/ SGD

Clustering/ Mean shift, k-Means,
Canopy, Dirichlet Process,
Co-Location search

Sequence learning/
HMM

Math libs/ Mahout collections

LDA
Libraries to have a look at:
Vowpal Wabbit Mallet
LibSvm
LibLinear
Libfm
Incanter
GraphLab
Skikits learn

Where to get more information:
“Mahout in Action” - Manning
“Taming Text” - Manning
“Machine Learning” - Andrew Ng
https://cwiki.apache.org/confluence/dis
play/MAHOUT/Books+Tutorials+and+T
alks
https://cwiki.apache.org/confluence/dis
play/MAHOUT/Reference+Reading
Image by pareeerica
http://www.flickr.com/photos/pareeerica/3711741298/

Frameworks worth mentioning:
Apache Mahout
Matlab/ Otave
Shogun
RapidI

Apache Giraph
R
Weka
MyMedialight

Get your hands dirty:
http://kaggle.com
https://cwiki.apache.org/confluence/dis
play/MAHOUT/Collections

Where to meet these people:
RecSys
NIPS
KDD
PKDD
ApacheCon
O'Reilly Strata

ICML
ECML
WSDM
JMLR
Berlin Buzzwords
Get started today with the right tools.

January 8, 2008 by dreizehn28
http://www.flickr.com/photos/1328/2176949559
Discuss ideas and problems online.

November 16, 2005 [phil h]
http://www.flickr.com/photos/hi-phi/64055296
Images taken at Berlin Buzzwords 2011/12/13 by
Philipp Kaden. See you there end of May 2014.

Discuss ideas and problems in person.
Text Classification Powered by Apache Mahout and Lucene
Become a committer yourself
BerlinBuzzwords.de – End of May 2014 in Berlin/ Germany.

http://

Online – user/dev@mahout.apache.org, java-user@lucene.apache.org,
dev@lucene.apache.org

Interest in solving hard problems.
Being part of lively community.
Engineering best practices.

Bug reports, patches, features.
Documentation, code, examples.
Image by: Patrick McEvoy
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
http://www.flickr.com/photos/29143375@N05/3344809375/in/photostream/

http://www.flickr.com/photos/redux/409356158/
By freezelight, http://www.flickr.com/photos/63056612@N00/155554663/
1 of 88

More Related Content

Viewers also liked(20)

Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
Sangameswar Venkatraman3.5K views
Enhance discovery Solr and MahoutEnhance discovery Solr and Mahout
Enhance discovery Solr and Mahout
lucenerevolution5.1K views
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum18.1K views
Plan de marketigPlan de marketig
Plan de marketig
Manografica Chile294 views
Mon youth bulletin vol 28Mon youth bulletin vol 28
Mon youth bulletin vol 28
ဗညာ ၐုိပ္576 views
Travel digital iq 2011Travel digital iq 2011
Travel digital iq 2011
Gabriela Otto2.6K views
Practica #4 ph de la lechePractica #4 ph de la leche
Practica #4 ph de la leche
Richrad Alexander Valarezo Avila552 views
Macsfs apologetica i el raptoMacsfs apologetica i el rapto
Macsfs apologetica i el rapto
defiendetufe1.2K views
Beef framework 2016Beef framework 2016
Beef framework 2016
Tensor248 views
Diapositivas rosadas regimennnDiapositivas rosadas regimennn
Diapositivas rosadas regimennn
carolina0505384 views
TRabajo de la voz y sonidoTRabajo de la voz y sonido
TRabajo de la voz y sonido
Joan-Llorenç Alba1.3K views
Tabelaprecosee201201Tabelaprecosee201201
Tabelaprecosee201201
Miguel Silva4K views

Similar to Text Classification Powered by Apache Mahout and Lucene(20)

More from lucenerevolution(20)

Search at TwitterSearch at Twitter
Search at Twitter
lucenerevolution4K views
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
lucenerevolution2.3K views
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
lucenerevolution4.5K views
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution12.5K views
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
lucenerevolution1.7K views
Turning search upside downTurning search upside down
Turning search upside down
lucenerevolution3.8K views

Recently uploaded(20)

Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet44 views
Green Leaf Consulting: Capabilities DeckGreen Leaf Consulting: Capabilities Deck
Green Leaf Consulting: Capabilities Deck
GreenLeafConsulting147 views
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web Developers
Maximiliano Firtman143 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation22 views
ISWC2023-McGuinnessTWC16x9FinalShort.pdfISWC2023-McGuinnessTWC16x9FinalShort.pdf
ISWC2023-McGuinnessTWC16x9FinalShort.pdf
Deborah McGuinness80 views
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum158 views
ThroughputThroughput
Throughput
Moisés Armani Ramírez25 views

Text Classification Powered by Apache Mahout and Lucene