Enhance discovery Solr and Mahout

Thinking
Lucene

Think
Lucid

Enhancing
Discovery
with
Solr
and

Mahout

Grant
Ingersoll

Chief
Scien@st

Lucid
Imagina@on

CONFIDENTIAL

|

1

Evolution

Documents
• Models
• Feature Selection

User
Interaction
Content
• Clicks
Relationships • Ratings/
• Page Rank, etc. Reviews
• Organization • Learning to
Rank
• Social Graph

Queries
• Phrases
• NLP

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

2

Minding the Intersection

Search

Analytics Discovery

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

3

Topics

l  Background

–  Apache
Mahout

–  Apache
Solr
and
Lucene

l  Recommenda@ons
with
Mahout

–  Collabora@ve
Filtering

l  Discovery
with
Solr
and
Mahout

l  Discussion

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

4

Apache
Lucene
in
a
Nutshell

l  hOp://lucene.apache.org/java

l  Java
based
Applica@on
Programming
Interface
(API)
for
adding
search
and

indexing
func@onality
to
applica@ons

l  Fast
and
eﬃcient
scoring
and
indexing
algorithms

l  Lots
of
contribu@ons
to
make
common
tasks
easier:

–  Highligh@ng,
spa@al,
Query
Parsers,
Benchmarking
tools,
etc.

l  Most
widely
deployed
search
library
on
the
planet

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

5

Apache
Solr
in
a
Nutshell

l  hOp://lucene.apache.org/solr

l  Lucene-‐based
Search
Server
+
other
features
and
func@onality

l  Access
Lucene
over
HTTP:

–  Java,
XML,
Ruby,
Python,
.NET,
JSON,
PHP,
etc.

l  Most
programming
tasks
in
Lucene
are
taken
care
of
in
Solr

l  Face@ng
(guided
naviga@on,
ﬁlters,
etc.)

l  Replica@on
and
distributed
search
support

l  Lucene
Best
Prac@ces

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

6

Apache
Mahout
in
a
Nutshell

http://dictionary.reference.com/browse/mahout

l  An
Apache
Socware
Founda@on
project
to
create

scalable
machine
learning
libraries
under
the
Apache

Socware
License

–  hOp://mahout.apache.org

l  The
Three
C’s:

–  Collabora@ve
Filtering
(recommenders)

–  Clustering

–  Classiﬁca@on

l  Others:

–  Frequent
Item
Mining

–  Primi@ve
collec@ons

–  Math
stuﬀ

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

7

Thinking
Lucene

Think
Lucid

Recommenda@ons
with
Mahout

CONFIDENTIAL

|

8

Recommenders

l  Collabora@ve
Filtering
(CF)

–  Provide
recommenda@ons
solely
based
on
preferences
expressed
between

users
and
items

–  “People
who
watched
this
also
watched
that”

l  Content-‐based
Recommenda@ons
(CBR)

–  Provide
recommenda@ons
based
on
the
aOributes
of
the
items
and
user
proﬁle

–  ‘Modern
Family’
is
a
sitcom,
Bob
likes
sitcoms

•  =>
Suggest
Modern
Family
to
Bob

l  Mahout
geared
towards
CF,
can
be
extended
to
do
CBR

–  Classiﬁca@on
can
also
be
used
for
CBR

l  Aside:
search
engines
can
also
solve
these
problems

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

9

To
Rate
or
Not?

l  In
many
instances,
user’s
don’t
provide
actual
ra@ngs

–  Clicks,
views,
etc.

l  Non-‐Boolean
ra@ngs
can
also
ocen
introduce
unnecessary
noise

–  Even
a
low
ra@ng
ocen
has
a
posi@ve
correla@on
with
highly
rated
items
in
the

real
world

l  Example:

Should
we
recommend
Frankenstein
to
Bob?

Dracula
Dracula Jane Frankenstein
Jane Eyre Java Programming
Frankenstein
Eyre
Bob 1 4 ???
Bob 1 4 ??? -
Mary 5 1 4
Mary 5 1 4 -

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

10

Collabora;ve
Filtering
with
Mahout

Item Item … Item m
l  Extensive
framework
for
collabora@ve

1 2
filtering

User 1 - 0.5 0.9
l  Recommenders

–  User
based
User 2 0.1 0.3 -
–  Item
based
…
–  Slope
One

User n 0.8 0.7 0.1
l  Online
and
Offline
support

–  Offline
can
u@lize
Hadoop

Recommendations
for User X

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

11

User
Similarity

What
should
we
recommend
for
User
1?

User
User

1
2
User

3
User

4

Item
1
Item
2
Item
3
Item
4

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

12

Item
Similarity

What
should
we
recommend
for
User
1?

User
User

1
2
User

3
User

4

Item
1
Item
2
Item
3
Item
4

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

13

Slope
One

User Item 1 Item 2
A 3.5 2
B ? 3

User
A:
3.5
–
2
=
1.5

Item
1
(User
B)
=
3
+
1.5
=
4.5

l  Intui@on:
There
is
a
linear
rela@onship
between
rated
items

–  Y
=
mX
+
b

where
m
=
1

l  Solve
for
b
upfront
based
on
exis@ng
ra@ngs:

b
=
(Y-‐X)

–  Find
the
average
diﬀerence
in
preference
value
for
every
pair
of
items

l  Online
can
be
very
fast,
but
requires
up
front
computa@on
and
memory

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

14

Online
and
Oﬄine
Recommenda;ons

l  Online

–  Predates
Hadoop

–  Designed
to
run
on
a
single
node

•  Matrix
size
of
~
100M
interac@ons

–  API
for
integra@ng
with
your
applica@on

l  Oﬄine

–  Hadoop
based

–  Designed
to
run
on
large
cluster

–  Several
approaches:

•  RecommenderJob,
ItemSimilarityJob,
ParallelALSFactoriza@onJob

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

15

RecommenderJob

l  Essen@ally
does
matrix
mul@plica@on
using
distributed
techniques

l  $MAHOUT_HOME/bin/examples/asf-‐email-‐examples.sh

101 102 103 104 105 User A Recs
3.0 30
101 7 2 0 1 3
0 37
102 2 8 3 5 2
X
4.0 =

103 0 3 3 6 4 38

104 1 5 6 4 7 3.0 53

105 3 2 4 7 9 2.0 64

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

16

Thinking
Lucene

Think
Lucid

Discovery
with
Solr

CONFIDENTIAL

|

17

Discovery
with
Solr

l  Goals:

–  Guide
users
to
results
without
having
to
guess
at
keywords

–  Encourage
serendipity

–  Never
show
empty
results

l  Out
of
the
Box:

–  Face@ng

–  Spell
Checking

–  More
Like
This

–  Clustering
(Carrot2)

l  Extend

–  Clustering
(with
Mahout)

–  Frequent
Item
Mining
(with
Mahout)

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

18

Clustering

l  Automa@cally
group
similar
content
together
to
aid
users
in
discovering

related
items
and/or
avoiding
repe@@ve
content

l  Solr
has
search
result
clustering

–  Pluggable

–  Default
implementa@on
uses
Carrot2

l  Mahout
has
Hadoop
based
large
scale
clustering

–  K-‐Means,
Minhash,
Dirichlet,
Canopy,
Spectral,
etc.

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

19

Discovery
In
Ac;on

l  Pre-‐reqs:

–  Apache
Ant
1.7.x,
Subversion
(SVN)

l  Command
Line
1:

–  svn
co
hOps://svn.apache.org/repos/asf/lucene/dev/trunk
solr-‐trunk

–  cd
solr-‐trunk/solr/

–  ant
example

–  cd
example

–  java
–Dsolr.clustering.enabled=true
–jar
start.jar

l  Command
Line
2

–  cd
exampledocs;
java
–jar
post.jar
*.xml

l  hOp://localhost:8983/solr/browse?
q=&debugQuery=true&annotateBrowse=true

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

20

Thinking
Lucene

Think
Lucid

Solr
+
Mahout

CONFIDENTIAL

|

21

Basics

l  Most
Mahout
tasks
are
oﬄine

l  Solr
provides
many
touch
points
for
integra@on:

–  ClusteringEngine

•  Clustering
results

–  SearchComponent

•  Sugges@ons
–
Related
searches,
clusters,
MLT,
spellchecking

–  UpdateProcessor

•  Classiﬁca@on
of
documents

–  Func@onQuery

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

22

Example:
Frequent
Itemset
Mining

l  Discover
frequently
co-‐occurring
items

l  Use
Case:
Related
Searches
from
Solr
Logs

l  Hadoop
and
sequen@al
versions

–  Parallel
FP
Growth

l  Input:

–  <op@onal
document
id>TAB<TOKEN1>SPACE<TOKEN2>SPACE

–  Comma,
pipe
also
allowed
as
delimiters

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

23

FIM
on
Solr
Query
Logs

l  Goal:

–  Extract
user
queries
from
Solr
logs

–  Feed
into
FIM
to
generate
Related
Keyword
Searches

l  Context:

–  Solr
Query
logs

–  bin/mahout
regexconverter
–input
$PATH_TO_LOGS
-‐-‐output
/tmp/solr/output

-‐-‐regex
"(?<=(?|&)q=).*?(?=&|$)"
-‐-‐overwrite
-‐-‐transformerClass
url
-‐-‐
formaOerClass
fpg

–  bin/mahout
fpg
-‐-‐input
/tmp/solr/output/
-‐o
/tmp/solr/ﬁm/output
-‐k
25
-‐s
2
-‐-‐
method
mapreduce

–  bin/mahout
seqdumper
-‐-‐seqFile
/tmp/solr2/results/frequentpaOerns/part-‐
r-‐00000

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

24

Output

l  Key:
Chris:
Value:
([Chris,
HosteOer],870),
([Chris],870),
([Search,
Faceted,

Chris,
HosteOer,
Webcast,
Power,
Mastering],18),
([Search,
Faceted,
Chris,

HosteOer,
Webcast,
Power],18),
([Search,
Faceted,
Chris,
HosteOer],18),

([Solr,
new,
Chris,
HosteOer,
webcast,
along,
sponsors,
DZone,
QA,
Refcard],
12),
([Solr,
new,
Chris,
HosteOer,
webcast,
along,
sponsors,
DZone],12),

([Solr,
new,
Chris,
HosteOer,
webcast,
along,
sponsors],12),
([Solr,
new,

Chris,
HosteOer,
webcast,
along],12),
([Solr,
new,
Chris,
HosteOer,
webcast],
12),
([Solr,
new,
Chris,
HosteOer],12)

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

25

Resources

l  hOp://lucene.apache.org

l  hOp://mahout.apache.org

l  hOp://manning.com/owen

l  hOp://manning.com/ingersoll

l  hOp://www.lucidimagina@on.com

l  grant@lucidimagina@on.com

l  @gsingers

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

26

Thinking
Lucene

Think
Lucid

Appendix

CONFIDENTIAL

|

27

Mahout
Overview

Applications

Examples

Freq.
Genetic Pattern Classification Clustering Recommenders
Mining

Math
Utilities/Integration Collections Apache
Vectors/Matrices/
Lucene/Vectorizer (primitives) Hadoop
SVD

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Copyright
Lucid
Imagina@on
CONFIDENTIAL

|

28

Enhance discovery Solr and Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Enhance discovery Solr and Mahout

Similar to Enhance discovery Solr and Mahout (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Enhance discovery Solr and Mahout