6. Aklabox
PresentaHon
Upload
your
documents
Share
your
documents
Collaborate
on
documents
Search
on
documents
Synchronize
your
documents
Publish
your
documents
Document
Viewer
9. Solr
&
R
IntegraHon
inside
AklaBox
• Why
do
I
get
this
list
when
I
search
inside
the
document
repository
?
• What
does
value
when
I
run
a
search
:
weight
of
every
words
?
• If
a
word
is
100
@mes
in
a
document,
is
the
document
more
valuable
for
my
search
?
• May
be
the
document
I’m
looking
for
has
not
the
exact
word
spelling
?
• How
do
I
take
into
account
mul@
language
support
?
10. Solr
&
R
IntegraHon
inside
AklaBox
• We
need
to
review
our
module
and
rethink
how
we
can
help
user
to
deploy
their
own
search
policy
• R
was
a
natural
choice
to
create
a
new
search
algorithm
• We
use
R
for
our
Data
Mining
development
• R
contains
packages
to
inspect
documents
• R
has
virtually
no
limit
to
analyze
and
classify
documents
• We
read
a
lot
about
R
&
Search
engine
…
11. Solr
&
R
IntegraHon
inside
AklaBox
• When
do
we
analyze
documents
with
R
:
• Before
Solr
Indexa@on
• AZer
Solr
Indexa@on
• Choice
:
• Before
Solr
Indexa@on
• We
add
Metadata
on
every
document,
like
top
words,
class
of
document
….
• We
create
classes
for
documents,
and
rela@on
between
classes
12. Solr
&
R
IntegraHon
inside
AklaBox
Keywords
are
added
inside
Solr
Index
15. Solr
&
R
IntegraHon
inside
AklaBox
R
Packages
:
• tm,
textmining
func@ons
(stemming,
words
frequency,
words
manipula@on,
etc...)
• TF
IDF
funcHon
(Term
Frequency)
• Matrix,
for
complex
ma@rx
manipula@on
• cluster
-‐
fanny
&
kmeans
func-ons,
to
calculate
classes
on
various
group
• libsvm
-‐
fonc@uns
svm,
predict
e&
tune,
for
automa@c
words
classifica@on
• Sampling
–
to
create
&
manipulate
different
data
sets
16. Solr
&
R
IntegraHon
inside
AklaBox
+
• R
algorithm
runs
when
the
document
is
uploaded
• We
keep
only
a
few
number
of
words
per
documents
(parameter)
• We
create
classes
for
documents
• We
can
managed
other
concerns,
such
as
interna@onalisa@on
• R
Package
can
be
switch
(other
algorithm,
new
deployment)
• easy
&
flexible
to
deploy
and
maintain
• No
impact
on
Solr
-‐
• Solr
index
is
a
gold
mine
…
and
we
don’t
run
analysis
on
it
21. DemonstraHon
• Other
Business
Cases
• Document
Management
:
Pre-‐classifica@on
of
documents
(pharmaceu@cal
industry)
• Search
engine
:
Analysis
of
WebSite
during
crawling
process
• Open
Door
to
New
development
• Phone@cs
search
(to
solve
the
word
spelling
problem)
22. Vanilla
Air,
Spark,
Spark
Sql
for
Solr
New
Technologies
are
emerging
…
well
:
it’s
already
there
!!!
23. Vanilla
Air,
Spark,
Spark
Sql
for
Solr
• Vanilla
Air
– Can
Process
R
Packages
– Can
scale
with
growing
number
of
documents
www.vanillasmartdata.com
24. Vanilla
Air,
Spark,
Spark
Sql
for
Solr
Easy
Switch
in
Architecture
-‐>
scalability
25. Vanilla
Air,
Spark,
Spark
&
R
&
Solr
Spark
1.5
Version
1.5
(sept
2015)
support
for
YARN
cluster
mode
in
R
26. Vanilla
Air,
Spark,
Spark
&
R
&
Solr
We
have
now
Spark
&
Solr
Tools
:
SolrRDD
Tools
for
reading
data
from
Solr
as
a
Spark
RDD
and
indexing
objects
from
Spark
into
Solr
using
SolrJ
hlps://github.com/LucidWorks/spark-‐solr
27. Vanilla
Air,
Spark,
Spark
&
R
&
Solr
Admin
Side
–
Runing
complex
R
program
on
Solr
index,
using
Vanilla
Air