18. The
Five
Ques<ons
1. When
should
I
use
it?
2. What
does
the
input
look
like?
3. What
does
the
output
look
like?
4. How
many
parameters
do
I
have
to
tune?
5. Why
will
it
fail?
18
20. Collabora<ve
Filtering
(cont.)
1. To
see
things
that
are
hidden.
2. <user_id>,<item_id>,<weight>
3. <item1>,<item2>,<score>
4. The
distance
metric
and
the
weight
calcula<ons.
5. If
the
input
data
is
too
sparse.
20
23. K-‐Means
Clustering
(cont.)
1. To
find
anomalous
events.
2. Vectors
of
normally
distributed
values.
3. Cluster
centroids.
4. The
choice(s)
of
K.
5. The
points
aren’t
even
remotely
normally
distributed.
23
26. Random
Forests
(cont.)
1. To
classify
and
predict.
2. A
dependent
variable
and
many
independent
variables.
3. Lots
and
lots
of
liale
trees.
4. The
number
of
variables
to
consider
at
each
level.
5. Too
many
independent
variables.
26
27. Random
Forests
on
Hadoop
• R’s
randomForest
and
rhadoop
tools
• Map:
par<<on
the
input
data
among
the
reducers
• Reduce:
fit
the
random
forests
to
each
par<<on
• Re-‐combine
the
resul<ng
trees
in
the
client
27