2. The Agenda
1. How it starts?
2. What is the Random Ferns algorithm?
3. How did implementation, evaluation and publishing
went?
3. Motivations
• Random Ferns is the popular classification algorithm in
the image processing field
• Our colleague - Miron Kursa as part of his research[1]
implemented this algorithm and publish as R package
called rFerns
• We have decided to empower Spark community with this
method by making it available as a Spark package
5. The Algorithm
• Random Ferns
– Example of the supervised learning
– Solves classification problems
– Kind of Ensemble Algorithm
6. Posterior Probability
• Hypothetically we can learn conditional probabilities:
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓𝑁)
• Where the classifier 𝑯 is described as
𝑯 𝒇 = arg max
𝑘
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓3)
• Not suitable, not traceable, memory consuming
7. Naïve Bayes Classifier
𝑃(𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 × 𝑃 𝒇 𝑪 𝑚
𝑃 𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚
𝑖=1
𝑁
𝑃 𝒇𝑖 𝑪 𝑚)
• Naïve as it misses dependencies among features
• Often quite successful classifications
8. • Goal to reach:
– Avoid Overfitting
– Build classifiers faster
• Ways to reach randomness
– Item sampling with replacement
– Feature sampling
Randomness in classifiers
9. Random Ferns
• Each classifier [1;L] has its set of features [1;S]
𝐹𝑙 = {𝑓𝑙,1, 𝑓𝑙,2, … , 𝑓𝑙,𝑆}
• Assume that classifiers are independent
𝑃 𝑓1, 𝑓2, … , 𝑓3 𝑪 𝑘) =
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
• Then classify items
𝐻 𝒇 ≡ arg max
𝑘
𝑃 𝑪 𝑘
𝑙1
𝐿
𝑃 𝐹𝑙 𝑪 𝑘)
10. Random Ferns
• Less-naïve Bayes
• From Random Forests perspective:
A
B
C C C C
B
A
B
D D D D
B
C
D
E E E E
D
14. Big data bagging
• How many times would a data point be sampled?
– Binomial distribution, 𝑝 =
1
𝑛
• 𝑃 𝑥 = 𝑘 = 𝑛
𝑘
1
𝑛
k
1 −
1
𝑛
𝑛−𝑘
– As 𝑛 → ∞ (big data) Binomial distribution tends to
Poisson distribution, 𝜆 = 𝑛𝑝 = 1[2]
• 𝑃 𝑥 = 𝑘 =
1
𝑒∙𝑘!
Simulate sampling using Poisson distribution
15. Binarisation
Note: each fern has its own binarisers
Categorical features Continuous features
— Get a random subset
of categories
— Given category either fits
this set or not
— Get two random feature values
from the training set
— Use their mean as threshold
16. Binarisation — implementation
Categorical features Continuous features
— Trivial as we have user
supplied categories info
— Assign every value a random
float
— Reduce by taking two values
with greatest floats assigned
18. 𝑃(𝐹𝑙|𝐶 𝑘)
• A combination of binary feature values used by
fern 𝑙
• For a fern of height 𝑆 there are 2 𝑆 distinct values
of 𝐹𝑙
• You may think of it as fern mapping each object
into one of 2 𝑆
buckets
19. 𝑃(𝐹𝑙|𝐶 𝑘)
• Probability of an object of class 𝐶 𝑘 falling into
bucket 𝐹𝑙
• Count of objects of class 𝐶 𝑘 falling into bucket 𝐹𝑙
divided by count of objects of class 𝐶 𝑘
𝑃 𝐹𝑙 𝐶 𝑘 =
𝐹𝑙 ∩ 𝐶 𝑘
𝐶 𝑘
20. Reduction
• The most important training part is
counting objects
• Sounds similar to… counting words!
• We have reduced classifier building to the
best-known big data problem
21. Memory
Q: How many probabilities do we need to compute?
A: About 2 𝑆 per fern
That means a binary classifier of 100 20-feature ferns
will weight over 1.5GB
23. Accuracy et al.
• Evaluation on Iris and Car datasets as integration test
• Iris:
– 10 ferns, 3 features per fern (out of 4)
– Accuracy: 98%
• Car:
– 20 ferns, 4 features per fern (out of 6)
– Accuracy: 90%
24. Dataset
• Million Song Dataset – Year Prediction
– Not quite about classification, but big (0.5M items)
– Task: having 90 real number features indicate a
publication year (ranging from 1922 to 2011)
– For sake of demonstration let’s just pretend it is
classification problem
25. Model Training Code
val raw = sc.textFile(…)
val lp = raw.map(parseIntoLabeledPoints(_))
val data = splitIntoTrainTest(lp)
val numFerns = 90
val numFeatures = 10
val model = FernForest.train(data.train,
numFerns, numFeatures, Map.empty)
val correct = data.test.map(lp => model.predict(lp.features) == lp.label)
26. Model Training time
𝑻 𝒇 , 𝑫 = 10−5 + 4,2 ∗ 10−6 ∗ 𝒇 ∗ ‖𝑫‖
• Where:
– ‖𝒇‖ is number of features
– ‖𝑫‖ is number of items in a dataset
27. Model Training Time
25.00
30.00
35.00
40.00
45.00
50.00
10 12 14 16 18 20
Est.TrainingTime
[min]
Number of Features
• Training time is linear
– against numer of features (diff to Random Forests)
– against number of samples
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0% 20% 40% 60%
TrainingTime[min]
Sample of 0.5M items Dataset
30. How can you help your users?
• Simplify discovery
– Register at spark-packages.org
• Simplify utilisation
– Publish artifacts to the Central Repository
31. spark-packages.org
• An index of packages for Apache Spark
• Spark Community keeps an eye on it
• Ideal place if you want to extend Spark
• You can register any GitHub-hosted Spark
project
32. The Central Repository
• Apache Maven retrieves all components from
the Central Repository by default
– so does Apache Spark
– and many other build systems
• Are your artifacts there yet?
33. Getting to the Central
Sonatype provides OSSRH
– free repository
– for open source software
– store snapshot artifacts
– promote releases to the Central Repository
Checklist:
1. Register[3] at Sonatype OSSRH
2. Generate GPG key (if you don’t have one yet)
3. Alter[4] your build.sbt
4. Build and sign your artefacts
5. Stage[5] release at OSSRH and promote to Central Repository
6. Voilà!
34. Things are smooth now
./$SPARK_HOME/bin/spark-shell
--packages pl.edu.icm:sparkling-ferns_2.10:0.2.0
36. References
[1] „rFerns: An Implementation of the Random Ferns Method for General-
Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10
[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”,
https://youtu.be/ceOwlHnVCqo
[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-
guide.html
[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-
Sonatype.html
[5] „Releasing the Deployment”, Sonatype,
http://central.sonatype.org/pages/releasing-the-deployment.html
Editor's Notes
Good afternoon Ladies and Gentlemen,
My name is Piotr Dendek.
Together with Mateusz Fedoryszak
we are going to present
the implementation of Random Ferns
for Apache Spark
done in Interdisciplinary Centre
For Math and Comp. Modeling,
The part of University of Warsaw.
I am going to tell you
how have we get here
(in terms
of Random Ferns implementation).
First, what or who has inspired us
and what are Random Ferns.
Next, Mateusz is going to describe
the implementation part
Finally we are going to share with you
evaluation results
and describe how to publish your package
on spark-packages.org.
So, let’s start!
Some of you,
especially people interested in the Image Processing Field
might have heard
about Random Ferns
as one of the state-of-the-art algorithms.
One of our colleges at ICM,
Miron Kursa,
used Random Ferns in his research.
As the great fan of R language
he implemented this algorithm
and published it in the CRAN repository.
It was quite a long time
before Spark version 1.0
Seeing how successful Random Ferns can be,
we decided to empower Spark Community
with this classification algorithm.
The best way to do so
was to publish it via spark-packages.org
Now let me say a few words
about the algorithm itself.
Random Ferns is the method
which uses supervised learning
to classify or label new examples
using knowledge about the training set
The plural form in the algorithm name
indicates that during model creation
many single classifiers will be created
and when classification occurs
results from each of them
will be combined
into a one result.
In the ideal world
we could use probabilistic approach
with ease.
We would know the ways
in which features depends on each other and
how the class depends on them.
That is joint probability.
In the real world,
we do not have so much information.
We cannot observe all combinations
of feature values.
Yet, we would really like
to use probabilistyc approach
and in fact we are doing so
in Random Forests , Random Fers,
you give a name.
And that is thanks to
easing constraints on classification,
especially...
move to Naïve Bayes,
where we assume
that all features are independent.
This assumption is false,
yet it has proved
to be the second best thing
to the pure true.
Thanks to this nice property
of probabilistic independence,
we only have to check
how probable is obtaining a class,
having a given feature value.
Then we multiply probabilities
of having a given class
from each feature,
and eventually yield
the most probable class.
So, it is much easier
in terms of RAM and computations
to track probabilities
and return the final result.
Let me follow this ML 101 class
for just a few more slides.
So tracking joint probablities
of evetything was the first no-no.
The second no-no is called "overfitting"
Because we want to avoid overfitting and
we would like to create model in parallel
the good idea to use
is sampling items with replacement
alongside with feature sampling.
This process can be executed
for as many mini-classifiers
as we want,
with as many features
as we want
– and the memory allow us to.
In the presented example
we have 3 subsets out of one.
Each of them has sample
of original data.
Also, each of them,
has the same number of features,
but features may be different accross subsets
Now, using each subset
we can create a mini-classifier
called "Fern".
Ok, we have L ferns,
each of which
uses only S features
out of N.
So we have less features
and less items.
Each fern classifies
an item in its own way.
For each item
we have probabilities
of an item I being classified
to each of classess.
Now, it may looks fancy,
but if we change the big bold F,
with small bold f
fix the number of ferns
with the number of features.
This gives us classical naive bayes classifier.
Yeah, looks familiar.
So, let's obfruscrate it with big bold F,
number S, etc. going back to Random Ferns.
Thanks to training N classifiers
each of which depends on
some subset of features
we obtain less-naive classification.
We implicitly assume
some relations between data
are represented as ferns.
Now what is going on
under the hood
of each fern?
Let’s look on
the tree representation
of ferns.
First of all,
yes, all ferns are perfectly binary trees.
This is thanks to feature binarization.
Features are somehow binarized
agains some threshold returning 1 or 0.
Each level of a fern
contains a test against the same feature.
The test return value 1 or 0.
So going from the root to a leaf,
we can collect bits,
which can be cast to an integer number,
call it the feature key.
When we are at a leaf,
we see probabilities of each class.
So a fern is a 2D array,
where the x axis is the feature key
and the y axis is a class index.
Now in the cell with indices x & y
we have a probability
Now - it might be big,
but we can train and use it fast.
What is interesting now
is how it can be constructed at scale.
Mateusz, could you bring us details.
Random Ferns are about training several small classifiers (ferns) each of which works on a subset of features
Bagging description
Simulation
Sampling from a big data set would be tough
Let’s look from a different perspective
Order doesn’t really matter
Instead of sampling individual elements we can sample how many times whas a particular object selected.
Actually, there’s a probability distribution that perfectly models that process.
Binomial distribution which equation is on the slide can be used in the sampling
The interesting thing is, as the number of elements we sample from grows to inf
which is true in our case, as we work with big data, Binomial distribution tends
to Poisson which density function is much simpler.
So, we’ll simulate sampling with replacement using Poisson dist
Categorical features: eye colour, gender
Continuous features: income per annum, height
May seem too naïve, but actually work.
Why? Some people state that the whole algorithm is crafted out of pure magic
More rational explanation: each feature is used by several ferns and each of them will use it’s own binarisers.
Discriminate between various original feature values fairly well.
Categorical: trivial to implement
— we assume that categorical features info is user supplied.
So do algos in Mllib
To proceed to the next topic we need to analise some of equations that Piotr has presented.
Bear with me, you’re gonna like result.
For a given object we assign this class which yields the greatest probabilities.
First — easy, let’s focus on the second
Highlighted part is a combination
Applying binarisers — mapping each object into a bucket
When we recall the classical definition of probability, we’ll realise that
Word „count” should ring a bell
Yes, you’re right, we have just reduced classifier training to the word count.
That gives a deeper meaning to this problem studied since the emergence of Hadoop era.
Before we finish this part, just a word of warning…
That’s a fair trade-off: you need more mem to model more complex relationships among features.
It would be a shame to present to an algorithm during big data conference without any performance data. Piotr, can you give us some numbers?
---
Random Ferns
can be quite memory consuming.
Because to this
the first evaluation of the package was
done on …
Iris and Cars Datasets.
These datasets
are in fact
used in integration tests.
The accuracy values
obtained on these datasets
were on expected level,
meaning that the algorithm
is implemented correctly.
At that point
We could calmly move
To bigger datasets.
To check how fast
can we train model
depending on number of features
and number of training samples
We used Million Song Dataset
to predict year of publishing each song
Ranging from 1922 to 2011, using 90 numerical features.
This prediction would be better done with Regression algorithms,
We know it,
But the point here is to use large volume data.
API of random ferns is similar
to algorithms present in MLlib.
You have to read data,
parse them into labeled points
and pass them to the method train
together with other input parameters
i.e. numer of ferns and numer of features.
Training method returns model,
Which can predict class of each observation.
After training many models
with different number
of items
of features
and ferns
we get an empirical estimation
of time needed to train a model.
Having a number of features fixed
training time depends linearly on a number of items
Conversely,
Having a number of training items fixed
Training time depends linearly on a number of features.
This estimation looks much better
When you look at charts
With the dataset of half a million items
And 10 ferns
model training takes about 27 min
Using 10 features out of 90
Increasing the number of features to 20
Results in about two times longer model creation
Now let’s fix
the number of ferns
and features to 10
and change number of items
used in model training.
Training time is 3 minutes with 10% of dataset
and about 12 minutes with 50% of items.
To sum up this part,
assuming we have enough memory
A model will be created
in quite reasonable and predictable time.
knowing this let’s move to…
… package publishing.
---
As Piotr said, let us finish the presentation with a few word about the packaging and dissemination of our work
Used great tools
Some of them presented
We’d like to focus on two of them
If your artifacts aren’t there yet, they should
There are a few guides explaining…
Because of that, the only step needed to start working with sparkling ferns is issuing this command.
Presented whole process sparkling ferns went from a research paper to the deployment.
We have revealed some details regarding their implementation and performance.
Finally, given some piece of advice regarding your packages.
Now we’ll be happy to answer any questions you may have