Sparkling Random Ferns by P Dendek and M Fedoryszak

Sparkling Random Ferns.
From an academic paper to spark-packages.org
Piotr Jan Dendek
Mateusz Fedoryszak

The Agenda
1. How it starts?
2. What is the Random Ferns algorithm?
3. How did implementation, evaluation and publishing
went?

Motivations
• Random Ferns is the popular classification algorithm in
the image processing field
• Our colleague - Miron Kursa as part of his research[1]
implemented this algorithm and publish as R package
called rFerns
• We have decided to empower Spark community with this
method by making it available as a Spark package

The Algorithm
• Random Ferns
– Example of the supervised learning
– Solves classification problems
– Kind of Ensemble Algorithm

Posterior Probability
• Hypothetically we can learn conditional probabilities:
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓𝑁)
• Where the classifier 𝑯 is described as
𝑯 𝒇 = arg max
𝑘
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓3)
• Not suitable, not traceable, memory consuming

Naïve Bayes Classifier
𝑃(𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 × 𝑃 𝒇 𝑪 𝑚
𝑃 𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚
𝑖=1
𝑁
𝑃 𝒇𝑖 𝑪 𝑚)
• Naïve as it misses dependencies among features
• Often quite successful classifications

• Goal to reach:
– Avoid Overfitting
– Build classifiers faster
• Ways to reach randomness
– Item sampling with replacement
– Feature sampling
Randomness in classifiers

Random Ferns
• Each classifier [1;L] has its set of features [1;S]
𝐹𝑙 = {𝑓𝑙,1, 𝑓𝑙,2, … , 𝑓𝑙,𝑆}
• Assume that classifiers are independent
𝑃 𝑓1, 𝑓2, … , 𝑓3 𝑪 𝑘) =
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
• Then classify items
𝐻 𝒇 ≡ arg max
𝑘
𝑃 𝑪 𝑘
𝑙1
𝐿
𝑃 𝐹𝑙 𝑪 𝑘)

Random Ferns
• Less-naïve Bayes
• From Random Forests perspective:
A
B
C C C C
B
A
B
D D D D
B
C
D
E E E E
D

Bagging
Initial set
Fern 1
Fern 2
Fern 3

Bagging
Initial set
Fern 1
Fern 2
Fern 3
2 0 2 1 0
1 1 0 1 2
1 1 1 1 1

Big data bagging
• How many times would a data point be sampled?
– Binomial distribution, 𝑝 =
1
𝑛
• 𝑃 𝑥 = 𝑘 = 𝑛
𝑘
1
𝑛
k
1 −
1
𝑛
𝑛−𝑘
– As 𝑛 → ∞ (big data) Binomial distribution tends to
Poisson distribution, 𝜆 = 𝑛𝑝 = 1[2]
• 𝑃 𝑥 = 𝑘 =
1
𝑒∙𝑘!
Simulate sampling using Poisson distribution

Binarisation
Note: each fern has its own binarisers
Categorical features Continuous features
— Get a random subset
of categories
— Given category either fits
this set or not
— Get two random feature values
from the training set
— Use their mean as threshold

Binarisation — implementation
Categorical features Continuous features
— Trivial as we have user
supplied categories info
— Assign every value a random
float
— Reduce by taking two values
with greatest floats assigned

𝐻(𝒇) ≡ arg max
𝑘
𝑃(𝑪 𝑘)
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
Probabilities
What’s that?

𝑃(𝐹𝑙|𝐶 𝑘)
• A combination of binary feature values used by
fern 𝑙
• For a fern of height 𝑆 there are 2 𝑆 distinct values
of 𝐹𝑙
• You may think of it as fern mapping each object
into one of 2 𝑆
buckets

𝑃(𝐹𝑙|𝐶 𝑘)
• Probability of an object of class 𝐶 𝑘 falling into
bucket 𝐹𝑙
• Count of objects of class 𝐶 𝑘 falling into bucket 𝐹𝑙
divided by count of objects of class 𝐶 𝑘
𝑃 𝐹𝑙 𝐶 𝑘 =
𝐹𝑙 ∩ 𝐶 𝑘
𝐶 𝑘

Reduction
• The most important training part is
counting objects
• Sounds similar to… counting words!
• We have reduced classifier building to the
best-known big data problem

Memory
Q: How many probabilities do we need to compute?
A: About 2 𝑆 per fern
That means a binary classifier of 100 20-feature ferns
will weight over 1.5GB

Accuracy et al.
• Evaluation on Iris and Car datasets as integration test
• Iris:
– 10 ferns, 3 features per fern (out of 4)
– Accuracy: 98%
• Car:
– 20 ferns, 4 features per fern (out of 6)
– Accuracy: 90%

Dataset
• Million Song Dataset – Year Prediction
– Not quite about classification, but big (0.5M items)
– Task: having 90 real number features indicate a
publication year (ranging from 1922 to 2011)
– For sake of demonstration let’s just pretend it is
classification problem

Model Training Code
val raw = sc.textFile(…)
val lp = raw.map(parseIntoLabeledPoints(_))
val data = splitIntoTrainTest(lp)
val numFerns = 90
val numFeatures = 10
val model = FernForest.train(data.train,
numFerns, numFeatures, Map.empty)
val correct = data.test.map(lp => model.predict(lp.features) == lp.label)

Model Training time
𝑻 𝒇 , 𝑫 = 10−5 + 4,2 ∗ 10−6 ∗ 𝒇 ∗ ‖𝑫‖
• Where:
– ‖𝒇‖ is number of features
– ‖𝑫‖ is number of items in a dataset

Model Training Time
25.00
30.00
35.00
40.00
45.00
50.00
10 12 14 16 18 20
Est.TrainingTime
[min]
Number of Features
• Training time is linear
– against numer of features (diff to Random Forests)
– against number of samples
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0% 20% 40% 60%
TrainingTime[min]
Sample of 0.5M items Dataset

How can you help your users?
• Simplify discovery
– Register at spark-packages.org
• Simplify utilisation
– Publish artifacts to the Central Repository

spark-packages.org
• An index of packages for Apache Spark
• Spark Community keeps an eye on it
• Ideal place if you want to extend Spark
• You can register any GitHub-hosted Spark
project

The Central Repository
• Apache Maven retrieves all components from
the Central Repository by default
– so does Apache Spark
– and many other build systems
• Are your artifacts there yet?

Getting to the Central
Sonatype provides OSSRH
– free repository
– for open source software
– store snapshot artifacts
– promote releases to the Central Repository
Checklist:
1. Register[3] at Sonatype OSSRH
2. Generate GPG key (if you don’t have one yet)
3. Alter[4] your build.sbt
4. Build and sign your artefacts
5. Stage[5] release at OSSRH and promote to Central Repository
6. Voilà!

Things are smooth now
./$SPARK_HOME/bin/spark-shell
--packages pl.edu.icm:sparkling-ferns_2.10:0.2.0

THANK YOU! QUESTIONS?
http://spark-packages.org/package/CeON/
sparkling-ferns
@pjden
@mfedoryszak
/piotrdendek
/mfedoryszak

References
[1] „rFerns: An Implementation of the Random Ferns Method for General-
Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10
[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”,
https://youtu.be/ceOwlHnVCqo
[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-
guide.html
[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-
Sonatype.html
[5] „Releasing the Deployment”, Sonatype,
http://central.sonatype.org/pages/releasing-the-deployment.html

Sparkling Random Ferns by P Dendek and M Fedoryszak

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sparkling Random Ferns by P Dendek and M Fedoryszak

Similar to Sparkling Random Ferns by P Dendek and M Fedoryszak (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Sparkling Random Ferns by P Dendek and M Fedoryszak

Editor's Notes