SlideShare a Scribd company logo
1 of 36
Sparkling Random Ferns.
From an academic paper to spark-packages.org
Piotr Jan Dendek
Mateusz Fedoryszak
The Agenda
1. How it starts?
2. What is the Random Ferns algorithm?
3. How did implementation, evaluation and publishing
went?
Motivations
• Random Ferns is the popular classification algorithm in
the image processing field
• Our colleague - Miron Kursa as part of his research[1]
implemented this algorithm and publish as R package
called rFerns
• We have decided to empower Spark community with this
method by making it available as a Spark package
THE ALGORITHM
The Algorithm
• Random Ferns
– Example of the supervised learning
– Solves classification problems
– Kind of Ensemble Algorithm
Posterior Probability
• Hypothetically we can learn conditional probabilities:
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓𝑁)
• Where the classifier 𝑯 is described as
𝑯 𝒇 = arg max
𝑘
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓3)
• Not suitable, not traceable, memory consuming
Naïve Bayes Classifier
𝑃(𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 × 𝑃 𝒇 𝑪 𝑚
𝑃 𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚
𝑖=1
𝑁
𝑃 𝒇𝑖 𝑪 𝑚)
• Naïve as it misses dependencies among features
• Often quite successful classifications
• Goal to reach:
– Avoid Overfitting
– Build classifiers faster
• Ways to reach randomness
– Item sampling with replacement
– Feature sampling
Randomness in classifiers
Random Ferns
• Each classifier [1;L] has its set of features [1;S]
𝐹𝑙 = {𝑓𝑙,1, 𝑓𝑙,2, … , 𝑓𝑙,𝑆}
• Assume that classifiers are independent
𝑃 𝑓1, 𝑓2, … , 𝑓3 𝑪 𝑘) =
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
• Then classify items
𝐻 𝒇 ≡ arg max
𝑘
𝑃 𝑪 𝑘
𝑙1
𝐿
𝑃 𝐹𝑙 𝑪 𝑘)
Random Ferns
• Less-naïve Bayes
• From Random Forests perspective:
A
B
C C C C
B
A
B
D D D D
B
C
D
E E E E
D
THE IMPLEMENTATION
Bagging
Initial set
Fern 1
Fern 2
Fern 3
Bagging
Initial set
Fern 1
Fern 2
Fern 3
2 0 2 1 0
1 1 0 1 2
1 1 1 1 1
Big data bagging
• How many times would a data point be sampled?
– Binomial distribution, 𝑝 =
1
𝑛
• 𝑃 𝑥 = 𝑘 = 𝑛
𝑘
1
𝑛
k
1 −
1
𝑛
𝑛−𝑘
– As 𝑛 → ∞ (big data) Binomial distribution tends to
Poisson distribution, 𝜆 = 𝑛𝑝 = 1[2]
• 𝑃 𝑥 = 𝑘 =
1
𝑒∙𝑘!
Simulate sampling using Poisson distribution
Binarisation
Note: each fern has its own binarisers
Categorical features Continuous features
— Get a random subset
of categories
— Given category either fits
this set or not
— Get two random feature values
from the training set
— Use their mean as threshold
Binarisation — implementation
Categorical features Continuous features
— Trivial as we have user
supplied categories info
— Assign every value a random
float
— Reduce by taking two values
with greatest floats assigned
𝐻(𝒇) ≡ arg max
𝑘
𝑃(𝑪 𝑘)
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
Probabilities
What’s that?
𝑃(𝐹𝑙|𝐶 𝑘)
• A combination of binary feature values used by
fern 𝑙
• For a fern of height 𝑆 there are 2 𝑆 distinct values
of 𝐹𝑙
• You may think of it as fern mapping each object
into one of 2 𝑆
buckets
𝑃(𝐹𝑙|𝐶 𝑘)
• Probability of an object of class 𝐶 𝑘 falling into
bucket 𝐹𝑙
• Count of objects of class 𝐶 𝑘 falling into bucket 𝐹𝑙
divided by count of objects of class 𝐶 𝑘
𝑃 𝐹𝑙 𝐶 𝑘 =
𝐹𝑙 ∩ 𝐶 𝑘
𝐶 𝑘
Reduction
• The most important training part is
counting objects
• Sounds similar to… counting words!
• We have reduced classifier building to the
best-known big data problem
Memory
Q: How many probabilities do we need to compute?
A: About 2 𝑆 per fern
That means a binary classifier of 100 20-feature ferns
will weight over 1.5GB
THE EVALUATION
Accuracy et al.
• Evaluation on Iris and Car datasets as integration test
• Iris:
– 10 ferns, 3 features per fern (out of 4)
– Accuracy: 98%
• Car:
– 20 ferns, 4 features per fern (out of 6)
– Accuracy: 90%
Dataset
• Million Song Dataset – Year Prediction
– Not quite about classification, but big (0.5M items)
– Task: having 90 real number features indicate a
publication year (ranging from 1922 to 2011)
– For sake of demonstration let’s just pretend it is
classification problem
Model Training Code
val raw = sc.textFile(…)
val lp = raw.map(parseIntoLabeledPoints(_))
val data = splitIntoTrainTest(lp)
val numFerns = 90
val numFeatures = 10
val model = FernForest.train(data.train,
numFerns, numFeatures, Map.empty)
val correct = data.test.map(lp => model.predict(lp.features) == lp.label)
Model Training time
𝑻 𝒇 , 𝑫 = 10−5 + 4,2 ∗ 10−6 ∗ 𝒇 ∗ ‖𝑫‖
• Where:
– ‖𝒇‖ is number of features
– ‖𝑫‖ is number of items in a dataset
Model Training Time
25.00
30.00
35.00
40.00
45.00
50.00
10 12 14 16 18 20
Est.TrainingTime
[min]
Number of Features
• Training time is linear
– against numer of features (diff to Random Forests)
– against number of samples
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0% 20% 40% 60%
TrainingTime[min]
Sample of 0.5M items Dataset
THE PACKAGE
Our toolbox
How can you help your users?
• Simplify discovery
– Register at spark-packages.org
• Simplify utilisation
– Publish artifacts to the Central Repository
spark-packages.org
• An index of packages for Apache Spark
• Spark Community keeps an eye on it
• Ideal place if you want to extend Spark
• You can register any GitHub-hosted Spark
project
The Central Repository
• Apache Maven retrieves all components from
the Central Repository by default
– so does Apache Spark
– and many other build systems
• Are your artifacts there yet?
Getting to the Central
Sonatype provides OSSRH
– free repository
– for open source software
– store snapshot artifacts
– promote releases to the Central Repository
Checklist:
1. Register[3] at Sonatype OSSRH
2. Generate GPG key (if you don’t have one yet)
3. Alter[4] your build.sbt
4. Build and sign your artefacts
5. Stage[5] release at OSSRH and promote to Central Repository
6. Voilà!
Things are smooth now
./$SPARK_HOME/bin/spark-shell 
--packages pl.edu.icm:sparkling-ferns_2.10:0.2.0
THANK YOU! QUESTIONS?
http://spark-packages.org/package/CeON/
sparkling-ferns
@pjden
@mfedoryszak
/piotrdendek
/mfedoryszak
References
[1] „rFerns: An Implementation of the Random Ferns Method for General-
Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10
[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”,
https://youtu.be/ceOwlHnVCqo
[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-
guide.html
[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-
Sonatype.html
[5] „Releasing the Deployment”, Sonatype,
http://central.sonatype.org/pages/releasing-the-deployment.html

More Related Content

What's hot

What's hot (20)

Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
13. Queue
13. Queue13. Queue
13. Queue
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Anomaly Detection by ADGM / LVAE
Anomaly Detection by ADGM / LVAEAnomaly Detection by ADGM / LVAE
Anomaly Detection by ADGM / LVAE
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
 
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in julia
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
 

Similar to Sparkling Random Ferns by P Dendek and M Fedoryszak

Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
Yahoo Developer Network
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
Dong Guo
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
Thinkful
 

Similar to Sparkling Random Ferns by P Dendek and M Fedoryszak (20)

Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdf
 
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Sparkling Random Ferns by P Dendek and M Fedoryszak

  • 1. Sparkling Random Ferns. From an academic paper to spark-packages.org Piotr Jan Dendek Mateusz Fedoryszak
  • 2. The Agenda 1. How it starts? 2. What is the Random Ferns algorithm? 3. How did implementation, evaluation and publishing went?
  • 3. Motivations • Random Ferns is the popular classification algorithm in the image processing field • Our colleague - Miron Kursa as part of his research[1] implemented this algorithm and publish as R package called rFerns • We have decided to empower Spark community with this method by making it available as a Spark package
  • 5. The Algorithm • Random Ferns – Example of the supervised learning – Solves classification problems – Kind of Ensemble Algorithm
  • 6. Posterior Probability • Hypothetically we can learn conditional probabilities: 𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓𝑁) • Where the classifier 𝑯 is described as 𝑯 𝒇 = arg max 𝑘 𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓3) • Not suitable, not traceable, memory consuming
  • 7. Naïve Bayes Classifier 𝑃(𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 × 𝑃 𝒇 𝑪 𝑚 𝑃 𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 𝑖=1 𝑁 𝑃 𝒇𝑖 𝑪 𝑚) • Naïve as it misses dependencies among features • Often quite successful classifications
  • 8. • Goal to reach: – Avoid Overfitting – Build classifiers faster • Ways to reach randomness – Item sampling with replacement – Feature sampling Randomness in classifiers
  • 9. Random Ferns • Each classifier [1;L] has its set of features [1;S] 𝐹𝑙 = {𝑓𝑙,1, 𝑓𝑙,2, … , 𝑓𝑙,𝑆} • Assume that classifiers are independent 𝑃 𝑓1, 𝑓2, … , 𝑓3 𝑪 𝑘) = 𝑙=1 𝐿 𝑃(𝐹𝑙|𝑪 𝑘) • Then classify items 𝐻 𝒇 ≡ arg max 𝑘 𝑃 𝑪 𝑘 𝑙1 𝐿 𝑃 𝐹𝑙 𝑪 𝑘)
  • 10. Random Ferns • Less-naïve Bayes • From Random Forests perspective: A B C C C C B A B D D D D B C D E E E E D
  • 13. Bagging Initial set Fern 1 Fern 2 Fern 3 2 0 2 1 0 1 1 0 1 2 1 1 1 1 1
  • 14. Big data bagging • How many times would a data point be sampled? – Binomial distribution, 𝑝 = 1 𝑛 • 𝑃 𝑥 = 𝑘 = 𝑛 𝑘 1 𝑛 k 1 − 1 𝑛 𝑛−𝑘 – As 𝑛 → ∞ (big data) Binomial distribution tends to Poisson distribution, 𝜆 = 𝑛𝑝 = 1[2] • 𝑃 𝑥 = 𝑘 = 1 𝑒∙𝑘! Simulate sampling using Poisson distribution
  • 15. Binarisation Note: each fern has its own binarisers Categorical features Continuous features — Get a random subset of categories — Given category either fits this set or not — Get two random feature values from the training set — Use their mean as threshold
  • 16. Binarisation — implementation Categorical features Continuous features — Trivial as we have user supplied categories info — Assign every value a random float — Reduce by taking two values with greatest floats assigned
  • 17. 𝐻(𝒇) ≡ arg max 𝑘 𝑃(𝑪 𝑘) 𝑙=1 𝐿 𝑃(𝐹𝑙|𝑪 𝑘) Probabilities What’s that?
  • 18. 𝑃(𝐹𝑙|𝐶 𝑘) • A combination of binary feature values used by fern 𝑙 • For a fern of height 𝑆 there are 2 𝑆 distinct values of 𝐹𝑙 • You may think of it as fern mapping each object into one of 2 𝑆 buckets
  • 19. 𝑃(𝐹𝑙|𝐶 𝑘) • Probability of an object of class 𝐶 𝑘 falling into bucket 𝐹𝑙 • Count of objects of class 𝐶 𝑘 falling into bucket 𝐹𝑙 divided by count of objects of class 𝐶 𝑘 𝑃 𝐹𝑙 𝐶 𝑘 = 𝐹𝑙 ∩ 𝐶 𝑘 𝐶 𝑘
  • 20. Reduction • The most important training part is counting objects • Sounds similar to… counting words! • We have reduced classifier building to the best-known big data problem
  • 21. Memory Q: How many probabilities do we need to compute? A: About 2 𝑆 per fern That means a binary classifier of 100 20-feature ferns will weight over 1.5GB
  • 23. Accuracy et al. • Evaluation on Iris and Car datasets as integration test • Iris: – 10 ferns, 3 features per fern (out of 4) – Accuracy: 98% • Car: – 20 ferns, 4 features per fern (out of 6) – Accuracy: 90%
  • 24. Dataset • Million Song Dataset – Year Prediction – Not quite about classification, but big (0.5M items) – Task: having 90 real number features indicate a publication year (ranging from 1922 to 2011) – For sake of demonstration let’s just pretend it is classification problem
  • 25. Model Training Code val raw = sc.textFile(…) val lp = raw.map(parseIntoLabeledPoints(_)) val data = splitIntoTrainTest(lp) val numFerns = 90 val numFeatures = 10 val model = FernForest.train(data.train, numFerns, numFeatures, Map.empty) val correct = data.test.map(lp => model.predict(lp.features) == lp.label)
  • 26. Model Training time 𝑻 𝒇 , 𝑫 = 10−5 + 4,2 ∗ 10−6 ∗ 𝒇 ∗ ‖𝑫‖ • Where: – ‖𝒇‖ is number of features – ‖𝑫‖ is number of items in a dataset
  • 27. Model Training Time 25.00 30.00 35.00 40.00 45.00 50.00 10 12 14 16 18 20 Est.TrainingTime [min] Number of Features • Training time is linear – against numer of features (diff to Random Forests) – against number of samples 0.0 2.0 4.0 6.0 8.0 10.0 12.0 0% 20% 40% 60% TrainingTime[min] Sample of 0.5M items Dataset
  • 30. How can you help your users? • Simplify discovery – Register at spark-packages.org • Simplify utilisation – Publish artifacts to the Central Repository
  • 31. spark-packages.org • An index of packages for Apache Spark • Spark Community keeps an eye on it • Ideal place if you want to extend Spark • You can register any GitHub-hosted Spark project
  • 32. The Central Repository • Apache Maven retrieves all components from the Central Repository by default – so does Apache Spark – and many other build systems • Are your artifacts there yet?
  • 33. Getting to the Central Sonatype provides OSSRH – free repository – for open source software – store snapshot artifacts – promote releases to the Central Repository Checklist: 1. Register[3] at Sonatype OSSRH 2. Generate GPG key (if you don’t have one yet) 3. Alter[4] your build.sbt 4. Build and sign your artefacts 5. Stage[5] release at OSSRH and promote to Central Repository 6. Voilà!
  • 34. Things are smooth now ./$SPARK_HOME/bin/spark-shell --packages pl.edu.icm:sparkling-ferns_2.10:0.2.0
  • 36. References [1] „rFerns: An Implementation of the Random Ferns Method for General- Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10 [2] „Proof that the Binomial Distribution tends to the Poisson Distribution”, https://youtu.be/ceOwlHnVCqo [3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh- guide.html [4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using- Sonatype.html [5] „Releasing the Deployment”, Sonatype, http://central.sonatype.org/pages/releasing-the-deployment.html

Editor's Notes

  1. Good afternoon Ladies and Gentlemen,  My name is Piotr Dendek.  Together with Mateusz Fedoryszak  we are going to present   the implementation of Random Ferns  for Apache Spark   done in Interdisciplinary Centre   For Math and Comp. Modeling,  The part of University of Warsaw. 
  2. I am going to tell you   how have we get here   (in terms   of Random Ferns implementation).  First, what or who has inspired us   and what are Random Ferns. Next, Mateusz is going to describe   the implementation part  Finally we are going to share with you   evaluation results   and describe how to publish your package   on spark-packages.org.  So, let’s start! 
  3. Some of you,   especially people interested in the Image Processing Field  might have heard   about Random Ferns   as one of the state-of-the-art algorithms.  One of our colleges at ICM,   Miron Kursa,   used Random Ferns in his research.  As the great fan of R language   he implemented this algorithm   and published it in the CRAN repository.  It was quite a long time   before Spark version 1.0  Seeing how successful Random Ferns can be,   we decided to empower Spark Community   with this classification algorithm.    The best way to do so   was to publish it via spark-packages.org 
  4. Now let me say a few words   about the algorithm itself. 
  5. Random Ferns is the method   which uses supervised learning   to classify or label new examples   using knowledge about the training set  The plural form in the algorithm name   indicates that during model creation   many single classifiers will be created  and when classification occurs   results from each of them   will be combined   into a one result.
  6. In the ideal world  we could use probabilistic approach with ease. We would know the ways   in which features depends on each other and   how the class depends on them.  That is joint probability. In the real world, we do not have so much information.  We cannot observe all combinations of feature values. Yet, we would really like  to use probabilistyc approach and in fact we are doing so  in Random Forests , Random Fers, you give a name. And that is thanks to  easing constraints on classification,   especially... 
  7. move to Naïve Bayes,   where we assume   that all features are independent.  This assumption is false, yet it has proved  to be the second best thing  to the pure true. Thanks to this nice property  of probabilistic independence, we only have to check how probable is obtaining a class, having a given feature value. Then we multiply probabilities  of having a given class from each feature, and eventually yield  the most probable class. So, it is much easier in terms of RAM and computations to track probabilities  and return the final result. Let me follow this ML 101 class   for just a few more slides.
  8. So tracking joint probablities  of evetything was the first no-no. The second no-no is called "overfitting" Because we want to avoid overfitting and   we would like to create model in parallel  the good idea to use   is sampling items with replacement  alongside with feature sampling.  This process can be executed   for as many mini-classifiers   as we want,  with as many features   as we want   – and the memory allow us to.  In the presented example   we have 3 subsets out of one. Each of them has sample  of original data. Also, each of them, has the same number of features, but features may be different accross subsets Now, using each subset  we can create a mini-classifier called "Fern".
  9. Ok, we have L ferns,   each of which   uses only S features   out of N. So we have less features  and less items. Each fern classifies  an item in its own way. For each item  we have probabilities of an item I being classified  to each of classess. Now, it may looks fancy, but if we change the big bold F, with small bold f fix the number of ferns  with the number of features. This gives us classical naive bayes classifier. Yeah, looks familiar. So, let's obfruscrate it with big bold F, number S, etc. going back to Random Ferns.
  10. Thanks to training N classifiers each of which depends on  some subset of features we obtain less-naive classification. We implicitly assume  some relations between data are represented as ferns. Now what is going on  under the hood  of each fern? Let’s look on   the tree representation  of ferns. First of all, yes, all ferns are perfectly binary trees. This is thanks to feature binarization. Features are somehow binarized  agains some threshold returning 1 or 0. Each level of a fern  contains a test against the same feature. The test return value 1 or 0. So going from the root to a leaf, we can collect bits, which can be cast to an integer number, call it the feature key. When we are at a leaf, we see probabilities of each class. So a fern is a 2D array, where the x axis is the feature key and the y axis is a class index. Now in the cell with indices x & y  we have a probability Now - it might be big, but we can train and use it fast.
  11. What is interesting now is how it can be constructed at scale. Mateusz, could you bring us details.
  12. Random Ferns are about training several small classifiers (ferns) each of which works on a subset of features Bagging description Simulation Sampling from a big data set would be tough Let’s look from a different perspective Order doesn’t really matter
  13. Instead of sampling individual elements we can sample how many times whas a particular object selected. Actually, there’s a probability distribution that perfectly models that process.
  14. Binomial distribution which equation is on the slide can be used in the sampling The interesting thing is, as the number of elements we sample from grows to inf which is true in our case, as we work with big data, Binomial distribution tends to Poisson which density function is much simpler. So, we’ll simulate sampling with replacement using Poisson dist
  15. Categorical features: eye colour, gender Continuous features: income per annum, height May seem too naïve, but actually work. Why? Some people state that the whole algorithm is crafted out of pure magic More rational explanation: each feature is used by several ferns and each of them will use it’s own binarisers.  Discriminate between various original feature values fairly well.
  16. Categorical: trivial to implement — we assume that categorical features info is user supplied. So do algos in Mllib
  17. To proceed to the next topic we need to analise some of equations that Piotr has presented. Bear with me, you’re gonna like result. For a given object we assign this class which yields the greatest probabilities. First — easy, let’s focus on the second
  18. Highlighted part is a combination Applying binarisers — mapping each object into a bucket
  19. When we recall the classical definition of probability, we’ll realise that Word „count” should ring a bell
  20. Yes, you’re right, we have just reduced classifier training to the word count. That gives a deeper meaning to this problem studied since the emergence of Hadoop era.
  21. Before we finish this part, just a word of warning… That’s a fair trade-off: you need more mem to model more complex relationships among features.
  22. It would be a shame to present to an algorithm during big data conference without any performance data. Piotr, can you give us some numbers? --- Random Ferns   can be quite memory consuming. Because to this  the first evaluation of the package was   done on … 
  23. Iris and Cars Datasets.   These datasets   are in fact  used in integration tests.  The accuracy values   obtained on these datasets   were on expected level,  meaning that the algorithm   is implemented correctly.  At that point   We could calmly move   To bigger datasets.  To check how fast   can we train model  depending on number of features   and number of training samples 
  24. We used Million Song Dataset  to predict year of publishing each song  Ranging from 1922 to 2011, using 90 numerical features.  This prediction would be better done with Regression algorithms,  We know it,  But the point here is to use large volume data. 
  25. API of random ferns is similar   to algorithms present in MLlib.  You have to read data,   parse them into labeled points   and pass them to the method train  together with other input parameters  i.e. numer of ferns and numer of features.  Training method returns model,  Which can predict class of each observation.  After training many models   with different number   of items  of features  and ferns 
  26. we get an empirical estimation   of time needed to train a model.  Having a number of features fixed   training time depends linearly on a number of items  Conversely,  Having a number of training items fixed  Training time depends linearly on a number of features.  This estimation looks much better  When you look at charts 
  27. With the dataset of half a million items  And 10 ferns   model training takes about 27 min  Using 10 features out of 90  Increasing the number of features to 20  Results in about two times longer model creation  Now let’s fix   the number of ferns   and features to 10  and change number of items   used in model training.  Training time is 3 minutes with 10% of dataset  and about 12 minutes with 50% of items.  To sum up this part,  assuming we have enough memory  A model will be created  in quite reasonable and predictable time.  knowing this let’s move to… 
  28. … package publishing.  --- As Piotr said, let us finish the presentation with a few word about the packaging and dissemination of our work
  29. Used great tools Some of them presented We’d like to focus on two of them
  30. If your artifacts aren’t there yet, they should
  31. There are a few guides explaining…
  32. Because of that, the only step needed to start working with sparkling ferns is issuing this command.
  33. Presented whole process sparkling ferns went from a research paper to the deployment. We have revealed some details regarding their implementation and performance. Finally, given some piece of advice regarding your packages. Now we’ll be happy to answer any questions you may have 