SlideShare a Scribd company logo
1 of 36
Sparkling Random Ferns.
From an academic paper to spark-packages.org
Piotr Jan Dendek
Mateusz Fedoryszak
The Agenda
1. How it starts?
2. What is the Random Ferns algorithm?
3. How did implementation, evaluation and publishing
went?
Motivations
• Random Ferns is the popular classification algorithm in
the image processing field
• Our colleague - Miron Kursa as part of his research[1]
implemented this algorithm and publish as R package
called rFerns
• We have decided to empower Spark community with this
method by making it available as a Spark package
THE ALGORITHM
The Algorithm
• Random Ferns
– Example of the supervised learning
– Solves classification problems
– Kind of Ensemble Algorithm
Posterior Probability
• Hypothetically we can learn conditional probabilities:
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓𝑁)
• Where the classifier 𝑯 is described as
𝑯 𝒇 = arg max
𝑘
𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓3)
• Not suitable, not traceable, memory consuming
Naïve Bayes Classifier
𝑃(𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 × 𝑃 𝒇 𝑪 𝑚
𝑃 𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚
𝑖=1
𝑁
𝑃 𝒇𝑖 𝑪 𝑚)
• Naïve as it misses dependencies among features
• Often quite successful classifications
• Goal to reach:
– Avoid Overfitting
– Build classifiers faster
• Ways to reach randomness
– Item sampling with replacement
– Feature sampling
Randomness in classifiers
Random Ferns
• Each classifier [1;L] has its set of features [1;S]
𝐹𝑙 = {𝑓𝑙,1, 𝑓𝑙,2, … , 𝑓𝑙,𝑆}
• Assume that classifiers are independent
𝑃 𝑓1, 𝑓2, … , 𝑓3 𝑪 𝑘) =
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
• Then classify items
𝐻 𝒇 ≡ arg max
𝑘
𝑃 𝑪 𝑘
𝑙1
𝐿
𝑃 𝐹𝑙 𝑪 𝑘)
Random Ferns
• Less-naïve Bayes
• From Random Forests perspective:
A
B
C C C C
B
A
B
D D D D
B
C
D
E E E E
D
THE IMPLEMENTATION
Bagging
Initial set
Fern 1
Fern 2
Fern 3
Bagging
Initial set
Fern 1
Fern 2
Fern 3
2 0 2 1 0
1 1 0 1 2
1 1 1 1 1
Big data bagging
• How many times would a data point be sampled?
– Binomial distribution, 𝑝 =
1
𝑛
• 𝑃 𝑥 = 𝑘 = 𝑛
𝑘
1
𝑛
k
1 −
1
𝑛
𝑛−𝑘
– As 𝑛 → ∞ (big data) Binomial distribution tends to
Poisson distribution, 𝜆 = 𝑛𝑝 = 1[2]
• 𝑃 𝑥 = 𝑘 =
1
𝑒∙𝑘!
Simulate sampling using Poisson distribution
Binarisation
Note: each fern has its own binarisers
Categorical features Continuous features
— Get a random subset
of categories
— Given category either fits
this set or not
— Get two random feature values
from the training set
— Use their mean as threshold
Binarisation — implementation
Categorical features Continuous features
— Trivial as we have user
supplied categories info
— Assign every value a random
float
— Reduce by taking two values
with greatest floats assigned
𝐻(𝒇) ≡ arg max
𝑘
𝑃(𝑪 𝑘)
𝑙=1
𝐿
𝑃(𝐹𝑙|𝑪 𝑘)
Probabilities
What’s that?
𝑃(𝐹𝑙|𝐶 𝑘)
• A combination of binary feature values used by
fern 𝑙
• For a fern of height 𝑆 there are 2 𝑆 distinct values
of 𝐹𝑙
• You may think of it as fern mapping each object
into one of 2 𝑆
buckets
𝑃(𝐹𝑙|𝐶 𝑘)
• Probability of an object of class 𝐶 𝑘 falling into
bucket 𝐹𝑙
• Count of objects of class 𝐶 𝑘 falling into bucket 𝐹𝑙
divided by count of objects of class 𝐶 𝑘
𝑃 𝐹𝑙 𝐶 𝑘 =
𝐹𝑙 ∩ 𝐶 𝑘
𝐶 𝑘
Reduction
• The most important training part is
counting objects
• Sounds similar to… counting words!
• We have reduced classifier building to the
best-known big data problem
Memory
Q: How many probabilities do we need to compute?
A: About 2 𝑆 per fern
That means a binary classifier of 100 20-feature ferns
will weight over 1.5GB
THE EVALUATION
Accuracy et al.
• Evaluation on Iris and Car datasets as integration test
• Iris:
– 10 ferns, 3 features per fern (out of 4)
– Accuracy: 98%
• Car:
– 20 ferns, 4 features per fern (out of 6)
– Accuracy: 90%
Dataset
• Million Song Dataset – Year Prediction
– Not quite about classification, but big (0.5M items)
– Task: having 90 real number features indicate a
publication year (ranging from 1922 to 2011)
– For sake of demonstration let’s just pretend it is
classification problem
Model Training Code
val raw = sc.textFile(…)
val lp = raw.map(parseIntoLabeledPoints(_))
val data = splitIntoTrainTest(lp)
val numFerns = 90
val numFeatures = 10
val model = FernForest.train(data.train,
numFerns, numFeatures, Map.empty)
val correct = data.test.map(lp => model.predict(lp.features) == lp.label)
Model Training time
𝑻 𝒇 , 𝑫 = 10−5 + 4,2 ∗ 10−6 ∗ 𝒇 ∗ ‖𝑫‖
• Where:
– ‖𝒇‖ is number of features
– ‖𝑫‖ is number of items in a dataset
Model Training Time
25.00
30.00
35.00
40.00
45.00
50.00
10 12 14 16 18 20
Est.TrainingTime
[min]
Number of Features
• Training time is linear
– against numer of features (diff to Random Forests)
– against number of samples
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0% 20% 40% 60%
TrainingTime[min]
Sample of 0.5M items Dataset
THE PACKAGE
Our toolbox
How can you help your users?
• Simplify discovery
– Register at spark-packages.org
• Simplify utilisation
– Publish artifacts to the Central Repository
spark-packages.org
• An index of packages for Apache Spark
• Spark Community keeps an eye on it
• Ideal place if you want to extend Spark
• You can register any GitHub-hosted Spark
project
The Central Repository
• Apache Maven retrieves all components from
the Central Repository by default
– so does Apache Spark
– and many other build systems
• Are your artifacts there yet?
Getting to the Central
Sonatype provides OSSRH
– free repository
– for open source software
– store snapshot artifacts
– promote releases to the Central Repository
Checklist:
1. Register[3] at Sonatype OSSRH
2. Generate GPG key (if you don’t have one yet)
3. Alter[4] your build.sbt
4. Build and sign your artefacts
5. Stage[5] release at OSSRH and promote to Central Repository
6. Voilà!
Things are smooth now
./$SPARK_HOME/bin/spark-shell 
--packages pl.edu.icm:sparkling-ferns_2.10:0.2.0
THANK YOU! QUESTIONS?
http://spark-packages.org/package/CeON/
sparkling-ferns
@pjden
@mfedoryszak
/piotrdendek
/mfedoryszak
References
[1] „rFerns: An Implementation of the Random Ferns Method for General-
Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10
[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”,
https://youtu.be/ceOwlHnVCqo
[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-
guide.html
[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-
Sonatype.html
[5] „Releasing the Deployment”, Sonatype,
http://central.sonatype.org/pages/releasing-the-deployment.html

More Related Content

What's hot

Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitKavita Ganesan
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data ScienceAlbert Bifet
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Anomaly Detection by ADGM / LVAE
Anomaly Detection by ADGM / LVAEAnomaly Detection by ADGM / LVAE
Anomaly Detection by ADGM / LVAEPreferred Networks
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksDatabricks
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsChris Johnson
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Balázs Hidasi
 
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...Pasquale Salza
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaAndre Pemmelaar
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunningTed Dunning
 

What's hot (20)

Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
13. Queue
13. Queue13. Queue
13. Queue
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Anomaly Detection by ADGM / LVAE
Anomaly Detection by ADGM / LVAEAnomaly Detection by ADGM / LVAE
Anomaly Detection by ADGM / LVAE
 
Applying your Convolutional Neural Networks
Applying your Convolutional Neural NetworksApplying your Convolutional Neural Networks
Applying your Convolutional Neural Networks
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
Parallel Recurrent Neural Network Architectures for Feature-rich Session-base...
 
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framew...
 
Generating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in juliaGenerating Sequences with Deep LSTMs & RNNS in julia
Generating Sequences with Deep LSTMs & RNNS in julia
 
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunningOscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
 

Similar to Sparkling Random Ferns by P Dendek and M Fedoryszak

Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with WekaAlbanLevy
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptxGaytriDhingra1
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopYahoo Developer Network
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionDong Guo
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfAnkita Tiwari
 
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...zohebmusharraf
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)Thinkful
 

Similar to Sparkling Random Ferns by P Dendek and M Fedoryszak (20)

Data mining with Weka
Data mining with WekaData mining with Weka
Data mining with Weka
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
L4. Ensembles of Decision Trees
L4. Ensembles of Decision TreesL4. Ensembles of Decision Trees
L4. Ensembles of Decision Trees
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Hadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using HadoopHadoop Summit 2010 Machine Learning Using Hadoop
Hadoop Summit 2010 Machine Learning Using Hadoop
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
EssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdfEssentialsOfMachineLearning.pdf
EssentialsOfMachineLearning.pdf
 
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
Unit-V.pptx DVD is a great way to get sbi and more jobs available review and ...
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Predict oscars (5:11)
Predict oscars (5:11)Predict oscars (5:11)
Predict oscars (5:11)
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

Sparkling Random Ferns by P Dendek and M Fedoryszak

  • 1. Sparkling Random Ferns. From an academic paper to spark-packages.org Piotr Jan Dendek Mateusz Fedoryszak
  • 2. The Agenda 1. How it starts? 2. What is the Random Ferns algorithm? 3. How did implementation, evaluation and publishing went?
  • 3. Motivations • Random Ferns is the popular classification algorithm in the image processing field • Our colleague - Miron Kursa as part of his research[1] implemented this algorithm and publish as R package called rFerns • We have decided to empower Spark community with this method by making it available as a Spark package
  • 5. The Algorithm • Random Ferns – Example of the supervised learning – Solves classification problems – Kind of Ensemble Algorithm
  • 6. Posterior Probability • Hypothetically we can learn conditional probabilities: 𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓𝑁) • Where the classifier 𝑯 is described as 𝑯 𝒇 = arg max 𝑘 𝑃 𝑪 = 𝑐 𝑚 𝑓1, 𝑓2, … , 𝑓3) • Not suitable, not traceable, memory consuming
  • 7. Naïve Bayes Classifier 𝑃(𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 × 𝑃 𝒇 𝑪 𝑚 𝑃 𝑪 𝑚 𝒇 ∝ 𝑃 𝑪 𝑚 𝑖=1 𝑁 𝑃 𝒇𝑖 𝑪 𝑚) • Naïve as it misses dependencies among features • Often quite successful classifications
  • 8. • Goal to reach: – Avoid Overfitting – Build classifiers faster • Ways to reach randomness – Item sampling with replacement – Feature sampling Randomness in classifiers
  • 9. Random Ferns • Each classifier [1;L] has its set of features [1;S] 𝐹𝑙 = {𝑓𝑙,1, 𝑓𝑙,2, … , 𝑓𝑙,𝑆} • Assume that classifiers are independent 𝑃 𝑓1, 𝑓2, … , 𝑓3 𝑪 𝑘) = 𝑙=1 𝐿 𝑃(𝐹𝑙|𝑪 𝑘) • Then classify items 𝐻 𝒇 ≡ arg max 𝑘 𝑃 𝑪 𝑘 𝑙1 𝐿 𝑃 𝐹𝑙 𝑪 𝑘)
  • 10. Random Ferns • Less-naïve Bayes • From Random Forests perspective: A B C C C C B A B D D D D B C D E E E E D
  • 13. Bagging Initial set Fern 1 Fern 2 Fern 3 2 0 2 1 0 1 1 0 1 2 1 1 1 1 1
  • 14. Big data bagging • How many times would a data point be sampled? – Binomial distribution, 𝑝 = 1 𝑛 • 𝑃 𝑥 = 𝑘 = 𝑛 𝑘 1 𝑛 k 1 − 1 𝑛 𝑛−𝑘 – As 𝑛 → ∞ (big data) Binomial distribution tends to Poisson distribution, 𝜆 = 𝑛𝑝 = 1[2] • 𝑃 𝑥 = 𝑘 = 1 𝑒∙𝑘! Simulate sampling using Poisson distribution
  • 15. Binarisation Note: each fern has its own binarisers Categorical features Continuous features — Get a random subset of categories — Given category either fits this set or not — Get two random feature values from the training set — Use their mean as threshold
  • 16. Binarisation — implementation Categorical features Continuous features — Trivial as we have user supplied categories info — Assign every value a random float — Reduce by taking two values with greatest floats assigned
  • 17. 𝐻(𝒇) ≡ arg max 𝑘 𝑃(𝑪 𝑘) 𝑙=1 𝐿 𝑃(𝐹𝑙|𝑪 𝑘) Probabilities What’s that?
  • 18. 𝑃(𝐹𝑙|𝐶 𝑘) • A combination of binary feature values used by fern 𝑙 • For a fern of height 𝑆 there are 2 𝑆 distinct values of 𝐹𝑙 • You may think of it as fern mapping each object into one of 2 𝑆 buckets
  • 19. 𝑃(𝐹𝑙|𝐶 𝑘) • Probability of an object of class 𝐶 𝑘 falling into bucket 𝐹𝑙 • Count of objects of class 𝐶 𝑘 falling into bucket 𝐹𝑙 divided by count of objects of class 𝐶 𝑘 𝑃 𝐹𝑙 𝐶 𝑘 = 𝐹𝑙 ∩ 𝐶 𝑘 𝐶 𝑘
  • 20. Reduction • The most important training part is counting objects • Sounds similar to… counting words! • We have reduced classifier building to the best-known big data problem
  • 21. Memory Q: How many probabilities do we need to compute? A: About 2 𝑆 per fern That means a binary classifier of 100 20-feature ferns will weight over 1.5GB
  • 23. Accuracy et al. • Evaluation on Iris and Car datasets as integration test • Iris: – 10 ferns, 3 features per fern (out of 4) – Accuracy: 98% • Car: – 20 ferns, 4 features per fern (out of 6) – Accuracy: 90%
  • 24. Dataset • Million Song Dataset – Year Prediction – Not quite about classification, but big (0.5M items) – Task: having 90 real number features indicate a publication year (ranging from 1922 to 2011) – For sake of demonstration let’s just pretend it is classification problem
  • 25. Model Training Code val raw = sc.textFile(…) val lp = raw.map(parseIntoLabeledPoints(_)) val data = splitIntoTrainTest(lp) val numFerns = 90 val numFeatures = 10 val model = FernForest.train(data.train, numFerns, numFeatures, Map.empty) val correct = data.test.map(lp => model.predict(lp.features) == lp.label)
  • 26. Model Training time 𝑻 𝒇 , 𝑫 = 10−5 + 4,2 ∗ 10−6 ∗ 𝒇 ∗ ‖𝑫‖ • Where: – ‖𝒇‖ is number of features – ‖𝑫‖ is number of items in a dataset
  • 27. Model Training Time 25.00 30.00 35.00 40.00 45.00 50.00 10 12 14 16 18 20 Est.TrainingTime [min] Number of Features • Training time is linear – against numer of features (diff to Random Forests) – against number of samples 0.0 2.0 4.0 6.0 8.0 10.0 12.0 0% 20% 40% 60% TrainingTime[min] Sample of 0.5M items Dataset
  • 30. How can you help your users? • Simplify discovery – Register at spark-packages.org • Simplify utilisation – Publish artifacts to the Central Repository
  • 31. spark-packages.org • An index of packages for Apache Spark • Spark Community keeps an eye on it • Ideal place if you want to extend Spark • You can register any GitHub-hosted Spark project
  • 32. The Central Repository • Apache Maven retrieves all components from the Central Repository by default – so does Apache Spark – and many other build systems • Are your artifacts there yet?
  • 33. Getting to the Central Sonatype provides OSSRH – free repository – for open source software – store snapshot artifacts – promote releases to the Central Repository Checklist: 1. Register[3] at Sonatype OSSRH 2. Generate GPG key (if you don’t have one yet) 3. Alter[4] your build.sbt 4. Build and sign your artefacts 5. Stage[5] release at OSSRH and promote to Central Repository 6. Voilà!
  • 34. Things are smooth now ./$SPARK_HOME/bin/spark-shell --packages pl.edu.icm:sparkling-ferns_2.10:0.2.0
  • 36. References [1] „rFerns: An Implementation of the Random Ferns Method for General- Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10 [2] „Proof that the Binomial Distribution tends to the Poisson Distribution”, https://youtu.be/ceOwlHnVCqo [3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh- guide.html [4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using- Sonatype.html [5] „Releasing the Deployment”, Sonatype, http://central.sonatype.org/pages/releasing-the-deployment.html

Editor's Notes

  1. Good afternoon Ladies and Gentlemen,  My name is Piotr Dendek.  Together with Mateusz Fedoryszak  we are going to present   the implementation of Random Ferns  for Apache Spark   done in Interdisciplinary Centre   For Math and Comp. Modeling,  The part of University of Warsaw. 
  2. I am going to tell you   how have we get here   (in terms   of Random Ferns implementation).  First, what or who has inspired us   and what are Random Ferns. Next, Mateusz is going to describe   the implementation part  Finally we are going to share with you   evaluation results   and describe how to publish your package   on spark-packages.org.  So, let’s start! 
  3. Some of you,   especially people interested in the Image Processing Field  might have heard   about Random Ferns   as one of the state-of-the-art algorithms.  One of our colleges at ICM,   Miron Kursa,   used Random Ferns in his research.  As the great fan of R language   he implemented this algorithm   and published it in the CRAN repository.  It was quite a long time   before Spark version 1.0  Seeing how successful Random Ferns can be,   we decided to empower Spark Community   with this classification algorithm.    The best way to do so   was to publish it via spark-packages.org 
  4. Now let me say a few words   about the algorithm itself. 
  5. Random Ferns is the method   which uses supervised learning   to classify or label new examples   using knowledge about the training set  The plural form in the algorithm name   indicates that during model creation   many single classifiers will be created  and when classification occurs   results from each of them   will be combined   into a one result.
  6. In the ideal world  we could use probabilistic approach with ease. We would know the ways   in which features depends on each other and   how the class depends on them.  That is joint probability. In the real world, we do not have so much information.  We cannot observe all combinations of feature values. Yet, we would really like  to use probabilistyc approach and in fact we are doing so  in Random Forests , Random Fers, you give a name. And that is thanks to  easing constraints on classification,   especially... 
  7. move to Naïve Bayes,   where we assume   that all features are independent.  This assumption is false, yet it has proved  to be the second best thing  to the pure true. Thanks to this nice property  of probabilistic independence, we only have to check how probable is obtaining a class, having a given feature value. Then we multiply probabilities  of having a given class from each feature, and eventually yield  the most probable class. So, it is much easier in terms of RAM and computations to track probabilities  and return the final result. Let me follow this ML 101 class   for just a few more slides.
  8. So tracking joint probablities  of evetything was the first no-no. The second no-no is called "overfitting" Because we want to avoid overfitting and   we would like to create model in parallel  the good idea to use   is sampling items with replacement  alongside with feature sampling.  This process can be executed   for as many mini-classifiers   as we want,  with as many features   as we want   – and the memory allow us to.  In the presented example   we have 3 subsets out of one. Each of them has sample  of original data. Also, each of them, has the same number of features, but features may be different accross subsets Now, using each subset  we can create a mini-classifier called "Fern".
  9. Ok, we have L ferns,   each of which   uses only S features   out of N. So we have less features  and less items. Each fern classifies  an item in its own way. For each item  we have probabilities of an item I being classified  to each of classess. Now, it may looks fancy, but if we change the big bold F, with small bold f fix the number of ferns  with the number of features. This gives us classical naive bayes classifier. Yeah, looks familiar. So, let's obfruscrate it with big bold F, number S, etc. going back to Random Ferns.
  10. Thanks to training N classifiers each of which depends on  some subset of features we obtain less-naive classification. We implicitly assume  some relations between data are represented as ferns. Now what is going on  under the hood  of each fern? Let’s look on   the tree representation  of ferns. First of all, yes, all ferns are perfectly binary trees. This is thanks to feature binarization. Features are somehow binarized  agains some threshold returning 1 or 0. Each level of a fern  contains a test against the same feature. The test return value 1 or 0. So going from the root to a leaf, we can collect bits, which can be cast to an integer number, call it the feature key. When we are at a leaf, we see probabilities of each class. So a fern is a 2D array, where the x axis is the feature key and the y axis is a class index. Now in the cell with indices x & y  we have a probability Now - it might be big, but we can train and use it fast.
  11. What is interesting now is how it can be constructed at scale. Mateusz, could you bring us details.
  12. Random Ferns are about training several small classifiers (ferns) each of which works on a subset of features Bagging description Simulation Sampling from a big data set would be tough Let’s look from a different perspective Order doesn’t really matter
  13. Instead of sampling individual elements we can sample how many times whas a particular object selected. Actually, there’s a probability distribution that perfectly models that process.
  14. Binomial distribution which equation is on the slide can be used in the sampling The interesting thing is, as the number of elements we sample from grows to inf which is true in our case, as we work with big data, Binomial distribution tends to Poisson which density function is much simpler. So, we’ll simulate sampling with replacement using Poisson dist
  15. Categorical features: eye colour, gender Continuous features: income per annum, height May seem too naïve, but actually work. Why? Some people state that the whole algorithm is crafted out of pure magic More rational explanation: each feature is used by several ferns and each of them will use it’s own binarisers.  Discriminate between various original feature values fairly well.
  16. Categorical: trivial to implement — we assume that categorical features info is user supplied. So do algos in Mllib
  17. To proceed to the next topic we need to analise some of equations that Piotr has presented. Bear with me, you’re gonna like result. For a given object we assign this class which yields the greatest probabilities. First — easy, let’s focus on the second
  18. Highlighted part is a combination Applying binarisers — mapping each object into a bucket
  19. When we recall the classical definition of probability, we’ll realise that Word „count” should ring a bell
  20. Yes, you’re right, we have just reduced classifier training to the word count. That gives a deeper meaning to this problem studied since the emergence of Hadoop era.
  21. Before we finish this part, just a word of warning… That’s a fair trade-off: you need more mem to model more complex relationships among features.
  22. It would be a shame to present to an algorithm during big data conference without any performance data. Piotr, can you give us some numbers? --- Random Ferns   can be quite memory consuming. Because to this  the first evaluation of the package was   done on … 
  23. Iris and Cars Datasets.   These datasets   are in fact  used in integration tests.  The accuracy values   obtained on these datasets   were on expected level,  meaning that the algorithm   is implemented correctly.  At that point   We could calmly move   To bigger datasets.  To check how fast   can we train model  depending on number of features   and number of training samples 
  24. We used Million Song Dataset  to predict year of publishing each song  Ranging from 1922 to 2011, using 90 numerical features.  This prediction would be better done with Regression algorithms,  We know it,  But the point here is to use large volume data. 
  25. API of random ferns is similar   to algorithms present in MLlib.  You have to read data,   parse them into labeled points   and pass them to the method train  together with other input parameters  i.e. numer of ferns and numer of features.  Training method returns model,  Which can predict class of each observation.  After training many models   with different number   of items  of features  and ferns 
  26. we get an empirical estimation   of time needed to train a model.  Having a number of features fixed   training time depends linearly on a number of items  Conversely,  Having a number of training items fixed  Training time depends linearly on a number of features.  This estimation looks much better  When you look at charts 
  27. With the dataset of half a million items  And 10 ferns   model training takes about 27 min  Using 10 features out of 90  Increasing the number of features to 20  Results in about two times longer model creation  Now let’s fix   the number of ferns   and features to 10  and change number of items   used in model training.  Training time is 3 minutes with 10% of dataset  and about 12 minutes with 50% of items.  To sum up this part,  assuming we have enough memory  A model will be created  in quite reasonable and predictable time.  knowing this let’s move to… 
  28. … package publishing.  --- As Piotr said, let us finish the presentation with a few word about the packaging and dissemination of our work
  29. Used great tools Some of them presented We’d like to focus on two of them
  30. If your artifacts aren’t there yet, they should
  31. There are a few guides explaining…
  32. Because of that, the only step needed to start working with sparkling ferns is issuing this command.
  33. Presented whole process sparkling ferns went from a research paper to the deployment. We have revealed some details regarding their implementation and performance. Finally, given some piece of advice regarding your packages. Now we’ll be happy to answer any questions you may have 