SlideShare a Scribd company logo
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data Science Company
SparkR
RBelgium
21/10/2015
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Who am I
• Data Scientist at InfoFarm
www.infofarm.be
• PhD in Math
• Author of parallelML
https://cran.r-project.org/web/packages/parallelML
• Daily R user
• Spark enthusiast
Wannes.rosiers@infofarm.be @RosiersWannes
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Overview
• Apache Spark
– A brief introduction
– R versus Scala (Java/Python)
• SparkR-1.4.0
– Getting started
– R integration
– Our own machine learning algorithms
• SparkR-1.5…
– What’s new?
– Spark MLlib
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Apache Spark
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
“Apache Spark is a fast and general engine for big data
processing, with built-in modules for streaming, SQL,
machine learning and graph processing”
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Fast, Scalable and Fault Tolerant
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
One ring to rule them all…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Being lazy…
• Transformations (map, filter, union, sort, …) are lazy
• Actions (count, collect, save, …) force computations of
transformations
… is a good thing!
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
versus
• Scala advantages
– Natively written in Scala
– Big Data extension of Scala concepts
• R disadvantages
– Work in progress
– R packages not implemented for parallel
processing
Yet promising as excellent Big Data analysis tool for R users
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
SparkR-1.4.0
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Initializing SparkR
• Download Spark http://spark.apache.org/downloads
• Install
– Installation (R/install-dev.sh)
– Documentation (R/create-docs.sh)
• Run
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Using SparkR
• sparkContext
• sqlContext
• Possibly hiveContext
• parquetFile
• jsonFile
• read.df
(via "com.databricks.spark.csv”)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Integrating native R code
• Magrittr
• Local computations
• Within SparkR functions
collect createDataFrame
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Machine learning
• Spark MLlib machine learning algorithms
were not available yet
• R algorithms are not implemented in a
distributed way
We implemented
–Naive Bayes (classification)
–K-means (clustering)
–Association rules (recommendation)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Performance
• Naive Bayes
• K-means
• Association rules
Set # observations Acion Time taken
Training +
Calibration
5.890.434 +
654.325
Build model +
Threshold
9min 6sec
Test 725.479 Prediction 3min 40sec
# observations Total time Time per iteration
7270238 3min 40 sec 25sec (4 iterations)
Action # observations Time taken
Construct rules 1.048.575 < 30sec
Predict 1 Instantly
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Lessons learned
• Nasty workarounds: e.g.
– Rounding: var – var %% 1
– Adding constant column:
cast(data[[1]]*0, 'integer')
– Calculating which column has
the smallest value:
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Lessons learned
• No notion of row indexes
Solvable via HiveQL
• Possible loss of orders
Solvable by keeping an order on a certain
column
• Not all Spark code available yet (map,
flatmap, lapply, …)
Solvable by altering source code to export them
• Slow computations due to framework
At least numPartitions might help you
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Lessons learned
• Caching does not support all types:
lapply(nb[["model"]], function(mod){
cache(mod)
count(mod)
})
})
• Sometimes necessary to collect intermediate
results
local_model <- collect(model)
for( i in 0:n){
if(! i %in% local_model$category)
local_model <- rbind(local_model, c(i, -1))
}
When using R code, this will always be the case
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
SparkR-1.5…
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
What’s new?
• Time classes (adding/subtracting times)
• More math functions (e.g. atan, rand)
• More text functions (e.g. concat, locate)
• More R functions (e.g. dim, ifelse)
• Create contigency table (crosstab)
• First machine learning algorithm (glm)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Algorithms provided by Spark
● Classification and regression
○ Linear models (SVMs, logistic regression, linear regression)
○ Naive Bayes
○ Decision trees
○ Ensembles of trees (Random Forests and Gradient-Boosted trees)
○ Isotonic regression
● Collaborative filtering
o Alternating least squares (ALS)
● Frequent pattern mining
○ FP-growth
○ Association rules
○ PrefixSpan
● Feature extraction and transformation
● Clustering
○ K-Means
○ Gaussian mixture
○ Power Iteration clustering
○ Latent Dirichlet allocation
○ Streaming k-means
● Dimensionality reduction
○ Singular value decomposition (SVD)
○ Principal component analysis (PCA)
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company
Questions

More Related Content

What's hot

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Databricks
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
Databricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
Sahan Bulathwela
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 

What's hot (20)

Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
 
Strata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache SparkStrata NYC 2015 - Supercharging R with Apache Spark
Strata NYC 2015 - Supercharging R with Apache Spark
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 

Viewers also liked

SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceRetail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
InfoFarm
 
Data Driven Decisions seminar
Data Driven Decisions seminarData Driven Decisions seminar
Data Driven Decisions seminar
InfoFarm
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
InfoFarm
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Calendario viernes 9 Enero 2015
Calendario viernes 9 Enero 2015Calendario viernes 9 Enero 2015
Calendario viernes 9 Enero 2015
Concientización Turismo Paraná
 
Promotion outilac france_web
Promotion outilac france_webPromotion outilac france_web
Promotion outilac france_webEuropages2
 
¿Qué es el navegador web opera ?
¿Qué es el navegador web opera ?¿Qué es el navegador web opera ?
¿Qué es el navegador web opera ?
Mar Loayza
 
Bondia Lleida 23072012
Bondia Lleida 23072012Bondia Lleida 23072012
Bondia Lleida 23072012
Bondia Lleida Sl
 
Report_5510_Markus_Nilsson
Report_5510_Markus_NilssonReport_5510_Markus_Nilsson
Report_5510_Markus_NilssonMarkus Nilsson
 
Olaf Stapledon - Hacedor de Estrellas
Olaf Stapledon - Hacedor de EstrellasOlaf Stapledon - Hacedor de Estrellas
Olaf Stapledon - Hacedor de Estrellas
Herman Schmitz
 
Mision de prospección perú (presentacion coac)
Mision de prospección perú (presentacion coac)Mision de prospección perú (presentacion coac)
Mision de prospección perú (presentacion coac)coacnet
 
04 presentation - code of conduct - final
04   presentation - code of conduct - final04   presentation - code of conduct - final
04 presentation - code of conduct - finalsoly1991
 
Machine learning
Machine learningMachine learning
Machine learning
InfoFarm
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
The Electricity Saving Family - a case study
The Electricity Saving Family - a case studyThe Electricity Saving Family - a case study
The Electricity Saving Family - a case study
SEA - Sustainable Energy Advice Ltd
 
Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
InfoFarm
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
InfoFarm
 

Viewers also liked (20)

SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerceRetail Detail OmniChannel Congress 2015 - Data Science for e-commerce
Retail Detail OmniChannel Congress 2015 - Data Science for e-commerce
 
Data Driven Decisions seminar
Data Driven Decisions seminarData Driven Decisions seminar
Data Driven Decisions seminar
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Calendario viernes 9 Enero 2015
Calendario viernes 9 Enero 2015Calendario viernes 9 Enero 2015
Calendario viernes 9 Enero 2015
 
Promotion outilac france_web
Promotion outilac france_webPromotion outilac france_web
Promotion outilac france_web
 
¿Qué es el navegador web opera ?
¿Qué es el navegador web opera ?¿Qué es el navegador web opera ?
¿Qué es el navegador web opera ?
 
Bondia Lleida 23072012
Bondia Lleida 23072012Bondia Lleida 23072012
Bondia Lleida 23072012
 
Kimbra
KimbraKimbra
Kimbra
 
Report_5510_Markus_Nilsson
Report_5510_Markus_NilssonReport_5510_Markus_Nilsson
Report_5510_Markus_Nilsson
 
Olaf Stapledon - Hacedor de Estrellas
Olaf Stapledon - Hacedor de EstrellasOlaf Stapledon - Hacedor de Estrellas
Olaf Stapledon - Hacedor de Estrellas
 
Mision de prospección perú (presentacion coac)
Mision de prospección perú (presentacion coac)Mision de prospección perú (presentacion coac)
Mision de prospección perú (presentacion coac)
 
04 presentation - code of conduct - final
04   presentation - code of conduct - final04   presentation - code of conduct - final
04 presentation - code of conduct - final
 
Machine learning
Machine learningMachine learning
Machine learning
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
The Electricity Saving Family - a case study
The Electricity Saving Family - a case studyThe Electricity Saving Family - a case study
The Electricity Saving Family - a case study
 
Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
 
Harvesting business Value with Data Science
Harvesting business Value with Data ScienceHarvesting business Value with Data Science
Harvesting business Value with Data Science
 

Similar to First impressions of SparkR: our own machine learning algorithm

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Databricks
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
Alok Singh
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
Tripti Jha
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
Georg Heiler
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 

Similar to First impressions of SparkR: our own machine learning algorithm (20)

An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Hadoop spark online demo
Hadoop spark online demoHadoop spark online demo
Hadoop spark online demo
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Python ml
Python mlPython ml
Python ml
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 

Recently uploaded

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 

Recently uploaded (20)

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 

First impressions of SparkR: our own machine learning algorithm

  • 1. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Data Science Company SparkR RBelgium 21/10/2015
  • 2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Who am I • Data Scientist at InfoFarm www.infofarm.be • PhD in Math • Author of parallelML https://cran.r-project.org/web/packages/parallelML • Daily R user • Spark enthusiast Wannes.rosiers@infofarm.be @RosiersWannes
  • 3. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Overview • Apache Spark – A brief introduction – R versus Scala (Java/Python) • SparkR-1.4.0 – Getting started – R integration – Our own machine learning algorithms • SparkR-1.5… – What’s new? – Spark MLlib
  • 4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Apache Spark
  • 5. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be “Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing”
  • 6. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Fast, Scalable and Fault Tolerant
  • 7. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be One ring to rule them all…
  • 8. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Being lazy… • Transformations (map, filter, union, sort, …) are lazy • Actions (count, collect, save, …) force computations of transformations … is a good thing!
  • 9. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be versus • Scala advantages – Natively written in Scala – Big Data extension of Scala concepts • R disadvantages – Work in progress – R packages not implemented for parallel processing Yet promising as excellent Big Data analysis tool for R users
  • 10. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company SparkR-1.4.0
  • 11. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Initializing SparkR • Download Spark http://spark.apache.org/downloads • Install – Installation (R/install-dev.sh) – Documentation (R/create-docs.sh) • Run
  • 12. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Using SparkR • sparkContext • sqlContext • Possibly hiveContext • parquetFile • jsonFile • read.df (via "com.databricks.spark.csv”)
  • 13. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 14. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Integrating native R code • Magrittr • Local computations • Within SparkR functions collect createDataFrame
  • 15. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Machine learning • Spark MLlib machine learning algorithms were not available yet • R algorithms are not implemented in a distributed way We implemented –Naive Bayes (classification) –K-means (clustering) –Association rules (recommendation)
  • 16. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
  • 17. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Performance • Naive Bayes • K-means • Association rules Set # observations Acion Time taken Training + Calibration 5.890.434 + 654.325 Build model + Threshold 9min 6sec Test 725.479 Prediction 3min 40sec # observations Total time Time per iteration 7270238 3min 40 sec 25sec (4 iterations) Action # observations Time taken Construct rules 1.048.575 < 30sec Predict 1 Instantly
  • 18. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Lessons learned • Nasty workarounds: e.g. – Rounding: var – var %% 1 – Adding constant column: cast(data[[1]]*0, 'integer') – Calculating which column has the smallest value:
  • 19. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Lessons learned • No notion of row indexes Solvable via HiveQL • Possible loss of orders Solvable by keeping an order on a certain column • Not all Spark code available yet (map, flatmap, lapply, …) Solvable by altering source code to export them • Slow computations due to framework At least numPartitions might help you
  • 20. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Lessons learned • Caching does not support all types: lapply(nb[["model"]], function(mod){ cache(mod) count(mod) }) }) • Sometimes necessary to collect intermediate results local_model <- collect(model) for( i in 0:n){ if(! i %in% local_model$category) local_model <- rbind(local_model, c(i, -1)) } When using R code, this will always be the case
  • 21. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company SparkR-1.5…
  • 22. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be What’s new? • Time classes (adding/subtracting times) • More math functions (e.g. atan, rand) • More text functions (e.g. concat, locate) • More R functions (e.g. dim, ifelse) • Create contigency table (crosstab) • First machine learning algorithm (glm)
  • 23. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be Algorithms provided by Spark ● Classification and regression ○ Linear models (SVMs, logistic regression, linear regression) ○ Naive Bayes ○ Decision trees ○ Ensembles of trees (Random Forests and Gradient-Boosted trees) ○ Isotonic regression ● Collaborative filtering o Alternating least squares (ALS) ● Frequent pattern mining ○ FP-growth ○ Association rules ○ PrefixSpan ● Feature extraction and transformation ● Clustering ○ K-Means ○ Gaussian mixture ○ Power Iteration clustering ○ Latent Dirichlet allocation ○ Streaming k-means ● Dimensionality reduction ○ Singular value decomposition (SVD) ○ Principal component analysis (PCA)
  • 24. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.beData Science Company Questions