SlideShare a Scribd company logo
Meetup Deep Learning Italia – 17/12/2018 - Roma
Apache Spark per il Machine Learning: Introduzione ed un caso di
studio
Speaker Valerio Morfino
APACHE SPARK PER IL
MACHINE LEARNING:
INTRODUZIONE ED UN CASO
DI STUDIO
VALERIO MORFINO
Head of Big Data & Analytics at DbServices srl
Valerio Morfino si occupa di informatica e di Internet dal 2000.
Laureato in Ingegneria Informatica, nel corso della propria carriera ha lavorato in
ha lavorato in società di consulenza, università, grandi e medie aziende
aziende occupandosi di consulenza, formazione, ricerca, direzione di progetti.
di progetti. Autore di articoli scientifici, relatore in conferenze su temi relativi a
temi relativi a web, e-commerce, machine learning, bioinformatica.
Presentation Objectives
 Basic understand of the Apache
Spark and its parallel model
 Understand how to face a
bioinformatic problem using a
Supervised Machine Learning
approach
 Use of Pyhon and Apache Spark for
implementation
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Summary
 Apache Spark
 Spark parallel programming
model
 Case Study Introduction
 Hands on!
 Conclusions
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Apache Spark
 Apache Spark is a distributed cluster based
general engine for big data processing
 It has become one of the key big data distributed
processing frameworks
 Spark is open source
 Spark is fully integrated with the Hadoop ecosystem
 It is available both in local and in cloud
environments by the most important providers (e.g.
AWS, Google, Databricks, …)
 Spark can run in clusters of hundreds or even
thousands of nodes
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Apache Spark
 High-level APIs accessible in Java, Scala, Python
and R
 The MLlib library is rich of efficient parallel
implementation of Machine learning algorithms
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Spark Cluster configurations
 Several Cluster configurations:
 Stand Alone
 Hadoop Yarn
 Mesos
 Kubernetes
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Apache Spark is Resilient!
 The Hardware can fail!
 Spark is resilient thanks to:
 Lineage
 Use of distributed File Systems such as HDFS
 Is this important for my Application?
 In the case of Big Datasets
 In the case of long training (processing) time
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Apache Spark is FAST!
 Spark is very fast!
 Up to 100X compared to Hadoop Map Reduce
 In Memory computing
 Lazy evaluation
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Map Reduce?
Ok, but…
What is Map Reduce?
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Map Reduce Paradigm
 Map jobs read a block of data and produce key-value pairs
 Reducer jobs receives key-value pairs from multiple map
jobs, sorted by key and produce output
 Key concept: Distribute the data and process it where it is!
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
RDDs to store Large datasets
 Resilient, i.e. fault-tolerant thanks to RDD lineage
graph, able to recompute missing or damaged
partitions
 Distributed, with data residing on multiple nodes in a
cluster
 Dataset is a collection of partitioned data stored in
memory as far as possible (otherwise disk)
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
MAP example using Spark
 Two datasets
joined
 Computing using
an UDF (at a lower
level Spark
compute a MAP)
 Lazy evaluation:
Map are
transformation
computed only
when an action is
called (e.g. output
requeste or reduce)Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Reduce using Spark
 Also Reduce
operations are
widely computed in
parallel way
 The level of
parallelisms in
related to the
number of partitions
and number of
worker nodes in the
clusterMeetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Spark SQL, DataFrames and
Datasets
 Spark SQL is a Spark module for structured data
processing.
 A Dataset is a distributed collection of data. Only
supported by Java and Scala API.
 A DataFrame is a Dataset organized into named
columns. It is conceptually equivalent to a table
in a relational database or a data frame in R or
Python, but with richer optimizations under the
hood
 Dataset and Dataframe are internally represented
as RDD but executed with some optimizations!
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Mllib - Spark’s machine learning
library
 ML Algorithms: common learning algorithms such as
classification, regression, clustering, and
collaborative filtering
 Featurization: feature extraction, transformation, dimensionality reduction, and selection
 Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
 Persistence: saving and load algorithms, models, and Pipelines
 Utilities: linear algebra, statistics, data handling, etc.
 Text Manipulations: Tokenization, Common Word Removing, Word combinations,
Word2Vec
Note: As of Spark 2.0, DataFrame-based API is primary API (package
spark.ml). The MLlib RDD-based API is now in maintenance mode (package
spark.mllib)
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
CASE STUDY
 We deal with the splicing site prediction
problem in DNA sequences It is an important
bioinformatic problem
 Useful for:
 Biological Research (identification of Intron-Exon
boundaries)
 Medical research (to understand human variation
on splicing and its effect on human diseases)
 Personalized medicine
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
CASE STUDY
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Biological Background
 DNA is a linear molecule composed of four
small molecules called nucleotide bases:
adenine (A), cytosine (C), guanine (G), and
thymine (T).
 Segments of DNA that carry genetic
information are called genes.
 The genes in DNA encode protein molecules
according to the flow known as “The Central
Dogma”: DNA → mRNA → Protein.
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Biological Background II
 Most of eukariotic genes have
their coding sequences –
exons- interrupted by non-
coding sequences - introns.
 The interruption points
between exon-intron (EI or
donor) and intron-exon (IE or
acceptor) are called “splicing
sites”. During the splicing
process introns are removed
 The DNA splicing site
prediction problem deals with
individuating those regions.
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Splicing site problem in ML
terms
 Given a sequence of DNA (e.g. 60 nucleotides)
:
AGTGTCCAGTCATG…GT…GAACGTAAGTAA
GA
 We wish to classify each sequence as:
 Containing a splicing site in the middle
 Not containing a splicing site in the middle
 Binary single one-value encoding (one hot
encoding):
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Ready to code?
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Supervised Machine learning
recipe
 Ingredients:
 A labelled set of data
In this specific case four files:
pos_training, neg_training, pos_test, neg_test
 A learning algorithm (e.g. Decision tree, SVM, Random Forest,
Multi Layer Perceptron, …)
 Preparation:
1. Load Dadaset and assign a label
AGTGTCCAGTCATG…GT…GAACGTAAGTAAGA,1
2. Encode features (Vector Indexer or OneHot Encoder)
0,2,2,0,2,2,0,1,2,0,1,…,1,0,…,2,2,1,3,3,1,0,3,0,2,1,2,0,3,1 String
Indexer
0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1 One Hot
Note: The last field is the class: 1-> Splicing site; 0-> no splicing siteMeetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Supervised Machine learning
cookbook
3. Split the Input Dataset in:
 Training set (about 70-80%)
 Test set (about 20-30%)
4. Assemble features in a Vector
0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1
features, label
[0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0],1
5. Train a Model
6. Test the model on Test set (tune and refine…)
7. Ready to classify new unlabbelled data!
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Let’code!
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
CONCLUSIONS
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Experiment Description
 Implementation steps:
 Data loading
 Data preparation (encoding)
 Data Splitting (training/test)
 Training
 Test
 Result Evaluation
Nucleotide Encoded value
Sparse matrix
A {1,0,0,0}
C {0,1,0,0}
G {0,0,1,0}
T {0,0,0,1}
Nucleotides encoding
 Splicing Site Prediction is a Supervised
Machine Learning Binary Classification problem
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Dataset and experimental
environment
Datase
t
#Nucleotides Training
Inst.
(pos./neg.)
Test.
Instances
(pos./neg.)
Total
samples
IPDATA 60 464/1536 302/884 3186
HS3D_1 140 1960/2942 836/1307 7045
HS3D_2 140 1960/12571 836/5431 20768
Datasets used
 Execution
Environments:
 Databricks Cloud Cluster
 1 core
 6 Gb ram
 Software configuration:
 Spark 2.2.1, Scala
2.11
 Jupyter 4.4.0
 Python 3.5.2
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Experiment Description
 Algorithms used:
 Logistic Regression
 Decision Tree
 Random Forest
 Linear Support Vector Machine
 Naïve Bayes
 Multilayer Perceptron
 We use default parameters, where possible
 But, Random Forest: Number of trees: 10
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Experiment results:
Classification performance
Dataset Algorithm Accuracy Error rate Corr.
IPDATA LR 0.948 0.052 0.865
IPDATA DT 0.970 0.030 0.923
IPDATA RF 0.965 0.035 0.906
IPDATA SVM 0.960 0.040 0.894
IPDATA BAYES 0.966 0.034 0.911
IPDATA MLPERC 0.966 0.034 0.912
HS3D_1 LR 0.927 0.073 0.847
HS3D_1 DT 0.921 0.079 0.835
HS3D_1 RF 0.933 0.067 0.859
HS3D_1 SVM 0.935 0.065 0.864
HS3D_1 BAYES 0.861 0.139 0.706
HS3D_1 MLPERC 0.923 0.077 0.838
HS3D_2 LR 0.947 0.053 0.765
HS3D_2 DT 0.939 0.061 0.734
HS3D_2 RF 0.908 0.092 0.525
HS3D_2 SVM 0.949 0.051 0.776
HS3D_2 BAYES 0.902 0.098 0.614
HS3D_2 MLPERC 0.945 0.055 0.763
Experiment results:
Classification performance
 The best performer is DT on IPDATA dataset
 Accuracy: 97%
 Error rate: 0.03
 MCC Correlation fact.
0.923
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Experiment results:
Training Time
Dataset Algorithm Databrick 1-core Local cluster 3-core
IPDATA LR 2.23 0.80
IPDATA DT 1.48 0.66
IPDATA RF 13.82 4.14
IPDATA SVM 13.95 4.45
IPDATA BAYES 0.75 0.16
IPDATA MLPERC 49.39 9.87
HS3D_1 LR 6.68 1.56
HS3D_1 DT 3.83 1.37
HS3D_1 RF 43.20 14.15
HS3D_1 SVM 26.42 6.27
HS3D_1 BAYES 2.04 0.16
HS3D_1 MLPERC 91.73 44.31
HS3D_2 LR 6.20 1.53
HS3D_2 DT 5.32 2.51
HS3D_2 RF 67.02 25.40
HS3D_2 SVM 26.63 7.83
HS3D_2 BAYES 2.03 0.17
HS3D_2 MLPERC 157.37 156,76
 Good scalability can be observed!
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
Meetup Deep Learning Italia – 17/12/2018 - Roma
Apache Spark per il Machine Learning: Introduzione ed un
caso di studio
Speaker Valerio Morfino
THANK YOU!
valerio.morfino@dbservices.it
https://it.linkedin.com/in/valerio-
morfino
Multilayer Perceptron
Classifier
 Multilayer perceptron classifier is a classifier based on the feedforward artificial
neural network.
 MLPC consists of multiple layers of nodes.
 Each layer is fully connected to the next layer in the network.
 Nodes in the input layer represent the input data.
 All other nodes map inputs to outputs by a linear combination of the inputs with the node’s
weights ww and bias bb and applying an activation function.
 Nodes in intermediate layers use sigmoid (logistic) function:
 f(zi)=11+e−zif(zi)=11+e−zi
 Nodes in the output layer use softmax function:
 f(zi)=ezi∑Nk=1ezkf(zi)=ezi∑k=1Nezk
 The number of nodes NN in the output layer corresponds to the number of classes.
 MLPC employs backpropagation for learning the model.
 We use the logistic loss function for optimization and L-BFGS as an optimization
routine.
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
K-fold Cross Validation
 CrossValidator begins by splitting the dataset into a set of folds which are used as
separate training and test datasets. E.g., with k=3 folds, CrossValidator will
generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for
training and 1/3 for testing.
 To evaluate a particular ParamMap, CrossValidator computes the average
evaluation metric for the 3 Models produced by fitting the Estimator on the 3
different (training, test) dataset pairs.
 After identifying the best ParamMap, CrossValidator finally re-fits the Estimator
using the best ParamMap and the entire dataset.
paramGrid = ParamGridBuilder() 
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) 
.addGrid(lr.regParam, [0.1, 0.01]) 
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
MCC Correlation
 The Matthews correlation coefficient is used in machine learning as a measure of
the quality of binary (two-class) classifications, introduced by biochemist Brian W.
Matthews in 1975. It takes into account true and false positives and negatives
and is generally regarded as a balanced measure which can be used even if the
classes are of very different sizes. The MCC is in essence a correlation
coefficient between the observed and predicted binary classifications; it returns a
value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no
better than random prediction and −1 indicates total disagreement between
prediction and observation. The statistic is also known as the phi coefficient.
MCC is related to the chi-square statistic for a 2×2 contingency table
 While there is no perfect way of describing the confusion matrix of true and false
positives and negatives by a single number, the Matthews correlation coefficient
is generally regarded as being one of the best such measures.
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di

More Related Content

Similar to APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217

Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
Ladle Patel
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Meetup mongo db-spark-ml-20191111
Meetup mongo db-spark-ml-20191111Meetup mongo db-spark-ml-20191111
Meetup mongo db-spark-ml-20191111
Deep Learning Italia
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
Databricks
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analytics
Muralidhar Somisetty
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
ZaranTech LLC
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
Peadar Coyle
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and spark
François Garillot
 
Big data with java
Big data with javaBig data with java
Big data with java
Stefan Angelov
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
Stratio
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Codemotion
 
Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...
Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...
Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...
Richard Abbuhl
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
datascienceiqss
 

Similar to APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217 (20)

Scalable Machine Learning with PySpark
Scalable Machine Learning with PySparkScalable Machine Learning with PySpark
Scalable Machine Learning with PySpark
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Meetup mongo db-spark-ml-20191111
Meetup mongo db-spark-ml-20191111Meetup mongo db-spark-ml-20191111
Meetup mongo db-spark-ml-20191111
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdfA Master Guide To Apache Spark Application And Versatile Uses.pdf
A Master Guide To Apache Spark Application And Versatile Uses.pdf
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Apache Spark and future of advanced analytics
Apache Spark and future of advanced analyticsApache Spark and future of advanced analytics
Apache Spark and future of advanced analytics
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark Introduction To Data Science with Apache Spark
Introduction To Data Science with Apache Spark
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.Introduction to Spark: Or how I learned to love 'big data' after all.
Introduction to Spark: Or how I learned to love 'big data' after all.
 
Deep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and sparkDeep learning on a mixed cluster with deeplearning4j and spark
Deep learning on a mixed cluster with deeplearning4j and spark
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
Emiliano Martinez | Deep learning in Spark Slides | Codemotion Madrid 2018
 
Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...
Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...
Java With The Best Online Conference - Mind the gap: Java, Machine Learning, ...
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
 

More from Deep Learning Italia

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
Deep Learning Italia
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Deep Learning Italia
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
Deep Learning Italia
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
Deep Learning Italia
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
Deep Learning Italia
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
Deep Learning Italia
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
Deep Learning Italia
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
Deep Learning Italia
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
Deep Learning Italia
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
Deep Learning Italia
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
Deep Learning Italia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
Deep Learning Italia
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
Deep Learning Italia
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
Deep Learning Italia
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
Deep Learning Italia
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
Deep Learning Italia
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
Deep Learning Italia
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
Deep Learning Italia
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
Deep Learning Italia
 

More from Deep Learning Italia (20)

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
 

Recently uploaded

[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
Bisnar Chase Personal Injury Attorneys
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
prijesh mathew
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
ArshadAyub49
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
newdirectionconsulta
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
frp60658
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 

Recently uploaded (20)

[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Drownings spike from May to August in children
Drownings spike from May to August in childrenDrownings spike from May to August in children
Drownings spike from May to August in children
 
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance PaymentCall Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
Call Girls Hyderabad ❤️ 7339748667 ❤️ With No Advance Payment
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Sid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.pptSid Sigma educational and problem solving power point- Six Sigma.ppt
Sid Sigma educational and problem solving power point- Six Sigma.ppt
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdfsaps4hanaandsapanalyticswheretodowhat1565272000538.pdf
saps4hanaandsapanalyticswheretodowhat1565272000538.pdf
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
CAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdfCAP Excel Formulas & Functions July - Copy (4).pdf
CAP Excel Formulas & Functions July - Copy (4).pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 

APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217

  • 1. Meetup Deep Learning Italia – 17/12/2018 - Roma Apache Spark per il Machine Learning: Introduzione ed un caso di studio Speaker Valerio Morfino APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO
  • 2. VALERIO MORFINO Head of Big Data & Analytics at DbServices srl Valerio Morfino si occupa di informatica e di Internet dal 2000. Laureato in Ingegneria Informatica, nel corso della propria carriera ha lavorato in ha lavorato in società di consulenza, università, grandi e medie aziende aziende occupandosi di consulenza, formazione, ricerca, direzione di progetti. di progetti. Autore di articoli scientifici, relatore in conferenze su temi relativi a temi relativi a web, e-commerce, machine learning, bioinformatica.
  • 3. Presentation Objectives  Basic understand of the Apache Spark and its parallel model  Understand how to face a bioinformatic problem using a Supervised Machine Learning approach  Use of Pyhon and Apache Spark for implementation Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 4. Summary  Apache Spark  Spark parallel programming model  Case Study Introduction  Hands on!  Conclusions Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 5. Apache Spark  Apache Spark is a distributed cluster based general engine for big data processing  It has become one of the key big data distributed processing frameworks  Spark is open source  Spark is fully integrated with the Hadoop ecosystem  It is available both in local and in cloud environments by the most important providers (e.g. AWS, Google, Databricks, …)  Spark can run in clusters of hundreds or even thousands of nodes Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 6. Apache Spark  High-level APIs accessible in Java, Scala, Python and R  The MLlib library is rich of efficient parallel implementation of Machine learning algorithms Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 7. Spark Cluster configurations  Several Cluster configurations:  Stand Alone  Hadoop Yarn  Mesos  Kubernetes Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 8. Apache Spark is Resilient!  The Hardware can fail!  Spark is resilient thanks to:  Lineage  Use of distributed File Systems such as HDFS  Is this important for my Application?  In the case of Big Datasets  In the case of long training (processing) time Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 9. Apache Spark is FAST!  Spark is very fast!  Up to 100X compared to Hadoop Map Reduce  In Memory computing  Lazy evaluation Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 10. Map Reduce? Ok, but… What is Map Reduce? Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 11. Map Reduce Paradigm  Map jobs read a block of data and produce key-value pairs  Reducer jobs receives key-value pairs from multiple map jobs, sorted by key and produce output  Key concept: Distribute the data and process it where it is! Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 12. RDDs to store Large datasets  Resilient, i.e. fault-tolerant thanks to RDD lineage graph, able to recompute missing or damaged partitions  Distributed, with data residing on multiple nodes in a cluster  Dataset is a collection of partitioned data stored in memory as far as possible (otherwise disk) Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 13. MAP example using Spark  Two datasets joined  Computing using an UDF (at a lower level Spark compute a MAP)  Lazy evaluation: Map are transformation computed only when an action is called (e.g. output requeste or reduce)Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 14. Reduce using Spark  Also Reduce operations are widely computed in parallel way  The level of parallelisms in related to the number of partitions and number of worker nodes in the clusterMeetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 15. Spark SQL, DataFrames and Datasets  Spark SQL is a Spark module for structured data processing.  A Dataset is a distributed collection of data. Only supported by Java and Scala API.  A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Python, but with richer optimizations under the hood  Dataset and Dataframe are internally represented as RDD but executed with some optimizations! Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 16. Mllib - Spark’s machine learning library  ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering  Featurization: feature extraction, transformation, dimensionality reduction, and selection  Pipelines: tools for constructing, evaluating, and tuning ML Pipelines  Persistence: saving and load algorithms, models, and Pipelines  Utilities: linear algebra, statistics, data handling, etc.  Text Manipulations: Tokenization, Common Word Removing, Word combinations, Word2Vec Note: As of Spark 2.0, DataFrame-based API is primary API (package spark.ml). The MLlib RDD-based API is now in maintenance mode (package spark.mllib) Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 17. CASE STUDY  We deal with the splicing site prediction problem in DNA sequences It is an important bioinformatic problem  Useful for:  Biological Research (identification of Intron-Exon boundaries)  Medical research (to understand human variation on splicing and its effect on human diseases)  Personalized medicine Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 18. CASE STUDY Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 19. Biological Background  DNA is a linear molecule composed of four small molecules called nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T).  Segments of DNA that carry genetic information are called genes.  The genes in DNA encode protein molecules according to the flow known as “The Central Dogma”: DNA → mRNA → Protein. Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 20. Biological Background II  Most of eukariotic genes have their coding sequences – exons- interrupted by non- coding sequences - introns.  The interruption points between exon-intron (EI or donor) and intron-exon (IE or acceptor) are called “splicing sites”. During the splicing process introns are removed  The DNA splicing site prediction problem deals with individuating those regions. Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 21. Splicing site problem in ML terms  Given a sequence of DNA (e.g. 60 nucleotides) : AGTGTCCAGTCATG…GT…GAACGTAAGTAA GA  We wish to classify each sequence as:  Containing a splicing site in the middle  Not containing a splicing site in the middle  Binary single one-value encoding (one hot encoding): Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 22. Ready to code? Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 23. Supervised Machine learning recipe  Ingredients:  A labelled set of data In this specific case four files: pos_training, neg_training, pos_test, neg_test  A learning algorithm (e.g. Decision tree, SVM, Random Forest, Multi Layer Perceptron, …)  Preparation: 1. Load Dadaset and assign a label AGTGTCCAGTCATG…GT…GAACGTAAGTAAGA,1 2. Encode features (Vector Indexer or OneHot Encoder) 0,2,2,0,2,2,0,1,2,0,1,…,1,0,…,2,2,1,3,3,1,0,3,0,2,1,2,0,3,1 String Indexer 0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1 One Hot Note: The last field is the class: 1-> Splicing site; 0-> no splicing siteMeetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 24. Supervised Machine learning cookbook 3. Split the Input Dataset in:  Training set (about 70-80%)  Test set (about 20-30%) 4. Assemble features in a Vector 0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1 features, label [0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0],1 5. Train a Model 6. Test the model on Test set (tune and refine…) 7. Ready to classify new unlabbelled data! Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 25. Let’code! Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 26. CONCLUSIONS Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 27. Experiment Description  Implementation steps:  Data loading  Data preparation (encoding)  Data Splitting (training/test)  Training  Test  Result Evaluation Nucleotide Encoded value Sparse matrix A {1,0,0,0} C {0,1,0,0} G {0,0,1,0} T {0,0,0,1} Nucleotides encoding  Splicing Site Prediction is a Supervised Machine Learning Binary Classification problem Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 28. Dataset and experimental environment Datase t #Nucleotides Training Inst. (pos./neg.) Test. Instances (pos./neg.) Total samples IPDATA 60 464/1536 302/884 3186 HS3D_1 140 1960/2942 836/1307 7045 HS3D_2 140 1960/12571 836/5431 20768 Datasets used  Execution Environments:  Databricks Cloud Cluster  1 core  6 Gb ram  Software configuration:  Spark 2.2.1, Scala 2.11  Jupyter 4.4.0  Python 3.5.2 Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 29. Experiment Description  Algorithms used:  Logistic Regression  Decision Tree  Random Forest  Linear Support Vector Machine  Naïve Bayes  Multilayer Perceptron  We use default parameters, where possible  But, Random Forest: Number of trees: 10 Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 30. Experiment results: Classification performance Dataset Algorithm Accuracy Error rate Corr. IPDATA LR 0.948 0.052 0.865 IPDATA DT 0.970 0.030 0.923 IPDATA RF 0.965 0.035 0.906 IPDATA SVM 0.960 0.040 0.894 IPDATA BAYES 0.966 0.034 0.911 IPDATA MLPERC 0.966 0.034 0.912 HS3D_1 LR 0.927 0.073 0.847 HS3D_1 DT 0.921 0.079 0.835 HS3D_1 RF 0.933 0.067 0.859 HS3D_1 SVM 0.935 0.065 0.864 HS3D_1 BAYES 0.861 0.139 0.706 HS3D_1 MLPERC 0.923 0.077 0.838 HS3D_2 LR 0.947 0.053 0.765 HS3D_2 DT 0.939 0.061 0.734 HS3D_2 RF 0.908 0.092 0.525 HS3D_2 SVM 0.949 0.051 0.776 HS3D_2 BAYES 0.902 0.098 0.614 HS3D_2 MLPERC 0.945 0.055 0.763
  • 31. Experiment results: Classification performance  The best performer is DT on IPDATA dataset  Accuracy: 97%  Error rate: 0.03  MCC Correlation fact. 0.923 Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 32. Experiment results: Training Time Dataset Algorithm Databrick 1-core Local cluster 3-core IPDATA LR 2.23 0.80 IPDATA DT 1.48 0.66 IPDATA RF 13.82 4.14 IPDATA SVM 13.95 4.45 IPDATA BAYES 0.75 0.16 IPDATA MLPERC 49.39 9.87 HS3D_1 LR 6.68 1.56 HS3D_1 DT 3.83 1.37 HS3D_1 RF 43.20 14.15 HS3D_1 SVM 26.42 6.27 HS3D_1 BAYES 2.04 0.16 HS3D_1 MLPERC 91.73 44.31 HS3D_2 LR 6.20 1.53 HS3D_2 DT 5.32 2.51 HS3D_2 RF 67.02 25.40 HS3D_2 SVM 26.63 7.83 HS3D_2 BAYES 2.03 0.17 HS3D_2 MLPERC 157.37 156,76  Good scalability can be observed! Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 33. Meetup Deep Learning Italia – 17/12/2018 - Roma Apache Spark per il Machine Learning: Introduzione ed un caso di studio Speaker Valerio Morfino THANK YOU! valerio.morfino@dbservices.it https://it.linkedin.com/in/valerio- morfino
  • 34. Multilayer Perceptron Classifier  Multilayer perceptron classifier is a classifier based on the feedforward artificial neural network.  MLPC consists of multiple layers of nodes.  Each layer is fully connected to the next layer in the network.  Nodes in the input layer represent the input data.  All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights ww and bias bb and applying an activation function.  Nodes in intermediate layers use sigmoid (logistic) function:  f(zi)=11+e−zif(zi)=11+e−zi  Nodes in the output layer use softmax function:  f(zi)=ezi∑Nk=1ezkf(zi)=ezi∑k=1Nezk  The number of nodes NN in the output layer corresponds to the number of classes.  MLPC employs backpropagation for learning the model.  We use the logistic loss function for optimization and L-BFGS as an optimization routine. Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 35. K-fold Cross Validation  CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.  To evaluate a particular ParamMap, CrossValidator computes the average evaluation metric for the 3 Models produced by fitting the Estimator on the 3 different (training, test) dataset pairs.  After identifying the best ParamMap, CrossValidator finally re-fits the Estimator using the best ParamMap and the entire dataset. paramGrid = ParamGridBuilder() .addGrid(hashingTF.numFeatures, [10, 100, 1000]) .addGrid(lr.regParam, [0.1, 0.01]) .build() crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di
  • 36. MCC Correlation  The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. The statistic is also known as the phi coefficient. MCC is related to the chi-square statistic for a 2×2 contingency table  While there is no perfect way of describing the confusion matrix of true and false positives and negatives by a single number, the Matthews correlation coefficient is generally regarded as being one of the best such measures. Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di

Editor's Notes

  1. Goog Afternoon to everyone I will be brief.
  2. Goog Afternoon to everyone I will be brief.
  3. It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  4. It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  5. It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Spark is easy-to-use and reliable thanks to RDDs – Resilient Distributed Dataset, the main distributed dataset abstraction
  6. It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk Spark is easy-to-use and reliable thanks to RDDs – Resilient Distributed Dataset, the main distributed dataset abstraction
  7. A programming framework for distributed and parallel processing on large datasets
  8. It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  9. It can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
  10. Dna is transcripted in mRNA (rna messenger) that is translated in Proteins
  11. Most Eukariotic have their coding sequence, that is, the part of the DNA that is transcribed into mrna, interrupted by non-coding sequences called introns.
  12. So, given a sequence of nucleotides, e.g. 60 nucleotides, encoded in binary with only-1 digit format. We have 240 binary digit. We have to identificate a function such as for each instance it return 1 if the sequence contains a Splicing Site and 0 if Sequence do not cointains a splicing site in the middle.
  13. So, given a sequence of nucleotides, e.g. 60 nucleotides, encoded in binary with only-1 digit format. We have 240 binary digit. We have to identificate a function such as for each instance it return 1 if the sequence contains a Splicing Site and 0 if Sequence do not cointains a splicing site in the middle.
  14. So, given a sequence of nucleotides, e.g. 60 nucleotides, encoded in binary with only-1 digit format. We have 240 binary digit. We have to identificate a function such as for each instance it return 1 if the sequence contains a Splicing Site and 0 if Sequence do not cointains a splicing site in the middle.
  15. So, given a sequence of nucleotides, e.g. 60 nucleotides, encoded in binary with only-1 digit format. We have 240 binary digit. We have to identificate a function such as for each instance it return 1 if the sequence contains a Splicing Site and 0 if Sequence do not cointains a splicing site in the middle.
  16. In order to test Apache Spark standard characteristics, where possible, we use default parameters For Random Forest the default number of tree parameter was of just 20 (very small)
  17. In order to test Apache Spark standard characteristics, where possible, we use default parameters For Random Forest the default number of tree parameter was of just 20 (very small)
  18. Aggiornare con dati degli ultimi esperimenti
  19. Thanks for your attention. I’m here for any question