APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217

Meetup Deep Learning Italia – 17/12/2018 - Roma
Apache Spark per il Machine Learning: Introduzione ed un caso di
studio
Speaker Valerio Morfino
APACHE SPARK PER IL
MACHINE LEARNING:
INTRODUZIONE ED UN CASO
DI STUDIO

VALERIO MORFINO
Head of Big Data & Analytics at DbServices srl
Valerio Morfino si occupa di informatica e di Internet dal 2000.
Laureato in Ingegneria Informatica, nel corso della propria carriera ha lavorato in
ha lavorato in società di consulenza, università, grandi e medie aziende
aziende occupandosi di consulenza, formazione, ricerca, direzione di progetti.
di progetti. Autore di articoli scientifici, relatore in conferenze su temi relativi a
temi relativi a web, e-commerce, machine learning, bioinformatica.

Presentation Objectives
 Basic understand of the Apache
Spark and its parallel model
 Understand how to face a
bioinformatic problem using a
Supervised Machine Learning
approach
 Use of Pyhon and Apache Spark for
implementation
Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di

Summary
 Apache Spark
 Spark parallel programming
model
 Case Study Introduction
 Hands on!
 Conclusions

Apache Spark
 Apache Spark is a distributed cluster based
general engine for big data processing
 It has become one of the key big data distributed
processing frameworks
 Spark is open source
 Spark is fully integrated with the Hadoop ecosystem
 It is available both in local and in cloud
environments by the most important providers (e.g.
AWS, Google, Databricks, …)
 Spark can run in clusters of hundreds or even
thousands of nodes

Apache Spark
 High-level APIs accessible in Java, Scala, Python
and R
 The MLlib library is rich of efficient parallel
implementation of Machine learning algorithms

Spark Cluster configurations
 Several Cluster configurations:
 Stand Alone
 Hadoop Yarn
 Mesos
 Kubernetes

Apache Spark is Resilient!
 The Hardware can fail!
 Spark is resilient thanks to:
 Lineage
 Use of distributed File Systems such as HDFS
 Is this important for my Application?
 In the case of Big Datasets
 In the case of long training (processing) time

Apache Spark is FAST!
 Spark is very fast!
 Up to 100X compared to Hadoop Map Reduce
 In Memory computing
 Lazy evaluation

Map Reduce?
Ok, but…
What is Map Reduce?

Map Reduce Paradigm
 Map jobs read a block of data and produce key-value pairs
 Reducer jobs receives key-value pairs from multiple map
jobs, sorted by key and produce output
 Key concept: Distribute the data and process it where it is!

RDDs to store Large datasets
 Resilient, i.e. fault-tolerant thanks to RDD lineage
graph, able to recompute missing or damaged
partitions
 Distributed, with data residing on multiple nodes in a
cluster
 Dataset is a collection of partitioned data stored in
memory as far as possible (otherwise disk)

MAP example using Spark
 Two datasets
joined
 Computing using
an UDF (at a lower
level Spark
compute a MAP)
 Lazy evaluation:
Map are
transformation
computed only
when an action is
called (e.g. output
requeste or reduce)Meetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di

Reduce using Spark
 Also Reduce
operations are
widely computed in
parallel way
 The level of
parallelisms in
related to the
number of partitions
and number of
worker nodes in the
clusterMeetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di

Spark SQL, DataFrames and
Datasets
 Spark SQL is a Spark module for structured data
processing.
 A Dataset is a distributed collection of data. Only
supported by Java and Scala API.
 A DataFrame is a Dataset organized into named
columns. It is conceptually equivalent to a table
in a relational database or a data frame in R or
Python, but with richer optimizations under the
hood
 Dataset and Dataframe are internally represented
as RDD but executed with some optimizations!

Mllib - Spark’s machine learning
library
 ML Algorithms: common learning algorithms such as
classification, regression, clustering, and
collaborative filtering
 Featurization: feature extraction, transformation, dimensionality reduction, and selection
 Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
 Persistence: saving and load algorithms, models, and Pipelines
 Utilities: linear algebra, statistics, data handling, etc.
 Text Manipulations: Tokenization, Common Word Removing, Word combinations,
Word2Vec
Note: As of Spark 2.0, DataFrame-based API is primary API (package
spark.ml). The MLlib RDD-based API is now in maintenance mode (package
spark.mllib)

CASE STUDY
 We deal with the splicing site prediction
problem in DNA sequences It is an important
bioinformatic problem
 Useful for:
 Biological Research (identification of Intron-Exon
boundaries)
 Medical research (to understand human variation
on splicing and its effect on human diseases)
 Personalized medicine

CASE STUDY

Biological Background
 DNA is a linear molecule composed of four
small molecules called nucleotide bases:
adenine (A), cytosine (C), guanine (G), and
thymine (T).
 Segments of DNA that carry genetic
information are called genes.
 The genes in DNA encode protein molecules
according to the flow known as “The Central
Dogma”: DNA → mRNA → Protein.

Biological Background II
 Most of eukariotic genes have
their coding sequences –
exons- interrupted by non-
coding sequences - introns.
 The interruption points
between exon-intron (EI or
donor) and intron-exon (IE or
acceptor) are called “splicing
sites”. During the splicing
process introns are removed
 The DNA splicing site
prediction problem deals with
individuating those regions.

Splicing site problem in ML
terms
 Given a sequence of DNA (e.g. 60 nucleotides)
:
AGTGTCCAGTCATG…GT…GAACGTAAGTAA
GA
 We wish to classify each sequence as:
 Containing a splicing site in the middle
 Not containing a splicing site in the middle
 Binary single one-value encoding (one hot
encoding):

Ready to code?

Supervised Machine learning
recipe
 Ingredients:
 A labelled set of data
In this specific case four files:
pos_training, neg_training, pos_test, neg_test
 A learning algorithm (e.g. Decision tree, SVM, Random Forest,
Multi Layer Perceptron, …)
 Preparation:
1. Load Dadaset and assign a label
AGTGTCCAGTCATG…GT…GAACGTAAGTAAGA,1
2. Encode features (Vector Indexer or OneHot Encoder)
0,2,2,0,2,2,0,1,2,0,1,…,1,0,…,2,2,1,3,3,1,0,3,0,2,1,2,0,3,1 String
Indexer
0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1 One Hot
Note: The last field is the class: 1-> Splicing site; 0-> no splicing siteMeetup Deep Learning Italia, Roma, 17/12/2018 Apache Spark per il Machine Learning: Introduzione ed un caso di

Supervised Machine learning
cookbook
3. Split the Input Dataset in:
 Training set (about 70-80%)
 Test set (about 20-30%)
4. Assemble features in a Vector
0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0,1
features, label
[0,0,0,1,0,0,0,1,0,…,0,0,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0],1
5. Train a Model
6. Test the model on Test set (tune and refine…)
7. Ready to classify new unlabbelled data!

Let’code!

CONCLUSIONS

Experiment Description
 Implementation steps:
 Data loading
 Data preparation (encoding)
 Data Splitting (training/test)
 Training
 Test
 Result Evaluation
Nucleotide Encoded value
Sparse matrix
A {1,0,0,0}
C {0,1,0,0}
G {0,0,1,0}
T {0,0,0,1}
Nucleotides encoding
 Splicing Site Prediction is a Supervised
Machine Learning Binary Classification problem

Dataset and experimental
environment
Datase
t
#Nucleotides Training
Inst.
(pos./neg.)
Test.
Instances
(pos./neg.)
Total
samples
IPDATA 60 464/1536 302/884 3186
HS3D_1 140 1960/2942 836/1307 7045
HS3D_2 140 1960/12571 836/5431 20768
Datasets used
 Execution
Environments:
 Databricks Cloud Cluster
 1 core
 6 Gb ram
 Software configuration:
 Spark 2.2.1, Scala
2.11
 Jupyter 4.4.0
 Python 3.5.2

Experiment Description
 Algorithms used:
 Logistic Regression
 Decision Tree
 Random Forest
 Linear Support Vector Machine
 Naïve Bayes
 Multilayer Perceptron
 We use default parameters, where possible
 But, Random Forest: Number of trees: 10

Experiment results:
Classification performance
Dataset Algorithm Accuracy Error rate Corr.
IPDATA LR 0.948 0.052 0.865
IPDATA DT 0.970 0.030 0.923
IPDATA RF 0.965 0.035 0.906
IPDATA SVM 0.960 0.040 0.894
IPDATA BAYES 0.966 0.034 0.911
IPDATA MLPERC 0.966 0.034 0.912
HS3D_1 LR 0.927 0.073 0.847
HS3D_1 DT 0.921 0.079 0.835
HS3D_1 RF 0.933 0.067 0.859
HS3D_1 SVM 0.935 0.065 0.864
HS3D_1 BAYES 0.861 0.139 0.706
HS3D_1 MLPERC 0.923 0.077 0.838
HS3D_2 LR 0.947 0.053 0.765
HS3D_2 DT 0.939 0.061 0.734
HS3D_2 RF 0.908 0.092 0.525
HS3D_2 SVM 0.949 0.051 0.776
HS3D_2 BAYES 0.902 0.098 0.614
HS3D_2 MLPERC 0.945 0.055 0.763

Experiment results:
Classification performance
 The best performer is DT on IPDATA dataset
 Accuracy: 97%
 Error rate: 0.03
 MCC Correlation fact.
0.923

Experiment results:
Training Time
Dataset Algorithm Databrick 1-core Local cluster 3-core
IPDATA LR 2.23 0.80
IPDATA DT 1.48 0.66
IPDATA RF 13.82 4.14
IPDATA SVM 13.95 4.45
IPDATA BAYES 0.75 0.16
IPDATA MLPERC 49.39 9.87
HS3D_1 LR 6.68 1.56
HS3D_1 DT 3.83 1.37
HS3D_1 RF 43.20 14.15
HS3D_1 SVM 26.42 6.27
HS3D_1 BAYES 2.04 0.16
HS3D_1 MLPERC 91.73 44.31
HS3D_2 LR 6.20 1.53
HS3D_2 DT 5.32 2.51
HS3D_2 RF 67.02 25.40
HS3D_2 SVM 26.63 7.83
HS3D_2 BAYES 2.03 0.17
HS3D_2 MLPERC 157.37 156,76
 Good scalability can be observed!

Meetup Deep Learning Italia – 17/12/2018 - Roma
Apache Spark per il Machine Learning: Introduzione ed un
caso di studio
Speaker Valerio Morfino
THANK YOU!
valerio.morfino@dbservices.it
https://it.linkedin.com/in/valerio-
morfino

Multilayer Perceptron
Classifier
 Multilayer perceptron classifier is a classifier based on the feedforward artificial
neural network.
 MLPC consists of multiple layers of nodes.
 Each layer is fully connected to the next layer in the network.
 Nodes in the input layer represent the input data.
 All other nodes map inputs to outputs by a linear combination of the inputs with the node’s
weights ww and bias bb and applying an activation function.
 Nodes in intermediate layers use sigmoid (logistic) function:
 f(zi)=11+e−zif(zi)=11+e−zi
 Nodes in the output layer use softmax function:
 f(zi)=ezi∑Nk=1ezkf(zi)=ezi∑k=1Nezk
 The number of nodes NN in the output layer corresponds to the number of classes.
 MLPC employs backpropagation for learning the model.
 We use the logistic loss function for optimization and L-BFGS as an optimization
routine.

K-fold Cross Validation
 CrossValidator begins by splitting the dataset into a set of folds which are used as
separate training and test datasets. E.g., with k=3 folds, CrossValidator will
generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for
training and 1/3 for testing.
 To evaluate a particular ParamMap, CrossValidator computes the average
evaluation metric for the 3 Models produced by fitting the Estimator on the 3
different (training, test) dataset pairs.
 After identifying the best ParamMap, CrossValidator finally re-fits the Estimator
using the best ParamMap and the entire dataset.
paramGrid = ParamGridBuilder()
.addGrid(hashingTF.numFeatures, [10, 100, 1000])
.addGrid(lr.regParam, [0.1, 0.01])
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice

MCC Correlation
 The Matthews correlation coefficient is used in machine learning as a measure of
the quality of binary (two-class) classifications, introduced by biochemist Brian W.
Matthews in 1975. It takes into account true and false positives and negatives
and is generally regarded as a balanced measure which can be used even if the
classes are of very different sizes. The MCC is in essence a correlation
coefficient between the observed and predicted binary classifications; it returns a
value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no
better than random prediction and −1 indicates total disagreement between
prediction and observation. The statistic is also known as the phi coefficient.
MCC is related to the chi-square statistic for a 2×2 contingency table
 While there is no perfect way of describing the confusion matrix of true and false
positives and negatives by a single number, the Matthews correlation coefficient
is generally regarded as being one of the best such measures.

APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217

Recommended

Recommended

More Related Content

Similar to APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217

Similar to APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217 (20)

More from Deep Learning Italia

More from Deep Learning Italia (20)

Recently uploaded

Recently uploaded (20)

APACHE SPARK PER IL MACHINE LEARNING: INTRODUZIONE ED UN CASO DI STUDIO_ Meetup deeplearningitalia-valerio-morfino-20181217

Editor's Notes