Meetup mongo db-spark-ml-20191111

Meetup Deep Learning Italia – 11/11/2019 - Roma
Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB
Speaker Valerio Morfino & Orlando Moroni
Machine Learning in the Big Data
context on document-based data
using Apache Spark & MongoDB

DB Services è una Technology company che opera in tutta Italia con sedi a
Roma e Milano.
Mission dell’azienda è aiutare i propri clienti a gestire, valorizzare e analizzare
al meglio i propri dati, grazie a talenti e tecnologie.
Around and inside data perché DBS opera su tutte le tecnologie che
riguardano i dati. Advanced Analytics, Visual Analytics, Business Intelligence,
ETL, Data Injestion, Sistemi SQL, NoSQL e BigData, Middleware, OnPremise e
Cloud sono quindi parte del DNA aziendale.
DBS lavora portando professionalità ed innovazione alle principali Big
company pubbliche e private ed alle PMI.
Con un’attenta strategia di partnership, DBS porta ai propri clienti le tecnologie
più importanti ed avanzate del mercato, tra cui MongoDB, Tableau, Oracle.
Sperimentiamo, supportiamo e proponiamo tecnologie all’avanguardia grazie
a partnership strategiche con Università e Startup, come Deep Learning Italia.
www.dbservices.it
Giorgia Butera, Sales Executive
giorgia.butera@dbservices.it
https://www.linkedin.com/in/giorgia-butera-693786122/

VALERIO MORFINO
Head of Big Data & Analytics
DB Services
Ingegnere informatico è Head of
Big Data & Analytics presso DB
Services.
Nel corso della propria carriera ha
lavorato in società di consulenza,
università ed aziende occupandosi
di consulenza, formazione,
ricerca, direzione di progetti.
E' autore di articoli e relatore in
conferenze sui temi web, e-
commerce, machine learning e big
data.
ORLANDO MORONI
COO & Principal MongoDB Architect
DB Services
Ingegnere informatico è Chief
Operation Officer & Principal
MongoDB architect presso DB
Services.
Nel corso della propria carriera ha
lavorato con le più importanti
tecnologie di database Relazionali e
non Relazionali su tematiche
architetturali, di modellazione e di
performance tuning.
Ha partecipato alla realizzazione di
alcune delle più importanti
installazioni di MongoDB in Italia.

Summary
❑ Introduction
❑ Apache Spark
❑ Mongo DB
❑ Mongo Spark Connector
❑ Case Study: SYN-DOS attack prediction on IoT
devices
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni

Introduction

Toward a new era
Traditional data platforms are failing to meet new business
requirements that demand nocompromises combination of:
❑ Real-time data
❑ Performance
❑ Scale
❑ Integrated data
❑ Security

The Translytical era
Translytical is a hot, emerging market that delivers a unified
data platform to support all kinds of workloads.
Translytical can support various use cases, including real-time
insights, machine learning (ML), streaming analytics, extreme
transactional processing, and operational reporting.
[The Forrester Wave™: Translytical Data Platforms, Q4 2019]

Why MongoDB & Spark?
❑ We are in a Big Data World!
❑ Store high Volume of Data
❑ Store and Analyze data with high Velocity
❑ Store data in a Variety of formats and locations
❑ Be aware of Vulnerability!

Why MongoDB & Spark? Examples
❑ Store and analyze data from IoT devices
❑ Store and analyze data in distributed environments
❑ Enable real-time analytics without ETL
❑ Advanced analytics and Machine Learning at scale
❑ Enrich BI report & dashboard with augmented
analytics features

Apache Spark

Apache Spark
❑ A distributed cluster based general engine for big data
processing
❑ Fully integrated with Hadoop ecosystem
❑ Available both in local and in cloud environments
❑ Clusters of hundreds or even thousands of nodes
❑ Up to 100X faster than Hadoop Map Reduce
❑ Resilient thanks to lineage and distributed storage system (e.g. HDFS or
MongoDB)
❑ This is important for Big Data and long processing tasks on big clusters and hardware,
software or networks connections can fail!

Apache Spark
❑ High-level APIs accessible in Java, Scala, Python and R
❑ The MLlib library is rich of efficient parallel implementation of Machine
learning algorithms

Spark Cluster configurations
❑ Several Cluster configurations:
❑ Stand Alone
❑ Hadoop Yarn
❑ Mesos
❑ Kubernetes

RDDs to store Large datasets
❑ Resilient, i.e. fault-tolerant thanks to RDD lineage graph, able to recompute missing or
damaged partitions
❑ Distributed, with data residing on multiple nodes in a cluster
❑ Dataset is a collection of partitioned data stored in memory as far as possible
(otherwise disk)

Mllib - Spark’s machine learning library
❑ ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
❑ Featurization: feature extraction, transformation, dimensionality reduction, and selection
❑ Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
❑ Persistence: saving and load algorithms, models, and Pipelines
❑ Utilities: linear algebra, statistics, data handling, etc.
❑ Text Manipulations: Tokenization, Common Word Removing, Word combinations, Word2Vec
Note: As of Spark 2.0, DataFrame-based API is primary API (package spark.ml). The MLlib RDD-based API is now in
maintenance mode (package spark.mllib)

MongoDB

MongoDB
❑ Document Oriented NOSQL Database
❑ JSON-like documents with SCHEMA
❑ A distributed database at its core:
❑high availability (replica set)
❑horizontal scaling (sharding)
❑geographic distribution
❑ Open source cross platform

MongoDB – Why?
Intelligent Operational Data Platform
Document Model Distributed Architecture Run Anywhere
Best way to work
with data
Intelligently put data
where you need it
Freedom
to run anywhere

Mongo – Relational dictionary
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
profession: [‘banking’, ‘finance’],
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
RDBMS

Mongo – Relational dictionary
MongoDB SQL
database database
collection table
document record (row)
field column
linking/embedded documents join
primary key (_id field) primary key (user designated)
index index

MongoDB – Replica Set
Replica Set
• Up to 50 replicas
• Distributed across racks, data centers, and regions
Self-healing
Data Center Aware
Addresses availability considerations:
• High Availability
• Disaster Recovery
• Maintenance
Application
Driver
Primary
Secondary
Secondary
Replication

MongoDB – Automatic Sharding
Application transparent
Multiple sharding policies: hashed, ranged, zoned
Increase or decrease capacity as you go
Automatic balancing for elasticity
Horizontally Scalable
•••Shard 1 Shard 2 Shard 3 Shard N

Mongo Spark Connector

Mongo Spark connector most important features
❑ Most important connector features:
❑ Ability to read/write BSON documents directly from/to MongoDB
❑ Automatic conversion from MongoDB collection to Spark RDD (Dataframe and Dataset)
❑ Predicates pushdown:
❑ Filters (e.g. where conditions) and Select are pushed down to the datasource. So, the actual filtering
and projections are done on the MongoDB node before returning the data to the Spark node.
❑ Integration with the MongoDB aggregation pipeline:
❑ A MongoRDD accept a MongoDB pipeline, to execute aggregations on the MongoDB nodes instead of
the Spark nodes. However, most of the work is automatically performed by connector.
❑ Data locality:
❑ If the Spark nodes and MongoDB nodes (in Sharded Cluster configuration) are deployed on the same
server the data will be loaded according to their locality in the cluster, avoiding costly network
transfers.

Reference Architecture for MongoDB & Spark
❑ Apache Spark
❑ MongoDB Connector for
Spark
❑ MongoDB nodes
❑ Data locality (Spark
Workers and MongoDB
nodes on the same node)

Case study Architecture
❑ Full Cloud architecture
❑ Databricks Community
❑ MongoDB Atlas

Case Study configuration
❑ Databricks community edition
❑ https://databricks.com/try-databricks
❑ 5.5 LTS (Spark 2.4.3. + Scala 2.11)
❑ Importazione libreria Maven per MongoDB Spark Connector:
❑ org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
❑ MongoDB Atlas
❑ https://www.mongodb.com/cloud/atlas

CASE STUDY
SYN-DOS attack prediction

Attacchi informatici
❑ Possono minare:
❑ Riservatezza
❑ Integrità
❑ Disponibilità
❑ Gli attacchi DOS – Denial of Service minano la Disponibilità
❑ L’attacco SYN-DOS (detto anche SYN-Flood) mina la disponibilità
saturando le connessioni TCP/IP del server

SYN-DOS Attack
1. Client requests
connection by sending
SYN (synchronize)
message to the server.
2. Server acknowledges
by sending SYN-ACK
(synchronize-
acknowledge)
message back to the
client.
3. Client responds with
an ACK
(acknowledge)
message, and the
connection is
established.
https://www.imperva.com/learn/application-security/syn-flood/

Dataset & Reference
❑ Dataset Description
❑ 115 features (Double)
❑ 1 Label (String)
❑ 11.000 total samples (10.000 normal + 1.000 attack)
❑ Features contains statistics which are used to implicitly describe the
current state of the channel
❑ Data came from IP-Cameras
❑ The statistics are generated by a Feature Extractor
❑ Syn-Dos
❑ Paper: https://arxiv.org/pdf/1802.09089.pdf
❑ Full Dataset:
https://drive.google.com/drive/folders/1kmoWY4poGWfmmVSdS
u-r_3Vo84Tu4PyE

Let’s code!

Attacchi informatici
❑ Accesso alla Consolle MongoDB Atlas:
❑ https://cloud.mongodb.com
❑ Vista delle Collections
❑ Accesso a Databricks
❑ Creazione del cluster
❑ Import Maven libreria: org.mongodb.spark:mongo-spark-
connector_2.11:2.4.1
❑ Notebook creazione Collection su MongoDB
❑ Notebook Training

Useful links
❑ https://spark.apache.org/docs/latest/
❑ https://spark.apache.org/docs/latest/ml-guide.html
❑ https://spark.apache.org/docs/latest/ml-classification-regression.html
❑ https://docs.databricks.com/getting-started/index.html
❑ https://www.mongodb.com/it
❑ https://databricks.com/try-databricks
❑ https://www.mongodb.com/cloud/atlas
❑ https://docs.mongodb.com/spark-connector/

Grazie per l’attenzione
valerio.morfino@dbservices.it https://it.linkedin.com/in/valerio-morfino
orlando.moroni@dbservices.it https://www.linkedin.com/in/orlandomoroni

Meetup mongo db-spark-ml-20191111

Recommended

Recommended

More Related Content

Similar to Meetup mongo db-spark-ml-20191111

Similar to Meetup mongo db-spark-ml-20191111 (20)

More from Deep Learning Italia

More from Deep Learning Italia (20)

Recently uploaded

Recently uploaded (20)

Meetup mongo db-spark-ml-20191111