SlideShare a Scribd company logo
Meetup Deep Learning Italia – 11/11/2019 - Roma
Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB
Speaker Valerio Morfino & Orlando Moroni
Machine Learning in the Big Data
context on document-based data
using Apache Spark & MongoDB
DB Services è una Technology company che opera in tutta Italia con sedi a
Roma e Milano.
Mission dell’azienda è aiutare i propri clienti a gestire, valorizzare e analizzare
al meglio i propri dati, grazie a talenti e tecnologie.
Around and inside data perché DBS opera su tutte le tecnologie che
riguardano i dati. Advanced Analytics, Visual Analytics, Business Intelligence,
ETL, Data Injestion, Sistemi SQL, NoSQL e BigData, Middleware, OnPremise e
Cloud sono quindi parte del DNA aziendale.
DBS lavora portando professionalità ed innovazione alle principali Big
company pubbliche e private ed alle PMI.
Con un’attenta strategia di partnership, DBS porta ai propri clienti le tecnologie
più importanti ed avanzate del mercato, tra cui MongoDB, Tableau, Oracle.
Sperimentiamo, supportiamo e proponiamo tecnologie all’avanguardia grazie
a partnership strategiche con Università e Startup, come Deep Learning Italia.
www.dbservices.it
Giorgia Butera, Sales Executive
giorgia.butera@dbservices.it
https://www.linkedin.com/in/giorgia-butera-693786122/
VALERIO MORFINO
Head of Big Data & Analytics
DB Services
Ingegnere informatico è Head of
Big Data & Analytics presso DB
Services.
Nel corso della propria carriera ha
lavorato in società di consulenza,
università ed aziende occupandosi
di consulenza, formazione,
ricerca, direzione di progetti.
E' autore di articoli e relatore in
conferenze sui temi web, e-
commerce, machine learning e big
data.
ORLANDO MORONI
COO & Principal MongoDB Architect
DB Services
Ingegnere informatico è Chief
Operation Officer & Principal
MongoDB architect presso DB
Services.
Nel corso della propria carriera ha
lavorato con le più importanti
tecnologie di database Relazionali e
non Relazionali su tematiche
architetturali, di modellazione e di
performance tuning.
Ha partecipato alla realizzazione di
alcune delle più importanti
installazioni di MongoDB in Italia.
Summary
❑ Introduction
❑ Apache Spark
❑ Mongo DB
❑ Mongo Spark Connector
❑ Case Study: SYN-DOS attack prediction on IoT
devices
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Introduction
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Toward a new era
Traditional data platforms are failing to meet new business
requirements that demand nocompromises combination of:
❑ Real-time data
❑ Performance
❑ Scale
❑ Integrated data
❑ Security
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
The Translytical era
Translytical is a hot, emerging market that delivers a unified
data platform to support all kinds of workloads.
Translytical can support various use cases, including real-time
insights, machine learning (ML), streaming analytics, extreme
transactional processing, and operational reporting.
[The Forrester Wave™: Translytical Data Platforms, Q4 2019]
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Why MongoDB & Spark?
❑ We are in a Big Data World!
❑ Store high Volume of Data
❑ Store and Analyze data with high Velocity
❑ Store data in a Variety of formats and locations
❑ Be aware of Vulnerability!
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Why MongoDB & Spark? Examples
❑ Store and analyze data from IoT devices
❑ Store and analyze data in distributed environments
❑ Enable real-time analytics without ETL
❑ Advanced analytics and Machine Learning at scale
❑ Enrich BI report & dashboard with augmented
analytics features
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Apache Spark
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Apache Spark
❑ A distributed cluster based general engine for big data
processing
❑ Fully integrated with Hadoop ecosystem
❑ Available both in local and in cloud environments
❑ Clusters of hundreds or even thousands of nodes
❑ Up to 100X faster than Hadoop Map Reduce
❑ Resilient thanks to lineage and distributed storage system (e.g. HDFS or
MongoDB)
❑ This is important for Big Data and long processing tasks on big clusters and hardware,
software or networks connections can fail!
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Apache Spark
❑ High-level APIs accessible in Java, Scala, Python and R
❑ The MLlib library is rich of efficient parallel implementation of Machine
learning algorithms
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Spark Cluster configurations
❑ Several Cluster configurations:
❑ Stand Alone
❑ Hadoop Yarn
❑ Mesos
❑ Kubernetes
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
RDDs to store Large datasets
❑ Resilient, i.e. fault-tolerant thanks to RDD lineage graph, able to recompute missing or
damaged partitions
❑ Distributed, with data residing on multiple nodes in a cluster
❑ Dataset is a collection of partitioned data stored in memory as far as possible
(otherwise disk)
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Mllib - Spark’s machine learning library
❑ ML Algorithms: common learning algorithms such as classification,
regression, clustering, and collaborative filtering
❑ Featurization: feature extraction, transformation, dimensionality reduction, and selection
❑ Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
❑ Persistence: saving and load algorithms, models, and Pipelines
❑ Utilities: linear algebra, statistics, data handling, etc.
❑ Text Manipulations: Tokenization, Common Word Removing, Word combinations, Word2Vec
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Note: As of Spark 2.0, DataFrame-based API is primary API (package spark.ml). The MLlib RDD-based API is now in
maintenance mode (package spark.mllib)
MongoDB
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
MongoDB
❑ Document Oriented NOSQL Database
❑ JSON-like documents with SCHEMA
❑ A distributed database at its core:
❑high availability (replica set)
❑horizontal scaling (sharding)
❑geographic distribution
❑ Open source cross platform
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
MongoDB – Why?
Intelligent Operational Data Platform
Document Model Distributed Architecture Run Anywhere
Best way to work
with data
Intelligently put data
where you need it
Freedom
to run anywhere
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
MongoDB
Rich Queries
Point | Range | Geospatial | Faceted Search | Aggregations | JOINs | Graph Traversals
JSON Documents Tabular Key-Value Text GraphGeospatial
Versatile: Multiple data models, rich query functionality
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Mongo – Relational dictionary
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
profession: [‘banking’, ‘finance’],
location: [45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
]
}
RDBMS
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Mongo – Relational dictionary
MongoDB SQL
database database
collection table
document record (row)
field column
linking/embedded documents join
primary key (_id field) primary key (user designated)
index index
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
MongoDB – Replica Set
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Replica Set
• Up to 50 replicas
• Distributed across racks, data centers, and regions
Self-healing
Data Center Aware
Addresses availability considerations:
• High Availability
• Disaster Recovery
• Maintenance
Application
Driver
Primary
Secondary
Secondary
Replication
MongoDB – Automatic Sharding
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Application transparent
Multiple sharding policies: hashed, ranged, zoned
Increase or decrease capacity as you go
Automatic balancing for elasticity
Horizontally Scalable
•••Shard 1 Shard 2 Shard 3 Shard N
Mongo Spark Connector
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Mongo Spark connector most important features
❑ Most important connector features:
❑ Ability to read/write BSON documents directly from/to MongoDB
❑ Automatic conversion from MongoDB collection to Spark RDD (Dataframe and Dataset)
❑ Predicates pushdown:
❑ Filters (e.g. where conditions) and Select are pushed down to the datasource. So, the actual filtering
and projections are done on the MongoDB node before returning the data to the Spark node.
❑ Integration with the MongoDB aggregation pipeline:
❑ A MongoRDD accept a MongoDB pipeline, to execute aggregations on the MongoDB nodes instead of
the Spark nodes. However, most of the work is automatically performed by connector.
❑ Data locality:
❑ If the Spark nodes and MongoDB nodes (in Sharded Cluster configuration) are deployed on the same
server the data will be loaded according to their locality in the cluster, avoiding costly network
transfers.
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Reference Architecture for MongoDB & Spark
❑ Apache Spark
❑ MongoDB Connector for
Spark
❑ MongoDB nodes
❑ Data locality (Spark
Workers and MongoDB
nodes on the same node)
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Case study Architecture
❑ Full Cloud architecture
❑ Databricks Community
❑ MongoDB Atlas
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Case Study configuration
❑ Databricks community edition
❑ https://databricks.com/try-databricks
❑ 5.5 LTS (Spark 2.4.3. + Scala 2.11)
❑ Importazione libreria Maven per MongoDB Spark Connector:
❑ org.mongodb.spark:mongo-spark-connector_2.11:2.4.1
❑ MongoDB Atlas
❑ https://www.mongodb.com/cloud/atlas
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
CASE STUDY
SYN-DOS attack prediction
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Attacchi informatici
❑ Possono minare:
❑ Riservatezza
❑ Integrità
❑ Disponibilità
❑ Gli attacchi DOS – Denial of Service minano la Disponibilità
❑ L’attacco SYN-DOS (detto anche SYN-Flood) mina la disponibilità
saturando le connessioni TCP/IP del server
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
SYN-DOS Attack
1. Client requests
connection by sending
SYN (synchronize)
message to the server.
2. Server acknowledges
by sending SYN-ACK
(synchronize-
acknowledge)
message back to the
client.
3. Client responds with
an ACK
(acknowledge)
message, and the
connection is
established.
https://www.imperva.com/learn/application-security/syn-flood/
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Dataset & Reference
❑ Dataset Description
❑ 115 features (Double)
❑ 1 Label (String)
❑ 11.000 total samples (10.000 normal + 1.000 attack)
❑ Features contains statistics which are used to implicitly describe the
current state of the channel
❑ Data came from IP-Cameras
❑ The statistics are generated by a Feature Extractor
❑ Syn-Dos
❑ Paper: https://arxiv.org/pdf/1802.09089.pdf
❑ Full Dataset:
https://drive.google.com/drive/folders/1kmoWY4poGWfmmVSdS
u-r_3Vo84Tu4PyE
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Let’s code!
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Attacchi informatici
❑ Accesso alla Consolle MongoDB Atlas:
❑ https://cloud.mongodb.com
❑ Vista delle Collections
❑ Accesso a Databricks
❑ Creazione del cluster
❑ Import Maven libreria: org.mongodb.spark:mongo-spark-
connector_2.11:2.4.1
❑ Notebook creazione Collection su MongoDB
❑ Notebook Training
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Useful links
❑ https://spark.apache.org/docs/latest/
❑ https://spark.apache.org/docs/latest/ml-guide.html
❑ https://spark.apache.org/docs/latest/ml-classification-regression.html
❑ https://docs.databricks.com/getting-started/index.html
❑ https://www.mongodb.com/it
❑ https://databricks.com/try-databricks
❑ https://www.mongodb.com/cloud/atlas
❑ https://docs.mongodb.com/spark-connector/
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
Grazie per l’attenzione
valerio.morfino@dbservices.it https://it.linkedin.com/in/valerio-morfino
orlando.moroni@dbservices.it https://www.linkedin.com/in/orlandomoroni
Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni

More Related Content

Similar to Meetup mongo db-spark-ml-20191111

Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
Spark Summit
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
Ss eb29
Ss eb29Ss eb29
Lider Reference Model ld4lt session March, 3rd, 2015
Lider Reference Model ld4lt session  March, 3rd, 2015Lider Reference Model ld4lt session  March, 3rd, 2015
Lider Reference Model ld4lt session March, 3rd, 2015
Sebastian Hellmann
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu
 
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?PiLOD 2013: Is Linked Data the future of data integration in the enterprise?
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?
John Walker
 
MongoDB classes 2019
MongoDB classes 2019MongoDB classes 2019
MongoDB classes 2019
Alexandre BERGERE
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Osgis2011 edina addy_pope
Osgis2011 edina addy_popeOsgis2011 edina addy_pope
Osgis2011 edina addy_pope
Addy Pope
 
Osgis2011 edina addy_pope
Osgis2011 edina addy_popeOsgis2011 edina addy_pope
Osgis2011 edina addy_pope
Addy Pope
 
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Mariano Gonzalez
 
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...
Geschäftliches Potential für System-Integratoren und Berater -  Graphdatenban...Geschäftliches Potential für System-Integratoren und Berater -  Graphdatenban...
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...
Neo4j
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
sopekmir
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
Rajeev Kumar
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
Vasu Jain
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Ashok Royal
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
Andrea Volpini
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
Carmen Sanborn
 
Resume
ResumeResume
Resume
nagapandu
 

Similar to Meetup mongo db-spark-ml-20191111 (20)

Spark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun MurthySpark and Hadoop Perfect Togeher by Arun Murthy
Spark and Hadoop Perfect Togeher by Arun Murthy
 
Spark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's KeynoteSpark Summit EMEA - Arun Murthy's Keynote
Spark Summit EMEA - Arun Murthy's Keynote
 
Ss eb29
Ss eb29Ss eb29
Ss eb29
 
Lider Reference Model ld4lt session March, 3rd, 2015
Lider Reference Model ld4lt session  March, 3rd, 2015Lider Reference Model ld4lt session  March, 3rd, 2015
Lider Reference Model ld4lt session March, 3rd, 2015
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?PiLOD 2013: Is Linked Data the future of data integration in the enterprise?
PiLOD 2013: Is Linked Data the future of data integration in the enterprise?
 
MongoDB classes 2019
MongoDB classes 2019MongoDB classes 2019
MongoDB classes 2019
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Osgis2011 edina addy_pope
Osgis2011 edina addy_popeOsgis2011 edina addy_pope
Osgis2011 edina addy_pope
 
Osgis2011 edina addy_pope
Osgis2011 edina addy_popeOsgis2011 edina addy_pope
Osgis2011 edina addy_pope
 
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
 
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...
Geschäftliches Potential für System-Integratoren und Berater -  Graphdatenban...Geschäftliches Potential für System-Integratoren und Berater -  Graphdatenban...
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
Rajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developerRajeev kumar apache_spark & scala developer
Rajeev kumar apache_spark & scala developer
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Resume
ResumeResume
Resume
 

More from Deep Learning Italia

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
Deep Learning Italia
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Deep Learning Italia
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
Deep Learning Italia
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
Deep Learning Italia
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
Deep Learning Italia
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
Deep Learning Italia
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
Deep Learning Italia
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
Deep Learning Italia
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
Deep Learning Italia
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
Deep Learning Italia
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
Deep Learning Italia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
Deep Learning Italia
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
Deep Learning Italia
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
Deep Learning Italia
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
Deep Learning Italia
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
Deep Learning Italia
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Deep Learning Italia
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
Deep Learning Italia
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
Deep Learning Italia
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
Deep Learning Italia
 

More from Deep Learning Italia (20)

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
 

Recently uploaded

原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 

Recently uploaded (20)

原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 

Meetup mongo db-spark-ml-20191111

  • 1. Meetup Deep Learning Italia – 11/11/2019 - Roma Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB Speaker Valerio Morfino & Orlando Moroni Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB
  • 2. DB Services è una Technology company che opera in tutta Italia con sedi a Roma e Milano. Mission dell’azienda è aiutare i propri clienti a gestire, valorizzare e analizzare al meglio i propri dati, grazie a talenti e tecnologie. Around and inside data perché DBS opera su tutte le tecnologie che riguardano i dati. Advanced Analytics, Visual Analytics, Business Intelligence, ETL, Data Injestion, Sistemi SQL, NoSQL e BigData, Middleware, OnPremise e Cloud sono quindi parte del DNA aziendale. DBS lavora portando professionalità ed innovazione alle principali Big company pubbliche e private ed alle PMI. Con un’attenta strategia di partnership, DBS porta ai propri clienti le tecnologie più importanti ed avanzate del mercato, tra cui MongoDB, Tableau, Oracle. Sperimentiamo, supportiamo e proponiamo tecnologie all’avanguardia grazie a partnership strategiche con Università e Startup, come Deep Learning Italia. www.dbservices.it Giorgia Butera, Sales Executive giorgia.butera@dbservices.it https://www.linkedin.com/in/giorgia-butera-693786122/
  • 3. VALERIO MORFINO Head of Big Data & Analytics DB Services Ingegnere informatico è Head of Big Data & Analytics presso DB Services. Nel corso della propria carriera ha lavorato in società di consulenza, università ed aziende occupandosi di consulenza, formazione, ricerca, direzione di progetti. E' autore di articoli e relatore in conferenze sui temi web, e- commerce, machine learning e big data. ORLANDO MORONI COO & Principal MongoDB Architect DB Services Ingegnere informatico è Chief Operation Officer & Principal MongoDB architect presso DB Services. Nel corso della propria carriera ha lavorato con le più importanti tecnologie di database Relazionali e non Relazionali su tematiche architetturali, di modellazione e di performance tuning. Ha partecipato alla realizzazione di alcune delle più importanti installazioni di MongoDB in Italia.
  • 4. Summary ❑ Introduction ❑ Apache Spark ❑ Mongo DB ❑ Mongo Spark Connector ❑ Case Study: SYN-DOS attack prediction on IoT devices Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 5. Introduction Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 6. Toward a new era Traditional data platforms are failing to meet new business requirements that demand nocompromises combination of: ❑ Real-time data ❑ Performance ❑ Scale ❑ Integrated data ❑ Security Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 7. The Translytical era Translytical is a hot, emerging market that delivers a unified data platform to support all kinds of workloads. Translytical can support various use cases, including real-time insights, machine learning (ML), streaming analytics, extreme transactional processing, and operational reporting. [The Forrester Wave™: Translytical Data Platforms, Q4 2019] Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 8. Why MongoDB & Spark? ❑ We are in a Big Data World! ❑ Store high Volume of Data ❑ Store and Analyze data with high Velocity ❑ Store data in a Variety of formats and locations ❑ Be aware of Vulnerability! Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 9. Why MongoDB & Spark? Examples ❑ Store and analyze data from IoT devices ❑ Store and analyze data in distributed environments ❑ Enable real-time analytics without ETL ❑ Advanced analytics and Machine Learning at scale ❑ Enrich BI report & dashboard with augmented analytics features Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 10. Apache Spark Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 11. Apache Spark ❑ A distributed cluster based general engine for big data processing ❑ Fully integrated with Hadoop ecosystem ❑ Available both in local and in cloud environments ❑ Clusters of hundreds or even thousands of nodes ❑ Up to 100X faster than Hadoop Map Reduce ❑ Resilient thanks to lineage and distributed storage system (e.g. HDFS or MongoDB) ❑ This is important for Big Data and long processing tasks on big clusters and hardware, software or networks connections can fail! Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 12. Apache Spark ❑ High-level APIs accessible in Java, Scala, Python and R ❑ The MLlib library is rich of efficient parallel implementation of Machine learning algorithms Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 13. Spark Cluster configurations ❑ Several Cluster configurations: ❑ Stand Alone ❑ Hadoop Yarn ❑ Mesos ❑ Kubernetes Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 14. RDDs to store Large datasets ❑ Resilient, i.e. fault-tolerant thanks to RDD lineage graph, able to recompute missing or damaged partitions ❑ Distributed, with data residing on multiple nodes in a cluster ❑ Dataset is a collection of partitioned data stored in memory as far as possible (otherwise disk) Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 15. Mllib - Spark’s machine learning library ❑ ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering ❑ Featurization: feature extraction, transformation, dimensionality reduction, and selection ❑ Pipelines: tools for constructing, evaluating, and tuning ML Pipelines ❑ Persistence: saving and load algorithms, models, and Pipelines ❑ Utilities: linear algebra, statistics, data handling, etc. ❑ Text Manipulations: Tokenization, Common Word Removing, Word combinations, Word2Vec Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni Note: As of Spark 2.0, DataFrame-based API is primary API (package spark.ml). The MLlib RDD-based API is now in maintenance mode (package spark.mllib)
  • 16. MongoDB Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 17. MongoDB ❑ Document Oriented NOSQL Database ❑ JSON-like documents with SCHEMA ❑ A distributed database at its core: ❑high availability (replica set) ❑horizontal scaling (sharding) ❑geographic distribution ❑ Open source cross platform Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 18. MongoDB – Why? Intelligent Operational Data Platform Document Model Distributed Architecture Run Anywhere Best way to work with data Intelligently put data where you need it Freedom to run anywhere Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 19. MongoDB Rich Queries Point | Range | Geospatial | Faceted Search | Aggregations | JOINs | Graph Traversals JSON Documents Tabular Key-Value Text GraphGeospatial Versatile: Multiple data models, rich query functionality Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 20. Mongo – Relational dictionary { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, profession: [‘banking’, ‘finance’], location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ] } RDBMS Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 21. Mongo – Relational dictionary MongoDB SQL database database collection table document record (row) field column linking/embedded documents join primary key (_id field) primary key (user designated) index index Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 22. MongoDB – Replica Set Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni Replica Set • Up to 50 replicas • Distributed across racks, data centers, and regions Self-healing Data Center Aware Addresses availability considerations: • High Availability • Disaster Recovery • Maintenance Application Driver Primary Secondary Secondary Replication
  • 23. MongoDB – Automatic Sharding Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni Application transparent Multiple sharding policies: hashed, ranged, zoned Increase or decrease capacity as you go Automatic balancing for elasticity Horizontally Scalable •••Shard 1 Shard 2 Shard 3 Shard N
  • 24. Mongo Spark Connector Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 25. Mongo Spark connector most important features ❑ Most important connector features: ❑ Ability to read/write BSON documents directly from/to MongoDB ❑ Automatic conversion from MongoDB collection to Spark RDD (Dataframe and Dataset) ❑ Predicates pushdown: ❑ Filters (e.g. where conditions) and Select are pushed down to the datasource. So, the actual filtering and projections are done on the MongoDB node before returning the data to the Spark node. ❑ Integration with the MongoDB aggregation pipeline: ❑ A MongoRDD accept a MongoDB pipeline, to execute aggregations on the MongoDB nodes instead of the Spark nodes. However, most of the work is automatically performed by connector. ❑ Data locality: ❑ If the Spark nodes and MongoDB nodes (in Sharded Cluster configuration) are deployed on the same server the data will be loaded according to their locality in the cluster, avoiding costly network transfers. Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 26. Reference Architecture for MongoDB & Spark ❑ Apache Spark ❑ MongoDB Connector for Spark ❑ MongoDB nodes ❑ Data locality (Spark Workers and MongoDB nodes on the same node) Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 27. Case study Architecture ❑ Full Cloud architecture ❑ Databricks Community ❑ MongoDB Atlas Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 28. Case Study configuration ❑ Databricks community edition ❑ https://databricks.com/try-databricks ❑ 5.5 LTS (Spark 2.4.3. + Scala 2.11) ❑ Importazione libreria Maven per MongoDB Spark Connector: ❑ org.mongodb.spark:mongo-spark-connector_2.11:2.4.1 ❑ MongoDB Atlas ❑ https://www.mongodb.com/cloud/atlas Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 29. CASE STUDY SYN-DOS attack prediction Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 30. Attacchi informatici ❑ Possono minare: ❑ Riservatezza ❑ Integrità ❑ Disponibilità ❑ Gli attacchi DOS – Denial of Service minano la Disponibilità ❑ L’attacco SYN-DOS (detto anche SYN-Flood) mina la disponibilità saturando le connessioni TCP/IP del server Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 31. SYN-DOS Attack 1. Client requests connection by sending SYN (synchronize) message to the server. 2. Server acknowledges by sending SYN-ACK (synchronize- acknowledge) message back to the client. 3. Client responds with an ACK (acknowledge) message, and the connection is established. https://www.imperva.com/learn/application-security/syn-flood/ Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 32. Dataset & Reference ❑ Dataset Description ❑ 115 features (Double) ❑ 1 Label (String) ❑ 11.000 total samples (10.000 normal + 1.000 attack) ❑ Features contains statistics which are used to implicitly describe the current state of the channel ❑ Data came from IP-Cameras ❑ The statistics are generated by a Feature Extractor ❑ Syn-Dos ❑ Paper: https://arxiv.org/pdf/1802.09089.pdf ❑ Full Dataset: https://drive.google.com/drive/folders/1kmoWY4poGWfmmVSdS u-r_3Vo84Tu4PyE Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 33. Let’s code! Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 34. Attacchi informatici ❑ Accesso alla Consolle MongoDB Atlas: ❑ https://cloud.mongodb.com ❑ Vista delle Collections ❑ Accesso a Databricks ❑ Creazione del cluster ❑ Import Maven libreria: org.mongodb.spark:mongo-spark- connector_2.11:2.4.1 ❑ Notebook creazione Collection su MongoDB ❑ Notebook Training Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 35. Useful links ❑ https://spark.apache.org/docs/latest/ ❑ https://spark.apache.org/docs/latest/ml-guide.html ❑ https://spark.apache.org/docs/latest/ml-classification-regression.html ❑ https://docs.databricks.com/getting-started/index.html ❑ https://www.mongodb.com/it ❑ https://databricks.com/try-databricks ❑ https://www.mongodb.com/cloud/atlas ❑ https://docs.mongodb.com/spark-connector/ Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni
  • 36. Grazie per l’attenzione valerio.morfino@dbservices.it https://it.linkedin.com/in/valerio-morfino orlando.moroni@dbservices.it https://www.linkedin.com/in/orlandomoroni Deep Learning Italia 11/11/2019 – Roma - Machine Learning in the Big Data context on document-based data using Apache Spark & MongoDB, Valerio Morfino & Orlando Moroni