SlideShare a Scribd company logo
1 of 44
Download to read offline
Designing Big Data Pipelines
Applying the TOREADOR Methodology
BDVA webinar
Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
Toreador
Platform Big Data
Platform
Tocode-based
Torecipies
Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
Toreador
Platform Big Data
Platform
Tocode-based
Torecipies
DS SS SC WC E
Sample Scenario
• Infrastructure for pollution monitoring
managed by Lombardia Informatica, an
agency of Lombardy region in Italy.
• A network of sensors acquire pollution data
everyday.
• sensors, containing information of a
specific acquiring sensor such as ID,
pollutant type, unit of measure
• data acquisition stations, managing a set
of sensors and information regarding their
position (e.g. longitude/latitude)
• pollution values, containing the values
acquired by sensors, the timestamp, and
the validation status. Each value is
validated by a human operator that
manually labels it as valid or invalid.
•The goal is to design and
deploy a Big Data pipeline to:
• predict the labels of acquired
data in real time
• alert the operator when
anomalous values are observed
Reference Scenario
Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs
Queue
Kafka Spark HBase
Display/
Query
Sensor
Data
Compute
Predictive
label
Store
HBase
Without the methodology..
•Draft the pipeline stages
•Identify the technology
•Develop the scripts
•Deploy
Slow, error-prone, difficult
to reuse..
• The pipeline includes two processing stages: training stage and prediction stage
• Our DM will include 2 requirement specifications:
DataPreparation.DataTransformation.Filtering;
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Training;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Batch.
DataAnalitycs.LearningApproach.Supervised;
DataAnalitycs.LearningStep.Prediction;
DataAnalitycs.AnalyticsAim.Regression;
DataProcessing.AnalyticsGoal.Streaming.
Declarative Model
DS1
DS2
• Based on the Declarative Models, the TOREADOR (SS) will return a set of
services consistent with DS1 and DS2
• The user can easily compose these services to address the scenario’s
goals
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The two compositions must be connected as the e-gestion of SC1 is the
in-gestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The two compositions must be connected as the egestion of SC1 is the
ingestion for SC2
Procedural Model
DS1
SS
SC1
DS2
SS
SC2
• The TOREADOR compiler translates SC1 and SC2 into executable
orchestrations in a suitable workflow language
Deployment Model
DS1
SS
SC1
DS2
SS
SC2
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
spark−filter−sensorsTest : filter
−−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a
t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ”
−−outputPath=”/user/root/sensors test.csv” &&
spark−assemblerTest : spark−assembler
−−features=”Data,Quote”−−inputPath=”/user/root/sen
sors test.csv”
−−outputPath=”/user/root/sensors/sensors test
assembled.csv” &&
spark−gbt−predict :
batch−gradientboostedtree−classification−predict
−−inputPath =/ user / root / sensors / sensors
−−outputPath =/ user / root / sensors / sensors −− m o
d e l = / u s e r / r o o t / s e n s o r s / m o d e l
WC1
WC2
1-n
Deployment
• The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2
• The execution of WC2 produces the results
Deployment Model
DS1
SS
SC2
WC2
E2
The Code-based Line
Code Once/Deploy Everywhere
The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and
controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code-
based approach.
She expresses the parallel computation of a coded algorithm, in terms of parallel primitives.
Toreador distributes it among computational nodes hosted by different Cloud environments.
The resulting computation can be saved as a service for the Service-based line
19
I. Code III. DeployII. Transform
Skeleton-Based
Code Compiler
Code-based compiler
import math
import random
def data_parallel_region(distr, func, *repl):
return [func(x, *repl) for x in distr]
def distance(a, b):
"""Computes euclidean distance between two
vectors"""
return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a,
b)]))
def kmeans_init(data, k):
"""Returns initial centroids configuration"""
return random.sample(data, k)
def kmeans_assign(p, centroids):
"""Returns the given instance paired to key of
nearest centroid"""
comparator = lambda x: distance(x[1], p)
print (comparator)
Source Code
MapReduce
Bag of Tasks
Producer Consumer
…......
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
import math
def data_parallel_region(distr,func, *repl):
return[func(x,*repl) for x in distr]
def distance(a,b):
"""Computes euclidean distancebetween
two vectors"""
returnmath.sqrt(sum([(x[1]-x[0])**2for x
in zip(a, b)]))
def kmeans_init(data,k):
"""Returns initial centroids
configuration"""
returnrandom.sample(data,k)
def kmeans_assign(p,centroids
Skeleton
Secondary
Scripts
Key Advances
• Batch and stream support
Guide the user in selecting a consistent set of services
for both batch and stream computations
• Platform independence
Use a smart compiler for generating executable
computations to different platforms
• End-to-end verifiability
Include an end-to-end procedure for checking consistency of model specifications
• Model reuse and refinement
Support model reuse and refinement
Store declarative, procedural and deployment models as templates to replicate or
extend designs
Or each us at info@toreador-project.eu
2017
Want to give it a try? Stay Tuned!
http://www.toreador-project.eu/community/
Thank you
Declarative Model Definition
Declarative Models: vocabulary
• Declarative model offers a vocabulary for an computation
independent description of BDA
• Organized in 5 areas
• Representation (Data Mode, Data Type, Management, Partitioning)
• Preparation (Data Reduction, Expansion, Cleaning, Anonymization)
• Analytics (Analytics Model, Task, Learning Approach, Expected Quality)
• Processing (Analysis Goal, Interaction, Performances)
• Visualization and Reporting (Goal, Interaction, Data Dimensionality)
• Each specification can be structured in three levels:
• Goal: Indicator – Objective – Constraint
• Feature: Type – Sub Type – Sub Sub Type
Declarative Models
• A web-based GUI for
specifying the requirements
of a BDA
• No coding, for basic users
• Analytics services are
provided by the target
TOREADOR platform
• Big Data campaign built by
composing existing services
• Based on model
transformations
26
Declarative Models
• A web-based GUI for
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
27
Declarative Models: machine readable
• A web-based GUI for
specifying the requirements of
a BDA
• Data_Preparation.Data_Source
_Model.Data_Model.
Document_Oriented
• Data_Analytics.Analytics_Aim.T
ask.Crisp_Clustering
28
…
"tdm:label": "Data Representation",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:label": "Data Source Model Type",
"tdm:constraint": "{}",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:label": "Data Structure",
"tdm:constraint": "{}",
"tdm:visualisationType": "Option",
"tdm:incorporates": [
{
"@type": "tdm:Feature",
"tdm:constraint": "{}",
"tdm:label": "Structured",
"$$hashKey": "object:21"
}
]
},
....
Interference Declaration
• A few examples
Data_Preparation.Anonymization. Technique.k-anonymity
→¬ Data_Analitycs.Analitycs_Quality. False_Positive_Rate.low
Data_Preparation.Anonymization. Technique.hashing
→¬ Data_Analitycs.Analitycs_Aim. Task.Crisp_Clustering.algorithm=k-mean
Data_Representation.Storage_Property. Coherence_Model.Strong_Consistency
→¬ Data_Representation.Storage_Property. Partitioning
29
• Interference
Declarations
• Boolean Interference:
P→¬Q
• Intensity of an
Interference: DP∩DQ
• Interference
Enforcement
• Fuzzy interpretation
max (1-P, 1-Q)
30
Consistency Check
Service-Based Line
Methodology: Building Blocks
• Declarative Specifications allow customers to define declarative models
shaping a BDA and retrieve a set of compatible services
• Service Catalog specifies the set of abstract services (e.g., algorithms,
mechanisms, or components) that are available to Big Data customers and
consultants for building their BDA
• Service Composition Repository permits to specify the procedural model
defining how services can be composed to carry out the Big Data analytics
• Support specification of an abstract Big Data service composition
• Deployment Configurations define the platform-dependent version of a
procedural model, as a workflow that is ready to be executed on the target
Big Data platform
32
Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
Procedural Models
• Platform-independent models that formally and
unambiguously describe how analytics should be
configured and executed
• They are generated following goals and constraints
specified in the declarative models
• They provide a workflow in the form of a service
orchestration
• Sequence
• Choice
• If-then
• Do-While
• Split-Join
• User creates the flow based
on the list of returned
services
Service Composition
• User creates the flow based
on the list of returned
services
• Services enriched with ad
hoc parameters
Service Composition
• User creates the flow based
on the list of returned
services
• Services enriched with ad
hoc parameters
• The flow is submitted to the
service which translates it
into OWL-S service
composition
Service Composition
• All internals are made
explicits
• Clear specification of the
services
• Reuse and modularity
Service Composition
Deployment Model Definition
Overview of the Methodology
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
• It consists of two main sub-processes
• Structure generation: the compiler parses the procedural model and identifies
the process operators (sequence, alternative, parallel, loop) composing it
• Service configuration: for each service in the procedural model the
corresponding one is identified and inserted in the deployment model
• Support transformations to any orchestration engine available as a
service
• Available for Oozie and Spring XD
Workflow compiler
• Workflow compiler takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• It produces as output an executable workflow
• For example an Oozie workflow
• XML file of the workflow
• job.properties
• System variables
Deployment Model
Translating the Composition Structure
• Deployment models:
• specify how procedural models are instantiated and configured on a target platform
• drive analytics execution in real scenario
• are platform-dependent
• Workflow compiler transforms the procedural model in a deployment
model that can be directly executed on the target platform.
• This transformation is based on a compiler that takes as input
• the OWL-S service composition
• information on the target platform (e.g., installed services/algorithms),
• and produces as output a technology-dependent workflow
Translating the Composition Structure
• OWL-S service composition structure is mapped on different control
constructs
• Workflow contain 3 types of
distinct PLACEHOLDER
• GREEN placeholders are
SYSTEM variables defined in
Oozie properties
• RED placeholders are JOB
variables defined in file
job.properties
• YELLOW placeholders are
ARGUMENTS of executable
jobs on OOZIE Server
• More on the demo…
Generating an Executable Workflow
Analytics Deployment Approach

More Related Content

Similar to BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

PETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsAndrea PETRUCCI
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...Christopher Diamantopoulos
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdlArshit Rai
 
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023Mary Fesenko
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentationHossam Hassan
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robotsJaime Martin Losa
 
Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelAlessio Bucaioni
 
VET4SBO Level 3 module 1 - unit 2 - 0.009 en
VET4SBO Level 3   module 1 - unit 2 - 0.009 enVET4SBO Level 3   module 1 - unit 2 - 0.009 en
VET4SBO Level 3 module 1 - unit 2 - 0.009 enKarel Van Isacker
 
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationWebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationAmir Zmora
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Idit Levine
 
Automated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented SystemAutomated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented SystemSander van der Burg
 
Bhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogueBhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogueVijayananda Mohire
 

Similar to BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo) (20)

PETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_PublicationsPETRUCCI_Andrea_Research_Projects_and_Publications
PETRUCCI_Andrea_Research_Projects_and_Publications
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
Summer training vhdl
Summer training vhdlSummer training vhdl
Summer training vhdl
 
JosephAnthonyEAlvarez_CV_2016
JosephAnthonyEAlvarez_CV_2016JosephAnthonyEAlvarez_CV_2016
JosephAnthonyEAlvarez_CV_2016
 
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023Evolution of deployment tooling @ Chronosphere - CraftConf 2023
Evolution of deployment tooling @ Chronosphere - CraftConf 2023
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
Stoop 305-reflective programming5
Stoop 305-reflective programming5Stoop 305-reflective programming5
Stoop 305-reflective programming5
 
NoC simulators presentation
NoC simulators presentationNoC simulators presentation
NoC simulators presentation
 
Notifier Tools pdf
Notifier Tools pdfNotifier Tools pdf
Notifier Tools pdf
 
resume2
resume2resume2
resume2
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robots
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component Model
 
VET4SBO Level 3 module 1 - unit 2 - 0.009 en
VET4SBO Level 3   module 1 - unit 2 - 0.009 enVET4SBO Level 3   module 1 - unit 2 - 0.009 en
VET4SBO Level 3 module 1 - unit 2 - 0.009 en
 
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & ImplementationWebRTC Webinar & Q&A - Sumilcast Standards & Implementation
WebRTC Webinar & Q&A - Sumilcast Standards & Implementation
 
Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017Debugging Microservices - QCON 2017
Debugging Microservices - QCON 2017
 
Automated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented SystemAutomated Deployment of Hetergeneous Service-Oriented System
Automated Deployment of Hetergeneous Service-Oriented System
 
Bhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogueBhadale Group of Companies -Universal Quantum Computer System Design catalogue
Bhadale Group of Companies -Universal Quantum Computer System Design catalogue
 

More from Big Data Value Association

Data Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingData Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingBig Data Value Association
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceKey Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceBig Data Value Association
 
GDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingGDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingBig Data Value Association
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Big Data Value Association
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyThree pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyBig Data Value Association
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Big Data Value Association
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...Big Data Value Association
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna Big Data Value Association
 
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBig Data Value Association
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...Big Data Value Association
 
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBig Data Value Association
 
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBig Data Value Association
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...Big Data Value Association
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBig Data Value Association
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBig Data Value Association
 
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingVirtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingBig Data Value Association
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewPolicy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewBig Data Value Association
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Big Data Value Association
 

More from Big Data Value Association (20)

Data Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharingData Privacy, Security in personal data sharing
Data Privacy, Security in personal data sharing
 
Key Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplaceKey Modules for a trsuted and privacy preserving personal data marketplace
Key Modules for a trsuted and privacy preserving personal data marketplace
 
GDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharingGDPR and Data Ethics considerations in personal data sharing
GDPR and Data Ethics considerations in personal data sharing
 
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...
 
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyThree pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
Three pillars for building a Smart Data Ecosystem: Trust, Security and Privacy
 
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...
 
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...
 
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
BDV Skills Accreditation - Big Data skilling in Emilia-Romagna
 
BDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionalsBDV Skills Accreditation - EIT labels for professionals
BDV Skills Accreditation - EIT labels for professionals
 
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...
 
BDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshopBDV Skills Accreditation - Objectives of the workshop
BDV Skills Accreditation - Objectives of the workshop
 
BDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshopBDV Skills Accreditation - Welcome introduction to the workshop
BDV Skills Accreditation - Welcome introduction to the workshop
 
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...
 
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector Webinar
 
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector WebinarBigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
BigDataPilotDemoDays - I-BiDaaS Application to the Financial Sector Webinar
 
Virtual BenchLearning - Data Bench Framework
Virtual BenchLearning - Data Bench FrameworkVirtual BenchLearning - Data Bench Framework
Virtual BenchLearning - Data Bench Framework
 
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for BenchmarkingVirtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
Virtual BenchLearning - DeepHealth - Needs & Requirements for Benchmarking
 
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...
 
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical OverviewPolicy Cloud Data Driven Policies against Radicalisation - Technical Overview
Policy Cloud Data Driven Policies against Radicalisation - Technical Overview
 
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...
 

Recently uploaded

Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...LuisMiguelPaz5
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSSnehalVinod
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxAniqa Zai
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样wsppdmt
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 

Recently uploaded (20)

Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
Huawei Ransomware Protection Storage Solution Technical Overview Presentation...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 

BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

  • 1. Designing Big Data Pipelines Applying the TOREADOR Methodology BDVA webinar Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
  • 4. Sample Scenario • Infrastructure for pollution monitoring managed by Lombardia Informatica, an agency of Lombardy region in Italy. • A network of sensors acquire pollution data everyday. • sensors, containing information of a specific acquiring sensor such as ID, pollutant type, unit of measure • data acquisition stations, managing a set of sensors and information regarding their position (e.g. longitude/latitude) • pollution values, containing the values acquired by sensors, the timestamp, and the validation status. Each value is validated by a human operator that manually labels it as valid or invalid.
  • 5. •The goal is to design and deploy a Big Data pipeline to: • predict the labels of acquired data in real time • alert the operator when anomalous values are observed Reference Scenario
  • 6. Key Advances • Batch and stream support Guide the user in selecting a consistent set of services for both batch and stream computations • Platform independence Use a smart compiler for generating executable computations to different platforms • End-to-end verifiability Include an end-to-end procedure for checking consistency of model specifications • Model reuse and refinement Support model reuse and refinement Store declarative, procedural and deployment models as templates to replicate or extend designs
  • 7. Queue Kafka Spark HBase Display/ Query Sensor Data Compute Predictive label Store HBase Without the methodology.. •Draft the pipeline stages •Identify the technology •Develop the scripts •Deploy Slow, error-prone, difficult to reuse..
  • 8. • The pipeline includes two processing stages: training stage and prediction stage • Our DM will include 2 requirement specifications: DataPreparation.DataTransformation.Filtering; DataAnalitycs.LearningApproach.Supervised; DataAnalitycs.LearningStep.Training; DataAnalitycs.AnalyticsAim.Regression; DataProcessing.AnalyticsGoal.Batch. DataAnalitycs.LearningApproach.Supervised; DataAnalitycs.LearningStep.Prediction; DataAnalitycs.AnalyticsAim.Regression; DataProcessing.AnalyticsGoal.Streaming. Declarative Model DS1 DS2
  • 9. • Based on the Declarative Models, the TOREADOR (SS) will return a set of services consistent with DS1 and DS2 • The user can easily compose these services to address the scenario’s goals Procedural Model DS1 SS SC1 DS2 SS SC2
  • 10. • The two compositions must be connected as the e-gestion of SC1 is the in-gestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  • 11. • The two compositions must be connected as the egestion of SC1 is the ingestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  • 12. • The two compositions must be connected as the egestion of SC1 is the ingestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  • 13. • The TOREADOR compiler translates SC1 and SC2 into executable orchestrations in a suitable workflow language Deployment Model DS1 SS SC1 DS2 SS SC2 spark−filter−sensorsTest : filter −−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ” −−outputPath=”/user/root/sensors test.csv” && spark−assemblerTest : spark−assembler −−features=”Data,Quote”−−inputPath=”/user/root/sen sors test.csv” −−outputPath=”/user/root/sensors/sensors test assembled.csv” && spark−gbt−predict : batch−gradientboostedtree−classification−predict −−inputPath =/ user / root / sensors / sensors −−outputPath =/ user / root / sensors / sensors −− m o d e l = / u s e r / r o o t / s e n s o r s / m o d e l spark−filter−sensorsTest : filter −−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ” −−outputPath=”/user/root/sensors test.csv” && spark−assemblerTest : spark−assembler −−features=”Data,Quote”−−inputPath=”/user/root/sen sors test.csv” −−outputPath=”/user/root/sensors/sensors test assembled.csv” && spark−gbt−predict : batch−gradientboostedtree−classification−predict −−inputPath =/ user / root / sensors / sensors −−outputPath =/ user / root / sensors / sensors −− m o d e l = / u s e r / r o o t / s e n s o r s / m o d e l WC1 WC2 1-n
  • 15. • The execution of WC2 produces the results Deployment Model DS1 SS SC2 WC2 E2
  • 16. • The execution of WC2 produces the results Deployment Model DS1 SS SC2 WC2 E2
  • 17. The Code-based Line Code Once/Deploy Everywhere The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code- based approach. She expresses the parallel computation of a coded algorithm, in terms of parallel primitives. Toreador distributes it among computational nodes hosted by different Cloud environments. The resulting computation can be saved as a service for the Service-based line 19 I. Code III. DeployII. Transform Skeleton-Based Code Compiler
  • 18. Code-based compiler import math import random def data_parallel_region(distr, func, *repl): return [func(x, *repl) for x in distr] def distance(a, b): """Computes euclidean distance between two vectors""" return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a, b)])) def kmeans_init(data, k): """Returns initial centroids configuration""" return random.sample(data, k) def kmeans_assign(p, centroids): """Returns the given instance paired to key of nearest centroid""" comparator = lambda x: distance(x[1], p) print (comparator) Source Code MapReduce Bag of Tasks Producer Consumer …...... import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids Skeleton Secondary Scripts
  • 19. Key Advances • Batch and stream support Guide the user in selecting a consistent set of services for both batch and stream computations • Platform independence Use a smart compiler for generating executable computations to different platforms • End-to-end verifiability Include an end-to-end procedure for checking consistency of model specifications • Model reuse and refinement Support model reuse and refinement Store declarative, procedural and deployment models as templates to replicate or extend designs
  • 20. Or each us at info@toreador-project.eu 2017 Want to give it a try? Stay Tuned! http://www.toreador-project.eu/community/
  • 23. Declarative Models: vocabulary • Declarative model offers a vocabulary for an computation independent description of BDA • Organized in 5 areas • Representation (Data Mode, Data Type, Management, Partitioning) • Preparation (Data Reduction, Expansion, Cleaning, Anonymization) • Analytics (Analytics Model, Task, Learning Approach, Expected Quality) • Processing (Analysis Goal, Interaction, Performances) • Visualization and Reporting (Goal, Interaction, Data Dimensionality) • Each specification can be structured in three levels: • Goal: Indicator – Objective – Constraint • Feature: Type – Sub Type – Sub Sub Type
  • 24. Declarative Models • A web-based GUI for specifying the requirements of a BDA • No coding, for basic users • Analytics services are provided by the target TOREADOR platform • Big Data campaign built by composing existing services • Based on model transformations 26
  • 25. Declarative Models • A web-based GUI for specifying the requirements of a BDA • Data_Preparation.Data_Source _Model.Data_Model. Document_Oriented • Data_Analytics.Analytics_Aim.T ask.Crisp_Clustering 27
  • 26. Declarative Models: machine readable • A web-based GUI for specifying the requirements of a BDA • Data_Preparation.Data_Source _Model.Data_Model. Document_Oriented • Data_Analytics.Analytics_Aim.T ask.Crisp_Clustering 28 … "tdm:label": "Data Representation", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:label": "Data Source Model Type", "tdm:constraint": "{}", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:label": "Data Structure", "tdm:constraint": "{}", "tdm:visualisationType": "Option", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:constraint": "{}", "tdm:label": "Structured", "$$hashKey": "object:21" } ] }, ....
  • 27. Interference Declaration • A few examples Data_Preparation.Anonymization. Technique.k-anonymity →¬ Data_Analitycs.Analitycs_Quality. False_Positive_Rate.low Data_Preparation.Anonymization. Technique.hashing →¬ Data_Analitycs.Analitycs_Aim. Task.Crisp_Clustering.algorithm=k-mean Data_Representation.Storage_Property. Coherence_Model.Strong_Consistency →¬ Data_Representation.Storage_Property. Partitioning 29
  • 28. • Interference Declarations • Boolean Interference: P→¬Q • Intensity of an Interference: DP∩DQ • Interference Enforcement • Fuzzy interpretation max (1-P, 1-Q) 30 Consistency Check
  • 30. Methodology: Building Blocks • Declarative Specifications allow customers to define declarative models shaping a BDA and retrieve a set of compatible services • Service Catalog specifies the set of abstract services (e.g., algorithms, mechanisms, or components) that are available to Big Data customers and consultants for building their BDA • Service Composition Repository permits to specify the procedural model defining how services can be composed to carry out the Big Data analytics • Support specification of an abstract Big Data service composition • Deployment Configurations define the platform-dependent version of a procedural model, as a workflow that is ready to be executed on the target Big Data platform 32
  • 31. Overview of the Methodology Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform
  • 32. Procedural Models • Platform-independent models that formally and unambiguously describe how analytics should be configured and executed • They are generated following goals and constraints specified in the declarative models • They provide a workflow in the form of a service orchestration • Sequence • Choice • If-then • Do-While • Split-Join
  • 33. • User creates the flow based on the list of returned services Service Composition
  • 34. • User creates the flow based on the list of returned services • Services enriched with ad hoc parameters Service Composition
  • 35. • User creates the flow based on the list of returned services • Services enriched with ad hoc parameters • The flow is submitted to the service which translates it into OWL-S service composition Service Composition
  • 36. • All internals are made explicits • Clear specification of the services • Reuse and modularity Service Composition
  • 38. Overview of the Methodology Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform
  • 39. • It consists of two main sub-processes • Structure generation: the compiler parses the procedural model and identifies the process operators (sequence, alternative, parallel, loop) composing it • Service configuration: for each service in the procedural model the corresponding one is identified and inserted in the deployment model • Support transformations to any orchestration engine available as a service • Available for Oozie and Spring XD Workflow compiler
  • 40. • Workflow compiler takes as input • the OWL-S service composition • information on the target platform (e.g., installed services/algorithms), • It produces as output an executable workflow • For example an Oozie workflow • XML file of the workflow • job.properties • System variables Deployment Model
  • 41. Translating the Composition Structure • Deployment models: • specify how procedural models are instantiated and configured on a target platform • drive analytics execution in real scenario • are platform-dependent • Workflow compiler transforms the procedural model in a deployment model that can be directly executed on the target platform. • This transformation is based on a compiler that takes as input • the OWL-S service composition • information on the target platform (e.g., installed services/algorithms), • and produces as output a technology-dependent workflow
  • 42. Translating the Composition Structure • OWL-S service composition structure is mapped on different control constructs
  • 43. • Workflow contain 3 types of distinct PLACEHOLDER • GREEN placeholders are SYSTEM variables defined in Oozie properties • RED placeholders are JOB variables defined in file job.properties • YELLOW placeholders are ARGUMENTS of executable jobs on OOZIE Server • More on the demo… Generating an Executable Workflow