BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)

Designing Big Data Pipelines
Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani

Big Data
• A huge amount of data are generated and
collected every minute (sensors)
• 1.7 million billion bytes of data, over 6
megabytes for each human (2016)
• 2.5 quintillion bytes of data created each day
• The trend is rapidly accelerating with the
growth of the Internet of Things (IoT),
200 billions of connected devices by 2020
• Low latency access to huge distributed data
sources has become a value proposition
• Business intelligence applications require
proper big data analysis and management
functionalities

What
to do
with
these
data?
Aggregation and
Statistics
Data
warehouse
and OLAP
Indexing,
Searching, and
Querying
Keyword
based search
Pattern
matching
(XML/RDF)
Knowledge
discovery
Data Mining
Statistical
Modeling
Machine
Learning

The Big
Data
difference
• Classic analytics assume:
• Standard data
models/formats
• Reasonable volumes
• Loose deadlines
• Problem: The five Vs
jeopardise these assumptions
– (unless we sample or
summarize)

Processing Models: batch vs stream
• Batch
• Receive, accumulate then
compute (data lake)
• Stream
• Compute while receiving
(data flow)
• Same questions, different
algorithms
• Both different from “mouse”
computations

Hurdles in
Adoption of
Big Data
Technologies
• Complex Architecture
• Lack of Standardization
• Regulatory Barriers
• Violation of Data
Access
• Sharing & Custody
Regulation
• High Cost of Legal
Clearance

Big Data
As a
Service
• A set of automatic tools
and a methodology that
allows customers to design
and deploy a full Big Data
pipeline addressing their
goals

How to
design a Big
Data Pipeline
1. Define a Business Value
2. Identify the Data Sources
3. Define the Data Flow
4. Study Data Protection Directives
5. Define Visualization, Reporting
and Interaction
6. Select Data Preparation Stages
7. Identify Processing
Requirements
8. Select Analytics
9. Define the Data Processing Flow

Big Data Pipeline Areas
Ingestion and
representation
Preparation
Processing
Analytics
Display and reporting
Specify how data are represented: NoSQL, Graph-
based, Relational, Extended relational, Markup
based, Hybrid
Specify how data will be routed and
parallelized, and how the analytics will be
computed: parallel batch, stream, hybrid
Specify the expected outcome: descriptive,
prescriptive, predictive
Specify the display and reporting of the
results: scalar, multi-dimensional
Specify how to prepare data for analitycs:
anonymize, reduce dimensions, hash

• Abstract the typical
procedural models (e.g., data
pipeline) implemented in big
data frameworks
• Develop model
transformations to translate
modelling decisions into
actual provisioning
Model Driven Approach
Declarative
Models
Procedural
Models
Deployment
Models
(Non-)Functional
Goals: Service goals
of Big Data Pipeline
What the BDA should
achieve and how to
achieve objectives
How the BDA process
should work

Declarative
Model
• Specify non-functional/functional
goals
• Single model addressing all
aspects of big data pipelines:
preparation, representation,
analytics, processing, display
and reporting
• Aspects of different areas
may impact on the same
procedural model template
• Some goals map directly to Service
Level Agreements (SLAs), others
need a transformation function to
map to SLAs

Procedural
Model
• Contain all information needed for
running the analytics
• Simple to map declarative goals on
procedures
• Platform independent
• Specified procedural templates
(alternatives)
• Procedural templates correspond to
defined goals
• May need additional input from final
users of big data services
• Templates express competences of data
scientist and data technologist
• Declarative models used to select the
(set of) proper templates

Deployment
Model
• Specify how procedural
models are to be incarnated
in a ready-to-be-deployed
architecture
• Drive analytics execution in
real scenarios
• To be defined for each
application
• Platform dependent

Methodology again
Declarative
Model
Specification
Service
Selection
Procedural
Model
Definition
Workflow
Compiler
Deployment
Model
Execution
Declarative
Specifications
Service
Catalog
Service
Composition
Repository
Deployment
Configurations
MBDAaaS
Platform Big Data
Platform
Tocode-based
Torecipies

BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)

Similar to BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani) (20)

More from Big Data Value Association

More from Big Data Value Association (20)

Recently uploaded

Recently uploaded (20)

BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)