Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)


Published on

In the Internet of Everything, huge volumes of multimedia data are generated at very high rates by heterogeneous sources in various formats, such as sensors readings, process logs, structured data from RDBMS, etc. The need of the hour is setting up efficient data pipelines that can compute advanced analytics models on data and use results to customize services, predict future needs or detect anomalies. This Webinar explores the TOREADOR conversational, service-based approach to the easy design of efficient and reusable analytics pipelines to be automatically deployed on a variety of cloud-based execution platforms.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)

  1. 1. Designing Big Data Pipelines Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
  2. 2. Big Data • A huge amount of data are generated and collected every minute (sensors) • 1.7 million billion bytes of data, over 6 megabytes for each human (2016) • 2.5 quintillion bytes of data created each day • The trend is rapidly accelerating with the growth of the Internet of Things (IoT), 200 billions of connected devices by 2020 • Low latency access to huge distributed data sources has become a value proposition • Business intelligence applications require proper big data analysis and management functionalities
  3. 3. What to do with these data? Aggregation and Statistics Data warehouse and OLAP Indexing, Searching, and Querying Keyword based search Pattern matching (XML/RDF) Knowledge discovery Data Mining Statistical Modeling Machine Learning
  4. 4. The Big Data difference • Classic analytics assume: • Standard data models/formats • Reasonable volumes • Loose deadlines • Problem: The five Vs jeopardise these assumptions – (unless we sample or summarize)
  5. 5. The data analytics pipeline
  6. 6. Processing Models: batch vs stream • Batch • Receive, accumulate then compute (data lake) • Stream • Compute while receiving (data flow) • Same questions, different algorithms • Both different from “mouse” computations
  7. 7. Hurdles in Adoption of Big Data Technologies • Complex Architecture • Lack of Standardization • Regulatory Barriers • Violation of Data Access • Sharing & Custody Regulation • High Cost of Legal Clearance
  8. 8. Big Data As a Service • A set of automatic tools and a methodology that allows customers to design and deploy a full Big Data pipeline addressing their goals
  9. 9. How to design a Big Data Pipeline 1. Define a Business Value 2. Identify the Data Sources 3. Define the Data Flow 4. Study Data Protection Directives 5. Define Visualization, Reporting and Interaction 6. Select Data Preparation Stages 7. Identify Processing Requirements 8. Select Analytics 9. Define the Data Processing Flow
  10. 10. Big Data Pipeline Areas Ingestion and representation Preparation Processing Analytics Display and reporting Specify how data are represented: NoSQL, Graph- based, Relational, Extended relational, Markup based, Hybrid Specify how data will be routed and parallelized, and how the analytics will be computed: parallel batch, stream, hybrid Specify the expected outcome: descriptive, prescriptive, predictive Specify the display and reporting of the results: scalar, multi-dimensional Specify how to prepare data for analitycs: anonymize, reduce dimensions, hash
  11. 11. • Abstract the typical procedural models (e.g., data pipeline) implemented in big data frameworks • Develop model transformations to translate modelling decisions into actual provisioning Model Driven Approach Declarative Models Procedural Models Deployment Models (Non-)Functional Goals: Service goals of Big Data Pipeline What the BDA should achieve and how to achieve objectives How the BDA process should work
  12. 12. Declarative Model • Specify non-functional/functional goals • Single model addressing all aspects of big data pipelines: preparation, representation, analytics, processing, display and reporting • Aspects of different areas may impact on the same procedural model template • Some goals map directly to Service Level Agreements (SLAs), others need a transformation function to map to SLAs
  13. 13. Procedural Model • Contain all information needed for running the analytics • Simple to map declarative goals on procedures • Platform independent • Specified procedural templates (alternatives) • Procedural templates correspond to defined goals • May need additional input from final users of big data services • Templates express competences of data scientist and data technologist • Declarative models used to select the (set of) proper templates
  14. 14. Deployment Model • Specify how procedural models are to be incarnated in a ready-to-be-deployed architecture • Drive analytics execution in real scenarios • To be defined for each application • Platform dependent
  15. 15. Methodology again Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform Tocode-based Torecipies