Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

159 views

Published on

In the Internet of Everything, huge volumes of multimedia data are generated at very high rates by heterogeneous sources in various formats, such as sensors readings, process logs, structured data from RDBMS, etc. The need of the hour is setting up efficient data pipelines that can compute advanced analytics models on data and use results to customize services, predict future needs or detect anomalies. This Webinar explores the TOREADOR conversational, service-based approach to the easy design of efficient and reusable analytics pipelines to be automatically deployed on a variety of cloud-based execution platforms.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

BDVe Webinar Series - Toreador Intro - Designing Big Data pipelines (Paolo Ceravolo)

  1. 1. Designing Big Data Pipelines Applying the TOREADOR Methodology BDVA webinar Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
  2. 2. Methodology again Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations Toreador Platform Big Data Platform Tocode-based Torecipies
  3. 3. Methodology again Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations Toreador Platform Big Data Platform Tocode-based Torecipies DS SS SC WC E
  4. 4. Sample Scenario • Infrastructure for pollution monitoring managed by Lombardia Informatica, an agency of Lombardy region in Italy. • A network of sensors acquire pollution data everyday. • sensors, containing information of a specific acquiring sensor such as ID, pollutant type, unit of measure • data acquisition stations, managing a set of sensors and information regarding their position (e.g. longitude/latitude) • pollution values, containing the values acquired by sensors, the timestamp, and the validation status. Each value is validated by a human operator that manually labels it as valid or invalid.
  5. 5. •The goal is to design and deploy a Big Data pipeline to: • predict the labels of acquired data in real time • alert the operator when anomalous values are observed Reference Scenario
  6. 6. Key Advances • Batch and stream support Guide the user in selecting a consistent set of services for both batch and stream computations • Platform independence Use a smart compiler for generating executable computations to different platforms • End-to-end verifiability Include an end-to-end procedure for checking consistency of model specifications • Model reuse and refinement Support model reuse and refinement Store declarative, procedural and deployment models as templates to replicate or extend designs
  7. 7. Queue Kafka Spark HBase Display/ Query Sensor Data Compute Predictive label Store HBase Without the methodology.. •Draft the pipeline stages •Identify the technology •Develop the scripts •Deploy Slow, error-prone, difficult to reuse..
  8. 8. • The pipeline includes two processing stages: training stage and prediction stage • Our DM will include 2 requirement specifications: DataPreparation.DataTransformation.Filtering; DataAnalitycs.LearningApproach.Supervised; DataAnalitycs.LearningStep.Training; DataAnalitycs.AnalyticsAim.Regression; DataProcessing.AnalyticsGoal.Batch. DataAnalitycs.LearningApproach.Supervised; DataAnalitycs.LearningStep.Prediction; DataAnalitycs.AnalyticsAim.Regression; DataProcessing.AnalyticsGoal.Streaming. Declarative Model DS1 DS2
  9. 9. • Based on the Declarative Models, the TOREADOR (SS) will return a set of services consistent with DS1 and DS2 • The user can easily compose these services to address the scenario’s goals Procedural Model DS1 SS SC1 DS2 SS SC2
  10. 10. • The two compositions must be connected as the e-gestion of SC1 is the in-gestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  11. 11. • The two compositions must be connected as the egestion of SC1 is the ingestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  12. 12. • The two compositions must be connected as the egestion of SC1 is the ingestion for SC2 Procedural Model DS1 SS SC1 DS2 SS SC2
  13. 13. • The TOREADOR compiler translates SC1 and SC2 into executable orchestrations in a suitable workflow language Deployment Model DS1 SS SC1 DS2 SS SC2 spark−filter−sensorsTest : filter −−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ” −−outputPath=”/user/root/sensors test.csv” && spark−assemblerTest : spark−assembler −−features=”Data,Quote”−−inputPath=”/user/root/sen sors test.csv” −−outputPath=”/user/root/sensors/sensors test assembled.csv” && spark−gbt−predict : batch−gradientboostedtree−classification−predict −−inputPath =/ user / root / sensors / sensors −−outputPath =/ user / root / sensors / sensors −− m o d e l = / u s e r / r o o t / s e n s o r s / m o d e l spark−filter−sensorsTest : filter −−expr=”sensorsDF#SensorId === 5958” −− i n p u t P a t h = ” / u s e r / r o o t / s e n s o r s / j o i n e d . c s v ” −−outputPath=”/user/root/sensors test.csv” && spark−assemblerTest : spark−assembler −−features=”Data,Quote”−−inputPath=”/user/root/sen sors test.csv” −−outputPath=”/user/root/sensors/sensors test assembled.csv” && spark−gbt−predict : batch−gradientboostedtree−classification−predict −−inputPath =/ user / root / sensors / sensors −−outputPath =/ user / root / sensors / sensors −− m o d e l = / u s e r / r o o t / s e n s o r s / m o d e l WC1 WC2 1-n
  14. 14. Deployment
  15. 15. • The execution of WC2 produces the results Deployment Model DS1 SS SC2 WC2 E2
  16. 16. • The execution of WC2 produces the results Deployment Model DS1 SS SC2 WC2 E2
  17. 17. The Code-based Line Code Once/Deploy Everywhere The Toreador Codel-line user is an expert programmer, aware of the potentialities (flexibility and controllability) and purposes (analytics developed from scratch or migration of legacy code) of a code- based approach. She expresses the parallel computation of a coded algorithm, in terms of parallel primitives. Toreador distributes it among computational nodes hosted by different Cloud environments. The resulting computation can be saved as a service for the Service-based line 19 I. Code III. DeployII. Transform Skeleton-Based Code Compiler
  18. 18. Code-based compiler import math import random def data_parallel_region(distr, func, *repl): return [func(x, *repl) for x in distr] def distance(a, b): """Computes euclidean distance between two vectors""" return math.sqrt(sum([(x[1]-x[0])**2 for x in zip(a, b)])) def kmeans_init(data, k): """Returns initial centroids configuration""" return random.sample(data, k) def kmeans_assign(p, centroids): """Returns the given instance paired to key of nearest centroid""" comparator = lambda x: distance(x[1], p) print (comparator) Source Code MapReduce Bag of Tasks Producer Consumer …...... import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids import math def data_parallel_region(distr,func, *repl): return[func(x,*repl) for x in distr] def distance(a,b): """Computes euclidean distancebetween two vectors""" returnmath.sqrt(sum([(x[1]-x[0])**2for x in zip(a, b)])) def kmeans_init(data,k): """Returns initial centroids configuration""" returnrandom.sample(data,k) def kmeans_assign(p,centroids Skeleton Secondary Scripts
  19. 19. Key Advances • Batch and stream support Guide the user in selecting a consistent set of services for both batch and stream computations • Platform independence Use a smart compiler for generating executable computations to different platforms • End-to-end verifiability Include an end-to-end procedure for checking consistency of model specifications • Model reuse and refinement Support model reuse and refinement Store declarative, procedural and deployment models as templates to replicate or extend designs
  20. 20. Or each us at info@toreador-project.eu 2017 Want to give it a try? Stay Tuned! http://www.toreador-project.eu/community/
  21. 21. Thank you
  22. 22. Declarative Model Definition
  23. 23. Declarative Models: vocabulary • Declarative model offers a vocabulary for an computation independent description of BDA • Organized in 5 areas • Representation (Data Mode, Data Type, Management, Partitioning) • Preparation (Data Reduction, Expansion, Cleaning, Anonymization) • Analytics (Analytics Model, Task, Learning Approach, Expected Quality) • Processing (Analysis Goal, Interaction, Performances) • Visualization and Reporting (Goal, Interaction, Data Dimensionality) • Each specification can be structured in three levels: • Goal: Indicator – Objective – Constraint • Feature: Type – Sub Type – Sub Sub Type
  24. 24. Declarative Models • A web-based GUI for specifying the requirements of a BDA • No coding, for basic users • Analytics services are provided by the target TOREADOR platform • Big Data campaign built by composing existing services • Based on model transformations 26
  25. 25. Declarative Models • A web-based GUI for specifying the requirements of a BDA • Data_Preparation.Data_Source _Model.Data_Model. Document_Oriented • Data_Analytics.Analytics_Aim.T ask.Crisp_Clustering 27
  26. 26. Declarative Models: machine readable • A web-based GUI for specifying the requirements of a BDA • Data_Preparation.Data_Source _Model.Data_Model. Document_Oriented • Data_Analytics.Analytics_Aim.T ask.Crisp_Clustering 28 … "tdm:label": "Data Representation", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:label": "Data Source Model Type", "tdm:constraint": "{}", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:label": "Data Structure", "tdm:constraint": "{}", "tdm:visualisationType": "Option", "tdm:incorporates": [ { "@type": "tdm:Feature", "tdm:constraint": "{}", "tdm:label": "Structured", "$$hashKey": "object:21" } ] }, ....
  27. 27. Interference Declaration • A few examples Data_Preparation.Anonymization. Technique.k-anonymity →¬ Data_Analitycs.Analitycs_Quality. False_Positive_Rate.low Data_Preparation.Anonymization. Technique.hashing →¬ Data_Analitycs.Analitycs_Aim. Task.Crisp_Clustering.algorithm=k-mean Data_Representation.Storage_Property. Coherence_Model.Strong_Consistency →¬ Data_Representation.Storage_Property. Partitioning 29
  28. 28. • Interference Declarations • Boolean Interference: P→¬Q • Intensity of an Interference: DP∩DQ • Interference Enforcement • Fuzzy interpretation max (1-P, 1-Q) 30 Consistency Check
  29. 29. Service-Based Line
  30. 30. Methodology: Building Blocks • Declarative Specifications allow customers to define declarative models shaping a BDA and retrieve a set of compatible services • Service Catalog specifies the set of abstract services (e.g., algorithms, mechanisms, or components) that are available to Big Data customers and consultants for building their BDA • Service Composition Repository permits to specify the procedural model defining how services can be composed to carry out the Big Data analytics • Support specification of an abstract Big Data service composition • Deployment Configurations define the platform-dependent version of a procedural model, as a workflow that is ready to be executed on the target Big Data platform 32
  31. 31. Overview of the Methodology Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform
  32. 32. Procedural Models • Platform-independent models that formally and unambiguously describe how analytics should be configured and executed • They are generated following goals and constraints specified in the declarative models • They provide a workflow in the form of a service orchestration • Sequence • Choice • If-then • Do-While • Split-Join
  33. 33. • User creates the flow based on the list of returned services Service Composition
  34. 34. • User creates the flow based on the list of returned services • Services enriched with ad hoc parameters Service Composition
  35. 35. • User creates the flow based on the list of returned services • Services enriched with ad hoc parameters • The flow is submitted to the service which translates it into OWL-S service composition Service Composition
  36. 36. • All internals are made explicits • Clear specification of the services • Reuse and modularity Service Composition
  37. 37. Deployment Model Definition
  38. 38. Overview of the Methodology Declarative Model Specification Service Selection Procedural Model Definition Workflow Compiler Deployment Model Execution Declarative Specifications Service Catalog Service Composition Repository Deployment Configurations MBDAaaS Platform Big Data Platform
  39. 39. • It consists of two main sub-processes • Structure generation: the compiler parses the procedural model and identifies the process operators (sequence, alternative, parallel, loop) composing it • Service configuration: for each service in the procedural model the corresponding one is identified and inserted in the deployment model • Support transformations to any orchestration engine available as a service • Available for Oozie and Spring XD Workflow compiler
  40. 40. • Workflow compiler takes as input • the OWL-S service composition • information on the target platform (e.g., installed services/algorithms), • It produces as output an executable workflow • For example an Oozie workflow • XML file of the workflow • job.properties • System variables Deployment Model
  41. 41. Translating the Composition Structure • Deployment models: • specify how procedural models are instantiated and configured on a target platform • drive analytics execution in real scenario • are platform-dependent • Workflow compiler transforms the procedural model in a deployment model that can be directly executed on the target platform. • This transformation is based on a compiler that takes as input • the OWL-S service composition • information on the target platform (e.g., installed services/algorithms), • and produces as output a technology-dependent workflow
  42. 42. Translating the Composition Structure • OWL-S service composition structure is mapped on different control constructs
  43. 43. • Workflow contain 3 types of distinct PLACEHOLDER • GREEN placeholders are SYSTEM variables defined in Oozie properties • RED placeholders are JOB variables defined in file job.properties • YELLOW placeholders are ARGUMENTS of executable jobs on OOZIE Server • More on the demo… Generating an Executable Workflow
  44. 44. Analytics Deployment Approach

×