Successfully reported this slideshow.
Your SlideShare is downloading. ×

Session 8 - Creating Data Processing Services | Train the Trainers Program

Ad

i4Trust Website
i4Trust Community
End-to-end AI Solution With
PySpark & Real-time Data
Processing With Apache NiFi
Rihab F...

Ad

Learning goals
● Managing real time data with the Context broker
● Data transformation (JSON-LD to CSV) and persistence wi...

Ad

End to End AI service architecture powered by FIWARE
3

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Check these out next

1 of 60 Ad
1 of 60 Ad

More Related Content

Slideshows for you (19)

Similar to Session 8 - Creating Data Processing Services | Train the Trainers Program (20)

Session 8 - Creating Data Processing Services | Train the Trainers Program

  1. 1. i4Trust Website i4Trust Community End-to-end AI Solution With PySpark & Real-time Data Processing With Apache NiFi Rihab Feki, Machine Learning Engineer and Evangelist Sherifa Fayed, Technical Expert and Evangelist FIWARE Foundation
  2. 2. Learning goals ● Managing real time data with the Context broker ● Data transformation (JSON-LD to CSV) and persistence with Apache NiFi ● Setting up a Google Cloud environment ○ Creating a Dataproc cluster and connecting it to Jupyter Notebook ○ Using Google Cloud Storage Service (GCS) ● Modeling a ML solution based on PySpark for multi-classification ● Deploying the ML model with Flask and getting predictions in real time 2
  3. 3. End to End AI service architecture powered by FIWARE 3
  4. 4. What is Apache NiFi? 4 ● System to process and distribute data ● Supports powerful and scalable directed graphs of data routing and transformation ● Web based user interface ● Tracking data flow from beginning to end
  5. 5. 5 Connecting NiFi to the Context Broker NGSI-LD Context Broker cURL or Postman NiFi (or Draco) 1026:1026 5050:5050 27017:27017 MongoDB
  6. 6. Entity: Steel plate geometric measurements 6 Link to dataset
  7. 7. End to End AI service architecture powered by FIWARE 7
  8. 8. Dataflow overview 8 Ingesting
  9. 9. Data processing and persistence with NiFi 9
  10. 10. The overall NiFi workflow 10
  11. 11. Overview about NiFi workflow 11 ● ListenHTTP: Configured as source for receiving notifications from the Context Broker ● GetFile: Reads data in JSON-LD format ● JoltTransformJSON: Transforms nested JSON to a simple attribute value JSON file which will be used to form the CSV file ● ConvertRecord: Converts each JSON file to a CSV file ● MergeContent: Merges the resulting CSV record files to form an aggregated CSV dataset (PS: The min number of entries can be set to perform the merge processor. Also a max number of flow files can be set) ● PutGCSObject: Saves the resulting CSV in Google Cloud Storage bucket
  12. 12. Demo: Data transformation and persistence 12
  13. 13. End to End AI service architecture powered by FIWARE 13
  14. 14. What is PySpark? 14 PySpark is an interface for Apache Spark in Python. PySpark is a language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform.
  15. 15. What is Cloud Dataproc? Batch processing, querying, streaming Machine Learning 15 Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools. Big data processing
  16. 16. The main benefits of Dataproc ● It’s a managed service: No need for a system administrator to set it up. ● It’s fast: Cluster creation in about 90 seconds. ● It’s cheaper than building your own cluster: Because you can spin up a Dataproc cluster when you need to run a job and shut it down afterward, so you only pay when jobs are running. ● It’s integrated with other Google Cloud services: Including Cloud Storage, BigQuery, and Cloud Bigtable, so it’s easy to get data into and out of it. 16
  17. 17. What makes Dataproc special? Typical mode of operation of Hadoop/Spark   on premise or in cloud  require you deploy a cluster, and then you proceed to fill up said cluster with jobs 17
  18. 18. What makes Dataproc special? Rather than submitting the job to an already-deployed cluster, you submit the job to Dataproc, which creates a cluster on your behalf on-demand. ➢ A cluster is now a means to an end for job execution. 18
  19. 19. Let’s see how Dataproc makes it easy and scalable... 19 Data scientists are big fans of Jupyter Notebooks However, getting an Apache Spark cluster set-up with Jupyter Notebooks can be complicated
  20. 20. Apache Spark and Jupyter Lab architecture on Google Cloud 20
  21. 21. How it works ? 1. Setting up the Google cloud environment and creating a project 2. Creating a Google Cloud Storage bucket for your cluster 3. Creating a Dataproc Cluster with Jupyter and Component Gateway 4. Accessing the JupyterLab web UI on Dataproc 5. Creating a Notebook and developing the AI algorithm with PySpark 21
  22. 22. Creating a Dataproc cluster using cloud shell 22 gcloud beta dataproc clusters create ${CLUSTER_NAME} --region=${REGION} --image-version=1.4 --master-machine-type=n1-standard-4 --worker-machine-type=n1-standard-4 --bucket=${BUCKET_NAME} --optional-components=ANACONDA,JUPYTER --enable-component-gateway
  23. 23. Component gateway for additional cluster components 23
  24. 24. Steel plates faults prediction 24 ● Features: 27 Geometric Measurements of the steel plates ● Fault types: 7 ○ Pastry ○ Z_Scratch ○ K_Scatch ○ Stains ○ Dirtiness ○ Bumps ○ Other_Faults Dataset format: CSV | Number of Samples: 1941 Link to dataset
  25. 25. Demo: Cloud environment set up Modeling the ML solution based on PySpark 25
  26. 26. ML model deployment with Flask architecture 26 27017:27017 5000:5000 www Orion Context Broker Model prediction Saved Model (.parquet) Model training Jupyter Notebook cURL or Postman 1026:1026
  27. 27. Useful links ● Source code and documentation https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi ● Jupyter Notebook for Steel faults classification based on PySpark https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/blob/master/PySpark/P ySpark_Steel_faults_Classification.ipynb ● Data processing and persistence with Apache NiFi documentation https://github.com/RihabFekii/PySpark-AI-service_Data-processing-NiFi/tree/master/Nifi ● NGSI-LD Context Broker ○ Docker hub: https://hub.docker.com/r/fiware/orion-ld ○ Documentation: https://github.com/FIWARE/context.Orion-LD ● Google Cloud Console: https://console.cloud.google.com/ ● Flask Apps with Docker: https://runnable.com/docker/python/docker-compose-with-flask-apps ● 27
  28. 28. Summary 28 ● Context Broker does not store data or persist it ● Google Cloud Dataproc service provides data scientists an easy way to set up, control and secure data science environments. Plus making it simple and fast for them to integrate it with other open source data tools. ● Once the Dataproc cluster is created, it is not possible to change the configuration or install new dependencies, libraries,.. ● Dataproc jobs are limited to some programming languages. ● Apache NiFi might not be the easiest tool for data processing but it manages data flows and automates them and it fits when dealing with large scale data or real-time data. ● Other cloud platforms could be used (AWS, Azure, Databricks,..)
  29. 29. Thank you! http://fiware.org Follow @FIWARE on Twitter
  30. 30. 30 Q&A
  31. 31. 31 Annex
  32. 32. 32 Creating an entity in the Context Broker unique id and type Attributes of the created entity
  33. 33. 33 Subscribing to changes and listening posting subscription to Orion subscribing to all entities of certain type sending notification to port NiFi is listening on subscribing to relevant attributes
  34. 34. 34 Subscribing to changes and listening
  35. 35. Inducing a change and receiving a notification 35
  36. 36. Processor Out Count jumps to 1 changing the value of X_Minimum Inducing a change and receiving a notification
  37. 37. Setting up the cloud environment 37
  38. 38. Creating a project in Google Cloud Platform 38 We can manage the project via the Cloud Shell
  39. 39. Creating a Google Cloud Storage bucket 39 ➢ Store datastes ➢ Store Notebooks ➢ Store logs ➢ Store output files
  40. 40. Creating a Dataproc cluster using cloud shell 40 gcloud beta dataproc clusters create ${CLUSTER_NAME} --region=${REGION} --image-version=1.4 --master-machine-type=n1-standard-4 --worker-machine-type=n1-standard-4 --bucket=${BUCKET_NAME} --optional-components=ANACONDA,JUPYTER --enable-component-gateway
  41. 41. Creating a Dataproc cluster using GUI 41
  42. 42. Component gateway for additional cluster components 42
  43. 43. Overview of the Dataproc cluster 43
  44. 44. Dataproc cluster web interfaces 44
  45. 45. Dataproc cluster : Jupyter lab interface 45
  46. 46. Creating a Jupyter Notebook and provisioning data from Google Cloud Bucket 46 Link to Notebook
  47. 47. Submitting a Pyspark job using Dataproc GUI 47
  48. 48. Submitting a Pyspark job to Dataproc cluster 48
  49. 49. www.egm.io Fluid Machine Learning lifecycle with FIWARE Benoit Orihuela – i4Trust Training Webinar
  50. 50. A TYPICAL ML LIFECYCLE • A Data Scientist • Get and clean up data • Prepare and train a ML model • An IT person • Package and deploy the ML model • An end user • Discover the available ML models (with respect to privacy) • Ask to use one or more of them (and optionally pay for it) • Get real time data (predictions, outliers,…) from a ML model ML lifecycle with FIWARE - i4Trust - 12/05/2021 3
  51. 51. WHAT DO WE AIM AT? ML lifecycle with FIWARE - i4Trust - 12/05/2021 4 Bridge the gap between data scientists and operations (MLOps) Develop the Machine Learning as a Service (MLaaS) model And also: More and more use cases requiring ML / AI activities FIWARE needs to offer a rich variety of tools
  52. 52. THE TRAINING AND PREPARATION PHASE ML lifecycle with FIWARE - i4Trust - 12/05/2021 5
  53. 53. THE DISCOVERY AND REGISTRATION PHASE ML lifecycle with FIWARE - i4Trust - 12/05/2021 6
  54. 54. THE PREDICTION PHASE ML lifecycle with FIWARE - i4Trust - 12/05/2021 7
  55. 55. DEMONSTRATIONS • Demonstration #1 - End to end demonstration of a ML model development, deployment and use • Use of Jupyter notebook as interface • Applied to a simplistic water flow calculation • Demonstration #2 – Events generation from video stream analysis • Realtime extraction of context information from a video stream ML lifecycle with FIWARE - i4Trust - 12/05/2021 8
  56. 56. Thank You! Tel: E.mail: www.egm.io Benoit ORIHUELA Lead Architect +33 687427107 benoit.orihuela@egm.io
  57. 57. www.egm.io MlaaS for Image analysis Anwar ALFATAYRI
  58. 58. 2 REAL LIFE EXAMPLE: SOCIAL DISTANCING Number of people : 14 Groups of 2 people : 1 Groups of 3 people : 2 Groups of 4 people : 1 Groups >4 People: 0
  59. 59. Machine learning on the edge TWO APPROACHES 3 Image 3 people detected Street Fiware Cloud
  60. 60. 4 Machine learning as a service TWO APPROACHES Image 3 people detected Street Fiware Cloud API Rest

×