Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow

Я розповім про досвід будування системи для роботи з великими даними на базі відкритої технологіі Apache Nifi та Kubernetes на прикладі аналізу ресурсів новин з використанням NLP аналізом.

  • Be the first to comment

  • Be the first to like this

DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow

  1. 1. Data pipelines: building an efficient instrument to create a custom workflow Speaker: Daniel Yavorovych DevOpsFest2020
  2. 2. Daniel Yavorovych CTO & Co-Founder at Dysnix 10+ years of * nix-systems administration; 5+ years of DevOps, SRE; 7+ years in the development of the cloud solution architectures and HL / HA infrastructures; 7+ years in the development of highly-powerful servers (Python / Golang).
  3. 3. Real-Time Data Pipelines Processing When is it needed? Why is this a problem? Real-time processing is needed for continuously data arriving - for example from Twitter, media news, Email, etc. Most solutions for working with Data Pipelines imply working in Batch mode. There are only a few alternatives which will be discussed further.
  4. 4. Data Pipeline Solutions Google Cloud Dataflow Batch and Stream modes! Fully integrated with AutoML, Pub/Sub and other GCP components Vender-lock-in It is not expensive, but it costs more than self-hosted solutions
  5. 5. Data Pipeline Solutions Apache Airflow Open Source & No vendor lock-in User Interface for visualizing Data Pipelines and Processing Support of various executors (Apache Spark, Celery, Kubernetes) No Stream Mode
  6. 6. Data Pipeline Solutions Luigi Open Source & No vendor lock-in Not very scalable: you need to split tasks in projects for parallel execution User Interface Hard to use: Dag tasks cannot be viewed before execution, logs view is difficult
  7. 7. Data Pipeline Solutions argoproj/argo-events Open source & No vendor lock-in No User Interface Real-time mode Kubernetes-native solution 20+ event sources Argo workflow support: - container-native - workflow engine New and poor
  8. 8. Data Pipeline Solutions Apache NiFi Open Source & No vendor lock-in Difficult integration with Kubernetes Real-time mode Flexible & User-friendly Interface for viewing Data Pipeline and Processing Highly scalable Lots of native Processors available
  9. 9. We choose NiFi because: the number of native processors available NiFi provides many ready-made Processors - from Twitter API and Slack to TCP and HTTP servers, S3, GCS, Google PUB / SUB (there are about 300 of them)
  10. 10. We choose NiFi because: Custom Scripts Have you ever lacked Processors? Write your own Processor in one of the convenient languages: Clojure, ECMAScript, Groovy, Lua, Python, Ruby. Will it work faster? I rewrote some Processors in Python just to substitute several NiFi Processors and it began working even faster...
  11. 11. WechooseNiFibecause: possibilitytochangedataflows &queuesinreal-time You can stop the Processor or a group of Processors at any time to make some changes and start working again. At the same time, all other Processors that do not depend on the shutdown will continue working. This allows you to stop those Processors that have errors or if just some changes are required. All messages will be added to the NiFi queue
  12. 12. We choose NiFi because: NiFi Registry NiFi Registry is a central location for the storage and management of shared resources across one or more instances of NiFi and/or MiNiFi. This allows you not only to switch between each of NiFi Processors and Processors Groups but also to create a version of your work (similar to GIT), always be able to roll back to one of the previous versions.
  13. 13. WechooseNifibecause: Templates NiFi templates allow you to export all your data flow to an XML file as a backup with a few keystrokes or hand it off to another developer. It can also be used as a base for presets (we'll talk about this later)
  14. 14. We choose Nifi because: External Auth & Users/Groups NiFi has flexible support for sharing permissions for Users / Groups with different Permissions. Permissions can be set both for operations (viewing / editing Flow, and specific objects (Processors / Processors groups). NiFi also supports external authentication (there is even support for the OpenID protocol). For example, we integrated Keycloak to store user data in one place. LDAP Kerberos
  15. 15. NiFi Arch
  16. 16. NiFi Arch: Cluster Mode
  17. 17. NiFi Scalability Source: Horizontal scaling There’s no limit of nodes in a single cluster (only node hardware limits and limits of network performance) It’s easy to join a new node to the running cluster
  18. 18. NiFi Scalability: Multiple Clusters In any case, if you lack 10 nodes because you are limited with the network bandwidth then you can build several NiFi clusters and connect them through Remote Processor Groups.
  19. 19. NiFi & Kubernetes Existing solutions: The last Helm Chart was the most relevant and we took it as a basis
  20. 20. Helm chart 12375 Grafana Dashboard ID: Nifi registry Grafana dashboard & prometheus metrics Predefined Nifi Flow
  21. 21. Tips & Tricks Use Kafka or any Message Bus. If there are any failures in NiFi, safety must be in any concern. Although NiFi has a visual editor and a bunch of Processors they must be built by a technically competent engineer, otherwise, data flow can be destabilized. For unpredictable inputs, use Rate Limit Processor. Use NiFi Registry - it will always allow you to roll back! Don’t try to use only Native NiFi Processors: sometimes it's too complicated and easier to write a couple of lines in Python. Don’t gloss over the mistakes! Working in NiFi you can deal with errors the same way as with regular data and send them to Slack or use for your purposes.
  22. 22. Production Architecture Example
  23. 23. Conclusion NiFi proved to be not only good for rapid prototyping of Data Pipeline Flow but also a good basis for scalable and loaded ELT systems Of all free self-hosted implementations that support NiFi, it is the most modern and actively developing Configuration of a NiFi cluster in Kubernetes did not seem like a trivial task but after some difficulties faced this ready-to-use solution meets all the requirements NiFi is flexible - it does not block everything on itself and using it properly you can achieve very good results with the support of really big but similar projects
  24. 24. Dysnix Open Source Helm charts Cryptocurrency nodes docker images Prometheus exporters Grafana dashboards Terraform for Blockchain-ETL (project for Google Cloud Platform)
  25. 25. Daniel Yavorovych CTO & Co-Founder at Dysnix Questions?