Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Interactive workflow management using Azkaban

2,429 views

Published on

Managing a workflow using Azkaban scheduler. It can be used in batch as well as interactive workloads

Published in: Data & Analytics
  • Be the first to comment

Interactive workflow management using Azkaban

  1. 1. Interactive Workflow Management using Azkaban API driven workflow management for Spark https://github.com/phatak-dev/interactive-azkaban
  2. 2. ● Madhukara Phatak ● Technical Lead at Tellius ● Consultant and Trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  3. 3. Agenda ● Different Kind of Applications in Spark ● Why Interactive? ● Building an Interactive Application ● Workflow in Big data ● Challenges of Interactive Application ● Azkaban ● Azkaban in manual/batch mode ● Azkaban AJAX API ● Azkaban client in Scala
  4. 4. Big Data Applications ● Typically applications in big data are divided depending upon the their work loads. ● Major divisions are ○ Batch Applications ○ Streaming Applications ● Most of the existing platforms support both of these applications these days ● But there is new category of applications are in raise, they are known as interactive applications
  5. 5. Big data Interactive Applications ● Ability to manipulate data in interactive way ● Exploratory in nature ● Moves away from notion that ETL, Analysis has to be in silos ● Combines batch and streaming data ● For Development ○ Zepplin, Jupyter Notebook etc ● For Production ○ DataMeer, Tellius,ZoomData etc
  6. 6. Spark and Interactive Applications ● Apache Spark is only big data platform built from scratch to support interactive applications ● Spark made interactive data exploration using notebooks popular ● Caching and Intelligent lazy mechanism makes it great tool for interactive systems ● As spark system combines ETL, Exploration and Advanced Analytics in one platform, we can do all the data work in interactive fashion.
  7. 7. Building an Interactive Application
  8. 8. REST based Spark Application Spark Cluster REST API Client Databa se HDFS
  9. 9. Akka-Http ● Framework to build reactive web application/ services ● Build on top AKKA abstractions for concurrency ● Next version of popular REST framework spray ● As stream is the base abstraction, works well with the spark ● Written in Scala. Has API’s in Java and Scala ● We will use local spark session to interact with Spark
  10. 10. Simple API ● The below is the API we expose ○ /load - for loading the data ○ /view - for looking at the sample data ○ /schedule - for schedule operations ● All these operations are simple, but they give you what an API based system look like ● We test the API’s using postman to emulate interactive mode ● Ex : RestService.scala
  11. 11. Workflow management in Big Data
  12. 12. Need of Workflow in Big data ● Most of the tasks we do in big data are repetitive in nature ● Once we have determined our flow, we want to run it on new data as and when it arrives ● Two parts - ○ Flow Definition ○ Scheduling ● Use cases ○ ETL, Updating models etc
  13. 13. Workflow for Batch ● Most of the scheduling for batch applications is done using some kind of scripting ● Many ways are there to define flow and executing ● Once code is tested, code is deployed and scripts are scheduled ● These scripts define the flow structure and use some scheduling to run the operations ● Well known frameworks for batch scheduling are ○ Oozie ○ Airflow
  14. 14. Workflow for Streaming ● Streaming frameworks itself most of the time handle the workflow need of the application ● The spark streaming code defines the flow that needs to be run ● Spark Streaming Scheduler runs the flow as and when new data appears ● So rarely we use an external workflow framework for executing these work loads
  15. 15. Workflow for Interactive Application ● Ability to define the workflows on the fly rather than fixed workflows as in case of batch ● Ability to schedule and unscheduled using API’s ● Should be able to handle both batch and streaming sources of data ● Should integrate with the state build up using the interactions in the interactive mode ● Ability to monitor the status of the running jobs in realtime
  16. 16. Challenges of scheduling for interactive ● Most of the workflow systems does not expose REST API to interact with system to define flow and scheduling ● Many lack good monitoring system to query the status of the running tasks which is critical ● Most of the workflow systems run on their own sandboxed execution engine which makes them hard to integrate with the application state ● More details [2]
  17. 17. Azkaban ● Azkaban is a workflow job scheduler created at LinkedIn to run Hadoop Jobs ● Has good support to define the dependencies through flow mechanism and monitoring of the jobs ● Allows extending the UI to track new metrics ● Supports for multiple runtimes like ○ Hadoop ○ Spark ○ Java
  18. 18. Azkaban Batch Mode ● Azkaban is primarily built for scheduling big data batch jobs ● It has a simple dsl to define the flows ● It allows us to define different executors for a given flow ● The abstractions ○ Project ○ Flow ● Ex : Running a java flow using Azkaban UI
  19. 19. Azkaban for Interactive Workflows
  20. 20. Azkaban AJAX API ● Though Azkaban is primarily build for the batch jobs, it has a AJAX API to interact with the workflow system ● This is an API primarily built for the UI to interact with the engine ● Though it’s not a full fledged REST API, it’s good enough to build an interactive workflow system with this API ● This AJAX API makes Azkaban ideal workflow management system for the interactive applications.
  21. 21. Azkaban Scala Client ● Azkaban AJAX API has some rough edges as it’s not meant to be work as standard REST API ● Interacting with API directly will be painful in your application ● azkaban-scala-client is a scala client which makes interactive with azkaban much easier ● Most of the API’s are exposed using scala, feature requests are welcomed ● https://github.com/phatak-dev/azkaban-scala-client
  22. 22. Schedule in REST API ● As we understood how to use Azkaban API to interact with workflow manager now we can use it in our REST API ● We will use our scala client to interact with azkaban ● The implementation of the flow will do a request to the rest server in order to use the state available in rest server ● Ex : Scheduler.scala
  23. 23. References ● http://blog.madhukaraphatak.com/interactive-scheduling -using-azkaban-setting-up-solo-server/ ● http://blog.madhukaraphatak.com/interactive-scheduling -using-azkaban-challenges-in-scheduling-interactive-wo rkloads/ ● http://azkaban.github.io/azkaban/docs/latest/#ajax-api ● https://github.com/azkaban/azkaban

×