Welcome to my talk on Building a Self-Service Hadoop platform at LinkedIn with Azkaban.
A little bit about who I am.
I am a Software Engineer on the Hadoop Development Team at LinkedIn. I am one of the main contributors to Azkaban
Previously, I was at Microsoft, working on the Windows Kernel.
I want to start by talking a little about how LinkedIn uses Hadoop.
About half of our Hadoop usage is in ad hoc queries, queries for general analytics. The other half is used on the vast majority of our data products
These are two pages from my LinkedIn experience: <click> - On the left is my homepage - On the right is my profile page.
The features on this page that are directly powered by Hadoop are everything in blue. <click>
- People You May Know - Recommendations - Many other features
- Our incredibly robust AB test platform, and we test almost every change, every placement, every impression.
The sample set, the test analysis… they are all done on Hadoop.
Let’s take a closer look at the LinkedIn home page.
<click> People You May Know and Ad Recommendations are two of the many data products powered by Hadoop at LinkedIn.
<click> Here are visual representation of dependency graphs of Hadoop jobs that create these two features. - Dozens to hundreds of Hadoop jobs per data product. - People you may know, is one of our premier data products. It was at the core of LinkedIn’s membership growth.
These run reliably every day, and sometimes multiple times a day, to provide fresh data to our users. And it’s critical that they work.
We’re constantly testing and improving the workflows for all of our data products. We’ve had a very high velocity of changes since the beginning of data products at LinkedIn.
This is the graph representing the workflow for one of LinkedIn’s major data products back in 2009. Its small, only a few Hadoop jobs. Not difficult to understand or maintain. But we were in a phase of rapid growth in 2009… and still are.
As we grew, these workflows changed… a lot.
<Click> Each image represents a month of change to the workflow.
- Every month, you can see that there are small to huge modification to the workflow. - It quickly got very complex - And this was just one of our data products. Imagine having one of these for every product.
It became clear that without being able to visualize what your workflow looked like and how it ran, it was very difficult to understand what the whole workflow did and keep up with the rapid development.
And it was with this in mind that we created <click>
Azkaban, our workflow scheduler, back in 2009
This is what it looked like back then, pretty gothic looking, but had the basic features we needed.
With Azkaban, we could
Run workflows Scheduling jobs View Job History Other features, like failure notification But, most importantly, an easy to use Web UI with visualizations that showed you how your flow looked and how it ran
Well, it wasn’t perfect It was one server stored everything in flat files on disk ran every job as forked local processes. Didn’t really have user management UI was engineer-designed and a bit dark and depressing
Nonetheless, the tool was heavily used, because above all, it was simple and easy to use.
About a year and half ago, in order to keep up with our rapid growth, we did a massive re-architecting of Azkaban.
We got some help from a Web Dev to create a completely new and easier to use UI.
Azkaban was now two separate servers Executor server handled job scheduling and execution Web server managed projects, users, and provided the UI User management so users can no longer overwrite each other’s files Stored data in a database, which is pluggable H2 MySQL Of course, brand new UI and more
New plugin system
Azkaban itself has no dependency on Hadoop.
Modularized many components of Azkaban and turned them into plugins The only built-in job type is command All other job types became plugins: Java, Pig, Hive, etc. Supported non-Hadoop jobtypes such as Teradata and Voldemort Build-and-Push
Viewer plugins, extend the Azkaban UI to integrate new tools HDFS browser Reportal
This means that we do not run a fork of Azkaban internally. We run the same code that lives in the open source repository on GitHub. All LinkedIn-specific code are implemented as plugins
Over time, we had found that our users used Azkaban as more than just a workflow scheduler.
They used Azkaban not only as a major tool for developing Hadoop workflows Also debugging them with the integrated job and flow logs Only going to the Job Tracker logs when absolutely necessary.
They used Azkaban to browse HDFS. Many of our users don’t even know the namenode HDFS browser even exists.
Azkaban had become not just a workflow manager, but the main front-end to Hadoop at LinkedIn.
With this in mind, at the beginning of this year, we released <click>
Azkaban 2.5, focusing on making Azkaban easier and more powerful to use and more productive to develop and extend.
We rebuilt the UI to be both familiar to our users and more beautiful and more intuitive. By using Bootstrap, it is also more future-proof and easier to extend.
We also added a number of new features and improvements.
We added more powerful workflow features such as embedded flows Embed and reuse flows as nodes within other flows. A number of new self-service tools to help users to better understand how their flows and jobs ran and make it easier for them to debug and tune them Viewer plugins that are specific to jobtypes, so that you can build tools that are seamlessly on the job details page. We used this to build the new self-service tools Improvements to the HDFS viewer Display the schema in addition to content Parquet file viewer A number of under-the-hood improvements and more.
One major reason we pushed the ease-of-use and simplicity so much was due to our users.
Our users not only include software engineers and data scientists, but also analysts and product managers who are creating, modifying and scheduling Hadoop workflows
So, Azkaban started off as a workflow scheduler for Hadoop.
Today, it is also:
An integrated runtime environment Where users develop and run complex workflows consisting of Pig, Hive, Java MapReduce and other types of jobs A unified front-end for Hadoop tools
In the first year, we only had a dozen different workflows to manage and a handful of people that used them.
Over the last 4 years, we’ve grown to have over 1000 different users Azkaban instances on over six different clusters 2,500 flows accounting for 30,000 Hadoop jobs executing per day.
We have jobs that run from just a few minutes all the way to 8 days.
The bad news is that we have over 1000 users, 2,500 flows and 30,000 jobs executing per day. And this is only going to keep increasing
It surprised us at how much it is being used. And this is only from one Azkaban instance and we have about 6 of them.
Our Hadoop development team is fairly small, only about 8 people. To keep up with our users, we had to make our Hadoop infrastructure as self-service as possible. And the primary front-end to our Hadoop infrastructure is, of course, Azkaban.
So, how do you use Azkaban?
I am going to show you how easy it is to create an Azkaban workflow.
A workflow is simply a collection of job files. These are key value property files. I use the ‘type’ property to define what kind of job I want to run I can run a variety of jobs, like Pig, Hive, Java or just plain old command line jobs.
The dependencies parameter is self explanatory. It specifies which jobs must completely successful before this job can be run. The rest of the parameters are passed to the executing job itself.
All Azkaban does with these parameters is to construct a process that is run locally. The reason we can get away with having a lot of processes locally, is that most of them don’t do much more than spawning Hadoop jobs on the cluster.
So in this case, bread waits on peanut butter and jelly, making this the most delicious workflow ever.
We found that many users reuse what is effectively a sub-flow several times with different parameters
As a result, we added support for embedded flows, making it possible to embed a workflow as a node in other workflows.
I just set the type to “flow” and set “flow.name” to the name of the workflow I want to embed.
In this case, my embedded flow is “sandwich”, consisting of peanutbutter, jelly, and bread, waits on coffee and fruit, making this workflow a complete and healthy breakfast.
Afterwards, I just package my jobs into a zip archive and upload it to Azkaban via the web UI or a REST interface.
Here is the project page, where I can run my workflows, set permissions, and customize my jobs.
If I want a birds-eye view of which jobs make up the flows in my project, I simply click on the drop-down for one of my flows and I can see the hierarchy of its jobs in an outline form. When I mouse over one of jobs, Azkaban automatically highlights the jobs it depends on and the ones that depend on it
When I are ready to run my workflow, I just click the “Execute Flow” button, which brings up <click>
…the Flow Execution Panel.
Here, I can do a lot of things to customize my flow before I run it.
I have this beautiful visualization of my flow. I can enable or disable any part of my flow.
For example, here I have 3 embedded flows that processes data and a final flow that pushes data back out to the front end. If I want to test my workflow and not push any data, I can right click and disable the last flow.
Because the flow will take a long time to run, I want to be notified by email if the execution fails.
I just click into the Notification tab, select to be notified when the first job fails, and specify the email address for the notification.
Here, I can also set the failure notification to be sent after the failed flow finishes running.
I can also set a notification to be sent if the flow finishes successfully.
By default, if a job in a flow fails, Azkaban will let the currently running jobs finish before killing the flow.
When I am testing my flow, I might want to change what happens when a failure is caught.
I just click into the Failure Options tab and select which behavior I want.
Aside from letting the currently running jobs finish,
I can also have Azkaban kill all jobs immediately when a failure is caught
or try to continue to run as many of the remaining jobs that it can run.
And I can do all this through the UI.
I can also customize the execution of my flow by setting custom parameters.
I just click into the Flow Parameters tab and set any parameters I like.
The parameters that I set here will override the parameters set in my job files.
This is particularly useful when I am testing my flow.
I might want to run multiple instances of my flow concurrently. This is often useful when I am testing my flow and I want to kick off a few instances of it each with different parameters.
If I am concerned about the different instance of the flow stepping on each other and touching the same files when running in parallel, I can go into the Concurrent tab and specify how I want the workflow to run if one instance is already running.
I can either disallow concurrent executions of my flow altogether.
I can let all instances of the flow run concurrently.
Or I can pipeline them in a way that ensures that new executions will not overrun the current execution. I can block the execution of Job A until the previous flow’s Job A has finished running Or, I can block Job A until the children of the previous flow’s Job A has completed. This is useful if I want my flows to be a few steps behind an already executing flow
Now, when I run my flow, I am presented with this interactive visualization.
At one point, Hadoop users had to constantly refresh the Hadoop JobTracker page to see the status of their jobs.
We try to do better than that. Instead of having to sit and click refresh every 3 seconds, Azkaban visualizes the execution of your flow, automatically updating as the jobs run.
Here, I can also expand into the embedded flows and view the progress of the inner jobs as well.
If I want to look at which jobs are run at which point in the flow’s execution, I can click into the Job List tab and view the flow as a Gantz chart. <click>
This is also updated automatically as my workflow runs.
I can also expand into the embedded flows here as well.
I can also click on the details link for any of the jobs and drill down into the job logs and other details specific to the job.
Often times, at LinkedIn, Hadoop users will have Azkaban open on this page one one screen while they do work on the other.
This is how we make it easier for users to understand their workflow executions and is critical to allowing them to understanding their workflows.
Many of our Hadoop workflows are scheduled to be run repeatedly, either monthly, weekly, or even daily, as is the case for some of our ETL workflows.
Azkaban makes it easy to schedule your workflow to be run at a specific time, date, and recurrence.
If I want my workflow to run at 5:30 PM every Friday, I just click “Schedule” and <click>
…and bring up the the Schedule Flow Panel.
Here, I can specify the time, date, and the recurrence for when I want my flow to be run.
Once my workflow is scheduled, I can view it on the Scheduled Flows page.
I can also manage other flows that I have scheduled from this page.
But what about those times when one of my jobs may be running abnormally long? We have a feature for that too.
I can click Set SLA for any of my scheduled flows and <click>
…and bring up the SLA Options panel.
Here, I can set SLA settings for my entire flow, or individual jobs in the flow.
I can notifications or even kill my job or flow when the SLA is exceeded
The other side of developing Hadoop workflows is debugging and tuning their performance.
As we all know, Hadoop, Pig, and Hive are all very complex systems, with many different knobs and logs that provide a ton of information but are not always easy to find or read.
Again, with over 1000 users, several clusters, so many job executing per day, as well as a small Hadoop team
we needed to build tools that make it as easy as possible to help our users debug problems and tune performance of their workflows and jobs on their own.
As my jobs change over time, one thing I would want to know is how the new changes affect the performance of the job.
On the job history page, Azkaban provides a time graph, visualizing of the history of a job’s runtime.
I can click on any data point to view the details and logs of that particular job execution.
With Azkaban 2.5, we added a similar time graph visualizing the history of the runtimes of the workflow as a whole.
This way, as I make modifications to my workflow over time, I can easily see how the performance of my workflow changes with each of the new modifications.
Another important feature of Azkaban is that it provides all the job execution logs so that you don’t have to hunt them down.
Job logs contain very rich information. For example, Pig and Hive job logs provide tables describing numbers of mappers and reducers and task runtimes for the MapReduce jobs they fire off.
However, job logs are very verbose are not always easy to read.
One complaint we often get is that the headers on the tables of job stats never line up with the actual columns.
As a result, one of the new self-service features we added in Azkaban 2.5 is the Job Summary.
The Job Summary parses the job logs and extracts the information most useful to the users such as - The command used to run the job - Classpath - JVM options - Memory settings - Parameters passed to the job
For Pig and Hive jobs, - Displays the table of mappers and reducers clearly in an actual table.
At LinkedIn, the majority of our Hadoop jobs are Pig jobs. As we all know, Pig scripts are ultimately compiled into a DAG of MapReduce jobs.
When developing and tuning Pig jobs, it is very useful to be able to visualize the DAG.
As a result, we built a Pig Visualizer, which is plugin specific to the Pig jobtype, which Uses the Pig listener interface to collect stats while my Pig job runs Visualizes the plan DAG and provides detailed information about each node, that previously required going to the job tracker
This is similar to Lipstick and Ambrose but is completely integrated with Azkaban. You do not need to modify your Pig jobs at all You do not have to leave Azkaban to go to a separate tool As you can see, it is integrated seamlessly as a new tab next to the job logs
I’ll show you some of the things you can do with the Pig Visualizer
Here, I can select one of the node, which will display some summary information for that job in the sidebar, including types of operations aliases whether the job succeeded a Job Tracker URL and some stats
Clicking on More Details brings up a modal dialog
…which displays more detailed information about the job.
The first tab displays the Pig Job Stats, which include Stats about the mapper and reducer stats I/O stats and spill count Which files the job read or wrote to and how many records
The second tab contains the Hadoop job counters.
These are the counters that you would find on the Job Tracker page but again, you can view them right here in Azkaban rather than having to go to a completely different tool.
Finally, the Pig Visualizer also displays part of the job configuration.
The values we picked to display on this page are the ones that our users most commonly look at when tuning their Pig jobs, such as Split size Io.sort.mb Compression options Pig Map and Reduce plans converted from base64 to text
Of course, there is a convenient link to view the full details on the Job Tracker page.
Often, users want to better understand how their workflow is performing as a whole to find out information such as which jobs run the longest use the most number of tasks
We built the Flow Summary to provide a dashboard of details and stats for a given workflow.
At the top, we have the project name and a list of job types used.
Then, we have scheduling information. I can remove the schedule or set an SLA right from this page.
Then, the flow summary can analyze the last successful run of my flow and display an aggregate view of stats from all the jobs.
It displays a histogram of the runtimes of each of the jobs in the flow.
Below, it shows aggregate views of resource consumption Such as which job used the highest number of map and reduce slots total slots used.
It also shows which jobs set the highest values for different parameters: such as maximum Xmx, Xms, and distribute cache usage
And which jobs read and wrote the most number of bytes.
These tools, especially the Pig Visualizer, have been heavily used at LinkedIn and have been really helpful for our users to understand and tune their workflows.
Browsing HDFS is something we all do extensively when developing Hadoop jobs.
We make that really easy to do with the Azkaban HDFS viewer plugin.
Browsing files is really easy.
The HDFS browser works just like any other web-based file browser.
You can jump back to any parent folder in the path and easily go to your home directory.
Often times, you will want to view files in HDFS, but the files may be in a binary format. At LinkedIn, we store most of our data in Avro and a few in other formats such as binary JSON.
While tools like the Namenode HDFS browser and the Hadoop command line dfs tool also let you browse HDFS, they do not have good support for binary formats and will simply dump the raw contents of the files.
And so, we did better than that.
The Azkaban HDFS browser has pluggable file type viewers, which will parse files in different binary formats and display up to 1000 records from the file in human-readable text
Sometimes, you would want to see the schema of the file. We made that really easy to do too.
Just click the Schema tab and <click>
....and the file viewer will extract the schema from the file and display it in as readable text.
Currently, this is supported by the Avro and Parquet file viewers.
The HDFS browser supports a number of file formats, include …. You can even view images in the HDFS browser as well.
And of course, text.
As I mentioned before, Hadoop is also heavily used by analysts at LinkedIn.
Often, analysts want to simply write a query, schedule it, and get the result without having to jump through the hoops of uploading job files.
We made that really easy to do with Reportal, which built on top of Azkaban.
This is the main dashboard of Reportal.
From here, I can manage my existing reports or schedule them.
I can also create a new report.
Creating a new report on Reportal is easy.
For this report, I want to find the 10 most common words in Alice in Wonderland.
As you can see, Reportal provides a nice editor with syntax highlighting.
Once I have my query, I can run the report and <click>
…view the results.
I can download the results. I can also easily visualize the results using a line graph, bar graph, or pie chart.
Reportal currently supports Pig, Hive, and Teradata queries.
Reportal is build completely on top of Azkaban and uses Azkaban to execute and schedule jobs.
Heavily used at LinkedIn for simple reporting.
I want to give a sneak peak of some of the upcoming features that we are currently working on.
<go through page>
This plugin will be open sourced so stay tuned.
These are other features that are on our roadmap. <go through list>
Here are some possible future ideas that we have been discussing.
<go through list>
If you see something you would really want, please let us know.
I would like to thank some of our main contributors.
Aside from myself: Hien Luu – Azkaban core Anthony Hsu – Reportal, Job Summary Alex Bain – Azkaban Gradle Plugin and DSL Richard Park – Azkaban 2.0 rewrite, embedded flows, many more changes to core Azkaban Chenjie Yu – Azkaban core, trigger manager, job types Shida Li – Reportal on Azkaban
Azkaban is developed on GitHub.
We do not use a forked version. We run the same code internally on our clusters, except for a few LinkedIn-specific plugins.
Please check out our website and our source code on GitHub.
We look forward to you giving Azkaban a spin.
We always welcome your feedback, bug reports, and pull requests, and we would love to get new contributors as well.