Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Building a Self-Service Hadoop Platform at L...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
3
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
4
Profile PageHome Page
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
5
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Evolution of Workflows
6
20092010201120122013
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 1.0
7
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 1.0
 Run workflows
 Schedule jobs
...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
9
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
 Major re-architecting
 Separa...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.0
 Jobtype plugins
– Built-in typ...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.5
12
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban 2.5
 UI overhauled using Bootstrap
...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Who’s using Azkaban?
 Software Engineers
 ...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban Today
 Workflow manager and schedul...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Good News! Success!
 1000+ users
 Several ...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Bad News! Success
 1000+ users
 Several cl...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Creating and Running Workflows
18
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Creating Workflows
 Add job “type” plugins
...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Embedded Flows
 Embed a flow as a node in a...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Project Management
Project Page
21
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Execution Panel
22
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Notification Options
23
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Failure Options
24
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Finish Current
– Finishes current running ...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Parameters
26
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Concurrent Execution Optio...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Skip Executions
– Prevent concurrent execu...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Executing Flow Page
29
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Flow Job List
30
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
31
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
Schedule Flow Panel
32
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Workflows
Scheduled Flows
33
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Scheduling Flows
Setting SLAs
34
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Debugging and Tuning
35
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
 1000+ users
 Several c...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Job Execution History
37
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Execution History
38
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Running Workflows
Job Logs
39
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Job Summary
40
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
41
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
42
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
43
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
44
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Pig Visualizer
45
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Summary
46
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Flow Summary
47
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Browsing HDFS
48
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
Browsing Files
49
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
Viewing Files
50
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
HDFS Viewer
File Schema
51
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Avro
 Parquet
 Binary JSON
 Sequence Fi...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
53
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
Dashboard
54
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
New Report
55
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Reportal
Viewing Results
56
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
 Pig
 Hive
 Teradata
Reportal
Supported Q...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Upcoming Features
58
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Azkaban Gradle Plugin and DSL
 Describe Azk...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Future Roadmap
 New visualizers (Hive, Tez,...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Future Discussions
 Conditional branching
...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Main Contributors
 David Chen (LinkedIn)
 ...
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
How to Contribute
Website: azkaban.github.io...
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Upcoming SlideShare
Loading in …5
×

Building a Self-Service Hadoop Platform at Linkedin with Azkaban

4,286 views

Published on

Published in: Technology
  • Be the first to comment

Building a Self-Service Hadoop Platform at Linkedin with Azkaban

  1. 1. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Building a Self-Service Hadoop Platform at LinkedIn with Azkaban
  2. 2. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
  3. 3. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Hadoop at LinkedIn 3
  4. 4. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Hadoop at LinkedIn 4 Profile PageHome Page
  5. 5. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Hadoop at LinkedIn 5
  6. 6. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Evolution of Workflows 6 20092010201120122013
  7. 7. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 1.0 7
  8. 8. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 1.0  Run workflows  Schedule jobs  Job History  Failure notification  Easy to use web UI and visualizations 8
  9. 9. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 2.0 9
  10. 10. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 2.0  Major re-architecting  Separate executor and web servers  User authentication  Pluggable database drivers – H2 – MySQL  Brand new UI 10
  11. 11. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 2.0  Jobtype plugins – Built-in type: command – Pluggable jobtypes:  Java  Pig  Hive – Non-Hadoop jobtypes:  Teradata  Voldemort  Viewer plugins – extending the Azkaban UI for other tools – HDFS browser – Reportal  LinkedIn-specific code as plugins 11
  12. 12. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 2.5 12
  13. 13. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban 2.5  UI overhauled using Bootstrap  Embedded flows  New self-service tools – Job Summary – Flow Summary – Pig Visualizer  Jobtype-specific plugins  HDFS viewer improvements – Display file schema in addition to content – Parquet file viewer  And more 13
  14. 14. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Who’s using Azkaban?  Software Engineers  Data Scientists  Analysts  Product Managers 14
  15. 15. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban Today  Workflow manager and scheduler  Integrated runtime environment  Unified front-end for Hadoop tools 15
  16. 16. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Good News! Success!  1000+ users  Several clusters  2,500 flows executing per day  30,000 jobs executing per day 16
  17. 17. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Bad News! Success  1000+ users  Several clusters  2,500 flows executing per day  30,000 jobs executing per day 17
  18. 18. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Creating and Running Workflows 18
  19. 19. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Creating Workflows  Add job “type” plugins – hadoopJava – Command – Pig – Hive  Dependencies – Determine the dependency graph  Parameter passing – Parameters can be passed to job 19 type=pig creamy.level=4 chunky.level=4 ... type=hadoopJava jelly.type=grape sugar=HFCS ... type=command bread.type=wheat dependencies=peanutbutter,jelly ... peanutbutter.job bread.job jelly.job
  20. 20. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Embedded Flows  Embed a flow as a node in another flow. – “flow” job type – Set flow.name to name of the embedded flow – Parameters can be passed to flow 20 peanutbutter jelly bread type=flow flow.name=bread dependencies=coffee,fruit type=hive coffee.decaf=false coffee.cream=true ... type=hadoopJava fruit.type=apple ... coffee.job fruit.job sandwich.job
  21. 21. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Project Management Project Page 21
  22. 22. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Flow Execution Panel 22
  23. 23. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Notification Options 23
  24. 24. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Failure Options 24
  25. 25. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.  Finish Current – Finishes current running flows, then stops  Cancel All – Kills all running jobs and finishes immediately  Finish Possible – Finish all possible jobs if their dependencies have met. Then it fails. Running Workflows Failure Options 25
  26. 26. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Flow Parameters 26
  27. 27. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Concurrent Execution Options 27
  28. 28. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.  Skip Executions – Prevent concurrent executions  Run Concurrently – Concurrently run the flow  Pipeline – Distance 1: jobA waits until concurrent jobA finishes – Distance 2: jobA waits until concurrent jobA’s children finishes Running Workflows Concurrent Execution Options 28
  29. 29. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Executing Flow Page 29
  30. 30. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Flow Job List 30
  31. 31. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Scheduling Workflows 31
  32. 32. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Scheduling Workflows Schedule Flow Panel 32
  33. 33. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Scheduling Workflows Scheduled Flows 33
  34. 34. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Scheduling Flows Setting SLAs 34
  35. 35. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Debugging and Tuning 35
  36. 36. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Hadoop at LinkedIn  1000+ users  Several clusters  2,500 flows executing per day  30,000 jobs executing per day 36
  37. 37. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Job Execution History 37
  38. 38. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Flow Execution History 38
  39. 39. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Running Workflows Job Logs 39
  40. 40. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Job Summary 40
  41. 41. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Pig Visualizer 41
  42. 42. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Pig Visualizer 42
  43. 43. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Pig Visualizer 43
  44. 44. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Pig Visualizer 44
  45. 45. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Pig Visualizer 45
  46. 46. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Flow Summary 46
  47. 47. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Flow Summary 47
  48. 48. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Browsing HDFS 48
  49. 49. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. HDFS Viewer Browsing Files 49
  50. 50. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. HDFS Viewer Viewing Files 50
  51. 51. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. HDFS Viewer File Schema 51
  52. 52. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.  Avro  Parquet  Binary JSON  Sequence File  Image  Text HDFS Viewer Supported File Types 52
  53. 53. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Reportal 53
  54. 54. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Reportal Dashboard 54
  55. 55. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Reportal New Report 55
  56. 56. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Reportal Viewing Results 56
  57. 57. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.  Pig  Hive  Teradata Reportal Supported Query Types 57
  58. 58. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Upcoming Features 58
  59. 59. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Azkaban Gradle Plugin and DSL  Describe Azkaban flow and deploy with Gradle  Single file (more if you want) to describe all your workflows – Compiles to .job files  Static checker  Valid Groovy code – Add conditionals for deployment to different clusters 59 azkaban { jobConfDir = ‘./jobs’ workflow(‘workflow2’) { pigJob(‘job2’) { script = ‘src/main/pig/count-by-country.job’ parameter ‘inputFile’, ‘/user/foo/sample’ reads ‘/data/databases/foo’, [as: ‘input’] writes ‘/data/databases/bar’, [as: ‘output’] } hiveJob(‘job3’) { query = ‘show tables’ } workflowDepends ‘job2’, ‘job3’ } }
  60. 60. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Future Roadmap  New visualizers (Hive, Tez, etc.)  Support DSL from other tools  Operationalization tooling  Scalability improvements  Improved plugin interfaces 60
  61. 61. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Future Discussions  Conditional branching  Hive Metastore browser  Pluggable executors (e.g. YARN)  Persistence storage server  Launching and monitoring long-running YARN applications (Samza, Storm, etc.) 61
  62. 62. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. Main Contributors  David Chen (LinkedIn)  Hien Luu (LinkedIn)  Anthony Hsu (LinkedIn)  Alex Bain (LinkedIn)  Richard Park (RelateIQ)  Chenjie Yu (Tango)  Shida Li (University of Waterloo) 62
  63. 63. Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. How to Contribute Website: azkaban.github.io GitHub: github.com/azkaban LinkedIn’s Data Website: data.linkedin.com 63

×