4. 4
Apache DolphinScheduler Introduction
1、Established in Analysys in 2017.
2、Open source in March 2019 and join Apache incubator
in August.
3、Dedicated to solving the complex dependencies in data
processing , it assembles Tasks in DAG, which can monitor
the status of tasks in real time, and supports such
operations as retrying, resuming from specified tasks,
suspending and terminating tasks.
6. 6
Pain points
Visual DAG Dependency
High availability Alert mechanism
01
02
03
04
05
Simple and easy to operate
View task status in real time
Visual task log
Workflow fault tolerance
Failed retry, rollback, transfer
Easy maintenance
task self-dependency
workflow dependency and so on
Alert plugin:mail/sms/wechat…
Warning
Multi task types
Cross language
Custom task Plugin
Easy to extend
Complement
re-fresh historical data
06
11. 11
Advantages
Simple and Easy
High reliability
Rich usage scenarios High scalability
Decentralized multi-Master and multi-
worker, self-supporting HA
Task queue to avoid overload
Fault-tolerant capability
Process definitions are visualized through drag
and drop
Open API
One-click deployment
Support pause and resume operation.
Support multi-tenancy
Support more task types, such as
spark, hive, mr, python, sub_process,
shell
Support custom task types
Scheduling capacity grows linearly with the cluster
Master and Worker support dynamic online and
offline
12. 12
Main capabilities
• Workflow can be timed,
dependent, manual,
pause/stop/resume
• Tasks are associated in DAG form
• Real-time monitoring of task status
• Supports more than 10
task types such as Shell,
MR, Spark, SQL, and
dependency task type
• Workflow priority, task
priority,
• global parameters and
local parameters
• Complete system
monitoring, task overtime
alarm/failure.
• Supports multi-tenancy, online
log viewing and resource online
management
• Supports stable
operation of 100,000
data tasks per day
• The decentralized design
ensures the stability and
high availability of the
system
13. 13
Process definition visualized drag-and-drop configuration
1. Visualized drag-and-drop
2. Support multi data task type,
includes Shell、DataSource、Spark、
Flink、MR、Python、Http,
ChildProcess、and Task
Dependency
3. Child Process
• workflow building reuse, avoid
repeated configuration,
16. 16
Task management: multi-granularity monitoring of task status
Tracking of task execution status
Task status data statistics Process instance status view
Task execution log online
17. 17
Data source management: visual configure multiple data source
1. Visualize Data sources
include :MySql、
PostgerSql、Hive、
Impala、Spark、
ClickHouse、Oracle、
SqlServer、DB2、
MongoDB.
2. Pluggable data source
extension
3. Visualize data source
management,
Configure once, use
everywhere.
18. 18
Workflow startup management
Task failure strategy:
1. Continue
2. End
Multi notification strategy
1. Success
2. Failure
3. All
4. None
Workflow Priority
Complement Data
1. Serial execution
2. Parallel execution
24. 24
DolphinScheduler 1.3 Feature – K8S Support
Advantage:
1. Elastic scaling
2. Make full use of server resources
3. Environmental isolation
Disadvantage:
K8S maintenance experience
Cloud native is the trend
25. 25
DolphinScheduler 1.3 Other Features
Batch export and import workflow
Process definition copy
Delete process instance cascade delete task log
Simplify configuration and optimize deployment
experience
27. 27
DolphinScheduler 1.3 New Architecture
Reduce the pressure on the database
• Worker remove DB operation, Single
responsibility
• Master and Worker communicate
directly to reduce latency
• Master multi strategy to distribute
tasks
- Random, round-robin and linear
maximum base on CPU & Memory
29. 29
Experience: Priority
no priority design and fair scheduling
design:
• The task submitted first may be
completed at the same time as the
task submitted later
• Low-priority services run first,
occupying resources and not
releasing
Question: Solution:
different process priority > process instance
sequence > task instance priority > task
instance sequence
default: FIFO
31. 31
Experience: Data component integration
Current more than 10 task types may not meet the
business demand
data sync task
kettle task
data
quality
...
SQL task
procedure task
business task
Solution:
task plugin
hot pluggable
33. 33
Practice of DolphinScheduler in Analysys
• Analysys Qianfan is an App
benchmarking analysis
product.
• Qianfan is a SaaS service app
that needs to process tens of
billions of data every day,
620 million monthly
activities, and 6.8 PB of big
data clusters through tens of
thousands of ETL tasks
processing every day.
• In 2018, we started to use
DolphinScheduler to
schedule the entire ETL
process.
• The picture on the right is
one of the workflows
34. 34
Practice of DolphinScheduler in Baosight
Extensions implemented by Baosight:
• Plugin type task
• Resource cache
• SQL function extension
• Message triggered scheduling
• Multiple data source access
• Workflow concurrency control
• Operation audit
• Alert optimization
• Configuration management
• Access control
• Operational data archiving
35. 35
Practice of DolphinScheduler in Qianxin
3
2
1
6
7
8
4 5
9
Online manage resource files
don't worry about losing the jar
Cluster high availability
decentralization
Support multi-tenant
we can't use the same
account
Privilege management
can only access authorized
projects and resources
Complex scheduling
cron、dependent、manual
scheduling
Multi task types
Visualization
Distributed & easy to
extend
no single point of issue
insufficient resources need extend
spark、shell、mr、hive
python…
drag and drop to generate
DAG
Workflow
Task failure retry/alarm
retry times? interval? email?
Why DolphinScheduler ?
37. 37
DolphinScheduler Roadmap Draft
• Master refactor: api communicate with master, event-driven, etc.
• Task parameter transfer
• Task Plugin (doing)
• Concurrency control of tasks
• Workflow trigger
• Data quality
• List dependency (upstream dependency)
• Support multi-cluster online release
• Workflow version management
• Permission redesign
• Easy to use
If you have more suggestions for Roadmap, please disscuss in dev mailing list
38. 38
Open source history
2017.12
2018.05
2019.02
2019.03
2019.05
2019.08
External seed users
officially open sourced on
March 30th - 1.0.0
Decided to open
source
cost 2 months refactoring
Internal use
qianfan product use
DolphinScheduler
Architecture design
Enter apache
incubator
Release Version
1.0.1、1.0.2、1.0.3
First Apache Version
1.2.0
2019.12
…
1.3.2
41. 41
Online DEMO: http://106.75.43.194:8888/
website:https://dolphinscheduler.apache.org
github : https://github.com/apache/incubator-dolphinscheduler
get help:
Submit an issue
Mail to dev-subscribe@dolphinscheduler.apache.org, follow the reply to subscribe the mail list.
DolphinScheduler Resources
Welcome to join the community. Joining open source starts from submitting the first PR.
- try to find the “easy to fix” mark or some very easy issue, submit PR