How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level data tasks?

How does Apache DolphinScheduler (Incubator)
support 100,000-level data task scheduling?
Lidong Dai
Apache DolphinScheduler PPMC & Committer

2021/1/24 2
CONTENTS
Introduction
01
Community
02
Architecture iteration
03 04 Advantages
05 Use cases
06
Roadmap
07 Related resource
08
Pain points

2021/1/24 3
PART 1
DolphinScheduler Introduction

4
Apache DolphinScheduler Introduction
1、Established in Analysys in 2017.
2、Open source in March 2019 and join Apache incubator
in August.
3、Dedicated to solving the complex dependencies in data
processing , it assembles Tasks in DAG, which can monitor
the status of tasks in real time, and supports such
operations as retrying, resuming from specified tasks,
suspending and terminating tasks.

2021/1/24 5
PART 2
Pain points

6
Pain points
Visual DAG Dependency
High availability Alert mechanism
01
02
03
04
05
Simple and easy to operate
View task status in real time
Visual task log
Workflow fault tolerance
Failed retry, rollback, transfer
Easy maintenance
task self-dependency
workflow dependency and so on
Alert plugin：mail/sms/wechat…
Warning
Multi task types
Cross language
Custom task Plugin
Easy to extend
Complement
re-fresh historical data
06

9
DolphinScheduler Community Construction
Dmall
Byte
Dance
Analysys
Tencent JD
Pingan
HUAWEI
MoMo Guan
Data
Alibaba
FunPlus
XiaoMi
DiDi
10086 LiZhi
Contributor company distribution
360
LY
Community over code
Code Contributor
Document Contributor

2021/1/24 10
PART 4
Advantages
2021/1/24 10

11
Advantages
Simple and Easy
High reliability
Rich usage scenarios High scalability
Decentralized multi-Master and multi-
worker, self-supporting HA
Task queue to avoid overload
Fault-tolerant capability
Process definitions are visualized through drag
and drop
Open API
One-click deployment
Support pause and resume operation.
Support multi-tenancy
Support more task types, such as
spark, hive, mr, python, sub_process,
shell
Support custom task types
Scheduling capacity grows linearly with the cluster
Master and Worker support dynamic online and
offline

12
Main capabilities
• Workflow can be timed,
dependent, manual,
pause/stop/resume
• Tasks are associated in DAG form
• Real-time monitoring of task status
• Supports more than 10
task types such as Shell,
MR, Spark, SQL, and
dependency task type
• Workflow priority, task
priority,
• global parameters and
local parameters
• Complete system
monitoring, task overtime
alarm/failure.
• Supports multi-tenancy, online
log viewing and resource online
management
• Supports stable
operation of 100,000
data tasks per day
• The decentralized design
ensures the stability and
high availability of the
system

13
Process definition visualized drag-and-drop configuration
1. Visualized drag-and-drop
2. Support multi data task type,
includes Shell、DataSource、Spark、
Flink、MR、Python、Http，
ChildProcess、and Task
Dependency
3. Child Process
• workflow building reuse, avoid
repeated configuration,

14
Visualization of workflow running process

15
Visualize task rerun, retry and task execution

16
Task management: multi-granularity monitoring of task status
Tracking of task execution status
Task status data statistics Process instance status view
Task execution log online

17
Data source management: visual configure multiple data source
1. Visualize Data sources
include ：MySql、
PostgerSql、Hive、
Impala、Spark、
ClickHouse、Oracle、
SqlServer、DB2、
MongoDB.
2. Pluggable data source
extension
3. Visualize data source
management，
Configure once, use
everywhere.

18
Workflow startup management
 Task failure strategy:
1. Continue
2. End
 Multi notification strategy
1. Success
2. Failure
3. All
4. None
 Workflow Priority
 Complement Data
1. Serial execution
2. Parallel execution

20
DolphinScheduler 1.3 Feature – Datax
Custom Template

21
DolphinScheduler 1.3 Feature – Sqoop

22
DolphinScheduler 1.3 Feature – Condition Task

23
DolphinScheduler 1.3 Feature – Ambari Plugin

24
DolphinScheduler 1.3 Feature – K8S Support
Advantage:
1. Elastic scaling
2. Make full use of server resources
3. Environmental isolation
Disadvantage:
K8S maintenance experience
Cloud native is the trend

25
DolphinScheduler 1.3 Other Features
 Batch export and import workflow
 Process definition copy
 Delete process instance cascade delete task log
 Simplify configuration and optimize deployment
experience

2021/1/24 26
PART 5
Architecture iteration
2021/1/24 26

27
DolphinScheduler 1.3 New Architecture
Reduce the pressure on the database
• Worker remove DB operation, Single
responsibility
• Master and Worker communicate
directly to reduce latency
• Master multi strategy to distribute
tasks
- Random, round-robin and linear
maximum base on CPU & Memory

28
DolphinScheduler 1.3 工作流程活动图

29
Experience: Priority
no priority design and fair scheduling
design:
• The task submitted first may be
completed at the same time as the
task submitted later
• Low-priority services run first,
occupying resources and not
releasing
Question： Solution：
different process priority > process instance
sequence > task instance priority > task
instance sequence
default: FIFO

30
Experience: Task Dependency
task dependency, task dependency check
such as day process A depends on hour process B last day

31
Experience: Data component integration
Current more than 10 task types may not meet the
business demand
data sync task
kettle task
data
quality
...
SQL task
procedure task
business task
Solution：
task plugin
hot pluggable

2021/1/24 32
PART 6
Use cases
2021/1/24 32

33
Practice of DolphinScheduler in Analysys
• Analysys Qianfan is an App
benchmarking analysis
product.
• Qianfan is a SaaS service app
that needs to process tens of
billions of data every day,
620 million monthly
activities, and 6.8 PB of big
data clusters through tens of
thousands of ETL tasks
processing every day.
• In 2018, we started to use
DolphinScheduler to
schedule the entire ETL
process.
• The picture on the right is
one of the workflows

34
Practice of DolphinScheduler in Baosight
Extensions implemented by Baosight:
• Plugin type task
• Resource cache
• SQL function extension
• Message triggered scheduling
• Multiple data source access
• Workflow concurrency control
• Operation audit
• Alert optimization
• Configuration management
• Access control
• Operational data archiving

35
Practice of DolphinScheduler in Qianxin
3
2
1
6
7
8
4 5
9
Online manage resource files
don't worry about losing the jar
Cluster high availability
decentralization
Support multi-tenant
we can't use the same
account
Privilege management
can only access authorized
projects and resources
Complex scheduling
cron、dependent、manual
scheduling
Multi task types
Visualization
Distributed & easy to
extend
no single point of issue
insufficient resources need extend
spark、shell、mr、hive
python…
drag and drop to generate
DAG
Workflow
Task failure retry/alarm
retry times? interval? email?
Why DolphinScheduler ？

2021/1/24 36
PART 7
Roadmap
2021/1/24 36

37
DolphinScheduler Roadmap Draft
• Master refactor: api communicate with master, event-driven, etc.
• Task parameter transfer
• Task Plugin (doing)
• Concurrency control of tasks
• Workflow trigger
• Data quality
• List dependency (upstream dependency)
• Support multi-cluster online release
• Workflow version management
• Permission redesign
• Easy to use
If you have more suggestions for Roadmap, please disscuss in dev mailing list

38
Open source history
2017.12
2018.05
2019.02
2019.03
2019.05
2019.08
External seed users
officially open sourced on
March 30th - 1.0.0
Decided to open
source
cost 2 months refactoring
Internal use
qianfan product use
DolphinScheduler
Architecture design
Enter apache
incubator
Release Version
1.0.1、1.0.2、1.0.3
First Apache Version
1.2.0
2019.12
…
1.3.2

39
DolphinScheduler Slogan
SUCCESS
Choice good tools
Use right
scheduler
Sleep very
well
Go home early
z

2021/1/24 40
PART 8
Resources
2021/1/24 40

41
 Online DEMO: http://106.75.43.194:8888/
 website：https://dolphinscheduler.apache.org
 github ： https://github.com/apache/incubator-dolphinscheduler
get help:
 Submit an issue
 Mail to dev-subscribe@dolphinscheduler.apache.org, follow the reply to subscribe the mail list.
DolphinScheduler Resources
Welcome to join the community. Joining open source starts from submitting the first PR.
- try to find the “easy to fix” mark or some very easy issue, submit PR

How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level data tasks?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level data tasks?

Similar to How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level data tasks? (20)

Recently uploaded

Recently uploaded (20)

How does Apache DolphinScheduler (Incubator) support scheduling 100,000-level data tasks?