2. 2
About me
- Data Engineer
@Data Science & Technology, Cathay Financial Holdings
- Former one-stop engineer for data science(Manufacturing)
- Former Chemical Engineer
- Polymer material, Genetic engineering, Bacterial fermentation
- D4SG (Data for Social Good) #4, winter 2018
- First prize, Genius For Home competition, MediaTek, 2018
- : orcahmlee
27. - Dask DataFrame and Koalas
- Lazy execution
- Support row-oriented partitioning and parallelism
- Modin
- Eager execution
- Support row, column, and cell-oriented partitioning
and parallelism
Modin vs. Dask DataFrame vs. Koalas
27
Modin vs. Dask DataFrame vs. Koalas
29. - Dask DataFrame and Koalas
- Lazy execution
- Support row-oriented partitioning and parallelism
- Modin
- Eager execution
- Support row, column, and cell-oriented partitioning
and parallelism
- If the API is not supported yet, it is being executed
in the default to pandas mode
Modin vs. Dask DataFrame vs. Koalas
29
Modin vs. Dask DataFrame vs. Koalas
36. Actor Model
- What is Actor Model and why to use it
- Related languages/frameworks implements Actor Model:
- Erlang, RabbitMQ, Akka
- Super useful references:
- https:>/blog.techbridge.cc/2019/06/21/actor-model-in-web/
- [COSCUP 2011] Programming for the Future, Introduction to the
Actor Model and Akka Framework
36
45. Architecture - System Layer
The system layer consists of three major components
- Global Control Store(GCS)
- Bottom-Up Distributed Scheduler
- In-Memory Distributed Object Store
45
Ray: A Distributed Framework for Emerging AI Applications
48. Global Control Store
48
- Maintains fault tolerance and low latency
- Enables every components in the system to be
stateless
- Key-value store with pub-sub functionality
- < v1.11.0: Using Redis
- >=v1.11.0: No longer starts Redis as default
Ray: A Distributed Framework for Emerging AI Applications
51. Global Control Store
51
- Maintains fault tolerance and low latency
- Enables every components in the system to be
stateless
- Key-value store with pub-sub functionality
- < v1.11.0: Using Redis
- >=v1.11.0: No longer starts Redis as default
Ray: A Distributed Framework for Emerging AI Applications
52. Global Control Store
Fault tolerance
- Decouple the durable lineage storage from other
system components
- Heartbeat table, Job table, Actor table
52
Ray: A Distributed Framework for Emerging AI Applications
53. Global Control Store
Low latency
- Centralized scheduler couple task scheduling and task
dispatch(Dask, Spark, CIEL)
- Involving the scheduler in each object transfer is
prohibitively expensive
- Ray store the object’s metadata in GCS rather than in
the scheduler, fully decoupling task dispatch from
task scheduling
53
Ray: A Distributed Framework for Emerging AI Applications
56. Existing cluster computing frameworks:
- Centralized schedulers: provide locality but at latencies
in the tens of ms(Spark, CIEL, Dryed)
- Distributed schedulers: can achieve high scale, but they
either don’t consider data locality(work stealing), or
assume tasks belong to independent jobs(Sparrow), or
assume the computation graph is known(Canary)
Bottom-Up Distributed Scheduler
56
Ray: A Distributed Framework for Emerging AI Applications
60. - Plasma: A High-Performance Shared-Memory Object Store
- Plasma was initially developed as part of Ray that is
being developed as part of Apache Arrow
- On each node, Ray implement the object store via
shared memory. This allows zero-copy data sharing
between tasks running on the same node
- Plasma holds immutable objects in shared memory
In-Memory Distributed Object Store
60
Ray: A Distributed Framework for Emerging AI Applications
61. - To minimize task latency, Plasma is used to store the
inputs and outputs of every task, or stateless
computation.
- For low latency, Ray keep objects entirely in memory
and evict them as needed to disk using an LRU policy
- Small objects(<100 KiB): store in in-process object store
- Large objects: store in shared memory object store
In-Memory Distributed Object Store
61
Ray: A Distributed Framework for Emerging AI Applications
64. In-Memory Distributed Object Store
Object spilling and persistence
- Spilling objects to external storage once the capacity
of the object store is used up(v1.3+)
- Two types of external storage supported by default
- For local storage, the OS would run out of inodes very
quickly. If objects are smaller than 100MB, Ray fuses
objects into a single file to avoid this problem
64
65. In-Memory Distributed Object Store
65
Fault Tolerance
- Ray recovers any needed objects through lineage
re-execution. The lineage stored in the GCS tracks
both stateless tasks and stateful actors during
initial execution
Ray: A Distributed Framework for Emerging AI Applications