5. Started with Map-Reduce
Task Graph with computations on data in nodes
Different Data APIs in community
● High-level Data API hides communication and decomposition from the user
● Lower-level messaging and Task API’s offer harder to use more powerful capabilities
● Data transformation APIs
○ Apache Crunch PCollections
○ Apache Spark RDD
○ Apache Flink DataSet
○ Apache Beam PCollections
○ Apache Heron Streamlets
● Apache Storm Task Graph
● SQL based APIs
16. Component Area Current Implementation Future Implementation
Connected
DataFlow
Workflow or External Dataflow between
different resources
Dynamic dataflows connected by data Ongoing
High Level APIs Distributed Data Set, SQL, Python,
Scala, Graph
TSets, Java Dataflow optimizations, SQL, Python, Scala, Graph (In
development), BEAM
Task System Task migration Not started Streaming job task migrations
Streaming and Batch Streaming and Batch execution
Task Execution Process, Threads More executors
Task Scheduling Dynamic Scheduling, Static Scheduling; Pluggable
Scheduling Algorithms
More algorithms
Task Graph Static Graph, Dynamic Graph Generation Cyclic graphs for iteration as in Timely DataFlow
Operators /
Communication
Internal DataFlow Operations Twister:Net; MPI Based, TCP, Batch and Streaming Integrate to other big data systems, Integrate with RDMA
BSP Operations Conventional MPI, Harp Native MPI Integration
Job Submission Job Submission (Dynamic/Static)
Resource Allocation
Plugins for Slurm,Mesos, Kubernetes,, Nomad Yarn, Marathon
Data Access Static (Batch) Data File Systems including HDFS NoSQL, SQL
Streaming Data Kafka Connector RabbitMQ, ActiveMQ
17. Function Mechanism Implementation Futures
Architecture Specification Coordination Points DataFlow coordination points
and BSP
Use for Learning nodes and
fault tolerance control
Execution Semantics Both process based and
thread based
Ongoing improvements
Fault tolerance Checkpointing Lightweight barriers,
Checkpointing
Available in June Release
Security Messaging, FaaS, Storage Crosses all components
(Research)
18. Mesos Kubernetes Standalone
BSP
Operations
Internal (fine grain) DataFlow
and State Definition Operations
Task Graph System
TSetRuntime
Resource API
HDFS NoSQL Message Brokers
Atomic Job
Submission
Connected or
External DataFlow
Data Access APIs
Streaming, Batch and ML Applications
Orchestration API
User APIsSQL APIPython API
Local
Slurm
Future Features: Python API critical
Java APIs Scala APIs
State
19. Worker
BSP Operations
Java API
Worker
DataFlow Operations
Java API
Operator Level APIs
Worker
DataFlow Operations
Task Graph
Java API
Worker
DataFlow Operations
Task Graph
TSet
Java API
Python
API
Worker
DataFlow Operations
Task Graph
SQL
APIs built on top of Task Graph
Low level APIs with the most
flexibility. Harder to program
Higher Level APIs based on
Task Graph
APIs are built combining different components of the System