Learn how to write YARN applications.
Agenda
1. Overview of YARN
2. Components of a YARN application
3. Lifecycle of a YARN application
We will walk through some code snippets as well.
Speaker - Priyanka Gugale is a committer of Apache Apex, and is working with DataTorrent Software India Pvt. Ltd as an engineer and has been working in the big data space for the past 2+ years.
Operators are basic compute units. Operators process each incoming tuple and emit zero or more tuples on output ports as per the business logic.
Input Adapter - This is one of the starting points in the application DAG and is responsible for getting tuples from an external system. At the same time, such data may also be generated by the operator itself, without interacting with the outside world
Generic Operator - This type of operator accepts input tuples from the previous operators and passes them on to the following operators in the DAG
Output Adapter - This is one of the ending points in the application DAG and is responsible for writing the data out to some external system.
An operator passes through various stages during its lifetime. Each stage is an API call that the Streaming Application Master makes for an operator.
setup() call initializes the operator and prepares itself to start processing tuples.
beginWindow() call marks the beginning of an application window and allows for any processing to be done before a window starts
process() call belongs to the InputPort and gets triggered when any tuple arrives at the Input port of the operator
emitTuples() call is used by Input adapters to emit any tuples that are fetched from the external systems
endWindow() call marks the end of the window and allows for any processing to be done after the window ends
teardown() call is used for gracefully shutting down the operator and releasing any resources held by the operator
Skeleton for Apex application
For application development or for functional testing, hadoop cluster or services as it can run in the local file system as single process with multiple threads. A hadoop cluster (distributed cluster) is recommended for benchmarking and production testing.
For single node cluster, throughput will not be high as multi node cluster, memory constraints