Kafka On YARN (KOYA)
An Open Source Initiative to integrate Kafka & YARN
Thomas Weise – thomas@datatorrent.com
April 15th, 2015
Scalable, High Performance, Fault Tolerant In-memory Stream Processing Platform
Physical | Virtual | Cloud
Hadoop 2.x – YARN + HDFS
Re-usable Stream Operator Library
Ingestion Transformation Analytics
Alert-
Action
Visualization
&
Distribution
Management & Monitoring Graphical App Design &
Launch
Real-Time Data
Visualization
Architecting for Stream Processing…
High volume auto-scaling fault
tolerant event stream across
multiple data centers globally.
Dimensional computation.
Use Case: Online Advertising Analytics
Ad Servers
DC1
Ad Servers
DC2
Ad Servers
DC3
Real-time
Dashboard
Metrics
Impressions
Impressions
Impressions
“A high-throughput distributed messaging system.”
“Fast, Scalable, Durable, Distributed”
Kafka is a natural fit to deliver events
into a our stream processing platform.
Apache Kafka
Kafka feeds Stream Processing
Kafka Cluster
Server-1
P1 P2 P3
Server-2
P1 P2 P3
Server-3
P1 P2 P3
YARN Cluster
Node Manager
DT Container
…
Node Manager
DT AppMaster
DT Container
… …
Resource
Manager
…
Problem?
• It is not easy to get started with Kafka
– Initial deployment difficult (bring your own tools)
• It is not easy to keep it running
– No central management (status, configuration changes,…)
– No automatic replacement for failed broker
• Operational Inefficiencies
– Resource fragmentation, underutilization
– Common infrastructure not leveraged, extra skill sets
• Adaption Barrier!
Why Kafka on YARN
• YARN enables:
– Horizontal scalability with commodity hardware
– Central resource management with queues, limits and locality constraints
– Framework for achieving fault tolerance and security
• Automate:
– Broker recovery
– Deployment of Kafka clusters
• Integrate:
– User friendly management (alternative to Kafka command line utilities)
YARN Cluster
Kafka on YARN through Slider
Node Manager
…
Node Manager
DT AppMaster
DT Container
… …
Resource
Manager
…
Node Manager
…
Node Manager
Slider AM
DT Container
…
Server-1
P
1
P
2
P
3
Server-2
P
1
P
2
P
3
Slider Agent
Slider Agent
Why Slider?
• Automates deployment and configuration of components
– Simplify on-demand cluster creation
• Generic AM for long running services
– Management of container failures – automates recovery
– Sticky allocation of components to hosts across AM restart
– Isolation: node labels to pin components to specific set of machines
• Central status
– View all servers in one place
• Areas for improvement
– Anti-affinity support (YARN limitation)
– Agent API documentation
– Flexibility in component instance specification
Configuration Example
Demo
Project Status
• Open Source: https://github.com/DataTorrent/koya
• Python Scripts + Configuration
• Works on Hadoop 2.6 through Slider 0.6
• Install: Embedded Slider or Application Package
• First Release by Q2
• Future Enhancements
– Expanded Status Info through Slider AM
– Explore Kafka management UI options
– Support for Disk as a Resource in YARN - YARN-2139
– Better control over server placement (anti-affinity)
– Slider-799
Q and & A
For more information: www.datatorrent.com/blogs

Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN

  • 1.
    Kafka On YARN(KOYA) An Open Source Initiative to integrate Kafka & YARN Thomas Weise – thomas@datatorrent.com April 15th, 2015
  • 2.
    Scalable, High Performance,Fault Tolerant In-memory Stream Processing Platform Physical | Virtual | Cloud Hadoop 2.x – YARN + HDFS Re-usable Stream Operator Library Ingestion Transformation Analytics Alert- Action Visualization & Distribution Management & Monitoring Graphical App Design & Launch Real-Time Data Visualization Architecting for Stream Processing…
  • 3.
    High volume auto-scalingfault tolerant event stream across multiple data centers globally. Dimensional computation. Use Case: Online Advertising Analytics Ad Servers DC1 Ad Servers DC2 Ad Servers DC3 Real-time Dashboard Metrics Impressions Impressions Impressions
  • 4.
    “A high-throughput distributedmessaging system.” “Fast, Scalable, Durable, Distributed” Kafka is a natural fit to deliver events into a our stream processing platform. Apache Kafka
  • 5.
    Kafka feeds StreamProcessing Kafka Cluster Server-1 P1 P2 P3 Server-2 P1 P2 P3 Server-3 P1 P2 P3 YARN Cluster Node Manager DT Container … Node Manager DT AppMaster DT Container … … Resource Manager …
  • 6.
    Problem? • It isnot easy to get started with Kafka – Initial deployment difficult (bring your own tools) • It is not easy to keep it running – No central management (status, configuration changes,…) – No automatic replacement for failed broker • Operational Inefficiencies – Resource fragmentation, underutilization – Common infrastructure not leveraged, extra skill sets • Adaption Barrier!
  • 7.
    Why Kafka onYARN • YARN enables: – Horizontal scalability with commodity hardware – Central resource management with queues, limits and locality constraints – Framework for achieving fault tolerance and security • Automate: – Broker recovery – Deployment of Kafka clusters • Integrate: – User friendly management (alternative to Kafka command line utilities)
  • 8.
    YARN Cluster Kafka onYARN through Slider Node Manager … Node Manager DT AppMaster DT Container … … Resource Manager … Node Manager … Node Manager Slider AM DT Container … Server-1 P 1 P 2 P 3 Server-2 P 1 P 2 P 3 Slider Agent Slider Agent
  • 9.
    Why Slider? • Automatesdeployment and configuration of components – Simplify on-demand cluster creation • Generic AM for long running services – Management of container failures – automates recovery – Sticky allocation of components to hosts across AM restart – Isolation: node labels to pin components to specific set of machines • Central status – View all servers in one place • Areas for improvement – Anti-affinity support (YARN limitation) – Agent API documentation – Flexibility in component instance specification
  • 10.
  • 11.
  • 18.
    Project Status • OpenSource: https://github.com/DataTorrent/koya • Python Scripts + Configuration • Works on Hadoop 2.6 through Slider 0.6 • Install: Embedded Slider or Application Package • First Release by Q2 • Future Enhancements – Expanded Status Info through Slider AM – Explore Kafka management UI options – Support for Disk as a Resource in YARN - YARN-2139 – Better control over server placement (anti-affinity) – Slider-799
  • 19.
    Q and &A For more information: www.datatorrent.com/blogs