Spark+flume seattle

1© Cloudera, Inc. All rights reserved.
Hari Shreedharan, Software Engineer @ Cloudera
Committer/PMC Member, Apache Flume
Committer, Apache Sqoop
Contributor, Apache Spark
Author, Using Flume (O’Reilly)
Flume + Spark Streaming =
Ingest + Processing

What is Flume
• Collection, Aggregation of streaming Event Data
• Typically used for log data
• Significant advantages over ad-hoc solutions
• Reliable, Scalable, Manageable, Customizable and High Performance
• Declarative, Dynamic Configuration
• Contextual Routing
• Feature rich
• Fully extensible

Core Concepts: Event
An Event is the fundamental unit of data transported by Flume from its
point of origination to its final destination. Event is a byte array payload
accompanied by optional headers.
• Payload is opaque to Flume
• Headers are specified as an unordered collection of string key-value pairs, with
keys being unique across the collection
• Headers can be used for contextual routing

Core Concepts: Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed from
• Not needed in all cases

Core Concepts: Agent
A container for hosting Sources, Channels, Sinks and other components
that enable the transportation of events from one place to another.
• Fundamental part of a Flume flow
• Provides Configuration, Life-Cycle Management, and Monitoring Support
for hosted components

Inside a Flume agent

Typical Aggregation Flow
[Client]+  Agent [ Agent]*  Destination

Core Concepts: Source
An active component that receives events from a specialized location or
mechanism and places it on one or Channels.
• Different Source types:
• Specialized sources for integrating with well-known systems. Example:
Syslog, Netcat
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro
• Require at least one channel to function

Sources
• Different Source types:
• Specialized sources for integrating with well-known systems. Example:
Spooling Files, Syslog, Netcat, JMS
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro, Thrift
• Require at least one channel to function

Core Concepts: Channel
A passive component that buffers the incoming events until they are drained by
Sinks.
• Different Channels offer different levels of persistence:
• Memory Channel: volatile
• Data lost if JVM or machine restarts
• File Channel: backed by WAL implementation
• Data not lost unless the disk dies.
• Eventually, when the agent comes back data can be accessed.
• Channels are fully transactional
• Provide weak ordering guarantees
• Can work with any number of Sources and Sinks.

Core Concepts: Sink
An active component that removes events from a Channel and transmits
them to their next hop destination.
• Different types of Sinks:
• Terminal sinks that deposit events to their final destination. For example:
HDFS, HBase, Morphline-Solr, Elastic Search
• Sinks support serialization to user’s preferred formats.
• HDFS sink supports time-based and arbitrary bucketing of data while writing
to HDFS.
• IPC sink for Agent-to-Agent communication: Avro, Thrift
• Require exactly one channel to function

Flow Reliability
Normal Flow
Communication Failure between Agents
Communication Restored, Flow back to Normal

Flow Handling
Channels decouple impedance of upstream and downstream
• Upstream burstiness is damped by channels
• Downstream failures are transparently absorbed by channels
 Sizing of channel capacity is key in realizing these benefits

Interceptors
Interceptor
An Interceptor is a component applied to a source in pre-specified order
to enable decorating and filtering of events where necessary.
• Built-in Interceptors allow adding headers such as timestamps, hostname,
static markers etc.
• Custom interceptors can introspect event payload to create specific headers
where necessary

Contextual Routing
Channel Selector
A Channel Selector allows a Source to select one or more Channels from
all the Channels that the Source is configured with based on preset
criteria.
• Built-in Channel Selectors:
• Replicating: for duplicating the events
• Multiplexing: for routing based on headers

Contextual Routing
• Terminal Sinks can directly use Headers to make destination selections
• HDFS Sink can use headers values to create dynamic path for files that the
event will be added to.
• Some headers such as timestamps can be used in a more sophisticated
manner
• Custom Channel Selector can be used for doing specialized routing where
necessary

Client API
• Simple API that can be used to send data to Flume agents.
• Simplest form – send a batch of events to one agent.
• Can be used to send data to multiple agents in a round-robin, random or
failover fashion (send data to one till it fails).
• Java only.
• flume.thrift can be used to generate code for other languages.
• Use with Thrift source.

Motivation for Real-Time Stream Processing
Data is being created at unprecedented rates
• Exponential data growth from mobile, web, social
• Connected devices: 9B in 2012 to 50B by 2020
• Over 1 trillion sensors by 2020
• Datacenter IP traffic growing at CAGR of 25%
How can we harness it data in real-time?
• Value can quickly degrade → capture value immediately
• From reactive analysis to direct operational impact
• Unlocks new competitive advantages
• Requires a completely new approach...

Spark: Easy and Fast Big Data
•Easy to Develop
•Rich APIs in Java, Scala,
Python
•Interactive shell
•Fast to Run
•General execution graphs
•In-memory storage
2-5× less code
Up to 10× faster on disk,
100× in memory

Spark Architecture
Driver
Worker
Worker
Worker
Data
RAM
Data
RAM
Data
RAM

RDDs
RDD = Resilient Distributed Datasets
• Immutable representation of data
• Operations on one RDD creates a new one
• Memory caching layer that stores data in a distributed, fault-tolerant cache
• Created by parallel transformations on data in stable storage
• Lazy materialization
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage

Spark Streaming
Extension of Apache Spark’s Core API, for Stream Processing.
The Framework Provides
Fault Tolerance
Scalability
High-Throughput

Canonical Stream Processing Architecture
Kafka
Data Ingest
App 1
App 2
.
.
.
Kafka Flume
HDFS
HBase
Data
Sources

Spark Streaming
• Incoming data represented as Discretized Streams (DStreams)
• Stream is broken down into micro-batches
• Each micro-batch is an RDD – can share code between batch and streaming

val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2tweets DStream
hashTags DStream
Stream composed of
small (1-10s) batch
computations
“Micro-batch” Architecture

Use DStreams for Windowing Functions

Spark Streaming Use-Cases
• Real-time dashboards
• Show approximate results in real-time
• Reconcile periodically with source-of-truth using Spark
• Joins of multiple streams
• Time-based or count-based “windows”
• Combine multiple sources of input to produce composite data
• Re-use RDDs created by Streaming in other Spark jobs.

Connectors
• Flume
• Kafka
• Amazon Kinesis
• MQTT
• Twitter (?!!)

Flume Polling Input DStream
• Reliable Receiver to pull data from Flume agents.
• Configure Flume agents to run the SparkSink
(org.apache.spark.streaming.flume.sink.SparkSink)
• Use FlumeUtils.createPollingStream to create a
FlumePollingInputDStream configured to connect to the
Flume agent(s)
• Enable WAL in Spark configuration to ensure no data loss
• Depends on Flume’s transactions for reliability

Flume Connector
• Flume Sink Flume receiver on Spark use Avro and uses the standard
event format Flume uses.
• No additional costs for using Spark Streaming vs just sending data from
one Flume agent to another!
• With WAL enabled, no data loss!

Flume Transactions
• Sink starts a transaction with the channel
Executor
Receiver
Block
Manager
Flume
Agent
Spark Sink
Flume
Channel
StartTxn

Flume Transactions
• Sink then pulls events out and sends it to
the receiver
Executor
Receiver
Block
Manager
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
Event Stream

Flume Transactions
the receiver
• The receiver stores all events in a single
store call
Executor
Receiver
Block
Manager
Block
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
Event Stream

Flume Transactions
the receiver
• The receiver stores all events in a single
store call
• Once the store call returns, the receiver
sends an ACK to the sink
Executor
Receiver
Block
Manager
Block
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
ACK
Event Stream

Flume Transactions
Executor
Receiver
Block
Manager
Block
Flume
Agent
Spark Sink
Flume
Channel
StartTxn
ACK
Event Stream
• Sink then pulls events out and sends it to the
receiver
• The receiver stores all events in a single store call
• Once the store call returns, the receiver sends an
ACK to the sink
• The sink then commits the transaction
• Timeouts/NACKs trigger a rollback and data is re-
sent.
CommitTxn

Summary
• Clients send Events to Agents
• Agents hosts number Flume components – Source, Interceptors, Channel
Selectors, Channels, Sink Processors, and Sinks.
• Sources and Sinks are active components, where as Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not filtered,
puts them on channel(s) selected by the configured Channel Selector
• Sink Processor identifies a sink to invoke, that can take Events from a Channel
and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery semantics
• Channel persistence allows for ensuring end-to-end reliability
36

Thank you
hshreedharan@cloudera.com
@harisr1234

Spark+flume seattle

More Related Content

What's hot

Viewers also liked

Similar to Spark+flume seattle

Recently uploaded

Spark+flume seattle

Editor's Notes