1. CS157B - Big Data Management
Flume with
Twitter Integration
Date: 03/3/2014
Professor: Thanh Tran
by Swathi Kotturu
2. ETL Using Flume
What is Flume?
Apache Flume is a distributed service for efficiently
collecting, aggregating, and moving large amounts of
log data.
Flume and it’s integration with Hadoop and can be used
to capture streaming twitter data which can be filtered
based on keywords and locations..
3. More About Flume
It has a very simple architecture based on streaming data flows.
Flume takes a source and processes it through a memory
channel, where the data gets filtered and sinks into the HDFS.
4. Flume Agents
Flume can deploy any number of agents. An Agent is a
container for Flume data flow. It can run any number of
sources, sinks, and channels.
It must have a source, channel, and sink.
5. Flume Sources
Sources are not Necessarily restricted to log data.
It is possible to use Flume to transport event data such
as network traffic data, social-media-generated data,
e-mail messages, etc…
The events can be HTTP POSTS, RPC calls, strings in
stdout, etc….
After an event occurs, Flume sources write the event to a
channel as a transaction.
6. Flume Channels
Channels are internal passive stores with specific
characteristics. This allows a source and a sink to run
asynchronously.
Two Main Types of Channels
Memory Channels
- Volatile Channel that buffers events in memory
only. If JVM crashes, all data is lost.
File Channels
- Persistant Channel that is stored to disk.
7. You can Run Multiple Agents and Servers to collect data in
parallel.
9. Flume in Cloudera
Download flume-sources-1.0-SNAPSHOT.jar and add it
to the flume class path.
http://files.cloudera.com/samples/flume-sources-1.0-
SNAPSHOT.jar
In the Cloudera Manager, you can add the class path:
“Services” -> “flume1″ -> “Configuration” ->
“Agent(Default)” -> “Advanced” -> “Java Configuration
Options for Flume Agent”, add:
–classpath /opt/cloudera/parcels/CDH-4.3.0-
1.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0-
SNAPSHOT.jar
11. Flume in Cloudera (cont.)
You also have to exclude the original file that came with
Flume, pre-installed by renaming it .org. The file is
search-contrib-1.0.0-jar-with-dependencies.jar and is in
the /usr/lib/flume-ng/lib/ path.
mv search-contrib-1.0.0-jar-with-
dependencies.jar search-contrib-1.0.0-jar-with-
dependencies.jar.org
Using Hue, create user Flume and give them access to
read and write in hdfs.
13. Flume in Cloudera (cont.)
From the Cloudera Manager, go to
“Services” -> “flume1″ -> “Configuration” ->
“Agent(Default)” -> “Agent Name”.
Set the Agent Name to Twitter Agent
15. Flume in Cloudera (cont.)
Also set the Configuration File to the following and make sure to replace
the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
Also set the Configuration File to the following and make sure to replace
the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumer key>
TwitterAgent.sources.Twitter.consumerSecret = <consumer
secret>
TwitterAgent.sources.Twitter.accessToken = <access token>
TwitterAgent.sources.Twitter.accessTokenSecret = <access
22. Next Steps
Tell Hive how to read the data
You will need
Hive-serdes-1.0-SNAPSHOT.jar
http://files.cloudera.com/samples/hive-serdes-
1.0-SNAPSHOT.jar
As Hive is setup to read delimited
row format but in this case needs to
read json.