Flume with Twitter Integration

CS157B - Big Data Management
Flume with
Twitter Integration
Date: 03/3/2014
Professor: Thanh Tran
by Swathi Kotturu

ETL Using Flume
What is Flume?
Apache Flume is a distributed service for efficiently
collecting, aggregating, and moving large amounts of
log data.
Flume and it’s integration with Hadoop and can be used
to capture streaming twitter data which can be filtered
based on keywords and locations..

More About Flume
It has a very simple architecture based on streaming data flows.
Flume takes a source and processes it through a memory
channel, where the data gets filtered and sinks into the HDFS.

Flume Agents
Flume can deploy any number of agents. An Agent is a
container for Flume data flow. It can run any number of
sources, sinks, and channels.
It must have a source, channel, and sink.

Flume Sources
Sources are not Necessarily restricted to log data.
It is possible to use Flume to transport event data such
as network traffic data, social-media-generated data,
e-mail messages, etc…
The events can be HTTP POSTS, RPC calls, strings in
stdout, etc….
After an event occurs, Flume sources write the event to a
channel as a transaction.

Flume Channels
Channels are internal passive stores with specific
characteristics. This allows a source and a sink to run
asynchronously.
Two Main Types of Channels
Memory Channels
- Volatile Channel that buffers events in memory
only. If JVM crashes, all data is lost.
File Channels
- Persistant Channel that is stored to disk.

You can Run Multiple Agents and Servers to collect data in
parallel.

Flume in Cloudera
Download flume-sources-1.0-SNAPSHOT.jar and add it
to the flume class path.
http://files.cloudera.com/samples/flume-sources-1.0-
SNAPSHOT.jar
In the Cloudera Manager, you can add the class path:
“Services” -> “flume1″ -> “Configuration” ->
“Agent(Default)” -> “Advanced” -> “Java Configuration
Options for Flume Agent”, add:
–classpath /opt/cloudera/parcels/CDH-4.3.0-
1.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0-
SNAPSHOT.jar

Flume in Cloudera (cont.)
You also have to exclude the original file that came with
Flume, pre-installed by renaming it .org. The file is
search-contrib-1.0.0-jar-with-dependencies.jar and is in
the /usr/lib/flume-ng/lib/ path.
mv search-contrib-1.0.0-jar-with-
dependencies.jar search-contrib-1.0.0-jar-with-
dependencies.jar.org
Using Hue, create user Flume and give them access to
read and write in hdfs.

From the Cloudera Manager, go to
“Services” -> “flume1″ -> “Configuration” ->
“Agent(Default)” -> “Agent Name”.
Set the Agent Name to Twitter Agent

Also set the Configuration File to the following and make sure to replace
the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
Also set the Configuration File to the following and make sure to replace
the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = <consumer key>
TwitterAgent.sources.Twitter.consumerSecret = <consumer
secret>
TwitterAgent.sources.Twitter.accessToken = <access token>
TwitterAgent.sources.Twitter.accessTokenSecret = <access

TwitterAgent.sources.Twitter.keywords = flu, runny nose,
tissue, sick, ill, cough
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path =
hdfs://localhost:8020/user/flume/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

Restart Flume Agent

Example Tweet
We loaded raw tweets into HDFS which are represented
as chunks of JSON

Next Steps
Tell Hive how to read the data
You will need
Hive-serdes-1.0-SNAPSHOT.jar
http://files.cloudera.com/samples/hive-serdes-
1.0-SNAPSHOT.jar
As Hive is setup to read delimited
row format but in this case needs to
read json.

Flume Resources
Learn More
https://dev.twitter.com/docs/streaming-
apis/parameters
https://cwiki.apache.org/confluence/display/FLUME/
Home
http://blog.cloudera.com/blog/2012/09/analyzing-
twitter-data-with-hadoop/

Flume with Twitter Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flume with Twitter Integration

Similar to Flume with Twitter Integration (20)

Recently uploaded

Recently uploaded (20)

Flume with Twitter Integration