CS157B - Big Data Management
Flume with
Twitter Integration
Date: 03/3/2014
Professor: Thanh Tran
by Swathi Kotturu
ETL Using Flume
What is Flume?
Apache Flume is a distributed service for efficiently
collecting, aggregating, and moving l...
More About Flume
It has a very simple architecture based on streaming data flows.
Flume takes a source and processes it th...
Flume Agents
Flume can deploy any number of agents. An Agent is a
container for Flume data flow. It can run any number of
...
Flume Sources
Sources are not Necessarily restricted to log data.
It is possible to use Flume to transport event data such...
Flume Channels
Channels are internal passive stores with specific
characteristics. This allows a source and a sink to run
...
You can Run Multiple Agents and Servers to collect data in
parallel.
Get Twitter Access
Flume in Cloudera
Download flume-sources-1.0-SNAPSHOT.jar and add it
to the flume class path.
http://files.cloudera.com/sa...
Flume in Cloudera (cont.)
Flume in Cloudera (cont.)
You also have to exclude the original file that came with
Flume, pre-installed by renaming it .o...
Flume in Cloudera (cont.)
Flume in Cloudera (cont.)
From the Cloudera Manager, go to
“Services” -> “flume1″ -> “Configuration” ->
“Agent(Default)” -...
Flume in Cloudera (cont.)
Flume in Cloudera (cont.)
Also set the Configuration File to the following and make sure to replace
the ConsumerKey, Consu...
Flume in Cloudera (cont.)
TwitterAgent.sources.Twitter.keywords = flu, runny nose,
tissue, sick, ill, cough
TwitterAgent.s...
Flume in Cloudera (cont.)
Flume in Cloudera (cont.)
Restart Flume Agent
Flume in Cloudera (cont.)
Flume in Cloudera (cont.)
Example Tweet
We loaded raw tweets into HDFS which are represented
as chunks of JSON
Next Steps
Tell Hive how to read the data
You will need
Hive-serdes-1.0-SNAPSHOT.jar
http://files.cloudera.com/samples/hiv...
Flume Resources
Learn More
https://dev.twitter.com/docs/streaming-
apis/parameters
https://cwiki.apache.org/confluence/dis...
Thank you!
Q/A
Upcoming SlideShare
Loading in …5
×

Flume with Twitter Integration

2,780 views

Published on

Published in: Data & Analytics, Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,780
On SlideShare
0
From Embeds
0
Number of Embeds
65
Actions
Shares
0
Downloads
113
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Flume with Twitter Integration

  1. 1. CS157B - Big Data Management Flume with Twitter Integration Date: 03/3/2014 Professor: Thanh Tran by Swathi Kotturu
  2. 2. ETL Using Flume What is Flume? Apache Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of log data. Flume and it’s integration with Hadoop and can be used to capture streaming twitter data which can be filtered based on keywords and locations..
  3. 3. More About Flume It has a very simple architecture based on streaming data flows. Flume takes a source and processes it through a memory channel, where the data gets filtered and sinks into the HDFS.
  4. 4. Flume Agents Flume can deploy any number of agents. An Agent is a container for Flume data flow. It can run any number of sources, sinks, and channels. It must have a source, channel, and sink.
  5. 5. Flume Sources Sources are not Necessarily restricted to log data. It is possible to use Flume to transport event data such as network traffic data, social-media-generated data, e-mail messages, etc… The events can be HTTP POSTS, RPC calls, strings in stdout, etc…. After an event occurs, Flume sources write the event to a channel as a transaction.
  6. 6. Flume Channels Channels are internal passive stores with specific characteristics. This allows a source and a sink to run asynchronously. Two Main Types of Channels Memory Channels - Volatile Channel that buffers events in memory only. If JVM crashes, all data is lost. File Channels - Persistant Channel that is stored to disk.
  7. 7. You can Run Multiple Agents and Servers to collect data in parallel.
  8. 8. Get Twitter Access
  9. 9. Flume in Cloudera Download flume-sources-1.0-SNAPSHOT.jar and add it to the flume class path. http://files.cloudera.com/samples/flume-sources-1.0- SNAPSHOT.jar In the Cloudera Manager, you can add the class path: “Services” -> “flume1″ -> “Configuration” -> “Agent(Default)” -> “Advanced” -> “Java Configuration Options for Flume Agent”, add: –classpath /opt/cloudera/parcels/CDH-4.3.0- 1.cdh4.3.0.p0.22/lib/flume-ng/lib/flume-sources-1.0- SNAPSHOT.jar
  10. 10. Flume in Cloudera (cont.)
  11. 11. Flume in Cloudera (cont.) You also have to exclude the original file that came with Flume, pre-installed by renaming it .org. The file is search-contrib-1.0.0-jar-with-dependencies.jar and is in the /usr/lib/flume-ng/lib/ path. mv search-contrib-1.0.0-jar-with- dependencies.jar search-contrib-1.0.0-jar-with- dependencies.jar.org Using Hue, create user Flume and give them access to read and write in hdfs.
  12. 12. Flume in Cloudera (cont.)
  13. 13. Flume in Cloudera (cont.) From the Cloudera Manager, go to “Services” -> “flume1″ -> “Configuration” -> “Agent(Default)” -> “Agent Name”. Set the Agent Name to Twitter Agent
  14. 14. Flume in Cloudera (cont.)
  15. 15. Flume in Cloudera (cont.) Also set the Configuration File to the following and make sure to replace the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret Also set the Configuration File to the following and make sure to replace the ConsumerKey, ConsumerSecret, AccessToken, AccessTokenSecret TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = <consumer key> TwitterAgent.sources.Twitter.consumerSecret = <consumer secret> TwitterAgent.sources.Twitter.accessToken = <access token> TwitterAgent.sources.Twitter.accessTokenSecret = <access
  16. 16. Flume in Cloudera (cont.) TwitterAgent.sources.Twitter.keywords = flu, runny nose, tissue, sick, ill, cough TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100
  17. 17. Flume in Cloudera (cont.)
  18. 18. Flume in Cloudera (cont.) Restart Flume Agent
  19. 19. Flume in Cloudera (cont.)
  20. 20. Flume in Cloudera (cont.)
  21. 21. Example Tweet We loaded raw tweets into HDFS which are represented as chunks of JSON
  22. 22. Next Steps Tell Hive how to read the data You will need Hive-serdes-1.0-SNAPSHOT.jar http://files.cloudera.com/samples/hive-serdes- 1.0-SNAPSHOT.jar As Hive is setup to read delimited row format but in this case needs to read json.
  23. 23. Flume Resources Learn More https://dev.twitter.com/docs/streaming- apis/parameters https://cwiki.apache.org/confluence/display/FLUME/ Home http://blog.cloudera.com/blog/2012/09/analyzing- twitter-data-with-hadoop/
  24. 24. Thank you! Q/A

×