Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Flume-based newsaggregator service on    Amazon EC2      Arinto Murdopo      Mário Almeida       Zafar Gilani      SDS, EM...
Outline● Introduction    ○ Cloudera Manager CDH3    ○ Cloudera Flume    ○ Hadoop Distributed File System●   Infrastructure...
Introduction● A flume-based independent news aggregator  service.● Using:  ○   Amazon EC2 IaaS  ○   Cloudera Manager CDH3 ...
Cloudera Manager CDH3● Automates the installation and configuration  process of CDH3 on an entire cluster.● We used free e...
Cloudera Flume● A distributed, reliable and available system.● To efficiently collect, aggregate and move  large amounts o...
Hadoop HDFS (1/2)● For our purpose Hadoop handles:  ○ Log receipt and storage.  ○ Search and log processing.● Coordinates ...
Hadoop HDFS (2/2)
Infrastructure setup● 2 Agent nodes collecting data:  ○ Source: RSS feed  ○ Sink: Collector● 1 Agent node (Collector):  ○ ...
Architecture
News Recommendation● We hosted a webpage in which people can  recommend possible sources for news.  ○ http://web.ist.utl.p...
RSS News aggregator● We wrote a Java application to read RSS  feeds using:  ○ java.net.URL to handle the resource pointed-...
Proof of concept (1/3)● Our Agent collects the RSS feeds and sends it  to the Collector Agent.
Proof of concept (2/3)● The collector receives the events from both  Agents and stores them into the HDFS.
Proof of concept (3/3)● Because we have a level of replication of 3,  every DataNode will end up with the same  amount of ...
Issues faced (1/4)● DataNode Setting dfs.datanode.du.reserved  is set by default to 10 GB.  ○ This means that if a datanod...
Issues faced (2/4)● In order for CDH Manager to work, all nodes  must run either Suse or RedHat.● The CDH Manager cannot r...
Issues faced (3/4)● Some installation guides forget to mention  the required ports to allow communication  with its servic...
Issues faced (4/4)● Although scaling through the addition of new  Agents is easy, it requires fine-tuning of the  channels...
Future work●   Expand RSS sources.●   Implement a web UI.●   Provide search services on the HDFS.●   Improve the HDFS load...
Conclusions (1/3)● HDFS default configuration parameters are  not suitable for deploying it in AWS EC2.● Cloudera Manager ...
Conclusions (2/3)● Amazon EC2 service is easier to use and  more reliable than other cloud providers!  ○ E.g. PlanetLab.● ...
Conclusions (3/3)● Fine tunage of Flumes configuration files is  not trivial!● HDFS NameNode is no longer a single point  ...
References (1/2)● Cloudera Flume 1.x installation  ○ https://ccp.cloudera.    com/display/CDHDOC/Flume+1.x+Installation● C...
References (2/2)● Find more detailed information on our setup  and configuration on our personal blogs:  ○ http://www.akna...
Easter Egg: Issues faced● One islamic team member declared love to a Cloudera    female member and ended up having to marr...
Special Thanks●   Leandro Navarro - UPC●   Amazon●   jarcec - #flume on irc.freenode.net●   mids - #cloudera on irc.freeno...
News aggregator service on      Amazon EC2         Arinto Murdopo         Mário Almeida          Zafar Gilani        SDS, ...
Upcoming SlideShare
Loading in …5
×

Flume-based Independent News Aggregator

3,941 views

Published on

Published in: Technology
  • Be the first to comment

Flume-based Independent News Aggregator

  1. 1. Flume-based newsaggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012
  2. 2. Outline● Introduction ○ Cloudera Manager CDH3 ○ Cloudera Flume ○ Hadoop Distributed File System● Infrastructure setup● Architecture● News recommendation● RSS News aggregator ● Issues faced● Proof of concept ● Future work ● Conclusions ● References
  3. 3. Introduction● A flume-based independent news aggregator service.● Using: ○ Amazon EC2 IaaS ○ Cloudera Manager CDH3 ○ Cloudera Flume ○ Hadoop Distributed File System
  4. 4. Cloudera Manager CDH3● Automates the installation and configuration process of CDH3 on an entire cluster.● We used free edition (up to 50 nodes).
  5. 5. Cloudera Flume● A distributed, reliable and available system.● To efficiently collect, aggregate and move large amounts of log data.● From many different sources to a centralized or distributed data store (such as Hadoop HDFS).
  6. 6. Hadoop HDFS (1/2)● For our purpose Hadoop handles: ○ Log receipt and storage. ○ Search and log processing.● Coordinates work among cluster of machines.
  7. 7. Hadoop HDFS (2/2)
  8. 8. Infrastructure setup● 2 Agent nodes collecting data: ○ Source: RSS feed ○ Sink: Collector● 1 Agent node (Collector): ○ Source: Agents ○ Sink: HDFS● HDFS NameNode: ○ Replicates data to DataNodes 1, 2 and 3.● Cloudera Manager CDH3 node: ○ Managing all our nodes (Agents and HDFS nodes).
  9. 9. Architecture
  10. 10. News Recommendation● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...
  11. 11. RSS News aggregator● We wrote a Java application to read RSS feeds using: ○ java.net.URL to handle the resource pointed-to by the URL. ○ javax.xml.parsers for XML parsing. ○ org.w3c.dom provides interfaces for DOM to process XML.
  12. 12. Proof of concept (1/3)● Our Agent collects the RSS feeds and sends it to the Collector Agent.
  13. 13. Proof of concept (2/3)● The collector receives the events from both Agents and stores them into the HDFS.
  14. 14. Proof of concept (3/3)● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.
  15. 15. Issues faced (1/4)● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB. ○ This means that if a datanode has less than 10 GB of capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)
  16. 16. Issues faced (2/4)● In order for CDH Manager to work, all nodes must run either Suse or RedHat.● The CDH Manager cannot run on a AWS EC2 micro instance.● Upon instance restart, its IP changes. ○ So the CDH Manager loses track of the node● CDH Manager operates with private DNS and so any references it makes point to this private DNS. ○ Web UIs are only accessible from our machines web browsers through public DNS names.
  17. 17. Issues faced (3/4)● Some installation guides forget to mention the required ports to allow communication with its services. ○ Cloudera provides a page with all the required ports.● The creating folders and changing user permissions is not mentioned in the user guide. ○ We needed to access hadoop with username hdfs and create the flume folder and change its owner to flume using chown command. (AccessControlException)
  18. 18. Issues faced (4/4)● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.
  19. 19. Future work● Expand RSS sources.● Implement a web UI.● Provide search services on the HDFS.● Improve the HDFS load balancing.
  20. 20. Conclusions (1/3)● HDFS default configuration parameters are not suitable for deploying it in AWS EC2.● Cloudera Manager makes installation and configuration process much easier! ○ but it also introduces a few constraints that might result in higher operating costs.● Adapting the RSS reader of the agents is not trivial! ○ different RSS sources have different contents (e.g. posts with ad banners).
  21. 21. Conclusions (2/3)● Amazon EC2 service is easier to use and more reliable than other cloud providers! ○ E.g. PlanetLab.● Flumes architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents.● Flume is horizontally scalable! ○ because its performance is proportional to the number of machines on which it is deployed.
  22. 22. Conclusions (3/3)● Fine tunage of Flumes configuration files is not trivial!● HDFS NameNode is no longer a single point of failure! ○ since NameNode replication was introduced. Adding passive NameNodes affects the overall performance of the HDFS cluster though.
  23. 23. References (1/2)● Cloudera Flume 1.x installation ○ https://ccp.cloudera. com/display/CDHDOC/Flume+1.x+Installation● Cloudera Manager CDH3 ○ https://ccp.cloudera. com/display/FREE374/Cloudera+Manager+Free+E dition+Installation+Guide● Cloudera port information ○ https://ccp.cloudera. com/display/CDHDOC/Configuring+Ports+for+CD H3● Cloudera Flume User Guide ○ http://archive.cloudera.com/cdh4/cdh/4/flume-
  24. 24. References (2/2)● Find more detailed information on our setup and configuration on our personal blogs: ○ http://www.aknahs.pt/ ○ http://www.otnira.com/ ○ http://115.186.131.91/~zafar/
  25. 25. Easter Egg: Issues faced● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project. ○ Turns out it was a male.● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off. ○ Turns out that screenshots are better than demos.● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves. ○ Turns out that he just took a very hot shower.
  26. 26. Special Thanks● Leandro Navarro - UPC● Amazon● jarcec - #flume on irc.freenode.net● mids - #cloudera on irc.freenode.net (@mids106)Hanging out in IRC is useful!
  27. 27. News aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012

×