Cloudera Manager CDH3● Automates the installation and configuration process of CDH3 on an entire cluster.● We used free edition (up to 50 nodes).
Cloudera Flume● A distributed, reliable and available system.● To efficiently collect, aggregate and move large amounts of log data.● From many different sources to a centralized or distributed data store (such as Hadoop HDFS).
Hadoop HDFS (1/2)● For our purpose Hadoop handles: ○ Log receipt and storage. ○ Search and log processing.● Coordinates work among cluster of machines.
News Recommendation● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...
RSS News aggregator● We wrote a Java application to read RSS feeds using: ○ java.net.URL to handle the resource pointed-to by the URL. ○ javax.xml.parsers for XML parsing. ○ org.w3c.dom provides interfaces for DOM to process XML.
Proof of concept (1/3)● Our Agent collects the RSS feeds and sends it to the Collector Agent.
Proof of concept (2/3)● The collector receives the events from both Agents and stores them into the HDFS.
Proof of concept (3/3)● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.
Issues faced (1/4)● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB. ○ This means that if a datanode has less than 10 GB of capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)
Issues faced (2/4)● In order for CDH Manager to work, all nodes must run either Suse or RedHat.● The CDH Manager cannot run on a AWS EC2 micro instance.● Upon instance restart, its IP changes. ○ So the CDH Manager loses track of the node● CDH Manager operates with private DNS and so any references it makes point to this private DNS. ○ Web UIs are only accessible from our machines web browsers through public DNS names.
Issues faced (3/4)● Some installation guides forget to mention the required ports to allow communication with its services. ○ Cloudera provides a page with all the required ports.● The creating folders and changing user permissions is not mentioned in the user guide. ○ We needed to access hadoop with username hdfs and create the flume folder and change its owner to flume using chown command. (AccessControlException)
Issues faced (4/4)● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.
Future work● Expand RSS sources.● Implement a web UI.● Provide search services on the HDFS.● Improve the HDFS load balancing.
Conclusions (1/3)● HDFS default configuration parameters are not suitable for deploying it in AWS EC2.● Cloudera Manager makes installation and configuration process much easier! ○ but it also introduces a few constraints that might result in higher operating costs.● Adapting the RSS reader of the agents is not trivial! ○ different RSS sources have different contents (e.g. posts with ad banners).
Conclusions (2/3)● Amazon EC2 service is easier to use and more reliable than other cloud providers! ○ E.g. PlanetLab.● Flumes architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents.● Flume is horizontally scalable! ○ because its performance is proportional to the number of machines on which it is deployed.
Conclusions (3/3)● Fine tunage of Flumes configuration files is not trivial!● HDFS NameNode is no longer a single point of failure! ○ since NameNode replication was introduced. Adding passive NameNodes affects the overall performance of the HDFS cluster though.
References (1/2)● Cloudera Flume 1.x installation ○ https://ccp.cloudera. com/display/CDHDOC/Flume+1.x+Installation● Cloudera Manager CDH3 ○ https://ccp.cloudera. com/display/FREE374/Cloudera+Manager+Free+E dition+Installation+Guide● Cloudera port information ○ https://ccp.cloudera. com/display/CDHDOC/Configuring+Ports+for+CD H3● Cloudera Flume User Guide ○ http://archive.cloudera.com/cdh4/cdh/4/flume-
References (2/2)● Find more detailed information on our setup and configuration on our personal blogs: ○ http://www.aknahs.pt/ ○ http://www.otnira.com/ ○ http://126.96.36.199/~zafar/
Easter Egg: Issues faced● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project. ○ Turns out it was a male.● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off. ○ Turns out that screenshots are better than demos.● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves. ○ Turns out that he just took a very hot shower.
Special Thanks● Leandro Navarro - UPC● Amazon● jarcec - #flume on irc.freenode.net● mids - #cloudera on irc.freenode.net (@mids106)Hanging out in IRC is useful!
News aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012