4. Cloudera Manager CDH3
● Automates the installation and configuration
process of CDH3 on an entire cluster.
● We used free edition (up to 50 nodes).
5. Cloudera Flume
● A distributed, reliable and available system.
● To efficiently collect, aggregate and move
large amounts of log data.
● From many different sources to a centralized
or distributed data store (such as Hadoop
HDFS).
6. Hadoop HDFS (1/2)
● For our purpose Hadoop handles:
○ Log receipt and storage.
○ Search and log processing.
● Coordinates work among cluster of
machines.
10. News Recommendation
● We hosted a webpage in which people can
recommend possible sources for news.
○ http://web.ist.utl.pt/~ist156947/sds/
● Retrieved a big compilation of news websites
and blogs from a reasonable variety of
countries
○ E.g. Spain, Libya, Russia, Syria, Iran...
11. RSS News aggregator
● We wrote a Java application to read RSS
feeds using:
○ java.net.URL to handle the resource pointed-to by
the URL.
○ javax.xml.parsers for XML parsing.
○ org.w3c.dom provides interfaces for DOM to process
XML.
12. Proof of concept (1/3)
● Our Agent collects the RSS feeds and sends it
to the Collector Agent.
13. Proof of concept (2/3)
● The collector receives the events from both
Agents and stores them into the HDFS.
14. Proof of concept (3/3)
● Because we have a level of replication of 3,
every DataNode will end up with the same
amount of data.
15. Issues faced (1/4)
● DataNode Setting dfs.datanode.du.reserved
is set by default to 10 GB.
○ This means that if a datanode has less than 10 GB of
capacity, then there is no remaining available space
for the file system. (Warning: Not able to place
enough replicas)
16. Issues faced (2/4)
● In order for CDH Manager to work, all nodes
must run either Suse or RedHat.
● The CDH Manager cannot run on a AWS
EC2 micro instance.
● Upon instance restart, its IP changes.
○ So the CDH Manager loses track of the node
● CDH Manager operates with private DNS
and so any references it makes point to this
private DNS.
○ Web UI's are only accessible from our machines web
browsers through public DNS names.
17. Issues faced (3/4)
● Some installation guides forget to mention
the required ports to allow communication
with its services.
○ Cloudera provides a page with all the required ports.
● The creating folders and changing user
permissions is not mentioned in the user
guide.
○ We needed to access hadoop with username hdfs and
create the flume folder and change its owner to
flume using chown command.
(AccessControlException)
18. Issues faced (4/4)
● Although scaling through the addition of new
Agents is easy, it requires fine-tuning of the
channels capacity (number of events) and
transaction size for each Agent.
19. Future work
● Expand RSS sources.
● Implement a web UI.
● Provide search services on the HDFS.
● Improve the HDFS load balancing.
20. Conclusions (1/3)
● HDFS default configuration parameters are
not suitable for deploying it in AWS EC2.
● Cloudera Manager makes installation and
configuration process much easier!
○ but it also introduces a few constraints that might
result in higher operating costs.
● Adapting the RSS reader of the agents is not
trivial!
○ different RSS sources have different contents (e.g.
posts with ad banners).
21. Conclusions (2/3)
● Amazon EC2 service is easier to use and
more reliable than other cloud providers!
○ E.g. PlanetLab.
● Flume's architecture based on streaming
data flows makes it easier to add new sources
and sinks.
○ the service can scale by adding new Agents.
● Flume is horizontally scalable!
○ because its performance is proportional to the
number of machines on which it is deployed.
22. Conclusions (3/3)
● Fine tunage of Flume's configuration files is
not trivial!
● HDFS NameNode is no longer a single point
of failure!
○ since NameNode replication was introduced. Adding
passive NameNodes affects the overall performance
of the HDFS cluster though.
24. References (2/2)
● Find more detailed information on our setup
and configuration on our personal blogs:
○ http://www.aknahs.pt/
○ http://www.otnira.com/
○ http://115.186.131.91/~zafar/
25. Easter Egg: Issues faced
● One islamic team member declared love to a Cloudera
female member and ended up having to marry her
during the project.
○ Turns out it was a male.
● One member became angry because other team was
using demos on their project and ended up cutting a
poor rastafarian hair off.
○ Turns out that screenshots are better than demos.
● One member managed to get sun burned while doing
the project. Before this it was thought that computer
scientists would only work in caves.
○ Turns out that he just took a very hot shower.
26. Special Thanks
● Leandro Navarro - UPC
● Amazon
● jarcec - #flume on irc.freenode.net
● mids - #cloudera on irc.freenode.net
(@mids106)
Hanging out in IRC is useful!
27. News aggregator service on
Amazon EC2
Arinto Murdopo
Mário Almeida
Zafar Gilani
SDS, EMDC 2012