Apache Bigtop Working Group
Hadoop Pipelines (Roman's/Ron's Idea)
● Working group, you set your own goals. Structure: do a demo in
front of the class. Focus on skills employers are looking for.
● Cluster skills using AWS; create instances, ec2-api, will have to
extend this using scripts or your own code. Have to demo some
– Goal:Manage multiple instances. You can do this manually but the number
of keystrokes goes up exponentially as you add new components. Need
some automation or code.
– Bash scripts are good b/c they are used in Bigtop init.d files and Roman's
code, e.g. copy the mkdir commands into script and run them.
● Hadoop*, all the features of 2.0.0. No training
course can give this to you. You will have to
manually do this.
– Use 2.0.X unit test code as a base
● Not a class which gives you answers. The
answers classes give you are too simple to be
● E.g.; Does YARN/Hadoop 2.0.X support
multitenancy? Multiple users/companies cant
see each other's data and if they run a query,
they can't crash the cluster for other users. This
isn't the case now.
● Zookeeper in HDFS, requires some
administration. Do you need to do a rollback of
zookeeper logs when a zk cluster fails?
Bigtop Basic Skills
● Run Bigtop in AWS in distributed mode, start
● Create Hadoop* pipelines (Roman's/Ron's idea)
– Ron: book. Great idea!!!!!
● Run mvn verify/learn to debug and write tests hers
● Will take months, demo driven. People do demos.
● Choose where to spend time.
● Bigdata =
– App development (Astyanax)
● Don't get distracted into 3). Not enough time to do all well. Let
Cloudera ppl help you.
● Do something new that people care about
– Don't try to be better than people w/the same job skill
– Learn efficiently, practice, practice, practice, Can't learn by watching
Big Company vs. Small
– Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access
from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest
data like w/flume a sys admin has to set this up. e.g. Don't spend time getting
flume to work in Hue. But make sure you know 2.0.x security models/LDAP,
pipeline debugging when things get stuck, failover, application development
– HUE != Ambari. Why?
– Value to building apps in HUE or w/HUE. Approach for webapps changing away
from HUE to something like Ambari which is a simpler user defined MVC pattern.
– User defined MVC better. Why? Think like a manager and what happens as
Django adds more complicated features?
– e.g. Jetty/J2EE example
● Do everything, use BT, get to working app as
fast as possible. 1) and 2) very important. Have
to do things quickly.
● You decide how to spend your own time
● Schedule 3x meetings after this every 2 weeks
● Individual demos
● Install Bigtop, demo WC, PI, demo components
● Turn pipeline demos into integration tests
● Test on pseudo distributed mode and cluster
● Listen to Roman: Hue....
● HBase requirements: R/S 48GB, 8-12
Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB,
● Disk: 25% for shuffle files for HDFS, <50% full,
JBOD, no RAID
Starting Hadoop, M/R
● Look at the logs /var/log/hadoop-hdfs
● Cluster ID: under ~/cache/.../data, VERSION,
change the text. DEMO
● No connection, check ping, check core-site.xml,
● M/R/Yarn: mapred-site.xml. NOTE: M/R uses
port 8021 and so does NAMENODE. Keep this
port, run on differeent server; open port 8031
● Telnet jt:8021, turn off iptables, disable selinux
● 1 node manager
– File Input Format Counters
– Bytes Read=1180
– File Output Format Counters
– Bytes Written=97
– Job Finished in 92.72 seconds
– Estimated value of Pi is 3.14080000000000000000
● 3 nodemanagers
● File Output Format Counters
● Bytes Written=97
● Job Finished in 86.762 seconds
● Estimated value of Pi is
Many options for projects
● Integration code testing when Roman gets here
in 2 weeks
● Work w/Ron or Victor on projects
● Update the wiki w/ AWS cluster setup,
automate w/whirr? + chef/puppet?
● Add HBase, Zookeeper management for