Apache Bigtop Working Group
7/14/2013
Basic Skills
Hadoop Pipelines (Roman's/Ron's Idea)
Career positioning
Basic Skills
● Working group, you set your own goals. Structure: do a demo in
front of the class. Focus on skills employers are looking for.
● Cluster skills using AWS; create instances, ec2-api, will have to
extend this using scripts or your own code. Have to demo some
skill
– Goal:Manage multiple instances. You can do this manually but the number
of keystrokes goes up exponentially as you add new components. Need
some automation or code.
– Bash scripts are good b/c they are used in Bigtop init.d files and Roman's
code, e.g. copy the mkdir commands into script and run them.
Basic Skills
● Hadoop*, all the features of 2.0.0. No training
course can give this to you. You will have to
manually do this.
– Use 2.0.X unit test code as a base
Hadoop 2.0.0
● Basic FS Review:
– Copy On Write
– Write Through/Write Back, FSCK
– Inodes/BTrees, NN/DN
Working Group
● Not a class which gives you answers. The
answers classes give you are too simple to be
valuable.
● E.g.; Does YARN/Hadoop 2.0.X support
multitenancy? Multiple users/companies cant
see each other's data and if they run a query,
they can't crash the cluster for other users. This
isn't the case now.
Hadoop 2.0.X
● Zookeeper in HDFS, requires some
administration. Do you need to do a rollback of
zookeeper logs when a zk cluster fails?
Bigtop Basic Skills
● Run Bigtop in AWS in distributed mode, start
w/HDFS
● Create Hadoop* pipelines (Roman's/Ron's idea)
– Ron: book. Great idea!!!!!
● Run mvn verify/learn to debug and write tests hers
● Will take months, demo driven. People do demos.
Career positioning
● Choose where to spend time.
● Bigdata =
– Devops
– App development (Astyanax)
– Internals
● Don't get distracted into 3). Not enough time to do all well. Let
Cloudera ppl help you.
● Do something new that people care about
– Don't try to be better than people w/the same job skill
– Learn efficiently, practice, practice, practice, Can't learn by watching
Big Company vs. Small
● Big:
– Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access
from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest
data like w/flume a sys admin has to set this up. e.g. Don't spend time getting
flume to work in Hue. But make sure you know 2.0.x security models/LDAP,
pipeline debugging when things get stuck, failover, application development
– HUE != Ambari. Why?
– Value to building apps in HUE or w/HUE. Approach for webapps changing away
from HUE to something like Ambari which is a simpler user defined MVC pattern.
– User defined MVC better. Why? Think like a manager and what happens as
Django adds more complicated features?
– e.g. Jetty/J2EE example
Small
● Do everything, use BT, get to working app as
fast as possible. 1) and 2) very important. Have
to do things quickly.
● You decide how to spend your own time
Structure
● Schedule 3x meetings after this every 2 weeks
● Individual demos
● Install Bigtop, demo WC, PI, demo components
and pipelines.
● Turn pipeline demos into integration tests
● Test on pseudo distributed mode and cluster
● Listen to Roman: Hue....
HBase/Hadoop
● HBase requirements: R/S 48GB, 8-12
cores/node
Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB,
HDFS
● Disk: 25% for shuffle files for HDFS, <50% full,
JBOD, no RAID
Starting Hadoop, M/R
● Look at the logs /var/log/hadoop-hdfs
● Cluster ID: under ~/cache/.../data, VERSION,
change the text. DEMO
● No connection, check ping, check core-site.xml,
/etc/hosts
● M/R/Yarn: mapred-site.xml. NOTE: M/R uses
port 8021 and so does NAMENODE. Keep this
port, run on differeent server; open port 8031
● Telnet jt:8021, turn off iptables, disable selinux
M/R Setup
● 1 node manager
– WRONG_REDUCE=0
– File Input Format Counters
– Bytes Read=1180
– File Output Format Counters
– Bytes Written=97
– Job Finished in 92.72 seconds
– Estimated value of Pi is 3.14080000000000000000
–
M/R AWS
● 3 nodemanagers
● File Output Format Counters
● Bytes Written=97
● Job Finished in 86.762 seconds
● Estimated value of Pi is
3.14080000000000000000
●
Zookeeper Administration
Many options for projects
● Integration code testing when Roman gets here
in 2 weeks
● Work w/Ron or Victor on projects
● Update the wiki w/ AWS cluster setup,
automate w/whirr? + chef/puppet?
● Add HBase, Zookeeper management for
Hadoop(monit/supervisord)

Apache bigtopwg7142013

  • 1.
    Apache Bigtop WorkingGroup 7/14/2013 Basic Skills Hadoop Pipelines (Roman's/Ron's Idea) Career positioning
  • 2.
    Basic Skills ● Workinggroup, you set your own goals. Structure: do a demo in front of the class. Focus on skills employers are looking for. ● Cluster skills using AWS; create instances, ec2-api, will have to extend this using scripts or your own code. Have to demo some skill – Goal:Manage multiple instances. You can do this manually but the number of keystrokes goes up exponentially as you add new components. Need some automation or code. – Bash scripts are good b/c they are used in Bigtop init.d files and Roman's code, e.g. copy the mkdir commands into script and run them.
  • 3.
    Basic Skills ● Hadoop*,all the features of 2.0.0. No training course can give this to you. You will have to manually do this. – Use 2.0.X unit test code as a base
  • 4.
    Hadoop 2.0.0 ● BasicFS Review: – Copy On Write – Write Through/Write Back, FSCK – Inodes/BTrees, NN/DN
  • 5.
    Working Group ● Nota class which gives you answers. The answers classes give you are too simple to be valuable. ● E.g.; Does YARN/Hadoop 2.0.X support multitenancy? Multiple users/companies cant see each other's data and if they run a query, they can't crash the cluster for other users. This isn't the case now.
  • 6.
    Hadoop 2.0.X ● Zookeeperin HDFS, requires some administration. Do you need to do a rollback of zookeeper logs when a zk cluster fails?
  • 7.
    Bigtop Basic Skills ●Run Bigtop in AWS in distributed mode, start w/HDFS ● Create Hadoop* pipelines (Roman's/Ron's idea) – Ron: book. Great idea!!!!! ● Run mvn verify/learn to debug and write tests hers ● Will take months, demo driven. People do demos.
  • 8.
    Career positioning ● Choosewhere to spend time. ● Bigdata = – Devops – App development (Astyanax) – Internals ● Don't get distracted into 3). Not enough time to do all well. Let Cloudera ppl help you. ● Do something new that people care about – Don't try to be better than people w/the same job skill – Learn efficiently, practice, practice, practice, Can't learn by watching
  • 9.
    Big Company vs.Small ● Big: – Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest data like w/flume a sys admin has to set this up. e.g. Don't spend time getting flume to work in Hue. But make sure you know 2.0.x security models/LDAP, pipeline debugging when things get stuck, failover, application development – HUE != Ambari. Why? – Value to building apps in HUE or w/HUE. Approach for webapps changing away from HUE to something like Ambari which is a simpler user defined MVC pattern. – User defined MVC better. Why? Think like a manager and what happens as Django adds more complicated features? – e.g. Jetty/J2EE example
  • 10.
    Small ● Do everything,use BT, get to working app as fast as possible. 1) and 2) very important. Have to do things quickly. ● You decide how to spend your own time
  • 11.
    Structure ● Schedule 3xmeetings after this every 2 weeks ● Individual demos ● Install Bigtop, demo WC, PI, demo components and pipelines. ● Turn pipeline demos into integration tests ● Test on pseudo distributed mode and cluster ● Listen to Roman: Hue....
  • 12.
    HBase/Hadoop ● HBase requirements:R/S 48GB, 8-12 cores/node Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB, HDFS ● Disk: 25% for shuffle files for HDFS, <50% full, JBOD, no RAID
  • 13.
    Starting Hadoop, M/R ●Look at the logs /var/log/hadoop-hdfs ● Cluster ID: under ~/cache/.../data, VERSION, change the text. DEMO ● No connection, check ping, check core-site.xml, /etc/hosts ● M/R/Yarn: mapred-site.xml. NOTE: M/R uses port 8021 and so does NAMENODE. Keep this port, run on differeent server; open port 8031 ● Telnet jt:8021, turn off iptables, disable selinux
  • 14.
    M/R Setup ● 1node manager – WRONG_REDUCE=0 – File Input Format Counters – Bytes Read=1180 – File Output Format Counters – Bytes Written=97 – Job Finished in 92.72 seconds – Estimated value of Pi is 3.14080000000000000000 –
  • 15.
    M/R AWS ● 3nodemanagers ● File Output Format Counters ● Bytes Written=97 ● Job Finished in 86.762 seconds ● Estimated value of Pi is 3.14080000000000000000 ●
  • 16.
  • 17.
    Many options forprojects ● Integration code testing when Roman gets here in 2 weeks ● Work w/Ron or Victor on projects ● Update the wiki w/ AWS cluster setup, automate w/whirr? + chef/puppet? ● Add HBase, Zookeeper management for Hadoop(monit/supervisord)