Status of Hadoop 0.23
Operations at Yahoo!
From Inception to Customer Validation

Charles Wimmer, Staff Site Reliability
        Engineer at LinkedIn
Summary of This Talk
● Includes
  ○ Operational changes required to support 0.23
● Does not include
  ○ Specifics about customer testing
  ○ Deployment into Research or Production clusters
Scope of This Change at Yahoo!
●   42,000+ Hadoop servers
●   20+ clusters
●   Three tiers: Sandbox, Research, Production
●   0.20.205.x
Overview of the Process
● Provide customers a 0.23 Sandbox cluster
● Provide customers enough data to test their
  applications
● Provide developer support to address
  application issues quickly
● Upgrade Research and Production clusters
  as applications are certified to work with 0.23
Test Cluster
●   420 Nodes
●   2 x Westmere 4 core processors
●   24G RAM
●   12 x 2T Disks
●   No Federation
Configuration
● Hierarchical Queues
● Memory Configuration
● Kerberos
Hierarchical Queues
<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>BIZUNIT-A,BIZUNIT-U,BIZUNIT-C,unfunded</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.capacity</name>
  <value>100</value>
</property>
Hierarchical Queues
<property>
 <name>yarn.scheduler.capacity.root.BIZUNIT-A.capacity</name>
  <value>50</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.BIZUNIT-U.capacity</name>
  <value>30</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.BIZUNIT-C.capacity</name>
  <value>15</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.unfunded.capacity</name>
  <value>5</value>
</property>
Hierarchical Queues
<property>
  <name>yarn.scheduler.capacity.root.BIZUNIT-U.proj-a.
capacity</name>
  <value>50</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.BIZUNIT-U.proj-b.
capacity</name>
  <value>50</value>
</property>
Hierarchical Queues
Memory Configuration
  <property> <name>yarn.
nodemanager.resource.memory-
mb</name>
  <value>21504</value>
  </property>
Kerberos Configuration
  <property>
   <name>yarn.resourcemanager.principal</name>
    <value>mapred/clustername-jt1.domain.name.com@REALM.
NAME.COM</value>
  </property>



  <property>
   <name>yarn.nodemanager.principal</name>
   <value>tt/_HOST@REALM.NAME.COM</value>
  </property>
Init Scripts
●   DataNode/NameNode
●   SecondaryNameNode
●   HistoryServer
●   NodeManager
●   ResourceManager
DataNode/NameNode
start_20(){
 . . .
}
start_next(){
 . . .
}
if [ -x /home/gs/hadoop/current/bin/hdfs ] ; then
   start_next $@
else
   start_20 $@
fi
SecondaryNameNode
function clean_checkpoint_dir {
 CHECKPOINT_DIR=/grid/0/tmp/hadoop-
hdfs/dfs/namesecondary/current
 if [ -d "$CHECKPOINT_DIR" ] ; then
   DELETE_DIR=`mktemp -p /grid/0/tmp -d delete-XXXXXX`
    if [ $? -eq 0 ] ; then
    echo "moving $CHECKPOINT_DIR to ${DELETE_DIR}/ "
    mv $CHECKPOINT_DIR ${DELETE_DIR}/
    cat<<EOF | at now+1min 2>/dev/null
if [ -d $DELETE_DIR ] ; then
    rm -rf --preserve-root $DELETE_DIR
fi
EOF
    fi
 fi
}
HistoryServer
case "$1" in
 start)
   su $HADOOP_USER -s /bin/sh -c 
"$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh
--config $HADOOP_CONF_DIR start
historyserver"
   RET=$?
   ;;
ResourceManager/NodeManager
case "$1" in
 start)
  su $HADOOP_USER -s /bin/sh -c 
"$HADOOP_PREFIX/sbin/yarn-daemon.sh --config
$HADOOP_CONF_DIR start $PROC"
  RET=$?
  ;;
Questions?
Charles Wimmer

@cwimmer

charles@wimmer.net

cwimmer@linkedin.com

Status of Hadoop 0.23 Operations at Yahoo

  • 1.
    Status of Hadoop0.23 Operations at Yahoo! From Inception to Customer Validation Charles Wimmer, Staff Site Reliability Engineer at LinkedIn
  • 2.
    Summary of ThisTalk ● Includes ○ Operational changes required to support 0.23 ● Does not include ○ Specifics about customer testing ○ Deployment into Research or Production clusters
  • 3.
    Scope of ThisChange at Yahoo! ● 42,000+ Hadoop servers ● 20+ clusters ● Three tiers: Sandbox, Research, Production ● 0.20.205.x
  • 4.
    Overview of theProcess ● Provide customers a 0.23 Sandbox cluster ● Provide customers enough data to test their applications ● Provide developer support to address application issues quickly ● Upgrade Research and Production clusters as applications are certified to work with 0.23
  • 5.
    Test Cluster ● 420 Nodes ● 2 x Westmere 4 core processors ● 24G RAM ● 12 x 2T Disks ● No Federation
  • 6.
    Configuration ● Hierarchical Queues ●Memory Configuration ● Kerberos
  • 7.
    Hierarchical Queues <property> <name>yarn.scheduler.capacity.root.queues</name> <value>BIZUNIT-A,BIZUNIT-U,BIZUNIT-C,unfunded</value> </property> <property> <name>yarn.scheduler.capacity.root.capacity</name> <value>100</value> </property>
  • 8.
    Hierarchical Queues <property> <name>yarn.scheduler.capacity.root.BIZUNIT-A.capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.BIZUNIT-U.capacity</name> <value>30</value> </property> <property> <name>yarn.scheduler.capacity.root.BIZUNIT-C.capacity</name> <value>15</value> </property> <property> <name>yarn.scheduler.capacity.root.unfunded.capacity</name> <value>5</value> </property>
  • 9.
    Hierarchical Queues <property> <name>yarn.scheduler.capacity.root.BIZUNIT-U.proj-a. capacity</name> <value>50</value> </property> <property> <name>yarn.scheduler.capacity.root.BIZUNIT-U.proj-b. capacity</name> <value>50</value> </property>
  • 10.
  • 11.
    Memory Configuration <property> <name>yarn. nodemanager.resource.memory- mb</name> <value>21504</value> </property>
  • 12.
    Kerberos Configuration <property> <name>yarn.resourcemanager.principal</name> <value>mapred/clustername-jt1.domain.name.com@REALM. NAME.COM</value> </property> <property> <name>yarn.nodemanager.principal</name> <value>tt/_HOST@REALM.NAME.COM</value> </property>
  • 13.
    Init Scripts ● DataNode/NameNode ● SecondaryNameNode ● HistoryServer ● NodeManager ● ResourceManager
  • 14.
    DataNode/NameNode start_20(){ . .. } start_next(){ . . . } if [ -x /home/gs/hadoop/current/bin/hdfs ] ; then start_next $@ else start_20 $@ fi
  • 15.
    SecondaryNameNode function clean_checkpoint_dir { CHECKPOINT_DIR=/grid/0/tmp/hadoop- hdfs/dfs/namesecondary/current if [ -d "$CHECKPOINT_DIR" ] ; then DELETE_DIR=`mktemp -p /grid/0/tmp -d delete-XXXXXX` if [ $? -eq 0 ] ; then echo "moving $CHECKPOINT_DIR to ${DELETE_DIR}/ " mv $CHECKPOINT_DIR ${DELETE_DIR}/ cat<<EOF | at now+1min 2>/dev/null if [ -d $DELETE_DIR ] ; then rm -rf --preserve-root $DELETE_DIR fi EOF fi fi }
  • 16.
    HistoryServer case "$1" in start) su $HADOOP_USER -s /bin/sh -c "$HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver" RET=$? ;;
  • 17.
    ResourceManager/NodeManager case "$1" in start) su $HADOOP_USER -s /bin/sh -c "$HADOOP_PREFIX/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start $PROC" RET=$? ;;
  • 18.