Infrastructure Around Hadoop


Published on

Published in: Technology

Infrastructure Around Hadoop

  1. 1. Hadoop Summit 2012 Infrastructure Around HadoopBackups, failover, configuration and monitoring Terran Melconian, Edmund MacKenty 1
  2. 2. What TripAdvisor Does•  Worlds largest travel site and community•  Trip planning user reviews•  >50 million unique monthly visitors, 30 countries*•  >60 million reviews and opinions*•  Run like a startup: 30+ teams all doing their own thing•  Heavy use of open-source projects•  Speed Wins! * source: comScore Media Metrix for TripAdvisor Sites, Worldwide, January 2012 2
  3. 3. What the Warehouse Team Does•  Retain and aggregate historic site activity data•  Make data available throughout the company•  Hits, reviews, forums, contacts, locations, businesses, etc.•  ~50 nodes in 4 clusters: Cloudera CDH3u3 (Hadoop 0.20.2)•  Used by ~12 analytics teams, heavy use of Hive•  Some jobs must run every day (eg. ETL, aggregations)•  Systems are very open, we trust our users (usually)•  3 people, fairly new to Hadoop/Hive 3
  4. 4. Why Hadoop at TripAdvisor•  Hadoop is how we scale analysis past the limits of one machine –  Some daily jobs taking nearly 24 hours, and were still growing quickly•  Our old RDBMS data warehouse could barely keep up with data ingestion, even running on expensive hardware with a SAN –  We obtained 20x improvement in wall clock time•  Reprocess unaggregated historical data as definitions change –  Before, impossible except for a small sample –  Now, reprocess years of data at the finest level in a few days•  Efficient platform for many kinds of statistics –  Representative example: five-hour RDBMS job went to 25 minutes 4
  5. 5. HA NameNode: DRBD, Corosync and Pacemaker•  Namenode and JobTracker run on “master” node•  Datanode and TaskTracker run on “slave” nodes•  Automatic fail-over of all master-node services to a passive node•  Provision two identical systems•  Set up virtual Master IP address to be failed over•  Secondary namenode on passive node, if available•  Monitor and automatically restart failed services 5
  6. 6. DRBD/Corosync Configuration•  DRBD: replicates namenode image, Hive metadata, Oozie job data –  Create two identical storage devices (we used RAID 1) –  Connect the master nodes with a cross-over ethernet cable –  Configure DRBD to use the cross-over and storage devices –  Use drbdadm to create the replicated device –  Create a filesystem on /dev/drbd0 with mkfs –  Cat /proc/drbd to see state of the device –  Once created, use /etc/init.d/drbd to manage it•  Corosync: messaging between active-passive masters –  Configure Corosync to also use the cross-over ethernet cable –  Corosync will start Pacemaker for you –  Use /etc/init.d/corosync to manage it, and Pacemaker 6
  7. 7. Pacemaker Configuration•  Define each resource you want to manage: –  DRBD device, master IP address, ethernet connectivity checks, Hadoop namenode and jobtracker, Hive thrift server, MySQL for Hive metadata, Oozie for workflow coordination•  Set monitoring intervals for each resource•  Define resource co-location dependencies•  Define resource ordering dependencies•  Restarts failed services, eg. Hive-Thrift•  Use crm tool to manage nodes and resources•  Test with a manual fail-over: –  migrate namenode resource to passive master –  Use crm status to watch all resources move over 7
  8. 8. Monitoring: Ganglia and Nagios, Job Tracking•  Visibility into cluster operations•  Monitor hardware states and resource usage•  Notify on specific boundary or failure conditions•  Track MapReduce jobs and Hive tables•  Identify immediate problems•  Show trends over time to predict future needs 8
  9. 9. Ganglia•  Standard monitoring of CPU, Memory, Disk usage, etc.•  PERL script parses Hadoop metrics, sends using gmetric(1)•  ~50 Hadoop metrics, ~30 system metrics•  Graphs for entire cluster and individual nodes•  Example: Two jobs with different resource profiles 9
  10. 10. Nagios•  Our primary notification system•  About 80 checks, ~25 are our own. Examples: –  check_hdp_connectivity: can master talk to all its slaves? –  check_hdp_data_nodes: are all configured slave datanodes running? –  check_hdp_max_mr_settings: does jobtracker have resources we expect? –  check_hadoop_master_logfiles: are logs being written to? –  check_hive_server: is it up?•  Some warnings: –  Do not let Nagios run hadoop fsck (check_hdp_hdfs) –  LDAP failure causes email cascade –  High loads can cause timeouts, which cause notifications 10
  11. 11. Job Tracking•  PERL script invoked frequently by cron•  Parses jobtracker log entries since last run•  Records data on each job in PostreSQL DB: –  Job ID, user, submitting IP and time, status –  Cluster ID, queue, Hive query –  start/stop times for job and first mapper and reducer –  Mapper and reducer counts, max memory, slots, splits•  CGI script to do queries: –  Running jobs, failed jobs, MapReduce capacity usage –  Job resource usage by status, queue, user•  Helps post-mortem of problems•  Used to predict trends, future resource needs 11
  12. 12. Other cron scripts we run•  Check_load: –  Dumps Java stack trace when load is too high –  Emails list of top processes so we can see what was wrong•  Master nodes: –  Compresses Hadoop/Hive logs more than 30 days old –  Removes logs more than 120 days old (we keep 10+ GBs) –  Check_hdfs: Runs hadoop fsck to see if HDFS is “healthy” –  Backup current namenode fsimage•  Slave Nodes: –  Check_disks: Removes read-only disks from datanode configuration –  Check_load: Kills some tasks and notifies us when load is too high•  Refresh production data to development cluster 12
  13. 13. Configuration Management•  Seems like extra work at first, but essential as you grow.•  Not Hadoop-specific: manage OS packages, Nagios and Ganglia scripts, cron jobs, svn, SSH keys, NFS mounts, jars –  Consistent UID/GIDs critical with DRBD –  We replace some jars from the RPMs with local fixes –  Templatized configuration files very convenient. ERB is good. –  SSH keys made consistent across nodes, masters share host key•  Use SVN as file delivery mechanism: checkout on each box•  We chose Puppet as a tool –  Gets the job done –  Lacks flexibility in inheritance to specialize defaults per-machine –  Some aspects of operation are hard to debug 13
  14. 14. Backup: HDFS and Hive DDL•  Objectives: –  Provide safety against total HDFS failure due to software bugs or machine room environmental incident –  Protect against user error in dropping or overwriting tables –  Restore data to another cluster•  Assumptions –  Repeating one day of processing is acceptable when restoring•  Components –  Incremental HDFS backup –  Hive DDL backup•  Runs on separate backup server with storage (NexSan) –  Pull process driven by processes on backup server 14
  15. 15. Backup HDFS•  Open-source Java app•  Requires customization to your environment•  Traverses HDFS directory tree•  Copies out files modified after a given date•  Doesnt copy very new directories –  Needed a way to avoid copying files being written at time of backup –  HDFS has no snapshots•  Ignores specified directories•  Generates restore shell scripts to set owners, perms•  Verification tool checks file sizes and checksums 15
  16. 16. Backup Hive DDL•  Open-source Java app uses Thrift server•  Iterates over all tables and views•  Constructs DDL statements from Hive metadata•  Ignores specific tables•  Generates Hive command script –  Recreates all tables, adds all partitions back one at a time•  Used to move metadata to MySQL•  Restore full cluster: –  copying files back with copyFromLocal –  Run perm/owner scripts –  Reapply Hive DDL 16
  17. 17. Other Things To Potentially Back Up•  Backup the Namenode Metadata –  We do this once every 4 hours –  This is in addition to mirroring on four physical drives•  Our job tracking database•  No general backups of root or local FS on machines –  Recreate machines with Puppet or other configuration management tool instead•  Oozie job database –  We do NOT back this up –  Tightly coupled with HDFS state and restore would be problematic –  The recovery procedure is to rebuild and reinstall coordinators 17
  18. 18. Oozie: Why•  Drawback: several times slower to write than cronjobs, while also less expressive•  Advantage: Ability to cleanly depend on input data –  With cron, you would have to poll for stamps•  Advantage: Clean and consistent metadata –  See what ran, what failed, what is still waiting and why –  Easily retry things which failed – good luck doing that with cron –  Output datasets are deleted on rerun so ordering is preserved 18
  19. 19. Oozie: How•  Establish consistent local practices for completion stamps, job naming, owners, and source code locations•  Enforce that all jobs must be idempotent•  Create scripts/makefiles/build.xml to rebuild and reinstall jobs after changes in their dependencies•  Bypass the Oozie GUI –  The CLI is a more capable tool –  Go straight to the Oozie backing DB and issue SQL queries•  Rerun coordinator actions, not workflows•  Dont ever use Derby – we experienced massive corruption 19
  20. 20. Experiences and Expectations•  Hadoop is not mature from a reliability and stability point of view –  It will probably get there in a few more years•  Cluster outages are common events, not outliers –  Must bounce key services to pick up basic configuration changes such as adding a new queue –  As you scale up, you will encounter new classes of problems –  Example: kernel deadlocks during heavy disk IO•  You must design for failure and have a robust mechanism to cleanly and easily resume execution once the cluster is back up.•  Important jobs must be isolated from developers –  Each cluster should contain ONE tier of jobs, grouped by SLA, release process, and time-to-recovery requirements 20
  21. 21. Attributes of Robust Jobs•  Idempotent and resumable regardless of when/how terminated•  Has an external framework for recording success/failure, timing, and amount of data processed•  Knows what input data it needs and waits for it to be ready•  Has mechanism for reprocessing if the input data is restated•  Checked into source control•  Testable in an expendable cluster before release 21
  22. 22. Benchmarks•  How to evaluate hardware/network changes or map/reduce slot tuning? –  Key insight: For the same job, the same task always does the same work –  Rerun job and compare execution of the same task across machinesMachine Tasks Comps Relative Perf (larger is better)~~~~~~~~~~~~ ~~~~~ ~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~type1_1 82 37 0.99 ====================type1_2 91 76 0.98 ====================type1_3 92 35 1.01 ====================type1_4 88 85 1.06 =====================type2_1 71 26 1.30 ==========================type3_1 92 80 0.68 ==============type4_1 78 42 1.19 ========================type4_2 78 45 1.29 ==========================type4_3 75 75 1.19 ========================remote 546 534 0.97 ===================local 378 69 1.05 ===================== 22
  23. 23. Features you Should Use•  Fair Scheduler•  refreshNodes, refreshQueues•  Hadoop metrics•  Namenode audit logging (disabled by default in 0.20)•  Exclude files to decommission slave nodes 23
  24. 24. Staffing•  Were living proof that you can hire some engineers with good fundamentals but no specialized experience and throw them in the deep end (its the TA way)•  Skills to hire for: –  Operations and Linux experience –  General service troubleshooting –  Scripting –  Java –  SQL (even if not using Hive)•  Managing clusters which are growing 2x - 4x per year takes 1-2 people working full time just to run in place 24
  25. 25. Open Questions•  Resuming of jobs on jobtracker restart•  Reloading of configurations without a restart•  Robust response to cluster OOM conditions•  Disabling job submission while allowing existing jobs to finish•  Please tell us if you have the answers! 25
  26. 26. Questions? 26
  27. 27. AppendixThis is for you to read later after downloading the presentation 27
  28. 28. Downloads 28
  29. 29. DRBD Configurationglobal { usage-count no; minor-count 1;}common { protocol C; on { syncer { rate 90M; } device /dev/drbd0;} disk /dev/sda3;resource internal { address; startup { flexible-meta-disk internal; wfc-timeout 600; } degr-wfc-timeout 60; on { } device /dev/drbd0; disk { disk /dev/sda3; on-io-error detach; address; } flexible-meta-disk internal; net { } # timeout 60; } # connect-int 10; # ping-int 10; # max-buffers 2048; # max-epoch-size 2048; } 29
  30. 30. Corosync Configurationcompatibility: whitetanktotem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 amf { bindnetaddr: mode: disabled mcastaddr: } mcastport: 5415 aisexec { } user: root} group: rootlogging { } fileline: off service { to_stderr: no name: pacemaker to_logfile: yes ver: 0 to_syslog: yes } logfile: /var/log/corosync.log debug: off timestamp: on logger_subsys { subsys: AMF debug: off }} 30
  31. 31. Pacemaker Configurationnode attributes standby="off"node attributes standby="off"property $id="cib-bootstrap-options" stonith-enabled="false" no-quorum-policy="ignore" expected-quorum-votes="2" dc-version="1.0.12-unknown" cluster-infrastructure="openais" last-lrm-refresh="1337718104"rsc_defaults $id="rsc-options" resource-stickiness="100"primitive DataStore ocf:linbit:drbd params drbd_resource="internal" op start interval="0" timeout="240s" op stop interval="0" timeout="100s"primitive fs_DataStore ocf:heartbeat:Filesystem params device="/dev/drbd0" directory="/data/internal" fstype="ext3" op monitor interval="60s" timeout="40s" op start interval="0" timeout="60s" op stop interval="0" timeout="60s"ms Cluster DataStore meta master-max="1" master-node="max=1" clone-max="2" clone-node-max="1" notify="true"colocation fs-with-drbd inf: fs_DataStore Cluster:Masterorder drdb-fs inf: Cluster:promote fs_DataStore:startprimitive MasterIP ocf:heartbeat:IPaddr2 params ip="" nic="bond0" op monitor interval="30s"colocation ip-with-drbd inf: MasterIP Cluster:Masterorder fs-ip inf: fs_DataStore MasterIPprimitive NameNode lsb:hadoop-0.20-namenode op monitor interval="30s" meta target-role="Started"colocation namenode-with-fs inf: NameNode fs_DataStoreorder ip-namenode inf: MasterIP NameNodeprimitive JobTracker lsb:hadoop-0.20-jobtracker op monitor interval="30s" meta target-role="Started"colocation jobtracker-with-fs inf: JobTracker fs_DataStoreorder namenode-jobtracker inf: NameNode JobTracker 31
  32. 32. Pacemaker Configuration (cont.)primitive SecondaryNameNode lsb:hadoop-0.20-secondarynamenode op monitor interval="30s" meta target-role="Started"colocation secondarynamenode-not-with-ip -inf: SecondaryNameNode MasterIPorder jobtracker-secnamenode inf: JobTracker SecondaryNameNodeprimitive Mysql ocf:heartbeat:mysql params datadir="/data/internal/mysql" socket="/data/internal/mysql/mysql.sock" binary="/usr/bin/mysqld_safe" op monitor interval="30s" timeout="30s" op start interval="0" timeout="120s" op stop interval="0" timeout="120s" meta target-role="Started"colocation mysql-with-fs inf: Mysql fs_DataStoreorder ip-mysql inf: MasterIP Mysqlprimitive HiveThrift lsb:hive-thrift op monitor interval="30s" meta target-role="Started"colocation hivethrift-with-ip inf: HiveThrift MasterIPorder jobtracker-hivethrift inf: JobTracker HiveThriftorder mysql-hivethrift inf: Mysql HiveThriftprimitive Oozie lsb:oozie op monitor interval="30s" meta target-role="Started"colocation oozie-with-fs inf: Oozie MasterIPorder jobtracker-oozie inf: JobTracker Oozieprimitive PingNodes ocf:pacemaker:ping params host_list="" multiplier="100" op start interval="0" timeout="60s" op monitor interval="30s" timeout="60s"clone PingClone PingNodes meta interleave="true"location ping-with-ip MasterIP rule $id="ping-with-ip-rule" pingd: defined pingdlocation MasterIP rule $id="" 50: #uname eq master01.tripadvisor.comorder ip-ping inf: MasterIP PingClone 32
  33. 33. Nagios Checkscheck_apt check_breeze check_by_ssh check_checkup_metriccheck_clamd check_cluster check_cronjobs check_crontabscheck_dhcp check_dig check_disk check_disk_smbcheck_disk_writable check_dns check_dummy check_fbrscheck_file_age check_files_age check_filesystems check_flexlmcheck_ftp check_gc check_hadoop_master_logfilescheck_hdp_connectivity check_hdp_data_nodes check_hdp_hdfs 20  check_hdp_max_mr_settings check_hive 10   check_hive_nsccheck_hive_server check_http check_icmp 0   check_ide_smart Rcheck_ifoperstatus check_ifstatus check_imap check_ircdcheck_jabber check_load check_local_mail check_logcheck_log_updated check_mailq check_memcached check_minervacheck_mrtg check_mrtgtraf check_mysql_repl check_nagioscheck_nntp check_nntps check_nrpe check_ntcheck_ntp check_ntp_peer check_ntp_time check_nwstatcheck_oracle check_overcr check_ping check_popcheck_proc_filehandles check_procs check_real check_rpccheck_sensors check_simap check_smtp check_spopcheck_ssh check_ssmtp check_swap check_swappingcheck_sys_filehandles check_ta_services check_tcp check_timecheck_udp check_ups check_users check_wavecheck_writeable_tmp 33
  34. 34. Example Oozie QuerySELECT a.todaystatus as today, a.yesterdaystatus as yday, j.status as parent, j.app_name, a.last_modified_time, a.nominal_time, a.idFROM ( SELECT t.status as todaystatus, y.status as yesterdaystatus, COALESCE(, AS id, y.job_id, COALESCE(t.nominal_time, y.nominal_time) AS nominal_time, COALESCE(t.last_modified_time, y.last_modified_time) AS last_modified_time FROM (SELECT * FROM COORD_ACTIONS WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 0) t RIGHT OUTER JOIN (SELECT * FROM COORD_ACTIONS WHERE TIMESTAMPDIFF(DAY, last_modified_time, now()) = 1) y ON (t.job_id=y.job_id) WHERE COALESCE(t.status, ) NOT IN (SUCCEEDED, WAITING) -- If theyre WAITING today, then make sure yesterday ran OK. OR (t.status = WAITING and y.status <> SUCCEEDED) UNION DISTINCT -- Dummy record to force the table to exist even when empty, since MySql -- otherwise emits nothing if data is not returned. SELECT EMPTY, RECORD, , , , THIS IS A DUMMY RECORD)aLEFT OUTER JOIN COORD_JOBS jON a.job_id=j.idWHERE j.status = RUNNING OR j.status IS NULL; 34
  35. 35. Sessions will resume at 4:30pm Page 35