Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Exploiting machine learning to keep Hadoop clusters healthy


Published on

Oath has one of the largest footprint of Hadoop, with tens of thousands of jobs run every day. Reliability and consistency is the key here. With 50k+ nodes there will be considerable amount of nodes having disk, memory, network, and slowness issues. If we have any hosts with issues serving/running jobs can increase tight SLA bound jobs’ run times exponentially and frustrate users and support team to debug it.
We are constantly working to develop system that works in tandem with Hadoop to quickly identify and single out pressure points. Here we would like to concentrate on disk, as per our experience disk are the most trouble maker and fragile, specially the high density disks. Because of the huge scale and monetary impact because of slow performing disks, we took challenge to build system to predict and take worn-out disks before they become performance bottleneck and hit jobs’ SLAs. Now task is simple look into symptoms of hard drive failure and take them out? Right? No it’s not straight forward when we are talking about 200+k disk drives. Just collecting such huge data periodically and reliably is one of the small challenges as compared to analyzing such huge datasets and predicting bad disks. Now lets see data regarding each disk we have reallocated sectors count, reported uncorrectable errors, command timeout, and uncorrectable sector count. On top of it hard disk model has its own interpretation of the above-mentioned statistics. DHEERAJ KAPUR, Principal Engineer, Oath and SWETHA BANAGIRI

Published in: Technology
  • Be the first to comment

Exploiting machine learning to keep Hadoop clusters healthy

  1. 1. Exploiting ML to keep Hadoop Cluster Healthy Dheeraj Kapur , Swetha Banagiri Big Data Infrastructure Management Team
  2. 2. 2
  3. 3. 3 Agenda Topic Speakers Overview Dheeraj Kapur Architecture Swetha Banagiri Q&A All Presenters
  4. 4. 4 Components Managed by Grid
  5. 5. 5 Zookeeper Backend Support Hadoop Storage Hadoop Compute Hadoop Services Support Shop Monitoring Starling for logging HDFS Hbase as NoSql store Hcatalog for metadata registry YARN (Mapred) and Tez for Batch processing Storm for stream processing Spark for iterative programming PIG for ETL Hive for SQL Oozie for workflows Proxy services GDM for data Mang Café on Spark for ML Grid Stack
  6. 6. 6 Challenge ● Oath has one of the largest footprints of Hadoop/Storm software frameworks ● Computing environment includes about 50,000+ nodes ● Nodes spread across ~40 clusters ● Largest cluster of Hadoop comprises of >5k nodes ● SLA driven, time sensitive jobs ● To operate and meet SLA, we require 90Mbps per disk throughput
  7. 7. 7 ● Performance degredation ● Data corruption ● Shuffle slowness result into pipeline failures. ● Task slowness as a result of datanode slowness or replication failures ● Slowness in jobs become critical performance bottleneck. ● Becomes huge bottleneck, when speculative execution can’t be turned on. Impact of Disk Failures
  8. 8. ● External Factors - Temperature, Power Outages ● Internal Factors - File Corruption, Drive read instability, Aging ● Prone to mechanical failure because of moving parts 8 Factors causing disk failures
  9. 9. 9 Proactive better than Reactive ● Avoids a bad disk being the performance bottleneck ● Avoids running tight SLA bound jobs on a bad node ● Avoids pipeline failure and block corruption ● Reduces revenue loss due to SLA misses ● With the DFP system enabled across the clusters, the hosts will have a higher uptime
  12. 12. 12 Elastic Stack 1/3 ● Centralized Data Collection System ● Master - Slave and push architecture ● Master helps in redirecting documents to data nodes ● Data is pushed as json documents using python code ● All documents are stored within an index ● Each key in a json document is called as a field. Continued……
  13. 13. 13 ● Data is distributed across the datanodes, each housing number of shards under single index ● API used to store/retrieve data. ● With Kibana as the frontend, building a dashboard for visualizing collected data is easier curl -XGET <hostname>:<port>/<index_name>/_search?pretty Elastic Stack 2/3
  14. 14. 14 Index Fields Document Elastic Stack 3/3
  15. 15. 15 What are the symptoms of a disk being bad?
  16. 16. 16 S.M.A.R.T. Stats 1/2 ● Self-Monitoring, Analysis and Reporting Technology ● Gives report on the internal information about a drive ● Drive fails immediately or it shows some symptoms before it fails ● The symptoms are recorded by S.M.A.R.T. tool ● S.M.A.R.T. stats are inconsistent from hard drive to hard drive. Continued……
  17. 17. 17 Following are the S.M.A.R.T. stats used for prediction SMART 5 Reallocated_Sector_Count SMART 187 Reported_Uncorrectable_Errors SMART 188 Command_Timeout SMART 197 Current_Pending_Sector_Count SMART 198 Offline_Uncorrectable S.M.A.R.T. Stats 2/2
  18. 18. 18 Pre Processing the data ● Data collected from various nodes fall under different disk models ● Each node is grouped based on the disk model in which the drive belongs to ● Data is ignored when all the five stats are 0
  19. 19. 19 Labelling the data ● Very important and cumbersome task ● Labelled ~4000 nodes across the disk models ● Nodes are classified as Good, Fair, Bad ● High values for a S.M.A.R.T stat means that the node is bad
  20. 20. 20 Feed Forward Neural Network Model
  21. 21. 21 ● Fully connected ● 4 layer deep neural network model ● ‘adam’ optimizer used for Back Propagation ● Three hidden layers use ‘relu’ activation function ● Output layer is a ‘sigmoid’ activation function ● Loss is calculated using ‘binary-crossentropy’ Feed Forward Neural Network
  22. 22. 22 Training - Dataset and Accuracy Results
  23. 23. 23 Testing - Dataset and Accuracy Results
  24. 24. 24 ● Number of bad nodes are very less compared to good nodes ● Small dataset ● Fine tuning the training parameters Challenges
  25. 25. 25 Q&A
  26. 26. 26 Thank you!